├── Code ├── scripts │ ├── test.json │ ├── simulate_request.sh │ ├── build_application.sh │ └── teardown_resources.sh ├── .DS_Store ├── src │ ├── package.json │ ├── Dockerfile │ └── app.js └── templates │ ├── automation_role.yml │ ├── runbook_scale_ecs_service.yml │ ├── playbook_investigate_application.yml │ ├── runbook_approval_gate.yml │ ├── playbook_gather_resources.yml │ ├── base_resources.yml │ ├── base_app.yml │ └── playbook_investigate_application_resources.yml ├── .DS_Store ├── Images ├── section4-iam.png ├── section3-alarm.png ├── section3-email.png ├── section3-canary.png ├── section4-normal.png ├── section4-output.png ├── section4-scale-up.png ├── section4-scale-up2.png ├── section4-scale-up3.png ├── section2-dns-outputs.png ├── section2-email-confirm.png ├── section3-alarm-detail.png ├── section3-alarm-email.png ├── section3-canary-detail.png ├── section3-stackoutput.png ├── section3-steps-explain.png ├── section4-approveordeny.png ├── section2-base-app-build.png ├── section2-base-bootstrap.png ├── section3-automationrole.png ├── section3-canary-monitor.png ├── section2-base-application.png ├── section2-ecr-repo-confirm.png ├── section4-create-automation.png ├── section5-create-automation.png ├── section2-environment-open-ide.png ├── section4-approve-timer-step1.png ├── section3-gather-resources-stepid.png ├── section4-architecture-graphics1.png ├── section4-architecture-graphics2.png ├── section4-architecture-graphics3.png ├── section5-create-automation-step1.png ├── section5-create-automation-step2.png ├── section2-base-app-create-complete.png ├── section3-failure-traffic-requests.png ├── section3-investigate-resourcelist.png ├── section3-success-traffic-requests.png ├── section4-approve-timer-input-param.png ├── section4-create-automation-addstep.png ├── section3-playbook-gather-resource-tab.png ├── section4-create-approval-gate-step1.png ├── section4-create-approval-gate-step2.png ├── section4-create-approval-gate-step3.png ├── section5-create-automation-graphics1.png ├── section5-create-automation-graphics2.png ├── section2-base-resources-create-complete.png ├── section4-create-automation-additionals.png ├── section5-create-automation-step2-input.png ├── section5-create-automation2-step1-input.png ├── section3-testing-canary-alarm-architecture.png ├── section4-create-automation-parameter-input.png ├── section4-create-automation-playbook-role.png ├── section5-create-automation-parameter-input.png ├── section4-create-automation-parameter-input-2.png ├── section4-create-automation-playbook-execute.png ├── section4-create-automation-playbook-execute2.png ├── section4-create-automation-playbook-owned-by-me.png ├── section4-create-automation-playbook-run-output.png ├── section4-create-automation-playbook-test-email.png ├── section4-create-automation-parameter-input-2-step1.png ├── section4-create-automation-parameter-input-2-step2.png ├── section4-create-automation-parameter-input-2-step3.png ├── section4-create-automation-playbook-execute-output.png ├── section4-create-automation-playbook-test-run-playbook.png ├── section3-create-automation-playbook-test-run-playbook-cpu.png ├── section4-create-automation-playbook-test-execute-playbook.png ├── section4-create-automation-playbook-test-run-playbook-summary.png ├── section4-create-automation-playbook-test-execute-playbook-observe.png ├── section4-create-automation-playbook-test-execute-playbook-summary.png ├── section4-create-automation-playbook-test-run-playbook-email-summary.png └── section4-create-automation-playbook-test-execute-playbook-email-summary.png ├── CODE_OF_CONDUCT.md ├── LICENSE ├── CONTRIBUTING.md └── README.md /Code/scripts/test.json: -------------------------------------------------------------------------------- 1 | {"Name":"Test User","Text":"This Message is a Test!"} -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/.DS_Store -------------------------------------------------------------------------------- /Code/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Code/.DS_Store -------------------------------------------------------------------------------- /Images/section4-iam.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-iam.png -------------------------------------------------------------------------------- /Images/section3-alarm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-alarm.png -------------------------------------------------------------------------------- /Images/section3-email.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-email.png -------------------------------------------------------------------------------- /Images/section3-canary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-canary.png -------------------------------------------------------------------------------- /Images/section4-normal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-normal.png -------------------------------------------------------------------------------- /Images/section4-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-output.png -------------------------------------------------------------------------------- /Images/section4-scale-up.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-scale-up.png -------------------------------------------------------------------------------- /Images/section4-scale-up2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-scale-up2.png -------------------------------------------------------------------------------- /Images/section4-scale-up3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-scale-up3.png -------------------------------------------------------------------------------- /Code/scripts/simulate_request.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ALBURL=$1 4 | while : 5 | do 6 | ab -p test.json -T application/json -c 3000 -n 60000000 -v 4 http://$ALBURL/encrypt 7 | done -------------------------------------------------------------------------------- /Images/section2-dns-outputs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-dns-outputs.png -------------------------------------------------------------------------------- /Images/section2-email-confirm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-email-confirm.png -------------------------------------------------------------------------------- /Images/section3-alarm-detail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-alarm-detail.png -------------------------------------------------------------------------------- /Images/section3-alarm-email.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-alarm-email.png -------------------------------------------------------------------------------- /Images/section3-canary-detail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-canary-detail.png -------------------------------------------------------------------------------- /Images/section3-stackoutput.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-stackoutput.png -------------------------------------------------------------------------------- /Images/section3-steps-explain.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-steps-explain.png -------------------------------------------------------------------------------- /Images/section4-approveordeny.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-approveordeny.png -------------------------------------------------------------------------------- /Images/section2-base-app-build.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-app-build.png -------------------------------------------------------------------------------- /Images/section2-base-bootstrap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-bootstrap.png -------------------------------------------------------------------------------- /Images/section3-automationrole.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-automationrole.png -------------------------------------------------------------------------------- /Images/section3-canary-monitor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-canary-monitor.png -------------------------------------------------------------------------------- /Images/section2-base-application.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-application.png -------------------------------------------------------------------------------- /Images/section2-ecr-repo-confirm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-ecr-repo-confirm.png -------------------------------------------------------------------------------- /Images/section4-create-automation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation.png -------------------------------------------------------------------------------- /Images/section5-create-automation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation.png -------------------------------------------------------------------------------- /Images/section2-environment-open-ide.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-environment-open-ide.png -------------------------------------------------------------------------------- /Images/section4-approve-timer-step1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-approve-timer-step1.png -------------------------------------------------------------------------------- /Images/section3-gather-resources-stepid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-gather-resources-stepid.png -------------------------------------------------------------------------------- /Images/section4-architecture-graphics1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-architecture-graphics1.png -------------------------------------------------------------------------------- /Images/section4-architecture-graphics2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-architecture-graphics2.png -------------------------------------------------------------------------------- /Images/section4-architecture-graphics3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-architecture-graphics3.png -------------------------------------------------------------------------------- /Images/section5-create-automation-step1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-step1.png -------------------------------------------------------------------------------- /Images/section5-create-automation-step2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-step2.png -------------------------------------------------------------------------------- /Images/section2-base-app-create-complete.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-app-create-complete.png -------------------------------------------------------------------------------- /Images/section3-failure-traffic-requests.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-failure-traffic-requests.png -------------------------------------------------------------------------------- /Images/section3-investigate-resourcelist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-investigate-resourcelist.png -------------------------------------------------------------------------------- /Images/section3-success-traffic-requests.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-success-traffic-requests.png -------------------------------------------------------------------------------- /Images/section4-approve-timer-input-param.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-approve-timer-input-param.png -------------------------------------------------------------------------------- /Images/section4-create-automation-addstep.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-addstep.png -------------------------------------------------------------------------------- /Images/section3-playbook-gather-resource-tab.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-playbook-gather-resource-tab.png -------------------------------------------------------------------------------- /Images/section4-create-approval-gate-step1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-approval-gate-step1.png -------------------------------------------------------------------------------- /Images/section4-create-approval-gate-step2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-approval-gate-step2.png -------------------------------------------------------------------------------- /Images/section4-create-approval-gate-step3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-approval-gate-step3.png -------------------------------------------------------------------------------- /Images/section5-create-automation-graphics1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-graphics1.png -------------------------------------------------------------------------------- /Images/section5-create-automation-graphics2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-graphics2.png -------------------------------------------------------------------------------- /Images/section2-base-resources-create-complete.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-resources-create-complete.png -------------------------------------------------------------------------------- /Images/section4-create-automation-additionals.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-additionals.png -------------------------------------------------------------------------------- /Images/section5-create-automation-step2-input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-step2-input.png -------------------------------------------------------------------------------- /Images/section5-create-automation2-step1-input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation2-step1-input.png -------------------------------------------------------------------------------- /Images/section3-testing-canary-alarm-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-testing-canary-alarm-architecture.png -------------------------------------------------------------------------------- /Images/section4-create-automation-parameter-input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-role.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-role.png -------------------------------------------------------------------------------- /Images/section5-create-automation-parameter-input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-parameter-input.png -------------------------------------------------------------------------------- /Images/section4-create-automation-parameter-input-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-execute.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-execute.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-execute2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-execute2.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-owned-by-me.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-owned-by-me.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-run-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-run-output.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-email.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-email.png -------------------------------------------------------------------------------- /Images/section4-create-automation-parameter-input-2-step1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2-step1.png -------------------------------------------------------------------------------- /Images/section4-create-automation-parameter-input-2-step2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2-step2.png -------------------------------------------------------------------------------- /Images/section4-create-automation-parameter-input-2-step3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2-step3.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-execute-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-execute-output.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-run-playbook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-run-playbook.png -------------------------------------------------------------------------------- /Images/section3-create-automation-playbook-test-run-playbook-cpu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-create-automation-playbook-test-run-playbook-cpu.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-execute-playbook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-run-playbook-summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-run-playbook-summary.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-execute-playbook-observe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook-observe.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-execute-playbook-summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook-summary.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-run-playbook-email-summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-run-playbook-email-summary.png -------------------------------------------------------------------------------- /Images/section4-create-automation-playbook-test-execute-playbook-email-summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook-email-summary.png -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /Code/src/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "app", 3 | "version": "1.0.0", 4 | "description": "", 5 | "main": "app.js", 6 | "dependencies": { 7 | "aws-sdk": "^2.850.0", 8 | "aws-xray-sdk": "^3.2.0", 9 | "body-parser": "^1.19.0", 10 | "express": "^4.17.1", 11 | "mysql": "^2.18.1", 12 | "zlib": "^1.0.5" 13 | }, 14 | "devDependencies": {}, 15 | "scripts": { 16 | "test": "echo \"Error: no test specified\" && exit 1" 17 | }, 18 | "author": "", 19 | "license": "ISC" 20 | } 21 | -------------------------------------------------------------------------------- /Code/src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM node:12-slim 2 | 3 | # Create app directory 4 | WORKDIR /usr/src/app 5 | 6 | # Install app dependencies 7 | # A wildcard is used to ensure both package.json AND package-lock.json are copied 8 | # where available (npm@5+) 9 | COPY package*.json ./ 10 | # ENV NODE_ENV=production 11 | # If you are building your code for production 12 | # RUN npm ci --only=production 13 | ENV NODE_ENV=production 14 | RUN npm install 15 | 16 | # Bundle app source 17 | COPY . . 18 | 19 | EXPOSE 80 20 | CMD [ "node", "app.js" ] -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /Code/templates/automation_role.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Resources: 3 | AutomationRole: 4 | Type: AWS::IAM::Role 5 | Properties: 6 | AssumeRolePolicyDocument: 7 | Version: '2012-10-17' 8 | Statement: 9 | - Effect: Allow 10 | Principal: 11 | Service: 12 | - ssm.amazonaws.com 13 | - ec2.amazonaws.com 14 | Action: sts:AssumeRole 15 | Policies: 16 | - PolicyName: PassRole 17 | PolicyDocument: 18 | Statement: 19 | - Effect: Allow 20 | Action: 'iam:PassRole' 21 | Resource: '*' 22 | - PolicyName: SNSPublish 23 | PolicyDocument: 24 | Statement: 25 | - Effect: Allow 26 | Action: 'sns:Publish' 27 | Resource: '*' 28 | ManagedPolicyArns: 29 | - arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole 30 | - arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess 31 | - arn:aws:iam::aws:policy/CloudWatchLogsReadOnlyAccess 32 | - arn:aws:iam::aws:policy/AmazonRDSReadOnlyAccess 33 | - arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess 34 | - arn:aws:iam::aws:policy/AmazonECS_FullAccess 35 | - arn:aws:iam::aws:policy/CloudWatchSyntheticsReadOnlyAccess 36 | Path: "/" 37 | RoleName: AutomationRole 38 | -------------------------------------------------------------------------------- /Code/templates/runbook_scale_ecs_service.yml: -------------------------------------------------------------------------------- 1 | Parameters: 2 | PlaybookIAMRole: 3 | Type: String 4 | 5 | Resources: 6 | ScaleECSWithApproval: 7 | Type: "AWS::SSM::Document" 8 | Properties: 9 | DocumentType: Automation 10 | Name: Runbook-ECS-Scale-Up 11 | Content: 12 | schemaVersion: '0.3' 13 | assumeRole: !Ref PlaybookIAMRole 14 | parameters: 15 | ECSClusterName: 16 | type: String 17 | ECSServiceName: 18 | type: String 19 | ECSDesiredCount: 20 | type: Integer 21 | Timer: 22 | type: String 23 | default: PT10M 24 | NotificationTopicArn: 25 | type: String 26 | NotificationMessage: 27 | type: String 28 | ApproverArn: 29 | type: String 30 | mainSteps: 31 | - name: ExecuteApprovalGateWithTimer 32 | action: 'aws:executeAutomation' 33 | inputs: 34 | DocumentName: Approval-Gate 35 | RuntimeParameters: 36 | Timer: '{{Timer}}' 37 | NotificationTopicArn: '{{NotificationTopicArn}}' 38 | NotificationMessage: '{{NotificationMessage}}' 39 | ApproverArn: '{{ApproverArn}}' 40 | - name: UpdateECSServiceDesiredCount 41 | action: aws:executeAwsApi 42 | inputs: 43 | Service: ecs 44 | Api: UpdateService 45 | service: '{{ECSServiceName}}' 46 | forceNewDeployment: true 47 | desiredCount: '{{ECSDesiredCount}}' 48 | cluster: '{{ECSClusterName}}' -------------------------------------------------------------------------------- /Code/templates/playbook_investigate_application.yml: -------------------------------------------------------------------------------- 1 | Parameters: 2 | PlaybookIAMRole: 3 | Type: String 4 | 5 | Resources: 6 | PlaybookInvestigateAlarm: 7 | Type: "AWS::SSM::Document" 8 | Properties: 9 | DocumentType: Automation 10 | Name: Playbook-Investigate-Application-From-Alarm 11 | Content: 12 | description: |2- 13 | # What is does this playbook do? 14 | 15 | This playbook will execute **Playbook-Gather-Resources** to gather Application resources monitored by Canary. 16 | 17 | Then subsequently execute **Playbook-Investigate-Application-Resources** to Investigate the resources for issues. 18 | 19 | Outputs of the investigation will be sent to SNS Topic Subscriber 20 | schemaVersion: '0.3' 21 | assumeRole: !Ref PlaybookIAMRole 22 | parameters: 23 | AlarmARN: 24 | type: String 25 | SNSTopicARN: 26 | type: String 27 | mainSteps: 28 | - name: gatherResources 29 | action: 'aws:executeAutomation' 30 | inputs: 31 | DocumentName: Playbook-Gather-Resources 32 | RuntimeParameters: 33 | AlarmARN: '{{AlarmARN}}' 34 | - name: investigateAppResources 35 | action: 'aws:executeAutomation' 36 | inputs: 37 | DocumentName: Playbook-Investigate-Application-Resources 38 | RuntimeParameters: 39 | Resources: '{{gatherResources.Output}}' 40 | - name: AWSPublishSNSNotification 41 | action: 'aws:executeAutomation' 42 | inputs: 43 | DocumentName: AWS-PublishSNSNotification 44 | RuntimeParameters: 45 | TopicArn: '{{SNSTopicARN}}' 46 | Message: '{{ investigateAppResources.Output }}' 47 | -------------------------------------------------------------------------------- /Code/scripts/build_application.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Build Script 4 | LABEL='latest' 5 | ECR_REPONAME='walab-ops-sample-application' 6 | SAMPLE_APPNAME=$ECR_REPONAME 7 | MAIN_STACK='walab-ops-base-resources' 8 | SYSOPSEMAIL=$2 9 | SYSOWNEREMAIL=$3 10 | 11 | sudo yum install jq -y -q 12 | AWS_REGION=$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq '.region' | sed -e 's/^"//' -e 's/"$//') 13 | AWS_ACCOUNT=$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq '.accountId' | sed -e 's/^"//' -e 's/"$//') 14 | RESOURCEID=$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq '.instanceId' | sed -e 's/^"//' -e 's/"$//') 15 | 16 | 17 | echo '#################################################' 18 | echo 'Script will deploy application with below details' 19 | echo '#################################################' 20 | echo 'Region: ' $AWS_REGION 21 | echo 'Account: '$AWS_ACCOUNT 22 | echo 'Repo Name: '$ECR_REPONAME 23 | echo 'Label: '$LABEL 24 | 25 | echo '##############################' 26 | echo 'Building Application Container' 27 | echo '##############################' 28 | aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com 29 | docker build -t $ECR_REPONAME ../src/ 30 | docker tag $ECR_REPONAME:latest $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPONAME:$LABEL 31 | docker push $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPONAME:$LABEL 32 | 33 | echo '########################' 34 | echo 'Deploy Application Stack' 35 | echo '########################' 36 | aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com 37 | sleep 15 38 | 39 | aws cloudformation create-stack --stack-name $ECR_REPONAME \ 40 | --template-body file://../templates/base_app.yml \ 41 | --parameters ParameterKey=BaselineVpcStack,ParameterValue=$MAIN_STACK \ 42 | ParameterKey=ECRImageURI,ParameterValue=$AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPONAME:$LABEL \ 43 | ParameterKey=SystemOpsNotificationEmail,ParameterValue=$SYSOPSEMAIL \ 44 | ParameterKey=SystemOwnerNotificationEmail,ParameterValue=$SYSOWNEREMAIL \ 45 | --capabilities CAPABILITY_NAMED_IAM \ 46 | --tags Key=Application,Value=OpsExcellence-Lab 47 | 48 | echo '#########################################' 49 | echo 'Waiting for Application Stack to complete' 50 | echo '#########################################' 51 | aws cloudformation wait stack-create-complete --stack-name $ECR_REPONAME 52 | 53 | echo '#########################################' 54 | echo 'Application create complete' 55 | echo '#########################################' -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /Code/templates/runbook_approval_gate.yml: -------------------------------------------------------------------------------- 1 | Parameters: 2 | PlaybookIAMRole: 3 | Type: String 4 | 5 | Resources: 6 | AutomaticApproveWithTimer: 7 | Type: "AWS::SSM::Document" 8 | Properties: 9 | DocumentType: Automation 10 | Name: Approval-Timer 11 | Content: 12 | schemaVersion: '0.3' 13 | assumeRole: !Ref PlaybookIAMRole 14 | parameters: 15 | AutomationExecutionId: 16 | type: String 17 | Timer: 18 | type: String 19 | default: PT10M 20 | mainSteps: 21 | - name: SleepTimer 22 | action: 'aws:sleep' 23 | inputs: 24 | Duration: '{{Timer}}' 25 | - name: ApproveExecution 26 | action: 'aws:executeAwsApi' 27 | inputs: 28 | Api: SendAutomationSignal 29 | Service: ssm 30 | Payload: 31 | Comment: 32 | - 'Automatic Approved by Automatic-Approval-With-Timer' 33 | AutomationExecutionId: '{{AutomationExecutionId}}' 34 | SignalType: Approve 35 | ApprovalGateWithTimer: 36 | Type: "AWS::SSM::Document" 37 | Properties: 38 | DocumentType: Automation 39 | Name: Approval-Gate 40 | Content: 41 | schemaVersion: '0.3' 42 | assumeRole: !Ref PlaybookIAMRole 43 | parameters: 44 | Timer: 45 | type: String 46 | default: PT10M 47 | NotificationTopicArn: 48 | type: String 49 | NotificationMessage: 50 | type: String 51 | ApproverArn: 52 | type: String 53 | outputs: 54 | - getApprovalStatus.approvalStatusVariable 55 | mainSteps: 56 | - name: executeAutoApproveTimer 57 | action: 'aws:executeScript' 58 | inputs: 59 | Runtime: python3.6 60 | Handler: handler 61 | InputPayload: 62 | AutomationExecutionId: '{{automation:EXECUTION_ID}}' 63 | Timer: '{{Timer}}' 64 | Script: |- 65 | import boto3 66 | def handler(event, context): 67 | client = boto3.client('ssm') 68 | response = client.start_automation_execution( 69 | DocumentName='Approval-Timer', 70 | Parameters={ 71 | 'Timer': [ event['Timer'] ], 72 | 'AutomationExecutionId' : [ event['AutomationExecutionId'] ] 73 | } 74 | ) 75 | return None 76 | - name: ApproveOrDeny 77 | action: 'aws:approve' 78 | onFailure: Continue 79 | isCritical: false 80 | inputs: 81 | NotificationArn: '{{NotificationTopicArn}}' 82 | Message: '{{NotificationMessage}}' 83 | MinRequiredApprovals: 1 84 | Approvers: 85 | - '{{ApproverArn}}' 86 | - !Ref PlaybookIAMRole 87 | - name: getApprovalStatus 88 | action: 'aws:executeAwsApi' 89 | maxAttempts: 1 90 | inputs: 91 | Service: ssm 92 | Api: DescribeAutomationStepExecutions 93 | AutomationExecutionId: '{{automation:EXECUTION_ID}}' 94 | Filters: 95 | - Key: StepName 96 | Values: 97 | - requestApproval 98 | outputs: 99 | - Name: approvalStatusVariable 100 | Selector: '$.StepExecutions[0].Outputs.ApprovalStatus[0]' 101 | Type: String 102 | -------------------------------------------------------------------------------- /Code/scripts/teardown_resources.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ECR_REPONAME='walab-ops-sample-application' 4 | SAMPLE_APPNAME=$ECR_REPONAME 5 | CANARY_RESULT_BUCKET=$(aws cloudformation describe-stacks --stack-name $SAMPLE_APPNAME | jq '.Stacks[0].Outputs[] | select(.OutputKey == "OutputCanaryResultsBucket") | .OutputValue' | sed -e 's/^"//' -e 's/"$//') 6 | MAIN_STACK='walab-ops-base-resources' 7 | 8 | 9 | 10 | 11 | echo '############' 12 | echo 'Cleanup Repo' 13 | echo '############' 14 | echo $ECR_REPONAME 15 | aws ecr delete-repository --repository-name $ECR_REPONAME --force 16 | 17 | echo '####################' 18 | echo 'Cleanup Canary Bucket' 19 | echo '####################' 20 | echo $CANARY_RESULT_BUCKET 21 | aws s3 rm s3://$CANARY_RESULT_BUCKET --recursive 22 | 23 | echo '####################' 24 | echo 'Cleanup Canary ' 25 | echo '####################' 26 | CANARY_LAMBDA=$(aws synthetics describe-canaries --query "Canaries[?Name == 'mysecretword-canary'].EngineArn" | sed '1d;$d' | sed -r 's/\s+//g' | tr -d '",' | cut -d':' -f7) 27 | CANARY_SECGROUP=$(aws lambda get-function-configuration --function-name $CANARY_LAMBDA --query VpcConfig.SecurityGroupIds[0] | tr -d '",') 28 | CANARY_NICS=$(aws ec2 describe-network-interfaces --filters Name=group-id,Values=$CANARY_SECGROUP --query NetworkInterfaces[].NetworkInterfaceId | sed ':a;N;$!ba;s/\n/ /g' | sed -r 's/\s+//g' | sed -r 's/,/ /g' | tr -d '[,' | tr -d '],' | tr -d '",') 29 | echo 'Delete ' $CANARY_LAMBDA 30 | aws lambda delete-function --function-name $CANARY_LAMBDA 31 | echo 'Sleep for 5 mins before deleting the Network Interface:' $CANARY_NICS 32 | sleep 300 33 | for t in ${CANARY_NICS[@]}; do 34 | echo 'Deleting ENI' $t 35 | aws ec2 delete-network-interface --network-interface-id $t 36 | done 37 | echo 'Sleep for 1 min before deleting the SEC Group:' $CANARY_SECGROUP 38 | sleep 60 39 | echo 'Delete Security Group' 40 | aws ec2 delete-security-group --group-id $CANARY_SECGROUP 41 | echo 'Delete Canary' 42 | aws synthetics stop-canary --name mysecretword-canary 43 | aws synthetics delete-canary --name mysecretword-canary 44 | 45 | echo '##########################' 46 | echo 'Deleting Application Stack' 47 | echo '##########################' 48 | aws cloudformation delete-stack --stack-name $SAMPLE_APPNAME 49 | aws cloudformation wait stack-delete-complete --stack-name $SAMPLE_APPNAME 50 | 51 | echo '################################' 52 | echo 'Deleting Playbook/Runbook Stacks' 53 | echo '################################' 54 | 55 | 56 | 57 | aws cloudformation delete-stack --stack-name waopslab-runbook-approval-gate 58 | aws cloudformation wait stack-delete-complete --stack-name waopslab-runbook-approval-gate 59 | 60 | aws cloudformation delete-stack --stack-name waopslab-automation-role 61 | aws cloudformation wait stack-delete-complete --stack-name waopslab-automation-role 62 | 63 | aws cloudformation delete-stack --stack-name waopslab-runbook-scale-ecs-service 64 | aws cloudformation wait stack-delete-complete --stack-name waopslab-runbook-scale-ecs-service 65 | 66 | aws cloudformation delete-stack --stack-name waopslab-playbook-gather-resources 67 | aws cloudformation wait stack-delete-complete --stack-name waopslab-playbook-gather-resources 68 | 69 | aws cloudformation delete-stack --stack-name waopslab-playbook-investigate-resources 70 | aws cloudformation wait stack-delete-complete --stack-name waopslab-playbook-investigate-resources 71 | 72 | aws cloudformation delete-stack --stack-name waopslab-playbook-investigate-application 73 | aws cloudformation wait stack-delete-complete --stack-name waopslab-playbook-investigate-application 74 | 75 | echo '##########################' 76 | echo 'Deleting Base Resources' 77 | echo '##########################' 78 | aws cloudformation delete-stack --stack-name $MAIN_STACK 79 | aws cloudformation wait stack-delete-complete --stack-name $MAIN_STACK 80 | 81 | echo '#########################################' 82 | echo 'Application Teardown Complete' 83 | echo '#########################################' 84 | 85 | 86 | -------------------------------------------------------------------------------- /Code/templates/playbook_gather_resources.yml: -------------------------------------------------------------------------------- 1 | Parameters: 2 | PlaybookIAMRole: 3 | Type: String 4 | 5 | Resources: 6 | PlaybookGatherAppResourceAlarm: 7 | Type: "AWS::SSM::Document" 8 | Properties: 9 | DocumentType: Automation 10 | Name: Playbook-Gather-Resources 11 | Content: 12 | schemaVersion: '0.3' 13 | assumeRole: "{{AutomationAssumeRole}}" 14 | parameters: 15 | AlarmARN: 16 | description: (Required) The Alarm ARN triggering incident. 17 | type: String 18 | AutomationAssumeRole: 19 | type: String 20 | default: !Ref PlaybookIAMRole 21 | description: (Optional) The ARN of the role that allows Automation to perform the actions on your behalf. 22 | outputs: 23 | - Gather_Resources_For_Alarm.Resources 24 | mainSteps: 25 | - name: Gather_Resources_For_Alarm 26 | action: aws:executeScript 27 | description: Gather AWS resources related to the Alarm, based on it's Tag 28 | outputs: 29 | - Name: Resources 30 | Selector: $.Payload.ApplicationStackResources 31 | Type: String 32 | inputs: 33 | Runtime: python3.6 34 | Handler: handler 35 | InputPayload: 36 | CloudWatchAlarmARN: '{{AlarmARN}}' 37 | Script: |- 38 | import json 39 | import re 40 | from datetime import datetime 41 | import boto3 42 | import os 43 | 44 | def arn_deconstruct(arn): 45 | # arn:aws:cloudwatch:us-east-1:754323466686:alarm:mysecretword-canary-alarm 46 | arnlist = arn.split(":") 47 | service=arnlist[2] 48 | region=arnlist[3] 49 | accountid=arnlist[4] 50 | servicetype=arnlist[5] 51 | name=arnlist[6] 52 | 53 | return { 54 | "Service": service, 55 | "Region": region, 56 | "AccountId": accountid, 57 | "Type": servicetype, 58 | "Name": name 59 | } 60 | 61 | def locate_alarm_source(alarm): 62 | cwclient = boto3.client('cloudwatch', region_name = alarm['Region'] ) 63 | alarm_source = {} 64 | alarm_detail = cwclient.describe_alarms(AlarmNames=[alarm['Name']]) 65 | 66 | if len(alarm_detail['MetricAlarms']) > 0: 67 | metric_alarm = alarm_detail['MetricAlarms'][0] 68 | namespace = metric_alarm['Namespace'] 69 | 70 | # Condition if NameSpace is CloudWatch Syntetics 71 | if namespace == 'CloudWatchSynthetics': 72 | if 'Dimensions' in metric_alarm: 73 | dimensions = metric_alarm['Dimensions'] 74 | for i in dimensions: 75 | if i['Name'] == 'CanaryName': 76 | source_name = i['Value'] 77 | alarm_source['Type'] = namespace 78 | alarm_source['Name'] = source_name 79 | alarm_source['Region'] = alarm['Region'] 80 | alarm_source['AccountId'] = alarm['AccountId'] 81 | 82 | result = alarm_source 83 | return result 84 | 85 | # #Condition for CompositeAlarms 86 | # if len(alarm_detail['CompositeAlarms']) > 0: 87 | 88 | def locate_canary_endpoint(canaryname,region): 89 | result = None 90 | synclient = boto3.client('synthetics', region_name = region ) 91 | res = synclient.get_canary(Name=canaryname) 92 | canary = res['Canary'] 93 | if 'Tags' in canary: 94 | if 'TargetEndpoint' in canary['Tags']: 95 | target_endpoint = canary['Tags']['TargetEndpoint'] 96 | result = target_endpoint 97 | 98 | return result 99 | 100 | 101 | def locate_app_tag_value(resource): 102 | result = None 103 | 104 | if resource['Type'] == 'CloudWatchSynthetics': 105 | synclient = boto3.client('synthetics', region_name = resource['Region'] ) 106 | res = synclient.get_canary(Name=resource['Name']) 107 | canary = res['Canary'] 108 | if 'Tags' in canary: 109 | if 'Application' in canary['Tags']: 110 | apptag_val = canary['Tags']['Application'] 111 | result = apptag_val 112 | 113 | return result 114 | 115 | def locate_app_resources_by_tag(tag,region): 116 | result = None 117 | 118 | # Search CloufFormation Stacks for tag 119 | cfnclient = boto3.client('cloudformation', region_name = region ) 120 | list = cfnclient.list_stacks(StackStatusFilter=['CREATE_COMPLETE','ROLLBACK_COMPLETE','UPDATE_COMPLETE','UPDATE_ROLLBACK_COMPLETE','IMPORT_COMPLETE','IMPORT_ROLLBACK_COMPLETE'] ) 121 | for stack in list['StackSummaries']: 122 | app_resources_list = [] 123 | stack_name = stack['StackName'] 124 | stack_details = cfnclient.describe_stacks(StackName=stack_name) 125 | stack_info = stack_details['Stacks'][0] 126 | if 'Tags' in stack_info: 127 | for t in stack_info['Tags']: 128 | if t['Key'] == 'Application' and t['Value'] == tag: 129 | app_stack_name = stack_info['StackName'] 130 | app_resources = cfnclient.describe_stack_resources(StackName=app_stack_name) 131 | for resource in app_resources['StackResources']: 132 | app_resources_list.append( 133 | { 134 | 'PhysicalResourceId' : resource['PhysicalResourceId'], 135 | 'Type': resource['ResourceType'] 136 | } 137 | ) 138 | result = app_resources_list 139 | 140 | return result 141 | def handler(event, context): 142 | result = {} 143 | arn = event['CloudWatchAlarmARN'] 144 | alarm = arn_deconstruct(arn) 145 | # Locate tag from CloudWatch Alarm 146 | 147 | 148 | alarm_source = locate_alarm_source(alarm) # Identify Alarm Source 149 | tag_value = locate_app_tag_value(alarm_source) #Identify tag from source 150 | 151 | if alarm_source['Type'] == 'CloudWatchSynthetics': 152 | endpoint = locate_canary_endpoint(alarm_source['Name'],alarm_source['Region']) 153 | result['CanaryEndpoint'] = endpoint 154 | 155 | # Locate cloudformation with tag 156 | resources = locate_app_resources_by_tag(tag_value,alarm['Region']) 157 | result['ApplicationStackResources'] = json.dumps(resources) 158 | 159 | return result -------------------------------------------------------------------------------- /Code/templates/base_resources.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | 3 | Description: > 4 | Well Architected Operational Excellence Lab 5 | 6 | Parameters: 7 | Cloud9CidrBlock: 8 | Description: The CIDR block range for your Cloud9 IDE VPC 9 | Type: String 10 | Default: 10.43.0.0/28 11 | GitRepositoryURL: 12 | Description: The Git repository URL for the project we are cloning 13 | Type: String 14 | Default: https://github.com/awslabs/aws-well-architected-labs.git 15 | 16 | 17 | Resources: 18 | #------------------------------------------------------------ 19 | # Create a VPC with a public and private subnet 20 | #------------------------------------------------------------ 21 | VPC: 22 | Type: 'AWS::EC2::VPC' 23 | Properties: 24 | CidrBlock: 172.31.0.0/16 25 | Tags: 26 | - Key: Name 27 | Value: !Sub "${AWS::StackName}-VPC" 28 | EnableDnsHostnames: true 29 | EnableDnsSupport: true 30 | 31 | PublicSubnet1: 32 | Type: 'AWS::EC2::Subnet' 33 | Properties: 34 | VpcId: !Ref VPC 35 | AvailabilityZone: !Select 36 | - '0' 37 | - !GetAZs '' 38 | CidrBlock: 172.31.1.0/24 39 | MapPublicIpOnLaunch: true 40 | 41 | PublicSubnet2: 42 | Type: 'AWS::EC2::Subnet' 43 | Properties: 44 | VpcId: !Ref VPC 45 | AvailabilityZone: !Select 46 | - '1' 47 | - !GetAZs '' 48 | CidrBlock: 172.31.3.0/24 49 | MapPublicIpOnLaunch: true 50 | 51 | PrivateSubnet1: 52 | Type: 'AWS::EC2::Subnet' 53 | Properties: 54 | VpcId: !Ref VPC 55 | AvailabilityZone: !Select 56 | - '0' 57 | - !GetAZs '' 58 | CidrBlock: 172.31.2.0/24 59 | MapPublicIpOnLaunch: false 60 | 61 | PrivateSubnet2: 62 | Type: 'AWS::EC2::Subnet' 63 | Properties: 64 | VpcId: !Ref VPC 65 | AvailabilityZone: !Select 66 | - '1' 67 | - !GetAZs '' 68 | CidrBlock: 172.31.4.0/24 69 | MapPublicIpOnLaunch: false 70 | 71 | #------------------------------------------------------------- 72 | # Create an IGW and attach to the created VPC 73 | # Create a NAT GW with an associated public IP address. 74 | #------------------------------------------------------------- 75 | 76 | IGW: 77 | Type: AWS::EC2::InternetGateway 78 | Properties: 79 | Tags: 80 | - Key: Name 81 | Value: !Sub "${AWS::StackName}-InternetGateway" 82 | 83 | IGWAttach: 84 | Type: AWS::EC2::VPCGatewayAttachment 85 | Properties: 86 | VpcId: !Ref VPC 87 | InternetGatewayId: !Ref IGW 88 | 89 | NatGateway: 90 | Type: "AWS::EC2::NatGateway" 91 | DependsOn: NatPublicIP 92 | Properties: 93 | AllocationId: !GetAtt NatPublicIP.AllocationId 94 | SubnetId: !Ref PublicSubnet1 95 | 96 | NatPublicIP: 97 | Type: "AWS::EC2::EIP" 98 | DependsOn: VPC 99 | Properties: 100 | Domain: vpc 101 | 102 | #------------------------------------------------------------- 103 | # Create public route table and attach to the public subnets 104 | #------------------------------------------------------------- 105 | 106 | PublicRouteTable1: 107 | Type: 'AWS::EC2::RouteTable' 108 | Properties: 109 | VpcId: !Ref VPC 110 | Tags: 111 | - Key: Name 112 | Value: !Sub "${AWS::StackName}-Public-RouteTable1" 113 | 114 | 115 | PublicRouteTable2: 116 | Type: 'AWS::EC2::RouteTable' 117 | Properties: 118 | VpcId: !Ref VPC 119 | Tags: 120 | - Key: Name 121 | Value: !Sub "${AWS::StackName}-Public-RouteTable2" 122 | 123 | PublicRoute1: 124 | Type: 'AWS::EC2::Route' 125 | DependsOn: 126 | - IGW 127 | - IGWAttach 128 | Properties: 129 | RouteTableId: !Ref PublicRouteTable1 130 | DestinationCidrBlock: 0.0.0.0/0 131 | GatewayId: !Ref IGW 132 | 133 | PublicRoute2: 134 | Type: 'AWS::EC2::Route' 135 | DependsOn: 136 | - IGW 137 | - IGWAttach 138 | Properties: 139 | RouteTableId: !Ref PublicRouteTable2 140 | DestinationCidrBlock: 0.0.0.0/0 141 | GatewayId: !Ref IGW 142 | 143 | PublicSubnet1RouteTableAssociation1: 144 | Type: 'AWS::EC2::SubnetRouteTableAssociation' 145 | Properties: 146 | SubnetId: !Ref PublicSubnet1 147 | RouteTableId: !Ref PublicRouteTable1 148 | 149 | PublicSubnet1RouteTableAssociation2: 150 | Type: 'AWS::EC2::SubnetRouteTableAssociation' 151 | Properties: 152 | SubnetId: !Ref PublicSubnet2 153 | RouteTableId: !Ref PublicRouteTable2 154 | 155 | #------------------------------------------------------------- 156 | # Create public route table and attach to the public subnets 157 | #------------------------------------------------------------- 158 | 159 | PrivateRouteTable1: 160 | Type: 'AWS::EC2::RouteTable' 161 | Properties: 162 | VpcId: !Ref VPC 163 | 164 | 165 | PrivateRouteTable2: 166 | Type: 'AWS::EC2::RouteTable' 167 | Properties: 168 | VpcId: !Ref VPC 169 | 170 | PrivateRoute1: 171 | Type: 'AWS::EC2::Route' 172 | DependsOn: IGW 173 | Properties: 174 | RouteTableId: !Ref PrivateRouteTable1 175 | DestinationCidrBlock: 0.0.0.0/0 176 | NatGatewayId: !Ref NatGateway 177 | 178 | PrivateRoute2: 179 | Type: 'AWS::EC2::Route' 180 | DependsOn: IGW 181 | Properties: 182 | RouteTableId: !Ref PrivateRouteTable2 183 | DestinationCidrBlock: 0.0.0.0/0 184 | NatGatewayId: !Ref NatGateway 185 | 186 | PrivateSubnet1RouteTableAssociation1: 187 | Type: 'AWS::EC2::SubnetRouteTableAssociation' 188 | Properties: 189 | SubnetId: !Ref PrivateSubnet1 190 | RouteTableId: !Ref PrivateRouteTable1 191 | 192 | PrivateSubnet1RouteTableAssociation2: 193 | Type: 'AWS::EC2::SubnetRouteTableAssociation' 194 | Properties: 195 | SubnetId: !Ref PrivateSubnet2 196 | RouteTableId: !Ref PrivateRouteTable2 197 | 198 | # ------------------ 199 | # ECR Repository 200 | # ------------------ 201 | 202 | AppContainerRepository: 203 | Type: AWS::ECR::Repository 204 | Properties: 205 | RepositoryName: walab-ops-sample-application 206 | 207 | Cloud9: 208 | Type: AWS::Cloud9::EnvironmentEC2 209 | Properties: 210 | AutomaticStopTimeMinutes: 30 211 | Description: Well Architected Operational Excellence lab workspace 212 | InstanceType: t2.small 213 | ImageId: amazonlinux-2-x86_64 214 | Name: !Sub "WellArchitectedOps-${AWS::StackName}" 215 | Repositories: 216 | - PathComponent: /aws-well-architected-labs 217 | RepositoryUrl: !Ref GitRepositoryURL 218 | SubnetId: !Ref PublicSubnet1 219 | 220 | Outputs: 221 | Cloud9DevEnvUrl: 222 | Description: Cloud9 Development Environment 223 | Value: !Sub "https://${AWS::Region}.console.aws.amazon.com/cloud9/ide/${Cloud9}" 224 | 225 | 226 | Outputs: 227 | OutputVPC: 228 | Description: Baseline VPC 229 | Value: !Ref VPC 230 | Export: 231 | Name: !Sub "${AWS::StackName}-VpcId" 232 | OutputVPCCidrBlock: 233 | Description: Baseline VPC Cidr Block 234 | Value: !GetAtt VPC.CidrBlock 235 | Export: 236 | Name: !Sub "${AWS::StackName}-VpcCidrBlock" 237 | OutputPublicSubnet1: 238 | Description: Public Subnet 1 VPC 239 | Value: !Ref PublicSubnet1 240 | Export: 241 | Name: !Sub "${AWS::StackName}-PublicSubnet1" 242 | OutputPublicSubnet2: 243 | Description: Public Subnet 2 VPC 244 | Value: !Ref PublicSubnet2 245 | Export: 246 | Name: !Sub "${AWS::StackName}-PublicSubnet2" 247 | OutputPrivateSubnet1: 248 | Description: Private Subnet 1 VPC 249 | Value: !Ref PrivateSubnet1 250 | Export: 251 | Name: !Sub "${AWS::StackName}-PrivateSubnet1" 252 | OutputPrivateSubnet2: 253 | Description: Private Subnet 2 VPC 254 | Value: !Ref PrivateSubnet2 255 | Export: 256 | Name: !Sub "${AWS::StackName}-PrivateSubnet2" 257 | OutputAppContainerRepository: 258 | Description: Applicaton ECR Repository 259 | Value: !Ref AppContainerRepository 260 | Export: 261 | Name: !Sub "${AWS::StackName}-AppContainerRepository" 262 | -------------------------------------------------------------------------------- /Code/src/app.js: -------------------------------------------------------------------------------- 1 | 'use strict'; 2 | const AWS = require('aws-sdk'); 3 | const AWSXRay = require('aws-xray-sdk'); 4 | const kmsClient = AWSXRay.captureAWSClient(new AWS.KMS({region: process.env.REGION })); 5 | const secretsmanager = AWSXRay.captureAWSClient(new AWS.SecretsManager({region: process.env.REGION })); 6 | const express = require('express'); 7 | const router = express.Router(); 8 | const bodyParser = require("body-parser"); 9 | var mysql = AWSXRay.captureMySQL(require('mysql')); 10 | const zlib = require('zlib'); 11 | 12 | 13 | // Constants 14 | const PORT = 80; 15 | const HOST = '0.0.0.0'; 16 | 17 | // App 18 | const app = express(); 19 | app.use(AWSXRay.express.openSegment('mysecretapp-api')); 20 | 21 | 22 | app.use(bodyParser.urlencoded({ extended: false })); 23 | app.use(bodyParser.json()); 24 | 25 | 26 | const DBHOST = process.env.DBHOST; 27 | const KeyId = process.env.KeyId; 28 | const DBSecret = process.env.DBSecret; 29 | 30 | function hydrateDBCreds( DBSecret ){ 31 | var promise = new Promise(function(resolve,reject){ 32 | var params = { 33 | SecretId: DBSecret 34 | }; 35 | secretsmanager.getSecretValue(params, function(err, data) { 36 | if (err){ 37 | console.log(err); 38 | reject(err); 39 | } 40 | else{ 41 | var secString = data['SecretString'] 42 | var secObj = JSON.parse(secString) 43 | process.env.DBUSER = secObj['username'] 44 | process.env.DBPASS = secObj['password'] 45 | resolve ( data ); 46 | } 47 | // successful response 48 | }); 49 | }); 50 | return promise; 51 | }; 52 | 53 | function encryptData( KeyId, Plaintext ){ 54 | var promise = new Promise(function(resolve,reject){ 55 | kmsClient.encrypt({ KeyId, Plaintext }, (err, data) => { 56 | if (err) { 57 | console.log(err) 58 | reject(err); // an error occurred 59 | } 60 | else { 61 | const { CiphertextBlob } = data; 62 | resolve ( CiphertextBlob ); 63 | }; 64 | }); 65 | 66 | }); 67 | return promise; 68 | }; 69 | 70 | 71 | function decryptData( KeyId, CiphertextBlob ){ 72 | var promise = new Promise(function(resolve,reject){ 73 | kmsClient.decrypt({ CiphertextBlob, KeyId }, (err, data) => { 74 | if (err) { 75 | console.log(err) 76 | reject(err); // an error occurred 77 | } 78 | else { 79 | const { Plaintext } = data; 80 | resolve ( Plaintext.toString() ); 81 | }; 82 | }); 83 | 84 | }); 85 | return promise; 86 | }; 87 | 88 | 89 | 90 | function decryptData( KeyId, CiphertextBlob ){ 91 | var promise = new Promise(function(resolve,reject){ 92 | kmsClient.decrypt({ CiphertextBlob, KeyId }, (err, data) => { 93 | if (err) { 94 | console.log(err) 95 | reject(err); // an error occurred 96 | } 97 | else { 98 | const { Plaintext } = data; 99 | resolve ( Plaintext.toString() ); 100 | }; 101 | }); 102 | 103 | }); 104 | return promise; 105 | }; 106 | 107 | 108 | 109 | function createDB(DBSecret){ 110 | 111 | var promise = new Promise(function(resolve,reject){ 112 | try 113 | { 114 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS}); 115 | var sql = "CREATE DATABASE IF NOT EXISTS mydb"; 116 | con.query(sql, function (err, result) { 117 | if (err) { 118 | con.end(); 119 | reject(err); 120 | } 121 | else{ 122 | con.end(); 123 | resolve("database create done"); 124 | } 125 | }); 126 | } 127 | catch(err){ 128 | reject(err); 129 | } 130 | }); 131 | return promise; 132 | }; 133 | 134 | function createTable(){ 135 | var promise = new Promise(function(resolve,reject){ 136 | try 137 | { 138 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS,database: "mydb"}); 139 | var sql = "CREATE TABLE IF NOT EXISTS peoplesecret (name VARCHAR(244) NOT NULL, secret TEXT, PRIMARY KEY (name) )"; 140 | con.query(sql, function (err, result) { 141 | if (err) { 142 | con.end(); 143 | reject(err); 144 | } 145 | else{ 146 | con.end(); 147 | resolve("table create done"); 148 | } 149 | }); 150 | } 151 | catch(err){ 152 | reject(err); 153 | } 154 | }); 155 | return promise; 156 | }; 157 | 158 | function storeSecret(Payload){ 159 | 160 | var promise = new Promise(function(resolve,reject){ 161 | try{ 162 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS,database: "mydb"}); 163 | var sql = "INSERT INTO peoplesecret (name, secret) VALUES ('" + Payload['Name'] + "', '" + Payload['Text'] + "' ) ON DUPLICATE KEY UPDATE name= '"+ Payload['Name'] +"', secret='" + Payload['Text'] + "'" ; 164 | con.query(sql, function (err, result) { 165 | if (err) { 166 | con.end(); 167 | reject(err); 168 | } 169 | else{ 170 | con.end(); 171 | resolve("1 record inserted"); 172 | } 173 | }); 174 | } 175 | catch(err){ 176 | reject(err); 177 | } 178 | }); 179 | return promise; 180 | }; 181 | 182 | function getSecret(Payload){ 183 | 184 | var promise = new Promise(function(resolve,reject){ 185 | try{ 186 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS,database: "mydb"}); 187 | var sql = "SELECT secret from peoplesecret WHERE name='"+ Payload['Name'] +"'"; 188 | con.query(sql, function (err, result) { 189 | if (err) { 190 | con.end(); 191 | reject(err); 192 | } 193 | else{ 194 | if(result.length < 1){ 195 | con.end(); 196 | reject('no record found'); 197 | } 198 | else{ 199 | con.end(); 200 | resolve(result[0].secret); 201 | } 202 | } 203 | }); 204 | } 205 | catch(err){ 206 | reject(err) 207 | } 208 | }); 209 | return promise; 210 | }; 211 | 212 | 213 | 214 | router.get('/', (req, res) => { 215 | res.status(200).send( 'OK' ); 216 | }); 217 | 218 | router.post('/encrypt', (req, res) => { 219 | const Payload = { 220 | 'Name': req.body.Name, 221 | 'Text': req.body.Text 222 | } 223 | 224 | hydrateDBCreds(DBSecret) 225 | .then(function(response){ 226 | encryptData(KeyId, Payload['Text']) 227 | .then(function(response) { 228 | var encryptedData = response; 229 | const EncryptedDataBase64Str = zlib.gzipSync(JSON.stringify(encryptedData)).toString('base64'); 230 | Payload['Text'] = EncryptedDataBase64Str 231 | return Payload; 232 | }) 233 | .then(function(response) { 234 | var Payload = response; 235 | 236 | //Prepare Database & Table 237 | createDB() 238 | .then(function(res){ 239 | 240 | createTable() 241 | .then(function(res){ 242 | //Insert Record 243 | storeSecret(Payload) 244 | .then(function(response){ 245 | console.log(response); 246 | return response; 247 | }) 248 | .catch(function(err){ 249 | console.log(err); 250 | res.status(500).send( {'Message':'oops something went wrong !' }); 251 | }); 252 | }) 253 | .catch(function(err){ 254 | console.log(err); 255 | res.status(500).send( {'Message':'oops something went wrong !' }); 256 | }); 257 | 258 | }) 259 | .catch(function(err){ 260 | console.log(err); 261 | res.status(500).send( {'Message':'oops something went wrong !' }); 262 | process.exit(500) 263 | }); 264 | return Payload; 265 | }) 266 | .then(function(response){ 267 | var output = { 268 | 'Message':'Data encrypted and stored, keep your key save', 269 | 'Key' : KeyId 270 | }; 271 | //console.log(response) 272 | res.status(200).send( output ); 273 | }) 274 | .catch(function(err) { 275 | console.log(err); 276 | 277 | res.status(400).send( {'Message':'Data encryption failed, check logs for more details' }); 278 | }); 279 | return response 280 | }) 281 | .catch(function(err){ 282 | console.log(err); 283 | res.status(400).send( {'Message':'Failed Hydrating Credentials' }); 284 | }) 285 | }); 286 | 287 | router.get('/decrypt', (req, res) => { 288 | const Payload = { 289 | 'Name': req.body.Name, 290 | 'Key': req.body.Key 291 | } 292 | hydrateDBCreds(DBSecret) 293 | .then(function(response){ 294 | getSecret(Payload) 295 | .then(function(response) { 296 | var secretText = response; 297 | const originalObj = JSON.parse(zlib.unzipSync(Buffer.from(secretText, 'base64'))); 298 | var buf = Buffer.from(originalObj, 'utf8'); 299 | decryptData(Payload['Key'],buf) 300 | .then(function(response){ 301 | var output = { 302 | 'Text':response, 303 | }; 304 | res.status(200).send( output ); 305 | }) 306 | .catch(function(err){ 307 | res.status(400).send( {'Message':'Data decryption failed, make sure you have the correct key' }); 308 | }); 309 | return response; 310 | }) 311 | .catch(function(err) { 312 | res.status(400).send( {'Message':'Failed getting secret text, check the user name' }); 313 | }); 314 | return response; 315 | }) 316 | .catch(function(err){ 317 | console.log(err); 318 | res.status(400).send( {'Message':'Failed Hydrating Credentials' }); 319 | }) 320 | }); 321 | 322 | app.use("/",router); 323 | app.use(AWSXRay.express.closeSegment()); 324 | 325 | 326 | app.listen(PORT, HOST); 327 | console.log(`Running on http://${HOST}:${PORT}`); -------------------------------------------------------------------------------- /Code/templates/base_app.yml: -------------------------------------------------------------------------------- 1 | Parameters: 2 | BaselineVpcStack: 3 | Type: String 4 | ECRImageURI: 5 | Type: String 6 | SystemOpsNotificationEmail: 7 | Type: String 8 | SystemOwnerNotificationEmail: 9 | Type: String 10 | 11 | Outputs: 12 | OutputApplicationEndpoint: 13 | Description: Application Endpoint 14 | Value: !GetAtt ALB.DNSName 15 | OutputECSService: 16 | Description: App Task ECS Service 17 | Value: !GetAtt ECSService.Name 18 | OutputECSCluster: 19 | Description: App Task ECS Cluster 20 | Value: !Ref ECSCluster 21 | OutputSystemOwnersTopicArn: 22 | Description: Arn of the SNS Topic for System Owners 23 | Value: !Ref SystemOwnersTopic 24 | OutputSystemEventTopicArn: 25 | Description: Arn of the SNS Topic for System Events 26 | Value: !Ref SystemEventTopic 27 | SyntheticsCanaryDurationAlarmArn: 28 | Description: Arn CloudWatch Alarm 29 | Value: !GetAtt SyntheticsCanaryDurationAlarm.Arn 30 | OutputCanaryResultsBucket: 31 | Description: Canary Result Bucket 32 | Value: !Ref ResultsBucket 33 | 34 | Resources: 35 | #---------------------------------------------------------------------------------------- 36 | # Build load balancer. 37 | #---------------------------------------------------------------------------------------- 38 | ALB: 39 | Type: AWS::ElasticLoadBalancingV2::LoadBalancer 40 | Properties: 41 | SecurityGroups: 42 | - !Ref ELBSecurityGroup 43 | Subnets: 44 | - 45 | Fn::ImportValue: 46 | !Sub "${BaselineVpcStack}-PublicSubnet1" 47 | - 48 | Fn::ImportValue: 49 | !Sub "${BaselineVpcStack}-PublicSubnet2" 50 | Tags: 51 | - Key: Name 52 | Value: !Join [ "-", [ !Ref AWS::StackName, "ExternalALB"]] 53 | - Key: Application 54 | Value: "OpsExcellence-Lab" 55 | LoadBalancerAttributes: 56 | - Key: idle_timeout.timeout_seconds 57 | Value: 30 58 | Scheme: internal 59 | 60 | ALBTargetGroup: 61 | Type: AWS::ElasticLoadBalancingV2::TargetGroup 62 | Properties: 63 | TargetType: ip 64 | HealthCheckEnabled: True 65 | HealthCheckIntervalSeconds: 60 66 | HealthCheckPath: / 67 | HealthCheckPort: 80 68 | HealthCheckProtocol: HTTP 69 | HealthCheckTimeoutSeconds: 30 70 | HealthyThresholdCount: 3 71 | UnhealthyThresholdCount: 5 72 | TargetGroupAttributes: 73 | - Key: deregistration_delay.timeout_seconds 74 | Value: 0 75 | VpcId: 76 | Fn::ImportValue: 77 | !Sub "${BaselineVpcStack}-VpcId" 78 | Port: 80 79 | Protocol: HTTP 80 | Tags: 81 | - Key: Name 82 | Value: !Join [ "-", [ !Ref AWS::StackName, "ExternalALBTargetGroup"]] 83 | - Key: Application 84 | Value: "OpsExcellence-Lab" 85 | 86 | ALBListener: 87 | Type: AWS::ElasticLoadBalancingV2::Listener 88 | Properties: 89 | LoadBalancerArn: !Ref ALB 90 | Port: 80 91 | Protocol: HTTP 92 | DefaultActions: 93 | - Type: forward 94 | TargetGroupArn: !Ref ALBTargetGroup 95 | 96 | #---------------------------------------------------------------------------------------- 97 | # Build load balancer security group. 98 | #---------------------------------------------------------------------------------------- 99 | ELBSecurityGroup: 100 | Type: AWS::EC2::SecurityGroup 101 | Properties: 102 | GroupDescription: Enable HTTP from the Internet 103 | VpcId: 104 | Fn::ImportValue: 105 | !Sub "${BaselineVpcStack}-VpcId" 106 | SecurityGroupIngress: 107 | - IpProtocol: tcp 108 | FromPort: 80 109 | ToPort: 80 110 | CidrIp: '0.0.0.0/0' 111 | Tags: 112 | - Key: Name 113 | Value: !Join [ "-", [ !Ref AWS::StackName, "ExternalELBSecurityGroup"]] 114 | - Key: Application 115 | Value: "OpsExcellence-Lab" 116 | 117 | #---------------------------------------------------------------------------------------- 118 | # Build ECS Resources. 119 | #---------------------------------------------------------------------------------------- 120 | 121 | ContainerSecGroup: 122 | Type: AWS::EC2::SecurityGroup 123 | Properties: 124 | GroupDescription: !Join ['', [!Ref 'AWS::StackName', -ContainerSecGroup]] 125 | VpcId: 126 | Fn::ImportValue: 127 | !Sub "${BaselineVpcStack}-VpcId" 128 | SecurityGroupIngress: 129 | - IpProtocol: tcp 130 | FromPort: 80 131 | ToPort: 80 132 | SourceSecurityGroupId: !Ref ELBSecurityGroup 133 | 134 | ECSCluster: 135 | Type: AWS::ECS::Cluster 136 | Properties: 137 | ClusterName: mysecretword-cluster 138 | CapacityProviders: 139 | - FARGATE 140 | DefaultCapacityProviderStrategy: 141 | - CapacityProvider: FARGATE 142 | Weight: 1 143 | 144 | ECSService: 145 | DependsOn: ALB 146 | Type: AWS::ECS::Service 147 | Properties: 148 | Cluster: !Ref ECSCluster 149 | ServiceName: mysecretword-service 150 | DeploymentConfiguration: 151 | MaximumPercent: 200 152 | MinimumHealthyPercent: 100 153 | DesiredCount: 1 154 | HealthCheckGracePeriodSeconds: 60 155 | LoadBalancers: 156 | - ContainerName: mysecretword-app 157 | ContainerPort: 80 158 | TargetGroupArn: !Ref ALBTargetGroup 159 | TaskDefinition: !Ref TaskDefinition 160 | LaunchType: FARGATE 161 | NetworkConfiguration: 162 | AwsvpcConfiguration: 163 | AssignPublicIp: ENABLED 164 | Subnets: 165 | - Fn::ImportValue: 166 | !Sub "${BaselineVpcStack}-PrivateSubnet1" 167 | - Fn::ImportValue: 168 | !Sub "${BaselineVpcStack}-PrivateSubnet2" 169 | SecurityGroups: 170 | - !Ref ContainerSecGroup 171 | 172 | 173 | TaskDefinition: 174 | Type: AWS::ECS::TaskDefinition 175 | Properties: 176 | Family: !Join ['', [!Ref 'AWS::StackName', -app]] 177 | TaskRoleArn: !Ref ECSTaskRole 178 | ExecutionRoleArn: !Ref ECSTaskExecutionRole 179 | NetworkMode: awsvpc 180 | RequiresCompatibilities: 181 | - FARGATE 182 | Cpu: 256 183 | Memory: 0.5GB 184 | ContainerDefinitions: 185 | - Name: mysecretword-app 186 | Essential: 'true' 187 | Image: !Ref ECRImageURI 188 | LogConfiguration: 189 | LogDriver: awslogs 190 | Options: 191 | awslogs-group: !Ref 'ECSCloudWatchLogsGroup' 192 | awslogs-region: !Ref 'AWS::Region' 193 | awslogs-stream-prefix: !Join ['', [!Ref 'AWS::StackName', -app]] 194 | Environment: 195 | - Name: DBHOST 196 | Value: !GetAtt RDS.Endpoint.Address 197 | - Name: KeyId 198 | Value: !Ref KMSKey 199 | - Name: DBSecret 200 | Value: !Ref RDSSecret 201 | - Name: REGION 202 | Value: !Ref AWS::Region 203 | PortMappings: 204 | - ContainerPort: 80 205 | 206 | 207 | ECSCloudWatchLogsGroup: 208 | Type: AWS::Logs::LogGroup 209 | Properties: 210 | LogGroupName: !Join ['', [!Ref 'AWS::StackName', -app-loggroup]] 211 | RetentionInDays: 365 212 | 213 | #---------------------------------------------------------------------------------------- 214 | # Build ECS IAM Roles. 215 | #---------------------------------------------------------------------------------------- 216 | 217 | ECSServiceRole: 218 | Type: AWS::IAM::Role 219 | Properties: 220 | AssumeRolePolicyDocument: 221 | Version: 2008-10-17 222 | Statement: 223 | - Sid: '' 224 | Effect: Allow 225 | Principal: 226 | Service: ecs.amazonaws.com 227 | Action: 'sts:AssumeRole' 228 | ManagedPolicyArns: 229 | - 'arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceRole' 230 | 231 | ECSTaskRole: 232 | Type: AWS::IAM::Role 233 | Properties: 234 | RoleName: !Join ['', [!Ref 'AWS::StackName', -ECSTaskRole]] 235 | AssumeRolePolicyDocument: 236 | Statement: 237 | - Effect: Allow 238 | Principal: 239 | Service: ecs-tasks.amazonaws.com 240 | Action: 'sts:AssumeRole' 241 | Path: / 242 | Policies: 243 | - PolicyName: KMSAccess 244 | PolicyDocument: 245 | Version: 2012-10-17 246 | Statement: 247 | - Effect: Allow 248 | Action: '*' 249 | Resource: !GetAtt KMSKey.Arn 250 | - PolicyName: SMAccess 251 | PolicyDocument: 252 | Version: 2012-10-17 253 | Statement: 254 | - Effect: Allow 255 | Action: 'secretsmanager:GetSecretValue' 256 | Resource: '*' 257 | - PolicyName: CloudWatchLogs 258 | PolicyDocument: 259 | Version: 2012-10-17 260 | Statement: 261 | - Effect: Allow 262 | Action: 'logs:*' 263 | Resource: !GetAtt ECSCloudWatchLogsGroup.Arn 264 | - PolicyName: Xray 265 | PolicyDocument: 266 | Version: 2012-10-17 267 | Statement: 268 | - Effect: Allow 269 | Action: 'xray:*' 270 | Resource: '*' 271 | ### This needs to be restricted 272 | 273 | ECSTaskExecutionRole: 274 | Type: AWS::IAM::Role 275 | Properties: 276 | RoleName: !Join ['', [!Ref 'AWS::StackName', -ECSTaskExecutionRole]] 277 | AssumeRolePolicyDocument: 278 | Statement: 279 | - Effect: Allow 280 | Principal: 281 | Service: ecs-tasks.amazonaws.com 282 | Action: 'sts:AssumeRole' 283 | ManagedPolicyArns: 284 | - 'arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy' 285 | 286 | #---------------------------------------------------------------------------------------- 287 | # Build RDS Instance. 288 | #---------------------------------------------------------------------------------------- 289 | 290 | RDS: 291 | Type: AWS::RDS::DBInstance 292 | Properties: 293 | AllocatedStorage: 5 294 | DBInstanceClass: db.t2.micro 295 | Engine: MySQL 296 | MasterUsername: !Join ['', ['{{resolve:secretsmanager:', !Ref RDSSecret, ':SecretString:username}}' ]] 297 | MasterUserPassword: !Join ['', ['{{resolve:secretsmanager:', !Ref RDSSecret, ':SecretString:password}}' ]] 298 | DBSubnetGroupName: !Ref RDSSubnetGroup 299 | VPCSecurityGroups: 300 | - !Ref RDSSecGroup 301 | MultiAZ: False 302 | 303 | 304 | RDSSecGroup: 305 | Type: AWS::EC2::SecurityGroup 306 | Properties: 307 | GroupDescription: !Join ['', [!Ref 'AWS::StackName', -RDSSecGroup]] 308 | VpcId: 309 | Fn::ImportValue: 310 | !Sub "${BaselineVpcStack}-VpcId" 311 | SecurityGroupIngress: 312 | - IpProtocol: tcp 313 | FromPort: 3306 314 | ToPort: 3306 315 | CidrIp: 316 | Fn::ImportValue: 317 | !Sub "${BaselineVpcStack}-VpcCidrBlock" 318 | 319 | RDSSubnetGroup: 320 | Type: "AWS::RDS::DBSubnetGroup" 321 | Properties: 322 | DBSubnetGroupDescription: "Subnet Group" 323 | SubnetIds: 324 | - Fn::ImportValue: 325 | !Sub "${BaselineVpcStack}-PrivateSubnet1" 326 | - Fn::ImportValue: 327 | !Sub "${BaselineVpcStack}-PrivateSubnet2" 328 | 329 | RDSSecret: 330 | Type: AWS::SecretsManager::Secret 331 | Properties: 332 | Description: 'This is the secret for my RDS instance' 333 | GenerateSecretString: 334 | SecretStringTemplate: '{"username": "masteradmin"}' 335 | GenerateStringKey: 'password' 336 | PasswordLength: 16 337 | ExcludeCharacters: '"@/\' 338 | 339 | #---------------------------------------------------------------------------------------- 340 | # Build Parameter KMS Key 341 | #---------------------------------------------------------------------------------------- 342 | 343 | KMSKey: 344 | Type: "AWS::KMS::Key" 345 | Properties: 346 | KeyPolicy: 347 | Version: 2012-10-17 348 | Id: key--1 349 | Statement: 350 | - Sid: Enable IAM User Permissions 351 | Effect: Allow 352 | Principal: 353 | AWS: !Join 354 | - "" 355 | - - "arn:aws:iam::" 356 | - !Ref "AWS::AccountId" 357 | - ":root" 358 | Action: "kms:*" 359 | Resource: "*" 360 | 361 | #---------------------------------------------------------------------------------------- 362 | # Canary 363 | #---------------------------------------------------------------------------------------- 364 | CloudWatchSyntheticsRole: 365 | Type: AWS::IAM::Role 366 | Properties: 367 | Description: CloudWatch Synthetics lambda execution role for running canaries 368 | AssumeRolePolicyDocument: 369 | Version: 2012-10-17 370 | Statement: 371 | - Effect: Allow 372 | Principal: 373 | Service: lambda.amazonaws.com 374 | Action: sts:AssumeRole 375 | Condition: {} 376 | 377 | RolePermissions: 378 | Type: AWS::IAM::Policy 379 | Properties: 380 | Roles: 381 | - Ref: CloudWatchSyntheticsRole 382 | PolicyName: CloudWatchSyntheticsPolicy 383 | PolicyDocument: 384 | Version: 2012-10-17 385 | Statement: 386 | - Effect: Allow 387 | Action: 388 | - s3:PutObject 389 | - s3:GetBucketLocation 390 | Resource: 391 | - Fn::Sub: arn:aws:s3:::${ResultsBucket}/* 392 | - Effect: Allow 393 | Action: 394 | - logs:CreateLogStream 395 | - logs:PutLogEvents 396 | - logs:CreateLogGroup 397 | Resource: 398 | - '*' 399 | - Effect: Allow 400 | Action: 401 | - s3:ListAllMyBuckets 402 | Resource: '*' 403 | - Effect: Allow 404 | Resource: '*' 405 | Action: cloudwatch:PutMetricData 406 | Condition: 407 | StringEquals: 408 | cloudwatch:namespace: CloudWatchSynthetics 409 | - Effect: Allow 410 | Resource: '*' 411 | Action: 412 | - ec2:* 413 | 414 | ResultsBucket: 415 | Type: AWS::S3::Bucket 416 | Properties: 417 | BucketEncryption: 418 | ServerSideEncryptionConfiguration: 419 | - ServerSideEncryptionByDefault: 420 | SSEAlgorithm: AES256 421 | DeletionPolicy: Retain 422 | 423 | CanarySecurityGroup: 424 | Type: AWS::EC2::SecurityGroup 425 | Properties: 426 | GroupDescription: Canary Sec Group 427 | VpcId: 428 | Fn::ImportValue: 429 | !Sub "${BaselineVpcStack}-VpcId" 430 | Tags: 431 | - Key: Name 432 | Value: !Join [ "-", [ !Ref AWS::StackName, "CanarySecurityGroup"]] 433 | - Key: Application 434 | Value: "OpsExcellence-Lab" 435 | 436 | SyntheticsCanary: 437 | Type: 'AWS::Synthetics::Canary' 438 | Properties: 439 | Name: mysecretword-canary 440 | ExecutionRoleArn: !GetAtt CloudWatchSyntheticsRole.Arn 441 | Code: 442 | Handler: apiCanaryBlueprint.handler 443 | Script: 444 | | 445 | var synthetics = require('Synthetics'); 446 | const log = require('SyntheticsLogger'); 447 | apiCanaryBlueprint = async function (ms) { 448 | 449 | // Handle validation for positive scenario 450 | const validateSuccessfull = async function(res) { 451 | return new Promise((resolve, reject) => { 452 | if (res.statusCode < 200 || res.statusCode > 299) { 453 | throw res.statusCode + ' ' + res.statusMessage; 454 | } 455 | 456 | let responseBody = ''; 457 | res.on('data', (d) => { 458 | responseBody += d; 459 | }); 460 | 461 | res.on('end', () => { 462 | // Add validation on 'responseBody' here if required. 463 | resolve(); 464 | }); 465 | }); 466 | }; 467 | 468 | 469 | let requestOptionsStep1 = { 470 | hostname: process.env.CANARY_ENDPOINT, 471 | method: 'POST', 472 | path: '/encrypt', 473 | port: '80', 474 | protocol: 'http:', 475 | body: "{\"Name\":\"Test User\",\"Text\":\"This Message is a Test!\"}", 476 | headers: {"Content-Type":"application/json"} 477 | }; 478 | requestOptionsStep1['headers']['User-Agent'] = [synthetics.getCanaryUserAgentString(), requestOptionsStep1['headers']['User-Agent']].join(' '); 479 | 480 | let stepConfig1 = { 481 | includeRequestHeaders: true, 482 | includeResponseHeaders: true, 483 | includeRequestBody: true, 484 | includeResponseBody: true, 485 | restrictedHeaders: [], 486 | continueOnHttpStepFailure: true 487 | }; 488 | 489 | await synthetics.executeHttpStep('Verify', requestOptionsStep1, validateSuccessfull, stepConfig1); 490 | 491 | 492 | }; 493 | exports.handler = async () => { 494 | return await apiCanaryBlueprint(); 495 | }; 496 | ArtifactS3Location: 497 | Fn::Join: 498 | - '' 499 | - - s3:// 500 | - Ref: ResultsBucket 501 | RuntimeVersion: syn-nodejs-puppeteer-3.5 502 | Schedule: 503 | Expression: 'rate(1 minute)' 504 | DurationInSeconds: 0 505 | RunConfig: 506 | TimeoutInSeconds: 60 507 | EnvironmentVariables: { "CANARY_ENDPOINT" : !GetAtt ALB.DNSName } 508 | VPCConfig: 509 | SecurityGroupIds: 510 | - !Ref CanarySecurityGroup 511 | SubnetIds: 512 | - 513 | Fn::ImportValue: 514 | !Sub "${BaselineVpcStack}-PrivateSubnet1" 515 | - 516 | Fn::ImportValue: 517 | !Sub "${BaselineVpcStack}-PrivateSubnet2" 518 | VpcId: 519 | Fn::ImportValue: 520 | !Sub "${BaselineVpcStack}-VpcId" 521 | FailureRetentionPeriod: 30 522 | SuccessRetentionPeriod: 30 523 | StartCanaryAfterCreation: true 524 | Tags: 525 | - Key: Name 526 | Value: !Join [ "-", [ !Ref AWS::StackName, "Canary"]] 527 | - Key: Application 528 | Value: "OpsExcellence-Lab" 529 | - Key: TargetEndpoint 530 | Value: !GetAtt ALB.DNSName 531 | 532 | SyntheticsCanaryDurationAlarm: 533 | Type: AWS::CloudWatch::Alarm 534 | Properties: 535 | AlarmDescription: Canary Alarm for My Secret Word 536 | AlarmName: mysecretword-canary-duation-alarm 537 | AlarmActions: 538 | - !Ref SystemEventTopic 539 | ComparisonOperator: GreaterThanOrEqualToThreshold 540 | EvaluationPeriods: 12 541 | DatapointsToAlarm: 3 542 | Dimensions: 543 | - Name: CanaryName 544 | Value: mysecretword-canary 545 | - Name: StepName 546 | Value: Verify 547 | Namespace: "CloudWatchSynthetics" 548 | MetricName: "Duration" 549 | Period: 30 550 | Statistic: Average 551 | Threshold: 5000 552 | TreatMissingData: ignore 553 | 554 | SystemEventTopic: 555 | Type: AWS::SNS::Topic 556 | Properties: 557 | TopicName: SystemEventTopic 558 | Subscription: 559 | - Endpoint: !Ref SystemOpsNotificationEmail 560 | Protocol: "Email" 561 | 562 | SystemOwnersTopic: 563 | Type: AWS::SNS::Topic 564 | Properties: 565 | TopicName: SystemOwnersTopic 566 | Subscription: 567 | - Endpoint: !Ref SystemOwnerNotificationEmail 568 | Protocol: "Email" 569 | -------------------------------------------------------------------------------- /Code/templates/playbook_investigate_application_resources.yml: -------------------------------------------------------------------------------- 1 | Parameters: 2 | PlaybookIAMRole: 3 | Type: String 4 | 5 | Resources: 6 | PlaybookInvestigateAlarm: 7 | Type: "AWS::SSM::Document" 8 | Properties: 9 | DocumentType: Automation 10 | Name: Playbook-Investigate-Application-Resources 11 | Content: 12 | schemaVersion: '0.3' 13 | assumeRole: "{{AutomationAssumeRole}}" 14 | parameters: 15 | Resources: 16 | description: (Required) The Stringified Resources list from Gather Resource Alarm Output. 17 | type: String 18 | AutomationAssumeRole: 19 | type: String 20 | default: !Ref PlaybookIAMRole 21 | description: (Optional) The ARN of the role that allows Automation to perform the actions on your behalf. 22 | outputs: 23 | - Inspect_Playbook_Results.Result 24 | mainSteps: 25 | - name: Gather_ELB_Statistics 26 | action: aws:executeScript 27 | description: Gather ELB Statistics 28 | outputs: 29 | - Name: Result 30 | Selector: $.Payload.Result 31 | Type: String 32 | inputs: 33 | Runtime: python3.6 34 | Handler: handler 35 | InputPayload: 36 | Resourceslist: '{{Resources}}' 37 | Script: |- 38 | import json 39 | import re 40 | from datetime import datetime,timedelta 41 | import boto3 42 | import os 43 | 44 | def arn_deconstruct(arn): 45 | arnlist = arn.split(":") 46 | 47 | service=arnlist[2] 48 | region=arnlist[3] 49 | accountid=arnlist[4] 50 | resources = arnlist[5].split("/") 51 | servicetype = resources[0] 52 | servicemode = resources[1] 53 | resourcename = resources[2] 54 | resourceid = resources[3] 55 | 56 | return { 57 | "Service": service, 58 | "Region": region, 59 | "AccountId": accountid, 60 | "Type": servicetype, 61 | "Mode" : servicemode, 62 | "Name" : resourcename, 63 | "Id" : resourceid 64 | } 65 | 66 | 67 | def get_related_metrics(elb): 68 | cwclient = boto3.client('cloudwatch', region_name = elb['Region'] ) 69 | if elb['Mode'] == 'app': 70 | response = cwclient.list_metrics( 71 | Namespace='AWS/ApplicationELB', 72 | Dimensions=[ 73 | { 74 | 'Name':'LoadBalancer', 75 | 'Value': '{}/{}/{}'.format(elb['Mode'],elb['Name'],elb['Id']) 76 | } 77 | ] 78 | ) 79 | return(response['Metrics']) 80 | 81 | 82 | def get_stat(elb,metricname,stat): 83 | cwclient = boto3.client('cloudwatch', region_name = elb['Region'] ) 84 | 85 | if elb['Mode'] == 'app': 86 | response = cwclient.get_metric_statistics( 87 | Namespace='AWS/ApplicationELB', 88 | MetricName=metricname, 89 | StartTime=datetime.now() - timedelta(minutes=60), 90 | EndTime=datetime.now(), 91 | Period=60, 92 | Dimensions=[ 93 | { 94 | 'Name':'LoadBalancer', 95 | 'Value': '{}/{}/{}'.format(elb['Mode'],elb['Name'],elb['Id']) 96 | } 97 | ], 98 | Statistics=[stat] 99 | ) 100 | 101 | x = [] 102 | result = {} 103 | if len(response['Datapoints']) > 0: 104 | for i in response['Datapoints']: 105 | x.append(i[stat]) 106 | result['OverallValue'] = cal_average(x) 107 | else: 108 | result['OverallValue'] = None 109 | result['Statistics'] = stat 110 | result['TimeWindow'] = 60 111 | return(result) 112 | 113 | def find_elb_resource(res): 114 | result = None 115 | r = json.loads(res['Resourceslist']) 116 | for i in r: 117 | if i['Type'] == 'AWS::ElasticLoadBalancingV2::Listener': 118 | result = i['PhysicalResourceId'] 119 | return result 120 | 121 | def cal_average(num): 122 | sum_num = 0 123 | for t in num: 124 | sum_num = sum_num + t 125 | 126 | avg = sum_num / len(num) 127 | return avg 128 | 129 | def myconverter(o): 130 | if isinstance(o, datetime): 131 | return o.__str__() 132 | 133 | def handler(event, context): 134 | 135 | arn = find_elb_resource(event) 136 | result = {} 137 | 138 | if arn is not None: 139 | elb = arn_deconstruct(arn) 140 | 141 | metricslist = get_related_metrics(elb) 142 | result['TargetResponseTime'] = get_stat(elb,'TargetResponseTime','Average') 143 | result['Target2XXCount'] = get_stat(elb,'HTTPCode_Target_2XX_Count','Sum') 144 | result['Target3XXCount'] = get_stat(elb,'HTTPCode_Target_2XX_Count','Sum') 145 | result['Target4XXCount'] = get_stat(elb,'HTTPCode_Target_4XX_Count','Sum') 146 | result['Target5XXCount'] = get_stat(elb,'HTTPCode_Target_5XX_Count','Sum') 147 | result['TargetConnectionErrorCount'] = get_stat(elb,'TargetConnectionErrorCount','Sum') 148 | result['UnHealthyHostCount'] = get_stat(elb,'UnHealthyHostCount','Average') 149 | result['ActiveConnectionCount'] = get_stat(elb,'ActiveConnectionCount','Sum') 150 | result['ELB3XXCount'] = get_stat(elb,'HTTPCode_ELB_3XX_Count','Sum') 151 | result['ELB4XXCount'] = get_stat(elb,'HTTPCode_ELB_4XX_Count','Sum') 152 | result['ELB5XXCount'] = get_stat(elb,'HTTPCode_ELB_5XX_Count','Sum') 153 | result['ELB500Count'] = get_stat(elb,'HTTPCode_ELB_500_Count','Sum') 154 | result['ELB502Count'] = get_stat(elb,'HTTPCode_ELB_502_Count','Sum') 155 | result['ELB503Count'] = get_stat(elb,'HTTPCode_ELB_503_Count','Sum') 156 | result['ELB504Count'] = get_stat(elb,'HTTPCode_ELB_504_Count','Sum') 157 | 158 | serialized_result = json.dumps(result, default = myconverter ) 159 | result['Result'] = json.dumps(json.loads(serialized_result)) 160 | 161 | return result 162 | - name: Gather_RDS_Config 163 | action: aws:executeScript 164 | description: Gather RDS Configurations 165 | outputs: 166 | - Name: Result 167 | Selector: $.Payload.Result 168 | Type: String 169 | inputs: 170 | Runtime: python3.6 171 | Handler: handler 172 | InputPayload: 173 | Resourceslist: '{{Resources}}' 174 | Script: |- 175 | import json 176 | import re 177 | from datetime import datetime,timedelta 178 | import boto3 179 | import os 180 | 181 | def arn_deconstruct(arn): 182 | arnlist = arn.split(":") 183 | 184 | service=arnlist[2] 185 | region=arnlist[3] 186 | accountid=arnlist[4] 187 | resources = arnlist[5].split("/") 188 | servicetype = resources[0] 189 | servicemode = resources[1] 190 | resourcename = resources[2] 191 | resourceid = resources[3] 192 | 193 | return { 194 | "Service": service, 195 | "Region": region, 196 | "AccountId": accountid, 197 | "Type": servicetype, 198 | "Mode" : servicemode, 199 | "Name" : resourcename, 200 | "Id" : resourceid 201 | } 202 | 203 | 204 | 205 | def get_rds_config(rdsname): 206 | rdsclient = boto3.client('rds') 207 | 208 | res = rdsclient.describe_db_instances( 209 | DBInstanceIdentifier=rdsname 210 | ) 211 | result = res['DBInstances'][0] 212 | 213 | return(result) 214 | 215 | def get_rds_parameters(rdsparamgroups): 216 | result = [] 217 | rdsclient = boto3.client('rds') 218 | 219 | for i in rdsparamgroups: 220 | name = i['DBParameterGroupName'] 221 | res = rdsclient.describe_db_parameters( 222 | DBParameterGroupName=name 223 | ) 224 | x = { 225 | 'DBParamGroup' : i, 226 | 'Parameters' : res['Parameters'] 227 | } 228 | result.append(x) 229 | 230 | return result 231 | 232 | 233 | def find_rds_resource(res): 234 | result = None 235 | r = json.loads(res['Resourceslist']) 236 | for i in r: 237 | if i['Type'] == 'AWS::RDS::DBInstance': 238 | result = i['PhysicalResourceId'] 239 | return result 240 | 241 | def cal_average(num): 242 | sum_num = 0 243 | for t in num: 244 | sum_num = sum_num + t 245 | 246 | avg = sum_num / len(num) 247 | return avg 248 | 249 | def myconverter(o): 250 | if isinstance(o, datetime): 251 | return o.__str__() 252 | 253 | def handler(event, context): 254 | param = None 255 | result = {} 256 | 257 | rdsrsname = find_rds_resource(event) 258 | rdsconfig = get_rds_config(rdsrsname) 259 | 260 | if len(rdsconfig['DBParameterGroups']) > 0: 261 | param = get_rds_parameters(rdsconfig['DBParameterGroups']) 262 | 263 | result['Result'] = json.dumps({ 264 | 'config' : json.loads(json.dumps(rdsconfig,default = myconverter)), 265 | 'parameters' : param 266 | } ); 267 | 268 | return result 269 | - name: Gather_RDS_Statistics 270 | action: aws:executeScript 271 | description: Gather RDS Statistics 272 | outputs: 273 | - Name: Result 274 | Selector: $.Payload.Result 275 | Type: String 276 | inputs: 277 | Runtime: python3.6 278 | Handler: handler 279 | InputPayload: 280 | Resourceslist: '{{Resources}}' 281 | Script: |- 282 | import json 283 | import re 284 | from datetime import datetime,timedelta 285 | import boto3 286 | import os 287 | 288 | def arn_deconstruct(arn): 289 | arnlist = arn.split(":") 290 | 291 | service=arnlist[2] 292 | region=arnlist[3] 293 | accountid=arnlist[4] 294 | resources = arnlist[5].split("/") 295 | servicetype = resources[0] 296 | servicemode = resources[1] 297 | resourcename = resources[2] 298 | resourceid = resources[3] 299 | 300 | return { 301 | "Service": service, 302 | "Region": region, 303 | "AccountId": accountid, 304 | "Type": servicetype, 305 | "Mode" : servicemode, 306 | "Name" : resourcename, 307 | "Id" : resourceid 308 | } 309 | 310 | 311 | def get_related_metrics(rdsname): 312 | cwclient = boto3.client('cloudwatch') 313 | response = cwclient.list_metrics( 314 | Namespace='AWS/RDS', 315 | Dimensions=[ 316 | { 317 | 'Name':'DBInstanceIdentifier', 318 | 'Value': rdsname 319 | } 320 | ] 321 | ) 322 | return(response['Metrics']) 323 | 324 | 325 | def get_stat(rdsname,metricname,stat): 326 | cwclient = boto3.client('cloudwatch') 327 | 328 | response = cwclient.get_metric_statistics( 329 | Namespace='AWS/RDS', 330 | MetricName=metricname, 331 | StartTime=datetime.now() - timedelta(minutes=60), 332 | EndTime=datetime.now(), 333 | Period=60, 334 | Dimensions=[ 335 | { 336 | 'Name':'DBInstanceIdentifier', 337 | 'Value': rdsname 338 | } 339 | ], 340 | Statistics=[stat] 341 | ) 342 | 343 | x = [] 344 | result = {} 345 | if len(response['Datapoints']) > 0: 346 | for i in response['Datapoints']: 347 | x.append(i[stat]) 348 | result['OverallValue'] = cal_average(x) 349 | else: 350 | result['OverallValue'] = None 351 | result['Statistics'] = stat 352 | result['TimeWindow'] = 60 353 | return(result) 354 | 355 | 356 | def find_rds_resource(res): 357 | result = None 358 | r = json.loads(res['Resourceslist']) 359 | for i in r: 360 | if i['Type'] == 'AWS::RDS::DBInstance': 361 | result = i['PhysicalResourceId'] 362 | return result 363 | 364 | def cal_average(num): 365 | sum_num = 0 366 | for t in num: 367 | sum_num = sum_num + t 368 | 369 | avg = sum_num / len(num) 370 | return avg 371 | 372 | def myconverter(o): 373 | if isinstance(o, datetime): 374 | return o.__str__() 375 | 376 | def handler(event, context): 377 | 378 | rdsrsname = find_rds_resource(event) 379 | metrics = get_related_metrics(rdsrsname) 380 | result = {} 381 | output = {} 382 | 383 | result['BinLogDiskUsage'] = get_stat(rdsrsname,'BinLogDiskUsage','Sum') 384 | result['BurstBalance'] = get_stat(rdsrsname,'BurstBalance','Average') 385 | result['CPUUtilization'] = get_stat(rdsrsname,'CPUUtilization','Average') 386 | result['CPUCreditUsage'] = get_stat(rdsrsname,'CPUCreditUsage','Sum') 387 | result['CPUCreditBalance'] = get_stat(rdsrsname,'CPUCreditBalance','Maximum') 388 | result['DatabaseConnections'] = get_stat(rdsrsname,'DatabaseConnections','Sum') 389 | result['DiskQueueDepth'] = get_stat(rdsrsname,'DiskQueueDepth','Maximum') 390 | result['FailedSQLServerAgentJobsCount'] = get_stat(rdsrsname,'FailedSQLServerAgentJobsCount','Average') 391 | result['FreeableMemory'] = get_stat(rdsrsname,'FreeableMemory','Maximum') 392 | result['MaximumUsedTransactionIDs'] = get_stat(rdsrsname,'MaximumUsedTransactionIDs','Maximum') 393 | result['NetworkReceiveThroughput'] = get_stat(rdsrsname,'NetworkReceiveThroughput','Average') 394 | result['NetworkTransmitThroughput'] = get_stat(rdsrsname,'NetworkTransmitThroughput','Average') 395 | result['OldestReplicationSlotLag'] = get_stat(rdsrsname,'OldestReplicationSlotLag','Maximum') 396 | result['ReadIOPS'] = get_stat(rdsrsname,'ReadIOPS','Average') 397 | result['ReadLatency'] = get_stat(rdsrsname,'ReadLatency','Average') 398 | result['ReadThroughput'] = get_stat(rdsrsname,'ReadThroughput','Average') 399 | result['ReplicaLag'] = get_stat(rdsrsname,'ReplicaLag','Average') 400 | result['ReplicationSlotDiskUsage'] = get_stat(rdsrsname,'ReplicationSlotDiskUsage','Maximum') 401 | result['SwapUsage'] = get_stat(rdsrsname,'SwapUsage','Maximum') 402 | result['TransactionLogsDiskUsage'] = get_stat(rdsrsname,'TransactionLogsDiskUsage','Maximum') 403 | result['TransactionLogsGeneration'] = get_stat(rdsrsname,'TransactionLogsGeneration','Average') 404 | result['ReplicationSlotDiskUsage'] = get_stat(rdsrsname,'ReplicationSlotDiskUsage','Maximum') 405 | result['WriteIOPS'] = get_stat(rdsrsname,'WriteIOPS','Average') 406 | result['WriteLatency'] = get_stat(rdsrsname,'WriteLatency','Average') 407 | result['WriteThroughput'] = get_stat(rdsrsname,'WriteThroughput','Average') 408 | output['Result'] = json.dumps(result) 409 | 410 | return output 411 | - name: Gather_ECS_Statistics 412 | action: aws:executeScript 413 | description: Gather ECS Service CloudWatch metrics 414 | outputs: 415 | - Name: Result 416 | Selector: $.Payload.Result 417 | Type: String 418 | inputs: 419 | Runtime: python3.6 420 | Handler: handler 421 | InputPayload: 422 | Resourceslist: '{{Resources}}' 423 | Script: |- 424 | import json 425 | import re 426 | from datetime import datetime,timedelta 427 | import boto3 428 | import os 429 | 430 | def arn_deconstruct(arn): 431 | arnlist = arn.split(":") 432 | 433 | service=arnlist[2] 434 | region=arnlist[3] 435 | accountid=arnlist[4] 436 | resources = arnlist[5].split("/") 437 | servicetype = resources[0] 438 | clustername = resources[1] 439 | servicename = resources[2] 440 | 441 | return { 442 | "Service": service, 443 | "Region": region, 444 | "AccountId": accountid, 445 | "Type": servicetype, 446 | "ClusterName" : clustername, 447 | "ServiceName" : servicename 448 | } 449 | 450 | 451 | def get_related_metrics(res): 452 | cwclient = boto3.client('cloudwatch', region_name = res['Region'] ) 453 | 454 | response = cwclient.list_metrics( 455 | Namespace='AWS/ECS', 456 | Dimensions=[ 457 | { 458 | 'Name':'ServiceName', 459 | 'Value': res['ServiceName'] 460 | }, 461 | { 462 | 'Name':'ClusterName', 463 | 'Value': res['ClusterName'] 464 | } 465 | ] 466 | ) 467 | return(response['Metrics']) 468 | 469 | 470 | def get_stat(res,metricname,stat): 471 | cwclient = boto3.client('cloudwatch', region_name = res['Region'] ) 472 | 473 | response = cwclient.get_metric_statistics( 474 | Namespace='AWS/ECS', 475 | MetricName=metricname, 476 | StartTime=datetime.now() - timedelta(minutes=6), 477 | EndTime=datetime.now(), 478 | Period=1, 479 | Dimensions=[ 480 | { 481 | 'Name':'ServiceName', 482 | 'Value': res['ServiceName'] 483 | }, 484 | { 485 | 'Name':'ClusterName', 486 | 'Value': res['ClusterName'] 487 | } 488 | ], 489 | Statistics=[stat] 490 | ) 491 | 492 | x = [] 493 | result = {} 494 | if len(response['Datapoints']) > 0: 495 | for i in response['Datapoints']: 496 | x.append(i[stat]) 497 | result['OverallValue'] = cal_average(x) 498 | else: 499 | result['OverallValue'] = None 500 | result['Statistics'] = stat 501 | result['TimeWindow'] = 60 502 | # result['Datapoints'] = response['Datapoints'] 503 | return(result) 504 | 505 | 506 | def find_ecsservice_resource(res): 507 | result = None 508 | r = json.loads(res['Resourceslist']) 509 | for i in r: 510 | if i['Type'] == 'AWS::ECS::Service': 511 | result = i['PhysicalResourceId'] 512 | return result 513 | 514 | def cal_average(num): 515 | sum_num = 0 516 | for t in num: 517 | sum_num = sum_num + t 518 | 519 | avg = sum_num / len(num) 520 | return avg 521 | 522 | def myconverter(o): 523 | if isinstance(o, datetime): 524 | return o.__str__() 525 | 526 | def handler(event, context): 527 | 528 | arn = find_ecsservice_resource(event) 529 | result = {} 530 | 531 | if arn is not None: 532 | ecsservice = arn_deconstruct(arn) 533 | result = {} 534 | output = {} 535 | result['CPUUtilization'] = get_stat(ecsservice,'CPUUtilization','Maximum') 536 | result['MemoryUtilization'] = get_stat(ecsservice,'MemoryUtilization','Maximum') 537 | serialized_result = json.dumps(result,default = myconverter ) 538 | result = json.loads(serialized_result) 539 | output['Result']=json.dumps(result) 540 | 541 | result = output 542 | 543 | return result 544 | - name: Gather_ECS_Error_Logs 545 | action: aws:executeScript 546 | description: Search and gather error in ECS logs 547 | outputs: 548 | - Name: Result 549 | Selector: $.Payload.Result 550 | Type: String 551 | inputs: 552 | Runtime: python3.6 553 | Handler: handler 554 | InputPayload: 555 | Resourceslist: '{{Resources}}' 556 | Script: |- 557 | import json 558 | import re 559 | from datetime import datetime,timedelta 560 | import boto3 561 | import os 562 | import time 563 | 564 | def arn_deconstruct(arn): 565 | arnlist = arn.split(":") 566 | 567 | service=arnlist[2] 568 | region=arnlist[3] 569 | accountid=arnlist[4] 570 | resources = arnlist[5].split("/") 571 | servicetype = resources[0] 572 | servicemode = resources[1] 573 | resourcename = resources[2] 574 | 575 | return { 576 | "Service": service, 577 | "Region": region, 578 | "AccountId": accountid, 579 | "Type": servicetype, 580 | "Mode" : servicemode, 581 | "Name" : resourcename 582 | } 583 | 584 | 585 | def find_ecs_resource(res): 586 | result = {} 587 | 588 | r = json.loads(res['Resourceslist']) 589 | for i in r: 590 | if i['Type'] == 'AWS::ECS::Cluster': 591 | result['ECSCluster'] = i['PhysicalResourceId'] 592 | if i['Type'] == 'AWS::ECS::Service': 593 | result['ECSService'] = i['PhysicalResourceId'] 594 | 595 | return result 596 | 597 | def find_ecs_logs(ecsclsname,ecssvcname,region): 598 | result = [] 599 | 600 | ecsclient = boto3.client('ecs', region_name = region ) 601 | ecssvcres = ecsclient.describe_services( 602 | cluster=ecsclsname, 603 | services=[ ecssvcname ] 604 | ) 605 | 606 | if len(ecssvcres['services']) > 0: 607 | taskdef = ecssvcres['services'][0]['taskDefinition'] 608 | taskdefres = ecsclient.describe_task_definition( 609 | taskDefinition=taskdef 610 | ) 611 | 612 | contdef = taskdefres['taskDefinition']['containerDefinitions'] 613 | 614 | for i in contdef: 615 | result.append(i['logConfiguration']) 616 | 617 | return result 618 | 619 | 620 | def find_error_in_logs(loglist): 621 | result = [] 622 | loggroups = [] 623 | logsclient = boto3.client('logs') 624 | 625 | for i in loglist: 626 | options = i['options'] 627 | if 'awslogs-group' in options: 628 | loggroups.append(options['awslogs-group']) 629 | now = int(datetime.now().timestamp()) 630 | 631 | res = logsclient.start_query( 632 | logGroupNames=loggroups, 633 | startTime = now - 3000, 634 | endTime = now, 635 | queryString = "fields @message | filter @message like \"Error:\" | limit 5" 636 | ) 637 | 638 | response = None 639 | while response == None or response['status'] == 'Running': 640 | time.sleep(1) 641 | response = logsclient.get_query_results( 642 | queryId= res['queryId'] 643 | ) 644 | 645 | if 'results' in response: 646 | if len(response['results']) > 0: 647 | for i in response['results']: 648 | for x in i: 649 | if x['field'] == '@ptr': 650 | pointer = x['value'] 651 | recdetail = logsclient.get_log_record( 652 | logRecordPointer=pointer 653 | ) 654 | 655 | result.append(recdetail['logRecord']) 656 | 657 | return result 658 | 659 | 660 | def cal_average(num): 661 | sum_num = 0 662 | for t in num: 663 | sum_num = sum_num + t 664 | 665 | avg = sum_num / len(num) 666 | return avg 667 | 668 | def myconverter(o): 669 | if isinstance(o, datetime): 670 | return o.__str__() 671 | 672 | def handler(event, context): 673 | result = {} 674 | x = [] 675 | res = find_ecs_resource(event) 676 | ecssvc = arn_deconstruct(res['ECSService']) 677 | loglist = find_ecs_logs(res['ECSCluster'],ecssvc['Name'],ecssvc['Region']) 678 | 679 | x = find_error_in_logs(loglist) 680 | 681 | if len(x) > 0: 682 | result['Result'] = json.dumps(x) 683 | else: 684 | result['Result'] = "None" 685 | 686 | return result 687 | - name: Gather_ECS_Config 688 | action: aws:executeScript 689 | description: Gather ECS Configurations 690 | outputs: 691 | - Name: Result 692 | Selector: $.Payload.Result 693 | Type: String 694 | inputs: 695 | Runtime: python3.6 696 | Handler: handler 697 | InputPayload: 698 | Resourceslist: '{{Resources}}' 699 | Script: |- 700 | import json 701 | import re 702 | from datetime import datetime,timedelta 703 | import boto3 704 | import os 705 | 706 | 707 | 708 | def arn_deconstruct(arn): 709 | arnlist = arn.split(":") 710 | 711 | service=arnlist[2] 712 | region=arnlist[3] 713 | accountid=arnlist[4] 714 | resources = arnlist[5].split("/") 715 | servicetype = resources[0] 716 | clustername = resources[1] 717 | servicename = resources[2] 718 | 719 | return { 720 | "Service": service, 721 | "Region": region, 722 | "AccountId": accountid, 723 | "Type": servicetype, 724 | "ClusterName" : clustername, 725 | "ServiceName" : servicename 726 | } 727 | 728 | def get_ecs_service_config(res): 729 | ecsclient = boto3.client('ecs') 730 | 731 | response = ecsclient.describe_services( 732 | cluster= res['ClusterName'], 733 | services=[ res['ServiceName'] ] 734 | ) 735 | 736 | if len(response['services']) > 0: 737 | result = response['services'][0] 738 | 739 | return(result) 740 | 741 | def get_scaling_policy(res): 742 | result = [] 743 | aaclient = boto3.client('application-autoscaling') 744 | 745 | response = aaclient.describe_scaling_policies( 746 | ServiceNamespace = 'ecs', 747 | ResourceId = 'service/{}/{}'.format(res['ClusterName'],res['ServiceName']) 748 | ) 749 | 750 | if len(response['ScalingPolicies']) > 0: 751 | result = response['ScalingPolicies'] 752 | 753 | return(result) 754 | 755 | def find_ecsservice_resource(res): 756 | result = None 757 | r = json.loads(res['Resourceslist']) 758 | for i in r: 759 | if i['Type'] == 'AWS::ECS::Service': 760 | result = i['PhysicalResourceId'] 761 | return result 762 | 763 | def cal_average(num): 764 | sum_num = 0 765 | for t in num: 766 | sum_num = sum_num + t 767 | 768 | avg = sum_num / len(num) 769 | return avg 770 | 771 | def myconverter(o): 772 | if isinstance(o, datetime): 773 | return o.__str__() 774 | 775 | def handler(event, context): 776 | 777 | 778 | arn = find_ecsservice_resource(event) 779 | ecsres = arn_deconstruct(arn) 780 | result = {} 781 | output = {} 782 | 783 | if ecsres is not None: 784 | ecssvccfg = json.dumps(get_ecs_service_config(ecsres),default = myconverter ) 785 | 786 | result = json.loads(ecssvccfg) 787 | result['scalingpolicies'] = json.loads(json.dumps( get_scaling_policy(ecsres),default = myconverter )) 788 | 789 | output['Result'] = json.dumps(result,default = myconverter ) 790 | return output 791 | - name: Inspect_Playbook_Results 792 | action: aws:executeScript 793 | description: Inspect Results 794 | outputs: 795 | - Name: Result 796 | Selector: $.Payload.Result 797 | Type: String 798 | inputs: 799 | Runtime: python3.6 800 | Handler: handler 801 | InputPayload: 802 | ELBStatistics: '{{Gather_ELB_Statistics.Result}}' 803 | RDSConfig: '{{Gather_RDS_Config.Result}}' 804 | RDSStatistics: '{{Gather_RDS_Statistics.Result}}' 805 | ECSStatistics: '{{Gather_ECS_Statistics.Result}}' 806 | ECSErrorLogs: '{{Gather_ECS_Error_Logs.Result}}' 807 | ECSConfig: '{{Gather_ECS_Config.Result}}' 808 | Script: |- 809 | import json 810 | import re 811 | from datetime import datetime,timedelta 812 | import boto3 813 | import os 814 | 815 | def inspect_elb_stats(elbstat): 816 | 817 | result = {} 818 | stat = json.loads(elbstat) 819 | 820 | #Benchmark Max Values 821 | TargetResponseTime = 5 822 | TargetConnectionErrorCount = 0 823 | UnHealthyHostCount = 0 824 | ELB5XXCount = 0 825 | ELB500Count = 0 826 | ELB502Count = 0 827 | ELB503Count = 0 828 | ELB504Count = 0 829 | Target4XXCount = 0 830 | Target5XXCount = 0 831 | 832 | if stat['TargetResponseTime']['OverallValue'] is not None and stat['TargetResponseTime']['OverallValue'] > TargetResponseTime: 833 | result['TargetResponseTime'] = stat['TargetResponseTime']['OverallValue'] 834 | 835 | if stat['TargetConnectionErrorCount']['OverallValue'] is not None and stat['TargetConnectionErrorCount']['OverallValue'] > TargetConnectionErrorCount: 836 | result['TargetConnectionErrorCount'] = stat['TargetConnectionErrorCount']['OverallValue'] 837 | 838 | if stat['UnHealthyHostCount']['OverallValue'] is not None and stat['UnHealthyHostCount']['OverallValue'] > UnHealthyHostCount : 839 | result['UnHealthyHostCount'] = stat['UnHealthyHostCount']['OverallValue'] 840 | 841 | if stat['ELB5XXCount']['OverallValue'] is not None and stat['ELB5XXCount']['OverallValue'] > ELB5XXCount : 842 | result['ELB5XXCount'] = stat['ELB5XXCount']['OverallValue'] 843 | 844 | if stat['ELB500Count']['OverallValue'] is not None and stat['ELB500Count']['OverallValue'] > ELB500Count : 845 | result['ELB500Count'] = stat['ELB500Count']['OverallValue'] 846 | 847 | if stat['ELB502Count']['OverallValue'] is not None and stat['ELB502Count']['OverallValue'] > ELB502Count: 848 | result['ELB502Count'] = stat['ELB502Count']['OverallValue'] 849 | 850 | if stat['ELB503Count']['OverallValue'] is not None and stat['ELB503Count']['OverallValue'] > ELB503Count: 851 | result['ELB503Count'] = stat['ELB503Count']['OverallValue'] 852 | 853 | if stat['ELB504Count']['OverallValue'] is not None and stat['ELB504Count']['OverallValue'] > ELB504Count: 854 | result['ELB504Count'] = stat['ELB504Count']['OverallValue'] 855 | 856 | if stat['Target4XXCount']['OverallValue'] is not None and stat['Target4XXCount']['OverallValue'] > Target4XXCount : 857 | result['Target4XXCount'] = stat['Target4XXCount']['OverallValue'] 858 | 859 | if stat['Target5XXCount']['OverallValue'] is not None and stat['Target5XXCount']['OverallValue'] > Target5XXCount : 860 | result['Target5XXCount'] = stat['Target5XXCount']['OverallValue'] 861 | 862 | return result 863 | 864 | def inspect_rds_stats(): 865 | #Benchmark Values 866 | DatabaseConnections = 150 867 | 868 | 869 | def inspect_ecs_logs(ecslogs): 870 | #Benchmark Max Values 871 | Count = 0 872 | 873 | result = [] 874 | print(ecslogs) 875 | 876 | if ecslogs is not None: 877 | stat = json.loads(ecslogs) 878 | if len(stat) > 0 : 879 | result = stat 880 | 881 | return result 882 | 883 | 884 | def inspect_ecs_stats(ecstat): 885 | 886 | result = {} 887 | stat = json.loads(ecstat) 888 | 889 | #Benchmark Max Values 890 | CPUUtilization = 80 891 | 892 | if stat['CPUUtilization']['OverallValue'] is not None and stat['CPUUtilization']['OverallValue'] > CPUUtilization: 893 | result['CPUUtilization'] = stat['CPUUtilization']['OverallValue'] 894 | 895 | return result 896 | 897 | def inspect_ecs_config(ecsconf): 898 | 899 | result = {} 900 | conf = json.loads(ecsconf) 901 | 902 | if 'runningCount' in conf: 903 | result['TaskRunningCount'] = conf['runningCount'] 904 | 905 | if 'desiredCount' in conf: 906 | result['TaskDesiredCount'] = conf['desiredCount'] 907 | 908 | if 'pendingCount' in conf: 909 | result['TaskPendingCount'] = conf['pendingCount'] 910 | 911 | if 'launchType' in conf: 912 | result['LaunchType'] = conf['launchType'] 913 | 914 | 915 | return result 916 | 917 | def myconverter(o): 918 | if isinstance(o, datetime): 919 | return o.__str__() 920 | 921 | def handler(event, context): 922 | 923 | result = {} 924 | output = {} 925 | 926 | elbstat = event['ELBStatistics'] 927 | output['ELB'] = inspect_elb_stats(elbstat) 928 | 929 | ecsstat = event['ECSStatistics'] 930 | ecslogs = event['ECSErrorLogs'] 931 | ecsconf = event['ECSConfig'] 932 | 933 | output['ECS'] = inspect_ecs_stats(ecsstat) 934 | output['ECS']['CurrentConfig'] =inspect_ecs_config(ecsconf) 935 | 936 | 937 | if ecslogs != "None": 938 | output['ECS']['Logs'] = inspect_ecs_logs(ecslogs) 939 | 940 | x = json.dumps(output, default = myconverter ) 941 | 942 | result['Result'] = x 943 | 944 | 945 | return result -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Achieving Operational Excellence Using Automated Playbook and Runbook 2 | 3 | ℹ️ You will run this lab in your own AWS account. Please follow directions at the end of the lab to remove resources to avoid future costs. 4 | 5 | ## Introduction 6 | 7 | This lab was derived directly from one of Operataional Excellence Labs named [Automating operations with Playbooks and Runbooks](https://wellarchitectedlabs.com/operational-excellence/200_labs/) in AWS Well-Architected Lab. 8 | 9 | Manually running your [runbooks](https://wa.aws.amazon.com/wat.concept.runbook.en.html) and [playbooks](https://wa.aws.amazon.com/wat.concept.playbook.en.html) for operational activities has a number of drawbacks: 10 | 11 | * Activities are prone to errors & difficult to trace. 12 | * Manual activities do not allow your operational practice to scale in line with your business requirements. 13 | 14 | In contrast, implementing automation in these activities has the following benefits: 15 | 16 | * Improved reliability by preventing the introduction of errors through manual processes. 17 | * Increased scalability by allowing non linear resource investment to operate your workload. 18 | * Increased traceability on your operation through log collection of the automation activity. 19 | * Improved incident response by reducing idle time and automatically triggering activity based on known events. 20 | 21 |
22 | Click here if you would like to know what runbook and playbook are 23 | 24 | 25 | At a glance, both **runbooks** and **playbooks** appear to be similar documents that technical users, can use to perform operational activities. However, there an essential difference between them: 26 | 27 | * A [playbook](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.playbook.en.html) documents contain processes that guides you through activities to investigate an issue. For example, gathering applicable information, identifying potential sources of failure, isolating faults, or determining the root cause of issues. Playbooks can follow multiple paths and yield more than one outcome. 28 | 29 | * A [runbook](https://wa.aws.amazon.com/wat.concept.runbook.en.html) contains procedures necessary to achieve a specific outcome. For example, creating a user, rolling back configuration, or scaling resource to resolve the issue identified. 30 | 31 |
32 | 33 | This hands-on lab will guide you through the steps to automate your operational activities using runbooks and playbooks built with AWS tools. 34 | 35 | We will show how you can build automated runbooks and playbooks to investigate and remediate application issues using the following AWS services: 36 | 37 | * [Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 38 | * [Simple Notification Service](https://aws.amazon.com/sns/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc) 39 | * [Amazon CloudWatch synthetic monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 40 | 41 | ## Prerequisites: 42 | 43 | * An [AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) that you are able to use for testing. The account should not be used for production purposes. 44 | * An [IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) in your AWS account with full access to [CloudFormation,](https://aws.amazon.com/cloudformation/) [Amazon ECS,](https://aws.amazon.com/ecs/)[Amazon RDS,](https://aws.amazon.com/rds/) [Amazon Virtual Private Cloud (VPC),](https://aws.amazon.com/vpc/) [AWS Identity and Access Management (IAM),](https://aws.amazon.com/iam/) [AWS Cloud9](https://aws.amazon.com/cloud9/) 45 | 46 | ## Costs 47 | 48 | NOTE: You will be billed for any applicable AWS resources used if you complete this lab that are not covered in the [AWS Free Tier](https://aws.amazon.com/free/). 49 | 50 | This lab walks you through creating a CI/CD workflow for serveress applications. 51 | ## Content 52 | 53 | - [Step 1. Deploy the sample application environment](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-1-Deploy-the-sample-application-environment) 54 | - [Step 2. Simulate an Application Issue](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-2-Simulate-an-Application-Issue) 55 | - [Step 3. Build and Run an Investigative Playbook](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-3-Build-and-Run-an-Investigative-Playbook) 56 | - [Step 4. Build and Run Remediation Runbook](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-4-Build-and-Run-Remediation-Runbook) 57 | - [Teardown](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#Teardown) 58 | - [Summary](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#Summary) 59 | 60 | ### Step 1. Deploy the sample application environment 61 | In this section, you will prepare a sample application. The application is an API hosted inside a docker container, using [Amazon Elastic Compute Service (ECS).](https://aws.amazon.com/ecs/). The container is accessed via an [Application Load Balancer.](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) 62 | 63 | The API is a private microservice within your [Amazon Virtual Private Cloud (VPC)](https://aws.amazon.com/vpc/). Communication to the API can only be done privately through routes within the VPC subnet. In our lab example, the business owner has agreed to run the API over HTTP protocol to simplify the implementation. 64 | 65 | The API has two actions available which encrypt and decrypt information. This is triggered by doing a REST POST call to the */encrypt* / */decrypt* methods as appropriate. 66 | 67 | * The *encrypt* action will allow you to pass a secret message along with a 'Name' key as the identifier and it will return a 'Secret Key Id' that you can use later to decrypt your message. 68 | * The *decrypt* action allows you to then decrypt the secret message passing along the 'Name' key and 'Secret Key Id' you obtained before to get your secret message. 69 | 70 | Both actions will make a write and read call to the application database hosted in [Amazon Relation Database Service (RDS)](https://aws.amazon.com/rds/), where the encrypted messages are stored. 71 | 72 | The following step-by-step instructions will provision the application that you will use with your **runbooks** and **playbooks** . 73 | 74 | Explore the contents of the CloudFormation script to learn more about the environment and application. 75 | 76 | You will use this sample application as a sandbox to simulate an application performance issue, start your **runbooks** and **playbooks** to autonomously investigate and remediate. 77 | 78 | #### Actions items in this section: 79 | 80 | 1. You will prepare the [Cloud9](https://aws.amazon.com/cloud9/) workspace launched with a new VPC. 81 | 2. You will run the application build script from the Cloud9 console to build the sample application as shown in the diagram below. 82 | 83 | ![Section1 App Arch](/Images/section2-base-application.png) 84 | 85 | 86 | ### 1.0 Prepare Cloud9 workspace. 87 | 88 | In this first step you will provision a [CloudFormation](https://aws.amazon.com/cloudformation/) stack that builds a Cloud9 workspace along with the VPC for the sample application. This Cloud9 workspace will be used to run the provisioning script of the sample application. You can choose the to deploy stack in one of the regions below. 89 | 90 | 1. Click on the link below to deploy the stack. This will take you to the CloudFormation console in your account. Use `walab-ops-base-resources` as the stack name, and take the default values for all options. 91 | 92 | * **us-west-2** : [here](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/create/review?stackName=walab-ops-base-resources&templateURL=https://aws-well-architected-labs-singapore.s3.ap-southeast-1.amazonaws.com/Operations/200_Automating_operations_with_playbooks_and_runbooks/base_resources.yml) 93 | * **ap-southeast-2** : [here](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/create/review?stackName=walab-ops-base-resources&templateURL=https://aws-well-architected-labs-singapore.s3.ap-southeast-1.amazonaws.com/Operations/200_Automating_operations_with_playbooks_and_runbooks/base_resources.yml) 94 | * **ap-southeast-1** : [here](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-1#/stacks/create/review?stackName=walab-ops-base-resources&templateURL=https://aws-well-architected-labs-singapore.s3.ap-southeast-1.amazonaws.com/Operations/200_Automating_operations_with_playbooks_and_runbooks/base_resources.yml) 95 | 96 | 2. Once the template is deployed, wait until the CloudFormation Stack reaches the **CREATE_COMPLETE** state. 97 | 98 | ![Section1 ](/Images/section2-base-resources-create-complete.png) 99 | 100 | 101 | ### 1.1 Run the build application script. 102 | 103 | Next, run the build script to build and deploy you application environment from the Cloud9 workspace as follows: 104 | 105 | 1. From the main console, access the **Cloud9** service. 106 | 2. Click **Environments** section on the left menu, and locate an environment named `WellArchitectedOps-walab-ops-base-resources` as below, then click **Open**. 107 | 108 | ![Section 2 Cloud9 IDE Welcome Screen](/Images/section2-environment-open-ide.png) 109 | 110 | 3. Your environment will bootstrap the lab repository. You should see a terminal output showing the following output: 111 | 112 | ![Section 2](/Images/section2-base-bootstrap.png) 113 | 114 | When the bootstrap script finishes you will see a folder called `aws-well-architected-labs`. 115 | 116 | 4. In the IDE terminal console, change directory to the working folder where the build script is located: 117 | 118 | ``` 119 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/ 120 | ``` 121 | 122 | 5. Copy and paste the command below, replacing `sysops@domain.com` and `owner@domain.com` with the email address you would like the application to notify you with. Replace the `sysops@domain.com` value with email representing system operators team and `owner@domain.com` with email address representing business owner. 123 | 124 | 125 | ``` 126 | bash build_application.sh walab-ops-base-resources sysops@domain.com owner@domain.com 127 | ``` 128 | 129 | 130 | > The `build_application.sh` script will build and deploy your sample application, along with the architecture that hosts it. 131 | The application architecture will have capabilities to notify systems operators and owners, leveraging [Amazon Simple Notification Service](https://aws.amazon.com/sns/). 132 | You can use the same email address for `sysops@domain.com` and `owner@domain.com` if you need to, but ensure that you have both values specified. 133 | 134 | If you have deployed Amazon ECS before in your account, you may encounter InvalidInput error with message "AWSServiceRoleForECS has been taken" while running the build_application.sh script. You can safely ignore this message, as the script will continue despite the error. 135 | 136 | 6. The above command runs the build and provisioning of the application stack. The script should take about 20 mins to finish. 137 | 138 | ![Section 2 Cloud9 IDE Welcome Screen](/Images/section2-base-app-build.png) 139 | 140 | > The `build_application.sh` will deploy the application docker image and push it to [Amazon ECR](https://aws.amazon.com/ecr/). This is used by [Amazon ECS.](https://aws.amazon.com/ecs/) Once the build script completes, another CloudFormation stack containing the application resources (ECS, RDS, ALB, and others) will be deployed. 141 | 142 | 7. In the CloudFormation console, you should see a new stack being deployed called `walab-ops-sample-application`. Wait until the stack reaches **CREATE_COMPLETE** state and proceed to the next step. 143 | 144 | ![Section 2 CreateComplete](/Images/section2-base-app-create-complete.png) 145 | 146 | ### 1.2. Confirm the application status. 147 | 148 | Once the application is successfully deployed, go to your [CloudFormation console](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2) and locate the stack named `walab-ops-sample-application`. 149 | 150 | 1. Confirm that the stack is in a **'CREATE_COMPLETE'** state. 151 | 2. Record the following output details as it will be required later: 152 | 3. Take note of the DNS value specified under **OutputApplicationEndpoint** of the Outputs. 153 | 154 | The screenshot below shows the output from the CloudFormation stack: 155 | 156 | ![Section2 DNS Output](/Images/section2-dns-outputs.png) 157 | 158 | 4. Check for an email sent to the system operator and owner addresses you've specified in the build_application.sh script. This email should also be visible in the CloudFormation parameter under in the **SystemOpsNotificationEmail** and **SystemOwnerNotificationEmail**. 159 | 160 | 5. Click `confirm subscription` on the email links to subscribe. 161 | 162 | ![Section2 DNS Output](/Images/section2-email-confirm.png) 163 | 164 | > There will be 2 emails sent to your address, please ensure to subscribe to **both** of them. 165 | 166 | ### 1.3. Test the application. 167 | 168 | In this section, you will be testing the encrypt API action from the deployed application. 169 | 170 | The application will take a JSON payload with `Name` as the identifier and `Text` key as the value of the secret message. 171 | 172 | The application will encrypt the value under `Text` key with a designated KMS key and store the encrypted text in the RDS database with `Name` as the primary key. 173 | 174 | > **Note:** For simplicity purposes the sample application will re-use the same KMS keys for each record generated. 175 | 176 |
177 | Click here to test 178 | 179 | 1. In the **Cloud9** terminal, run the command below, replacing the `ApplicationEndpoint` with the **OutputApplicationEndpoint** from previous step. This command will run [curl](https://curl.se/) to send a POST request with the secret message payload `{"Name":"Bob","Text":"Run your operations as code"}` to the API. 180 | 181 | ``` 182 | ALBEndpoint="ApplicationEndpoint" 183 | ``` 184 | 185 | ``` 186 | curl --header "Content-Type: application/json" --request POST --data '{"Name":"Bob","Text":"Run your operations as code"}' $ALBEndpoint/encrypt 187 | ``` 188 | 189 | 2. Once you run this command, you should see output as follows: 190 | 191 | ``` 192 | {"Message":"Data encrypted and stored, keep your key save","Key":"EncryptKey"} 193 | ``` 194 | 195 | 3. Take note of the encrypt key value under **Key** . 196 | 197 | 4. Run the command below, pasting the encrypt key you took note of previously under the **Key** section to test the decrypt API. 198 | 199 | 200 | ``` 201 | curl --header "Content-Type: application/json" --request GET --data '{"Name":"Bob","Key":"EncryptKey"}' $ALBEndpoint/decrypt 202 | 203 | ``` 204 | 205 | 5. Once you run the command you should see the following output: 206 | 207 | ``` 208 | {"Text":"Run your operations as code"} 209 | ``` 210 |
211 | 212 | ## Congratulations! 213 | 214 | You have now completed the first section of the Lab. 215 | 216 | You should have a sample application API which we will use for the remainder of the lab. 217 | 218 | ### Step 2. Simulate an Application Issue 219 | Understanding the health of your workload is an essential component of Operational Excellence. Defining metrics and thresholds, together with appropriate alerts will ensure that issues can be acknowledged and remediated within an appropriate timeframe. 220 | 221 | In this section of the lab, you will simulate a performance issue within the API. Using Amazon CloudWatch synthetic, your API will utilize a canary monitor, which continuously checks API response time to detect an issue. 222 | 223 | In this example, should the API take longer than 6 seconds to respond, an alert will be created, triggering a notification email. 224 | 225 | #### Actions items in this section: 226 | 227 | 1. You will run a script that will send a large amount of traffic to the API. 228 | 2. You will observe and confirm the issue through AWS monitoring tools. 229 | 230 | The following resources had been deployed to perform these actions. 231 | 232 | ![Section3 Base Architecture](/Images/section3-testing-canary-alarm-architecture.png) 233 | 234 | ### 2.0 Sending traffic to the application 235 | 236 | In this section, you will send multiple concurrent requests to the application, simulating a large surge of incoming traffic. This will overwhelm the API, which will gradually increase the response time of the application. This results in the canary monitoring exceeding the set threshold, triggering the CloudWatch Alarm to send notification. 237 | 238 | Follow below steps to continue: 239 | 240 | 1. From the **Cloud9** terminal, run the command shown below to change directory to the working script folder: 241 | 242 | ``` 243 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/ 244 | ``` 245 | 246 | 2. Confirm that you have the `test.json` in the folder and it contains the following text: 247 | 248 | ``` 249 | {"Name":"Test User","Text":"This Message is a Test!"} 250 | ``` 251 | 252 | 3. Go to CloudFormation console and take note of the **OutputApplicationEndpoint** value under Output tab of `walab-ops-sample-application` stack. This is the DNS endpoint of the Application Load Balancer. 253 | 254 | 255 | ![Section3 Succces Screenshot](/Images/section3-stackoutput.png) 256 | 257 | 258 | 4. Make sure you have test the application previously. If so, execute the command below: 259 | 260 | ``` 261 | bash simulate_request.sh $ALBEndpoint 262 | ``` 263 | 264 | This script uses the [Apache Benchmark](https://httpd.apache.org/docs/2.4/programs/ab.html) to send 60,000,000 requests, 3000 concurrent request at a time. 265 | 266 | When you run the command you will see the output gradually change from a consistently successful 200 response to include 504 time-out responses. 267 | 268 | The requests generated by the script are overwhelming the application API and result in occasional timeouts by your load balancer. 269 | 270 | Keep the command running in the background as you proceed through the lab. 271 | 272 | ![Section3 Succces Screenshot](/Images/section3-success-traffic-requests.png) 273 | 274 | ![Section3 Failure Screenshot](/Images/section3-failure-traffic-requests.png) 275 | 276 | 277 | ### 2.1 Observing the alarm being triggered. 278 | 279 | 1. After approximately 6 minutes, you will see an alarm which is triggered as a response to the generated activity. This will trigger an email indicating that the CloudWatch alarm has been triggered. 280 | 281 | ![Section3 Email](/Images/section3-email.png) 282 | 283 | 2. Check and confirm the alarm by going to the CloudWatch console. 284 | 285 | 3. Click on the Alarms section on the left menu. 286 | 287 | 4. Click on the Alarms called `mysecretword-canary-duration-alarm`, which should be in an alarm state. 288 | 289 | ![Section3 Failure Screenshot](/Images/section3-alarm.png) 290 | 291 | 5. Click on the alarm to display the CloudWatch metrics that the alarm data is based from. 292 | 293 | 6. The alarm is based on the `Duration` metric data emitted by the `mysecretword-canary` CloudWatch synthetic canary monitor. The Duration metric measures how long it takes for the canary requests to receive a response from the application. 294 | 295 | 7. The alarm is triggered whenever the value of the `Duration` metric is above 6 seconds within a 1 minute duration. The latest threshold will be 5000 for 3 datapoints within 6 minutes. 296 | 297 | ![Section3 Failure Screenshot](/Images/section3-alarm-detail.png) 298 | 299 | 8. On the left menu click on **Application monitoring and **Synthetics Canaries** and locate the canary monitor named `mysecretword-canary`. 300 | 301 | ![Section3 Canary](/Images/section3-canary.png) 302 | 303 | 9. Click on the canary and the select the **Configuration** tab. 304 | 305 | 10. From here you will see the canary configuration and a snippet of the canary script. 306 | 307 | 11. In the canary script section, scroll down to the section that contains `let requestOptionStep1` as shown in the screenshot below. This is the configuration that controls the destination of the request (hostname, path and payload body). 308 | 309 | ![Section3 Canary](/Images/section3-canary-detail.png) 310 | 311 | 12. Click on the **Monitoring** tab. 312 | 313 | 13. From here you will see the visualization of the metrics that the canary monitor generates. 314 | 315 | 14. Locate the 'Duration' metric that is being used to trigger the CloudWatch alarm. 316 | 317 | 15. You will see the average duration value of the canary request representing the time to complete. A value above 6000ms signifies that the request has taken more than 6 seconds to receive a response from the application, indicating a performance issue in the API. 318 | 319 | ![Section3 Canary](/Images/section3-canary-monitor.png) 320 | 321 | You have now completed the second section of the lab. 322 | 323 | You should still have the `simulate_request.sh` running in the background, simulating a large influx of traffic to your API. This causes the application to respond slowly and time-out periodically. The CloudWatch Alarm will be triggering and performance issue notifications sent to your System Operator to prompt them into action. 324 | 325 | > This concludes **Section 2** of this lab. Click 'Next step' to continue to the next section of the lab where we will build an automated **playbook** to assist investigation of the issue. 326 | 327 | ### Step 3. Build and Run an Investigative Playbook 328 | The efficiency of issue resolution within an Operations team is directly linked to their tenure and experience. Where an Operator has prior knowledge of a particular issue, they will have a headstart in being able to reach resolution in terms of understanding logs and metrics which were used in previous situations. Whilst this constitutes value to an Operations group, it also represents a single point of failure and a scalability challenge. 329 | 330 | This is where [playbooks](https://wa.aws.amazon.com/wat.concept.playbook.en.html) become important. Playbooks are a documented set of predefined steps, which are run to identify an issue. The result of each step can be used to either call more steps to run, or alternatively to trigger manual intervention. 331 | 332 | Automating **playbook** activities wherever possible, is critical to reducing the time to respond to an incident. 333 | 334 | The AWS Cloud offers multiple services you can use to build an automated playbook, one which is AWS Systems Manager. 335 | 336 | AWS Systems Manager offers an automation document capability (known within Systems Manager as [runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html 337 | )), which allows for the creation of a series of executable steps to orchestrate your investigation and remediation. AWS Systems Manager Automation Documents allow a user to run custom scripts, call AWS service APIs, or even run remote commands on cloud or on-premise compute instances. 338 | 339 | In this section, you will focus on creating an automated **playbook** in assisting your investigation, as a Systems Operator. 340 | 341 | #### Actions items in this section: 342 | 343 | 1. You will build a **playbook** to gather information about the workload and query the relevant metrics and logs. 344 | 2. You will run the automation document to investigate your issue. 345 | 346 | ### 3.0 Prepare Automation Document IAM Role 347 | 348 | The Systems Manager Automation Document you are building will require assumed permissions to run the investigation and remediation steps. You will need to create the IAM role that will assume the permissions to perform the **playbook** activities. To simplify the deployment process, a CloudFormation template has been provided that you can deploy via the console or AWS CLI. Please choose one of the two following deployment steps: 349 | 350 |
351 | Click here for CloudFormation Console deployment step 352 | 353 | 1. Download the template [here.](/Code/templates/automation_role.yml "Resources template") 354 | 2. Follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template. 355 | 3. Use `waopslab-automation-role` as the **Stack Name**, as this is referenced by other stacks later in the lab. 356 | 357 |
358 | 359 |
360 | Click here for CloudFormation CLI deployment step (Preferred way) 361 | 362 | **Note:** To deploy from the command line, ensure that you have installed and configured AWS CLI with the appropriate credentials. 363 | 364 | 1. From the **Cloud9** terminal change to the appropriate folder as shown: 365 | 366 | ``` 367 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates 368 | ``` 369 | 370 | 2. Then run the command listed below: 371 | 372 | ``` 373 | aws cloudformation create-stack --stack-name waopslab-automation-role \ 374 | --capabilities CAPABILITY_NAMED_IAM \ 375 | --template-body file://automation_role.yml 376 | ``` 377 | 378 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command: 379 | 380 | ``` 381 | aws cloudformation describe-stacks --stack-name waopslab-automation-role 382 | ``` 383 | 384 | Locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE** 385 |
386 | 387 | 1. Once you have deployed the CloudFormation stack above, go to the IAM Console. 388 | 389 | 2. On the side menu, click on **Roles** and locate the IAM role named **AutomationRole**. 390 | 391 | 3. Take note of the ARN of the role, as we will need it later in the lab. 392 | 393 | 394 | ![Section3 ](/Images/section3-automationrole.png) 395 | 396 | ### 3.1 Building the "Gather-Resources" Playbook. 397 | 398 | In preparation for the investigation, you need to know all services and resources associated to the issue. When the email notification is sent, information in the email does not contain any resources information. To gather this necessary information, we will build a **playbook** to acquire all related resources using our CloudWatch alarm ARN as a reference. 399 | 400 | Codifying your **playbook** with AWS Systems Manager allows for maximum code reusability. This will reduce overhead in re-writing codes that has identical objectives. 401 | 402 | ![Section4 ](/Images/section4-architecture-graphics1.png) 403 | 404 | 405 | > **Note:** Follow these step to build and run playbook. Select a guide to deploy using either the AWS console, the AWS CLI or via a CloudFormation template deployment. 406 | 407 |
408 | Click here for CloudFormation Console deployment step 409 | 410 | Download the template [here.](/Code/templates/playbook_gather_resources.yml "Resources template") 411 | 412 | 413 | If you decide to deploy the stack from the console, ensure that you follow below requirements & step: 414 | 415 | 1. Follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template. 416 | 2. Use `waopslab-playbook-gather-resources` as the **Stack Name**, as this is referenced by other stacks later in the lab. 417 | 418 |
419 | 420 |
421 | Click here for CloudFormation CLI deployment step (Preferred way) 422 | 423 | **Note:** To deploy from the command line, ensure that you have installed and configured AWS CLI with the appropriate credentials. 424 | 425 | 1. From the **Cloud9** terminal, run the command to get into the working script folder 426 | 427 | ``` 428 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates 429 | ``` 430 | 431 | 2. Then run the below commands, replacing the 'AutomationRoleArn' with the Arn of **AutomationRole** you took note in previous step 3.0. 432 | 433 | ``` 434 | aws cloudformation create-stack --stack-name waopslab-playbook-gather-resources \ 435 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \ 436 | --template-body file://playbook_gather_resources.yml 437 | ``` 438 | 439 | Example: 440 | 441 | 442 | ``` 443 | aws cloudformation create-stack --stack-name waopslab-playbook-gather-resources \ 444 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/AutomationRole \ 445 | --template-body file://playbook_gather_resources.yml 446 | ``` 447 | 448 | **Note:** Please adjust your command-line if you are using profiles within your aws command line as required. 449 | 450 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command below, locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**. 451 | 452 | ``` 453 | aws cloudformation describe-stacks --stack-name waopslab-playbook-gather-resources 454 | ``` 455 | 456 |
457 | 458 | 459 |
460 | Click here for Console step-by-step 461 | 462 | 1. Go to the AWS Systems Manager console. Click **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below: 463 | 464 | ![Section4 ](/Images/section4-create-automation.png) 465 | 466 | 2. Enter `Playbook-Gather-Resources` in the **Name** field and copy the notes shown below into the **Document description** field. 467 | 468 | ``` 469 | # What does this **playbook** do? 470 | 471 | Query the CloudWatch Synthetics Canary and look for all resources related to the application based on it's Application Tag. This **playbook** takes an input of the CloudWatch Alarm ARN triggered by the canary 472 | 473 | Note : Application resources must be deployed using CloudFormation and properly tagged accordingly. 474 | 475 | ## Actions taken in this playbook. 476 | 1. Describe CloudWatch Alarm ARN and identify the Canary resource. 477 | 2. Describe the Canary resource to gather the value of 'Application' tag 478 | 3. Gather CloudFormation Stack with the same value of 'Application' tag. 479 | 4. List all resources in CloudFormation Stack. 480 | 5. Parse list of resources into String Output. 481 | ``` 482 | 483 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**. 484 | 485 | ![Section4 ](/Images/section4-create-automation-playbook-role.png) 486 | 487 | 488 | 4. Expand the **Input Parameters** section and enter `AlarmARN` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. This will define a Parameter within our playbook, so that the value of the CloudWatch Alarm ARN can be passed into the playbook to run the action. 489 | 490 | ![Section4 ](/Images/section4-create-automation-parameter-input.png) 491 | 492 | 5. Under **Step 1** section specify `Gather_Resources_For_Alarm` **Step name**, select `aws::executeScript` as the **Action type**. 493 | 494 | 6. Under **Inputs** set `Python3.6` as the **Runtime** and specify `script_handler` as the **Handler**. 495 | 7. Paste in below python codes into the **Script** section. 496 | 497 | ![Section4 ](/Images/section4-create-automation-addstep.png) 498 | 499 | ``` 500 | import json 501 | import re 502 | from datetime import datetime 503 | import boto3 504 | import os 505 | 506 | def arn_deconstruct(arn): 507 | arnlist = arn.split(":") 508 | service=arnlist[2] 509 | region=arnlist[3] 510 | accountid=arnlist[4] 511 | servicetype=arnlist[5] 512 | name=arnlist[6] 513 | return { 514 | "Service": service, 515 | "Region": region, 516 | "AccountId": accountid, 517 | "Type": servicetype, 518 | "Name": name 519 | } 520 | 521 | def locate_alarm_source(alarm): 522 | cwclient = boto3.client('cloudwatch', region_name = alarm['Region'] ) 523 | alarm_source = {} 524 | alarm_detail = cwclient.describe_alarms(AlarmNames=[alarm['Name']]) 525 | 526 | if len(alarm_detail['MetricAlarms']) > 0: 527 | metric_alarm = alarm_detail['MetricAlarms'][0] 528 | namespace = metric_alarm['Namespace'] 529 | 530 | # Condition if NameSpace is CloudWatch Syntetics 531 | if namespace == 'CloudWatchSynthetics': 532 | if 'Dimensions' in metric_alarm: 533 | dimensions = metric_alarm['Dimensions'] 534 | for i in dimensions: 535 | if i['Name'] == 'CanaryName': 536 | source_name = i['Value'] 537 | alarm_source['Type'] = namespace 538 | alarm_source['Name'] = source_name 539 | alarm_source['Region'] = alarm['Region'] 540 | alarm_source['AccountId'] = alarm['AccountId'] 541 | 542 | result = alarm_source 543 | return result 544 | 545 | def locate_canary_endpoint(canaryname,region): 546 | result = None 547 | synclient = boto3.client('synthetics', region_name = region ) 548 | res = synclient.get_canary(Name=canaryname) 549 | canary = res['Canary'] 550 | if 'Tags' in canary: 551 | if 'TargetEndpoint' in canary['Tags']: 552 | target_endpoint = canary['Tags']['TargetEndpoint'] 553 | result = target_endpoint 554 | return result 555 | 556 | 557 | def locate_app_tag_value(resource): 558 | result = None 559 | if resource['Type'] == 'CloudWatchSynthetics': 560 | synclient = boto3.client('synthetics', region_name = resource['Region'] ) 561 | res = synclient.get_canary(Name=resource['Name']) 562 | canary = res['Canary'] 563 | if 'Tags' in canary: 564 | if 'Application' in canary['Tags']: 565 | apptag_val = canary['Tags']['Application'] 566 | result = apptag_val 567 | return result 568 | 569 | def locate_app_resources_by_tag(tag,region): 570 | result = None 571 | 572 | # Search CloufFormation Stacks for tag 573 | cfnclient = boto3.client('cloudformation', region_name = region ) 574 | list = cfnclient.list_stacks(StackStatusFilter=['CREATE_COMPLETE','ROLLBACK_COMPLETE','UPDATE_COMPLETE','UPDATE_ROLLBACK_COMPLETE','IMPORT_COMPLETE','IMPORT_ROLLBACK_COMPLETE'] ) 575 | for stack in list['StackSummaries']: 576 | app_resources_list = [] 577 | stack_name = stack['StackName'] 578 | stack_details = cfnclient.describe_stacks(StackName=stack_name) 579 | stack_info = stack_details['Stacks'][0] 580 | if 'Tags' in stack_info: 581 | for t in stack_info['Tags']: 582 | if t['Key'] == 'Application' and t['Value'] == tag: 583 | app_stack_name = stack_info['StackName'] 584 | app_resources = cfnclient.describe_stack_resources(StackName=app_stack_name) 585 | for resource in app_resources['StackResources']: 586 | app_resources_list.append( 587 | { 588 | 'PhysicalResourceId' : resource['PhysicalResourceId'], 589 | 'Type': resource['ResourceType'] 590 | } 591 | ) 592 | result = app_resources_list 593 | 594 | return result 595 | def script_handler(event, context): 596 | result = {} 597 | arn = event['CloudWatchAlarmARN'] 598 | alarm = arn_deconstruct(arn) 599 | # Locate tag from CloudWatch Alarm 600 | 601 | alarm_source = locate_alarm_source(alarm) # Identify Alarm Source 602 | tag_value = locate_app_tag_value(alarm_source) #Identify tag from source 603 | 604 | if alarm_source['Type'] == 'CloudWatchSynthetics': 605 | endpoint = locate_canary_endpoint(alarm_source['Name'],alarm_source['Region']) 606 | result['CanaryEndpoint'] = endpoint 607 | 608 | # Locate cloudformation with tag 609 | resources = locate_app_resources_by_tag(tag_value,alarm['Region']) 610 | result['ApplicationStackResources'] = json.dumps(resources) 611 | 612 | return result 613 | ``` 614 | 615 | 8. Under **Additional inputs** specify the input value to the step, passing in the parameter we created previously. To do this, specify below values: 616 | 617 | * `InputPayload` as the **Input name** 618 | * `CloudWatchAlarmARN: '{{AlarmARN}}'` as the **Input Value**. 619 | 620 | 9. Under **Outputs** specify below values: 621 | 622 | * `Resources` as **Name** 623 | * `$.Payload.ApplicationStackResources` as **Selector** 624 | * `String` as **Type** 625 | 626 | 10. Once your settings match the screenshot below, click on **Create Automation** 627 | 628 | ![Section4 ](/Images/section4-create-automation-additionals.png) 629 | 630 |
631 | 632 | 633 | Once the automation document is created, you can now give it a test. 634 | 635 | 1. You can then find the newly created document under the **Owned by me** tab of the **Document** section in Systems Manager Console. 636 | 637 | ![Section3 ](/Images/section3-playbook-gather-resource-tab.png) 638 | 639 | 2. Click on the **playbook** called `Playbook-Gather-Resources` and click on **Execute Automation** to run your playbook. 640 | 3. Paste in the CloudWatch Alarm ARN ( You can find this ARN in the email notification in section **2.1 Observing the alarm being triggered** ) and click on **Execute** to test the playbook. 641 | 642 | ![Section3 ](/Images/section3-alarm-email.png) 643 | 644 | 4. Once the **playbook** run is completed successfully, click on the **Step Id** to see the final message and output of the step. You should be able to see this output listing all the resources of the application 645 | 646 | ![Section3 ](/Images/section3-gather-resources-stepid.png) 647 | 648 | 5. **Copy** the Resources list output from the section as highlighted in the screenshot below. This list consist of the all the resources defined in the CloudFormation stack related to our application. These information includes the Elastic Load Balancer, ECS and RDS resource id that we can now use to further our investigation of the underlying issue. 649 | 650 | ![Section3 ](/Images/section4-create-automation-playbook-run-output.png) 651 | 652 | 6. You can **Paste** the output into a temporary location like notepad for now. You will need this value for our next step. 653 | 654 | ### 3.2 Building the "Investigate-Application-Resources" Playbook. 655 | 656 | In the previous step, you have created a **playbook** that finds all related AWS resources in the application. 657 | In this step you will create a **playbook** that will interrogate resources, capture recent metrics and logs, to look for insights and better understand the root cause of the issue. 658 | 659 | In practice, there can be various possibilities of actions that the **playbook** can take to investigate, depending on the scenario presented by the issue. The purpose of this Lab is to showcase how you can use **playbook** to aid investigation, rather than advise on a specific action path. 660 | 661 | Therefore, in this lab we will assume an example scenario. The **playbook** will look at metrics and logs of the ELB, ECS and RDS services in the resource list. The **playbook** will then highlight the metrics and logs that is considered outside of normal operational threshold. 662 | 663 | 664 | ![Section3 ](/Images/section4-architecture-graphics2.png) 665 | 666 | Please follow the below instructions to build this playbook: 667 | 668 | > **Note:** We will deploy this **playbook** via CloudFormation template to simplify deployment. Please follow the steps below to deploy the CloudFormation template via CLI / or Console. 669 | 670 | 671 |
672 | Click here for CloudFormation Console deployment step 673 | 674 | Download the template [here.](/Code/templates/playbook_investigate_application_resources.yml "Resources template") 675 | 676 | 677 | If you decide to deploy the stack from the console, ensure that you follow below requirements & step: 678 | 679 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template. 680 | 2. Use `waopslab-playbook-investigate-resources` as the **Stack Name**, as this is referenced by other stacks later in the lab. 681 | 682 | 683 |
684 | 685 |
686 | Click here for CloudFormation CLI deployment step (Preferred way) 687 | 688 | 689 | 1. From the Cloud9 terminal, change to the required folder as shown: 690 | 691 | ``` 692 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates 693 | ``` 694 | 695 | 2. Run the command below, replacing the 'AutomationRoleArn' with the Arn of **AutomationRole** you took note in previous step **3.0 Prepare Automation Document IAM Role**. 696 | 697 | ``` 698 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-resources \ 699 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \ 700 | --template-body file://playbook_investigate_application_resources.yml 701 | ``` 702 | Example: 703 | 704 | ``` 705 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-resources \ 706 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-playbook-role \ 707 | --template-body file://playbook_investigate_application_resources.yml 708 | ``` 709 | 710 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command as follows: 711 | 712 | ``` 713 | aws cloudformation describe-stacks --stack-name waopslab-playbook-investigate-resources 714 | ``` 715 | 716 | 4. Locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE** 717 | 718 |
719 | 720 | When the document is created, you can go ahead and run a quick test. 721 | 722 | You can find the newly created document under the **Owned by me** tab of the Document resource in the Systems Manager console. 723 | 724 | 1. Click on the **playbook** called `Playbook-Investigate-Application-Resources` and click on **Execute Automation** to run our playbook. 725 | 726 | 2. Paste in the resources list you took note from the output of the previous **playbook** ( refer to section **3.1 Building the "Gather-Resources" Playbook** ) under **Resources** and click on **Execute** 727 | 728 | ![Section3 ](/Images/section3-investigate-resourcelist.png) 729 | 730 | 3. Under **Executed Steps** you should be able to see each of the step the **playbook**. If you view the content of the document you will be able to see the code and find out what each step does. 731 | 732 | ![Section3 ](/Images/section3-steps-explain.png) 733 | 734 | For simplicity, we have created a list of output and description for each step. Expand the list below to view. 735 | 736 |
737 | Output list 738 | 739 | 740 | | Step Name | Description | Output list | 741 | |------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------| 742 | | Gather_ELB_Statistics | Go through the resource list and locate the ELB. Query data from the ELB CloudWatch metrics, looking at metrics from the last 60 minutes. | TargetResponseTime (Average) | 743 | ||| HTTPCode_Target_2XX_Count (Sum) | 744 | ||| HTTPCode_Target_3XX_Count (Sum) | 745 | ||| HTTPCode_Target_4XX_Count (Sum) | 746 | ||| HTTPCode_Target_5XX_Count (Sum) | 747 | ||| TargetConnectionErrorCount (Sum) | 748 | ||| UnHealthyHostCount (Average) | 749 | ||| ActiveConnectionCount (Sum) | 750 | ||| HTTPCode_ELB_3XX_Count (Sum) | 751 | ||| HTTPCode_ELB_4XX_Count (Sum) | 752 | ||| HTTPCode_ELB_5XX_Count (Sum) | 753 | ||| HTTPCode_ELB_500_Count (Sum) | 754 | ||| HTTPCode_ELB_502_Count (Sum) | 755 | ||| HTTPCode_ELB_503_Count (Sum) | 756 | ||| HTTPCode_ELB_504_Count (Sum) | 757 | | Gather_RDS_Statistics | Go through resource list and locate the RDS resource. Query data from the RDS CloudWatch metrics, looking at metrics from the last 60 minutes. | BinLogDiskUsage (Sum) | 758 | ||| BinLogDiskUsage (Sum) | 759 | ||| BurstBalance (Average) | 760 | ||| CPUUtilization (Average) | 761 | ||| CPUCreditUsage (Sum) | 762 | ||| CPUCreditBalance (Maximum) | 763 | ||| DatabaseConnections (Sum) | 764 | ||| DiskQueueDepth (Maximum) | 765 | ||| FailedSQLServerAgentJobsCount (Average) | 766 | ||| FreeableMemory (Maximum) | 767 | ||| MaximumUsedTransactionIDs (Maximum) | 768 | ||| NetworkReceiveThroughput (Average) | 769 | ||| OldestReplicationSlotLag (Average) | 770 | ||| ReadIOPS (Average) | 771 | ||| ReadLatency (Average) | 772 | ||| ReadThroughput (Maximum) | 773 | ||| ReplicaLag (Average) | 774 | ||| ReplicationSlotDiskUsage (Maximum) | 775 | ||| SwapUsage (Maximum) | 776 | ||| TransactionLogsDiskUsage (Maximum) | 777 | ||| TransactionLogsGeneration (Average) | 778 | ||| ReplicationSlotDiskUsage (Maximum) | 779 | ||| WriteIOPS (Average) | 780 | ||| WriteLatency (Average) | 781 | ||| WriteThroughput (Average) | 782 | | Gather_ECS_Statistics | Go through the resource list and locate the ECS resource. Query data from the ECS CloudWatch metrics, looking at metrics from the last 6 minutes. | CPUUtilization (Maximum) | 783 | ||| MemoryUtilization (Maximum) | 784 | | Gather_ECS_Error_Logs | Go through the resource list and locate the ECS Service. Search in CloudWatch logs for any Error occurrence. || 785 | | Gather_ECS_Config | Go through the resource list and locate the ECS resource. Describe the ECS service configuration. || 786 | | Gather_RDS_Config | Go through the resource list and locate the RDS resource. Describe RDS Instance Config & Parameters. || 787 | | Inspect_Playbook_Results | Go through the output of above steps, inspect results and check if it is above the threshold. | TargetResponseTime = 5 (ELB) | 788 | |||TargetConnectionErrorCount= 0 (ELB) 789 | |||UnHealthyHostCount = 0 (ELB) 790 | |||ELB5XXCount = 0 (ELB) 791 | |||ELB500Count = 0 (ELB) 792 | |||ELB502Count = 0 (ELB) 793 | |||ELB503Count = 0 (ELB) 794 | |||ELB504Count = 0 (ELB) 795 | |||Target4XXCount = 0 (ELB) 796 | |||Target5XXCount = 0 (ELB) 797 | |||CPUUtilization = 80 (ECS) 798 |
799 | 800 | 4. Wait until all steps are completed successfully. 801 | 802 | 803 | 804 | ### 3.3 Building the "Investigate-Application-From-Alarm" Playbook. 805 | 806 | So far we have 2 separate playbooks. The first playbook gathers the list of resources associated with the application. The second playbook queries the relevant resources and investigates the appropriate logs and metrics. 807 | 808 | In this step we will automate our **playbooks** further by creating a parent **playbook** that orchestrates the 2 Investigative **playbooks**. We will add another step to send notification to our Developers and System Owners. 809 | 810 | ![Section4 ](/Images/section4-architecture-graphics3.png) 811 | 812 | Follow the instructions below to build the parent Playbook. 813 | 814 | > **Note:** Select a step-by-step guide below to build the parent playbook using either the AWS console a CloudFormation template. 815 | 816 |
817 | Click here for CloudFormation Console deployment step 818 | 819 | Download the template [here.](/Code/templates/playbook_investigate_application.yml "Resources template") 820 | 821 | 822 | If you decide to deploy the stack from the console, follow these steps: 823 | 824 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template. 825 | 2. Use `waopslab-playbook-investigate-application` as the **Stack Name**, as this is referenced by other stacks later in the lab. 826 | 3. In the parameter input screen, under **PlaybookIAMRole** enter ARN of **playbook** IAM role (defined in previous step), under **NotificationEmail** enter your designated email for **playbook** notification 827 | 828 |
829 | 830 |
831 | Click here for CloudFormation CLI deployment step (Preferred way) 832 | 833 | 834 | 1. From the Cloud9 terminal, change to the required folder as shown: 835 | 836 | ``` 837 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates 838 | ``` 839 | 840 | 2. Then run below command : 841 | 842 | ``` 843 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-application \ 844 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \ 845 | --template-body file://playbook_investigate_application.yml 846 | ``` 847 | Example: 848 | 849 | ``` 850 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-application \ 851 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-playbook-role \ 852 | --template-body file://playbook_investigate_application.yml 853 | ``` 854 | 855 | **Note:** Please adjust your command-line if you are using profiles within your aws command line as required. 856 | 857 | Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command as follows: 858 | ``` 859 | aws cloudformation describe-stacks --stack-name waopslab-playbook-investigate-application 860 | ``` 861 | 862 | Locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE** 863 | 864 |
865 | 866 |
867 | Click here for Console step-by-step guide 868 | 869 | 1. From the AWS Systems Manager console, click on **documents** as shown below. Once you are there, click on **Create Automation** 870 | 871 | ![Section4 ](/Images/section4-create-automation.png) 872 | 873 | 2. Next, enter in `Playbook-Investigate-Application-From-Alarm` in the **Name** and paste in the notes shown below into the **Description** box. This provides a description of the **playbook**. Systems Manager supports putting in notes as markdown, so feel free to format as required. 874 | 875 | 876 | ``` 877 | # What is does this **playbook** do? 878 | 879 | This **playbook** will run **Playbook-Gather-Resources** to gather Application resources monitored by Canary. 880 | 881 | Then subsequently run **Playbook-Investigate-Application-Resources** to Investigate the resources for issues. 882 | 883 | Outputs of the investigation will be sent to SNS Topic Subscriber 884 | 885 | ``` 886 | 887 | 3. Under **Assume role** field, enter in the ARN of the IAM role we created in the previous step. 888 | 889 | 4. Under **Input Parameters** field, enter `AlarmARN` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. This will define a Parameter into our playbook, which allows the value of the CloudWatch Alarm to be passed to the main step that will run the action. 890 | 891 | 5. Add another parameter by clicking on the **Add a parameter** link. Enter `SNSTopicARN` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. This will define another Parameter into our playbook, so that we can send notification to the Owner and Developer. 892 | 893 | ![Section4 ](/Images/section4-create-automation-parameter-input-2.png) 894 | 895 | 896 | 6. Click **Add Step** and create the first step of `aws:executeAutomation` Action type with StepName `PlaybookGatherAppResourcesCanaryCloudWatchAlarm` 897 | 898 | 7. Specify `Playbook-Gather-Resources` as the **Document name** under Inputs and under **Additional inputs** specify `RuntimeParameters` with `{"AlarmARN":'{{AlarmARN}}'}` as it's value (refer to screenshot below). This step we will be run the `Gather-Resources` **playbook** which we created previously. 899 | 900 | ![Section4 ](/Images/section4-create-automation-parameter-input-2-step1.png) 901 | 902 | 8. Once this step is defined, add another step by clicking on **Add Step** at the bottom of the section. 903 | 904 | 9. For this second step, specify the **Step name** as `PlaybookInvestigateAppResourcesELBECSRDS` and an action type of `aws:executeAutomation`. 905 | 906 | 10. Specify `Playbook-Investigate-Application-Resources` as the **Document name** and `RuntimeParameters` as `Resources: '{{PlaybookGatherAppResourcesCanaryCloudWatchAlarm.Output}}'` This will take the output of the first step and pass to the second **playbook** to run the investigation of associated resources. 907 | 908 | ![Section4 ](/Images/section4-create-automation-parameter-input-2-step2.png) 909 | 910 | 11. For the last step, take the output investigation from the second step and send that to the SNS topic where our owner, developers and admin are subscribed. 911 | 912 | 12. Specify the **Step name** as `AWSPublishSNSNotification` and the action type as `aws:executeAutomation`. 913 | 13. Specify `AWS-PublishSNSNotification` as the **Document name** and `RuntimeParameters` as shown below. This will take the output of the second step which contains summary data of the investigation and AWS-PublishSNSNotification which will send an email to the SNS we specified in the parameters. 914 | 915 | 916 | ``` 917 | TopicArn: '{{SNSTopicARN}}' 918 | Message: '{{ PlaybookInvestigateAppResourcesELBECSRDS.Output }}' 919 | ``` 920 | 921 | ![Section4 ](/Images/section4-create-automation-parameter-input-2-step3.png) 922 | 923 | 14. Our **playbook** will run investigative tasks and send the result to an SNS topic where our Systems administrator / engineer will subscribe to. To do this we will need to create an SNS topic that our **playbook** will send notification to. Please follow the instructions specified in this [link](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html) and create a Standard SNS topic and name it `PlaybookNotificationSNSTopic` 924 | 925 | 15. Once you've created the topic, go ahead and subscribe your an email using this instruction [here](https://docs.aws.amazon.com/sns/latest/dg/sns-email-notifications.html) 926 | 927 |
928 | 929 | ### 3.4 Executing investigation Playbook. 930 | 931 | You can now run the **playbook** to discover the result of the investigation. 932 | 933 | 1. Go to the **Output** section of the deployed CloudFormation stack `walab-ops-sample-application` and take note of below output values. 934 | 935 | 2. Go to the Systems Manager Automation document we just created in the previous step, `Playbook-Investigate-Application-From-Alarm`. 936 | 937 | 3. And then run the **playbook** passing the ARN as the **AlarmARN** input value, along with the **SNSTopicArn**. 938 | 939 | * You can get the **AlarmARN** from the email that you received from CloudWatch Alarm as described in step **3.1 Building the "Gather-Resources" Playbook.** in this lab. 940 | * To get the value for **SNSTopicArn**, go to the CloudFormation console output of `walab-ops-sample-application` stack and copy, paste the value of **OutputSystemEventTopicArn** 941 | 942 | ![Section3](/Images/section4-create-automation-playbook-test-run-playbook.png) 943 | 944 | 945 | 4. When the **playbook** completed, an email will be send to you, which contains a summary of the investigation completed by the playbook as shown. 946 | 947 | ![Section3 ](/Images/section4-create-automation-playbook-test-run-playbook-email-summary.png) 948 | 949 | 5. Copy and paste the message section and use a json linter tool such as [jsonlint.com](http://jsonlint.com) to give better structure for visibility. The result from the **playbook** investigation might vary slightly, but the overall findings should be similar to the below screenshot. 950 | 951 | ![Section3 ](/Images/section4-create-automation-playbook-test-run-playbook-summary.png) 952 | 953 | 6. From the report being generated you should see a large number of **ELB504Count error** and a high **TargetResponseTime** from the Load balancer. This explains the delay we are seeing from our canary alarm. 954 | 955 | If you then look at the ECS summary, you will notice that there is only 1 ECS **TaskRunningCount**, with a relatively high **CPUUtilization** average. The script calculates the average of maximum value on the ECS service in the last 6 minutes window. If you do not see CPUUtilization value in the json, you can confirm this by going to the ECS service console and click on the **Metrics** tab. 956 | 957 | ![Section3 ](/Images/section3-create-automation-playbook-test-run-playbook-cpu.png) 958 | 959 | Therefore, it is likely that the immediate cause of the latency is resource constrained at the application API level running in ECS. Ideally, if we can increase the number of tasks in the ECS service, the application should be able to release some of the CPU Utilization constraints. 960 | 961 | With all of these information provided by our **playbook** findings, we should be able to determine what is the next course of action to attempt remediation to the issue. 962 | 963 | This concludes **Section 3** of this lab, click on the link below to move on to the next section to build the remediation runbook. 964 | 965 | ### Step 4. Build and Run Remediation Runbook 966 | 967 | In contrast to playbooks, **runbooks** are procedures that accomplish specific tasks to achieve an outcome. In the previous section, you have identified an issue with CPU utilization, which occurs because there is only 1 ECS task running in the cluster. This could be remediated through the use of auto-scaling. 968 | 969 | However, implementing this requires preparation and planning. When an incident occurs, operations teams should have a defined escalation path for the issue. Depending on the criticality of the system they should also be equipped to do what is necessary to ensure system availability is protected while the escalation occurs. 970 | 971 | In this section, you will build an automated **runbook** to remediate the CPU utilization issue by increasing the number of tasks in the ECS cluster. Your automated **runbook**, will notify the owner of the workload and give them the option to be able to intercept the scale-up action should they choose not to proceed. 972 | 973 | #### Actions items in this section: 974 | 975 | 1. You will build a **runbook** to scale up the ECS cluster, with the approval mechanism. 976 | 2. You will execute the **runbook** and observe the recovery of your application. 977 | 978 | ### 4.0 Building the "Approval-Gate" Runbooks. 979 | 980 | In this section you will build a reusable **runbook**, which provides the owner with the ability to deny or approve remediation actions within a defined waiting period. If the wait time is exceeded and a decision has has not been made, the runbook will automatically approve the action as shown. 981 | 982 | ![Section5 ](/Images/section5-create-automation-graphics1.png) 983 | 984 | We will achieve this through the use of a Systems Manager Automation document, which we will build using the following steps: 985 | 986 | 1. The `Approval-Gate` **runbook** executes a separate document called the `Approve-Timer`. 987 | 988 | 2. The `Approve-Timer` **runbook** will then wait for a preconfigured amount of time and send an approve signal to the `Approval-Gate` **runbook**. 989 | 990 | 3. Meanwhile, the `Approval-Gate` **runbook** then sends an approval request to the workload owner via a designated SNS topic. 991 | 992 | * If the owner choose to approve, the `Approval-Gate` **runbook** will continue to the next step. 993 | * If the owner declines the approval, the **runbook** will fail, blocking further steps. 994 | * However, if the owner does not response within the preconfigured wait time, the `Approve-Timer` **runbook** will automatically approve the request. 995 | 996 | Follow the instructions below to build the runbook: 997 | 998 | > **Note:** Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template. 999 | 1000 |
1001 | Click here for Console step by step 1002 | 1003 | 1. Go to the AWS Systems Manager console. Click **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below: 1004 | 1005 | ![Section5 ](/Images/section5-create-automation.png) 1006 | 1007 | 2. Enter `Approval-Timer` in the **Name** field and copy the notes shown below into the **Document description** field. 1008 | 1009 | ``` 1010 | # What does this automation do? 1011 | 1012 | Automatically trigger 'Approval' Signal to an execution, after a timer lapse 1013 | 1014 | ## Steps 1015 | 1016 | 1. Sleep for X time specified on the parameter input 1017 | 2. Automatically signal 'Approval' to the Execution specified in parameter input 1018 | ``` 1019 | 1020 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**. 1021 | 1022 | 4. Expand the **Input Parameters** section and enter `Timer` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. 1023 | 1024 | 5. Then add another parameter this time called `AutomationExecutionId`, of type `String` and set **Required** to `Yes`. Once you are done, your configuration should look like the screenshot below. 1025 | 1026 | ![Section4 ](/Images/section4-approve-timer-input-param.png) 1027 | 1028 | 6. Under **Step 1** section specify `SleepTimer` as **Step name**, select `aws::sleep` as the **Action type**. 1029 | 1030 | 7. Expand the **Inputs** section of the step, and specify `{{Timer}}` as the **Duration** 1031 | 1032 | ![Section4 ](/Images/section4-approve-timer-step1.png) 1033 | 1034 | 1035 | 8. Click on **Add step** and specify `ApproveExecution` as **Step name**, select `aws::executeAwsApi` as the **Action type**. 1036 | 1037 | 9. Expand the **Inputs** section of the step, and specify `ssm` in the **Service** field and `SendAutomationSignal` in the API field. 1038 | 1039 | 10. Under **Additional inputs** specify below values. 1040 | 1041 | * `Approve` as the **SignalType** 1042 | * `{{AutomationExecutionId}}` as the **AutomationExecutionId**. 1043 | 1044 | Once you are done, your configuration should look like the screenshot below. 1045 | 1046 | ![Section5 ](/Images/section5-create-automation-step2.png) 1047 | 1048 | ![Section5 ](/Images/section5-create-automation-step2-input.png) 1049 | 1050 | 6 . Click on **Create automation** once you are done. 1051 | 1052 | Next, you will create the `Approval-Gate` **runbook** responsible for running the `Approval-Timer` **runbook** asynchronously. Follow below steps to complete the configuration: 1053 | 1054 | 1. From the AWS Systems Manager console, select **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below: 1055 | 1056 | ![Section5 ](/Images/section5-create-automation.png) 1057 | 1058 | 2. Next, enter `Approval-Gate` in the **Name** field and add the notes shown below to the **Document description** field. 1059 | 1060 | ``` 1061 | # What does this automation do? 1062 | 1063 | Place a gate before your desired step to create approval mechanism. 1064 | Automation will trigger an asynchronously timer that will automatically approve once the time has lapsed. 1065 | Automation will then send approval / deny request to the designated SNS Topic. 1066 | When deny is triggered by approver, the step will fail and block the following step from executing. 1067 | 1068 | Note: Please ensure to have onFailure set to abort in your automation document. 1069 | 1070 | ## Steps 1071 | 1072 | 1. Trigger an asynchronously timer that will automatically approve once the time has lapsed. 1073 | 2. Send approval / deny request to the designated SNS Topic. 1074 | 1075 | ``` 1076 | 1077 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**. 1078 | 1079 | 4. Expand the **Input Parameters** section and enter the following: 1080 | 1081 | * `Timer` as the **Parameter name**, set the type as `String` and **Required** as `Yes`. 1082 | * `NotificationMessage` as the **Parameter name**, set the type as type `String` and **Required** is `Yes`. 1083 | * `NotificationTopicArn` as the **Parameter name**, set the type as type `String` and **Required** is `Yes`. 1084 | * `ApproverRoleArn` as the **Parameter name**, set the type as type `String` and **Required** is `Yes`. 1085 | 1086 | 5. Expand **Step 1** create a step named `executeAutoApproveTimer` and action type `aws:executeScript`. 1087 | 1088 | 6. Expand **Inputs**, then set the **Runtime** as `Python3.6` and paste in below code into the script section. Note that code snippet will execute the `Approval-Timer` **runbook** you created asyncronously. 1089 | 1090 | ``` 1091 | import boto3 1092 | def script_handler(event, context): 1093 | client = boto3.client('ssm') 1094 | response = client.start_automation_execution( 1095 | DocumentName='Approval-Timer', 1096 | Parameters={ 1097 | 'Timer': [ event['Timer'] ], 1098 | 'AutomationExecutionId' : [ event['AutomationExecutionId'] ] 1099 | } 1100 | ) 1101 | return None 1102 | ``` 1103 | 1104 | 6. Expand **Additional Inputs**, then select `InputPayload` under **Input Name**, and add the text shown below to **Input Value**: 1105 | 1106 | ``` 1107 | AutomationExecutionId: '{{automation:EXECUTION_ID}}' 1108 | Timer: '{{Timer}}' 1109 | ``` 1110 | Once you have completed this step, your **Step 1** configuration should look like below screenshot. 1111 | 1112 | ![Section4 ](/Images/section4-create-approval-gate-step1.png) 1113 | 1114 | 7. Click **Add step** to create **Step 2** 1115 | 1116 | 8. Create a step named `ApproveOrDeny` and action type `aws:approve`. 1117 | 1118 | 9. Expand **Inputs** and specify below values under **Approvers**, replacing the `AutomationRoleArn` with the Arn of **AutomationRole** you took note of in section **3.0 Prepare Automation Document IAM Role**. 1119 | 1120 | ``` 1121 | [ '{{ApproverRoleArn}}', 'AutomationRoleArn' ] 1122 | ``` 1123 | 1124 | Example: 1125 | 1126 | ``` 1127 | [ '{{ApproverRoleArn}}', 'arn:aws:iam::xxxxx:role/AutomationRole' ] 1128 | ``` 1129 | 1130 | 1131 | 10. Expand **Additional Inputs** and specify the following values: 1132 | 1133 | * `NotificationArn` as the **Input name**, and `{{NotificationTopicArn}}` as the **Input value** 1134 | * `Message` as the **Input name**, and `{{NotificationMessage}}` as the **Input value** 1135 | * `MinRequiredApprovals` as the **Input name**, and `1` as the **Input value** 1136 | 1137 | 12. Expand **Common properties** and change the following properties to below values (keep the remaining as it is): 1138 | 1139 | * `Continue` for **On failure** 1140 | * `false` for **Is critical** 1141 | 1142 | Once you have completed this step, your **Step 2** configuration should look like below screenshot. 1143 | 1144 | ![Section4 ](/Images/section4-create-approval-gate-step2.png) 1145 | 1146 | 1147 | 1148 | 13. Click **Add step** to create **Step 3** 1149 | 1150 | 14. Create a step named `getApprovalStatus` and action type `aws:executeAwsApi` 1151 | 1152 | 15. Expand **Inputs** and specify `ssm` in the **Service** field, and `DescribeAutomationStepExecutions` in the **API** field. 1153 | 1154 | 16. Expand **Additional Inputs** and specify below values: 1155 | 1156 | * `AutomationExecutionId` as the **Input Name**, and `{{automation:EXECUTION_ID}}` as the **Input value** 1157 | * `Filters` as the **Input Name**, and copy below values as the **Input value** 1158 | 1159 | ``` 1160 | - Key: StepName 1161 | Values: 1162 | - requestApproval 1163 | ``` 1164 | 17. Expand **Outputs** and specify below values: 1165 | 1166 | * `approvalStatusVariable` as the **Name** 1167 | * `$.StepExecutions[0].Outputs.ApprovalStatus[0]` as the **Selector** 1168 | * `String` as the **Type** 1169 | 1170 | Once you have completed this step, your **Step 3** configuration should look like below screenshot. 1171 | 1172 | ![Section4 ](/Images/section4-create-approval-gate-step3.png) 1173 | 1174 | 1175 | 18. Click on **Create automation** to complete the configuation. 1176 | 1177 |
1178 | 1179 |
1180 | Click here for CloudFormation deployment steps 1181 | 1182 | Download the template [here.](/Code/templates/runbook_approval_gate.yml "Resources template") 1183 | 1184 | If you decide to deploy the stack from the console, ensure that you follow below requirements & step: 1185 | 1186 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template. 1187 | 2. Use `waopslab-runbook-approval-gate` as the **Stack Name**, as this is referenced by other stacks later in the lab. 1188 | 1189 |
1190 | 1191 |
1192 | Click here for CloudFormation CLI deployment step (Preferred way) 1193 | 1194 | 1. From the Cloud9 terminal, change to the templates folder as shown: 1195 | 1196 | ``` 1197 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates 1198 | ``` 1199 | 1200 | 1201 | 2. Run the below commands, replacing the `AutomationRoleArn` with the Arn of **AutomationRole** you took note of in section **3.0 Prepare Automation Document IAM Role**. 1202 | 1203 | ``` 1204 | aws cloudformation create-stack --stack-name waopslab-runbook-approval-gate \ 1205 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \ 1206 | --template-body file://runbook_approval_gate.yml 1207 | ``` 1208 | 1209 | With your AutomationRole Arn in place your command will look similar to the following example: 1210 | 1211 | ``` 1212 | aws cloudformation create-stack --stack-name waopslab-runbook-approval-gate \ 1213 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-runbook-role \ 1214 | --template-body file://runbook_approval_gate.yml 1215 | ``` 1216 | 1217 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command below, locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**. 1218 | 1219 | ``` 1220 | aws cloudformation describe-stacks --stack-name waopslab-runbook-approval-gate 1221 | ``` 1222 | 1223 |
1224 | 1225 | ### 4.1 Building the "ECS-Scale-Up" runbook. 1226 | 1227 | ![Section5 ](/Images/section5-create-automation-graphics2.png) 1228 | 1229 | Next, you are going to build the ECS-Scale-Up **runbook** which will complete the following: 1230 | 1231 | 1. Run the `Approval-Gate` **runbook** which you created previously. 1232 | 2. Wait for the `Approval-Gate` **runbook** to complete. 1233 | 3. Once the `Approval-Gate` **runbook** completes successfully, the runbook will increase the number of ECS tasks in the cluster. 1234 | 1235 | Please follow below steps to build the runbook. 1236 | 1237 | > **Note:** Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template. 1238 | 1239 |
1240 | Click here for Console step by step 1241 | 1242 | 1. Go to the AWS Systems Manager console. Click **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below. 1243 | 1244 | ![Section5 ](/Images/section5-create-automation.png) 1245 | 1246 | 2. Next, enter `Runbook-ECS-Scale-Up` in the **Name** field and add the notes shown below to the **Document description** field: 1247 | 1248 | ``` 1249 | # What does this automation do? 1250 | 1251 | Scale up a given ECS service task desired count to certain number, with approval process. 1252 | The automation will trigger Approval-Gate runbook, before executing. 1253 | 1254 | ## Steps 1255 | 1256 | 1. Trigger Approval-Gate 1257 | 2. Scale ECS Service by number of service 1258 | ``` 1259 | 1260 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**. 1261 | 1262 | 4. Expand the **Input Parameters** section and enter the following. 1263 | 1264 | * `ECSDesiredCount` as the **Parameter name**, set the type as `Integer` and **Required** as `Yes`. 1265 | * `ECSClusterName` as the **Parameter name**, set the type as `String` and **Required** is `Yes`. 1266 | * `ECSServiceName`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`. 1267 | * `NotificationTopicArn`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`. 1268 | * `NotificationMessage`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`. 1269 | * `ApproverRoleArn`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`. 1270 | * `Timer`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`. 1271 | 1272 | 1273 | 5. Expand **Step 1** create a step named `executeApprovalGate` and action type `aws:executeAutomation`. 1274 | 1275 | 6. Expand **Inputs**, then set the **Document name** as `Approval-Gate`. 1276 | 1277 | 7. Expand **Additional inputs** and select `RuntimeParameters` as the **Input Name** 1278 | 1279 | 8. Paste in below as the **Input Value** 1280 | 1281 | ``` 1282 | { 1283 | "Timer":'{{Timer}}', 1284 | "NotificationMessage":'{{NotificationMessage}}', 1285 | "NotificationTopicArn":'{{NotificationTopicArn}}', 1286 | "ApproverRoleArn":'{{ApproverRoleArn}}' 1287 | } 1288 | ``` 1289 | 1290 | 9. Click **Add Step** to create the second step. 1291 | 1292 | 10. Specify `updateECSServiceDesiredCount` as **Step Name** and select `aws:executeAwsApi` as Action type. 1293 | 1294 | 11. Expand **Inputs** and configure the following values: 1295 | 1296 | * `ecs` as **Service** 1297 | * `UpdateService` as **Api** 1298 | 1299 | 12. Expand **Additional inputs** and configure the following values: 1300 | 1301 | * `forceNewDeployment` as the **Input Name** and `true` as **Input Value** 1302 | * `desiredCount`as the **Input Name** and `{{ECSDesiredCount}}` as **Input Value** 1303 | * `service` as the **Input Name** and `{{ECSServiceName}}` as **Input Value** 1304 | * `cluster` as the **Input Name** and `{{ECSClusterName}}` as **Input Value** 1305 | 1306 | 13 . Click on **Create automation** once complete 1307 | 1308 | 1309 |
1310 | 1311 |
1312 | Click here for CloudFormation Console deployment step 1313 | 1314 | Download the template [here.](/Code/templates/runbook_scale_ecs_service.yml "Resources template") 1315 | 1316 | If you decide to deploy the stack from the console, ensure that you complete the following steps: 1317 | 1318 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template. 1319 | 2. Use `waopslab-runbook-scale-ecs-service` as the **Stack Name**, as this is referenced by other stacks later in the lab. 1320 | 1321 |
1322 | 1323 |
1324 | Click here for CloudFormation CLI deployment step (Preferred way) 1325 | 1326 | 1. From the Cloud9 terminal, run the command to get into the working script folder. 1327 | 1328 | ``` 1329 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates 1330 | ``` 1331 | 1332 | 2. Then run below commands, replacing the 'AutomationRoleArn' with the Arn of **AutomationRole** you took note in previous step **3.0 Prepare Automation Document IAM Role**. 1333 | 1334 | ``` 1335 | aws cloudformation create-stack --stack-name waopslab-runbook-scale-ecs-service \ 1336 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \ 1337 | --template-body file://runbook_scale_ecs_service.yml 1338 | ``` 1339 | Example: 1340 | 1341 | ``` 1342 | aws cloudformation create-stack --stack-name waopslab-runbook-scale-ecs-service \ 1343 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/AutomationRole \ 1344 | --template-body file://runbook_scale_ecs_service.yml 1345 | ``` 1346 | 1347 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command below, locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**. 1348 | 1349 | 1350 | ``` 1351 | aws cloudformation describe-stacks --stack-name waopslab-runbook-scale-ecs-service 1352 | ``` 1353 | 1354 |
1355 | 1356 | ### 4.2 Executing remediation Runbook. 1357 | 1358 | Now, lets run the **runbook** you created above to remediate the issue. 1359 | 1360 | 1. Go to the AWS CloudFormation console. 1361 | 1362 | 2. Click on the stack named `walab-ops-sample-application`. 1363 | 1364 | 3. Click on the **Output** tab, and take note following output values. You will need these values to execute the runbook. 1365 | 1366 | * OutputECSCluster 1367 | * OutputECSService 1368 | * OutputSystemOwnersTopicArn 1369 | 1370 | ![ Section4 ](/Images/section4-output.png) 1371 | 1372 | 4. If you are currently using an IAM user or role to log into your AWS Console, take note of the ARN. 1373 | You will need this ARN when executing the **runbook** to restrict access to approve or deny request capability. 1374 | 1375 | To find your current IAM user ARN, go to the IAM console and click **Users** on the left side menu, then click on your **User** name. 1376 | For IAM role, go to the IAM console and click **Roles** on the left side menu, then click on the **Role** name, you are using. 1377 | 1378 | You will see something similar to the example below. Take note of the ARN value,and proceed to the next step. 1379 | 1380 | ![ Section4 ](/Images/section4-iam.png) 1381 | 1382 | 5. Go to the Systems Manager Automation console, click on **Document** under **Shared Resources**, locate and click an automation document called `Runbook-ECS-Scale-Up`. 1383 | 1384 | 8. Then click **Execute automation**. 1385 | 1386 | 7. Fill in the **Input parameters** with values below. 1387 | 1388 | ![ Section4 ](/Images/section4-scale-up.png) 1389 | 1390 | * For **ECSServiceName**, place the value of **OutputECSService** you took note on step 3. 1391 | * For **ECSClusterName**, Place the value of **OutputECSCluster** you took note on step 3. 1392 | * For **ApproverArn**, place the ARN value you took note on step 4. 1393 | * For **ECSDesiredCount**, place in `100` to increase the task number to 100. 1394 | * For **NotificationMessage**, place in any message that can help the approver make an informed decision when approving or denying the requested action. 1395 | 1396 | For example: 1397 | ``` 1398 | Hello, your mysecretword app is experiencing performance degradation. To maintain quality customer experience we will manually scale up the supporting cluster. This action will be approximately 10 minutes after this message is generated unless you do not consent and deny the action within the period. 1399 | ``` 1400 | 1401 | * For **NotificationTopicArn**, place the value of **OutputSystemOwnersTopicArn** you took note on step 3. 1402 | * For **Timer**, you can specify `PT5M` or specify a value defined in ISO 8601 duration format. 1403 | 1404 | 5. Click **Execute** to run the **runbook**. 1405 | 1406 | 6. Once the **runbook** is running, you will receive an email with instructions approve or deny, on the email address subscribed to the owners SNS topic ARN. 1407 | Follow the link in the email using the User of the ApproverArn you placed in the Input parameters. The link will take you to the SSM Console where you can approve or deny the request. 1408 | 1409 | 1410 | ![ Section4 ](/Images/section4-approveordeny.png) 1411 | 1412 | If you approve, or ignore the email, the request will be automatically be approved after the Timer set in the runbook expires. 1413 | If you deny, the **runbook** will fail and no action will be taken. 1414 | 1415 | 7. Once the **runbook** completes, you can see that the ECS task count increased to the value specified. 1416 | 1417 | 8. Go to ECS console and click on **Clusters** and select `mysecretword-cluster`. 1418 | 1419 | 9. Click on the `mysecretword-service` **Service**, and you will see the number of running tasks increasing to 100 and the average CPUUtilization decrease. 1420 | 1421 | ![ Section4 ](/Images/section4-scale-up2.png) 1422 | 1423 | ![ Section4 ](/Images/section4-scale-up3.png) 1424 | 1425 | 9. Subsequently, you will see the API response time returns to normal and the CloudWatch Alarm returns to an OK state. 1426 | 1427 | ![ Section4 ](/Images/section4-normal.png) 1428 | 1429 | You can check both using your CloudWatch Console, following the steps you ran in section **2.1 Observing the alarm being triggered**. 1430 | 1431 | 1432 | #### Congratulations ! 1433 | You have now completed the **Automating operations with Playbooks and Runbooks** lab, click on the link below to cleanup the lab resources. 1434 | 1435 | 1436 | ## Teardown 1437 | In this section you will delete all resources related to the lab environment. 1438 | 1439 | 1. Run the following command to navigate to the script folder. 1440 | 1441 | ``` 1442 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/ 1443 | ``` 1444 | 1445 | 2. Run the teardown_resources.sh script to delete all resources related to the lab. 1446 | ``` 1447 | bash teardown_resources.sh 1448 | ``` 1449 | ## Summary 1450 | In this lab you learnt: 1451 | - Build and run automated playbooks to support your investigations 1452 | - Build and run automated runbooks to remediate specific faults 1453 | - Enabling traceability of operations activities in your environment 1454 | 1455 | 1456 | --------------------------------------------------------------------------------