├── Code
├── scripts
│ ├── test.json
│ ├── simulate_request.sh
│ ├── build_application.sh
│ └── teardown_resources.sh
├── .DS_Store
├── src
│ ├── package.json
│ ├── Dockerfile
│ └── app.js
└── templates
│ ├── automation_role.yml
│ ├── runbook_scale_ecs_service.yml
│ ├── playbook_investigate_application.yml
│ ├── runbook_approval_gate.yml
│ ├── playbook_gather_resources.yml
│ ├── base_resources.yml
│ ├── base_app.yml
│ └── playbook_investigate_application_resources.yml
├── .DS_Store
├── Images
├── section4-iam.png
├── section3-alarm.png
├── section3-email.png
├── section3-canary.png
├── section4-normal.png
├── section4-output.png
├── section4-scale-up.png
├── section4-scale-up2.png
├── section4-scale-up3.png
├── section2-dns-outputs.png
├── section2-email-confirm.png
├── section3-alarm-detail.png
├── section3-alarm-email.png
├── section3-canary-detail.png
├── section3-stackoutput.png
├── section3-steps-explain.png
├── section4-approveordeny.png
├── section2-base-app-build.png
├── section2-base-bootstrap.png
├── section3-automationrole.png
├── section3-canary-monitor.png
├── section2-base-application.png
├── section2-ecr-repo-confirm.png
├── section4-create-automation.png
├── section5-create-automation.png
├── section2-environment-open-ide.png
├── section4-approve-timer-step1.png
├── section3-gather-resources-stepid.png
├── section4-architecture-graphics1.png
├── section4-architecture-graphics2.png
├── section4-architecture-graphics3.png
├── section5-create-automation-step1.png
├── section5-create-automation-step2.png
├── section2-base-app-create-complete.png
├── section3-failure-traffic-requests.png
├── section3-investigate-resourcelist.png
├── section3-success-traffic-requests.png
├── section4-approve-timer-input-param.png
├── section4-create-automation-addstep.png
├── section3-playbook-gather-resource-tab.png
├── section4-create-approval-gate-step1.png
├── section4-create-approval-gate-step2.png
├── section4-create-approval-gate-step3.png
├── section5-create-automation-graphics1.png
├── section5-create-automation-graphics2.png
├── section2-base-resources-create-complete.png
├── section4-create-automation-additionals.png
├── section5-create-automation-step2-input.png
├── section5-create-automation2-step1-input.png
├── section3-testing-canary-alarm-architecture.png
├── section4-create-automation-parameter-input.png
├── section4-create-automation-playbook-role.png
├── section5-create-automation-parameter-input.png
├── section4-create-automation-parameter-input-2.png
├── section4-create-automation-playbook-execute.png
├── section4-create-automation-playbook-execute2.png
├── section4-create-automation-playbook-owned-by-me.png
├── section4-create-automation-playbook-run-output.png
├── section4-create-automation-playbook-test-email.png
├── section4-create-automation-parameter-input-2-step1.png
├── section4-create-automation-parameter-input-2-step2.png
├── section4-create-automation-parameter-input-2-step3.png
├── section4-create-automation-playbook-execute-output.png
├── section4-create-automation-playbook-test-run-playbook.png
├── section3-create-automation-playbook-test-run-playbook-cpu.png
├── section4-create-automation-playbook-test-execute-playbook.png
├── section4-create-automation-playbook-test-run-playbook-summary.png
├── section4-create-automation-playbook-test-execute-playbook-observe.png
├── section4-create-automation-playbook-test-execute-playbook-summary.png
├── section4-create-automation-playbook-test-run-playbook-email-summary.png
└── section4-create-automation-playbook-test-execute-playbook-email-summary.png
├── CODE_OF_CONDUCT.md
├── LICENSE
├── CONTRIBUTING.md
└── README.md
/Code/scripts/test.json:
--------------------------------------------------------------------------------
1 | {"Name":"Test User","Text":"This Message is a Test!"}
--------------------------------------------------------------------------------
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/.DS_Store
--------------------------------------------------------------------------------
/Code/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Code/.DS_Store
--------------------------------------------------------------------------------
/Images/section4-iam.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-iam.png
--------------------------------------------------------------------------------
/Images/section3-alarm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-alarm.png
--------------------------------------------------------------------------------
/Images/section3-email.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-email.png
--------------------------------------------------------------------------------
/Images/section3-canary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-canary.png
--------------------------------------------------------------------------------
/Images/section4-normal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-normal.png
--------------------------------------------------------------------------------
/Images/section4-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-output.png
--------------------------------------------------------------------------------
/Images/section4-scale-up.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-scale-up.png
--------------------------------------------------------------------------------
/Images/section4-scale-up2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-scale-up2.png
--------------------------------------------------------------------------------
/Images/section4-scale-up3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-scale-up3.png
--------------------------------------------------------------------------------
/Code/scripts/simulate_request.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | ALBURL=$1
4 | while :
5 | do
6 | ab -p test.json -T application/json -c 3000 -n 60000000 -v 4 http://$ALBURL/encrypt
7 | done
--------------------------------------------------------------------------------
/Images/section2-dns-outputs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-dns-outputs.png
--------------------------------------------------------------------------------
/Images/section2-email-confirm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-email-confirm.png
--------------------------------------------------------------------------------
/Images/section3-alarm-detail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-alarm-detail.png
--------------------------------------------------------------------------------
/Images/section3-alarm-email.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-alarm-email.png
--------------------------------------------------------------------------------
/Images/section3-canary-detail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-canary-detail.png
--------------------------------------------------------------------------------
/Images/section3-stackoutput.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-stackoutput.png
--------------------------------------------------------------------------------
/Images/section3-steps-explain.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-steps-explain.png
--------------------------------------------------------------------------------
/Images/section4-approveordeny.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-approveordeny.png
--------------------------------------------------------------------------------
/Images/section2-base-app-build.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-app-build.png
--------------------------------------------------------------------------------
/Images/section2-base-bootstrap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-bootstrap.png
--------------------------------------------------------------------------------
/Images/section3-automationrole.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-automationrole.png
--------------------------------------------------------------------------------
/Images/section3-canary-monitor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-canary-monitor.png
--------------------------------------------------------------------------------
/Images/section2-base-application.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-application.png
--------------------------------------------------------------------------------
/Images/section2-ecr-repo-confirm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-ecr-repo-confirm.png
--------------------------------------------------------------------------------
/Images/section4-create-automation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation.png
--------------------------------------------------------------------------------
/Images/section5-create-automation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation.png
--------------------------------------------------------------------------------
/Images/section2-environment-open-ide.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-environment-open-ide.png
--------------------------------------------------------------------------------
/Images/section4-approve-timer-step1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-approve-timer-step1.png
--------------------------------------------------------------------------------
/Images/section3-gather-resources-stepid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-gather-resources-stepid.png
--------------------------------------------------------------------------------
/Images/section4-architecture-graphics1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-architecture-graphics1.png
--------------------------------------------------------------------------------
/Images/section4-architecture-graphics2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-architecture-graphics2.png
--------------------------------------------------------------------------------
/Images/section4-architecture-graphics3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-architecture-graphics3.png
--------------------------------------------------------------------------------
/Images/section5-create-automation-step1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-step1.png
--------------------------------------------------------------------------------
/Images/section5-create-automation-step2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-step2.png
--------------------------------------------------------------------------------
/Images/section2-base-app-create-complete.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-app-create-complete.png
--------------------------------------------------------------------------------
/Images/section3-failure-traffic-requests.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-failure-traffic-requests.png
--------------------------------------------------------------------------------
/Images/section3-investigate-resourcelist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-investigate-resourcelist.png
--------------------------------------------------------------------------------
/Images/section3-success-traffic-requests.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-success-traffic-requests.png
--------------------------------------------------------------------------------
/Images/section4-approve-timer-input-param.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-approve-timer-input-param.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-addstep.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-addstep.png
--------------------------------------------------------------------------------
/Images/section3-playbook-gather-resource-tab.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-playbook-gather-resource-tab.png
--------------------------------------------------------------------------------
/Images/section4-create-approval-gate-step1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-approval-gate-step1.png
--------------------------------------------------------------------------------
/Images/section4-create-approval-gate-step2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-approval-gate-step2.png
--------------------------------------------------------------------------------
/Images/section4-create-approval-gate-step3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-approval-gate-step3.png
--------------------------------------------------------------------------------
/Images/section5-create-automation-graphics1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-graphics1.png
--------------------------------------------------------------------------------
/Images/section5-create-automation-graphics2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-graphics2.png
--------------------------------------------------------------------------------
/Images/section2-base-resources-create-complete.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section2-base-resources-create-complete.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-additionals.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-additionals.png
--------------------------------------------------------------------------------
/Images/section5-create-automation-step2-input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-step2-input.png
--------------------------------------------------------------------------------
/Images/section5-create-automation2-step1-input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation2-step1-input.png
--------------------------------------------------------------------------------
/Images/section3-testing-canary-alarm-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-testing-canary-alarm-architecture.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-parameter-input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-role.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-role.png
--------------------------------------------------------------------------------
/Images/section5-create-automation-parameter-input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section5-create-automation-parameter-input.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-parameter-input-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-execute.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-execute.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-execute2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-execute2.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-owned-by-me.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-owned-by-me.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-run-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-run-output.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-email.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-email.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-parameter-input-2-step1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2-step1.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-parameter-input-2-step2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2-step2.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-parameter-input-2-step3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-parameter-input-2-step3.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-execute-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-execute-output.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-run-playbook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-run-playbook.png
--------------------------------------------------------------------------------
/Images/section3-create-automation-playbook-test-run-playbook-cpu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section3-create-automation-playbook-test-run-playbook-cpu.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-execute-playbook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-run-playbook-summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-run-playbook-summary.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-execute-playbook-observe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook-observe.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-execute-playbook-summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook-summary.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-run-playbook-email-summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-run-playbook-email-summary.png
--------------------------------------------------------------------------------
/Images/section4-create-automation-playbook-test-execute-playbook-email-summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/achieving-operational-excellence-using-automated-playbook-and-runbook/main/Images/section4-create-automation-playbook-test-execute-playbook-email-summary.png
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/Code/src/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "app",
3 | "version": "1.0.0",
4 | "description": "",
5 | "main": "app.js",
6 | "dependencies": {
7 | "aws-sdk": "^2.850.0",
8 | "aws-xray-sdk": "^3.2.0",
9 | "body-parser": "^1.19.0",
10 | "express": "^4.17.1",
11 | "mysql": "^2.18.1",
12 | "zlib": "^1.0.5"
13 | },
14 | "devDependencies": {},
15 | "scripts": {
16 | "test": "echo \"Error: no test specified\" && exit 1"
17 | },
18 | "author": "",
19 | "license": "ISC"
20 | }
21 |
--------------------------------------------------------------------------------
/Code/src/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM node:12-slim
2 |
3 | # Create app directory
4 | WORKDIR /usr/src/app
5 |
6 | # Install app dependencies
7 | # A wildcard is used to ensure both package.json AND package-lock.json are copied
8 | # where available (npm@5+)
9 | COPY package*.json ./
10 | # ENV NODE_ENV=production
11 | # If you are building your code for production
12 | # RUN npm ci --only=production
13 | ENV NODE_ENV=production
14 | RUN npm install
15 |
16 | # Bundle app source
17 | COPY . .
18 |
19 | EXPOSE 80
20 | CMD [ "node", "app.js" ]
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
4 | this software and associated documentation files (the "Software"), to deal in
5 | the Software without restriction, including without limitation the rights to
6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7 | the Software, and to permit persons to whom the Software is furnished to do so.
8 |
9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 |
16 |
--------------------------------------------------------------------------------
/Code/templates/automation_role.yml:
--------------------------------------------------------------------------------
1 | AWSTemplateFormatVersion: '2010-09-09'
2 | Resources:
3 | AutomationRole:
4 | Type: AWS::IAM::Role
5 | Properties:
6 | AssumeRolePolicyDocument:
7 | Version: '2012-10-17'
8 | Statement:
9 | - Effect: Allow
10 | Principal:
11 | Service:
12 | - ssm.amazonaws.com
13 | - ec2.amazonaws.com
14 | Action: sts:AssumeRole
15 | Policies:
16 | - PolicyName: PassRole
17 | PolicyDocument:
18 | Statement:
19 | - Effect: Allow
20 | Action: 'iam:PassRole'
21 | Resource: '*'
22 | - PolicyName: SNSPublish
23 | PolicyDocument:
24 | Statement:
25 | - Effect: Allow
26 | Action: 'sns:Publish'
27 | Resource: '*'
28 | ManagedPolicyArns:
29 | - arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole
30 | - arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess
31 | - arn:aws:iam::aws:policy/CloudWatchLogsReadOnlyAccess
32 | - arn:aws:iam::aws:policy/AmazonRDSReadOnlyAccess
33 | - arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
34 | - arn:aws:iam::aws:policy/AmazonECS_FullAccess
35 | - arn:aws:iam::aws:policy/CloudWatchSyntheticsReadOnlyAccess
36 | Path: "/"
37 | RoleName: AutomationRole
38 |
--------------------------------------------------------------------------------
/Code/templates/runbook_scale_ecs_service.yml:
--------------------------------------------------------------------------------
1 | Parameters:
2 | PlaybookIAMRole:
3 | Type: String
4 |
5 | Resources:
6 | ScaleECSWithApproval:
7 | Type: "AWS::SSM::Document"
8 | Properties:
9 | DocumentType: Automation
10 | Name: Runbook-ECS-Scale-Up
11 | Content:
12 | schemaVersion: '0.3'
13 | assumeRole: !Ref PlaybookIAMRole
14 | parameters:
15 | ECSClusterName:
16 | type: String
17 | ECSServiceName:
18 | type: String
19 | ECSDesiredCount:
20 | type: Integer
21 | Timer:
22 | type: String
23 | default: PT10M
24 | NotificationTopicArn:
25 | type: String
26 | NotificationMessage:
27 | type: String
28 | ApproverArn:
29 | type: String
30 | mainSteps:
31 | - name: ExecuteApprovalGateWithTimer
32 | action: 'aws:executeAutomation'
33 | inputs:
34 | DocumentName: Approval-Gate
35 | RuntimeParameters:
36 | Timer: '{{Timer}}'
37 | NotificationTopicArn: '{{NotificationTopicArn}}'
38 | NotificationMessage: '{{NotificationMessage}}'
39 | ApproverArn: '{{ApproverArn}}'
40 | - name: UpdateECSServiceDesiredCount
41 | action: aws:executeAwsApi
42 | inputs:
43 | Service: ecs
44 | Api: UpdateService
45 | service: '{{ECSServiceName}}'
46 | forceNewDeployment: true
47 | desiredCount: '{{ECSDesiredCount}}'
48 | cluster: '{{ECSClusterName}}'
--------------------------------------------------------------------------------
/Code/templates/playbook_investigate_application.yml:
--------------------------------------------------------------------------------
1 | Parameters:
2 | PlaybookIAMRole:
3 | Type: String
4 |
5 | Resources:
6 | PlaybookInvestigateAlarm:
7 | Type: "AWS::SSM::Document"
8 | Properties:
9 | DocumentType: Automation
10 | Name: Playbook-Investigate-Application-From-Alarm
11 | Content:
12 | description: |2-
13 | # What is does this playbook do?
14 |
15 | This playbook will execute **Playbook-Gather-Resources** to gather Application resources monitored by Canary.
16 |
17 | Then subsequently execute **Playbook-Investigate-Application-Resources** to Investigate the resources for issues.
18 |
19 | Outputs of the investigation will be sent to SNS Topic Subscriber
20 | schemaVersion: '0.3'
21 | assumeRole: !Ref PlaybookIAMRole
22 | parameters:
23 | AlarmARN:
24 | type: String
25 | SNSTopicARN:
26 | type: String
27 | mainSteps:
28 | - name: gatherResources
29 | action: 'aws:executeAutomation'
30 | inputs:
31 | DocumentName: Playbook-Gather-Resources
32 | RuntimeParameters:
33 | AlarmARN: '{{AlarmARN}}'
34 | - name: investigateAppResources
35 | action: 'aws:executeAutomation'
36 | inputs:
37 | DocumentName: Playbook-Investigate-Application-Resources
38 | RuntimeParameters:
39 | Resources: '{{gatherResources.Output}}'
40 | - name: AWSPublishSNSNotification
41 | action: 'aws:executeAutomation'
42 | inputs:
43 | DocumentName: AWS-PublishSNSNotification
44 | RuntimeParameters:
45 | TopicArn: '{{SNSTopicARN}}'
46 | Message: '{{ investigateAppResources.Output }}'
47 |
--------------------------------------------------------------------------------
/Code/scripts/build_application.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Build Script
4 | LABEL='latest'
5 | ECR_REPONAME='walab-ops-sample-application'
6 | SAMPLE_APPNAME=$ECR_REPONAME
7 | MAIN_STACK='walab-ops-base-resources'
8 | SYSOPSEMAIL=$2
9 | SYSOWNEREMAIL=$3
10 |
11 | sudo yum install jq -y -q
12 | AWS_REGION=$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq '.region' | sed -e 's/^"//' -e 's/"$//')
13 | AWS_ACCOUNT=$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq '.accountId' | sed -e 's/^"//' -e 's/"$//')
14 | RESOURCEID=$(curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq '.instanceId' | sed -e 's/^"//' -e 's/"$//')
15 |
16 |
17 | echo '#################################################'
18 | echo 'Script will deploy application with below details'
19 | echo '#################################################'
20 | echo 'Region: ' $AWS_REGION
21 | echo 'Account: '$AWS_ACCOUNT
22 | echo 'Repo Name: '$ECR_REPONAME
23 | echo 'Label: '$LABEL
24 |
25 | echo '##############################'
26 | echo 'Building Application Container'
27 | echo '##############################'
28 | aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com
29 | docker build -t $ECR_REPONAME ../src/
30 | docker tag $ECR_REPONAME:latest $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPONAME:$LABEL
31 | docker push $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPONAME:$LABEL
32 |
33 | echo '########################'
34 | echo 'Deploy Application Stack'
35 | echo '########################'
36 | aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com
37 | sleep 15
38 |
39 | aws cloudformation create-stack --stack-name $ECR_REPONAME \
40 | --template-body file://../templates/base_app.yml \
41 | --parameters ParameterKey=BaselineVpcStack,ParameterValue=$MAIN_STACK \
42 | ParameterKey=ECRImageURI,ParameterValue=$AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPONAME:$LABEL \
43 | ParameterKey=SystemOpsNotificationEmail,ParameterValue=$SYSOPSEMAIL \
44 | ParameterKey=SystemOwnerNotificationEmail,ParameterValue=$SYSOWNEREMAIL \
45 | --capabilities CAPABILITY_NAMED_IAM \
46 | --tags Key=Application,Value=OpsExcellence-Lab
47 |
48 | echo '#########################################'
49 | echo 'Waiting for Application Stack to complete'
50 | echo '#########################################'
51 | aws cloudformation wait stack-create-complete --stack-name $ECR_REPONAME
52 |
53 | echo '#########################################'
54 | echo 'Application create complete'
55 | echo '#########################################'
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
--------------------------------------------------------------------------------
/Code/templates/runbook_approval_gate.yml:
--------------------------------------------------------------------------------
1 | Parameters:
2 | PlaybookIAMRole:
3 | Type: String
4 |
5 | Resources:
6 | AutomaticApproveWithTimer:
7 | Type: "AWS::SSM::Document"
8 | Properties:
9 | DocumentType: Automation
10 | Name: Approval-Timer
11 | Content:
12 | schemaVersion: '0.3'
13 | assumeRole: !Ref PlaybookIAMRole
14 | parameters:
15 | AutomationExecutionId:
16 | type: String
17 | Timer:
18 | type: String
19 | default: PT10M
20 | mainSteps:
21 | - name: SleepTimer
22 | action: 'aws:sleep'
23 | inputs:
24 | Duration: '{{Timer}}'
25 | - name: ApproveExecution
26 | action: 'aws:executeAwsApi'
27 | inputs:
28 | Api: SendAutomationSignal
29 | Service: ssm
30 | Payload:
31 | Comment:
32 | - 'Automatic Approved by Automatic-Approval-With-Timer'
33 | AutomationExecutionId: '{{AutomationExecutionId}}'
34 | SignalType: Approve
35 | ApprovalGateWithTimer:
36 | Type: "AWS::SSM::Document"
37 | Properties:
38 | DocumentType: Automation
39 | Name: Approval-Gate
40 | Content:
41 | schemaVersion: '0.3'
42 | assumeRole: !Ref PlaybookIAMRole
43 | parameters:
44 | Timer:
45 | type: String
46 | default: PT10M
47 | NotificationTopicArn:
48 | type: String
49 | NotificationMessage:
50 | type: String
51 | ApproverArn:
52 | type: String
53 | outputs:
54 | - getApprovalStatus.approvalStatusVariable
55 | mainSteps:
56 | - name: executeAutoApproveTimer
57 | action: 'aws:executeScript'
58 | inputs:
59 | Runtime: python3.6
60 | Handler: handler
61 | InputPayload:
62 | AutomationExecutionId: '{{automation:EXECUTION_ID}}'
63 | Timer: '{{Timer}}'
64 | Script: |-
65 | import boto3
66 | def handler(event, context):
67 | client = boto3.client('ssm')
68 | response = client.start_automation_execution(
69 | DocumentName='Approval-Timer',
70 | Parameters={
71 | 'Timer': [ event['Timer'] ],
72 | 'AutomationExecutionId' : [ event['AutomationExecutionId'] ]
73 | }
74 | )
75 | return None
76 | - name: ApproveOrDeny
77 | action: 'aws:approve'
78 | onFailure: Continue
79 | isCritical: false
80 | inputs:
81 | NotificationArn: '{{NotificationTopicArn}}'
82 | Message: '{{NotificationMessage}}'
83 | MinRequiredApprovals: 1
84 | Approvers:
85 | - '{{ApproverArn}}'
86 | - !Ref PlaybookIAMRole
87 | - name: getApprovalStatus
88 | action: 'aws:executeAwsApi'
89 | maxAttempts: 1
90 | inputs:
91 | Service: ssm
92 | Api: DescribeAutomationStepExecutions
93 | AutomationExecutionId: '{{automation:EXECUTION_ID}}'
94 | Filters:
95 | - Key: StepName
96 | Values:
97 | - requestApproval
98 | outputs:
99 | - Name: approvalStatusVariable
100 | Selector: '$.StepExecutions[0].Outputs.ApprovalStatus[0]'
101 | Type: String
102 |
--------------------------------------------------------------------------------
/Code/scripts/teardown_resources.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | ECR_REPONAME='walab-ops-sample-application'
4 | SAMPLE_APPNAME=$ECR_REPONAME
5 | CANARY_RESULT_BUCKET=$(aws cloudformation describe-stacks --stack-name $SAMPLE_APPNAME | jq '.Stacks[0].Outputs[] | select(.OutputKey == "OutputCanaryResultsBucket") | .OutputValue' | sed -e 's/^"//' -e 's/"$//')
6 | MAIN_STACK='walab-ops-base-resources'
7 |
8 |
9 |
10 |
11 | echo '############'
12 | echo 'Cleanup Repo'
13 | echo '############'
14 | echo $ECR_REPONAME
15 | aws ecr delete-repository --repository-name $ECR_REPONAME --force
16 |
17 | echo '####################'
18 | echo 'Cleanup Canary Bucket'
19 | echo '####################'
20 | echo $CANARY_RESULT_BUCKET
21 | aws s3 rm s3://$CANARY_RESULT_BUCKET --recursive
22 |
23 | echo '####################'
24 | echo 'Cleanup Canary '
25 | echo '####################'
26 | CANARY_LAMBDA=$(aws synthetics describe-canaries --query "Canaries[?Name == 'mysecretword-canary'].EngineArn" | sed '1d;$d' | sed -r 's/\s+//g' | tr -d '",' | cut -d':' -f7)
27 | CANARY_SECGROUP=$(aws lambda get-function-configuration --function-name $CANARY_LAMBDA --query VpcConfig.SecurityGroupIds[0] | tr -d '",')
28 | CANARY_NICS=$(aws ec2 describe-network-interfaces --filters Name=group-id,Values=$CANARY_SECGROUP --query NetworkInterfaces[].NetworkInterfaceId | sed ':a;N;$!ba;s/\n/ /g' | sed -r 's/\s+//g' | sed -r 's/,/ /g' | tr -d '[,' | tr -d '],' | tr -d '",')
29 | echo 'Delete ' $CANARY_LAMBDA
30 | aws lambda delete-function --function-name $CANARY_LAMBDA
31 | echo 'Sleep for 5 mins before deleting the Network Interface:' $CANARY_NICS
32 | sleep 300
33 | for t in ${CANARY_NICS[@]}; do
34 | echo 'Deleting ENI' $t
35 | aws ec2 delete-network-interface --network-interface-id $t
36 | done
37 | echo 'Sleep for 1 min before deleting the SEC Group:' $CANARY_SECGROUP
38 | sleep 60
39 | echo 'Delete Security Group'
40 | aws ec2 delete-security-group --group-id $CANARY_SECGROUP
41 | echo 'Delete Canary'
42 | aws synthetics stop-canary --name mysecretword-canary
43 | aws synthetics delete-canary --name mysecretword-canary
44 |
45 | echo '##########################'
46 | echo 'Deleting Application Stack'
47 | echo '##########################'
48 | aws cloudformation delete-stack --stack-name $SAMPLE_APPNAME
49 | aws cloudformation wait stack-delete-complete --stack-name $SAMPLE_APPNAME
50 |
51 | echo '################################'
52 | echo 'Deleting Playbook/Runbook Stacks'
53 | echo '################################'
54 |
55 |
56 |
57 | aws cloudformation delete-stack --stack-name waopslab-runbook-approval-gate
58 | aws cloudformation wait stack-delete-complete --stack-name waopslab-runbook-approval-gate
59 |
60 | aws cloudformation delete-stack --stack-name waopslab-automation-role
61 | aws cloudformation wait stack-delete-complete --stack-name waopslab-automation-role
62 |
63 | aws cloudformation delete-stack --stack-name waopslab-runbook-scale-ecs-service
64 | aws cloudformation wait stack-delete-complete --stack-name waopslab-runbook-scale-ecs-service
65 |
66 | aws cloudformation delete-stack --stack-name waopslab-playbook-gather-resources
67 | aws cloudformation wait stack-delete-complete --stack-name waopslab-playbook-gather-resources
68 |
69 | aws cloudformation delete-stack --stack-name waopslab-playbook-investigate-resources
70 | aws cloudformation wait stack-delete-complete --stack-name waopslab-playbook-investigate-resources
71 |
72 | aws cloudformation delete-stack --stack-name waopslab-playbook-investigate-application
73 | aws cloudformation wait stack-delete-complete --stack-name waopslab-playbook-investigate-application
74 |
75 | echo '##########################'
76 | echo 'Deleting Base Resources'
77 | echo '##########################'
78 | aws cloudformation delete-stack --stack-name $MAIN_STACK
79 | aws cloudformation wait stack-delete-complete --stack-name $MAIN_STACK
80 |
81 | echo '#########################################'
82 | echo 'Application Teardown Complete'
83 | echo '#########################################'
84 |
85 |
86 |
--------------------------------------------------------------------------------
/Code/templates/playbook_gather_resources.yml:
--------------------------------------------------------------------------------
1 | Parameters:
2 | PlaybookIAMRole:
3 | Type: String
4 |
5 | Resources:
6 | PlaybookGatherAppResourceAlarm:
7 | Type: "AWS::SSM::Document"
8 | Properties:
9 | DocumentType: Automation
10 | Name: Playbook-Gather-Resources
11 | Content:
12 | schemaVersion: '0.3'
13 | assumeRole: "{{AutomationAssumeRole}}"
14 | parameters:
15 | AlarmARN:
16 | description: (Required) The Alarm ARN triggering incident.
17 | type: String
18 | AutomationAssumeRole:
19 | type: String
20 | default: !Ref PlaybookIAMRole
21 | description: (Optional) The ARN of the role that allows Automation to perform the actions on your behalf.
22 | outputs:
23 | - Gather_Resources_For_Alarm.Resources
24 | mainSteps:
25 | - name: Gather_Resources_For_Alarm
26 | action: aws:executeScript
27 | description: Gather AWS resources related to the Alarm, based on it's Tag
28 | outputs:
29 | - Name: Resources
30 | Selector: $.Payload.ApplicationStackResources
31 | Type: String
32 | inputs:
33 | Runtime: python3.6
34 | Handler: handler
35 | InputPayload:
36 | CloudWatchAlarmARN: '{{AlarmARN}}'
37 | Script: |-
38 | import json
39 | import re
40 | from datetime import datetime
41 | import boto3
42 | import os
43 |
44 | def arn_deconstruct(arn):
45 | # arn:aws:cloudwatch:us-east-1:754323466686:alarm:mysecretword-canary-alarm
46 | arnlist = arn.split(":")
47 | service=arnlist[2]
48 | region=arnlist[3]
49 | accountid=arnlist[4]
50 | servicetype=arnlist[5]
51 | name=arnlist[6]
52 |
53 | return {
54 | "Service": service,
55 | "Region": region,
56 | "AccountId": accountid,
57 | "Type": servicetype,
58 | "Name": name
59 | }
60 |
61 | def locate_alarm_source(alarm):
62 | cwclient = boto3.client('cloudwatch', region_name = alarm['Region'] )
63 | alarm_source = {}
64 | alarm_detail = cwclient.describe_alarms(AlarmNames=[alarm['Name']])
65 |
66 | if len(alarm_detail['MetricAlarms']) > 0:
67 | metric_alarm = alarm_detail['MetricAlarms'][0]
68 | namespace = metric_alarm['Namespace']
69 |
70 | # Condition if NameSpace is CloudWatch Syntetics
71 | if namespace == 'CloudWatchSynthetics':
72 | if 'Dimensions' in metric_alarm:
73 | dimensions = metric_alarm['Dimensions']
74 | for i in dimensions:
75 | if i['Name'] == 'CanaryName':
76 | source_name = i['Value']
77 | alarm_source['Type'] = namespace
78 | alarm_source['Name'] = source_name
79 | alarm_source['Region'] = alarm['Region']
80 | alarm_source['AccountId'] = alarm['AccountId']
81 |
82 | result = alarm_source
83 | return result
84 |
85 | # #Condition for CompositeAlarms
86 | # if len(alarm_detail['CompositeAlarms']) > 0:
87 |
88 | def locate_canary_endpoint(canaryname,region):
89 | result = None
90 | synclient = boto3.client('synthetics', region_name = region )
91 | res = synclient.get_canary(Name=canaryname)
92 | canary = res['Canary']
93 | if 'Tags' in canary:
94 | if 'TargetEndpoint' in canary['Tags']:
95 | target_endpoint = canary['Tags']['TargetEndpoint']
96 | result = target_endpoint
97 |
98 | return result
99 |
100 |
101 | def locate_app_tag_value(resource):
102 | result = None
103 |
104 | if resource['Type'] == 'CloudWatchSynthetics':
105 | synclient = boto3.client('synthetics', region_name = resource['Region'] )
106 | res = synclient.get_canary(Name=resource['Name'])
107 | canary = res['Canary']
108 | if 'Tags' in canary:
109 | if 'Application' in canary['Tags']:
110 | apptag_val = canary['Tags']['Application']
111 | result = apptag_val
112 |
113 | return result
114 |
115 | def locate_app_resources_by_tag(tag,region):
116 | result = None
117 |
118 | # Search CloufFormation Stacks for tag
119 | cfnclient = boto3.client('cloudformation', region_name = region )
120 | list = cfnclient.list_stacks(StackStatusFilter=['CREATE_COMPLETE','ROLLBACK_COMPLETE','UPDATE_COMPLETE','UPDATE_ROLLBACK_COMPLETE','IMPORT_COMPLETE','IMPORT_ROLLBACK_COMPLETE'] )
121 | for stack in list['StackSummaries']:
122 | app_resources_list = []
123 | stack_name = stack['StackName']
124 | stack_details = cfnclient.describe_stacks(StackName=stack_name)
125 | stack_info = stack_details['Stacks'][0]
126 | if 'Tags' in stack_info:
127 | for t in stack_info['Tags']:
128 | if t['Key'] == 'Application' and t['Value'] == tag:
129 | app_stack_name = stack_info['StackName']
130 | app_resources = cfnclient.describe_stack_resources(StackName=app_stack_name)
131 | for resource in app_resources['StackResources']:
132 | app_resources_list.append(
133 | {
134 | 'PhysicalResourceId' : resource['PhysicalResourceId'],
135 | 'Type': resource['ResourceType']
136 | }
137 | )
138 | result = app_resources_list
139 |
140 | return result
141 | def handler(event, context):
142 | result = {}
143 | arn = event['CloudWatchAlarmARN']
144 | alarm = arn_deconstruct(arn)
145 | # Locate tag from CloudWatch Alarm
146 |
147 |
148 | alarm_source = locate_alarm_source(alarm) # Identify Alarm Source
149 | tag_value = locate_app_tag_value(alarm_source) #Identify tag from source
150 |
151 | if alarm_source['Type'] == 'CloudWatchSynthetics':
152 | endpoint = locate_canary_endpoint(alarm_source['Name'],alarm_source['Region'])
153 | result['CanaryEndpoint'] = endpoint
154 |
155 | # Locate cloudformation with tag
156 | resources = locate_app_resources_by_tag(tag_value,alarm['Region'])
157 | result['ApplicationStackResources'] = json.dumps(resources)
158 |
159 | return result
--------------------------------------------------------------------------------
/Code/templates/base_resources.yml:
--------------------------------------------------------------------------------
1 | AWSTemplateFormatVersion: '2010-09-09'
2 |
3 | Description: >
4 | Well Architected Operational Excellence Lab
5 |
6 | Parameters:
7 | Cloud9CidrBlock:
8 | Description: The CIDR block range for your Cloud9 IDE VPC
9 | Type: String
10 | Default: 10.43.0.0/28
11 | GitRepositoryURL:
12 | Description: The Git repository URL for the project we are cloning
13 | Type: String
14 | Default: https://github.com/awslabs/aws-well-architected-labs.git
15 |
16 |
17 | Resources:
18 | #------------------------------------------------------------
19 | # Create a VPC with a public and private subnet
20 | #------------------------------------------------------------
21 | VPC:
22 | Type: 'AWS::EC2::VPC'
23 | Properties:
24 | CidrBlock: 172.31.0.0/16
25 | Tags:
26 | - Key: Name
27 | Value: !Sub "${AWS::StackName}-VPC"
28 | EnableDnsHostnames: true
29 | EnableDnsSupport: true
30 |
31 | PublicSubnet1:
32 | Type: 'AWS::EC2::Subnet'
33 | Properties:
34 | VpcId: !Ref VPC
35 | AvailabilityZone: !Select
36 | - '0'
37 | - !GetAZs ''
38 | CidrBlock: 172.31.1.0/24
39 | MapPublicIpOnLaunch: true
40 |
41 | PublicSubnet2:
42 | Type: 'AWS::EC2::Subnet'
43 | Properties:
44 | VpcId: !Ref VPC
45 | AvailabilityZone: !Select
46 | - '1'
47 | - !GetAZs ''
48 | CidrBlock: 172.31.3.0/24
49 | MapPublicIpOnLaunch: true
50 |
51 | PrivateSubnet1:
52 | Type: 'AWS::EC2::Subnet'
53 | Properties:
54 | VpcId: !Ref VPC
55 | AvailabilityZone: !Select
56 | - '0'
57 | - !GetAZs ''
58 | CidrBlock: 172.31.2.0/24
59 | MapPublicIpOnLaunch: false
60 |
61 | PrivateSubnet2:
62 | Type: 'AWS::EC2::Subnet'
63 | Properties:
64 | VpcId: !Ref VPC
65 | AvailabilityZone: !Select
66 | - '1'
67 | - !GetAZs ''
68 | CidrBlock: 172.31.4.0/24
69 | MapPublicIpOnLaunch: false
70 |
71 | #-------------------------------------------------------------
72 | # Create an IGW and attach to the created VPC
73 | # Create a NAT GW with an associated public IP address.
74 | #-------------------------------------------------------------
75 |
76 | IGW:
77 | Type: AWS::EC2::InternetGateway
78 | Properties:
79 | Tags:
80 | - Key: Name
81 | Value: !Sub "${AWS::StackName}-InternetGateway"
82 |
83 | IGWAttach:
84 | Type: AWS::EC2::VPCGatewayAttachment
85 | Properties:
86 | VpcId: !Ref VPC
87 | InternetGatewayId: !Ref IGW
88 |
89 | NatGateway:
90 | Type: "AWS::EC2::NatGateway"
91 | DependsOn: NatPublicIP
92 | Properties:
93 | AllocationId: !GetAtt NatPublicIP.AllocationId
94 | SubnetId: !Ref PublicSubnet1
95 |
96 | NatPublicIP:
97 | Type: "AWS::EC2::EIP"
98 | DependsOn: VPC
99 | Properties:
100 | Domain: vpc
101 |
102 | #-------------------------------------------------------------
103 | # Create public route table and attach to the public subnets
104 | #-------------------------------------------------------------
105 |
106 | PublicRouteTable1:
107 | Type: 'AWS::EC2::RouteTable'
108 | Properties:
109 | VpcId: !Ref VPC
110 | Tags:
111 | - Key: Name
112 | Value: !Sub "${AWS::StackName}-Public-RouteTable1"
113 |
114 |
115 | PublicRouteTable2:
116 | Type: 'AWS::EC2::RouteTable'
117 | Properties:
118 | VpcId: !Ref VPC
119 | Tags:
120 | - Key: Name
121 | Value: !Sub "${AWS::StackName}-Public-RouteTable2"
122 |
123 | PublicRoute1:
124 | Type: 'AWS::EC2::Route'
125 | DependsOn:
126 | - IGW
127 | - IGWAttach
128 | Properties:
129 | RouteTableId: !Ref PublicRouteTable1
130 | DestinationCidrBlock: 0.0.0.0/0
131 | GatewayId: !Ref IGW
132 |
133 | PublicRoute2:
134 | Type: 'AWS::EC2::Route'
135 | DependsOn:
136 | - IGW
137 | - IGWAttach
138 | Properties:
139 | RouteTableId: !Ref PublicRouteTable2
140 | DestinationCidrBlock: 0.0.0.0/0
141 | GatewayId: !Ref IGW
142 |
143 | PublicSubnet1RouteTableAssociation1:
144 | Type: 'AWS::EC2::SubnetRouteTableAssociation'
145 | Properties:
146 | SubnetId: !Ref PublicSubnet1
147 | RouteTableId: !Ref PublicRouteTable1
148 |
149 | PublicSubnet1RouteTableAssociation2:
150 | Type: 'AWS::EC2::SubnetRouteTableAssociation'
151 | Properties:
152 | SubnetId: !Ref PublicSubnet2
153 | RouteTableId: !Ref PublicRouteTable2
154 |
155 | #-------------------------------------------------------------
156 | # Create public route table and attach to the public subnets
157 | #-------------------------------------------------------------
158 |
159 | PrivateRouteTable1:
160 | Type: 'AWS::EC2::RouteTable'
161 | Properties:
162 | VpcId: !Ref VPC
163 |
164 |
165 | PrivateRouteTable2:
166 | Type: 'AWS::EC2::RouteTable'
167 | Properties:
168 | VpcId: !Ref VPC
169 |
170 | PrivateRoute1:
171 | Type: 'AWS::EC2::Route'
172 | DependsOn: IGW
173 | Properties:
174 | RouteTableId: !Ref PrivateRouteTable1
175 | DestinationCidrBlock: 0.0.0.0/0
176 | NatGatewayId: !Ref NatGateway
177 |
178 | PrivateRoute2:
179 | Type: 'AWS::EC2::Route'
180 | DependsOn: IGW
181 | Properties:
182 | RouteTableId: !Ref PrivateRouteTable2
183 | DestinationCidrBlock: 0.0.0.0/0
184 | NatGatewayId: !Ref NatGateway
185 |
186 | PrivateSubnet1RouteTableAssociation1:
187 | Type: 'AWS::EC2::SubnetRouteTableAssociation'
188 | Properties:
189 | SubnetId: !Ref PrivateSubnet1
190 | RouteTableId: !Ref PrivateRouteTable1
191 |
192 | PrivateSubnet1RouteTableAssociation2:
193 | Type: 'AWS::EC2::SubnetRouteTableAssociation'
194 | Properties:
195 | SubnetId: !Ref PrivateSubnet2
196 | RouteTableId: !Ref PrivateRouteTable2
197 |
198 | # ------------------
199 | # ECR Repository
200 | # ------------------
201 |
202 | AppContainerRepository:
203 | Type: AWS::ECR::Repository
204 | Properties:
205 | RepositoryName: walab-ops-sample-application
206 |
207 | Cloud9:
208 | Type: AWS::Cloud9::EnvironmentEC2
209 | Properties:
210 | AutomaticStopTimeMinutes: 30
211 | Description: Well Architected Operational Excellence lab workspace
212 | InstanceType: t2.small
213 | ImageId: amazonlinux-2-x86_64
214 | Name: !Sub "WellArchitectedOps-${AWS::StackName}"
215 | Repositories:
216 | - PathComponent: /aws-well-architected-labs
217 | RepositoryUrl: !Ref GitRepositoryURL
218 | SubnetId: !Ref PublicSubnet1
219 |
220 | Outputs:
221 | Cloud9DevEnvUrl:
222 | Description: Cloud9 Development Environment
223 | Value: !Sub "https://${AWS::Region}.console.aws.amazon.com/cloud9/ide/${Cloud9}"
224 |
225 |
226 | Outputs:
227 | OutputVPC:
228 | Description: Baseline VPC
229 | Value: !Ref VPC
230 | Export:
231 | Name: !Sub "${AWS::StackName}-VpcId"
232 | OutputVPCCidrBlock:
233 | Description: Baseline VPC Cidr Block
234 | Value: !GetAtt VPC.CidrBlock
235 | Export:
236 | Name: !Sub "${AWS::StackName}-VpcCidrBlock"
237 | OutputPublicSubnet1:
238 | Description: Public Subnet 1 VPC
239 | Value: !Ref PublicSubnet1
240 | Export:
241 | Name: !Sub "${AWS::StackName}-PublicSubnet1"
242 | OutputPublicSubnet2:
243 | Description: Public Subnet 2 VPC
244 | Value: !Ref PublicSubnet2
245 | Export:
246 | Name: !Sub "${AWS::StackName}-PublicSubnet2"
247 | OutputPrivateSubnet1:
248 | Description: Private Subnet 1 VPC
249 | Value: !Ref PrivateSubnet1
250 | Export:
251 | Name: !Sub "${AWS::StackName}-PrivateSubnet1"
252 | OutputPrivateSubnet2:
253 | Description: Private Subnet 2 VPC
254 | Value: !Ref PrivateSubnet2
255 | Export:
256 | Name: !Sub "${AWS::StackName}-PrivateSubnet2"
257 | OutputAppContainerRepository:
258 | Description: Applicaton ECR Repository
259 | Value: !Ref AppContainerRepository
260 | Export:
261 | Name: !Sub "${AWS::StackName}-AppContainerRepository"
262 |
--------------------------------------------------------------------------------
/Code/src/app.js:
--------------------------------------------------------------------------------
1 | 'use strict';
2 | const AWS = require('aws-sdk');
3 | const AWSXRay = require('aws-xray-sdk');
4 | const kmsClient = AWSXRay.captureAWSClient(new AWS.KMS({region: process.env.REGION }));
5 | const secretsmanager = AWSXRay.captureAWSClient(new AWS.SecretsManager({region: process.env.REGION }));
6 | const express = require('express');
7 | const router = express.Router();
8 | const bodyParser = require("body-parser");
9 | var mysql = AWSXRay.captureMySQL(require('mysql'));
10 | const zlib = require('zlib');
11 |
12 |
13 | // Constants
14 | const PORT = 80;
15 | const HOST = '0.0.0.0';
16 |
17 | // App
18 | const app = express();
19 | app.use(AWSXRay.express.openSegment('mysecretapp-api'));
20 |
21 |
22 | app.use(bodyParser.urlencoded({ extended: false }));
23 | app.use(bodyParser.json());
24 |
25 |
26 | const DBHOST = process.env.DBHOST;
27 | const KeyId = process.env.KeyId;
28 | const DBSecret = process.env.DBSecret;
29 |
30 | function hydrateDBCreds( DBSecret ){
31 | var promise = new Promise(function(resolve,reject){
32 | var params = {
33 | SecretId: DBSecret
34 | };
35 | secretsmanager.getSecretValue(params, function(err, data) {
36 | if (err){
37 | console.log(err);
38 | reject(err);
39 | }
40 | else{
41 | var secString = data['SecretString']
42 | var secObj = JSON.parse(secString)
43 | process.env.DBUSER = secObj['username']
44 | process.env.DBPASS = secObj['password']
45 | resolve ( data );
46 | }
47 | // successful response
48 | });
49 | });
50 | return promise;
51 | };
52 |
53 | function encryptData( KeyId, Plaintext ){
54 | var promise = new Promise(function(resolve,reject){
55 | kmsClient.encrypt({ KeyId, Plaintext }, (err, data) => {
56 | if (err) {
57 | console.log(err)
58 | reject(err); // an error occurred
59 | }
60 | else {
61 | const { CiphertextBlob } = data;
62 | resolve ( CiphertextBlob );
63 | };
64 | });
65 |
66 | });
67 | return promise;
68 | };
69 |
70 |
71 | function decryptData( KeyId, CiphertextBlob ){
72 | var promise = new Promise(function(resolve,reject){
73 | kmsClient.decrypt({ CiphertextBlob, KeyId }, (err, data) => {
74 | if (err) {
75 | console.log(err)
76 | reject(err); // an error occurred
77 | }
78 | else {
79 | const { Plaintext } = data;
80 | resolve ( Plaintext.toString() );
81 | };
82 | });
83 |
84 | });
85 | return promise;
86 | };
87 |
88 |
89 |
90 | function decryptData( KeyId, CiphertextBlob ){
91 | var promise = new Promise(function(resolve,reject){
92 | kmsClient.decrypt({ CiphertextBlob, KeyId }, (err, data) => {
93 | if (err) {
94 | console.log(err)
95 | reject(err); // an error occurred
96 | }
97 | else {
98 | const { Plaintext } = data;
99 | resolve ( Plaintext.toString() );
100 | };
101 | });
102 |
103 | });
104 | return promise;
105 | };
106 |
107 |
108 |
109 | function createDB(DBSecret){
110 |
111 | var promise = new Promise(function(resolve,reject){
112 | try
113 | {
114 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS});
115 | var sql = "CREATE DATABASE IF NOT EXISTS mydb";
116 | con.query(sql, function (err, result) {
117 | if (err) {
118 | con.end();
119 | reject(err);
120 | }
121 | else{
122 | con.end();
123 | resolve("database create done");
124 | }
125 | });
126 | }
127 | catch(err){
128 | reject(err);
129 | }
130 | });
131 | return promise;
132 | };
133 |
134 | function createTable(){
135 | var promise = new Promise(function(resolve,reject){
136 | try
137 | {
138 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS,database: "mydb"});
139 | var sql = "CREATE TABLE IF NOT EXISTS peoplesecret (name VARCHAR(244) NOT NULL, secret TEXT, PRIMARY KEY (name) )";
140 | con.query(sql, function (err, result) {
141 | if (err) {
142 | con.end();
143 | reject(err);
144 | }
145 | else{
146 | con.end();
147 | resolve("table create done");
148 | }
149 | });
150 | }
151 | catch(err){
152 | reject(err);
153 | }
154 | });
155 | return promise;
156 | };
157 |
158 | function storeSecret(Payload){
159 |
160 | var promise = new Promise(function(resolve,reject){
161 | try{
162 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS,database: "mydb"});
163 | var sql = "INSERT INTO peoplesecret (name, secret) VALUES ('" + Payload['Name'] + "', '" + Payload['Text'] + "' ) ON DUPLICATE KEY UPDATE name= '"+ Payload['Name'] +"', secret='" + Payload['Text'] + "'" ;
164 | con.query(sql, function (err, result) {
165 | if (err) {
166 | con.end();
167 | reject(err);
168 | }
169 | else{
170 | con.end();
171 | resolve("1 record inserted");
172 | }
173 | });
174 | }
175 | catch(err){
176 | reject(err);
177 | }
178 | });
179 | return promise;
180 | };
181 |
182 | function getSecret(Payload){
183 |
184 | var promise = new Promise(function(resolve,reject){
185 | try{
186 | var con = mysql.createConnection({host: DBHOST,user: process.env.DBUSER ,password: process.env.DBPASS,database: "mydb"});
187 | var sql = "SELECT secret from peoplesecret WHERE name='"+ Payload['Name'] +"'";
188 | con.query(sql, function (err, result) {
189 | if (err) {
190 | con.end();
191 | reject(err);
192 | }
193 | else{
194 | if(result.length < 1){
195 | con.end();
196 | reject('no record found');
197 | }
198 | else{
199 | con.end();
200 | resolve(result[0].secret);
201 | }
202 | }
203 | });
204 | }
205 | catch(err){
206 | reject(err)
207 | }
208 | });
209 | return promise;
210 | };
211 |
212 |
213 |
214 | router.get('/', (req, res) => {
215 | res.status(200).send( 'OK' );
216 | });
217 |
218 | router.post('/encrypt', (req, res) => {
219 | const Payload = {
220 | 'Name': req.body.Name,
221 | 'Text': req.body.Text
222 | }
223 |
224 | hydrateDBCreds(DBSecret)
225 | .then(function(response){
226 | encryptData(KeyId, Payload['Text'])
227 | .then(function(response) {
228 | var encryptedData = response;
229 | const EncryptedDataBase64Str = zlib.gzipSync(JSON.stringify(encryptedData)).toString('base64');
230 | Payload['Text'] = EncryptedDataBase64Str
231 | return Payload;
232 | })
233 | .then(function(response) {
234 | var Payload = response;
235 |
236 | //Prepare Database & Table
237 | createDB()
238 | .then(function(res){
239 |
240 | createTable()
241 | .then(function(res){
242 | //Insert Record
243 | storeSecret(Payload)
244 | .then(function(response){
245 | console.log(response);
246 | return response;
247 | })
248 | .catch(function(err){
249 | console.log(err);
250 | res.status(500).send( {'Message':'oops something went wrong !' });
251 | });
252 | })
253 | .catch(function(err){
254 | console.log(err);
255 | res.status(500).send( {'Message':'oops something went wrong !' });
256 | });
257 |
258 | })
259 | .catch(function(err){
260 | console.log(err);
261 | res.status(500).send( {'Message':'oops something went wrong !' });
262 | process.exit(500)
263 | });
264 | return Payload;
265 | })
266 | .then(function(response){
267 | var output = {
268 | 'Message':'Data encrypted and stored, keep your key save',
269 | 'Key' : KeyId
270 | };
271 | //console.log(response)
272 | res.status(200).send( output );
273 | })
274 | .catch(function(err) {
275 | console.log(err);
276 |
277 | res.status(400).send( {'Message':'Data encryption failed, check logs for more details' });
278 | });
279 | return response
280 | })
281 | .catch(function(err){
282 | console.log(err);
283 | res.status(400).send( {'Message':'Failed Hydrating Credentials' });
284 | })
285 | });
286 |
287 | router.get('/decrypt', (req, res) => {
288 | const Payload = {
289 | 'Name': req.body.Name,
290 | 'Key': req.body.Key
291 | }
292 | hydrateDBCreds(DBSecret)
293 | .then(function(response){
294 | getSecret(Payload)
295 | .then(function(response) {
296 | var secretText = response;
297 | const originalObj = JSON.parse(zlib.unzipSync(Buffer.from(secretText, 'base64')));
298 | var buf = Buffer.from(originalObj, 'utf8');
299 | decryptData(Payload['Key'],buf)
300 | .then(function(response){
301 | var output = {
302 | 'Text':response,
303 | };
304 | res.status(200).send( output );
305 | })
306 | .catch(function(err){
307 | res.status(400).send( {'Message':'Data decryption failed, make sure you have the correct key' });
308 | });
309 | return response;
310 | })
311 | .catch(function(err) {
312 | res.status(400).send( {'Message':'Failed getting secret text, check the user name' });
313 | });
314 | return response;
315 | })
316 | .catch(function(err){
317 | console.log(err);
318 | res.status(400).send( {'Message':'Failed Hydrating Credentials' });
319 | })
320 | });
321 |
322 | app.use("/",router);
323 | app.use(AWSXRay.express.closeSegment());
324 |
325 |
326 | app.listen(PORT, HOST);
327 | console.log(`Running on http://${HOST}:${PORT}`);
--------------------------------------------------------------------------------
/Code/templates/base_app.yml:
--------------------------------------------------------------------------------
1 | Parameters:
2 | BaselineVpcStack:
3 | Type: String
4 | ECRImageURI:
5 | Type: String
6 | SystemOpsNotificationEmail:
7 | Type: String
8 | SystemOwnerNotificationEmail:
9 | Type: String
10 |
11 | Outputs:
12 | OutputApplicationEndpoint:
13 | Description: Application Endpoint
14 | Value: !GetAtt ALB.DNSName
15 | OutputECSService:
16 | Description: App Task ECS Service
17 | Value: !GetAtt ECSService.Name
18 | OutputECSCluster:
19 | Description: App Task ECS Cluster
20 | Value: !Ref ECSCluster
21 | OutputSystemOwnersTopicArn:
22 | Description: Arn of the SNS Topic for System Owners
23 | Value: !Ref SystemOwnersTopic
24 | OutputSystemEventTopicArn:
25 | Description: Arn of the SNS Topic for System Events
26 | Value: !Ref SystemEventTopic
27 | SyntheticsCanaryDurationAlarmArn:
28 | Description: Arn CloudWatch Alarm
29 | Value: !GetAtt SyntheticsCanaryDurationAlarm.Arn
30 | OutputCanaryResultsBucket:
31 | Description: Canary Result Bucket
32 | Value: !Ref ResultsBucket
33 |
34 | Resources:
35 | #----------------------------------------------------------------------------------------
36 | # Build load balancer.
37 | #----------------------------------------------------------------------------------------
38 | ALB:
39 | Type: AWS::ElasticLoadBalancingV2::LoadBalancer
40 | Properties:
41 | SecurityGroups:
42 | - !Ref ELBSecurityGroup
43 | Subnets:
44 | -
45 | Fn::ImportValue:
46 | !Sub "${BaselineVpcStack}-PublicSubnet1"
47 | -
48 | Fn::ImportValue:
49 | !Sub "${BaselineVpcStack}-PublicSubnet2"
50 | Tags:
51 | - Key: Name
52 | Value: !Join [ "-", [ !Ref AWS::StackName, "ExternalALB"]]
53 | - Key: Application
54 | Value: "OpsExcellence-Lab"
55 | LoadBalancerAttributes:
56 | - Key: idle_timeout.timeout_seconds
57 | Value: 30
58 | Scheme: internal
59 |
60 | ALBTargetGroup:
61 | Type: AWS::ElasticLoadBalancingV2::TargetGroup
62 | Properties:
63 | TargetType: ip
64 | HealthCheckEnabled: True
65 | HealthCheckIntervalSeconds: 60
66 | HealthCheckPath: /
67 | HealthCheckPort: 80
68 | HealthCheckProtocol: HTTP
69 | HealthCheckTimeoutSeconds: 30
70 | HealthyThresholdCount: 3
71 | UnhealthyThresholdCount: 5
72 | TargetGroupAttributes:
73 | - Key: deregistration_delay.timeout_seconds
74 | Value: 0
75 | VpcId:
76 | Fn::ImportValue:
77 | !Sub "${BaselineVpcStack}-VpcId"
78 | Port: 80
79 | Protocol: HTTP
80 | Tags:
81 | - Key: Name
82 | Value: !Join [ "-", [ !Ref AWS::StackName, "ExternalALBTargetGroup"]]
83 | - Key: Application
84 | Value: "OpsExcellence-Lab"
85 |
86 | ALBListener:
87 | Type: AWS::ElasticLoadBalancingV2::Listener
88 | Properties:
89 | LoadBalancerArn: !Ref ALB
90 | Port: 80
91 | Protocol: HTTP
92 | DefaultActions:
93 | - Type: forward
94 | TargetGroupArn: !Ref ALBTargetGroup
95 |
96 | #----------------------------------------------------------------------------------------
97 | # Build load balancer security group.
98 | #----------------------------------------------------------------------------------------
99 | ELBSecurityGroup:
100 | Type: AWS::EC2::SecurityGroup
101 | Properties:
102 | GroupDescription: Enable HTTP from the Internet
103 | VpcId:
104 | Fn::ImportValue:
105 | !Sub "${BaselineVpcStack}-VpcId"
106 | SecurityGroupIngress:
107 | - IpProtocol: tcp
108 | FromPort: 80
109 | ToPort: 80
110 | CidrIp: '0.0.0.0/0'
111 | Tags:
112 | - Key: Name
113 | Value: !Join [ "-", [ !Ref AWS::StackName, "ExternalELBSecurityGroup"]]
114 | - Key: Application
115 | Value: "OpsExcellence-Lab"
116 |
117 | #----------------------------------------------------------------------------------------
118 | # Build ECS Resources.
119 | #----------------------------------------------------------------------------------------
120 |
121 | ContainerSecGroup:
122 | Type: AWS::EC2::SecurityGroup
123 | Properties:
124 | GroupDescription: !Join ['', [!Ref 'AWS::StackName', -ContainerSecGroup]]
125 | VpcId:
126 | Fn::ImportValue:
127 | !Sub "${BaselineVpcStack}-VpcId"
128 | SecurityGroupIngress:
129 | - IpProtocol: tcp
130 | FromPort: 80
131 | ToPort: 80
132 | SourceSecurityGroupId: !Ref ELBSecurityGroup
133 |
134 | ECSCluster:
135 | Type: AWS::ECS::Cluster
136 | Properties:
137 | ClusterName: mysecretword-cluster
138 | CapacityProviders:
139 | - FARGATE
140 | DefaultCapacityProviderStrategy:
141 | - CapacityProvider: FARGATE
142 | Weight: 1
143 |
144 | ECSService:
145 | DependsOn: ALB
146 | Type: AWS::ECS::Service
147 | Properties:
148 | Cluster: !Ref ECSCluster
149 | ServiceName: mysecretword-service
150 | DeploymentConfiguration:
151 | MaximumPercent: 200
152 | MinimumHealthyPercent: 100
153 | DesiredCount: 1
154 | HealthCheckGracePeriodSeconds: 60
155 | LoadBalancers:
156 | - ContainerName: mysecretword-app
157 | ContainerPort: 80
158 | TargetGroupArn: !Ref ALBTargetGroup
159 | TaskDefinition: !Ref TaskDefinition
160 | LaunchType: FARGATE
161 | NetworkConfiguration:
162 | AwsvpcConfiguration:
163 | AssignPublicIp: ENABLED
164 | Subnets:
165 | - Fn::ImportValue:
166 | !Sub "${BaselineVpcStack}-PrivateSubnet1"
167 | - Fn::ImportValue:
168 | !Sub "${BaselineVpcStack}-PrivateSubnet2"
169 | SecurityGroups:
170 | - !Ref ContainerSecGroup
171 |
172 |
173 | TaskDefinition:
174 | Type: AWS::ECS::TaskDefinition
175 | Properties:
176 | Family: !Join ['', [!Ref 'AWS::StackName', -app]]
177 | TaskRoleArn: !Ref ECSTaskRole
178 | ExecutionRoleArn: !Ref ECSTaskExecutionRole
179 | NetworkMode: awsvpc
180 | RequiresCompatibilities:
181 | - FARGATE
182 | Cpu: 256
183 | Memory: 0.5GB
184 | ContainerDefinitions:
185 | - Name: mysecretword-app
186 | Essential: 'true'
187 | Image: !Ref ECRImageURI
188 | LogConfiguration:
189 | LogDriver: awslogs
190 | Options:
191 | awslogs-group: !Ref 'ECSCloudWatchLogsGroup'
192 | awslogs-region: !Ref 'AWS::Region'
193 | awslogs-stream-prefix: !Join ['', [!Ref 'AWS::StackName', -app]]
194 | Environment:
195 | - Name: DBHOST
196 | Value: !GetAtt RDS.Endpoint.Address
197 | - Name: KeyId
198 | Value: !Ref KMSKey
199 | - Name: DBSecret
200 | Value: !Ref RDSSecret
201 | - Name: REGION
202 | Value: !Ref AWS::Region
203 | PortMappings:
204 | - ContainerPort: 80
205 |
206 |
207 | ECSCloudWatchLogsGroup:
208 | Type: AWS::Logs::LogGroup
209 | Properties:
210 | LogGroupName: !Join ['', [!Ref 'AWS::StackName', -app-loggroup]]
211 | RetentionInDays: 365
212 |
213 | #----------------------------------------------------------------------------------------
214 | # Build ECS IAM Roles.
215 | #----------------------------------------------------------------------------------------
216 |
217 | ECSServiceRole:
218 | Type: AWS::IAM::Role
219 | Properties:
220 | AssumeRolePolicyDocument:
221 | Version: 2008-10-17
222 | Statement:
223 | - Sid: ''
224 | Effect: Allow
225 | Principal:
226 | Service: ecs.amazonaws.com
227 | Action: 'sts:AssumeRole'
228 | ManagedPolicyArns:
229 | - 'arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceRole'
230 |
231 | ECSTaskRole:
232 | Type: AWS::IAM::Role
233 | Properties:
234 | RoleName: !Join ['', [!Ref 'AWS::StackName', -ECSTaskRole]]
235 | AssumeRolePolicyDocument:
236 | Statement:
237 | - Effect: Allow
238 | Principal:
239 | Service: ecs-tasks.amazonaws.com
240 | Action: 'sts:AssumeRole'
241 | Path: /
242 | Policies:
243 | - PolicyName: KMSAccess
244 | PolicyDocument:
245 | Version: 2012-10-17
246 | Statement:
247 | - Effect: Allow
248 | Action: '*'
249 | Resource: !GetAtt KMSKey.Arn
250 | - PolicyName: SMAccess
251 | PolicyDocument:
252 | Version: 2012-10-17
253 | Statement:
254 | - Effect: Allow
255 | Action: 'secretsmanager:GetSecretValue'
256 | Resource: '*'
257 | - PolicyName: CloudWatchLogs
258 | PolicyDocument:
259 | Version: 2012-10-17
260 | Statement:
261 | - Effect: Allow
262 | Action: 'logs:*'
263 | Resource: !GetAtt ECSCloudWatchLogsGroup.Arn
264 | - PolicyName: Xray
265 | PolicyDocument:
266 | Version: 2012-10-17
267 | Statement:
268 | - Effect: Allow
269 | Action: 'xray:*'
270 | Resource: '*'
271 | ### This needs to be restricted
272 |
273 | ECSTaskExecutionRole:
274 | Type: AWS::IAM::Role
275 | Properties:
276 | RoleName: !Join ['', [!Ref 'AWS::StackName', -ECSTaskExecutionRole]]
277 | AssumeRolePolicyDocument:
278 | Statement:
279 | - Effect: Allow
280 | Principal:
281 | Service: ecs-tasks.amazonaws.com
282 | Action: 'sts:AssumeRole'
283 | ManagedPolicyArns:
284 | - 'arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
285 |
286 | #----------------------------------------------------------------------------------------
287 | # Build RDS Instance.
288 | #----------------------------------------------------------------------------------------
289 |
290 | RDS:
291 | Type: AWS::RDS::DBInstance
292 | Properties:
293 | AllocatedStorage: 5
294 | DBInstanceClass: db.t2.micro
295 | Engine: MySQL
296 | MasterUsername: !Join ['', ['{{resolve:secretsmanager:', !Ref RDSSecret, ':SecretString:username}}' ]]
297 | MasterUserPassword: !Join ['', ['{{resolve:secretsmanager:', !Ref RDSSecret, ':SecretString:password}}' ]]
298 | DBSubnetGroupName: !Ref RDSSubnetGroup
299 | VPCSecurityGroups:
300 | - !Ref RDSSecGroup
301 | MultiAZ: False
302 |
303 |
304 | RDSSecGroup:
305 | Type: AWS::EC2::SecurityGroup
306 | Properties:
307 | GroupDescription: !Join ['', [!Ref 'AWS::StackName', -RDSSecGroup]]
308 | VpcId:
309 | Fn::ImportValue:
310 | !Sub "${BaselineVpcStack}-VpcId"
311 | SecurityGroupIngress:
312 | - IpProtocol: tcp
313 | FromPort: 3306
314 | ToPort: 3306
315 | CidrIp:
316 | Fn::ImportValue:
317 | !Sub "${BaselineVpcStack}-VpcCidrBlock"
318 |
319 | RDSSubnetGroup:
320 | Type: "AWS::RDS::DBSubnetGroup"
321 | Properties:
322 | DBSubnetGroupDescription: "Subnet Group"
323 | SubnetIds:
324 | - Fn::ImportValue:
325 | !Sub "${BaselineVpcStack}-PrivateSubnet1"
326 | - Fn::ImportValue:
327 | !Sub "${BaselineVpcStack}-PrivateSubnet2"
328 |
329 | RDSSecret:
330 | Type: AWS::SecretsManager::Secret
331 | Properties:
332 | Description: 'This is the secret for my RDS instance'
333 | GenerateSecretString:
334 | SecretStringTemplate: '{"username": "masteradmin"}'
335 | GenerateStringKey: 'password'
336 | PasswordLength: 16
337 | ExcludeCharacters: '"@/\'
338 |
339 | #----------------------------------------------------------------------------------------
340 | # Build Parameter KMS Key
341 | #----------------------------------------------------------------------------------------
342 |
343 | KMSKey:
344 | Type: "AWS::KMS::Key"
345 | Properties:
346 | KeyPolicy:
347 | Version: 2012-10-17
348 | Id: key--1
349 | Statement:
350 | - Sid: Enable IAM User Permissions
351 | Effect: Allow
352 | Principal:
353 | AWS: !Join
354 | - ""
355 | - - "arn:aws:iam::"
356 | - !Ref "AWS::AccountId"
357 | - ":root"
358 | Action: "kms:*"
359 | Resource: "*"
360 |
361 | #----------------------------------------------------------------------------------------
362 | # Canary
363 | #----------------------------------------------------------------------------------------
364 | CloudWatchSyntheticsRole:
365 | Type: AWS::IAM::Role
366 | Properties:
367 | Description: CloudWatch Synthetics lambda execution role for running canaries
368 | AssumeRolePolicyDocument:
369 | Version: 2012-10-17
370 | Statement:
371 | - Effect: Allow
372 | Principal:
373 | Service: lambda.amazonaws.com
374 | Action: sts:AssumeRole
375 | Condition: {}
376 |
377 | RolePermissions:
378 | Type: AWS::IAM::Policy
379 | Properties:
380 | Roles:
381 | - Ref: CloudWatchSyntheticsRole
382 | PolicyName: CloudWatchSyntheticsPolicy
383 | PolicyDocument:
384 | Version: 2012-10-17
385 | Statement:
386 | - Effect: Allow
387 | Action:
388 | - s3:PutObject
389 | - s3:GetBucketLocation
390 | Resource:
391 | - Fn::Sub: arn:aws:s3:::${ResultsBucket}/*
392 | - Effect: Allow
393 | Action:
394 | - logs:CreateLogStream
395 | - logs:PutLogEvents
396 | - logs:CreateLogGroup
397 | Resource:
398 | - '*'
399 | - Effect: Allow
400 | Action:
401 | - s3:ListAllMyBuckets
402 | Resource: '*'
403 | - Effect: Allow
404 | Resource: '*'
405 | Action: cloudwatch:PutMetricData
406 | Condition:
407 | StringEquals:
408 | cloudwatch:namespace: CloudWatchSynthetics
409 | - Effect: Allow
410 | Resource: '*'
411 | Action:
412 | - ec2:*
413 |
414 | ResultsBucket:
415 | Type: AWS::S3::Bucket
416 | Properties:
417 | BucketEncryption:
418 | ServerSideEncryptionConfiguration:
419 | - ServerSideEncryptionByDefault:
420 | SSEAlgorithm: AES256
421 | DeletionPolicy: Retain
422 |
423 | CanarySecurityGroup:
424 | Type: AWS::EC2::SecurityGroup
425 | Properties:
426 | GroupDescription: Canary Sec Group
427 | VpcId:
428 | Fn::ImportValue:
429 | !Sub "${BaselineVpcStack}-VpcId"
430 | Tags:
431 | - Key: Name
432 | Value: !Join [ "-", [ !Ref AWS::StackName, "CanarySecurityGroup"]]
433 | - Key: Application
434 | Value: "OpsExcellence-Lab"
435 |
436 | SyntheticsCanary:
437 | Type: 'AWS::Synthetics::Canary'
438 | Properties:
439 | Name: mysecretword-canary
440 | ExecutionRoleArn: !GetAtt CloudWatchSyntheticsRole.Arn
441 | Code:
442 | Handler: apiCanaryBlueprint.handler
443 | Script:
444 | |
445 | var synthetics = require('Synthetics');
446 | const log = require('SyntheticsLogger');
447 | apiCanaryBlueprint = async function (ms) {
448 |
449 | // Handle validation for positive scenario
450 | const validateSuccessfull = async function(res) {
451 | return new Promise((resolve, reject) => {
452 | if (res.statusCode < 200 || res.statusCode > 299) {
453 | throw res.statusCode + ' ' + res.statusMessage;
454 | }
455 |
456 | let responseBody = '';
457 | res.on('data', (d) => {
458 | responseBody += d;
459 | });
460 |
461 | res.on('end', () => {
462 | // Add validation on 'responseBody' here if required.
463 | resolve();
464 | });
465 | });
466 | };
467 |
468 |
469 | let requestOptionsStep1 = {
470 | hostname: process.env.CANARY_ENDPOINT,
471 | method: 'POST',
472 | path: '/encrypt',
473 | port: '80',
474 | protocol: 'http:',
475 | body: "{\"Name\":\"Test User\",\"Text\":\"This Message is a Test!\"}",
476 | headers: {"Content-Type":"application/json"}
477 | };
478 | requestOptionsStep1['headers']['User-Agent'] = [synthetics.getCanaryUserAgentString(), requestOptionsStep1['headers']['User-Agent']].join(' ');
479 |
480 | let stepConfig1 = {
481 | includeRequestHeaders: true,
482 | includeResponseHeaders: true,
483 | includeRequestBody: true,
484 | includeResponseBody: true,
485 | restrictedHeaders: [],
486 | continueOnHttpStepFailure: true
487 | };
488 |
489 | await synthetics.executeHttpStep('Verify', requestOptionsStep1, validateSuccessfull, stepConfig1);
490 |
491 |
492 | };
493 | exports.handler = async () => {
494 | return await apiCanaryBlueprint();
495 | };
496 | ArtifactS3Location:
497 | Fn::Join:
498 | - ''
499 | - - s3://
500 | - Ref: ResultsBucket
501 | RuntimeVersion: syn-nodejs-puppeteer-3.5
502 | Schedule:
503 | Expression: 'rate(1 minute)'
504 | DurationInSeconds: 0
505 | RunConfig:
506 | TimeoutInSeconds: 60
507 | EnvironmentVariables: { "CANARY_ENDPOINT" : !GetAtt ALB.DNSName }
508 | VPCConfig:
509 | SecurityGroupIds:
510 | - !Ref CanarySecurityGroup
511 | SubnetIds:
512 | -
513 | Fn::ImportValue:
514 | !Sub "${BaselineVpcStack}-PrivateSubnet1"
515 | -
516 | Fn::ImportValue:
517 | !Sub "${BaselineVpcStack}-PrivateSubnet2"
518 | VpcId:
519 | Fn::ImportValue:
520 | !Sub "${BaselineVpcStack}-VpcId"
521 | FailureRetentionPeriod: 30
522 | SuccessRetentionPeriod: 30
523 | StartCanaryAfterCreation: true
524 | Tags:
525 | - Key: Name
526 | Value: !Join [ "-", [ !Ref AWS::StackName, "Canary"]]
527 | - Key: Application
528 | Value: "OpsExcellence-Lab"
529 | - Key: TargetEndpoint
530 | Value: !GetAtt ALB.DNSName
531 |
532 | SyntheticsCanaryDurationAlarm:
533 | Type: AWS::CloudWatch::Alarm
534 | Properties:
535 | AlarmDescription: Canary Alarm for My Secret Word
536 | AlarmName: mysecretword-canary-duation-alarm
537 | AlarmActions:
538 | - !Ref SystemEventTopic
539 | ComparisonOperator: GreaterThanOrEqualToThreshold
540 | EvaluationPeriods: 12
541 | DatapointsToAlarm: 3
542 | Dimensions:
543 | - Name: CanaryName
544 | Value: mysecretword-canary
545 | - Name: StepName
546 | Value: Verify
547 | Namespace: "CloudWatchSynthetics"
548 | MetricName: "Duration"
549 | Period: 30
550 | Statistic: Average
551 | Threshold: 5000
552 | TreatMissingData: ignore
553 |
554 | SystemEventTopic:
555 | Type: AWS::SNS::Topic
556 | Properties:
557 | TopicName: SystemEventTopic
558 | Subscription:
559 | - Endpoint: !Ref SystemOpsNotificationEmail
560 | Protocol: "Email"
561 |
562 | SystemOwnersTopic:
563 | Type: AWS::SNS::Topic
564 | Properties:
565 | TopicName: SystemOwnersTopic
566 | Subscription:
567 | - Endpoint: !Ref SystemOwnerNotificationEmail
568 | Protocol: "Email"
569 |
--------------------------------------------------------------------------------
/Code/templates/playbook_investigate_application_resources.yml:
--------------------------------------------------------------------------------
1 | Parameters:
2 | PlaybookIAMRole:
3 | Type: String
4 |
5 | Resources:
6 | PlaybookInvestigateAlarm:
7 | Type: "AWS::SSM::Document"
8 | Properties:
9 | DocumentType: Automation
10 | Name: Playbook-Investigate-Application-Resources
11 | Content:
12 | schemaVersion: '0.3'
13 | assumeRole: "{{AutomationAssumeRole}}"
14 | parameters:
15 | Resources:
16 | description: (Required) The Stringified Resources list from Gather Resource Alarm Output.
17 | type: String
18 | AutomationAssumeRole:
19 | type: String
20 | default: !Ref PlaybookIAMRole
21 | description: (Optional) The ARN of the role that allows Automation to perform the actions on your behalf.
22 | outputs:
23 | - Inspect_Playbook_Results.Result
24 | mainSteps:
25 | - name: Gather_ELB_Statistics
26 | action: aws:executeScript
27 | description: Gather ELB Statistics
28 | outputs:
29 | - Name: Result
30 | Selector: $.Payload.Result
31 | Type: String
32 | inputs:
33 | Runtime: python3.6
34 | Handler: handler
35 | InputPayload:
36 | Resourceslist: '{{Resources}}'
37 | Script: |-
38 | import json
39 | import re
40 | from datetime import datetime,timedelta
41 | import boto3
42 | import os
43 |
44 | def arn_deconstruct(arn):
45 | arnlist = arn.split(":")
46 |
47 | service=arnlist[2]
48 | region=arnlist[3]
49 | accountid=arnlist[4]
50 | resources = arnlist[5].split("/")
51 | servicetype = resources[0]
52 | servicemode = resources[1]
53 | resourcename = resources[2]
54 | resourceid = resources[3]
55 |
56 | return {
57 | "Service": service,
58 | "Region": region,
59 | "AccountId": accountid,
60 | "Type": servicetype,
61 | "Mode" : servicemode,
62 | "Name" : resourcename,
63 | "Id" : resourceid
64 | }
65 |
66 |
67 | def get_related_metrics(elb):
68 | cwclient = boto3.client('cloudwatch', region_name = elb['Region'] )
69 | if elb['Mode'] == 'app':
70 | response = cwclient.list_metrics(
71 | Namespace='AWS/ApplicationELB',
72 | Dimensions=[
73 | {
74 | 'Name':'LoadBalancer',
75 | 'Value': '{}/{}/{}'.format(elb['Mode'],elb['Name'],elb['Id'])
76 | }
77 | ]
78 | )
79 | return(response['Metrics'])
80 |
81 |
82 | def get_stat(elb,metricname,stat):
83 | cwclient = boto3.client('cloudwatch', region_name = elb['Region'] )
84 |
85 | if elb['Mode'] == 'app':
86 | response = cwclient.get_metric_statistics(
87 | Namespace='AWS/ApplicationELB',
88 | MetricName=metricname,
89 | StartTime=datetime.now() - timedelta(minutes=60),
90 | EndTime=datetime.now(),
91 | Period=60,
92 | Dimensions=[
93 | {
94 | 'Name':'LoadBalancer',
95 | 'Value': '{}/{}/{}'.format(elb['Mode'],elb['Name'],elb['Id'])
96 | }
97 | ],
98 | Statistics=[stat]
99 | )
100 |
101 | x = []
102 | result = {}
103 | if len(response['Datapoints']) > 0:
104 | for i in response['Datapoints']:
105 | x.append(i[stat])
106 | result['OverallValue'] = cal_average(x)
107 | else:
108 | result['OverallValue'] = None
109 | result['Statistics'] = stat
110 | result['TimeWindow'] = 60
111 | return(result)
112 |
113 | def find_elb_resource(res):
114 | result = None
115 | r = json.loads(res['Resourceslist'])
116 | for i in r:
117 | if i['Type'] == 'AWS::ElasticLoadBalancingV2::Listener':
118 | result = i['PhysicalResourceId']
119 | return result
120 |
121 | def cal_average(num):
122 | sum_num = 0
123 | for t in num:
124 | sum_num = sum_num + t
125 |
126 | avg = sum_num / len(num)
127 | return avg
128 |
129 | def myconverter(o):
130 | if isinstance(o, datetime):
131 | return o.__str__()
132 |
133 | def handler(event, context):
134 |
135 | arn = find_elb_resource(event)
136 | result = {}
137 |
138 | if arn is not None:
139 | elb = arn_deconstruct(arn)
140 |
141 | metricslist = get_related_metrics(elb)
142 | result['TargetResponseTime'] = get_stat(elb,'TargetResponseTime','Average')
143 | result['Target2XXCount'] = get_stat(elb,'HTTPCode_Target_2XX_Count','Sum')
144 | result['Target3XXCount'] = get_stat(elb,'HTTPCode_Target_2XX_Count','Sum')
145 | result['Target4XXCount'] = get_stat(elb,'HTTPCode_Target_4XX_Count','Sum')
146 | result['Target5XXCount'] = get_stat(elb,'HTTPCode_Target_5XX_Count','Sum')
147 | result['TargetConnectionErrorCount'] = get_stat(elb,'TargetConnectionErrorCount','Sum')
148 | result['UnHealthyHostCount'] = get_stat(elb,'UnHealthyHostCount','Average')
149 | result['ActiveConnectionCount'] = get_stat(elb,'ActiveConnectionCount','Sum')
150 | result['ELB3XXCount'] = get_stat(elb,'HTTPCode_ELB_3XX_Count','Sum')
151 | result['ELB4XXCount'] = get_stat(elb,'HTTPCode_ELB_4XX_Count','Sum')
152 | result['ELB5XXCount'] = get_stat(elb,'HTTPCode_ELB_5XX_Count','Sum')
153 | result['ELB500Count'] = get_stat(elb,'HTTPCode_ELB_500_Count','Sum')
154 | result['ELB502Count'] = get_stat(elb,'HTTPCode_ELB_502_Count','Sum')
155 | result['ELB503Count'] = get_stat(elb,'HTTPCode_ELB_503_Count','Sum')
156 | result['ELB504Count'] = get_stat(elb,'HTTPCode_ELB_504_Count','Sum')
157 |
158 | serialized_result = json.dumps(result, default = myconverter )
159 | result['Result'] = json.dumps(json.loads(serialized_result))
160 |
161 | return result
162 | - name: Gather_RDS_Config
163 | action: aws:executeScript
164 | description: Gather RDS Configurations
165 | outputs:
166 | - Name: Result
167 | Selector: $.Payload.Result
168 | Type: String
169 | inputs:
170 | Runtime: python3.6
171 | Handler: handler
172 | InputPayload:
173 | Resourceslist: '{{Resources}}'
174 | Script: |-
175 | import json
176 | import re
177 | from datetime import datetime,timedelta
178 | import boto3
179 | import os
180 |
181 | def arn_deconstruct(arn):
182 | arnlist = arn.split(":")
183 |
184 | service=arnlist[2]
185 | region=arnlist[3]
186 | accountid=arnlist[4]
187 | resources = arnlist[5].split("/")
188 | servicetype = resources[0]
189 | servicemode = resources[1]
190 | resourcename = resources[2]
191 | resourceid = resources[3]
192 |
193 | return {
194 | "Service": service,
195 | "Region": region,
196 | "AccountId": accountid,
197 | "Type": servicetype,
198 | "Mode" : servicemode,
199 | "Name" : resourcename,
200 | "Id" : resourceid
201 | }
202 |
203 |
204 |
205 | def get_rds_config(rdsname):
206 | rdsclient = boto3.client('rds')
207 |
208 | res = rdsclient.describe_db_instances(
209 | DBInstanceIdentifier=rdsname
210 | )
211 | result = res['DBInstances'][0]
212 |
213 | return(result)
214 |
215 | def get_rds_parameters(rdsparamgroups):
216 | result = []
217 | rdsclient = boto3.client('rds')
218 |
219 | for i in rdsparamgroups:
220 | name = i['DBParameterGroupName']
221 | res = rdsclient.describe_db_parameters(
222 | DBParameterGroupName=name
223 | )
224 | x = {
225 | 'DBParamGroup' : i,
226 | 'Parameters' : res['Parameters']
227 | }
228 | result.append(x)
229 |
230 | return result
231 |
232 |
233 | def find_rds_resource(res):
234 | result = None
235 | r = json.loads(res['Resourceslist'])
236 | for i in r:
237 | if i['Type'] == 'AWS::RDS::DBInstance':
238 | result = i['PhysicalResourceId']
239 | return result
240 |
241 | def cal_average(num):
242 | sum_num = 0
243 | for t in num:
244 | sum_num = sum_num + t
245 |
246 | avg = sum_num / len(num)
247 | return avg
248 |
249 | def myconverter(o):
250 | if isinstance(o, datetime):
251 | return o.__str__()
252 |
253 | def handler(event, context):
254 | param = None
255 | result = {}
256 |
257 | rdsrsname = find_rds_resource(event)
258 | rdsconfig = get_rds_config(rdsrsname)
259 |
260 | if len(rdsconfig['DBParameterGroups']) > 0:
261 | param = get_rds_parameters(rdsconfig['DBParameterGroups'])
262 |
263 | result['Result'] = json.dumps({
264 | 'config' : json.loads(json.dumps(rdsconfig,default = myconverter)),
265 | 'parameters' : param
266 | } );
267 |
268 | return result
269 | - name: Gather_RDS_Statistics
270 | action: aws:executeScript
271 | description: Gather RDS Statistics
272 | outputs:
273 | - Name: Result
274 | Selector: $.Payload.Result
275 | Type: String
276 | inputs:
277 | Runtime: python3.6
278 | Handler: handler
279 | InputPayload:
280 | Resourceslist: '{{Resources}}'
281 | Script: |-
282 | import json
283 | import re
284 | from datetime import datetime,timedelta
285 | import boto3
286 | import os
287 |
288 | def arn_deconstruct(arn):
289 | arnlist = arn.split(":")
290 |
291 | service=arnlist[2]
292 | region=arnlist[3]
293 | accountid=arnlist[4]
294 | resources = arnlist[5].split("/")
295 | servicetype = resources[0]
296 | servicemode = resources[1]
297 | resourcename = resources[2]
298 | resourceid = resources[3]
299 |
300 | return {
301 | "Service": service,
302 | "Region": region,
303 | "AccountId": accountid,
304 | "Type": servicetype,
305 | "Mode" : servicemode,
306 | "Name" : resourcename,
307 | "Id" : resourceid
308 | }
309 |
310 |
311 | def get_related_metrics(rdsname):
312 | cwclient = boto3.client('cloudwatch')
313 | response = cwclient.list_metrics(
314 | Namespace='AWS/RDS',
315 | Dimensions=[
316 | {
317 | 'Name':'DBInstanceIdentifier',
318 | 'Value': rdsname
319 | }
320 | ]
321 | )
322 | return(response['Metrics'])
323 |
324 |
325 | def get_stat(rdsname,metricname,stat):
326 | cwclient = boto3.client('cloudwatch')
327 |
328 | response = cwclient.get_metric_statistics(
329 | Namespace='AWS/RDS',
330 | MetricName=metricname,
331 | StartTime=datetime.now() - timedelta(minutes=60),
332 | EndTime=datetime.now(),
333 | Period=60,
334 | Dimensions=[
335 | {
336 | 'Name':'DBInstanceIdentifier',
337 | 'Value': rdsname
338 | }
339 | ],
340 | Statistics=[stat]
341 | )
342 |
343 | x = []
344 | result = {}
345 | if len(response['Datapoints']) > 0:
346 | for i in response['Datapoints']:
347 | x.append(i[stat])
348 | result['OverallValue'] = cal_average(x)
349 | else:
350 | result['OverallValue'] = None
351 | result['Statistics'] = stat
352 | result['TimeWindow'] = 60
353 | return(result)
354 |
355 |
356 | def find_rds_resource(res):
357 | result = None
358 | r = json.loads(res['Resourceslist'])
359 | for i in r:
360 | if i['Type'] == 'AWS::RDS::DBInstance':
361 | result = i['PhysicalResourceId']
362 | return result
363 |
364 | def cal_average(num):
365 | sum_num = 0
366 | for t in num:
367 | sum_num = sum_num + t
368 |
369 | avg = sum_num / len(num)
370 | return avg
371 |
372 | def myconverter(o):
373 | if isinstance(o, datetime):
374 | return o.__str__()
375 |
376 | def handler(event, context):
377 |
378 | rdsrsname = find_rds_resource(event)
379 | metrics = get_related_metrics(rdsrsname)
380 | result = {}
381 | output = {}
382 |
383 | result['BinLogDiskUsage'] = get_stat(rdsrsname,'BinLogDiskUsage','Sum')
384 | result['BurstBalance'] = get_stat(rdsrsname,'BurstBalance','Average')
385 | result['CPUUtilization'] = get_stat(rdsrsname,'CPUUtilization','Average')
386 | result['CPUCreditUsage'] = get_stat(rdsrsname,'CPUCreditUsage','Sum')
387 | result['CPUCreditBalance'] = get_stat(rdsrsname,'CPUCreditBalance','Maximum')
388 | result['DatabaseConnections'] = get_stat(rdsrsname,'DatabaseConnections','Sum')
389 | result['DiskQueueDepth'] = get_stat(rdsrsname,'DiskQueueDepth','Maximum')
390 | result['FailedSQLServerAgentJobsCount'] = get_stat(rdsrsname,'FailedSQLServerAgentJobsCount','Average')
391 | result['FreeableMemory'] = get_stat(rdsrsname,'FreeableMemory','Maximum')
392 | result['MaximumUsedTransactionIDs'] = get_stat(rdsrsname,'MaximumUsedTransactionIDs','Maximum')
393 | result['NetworkReceiveThroughput'] = get_stat(rdsrsname,'NetworkReceiveThroughput','Average')
394 | result['NetworkTransmitThroughput'] = get_stat(rdsrsname,'NetworkTransmitThroughput','Average')
395 | result['OldestReplicationSlotLag'] = get_stat(rdsrsname,'OldestReplicationSlotLag','Maximum')
396 | result['ReadIOPS'] = get_stat(rdsrsname,'ReadIOPS','Average')
397 | result['ReadLatency'] = get_stat(rdsrsname,'ReadLatency','Average')
398 | result['ReadThroughput'] = get_stat(rdsrsname,'ReadThroughput','Average')
399 | result['ReplicaLag'] = get_stat(rdsrsname,'ReplicaLag','Average')
400 | result['ReplicationSlotDiskUsage'] = get_stat(rdsrsname,'ReplicationSlotDiskUsage','Maximum')
401 | result['SwapUsage'] = get_stat(rdsrsname,'SwapUsage','Maximum')
402 | result['TransactionLogsDiskUsage'] = get_stat(rdsrsname,'TransactionLogsDiskUsage','Maximum')
403 | result['TransactionLogsGeneration'] = get_stat(rdsrsname,'TransactionLogsGeneration','Average')
404 | result['ReplicationSlotDiskUsage'] = get_stat(rdsrsname,'ReplicationSlotDiskUsage','Maximum')
405 | result['WriteIOPS'] = get_stat(rdsrsname,'WriteIOPS','Average')
406 | result['WriteLatency'] = get_stat(rdsrsname,'WriteLatency','Average')
407 | result['WriteThroughput'] = get_stat(rdsrsname,'WriteThroughput','Average')
408 | output['Result'] = json.dumps(result)
409 |
410 | return output
411 | - name: Gather_ECS_Statistics
412 | action: aws:executeScript
413 | description: Gather ECS Service CloudWatch metrics
414 | outputs:
415 | - Name: Result
416 | Selector: $.Payload.Result
417 | Type: String
418 | inputs:
419 | Runtime: python3.6
420 | Handler: handler
421 | InputPayload:
422 | Resourceslist: '{{Resources}}'
423 | Script: |-
424 | import json
425 | import re
426 | from datetime import datetime,timedelta
427 | import boto3
428 | import os
429 |
430 | def arn_deconstruct(arn):
431 | arnlist = arn.split(":")
432 |
433 | service=arnlist[2]
434 | region=arnlist[3]
435 | accountid=arnlist[4]
436 | resources = arnlist[5].split("/")
437 | servicetype = resources[0]
438 | clustername = resources[1]
439 | servicename = resources[2]
440 |
441 | return {
442 | "Service": service,
443 | "Region": region,
444 | "AccountId": accountid,
445 | "Type": servicetype,
446 | "ClusterName" : clustername,
447 | "ServiceName" : servicename
448 | }
449 |
450 |
451 | def get_related_metrics(res):
452 | cwclient = boto3.client('cloudwatch', region_name = res['Region'] )
453 |
454 | response = cwclient.list_metrics(
455 | Namespace='AWS/ECS',
456 | Dimensions=[
457 | {
458 | 'Name':'ServiceName',
459 | 'Value': res['ServiceName']
460 | },
461 | {
462 | 'Name':'ClusterName',
463 | 'Value': res['ClusterName']
464 | }
465 | ]
466 | )
467 | return(response['Metrics'])
468 |
469 |
470 | def get_stat(res,metricname,stat):
471 | cwclient = boto3.client('cloudwatch', region_name = res['Region'] )
472 |
473 | response = cwclient.get_metric_statistics(
474 | Namespace='AWS/ECS',
475 | MetricName=metricname,
476 | StartTime=datetime.now() - timedelta(minutes=6),
477 | EndTime=datetime.now(),
478 | Period=1,
479 | Dimensions=[
480 | {
481 | 'Name':'ServiceName',
482 | 'Value': res['ServiceName']
483 | },
484 | {
485 | 'Name':'ClusterName',
486 | 'Value': res['ClusterName']
487 | }
488 | ],
489 | Statistics=[stat]
490 | )
491 |
492 | x = []
493 | result = {}
494 | if len(response['Datapoints']) > 0:
495 | for i in response['Datapoints']:
496 | x.append(i[stat])
497 | result['OverallValue'] = cal_average(x)
498 | else:
499 | result['OverallValue'] = None
500 | result['Statistics'] = stat
501 | result['TimeWindow'] = 60
502 | # result['Datapoints'] = response['Datapoints']
503 | return(result)
504 |
505 |
506 | def find_ecsservice_resource(res):
507 | result = None
508 | r = json.loads(res['Resourceslist'])
509 | for i in r:
510 | if i['Type'] == 'AWS::ECS::Service':
511 | result = i['PhysicalResourceId']
512 | return result
513 |
514 | def cal_average(num):
515 | sum_num = 0
516 | for t in num:
517 | sum_num = sum_num + t
518 |
519 | avg = sum_num / len(num)
520 | return avg
521 |
522 | def myconverter(o):
523 | if isinstance(o, datetime):
524 | return o.__str__()
525 |
526 | def handler(event, context):
527 |
528 | arn = find_ecsservice_resource(event)
529 | result = {}
530 |
531 | if arn is not None:
532 | ecsservice = arn_deconstruct(arn)
533 | result = {}
534 | output = {}
535 | result['CPUUtilization'] = get_stat(ecsservice,'CPUUtilization','Maximum')
536 | result['MemoryUtilization'] = get_stat(ecsservice,'MemoryUtilization','Maximum')
537 | serialized_result = json.dumps(result,default = myconverter )
538 | result = json.loads(serialized_result)
539 | output['Result']=json.dumps(result)
540 |
541 | result = output
542 |
543 | return result
544 | - name: Gather_ECS_Error_Logs
545 | action: aws:executeScript
546 | description: Search and gather error in ECS logs
547 | outputs:
548 | - Name: Result
549 | Selector: $.Payload.Result
550 | Type: String
551 | inputs:
552 | Runtime: python3.6
553 | Handler: handler
554 | InputPayload:
555 | Resourceslist: '{{Resources}}'
556 | Script: |-
557 | import json
558 | import re
559 | from datetime import datetime,timedelta
560 | import boto3
561 | import os
562 | import time
563 |
564 | def arn_deconstruct(arn):
565 | arnlist = arn.split(":")
566 |
567 | service=arnlist[2]
568 | region=arnlist[3]
569 | accountid=arnlist[4]
570 | resources = arnlist[5].split("/")
571 | servicetype = resources[0]
572 | servicemode = resources[1]
573 | resourcename = resources[2]
574 |
575 | return {
576 | "Service": service,
577 | "Region": region,
578 | "AccountId": accountid,
579 | "Type": servicetype,
580 | "Mode" : servicemode,
581 | "Name" : resourcename
582 | }
583 |
584 |
585 | def find_ecs_resource(res):
586 | result = {}
587 |
588 | r = json.loads(res['Resourceslist'])
589 | for i in r:
590 | if i['Type'] == 'AWS::ECS::Cluster':
591 | result['ECSCluster'] = i['PhysicalResourceId']
592 | if i['Type'] == 'AWS::ECS::Service':
593 | result['ECSService'] = i['PhysicalResourceId']
594 |
595 | return result
596 |
597 | def find_ecs_logs(ecsclsname,ecssvcname,region):
598 | result = []
599 |
600 | ecsclient = boto3.client('ecs', region_name = region )
601 | ecssvcres = ecsclient.describe_services(
602 | cluster=ecsclsname,
603 | services=[ ecssvcname ]
604 | )
605 |
606 | if len(ecssvcres['services']) > 0:
607 | taskdef = ecssvcres['services'][0]['taskDefinition']
608 | taskdefres = ecsclient.describe_task_definition(
609 | taskDefinition=taskdef
610 | )
611 |
612 | contdef = taskdefres['taskDefinition']['containerDefinitions']
613 |
614 | for i in contdef:
615 | result.append(i['logConfiguration'])
616 |
617 | return result
618 |
619 |
620 | def find_error_in_logs(loglist):
621 | result = []
622 | loggroups = []
623 | logsclient = boto3.client('logs')
624 |
625 | for i in loglist:
626 | options = i['options']
627 | if 'awslogs-group' in options:
628 | loggroups.append(options['awslogs-group'])
629 | now = int(datetime.now().timestamp())
630 |
631 | res = logsclient.start_query(
632 | logGroupNames=loggroups,
633 | startTime = now - 3000,
634 | endTime = now,
635 | queryString = "fields @message | filter @message like \"Error:\" | limit 5"
636 | )
637 |
638 | response = None
639 | while response == None or response['status'] == 'Running':
640 | time.sleep(1)
641 | response = logsclient.get_query_results(
642 | queryId= res['queryId']
643 | )
644 |
645 | if 'results' in response:
646 | if len(response['results']) > 0:
647 | for i in response['results']:
648 | for x in i:
649 | if x['field'] == '@ptr':
650 | pointer = x['value']
651 | recdetail = logsclient.get_log_record(
652 | logRecordPointer=pointer
653 | )
654 |
655 | result.append(recdetail['logRecord'])
656 |
657 | return result
658 |
659 |
660 | def cal_average(num):
661 | sum_num = 0
662 | for t in num:
663 | sum_num = sum_num + t
664 |
665 | avg = sum_num / len(num)
666 | return avg
667 |
668 | def myconverter(o):
669 | if isinstance(o, datetime):
670 | return o.__str__()
671 |
672 | def handler(event, context):
673 | result = {}
674 | x = []
675 | res = find_ecs_resource(event)
676 | ecssvc = arn_deconstruct(res['ECSService'])
677 | loglist = find_ecs_logs(res['ECSCluster'],ecssvc['Name'],ecssvc['Region'])
678 |
679 | x = find_error_in_logs(loglist)
680 |
681 | if len(x) > 0:
682 | result['Result'] = json.dumps(x)
683 | else:
684 | result['Result'] = "None"
685 |
686 | return result
687 | - name: Gather_ECS_Config
688 | action: aws:executeScript
689 | description: Gather ECS Configurations
690 | outputs:
691 | - Name: Result
692 | Selector: $.Payload.Result
693 | Type: String
694 | inputs:
695 | Runtime: python3.6
696 | Handler: handler
697 | InputPayload:
698 | Resourceslist: '{{Resources}}'
699 | Script: |-
700 | import json
701 | import re
702 | from datetime import datetime,timedelta
703 | import boto3
704 | import os
705 |
706 |
707 |
708 | def arn_deconstruct(arn):
709 | arnlist = arn.split(":")
710 |
711 | service=arnlist[2]
712 | region=arnlist[3]
713 | accountid=arnlist[4]
714 | resources = arnlist[5].split("/")
715 | servicetype = resources[0]
716 | clustername = resources[1]
717 | servicename = resources[2]
718 |
719 | return {
720 | "Service": service,
721 | "Region": region,
722 | "AccountId": accountid,
723 | "Type": servicetype,
724 | "ClusterName" : clustername,
725 | "ServiceName" : servicename
726 | }
727 |
728 | def get_ecs_service_config(res):
729 | ecsclient = boto3.client('ecs')
730 |
731 | response = ecsclient.describe_services(
732 | cluster= res['ClusterName'],
733 | services=[ res['ServiceName'] ]
734 | )
735 |
736 | if len(response['services']) > 0:
737 | result = response['services'][0]
738 |
739 | return(result)
740 |
741 | def get_scaling_policy(res):
742 | result = []
743 | aaclient = boto3.client('application-autoscaling')
744 |
745 | response = aaclient.describe_scaling_policies(
746 | ServiceNamespace = 'ecs',
747 | ResourceId = 'service/{}/{}'.format(res['ClusterName'],res['ServiceName'])
748 | )
749 |
750 | if len(response['ScalingPolicies']) > 0:
751 | result = response['ScalingPolicies']
752 |
753 | return(result)
754 |
755 | def find_ecsservice_resource(res):
756 | result = None
757 | r = json.loads(res['Resourceslist'])
758 | for i in r:
759 | if i['Type'] == 'AWS::ECS::Service':
760 | result = i['PhysicalResourceId']
761 | return result
762 |
763 | def cal_average(num):
764 | sum_num = 0
765 | for t in num:
766 | sum_num = sum_num + t
767 |
768 | avg = sum_num / len(num)
769 | return avg
770 |
771 | def myconverter(o):
772 | if isinstance(o, datetime):
773 | return o.__str__()
774 |
775 | def handler(event, context):
776 |
777 |
778 | arn = find_ecsservice_resource(event)
779 | ecsres = arn_deconstruct(arn)
780 | result = {}
781 | output = {}
782 |
783 | if ecsres is not None:
784 | ecssvccfg = json.dumps(get_ecs_service_config(ecsres),default = myconverter )
785 |
786 | result = json.loads(ecssvccfg)
787 | result['scalingpolicies'] = json.loads(json.dumps( get_scaling_policy(ecsres),default = myconverter ))
788 |
789 | output['Result'] = json.dumps(result,default = myconverter )
790 | return output
791 | - name: Inspect_Playbook_Results
792 | action: aws:executeScript
793 | description: Inspect Results
794 | outputs:
795 | - Name: Result
796 | Selector: $.Payload.Result
797 | Type: String
798 | inputs:
799 | Runtime: python3.6
800 | Handler: handler
801 | InputPayload:
802 | ELBStatistics: '{{Gather_ELB_Statistics.Result}}'
803 | RDSConfig: '{{Gather_RDS_Config.Result}}'
804 | RDSStatistics: '{{Gather_RDS_Statistics.Result}}'
805 | ECSStatistics: '{{Gather_ECS_Statistics.Result}}'
806 | ECSErrorLogs: '{{Gather_ECS_Error_Logs.Result}}'
807 | ECSConfig: '{{Gather_ECS_Config.Result}}'
808 | Script: |-
809 | import json
810 | import re
811 | from datetime import datetime,timedelta
812 | import boto3
813 | import os
814 |
815 | def inspect_elb_stats(elbstat):
816 |
817 | result = {}
818 | stat = json.loads(elbstat)
819 |
820 | #Benchmark Max Values
821 | TargetResponseTime = 5
822 | TargetConnectionErrorCount = 0
823 | UnHealthyHostCount = 0
824 | ELB5XXCount = 0
825 | ELB500Count = 0
826 | ELB502Count = 0
827 | ELB503Count = 0
828 | ELB504Count = 0
829 | Target4XXCount = 0
830 | Target5XXCount = 0
831 |
832 | if stat['TargetResponseTime']['OverallValue'] is not None and stat['TargetResponseTime']['OverallValue'] > TargetResponseTime:
833 | result['TargetResponseTime'] = stat['TargetResponseTime']['OverallValue']
834 |
835 | if stat['TargetConnectionErrorCount']['OverallValue'] is not None and stat['TargetConnectionErrorCount']['OverallValue'] > TargetConnectionErrorCount:
836 | result['TargetConnectionErrorCount'] = stat['TargetConnectionErrorCount']['OverallValue']
837 |
838 | if stat['UnHealthyHostCount']['OverallValue'] is not None and stat['UnHealthyHostCount']['OverallValue'] > UnHealthyHostCount :
839 | result['UnHealthyHostCount'] = stat['UnHealthyHostCount']['OverallValue']
840 |
841 | if stat['ELB5XXCount']['OverallValue'] is not None and stat['ELB5XXCount']['OverallValue'] > ELB5XXCount :
842 | result['ELB5XXCount'] = stat['ELB5XXCount']['OverallValue']
843 |
844 | if stat['ELB500Count']['OverallValue'] is not None and stat['ELB500Count']['OverallValue'] > ELB500Count :
845 | result['ELB500Count'] = stat['ELB500Count']['OverallValue']
846 |
847 | if stat['ELB502Count']['OverallValue'] is not None and stat['ELB502Count']['OverallValue'] > ELB502Count:
848 | result['ELB502Count'] = stat['ELB502Count']['OverallValue']
849 |
850 | if stat['ELB503Count']['OverallValue'] is not None and stat['ELB503Count']['OverallValue'] > ELB503Count:
851 | result['ELB503Count'] = stat['ELB503Count']['OverallValue']
852 |
853 | if stat['ELB504Count']['OverallValue'] is not None and stat['ELB504Count']['OverallValue'] > ELB504Count:
854 | result['ELB504Count'] = stat['ELB504Count']['OverallValue']
855 |
856 | if stat['Target4XXCount']['OverallValue'] is not None and stat['Target4XXCount']['OverallValue'] > Target4XXCount :
857 | result['Target4XXCount'] = stat['Target4XXCount']['OverallValue']
858 |
859 | if stat['Target5XXCount']['OverallValue'] is not None and stat['Target5XXCount']['OverallValue'] > Target5XXCount :
860 | result['Target5XXCount'] = stat['Target5XXCount']['OverallValue']
861 |
862 | return result
863 |
864 | def inspect_rds_stats():
865 | #Benchmark Values
866 | DatabaseConnections = 150
867 |
868 |
869 | def inspect_ecs_logs(ecslogs):
870 | #Benchmark Max Values
871 | Count = 0
872 |
873 | result = []
874 | print(ecslogs)
875 |
876 | if ecslogs is not None:
877 | stat = json.loads(ecslogs)
878 | if len(stat) > 0 :
879 | result = stat
880 |
881 | return result
882 |
883 |
884 | def inspect_ecs_stats(ecstat):
885 |
886 | result = {}
887 | stat = json.loads(ecstat)
888 |
889 | #Benchmark Max Values
890 | CPUUtilization = 80
891 |
892 | if stat['CPUUtilization']['OverallValue'] is not None and stat['CPUUtilization']['OverallValue'] > CPUUtilization:
893 | result['CPUUtilization'] = stat['CPUUtilization']['OverallValue']
894 |
895 | return result
896 |
897 | def inspect_ecs_config(ecsconf):
898 |
899 | result = {}
900 | conf = json.loads(ecsconf)
901 |
902 | if 'runningCount' in conf:
903 | result['TaskRunningCount'] = conf['runningCount']
904 |
905 | if 'desiredCount' in conf:
906 | result['TaskDesiredCount'] = conf['desiredCount']
907 |
908 | if 'pendingCount' in conf:
909 | result['TaskPendingCount'] = conf['pendingCount']
910 |
911 | if 'launchType' in conf:
912 | result['LaunchType'] = conf['launchType']
913 |
914 |
915 | return result
916 |
917 | def myconverter(o):
918 | if isinstance(o, datetime):
919 | return o.__str__()
920 |
921 | def handler(event, context):
922 |
923 | result = {}
924 | output = {}
925 |
926 | elbstat = event['ELBStatistics']
927 | output['ELB'] = inspect_elb_stats(elbstat)
928 |
929 | ecsstat = event['ECSStatistics']
930 | ecslogs = event['ECSErrorLogs']
931 | ecsconf = event['ECSConfig']
932 |
933 | output['ECS'] = inspect_ecs_stats(ecsstat)
934 | output['ECS']['CurrentConfig'] =inspect_ecs_config(ecsconf)
935 |
936 |
937 | if ecslogs != "None":
938 | output['ECS']['Logs'] = inspect_ecs_logs(ecslogs)
939 |
940 | x = json.dumps(output, default = myconverter )
941 |
942 | result['Result'] = x
943 |
944 |
945 | return result
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Achieving Operational Excellence Using Automated Playbook and Runbook
2 |
3 | ℹ️ You will run this lab in your own AWS account. Please follow directions at the end of the lab to remove resources to avoid future costs.
4 |
5 | ## Introduction
6 |
7 | This lab was derived directly from one of Operataional Excellence Labs named [Automating operations with Playbooks and Runbooks](https://wellarchitectedlabs.com/operational-excellence/200_labs/) in AWS Well-Architected Lab.
8 |
9 | Manually running your [runbooks](https://wa.aws.amazon.com/wat.concept.runbook.en.html) and [playbooks](https://wa.aws.amazon.com/wat.concept.playbook.en.html) for operational activities has a number of drawbacks:
10 |
11 | * Activities are prone to errors & difficult to trace.
12 | * Manual activities do not allow your operational practice to scale in line with your business requirements.
13 |
14 | In contrast, implementing automation in these activities has the following benefits:
15 |
16 | * Improved reliability by preventing the introduction of errors through manual processes.
17 | * Increased scalability by allowing non linear resource investment to operate your workload.
18 | * Increased traceability on your operation through log collection of the automation activity.
19 | * Improved incident response by reducing idle time and automatically triggering activity based on known events.
20 |
21 |
22 | Click here if you would like to know what runbook and playbook are
23 |
24 |
25 | At a glance, both **runbooks** and **playbooks** appear to be similar documents that technical users, can use to perform operational activities. However, there an essential difference between them:
26 |
27 | * A [playbook](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.playbook.en.html) documents contain processes that guides you through activities to investigate an issue. For example, gathering applicable information, identifying potential sources of failure, isolating faults, or determining the root cause of issues. Playbooks can follow multiple paths and yield more than one outcome.
28 |
29 | * A [runbook](https://wa.aws.amazon.com/wat.concept.runbook.en.html) contains procedures necessary to achieve a specific outcome. For example, creating a user, rolling back configuration, or scaling resource to resolve the issue identified.
30 |
31 |
32 |
33 | This hands-on lab will guide you through the steps to automate your operational activities using runbooks and playbooks built with AWS tools.
34 |
35 | We will show how you can build automated runbooks and playbooks to investigate and remediate application issues using the following AWS services:
36 |
37 | * [Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html)
38 | * [Simple Notification Service](https://aws.amazon.com/sns/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc)
39 | * [Amazon CloudWatch synthetic monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)
40 |
41 | ## Prerequisites:
42 |
43 | * An [AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) that you are able to use for testing. The account should not be used for production purposes.
44 | * An [IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) in your AWS account with full access to [CloudFormation,](https://aws.amazon.com/cloudformation/) [Amazon ECS,](https://aws.amazon.com/ecs/)[Amazon RDS,](https://aws.amazon.com/rds/) [Amazon Virtual Private Cloud (VPC),](https://aws.amazon.com/vpc/) [AWS Identity and Access Management (IAM),](https://aws.amazon.com/iam/) [AWS Cloud9](https://aws.amazon.com/cloud9/)
45 |
46 | ## Costs
47 |
48 | NOTE: You will be billed for any applicable AWS resources used if you complete this lab that are not covered in the [AWS Free Tier](https://aws.amazon.com/free/).
49 |
50 | This lab walks you through creating a CI/CD workflow for serveress applications.
51 | ## Content
52 |
53 | - [Step 1. Deploy the sample application environment](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-1-Deploy-the-sample-application-environment)
54 | - [Step 2. Simulate an Application Issue](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-2-Simulate-an-Application-Issue)
55 | - [Step 3. Build and Run an Investigative Playbook](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-3-Build-and-Run-an-Investigative-Playbook)
56 | - [Step 4. Build and Run Remediation Runbook](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#step-4-Build-and-Run-Remediation-Runbook)
57 | - [Teardown](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#Teardown)
58 | - [Summary](https://github.com/aws-samples/build-and-operate-a-secure-and-successful-cloud-operations-model#Summary)
59 |
60 | ### Step 1. Deploy the sample application environment
61 | In this section, you will prepare a sample application. The application is an API hosted inside a docker container, using [Amazon Elastic Compute Service (ECS).](https://aws.amazon.com/ecs/). The container is accessed via an [Application Load Balancer.](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html)
62 |
63 | The API is a private microservice within your [Amazon Virtual Private Cloud (VPC)](https://aws.amazon.com/vpc/). Communication to the API can only be done privately through routes within the VPC subnet. In our lab example, the business owner has agreed to run the API over HTTP protocol to simplify the implementation.
64 |
65 | The API has two actions available which encrypt and decrypt information. This is triggered by doing a REST POST call to the */encrypt* / */decrypt* methods as appropriate.
66 |
67 | * The *encrypt* action will allow you to pass a secret message along with a 'Name' key as the identifier and it will return a 'Secret Key Id' that you can use later to decrypt your message.
68 | * The *decrypt* action allows you to then decrypt the secret message passing along the 'Name' key and 'Secret Key Id' you obtained before to get your secret message.
69 |
70 | Both actions will make a write and read call to the application database hosted in [Amazon Relation Database Service (RDS)](https://aws.amazon.com/rds/), where the encrypted messages are stored.
71 |
72 | The following step-by-step instructions will provision the application that you will use with your **runbooks** and **playbooks** .
73 |
74 | Explore the contents of the CloudFormation script to learn more about the environment and application.
75 |
76 | You will use this sample application as a sandbox to simulate an application performance issue, start your **runbooks** and **playbooks** to autonomously investigate and remediate.
77 |
78 | #### Actions items in this section:
79 |
80 | 1. You will prepare the [Cloud9](https://aws.amazon.com/cloud9/) workspace launched with a new VPC.
81 | 2. You will run the application build script from the Cloud9 console to build the sample application as shown in the diagram below.
82 |
83 | 
84 |
85 |
86 | ### 1.0 Prepare Cloud9 workspace.
87 |
88 | In this first step you will provision a [CloudFormation](https://aws.amazon.com/cloudformation/) stack that builds a Cloud9 workspace along with the VPC for the sample application. This Cloud9 workspace will be used to run the provisioning script of the sample application. You can choose the to deploy stack in one of the regions below.
89 |
90 | 1. Click on the link below to deploy the stack. This will take you to the CloudFormation console in your account. Use `walab-ops-base-resources` as the stack name, and take the default values for all options.
91 |
92 | * **us-west-2** : [here](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/create/review?stackName=walab-ops-base-resources&templateURL=https://aws-well-architected-labs-singapore.s3.ap-southeast-1.amazonaws.com/Operations/200_Automating_operations_with_playbooks_and_runbooks/base_resources.yml)
93 | * **ap-southeast-2** : [here](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/create/review?stackName=walab-ops-base-resources&templateURL=https://aws-well-architected-labs-singapore.s3.ap-southeast-1.amazonaws.com/Operations/200_Automating_operations_with_playbooks_and_runbooks/base_resources.yml)
94 | * **ap-southeast-1** : [here](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-1#/stacks/create/review?stackName=walab-ops-base-resources&templateURL=https://aws-well-architected-labs-singapore.s3.ap-southeast-1.amazonaws.com/Operations/200_Automating_operations_with_playbooks_and_runbooks/base_resources.yml)
95 |
96 | 2. Once the template is deployed, wait until the CloudFormation Stack reaches the **CREATE_COMPLETE** state.
97 |
98 | 
99 |
100 |
101 | ### 1.1 Run the build application script.
102 |
103 | Next, run the build script to build and deploy you application environment from the Cloud9 workspace as follows:
104 |
105 | 1. From the main console, access the **Cloud9** service.
106 | 2. Click **Environments** section on the left menu, and locate an environment named `WellArchitectedOps-walab-ops-base-resources` as below, then click **Open**.
107 |
108 | 
109 |
110 | 3. Your environment will bootstrap the lab repository. You should see a terminal output showing the following output:
111 |
112 | 
113 |
114 | When the bootstrap script finishes you will see a folder called `aws-well-architected-labs`.
115 |
116 | 4. In the IDE terminal console, change directory to the working folder where the build script is located:
117 |
118 | ```
119 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/
120 | ```
121 |
122 | 5. Copy and paste the command below, replacing `sysops@domain.com` and `owner@domain.com` with the email address you would like the application to notify you with. Replace the `sysops@domain.com` value with email representing system operators team and `owner@domain.com` with email address representing business owner.
123 |
124 |
125 | ```
126 | bash build_application.sh walab-ops-base-resources sysops@domain.com owner@domain.com
127 | ```
128 |
129 |
130 | > The `build_application.sh` script will build and deploy your sample application, along with the architecture that hosts it.
131 | The application architecture will have capabilities to notify systems operators and owners, leveraging [Amazon Simple Notification Service](https://aws.amazon.com/sns/).
132 | You can use the same email address for `sysops@domain.com` and `owner@domain.com` if you need to, but ensure that you have both values specified.
133 |
134 | If you have deployed Amazon ECS before in your account, you may encounter InvalidInput error with message "AWSServiceRoleForECS has been taken" while running the build_application.sh script. You can safely ignore this message, as the script will continue despite the error.
135 |
136 | 6. The above command runs the build and provisioning of the application stack. The script should take about 20 mins to finish.
137 |
138 | 
139 |
140 | > The `build_application.sh` will deploy the application docker image and push it to [Amazon ECR](https://aws.amazon.com/ecr/). This is used by [Amazon ECS.](https://aws.amazon.com/ecs/) Once the build script completes, another CloudFormation stack containing the application resources (ECS, RDS, ALB, and others) will be deployed.
141 |
142 | 7. In the CloudFormation console, you should see a new stack being deployed called `walab-ops-sample-application`. Wait until the stack reaches **CREATE_COMPLETE** state and proceed to the next step.
143 |
144 | 
145 |
146 | ### 1.2. Confirm the application status.
147 |
148 | Once the application is successfully deployed, go to your [CloudFormation console](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2) and locate the stack named `walab-ops-sample-application`.
149 |
150 | 1. Confirm that the stack is in a **'CREATE_COMPLETE'** state.
151 | 2. Record the following output details as it will be required later:
152 | 3. Take note of the DNS value specified under **OutputApplicationEndpoint** of the Outputs.
153 |
154 | The screenshot below shows the output from the CloudFormation stack:
155 |
156 | 
157 |
158 | 4. Check for an email sent to the system operator and owner addresses you've specified in the build_application.sh script. This email should also be visible in the CloudFormation parameter under in the **SystemOpsNotificationEmail** and **SystemOwnerNotificationEmail**.
159 |
160 | 5. Click `confirm subscription` on the email links to subscribe.
161 |
162 | 
163 |
164 | > There will be 2 emails sent to your address, please ensure to subscribe to **both** of them.
165 |
166 | ### 1.3. Test the application.
167 |
168 | In this section, you will be testing the encrypt API action from the deployed application.
169 |
170 | The application will take a JSON payload with `Name` as the identifier and `Text` key as the value of the secret message.
171 |
172 | The application will encrypt the value under `Text` key with a designated KMS key and store the encrypted text in the RDS database with `Name` as the primary key.
173 |
174 | > **Note:** For simplicity purposes the sample application will re-use the same KMS keys for each record generated.
175 |
176 |
177 | Click here to test
178 |
179 | 1. In the **Cloud9** terminal, run the command below, replacing the `ApplicationEndpoint` with the **OutputApplicationEndpoint** from previous step. This command will run [curl](https://curl.se/) to send a POST request with the secret message payload `{"Name":"Bob","Text":"Run your operations as code"}` to the API.
180 |
181 | ```
182 | ALBEndpoint="ApplicationEndpoint"
183 | ```
184 |
185 | ```
186 | curl --header "Content-Type: application/json" --request POST --data '{"Name":"Bob","Text":"Run your operations as code"}' $ALBEndpoint/encrypt
187 | ```
188 |
189 | 2. Once you run this command, you should see output as follows:
190 |
191 | ```
192 | {"Message":"Data encrypted and stored, keep your key save","Key":"EncryptKey"}
193 | ```
194 |
195 | 3. Take note of the encrypt key value under **Key** .
196 |
197 | 4. Run the command below, pasting the encrypt key you took note of previously under the **Key** section to test the decrypt API.
198 |
199 |
200 | ```
201 | curl --header "Content-Type: application/json" --request GET --data '{"Name":"Bob","Key":"EncryptKey"}' $ALBEndpoint/decrypt
202 |
203 | ```
204 |
205 | 5. Once you run the command you should see the following output:
206 |
207 | ```
208 | {"Text":"Run your operations as code"}
209 | ```
210 |
211 |
212 | ## Congratulations!
213 |
214 | You have now completed the first section of the Lab.
215 |
216 | You should have a sample application API which we will use for the remainder of the lab.
217 |
218 | ### Step 2. Simulate an Application Issue
219 | Understanding the health of your workload is an essential component of Operational Excellence. Defining metrics and thresholds, together with appropriate alerts will ensure that issues can be acknowledged and remediated within an appropriate timeframe.
220 |
221 | In this section of the lab, you will simulate a performance issue within the API. Using Amazon CloudWatch synthetic, your API will utilize a canary monitor, which continuously checks API response time to detect an issue.
222 |
223 | In this example, should the API take longer than 6 seconds to respond, an alert will be created, triggering a notification email.
224 |
225 | #### Actions items in this section:
226 |
227 | 1. You will run a script that will send a large amount of traffic to the API.
228 | 2. You will observe and confirm the issue through AWS monitoring tools.
229 |
230 | The following resources had been deployed to perform these actions.
231 |
232 | 
233 |
234 | ### 2.0 Sending traffic to the application
235 |
236 | In this section, you will send multiple concurrent requests to the application, simulating a large surge of incoming traffic. This will overwhelm the API, which will gradually increase the response time of the application. This results in the canary monitoring exceeding the set threshold, triggering the CloudWatch Alarm to send notification.
237 |
238 | Follow below steps to continue:
239 |
240 | 1. From the **Cloud9** terminal, run the command shown below to change directory to the working script folder:
241 |
242 | ```
243 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/
244 | ```
245 |
246 | 2. Confirm that you have the `test.json` in the folder and it contains the following text:
247 |
248 | ```
249 | {"Name":"Test User","Text":"This Message is a Test!"}
250 | ```
251 |
252 | 3. Go to CloudFormation console and take note of the **OutputApplicationEndpoint** value under Output tab of `walab-ops-sample-application` stack. This is the DNS endpoint of the Application Load Balancer.
253 |
254 |
255 | 
256 |
257 |
258 | 4. Make sure you have test the application previously. If so, execute the command below:
259 |
260 | ```
261 | bash simulate_request.sh $ALBEndpoint
262 | ```
263 |
264 | This script uses the [Apache Benchmark](https://httpd.apache.org/docs/2.4/programs/ab.html) to send 60,000,000 requests, 3000 concurrent request at a time.
265 |
266 | When you run the command you will see the output gradually change from a consistently successful 200 response to include 504 time-out responses.
267 |
268 | The requests generated by the script are overwhelming the application API and result in occasional timeouts by your load balancer.
269 |
270 | Keep the command running in the background as you proceed through the lab.
271 |
272 | 
273 |
274 | 
275 |
276 |
277 | ### 2.1 Observing the alarm being triggered.
278 |
279 | 1. After approximately 6 minutes, you will see an alarm which is triggered as a response to the generated activity. This will trigger an email indicating that the CloudWatch alarm has been triggered.
280 |
281 | 
282 |
283 | 2. Check and confirm the alarm by going to the CloudWatch console.
284 |
285 | 3. Click on the Alarms section on the left menu.
286 |
287 | 4. Click on the Alarms called `mysecretword-canary-duration-alarm`, which should be in an alarm state.
288 |
289 | 
290 |
291 | 5. Click on the alarm to display the CloudWatch metrics that the alarm data is based from.
292 |
293 | 6. The alarm is based on the `Duration` metric data emitted by the `mysecretword-canary` CloudWatch synthetic canary monitor. The Duration metric measures how long it takes for the canary requests to receive a response from the application.
294 |
295 | 7. The alarm is triggered whenever the value of the `Duration` metric is above 6 seconds within a 1 minute duration. The latest threshold will be 5000 for 3 datapoints within 6 minutes.
296 |
297 | 
298 |
299 | 8. On the left menu click on **Application monitoring and **Synthetics Canaries** and locate the canary monitor named `mysecretword-canary`.
300 |
301 | 
302 |
303 | 9. Click on the canary and the select the **Configuration** tab.
304 |
305 | 10. From here you will see the canary configuration and a snippet of the canary script.
306 |
307 | 11. In the canary script section, scroll down to the section that contains `let requestOptionStep1` as shown in the screenshot below. This is the configuration that controls the destination of the request (hostname, path and payload body).
308 |
309 | 
310 |
311 | 12. Click on the **Monitoring** tab.
312 |
313 | 13. From here you will see the visualization of the metrics that the canary monitor generates.
314 |
315 | 14. Locate the 'Duration' metric that is being used to trigger the CloudWatch alarm.
316 |
317 | 15. You will see the average duration value of the canary request representing the time to complete. A value above 6000ms signifies that the request has taken more than 6 seconds to receive a response from the application, indicating a performance issue in the API.
318 |
319 | 
320 |
321 | You have now completed the second section of the lab.
322 |
323 | You should still have the `simulate_request.sh` running in the background, simulating a large influx of traffic to your API. This causes the application to respond slowly and time-out periodically. The CloudWatch Alarm will be triggering and performance issue notifications sent to your System Operator to prompt them into action.
324 |
325 | > This concludes **Section 2** of this lab. Click 'Next step' to continue to the next section of the lab where we will build an automated **playbook** to assist investigation of the issue.
326 |
327 | ### Step 3. Build and Run an Investigative Playbook
328 | The efficiency of issue resolution within an Operations team is directly linked to their tenure and experience. Where an Operator has prior knowledge of a particular issue, they will have a headstart in being able to reach resolution in terms of understanding logs and metrics which were used in previous situations. Whilst this constitutes value to an Operations group, it also represents a single point of failure and a scalability challenge.
329 |
330 | This is where [playbooks](https://wa.aws.amazon.com/wat.concept.playbook.en.html) become important. Playbooks are a documented set of predefined steps, which are run to identify an issue. The result of each step can be used to either call more steps to run, or alternatively to trigger manual intervention.
331 |
332 | Automating **playbook** activities wherever possible, is critical to reducing the time to respond to an incident.
333 |
334 | The AWS Cloud offers multiple services you can use to build an automated playbook, one which is AWS Systems Manager.
335 |
336 | AWS Systems Manager offers an automation document capability (known within Systems Manager as [runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html
337 | )), which allows for the creation of a series of executable steps to orchestrate your investigation and remediation. AWS Systems Manager Automation Documents allow a user to run custom scripts, call AWS service APIs, or even run remote commands on cloud or on-premise compute instances.
338 |
339 | In this section, you will focus on creating an automated **playbook** in assisting your investigation, as a Systems Operator.
340 |
341 | #### Actions items in this section:
342 |
343 | 1. You will build a **playbook** to gather information about the workload and query the relevant metrics and logs.
344 | 2. You will run the automation document to investigate your issue.
345 |
346 | ### 3.0 Prepare Automation Document IAM Role
347 |
348 | The Systems Manager Automation Document you are building will require assumed permissions to run the investigation and remediation steps. You will need to create the IAM role that will assume the permissions to perform the **playbook** activities. To simplify the deployment process, a CloudFormation template has been provided that you can deploy via the console or AWS CLI. Please choose one of the two following deployment steps:
349 |
350 |
351 | Click here for CloudFormation Console deployment step
352 |
353 | 1. Download the template [here.](/Code/templates/automation_role.yml "Resources template")
354 | 2. Follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template.
355 | 3. Use `waopslab-automation-role` as the **Stack Name**, as this is referenced by other stacks later in the lab.
356 |
357 |
358 |
359 |
360 | Click here for CloudFormation CLI deployment step (Preferred way)
361 |
362 | **Note:** To deploy from the command line, ensure that you have installed and configured AWS CLI with the appropriate credentials.
363 |
364 | 1. From the **Cloud9** terminal change to the appropriate folder as shown:
365 |
366 | ```
367 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
368 | ```
369 |
370 | 2. Then run the command listed below:
371 |
372 | ```
373 | aws cloudformation create-stack --stack-name waopslab-automation-role \
374 | --capabilities CAPABILITY_NAMED_IAM \
375 | --template-body file://automation_role.yml
376 | ```
377 |
378 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command:
379 |
380 | ```
381 | aws cloudformation describe-stacks --stack-name waopslab-automation-role
382 | ```
383 |
384 | Locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**
385 |
386 |
387 | 1. Once you have deployed the CloudFormation stack above, go to the IAM Console.
388 |
389 | 2. On the side menu, click on **Roles** and locate the IAM role named **AutomationRole**.
390 |
391 | 3. Take note of the ARN of the role, as we will need it later in the lab.
392 |
393 |
394 | 
395 |
396 | ### 3.1 Building the "Gather-Resources" Playbook.
397 |
398 | In preparation for the investigation, you need to know all services and resources associated to the issue. When the email notification is sent, information in the email does not contain any resources information. To gather this necessary information, we will build a **playbook** to acquire all related resources using our CloudWatch alarm ARN as a reference.
399 |
400 | Codifying your **playbook** with AWS Systems Manager allows for maximum code reusability. This will reduce overhead in re-writing codes that has identical objectives.
401 |
402 | 
403 |
404 |
405 | > **Note:** Follow these step to build and run playbook. Select a guide to deploy using either the AWS console, the AWS CLI or via a CloudFormation template deployment.
406 |
407 |
408 | Click here for CloudFormation Console deployment step
409 |
410 | Download the template [here.](/Code/templates/playbook_gather_resources.yml "Resources template")
411 |
412 |
413 | If you decide to deploy the stack from the console, ensure that you follow below requirements & step:
414 |
415 | 1. Follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template.
416 | 2. Use `waopslab-playbook-gather-resources` as the **Stack Name**, as this is referenced by other stacks later in the lab.
417 |
418 |
419 |
420 |
421 | Click here for CloudFormation CLI deployment step (Preferred way)
422 |
423 | **Note:** To deploy from the command line, ensure that you have installed and configured AWS CLI with the appropriate credentials.
424 |
425 | 1. From the **Cloud9** terminal, run the command to get into the working script folder
426 |
427 | ```
428 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
429 | ```
430 |
431 | 2. Then run the below commands, replacing the 'AutomationRoleArn' with the Arn of **AutomationRole** you took note in previous step 3.0.
432 |
433 | ```
434 | aws cloudformation create-stack --stack-name waopslab-playbook-gather-resources \
435 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
436 | --template-body file://playbook_gather_resources.yml
437 | ```
438 |
439 | Example:
440 |
441 |
442 | ```
443 | aws cloudformation create-stack --stack-name waopslab-playbook-gather-resources \
444 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/AutomationRole \
445 | --template-body file://playbook_gather_resources.yml
446 | ```
447 |
448 | **Note:** Please adjust your command-line if you are using profiles within your aws command line as required.
449 |
450 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command below, locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**.
451 |
452 | ```
453 | aws cloudformation describe-stacks --stack-name waopslab-playbook-gather-resources
454 | ```
455 |
456 |
457 |
458 |
459 |
460 | Click here for Console step-by-step
461 |
462 | 1. Go to the AWS Systems Manager console. Click **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below:
463 |
464 | 
465 |
466 | 2. Enter `Playbook-Gather-Resources` in the **Name** field and copy the notes shown below into the **Document description** field.
467 |
468 | ```
469 | # What does this **playbook** do?
470 |
471 | Query the CloudWatch Synthetics Canary and look for all resources related to the application based on it's Application Tag. This **playbook** takes an input of the CloudWatch Alarm ARN triggered by the canary
472 |
473 | Note : Application resources must be deployed using CloudFormation and properly tagged accordingly.
474 |
475 | ## Actions taken in this playbook.
476 | 1. Describe CloudWatch Alarm ARN and identify the Canary resource.
477 | 2. Describe the Canary resource to gather the value of 'Application' tag
478 | 3. Gather CloudFormation Stack with the same value of 'Application' tag.
479 | 4. List all resources in CloudFormation Stack.
480 | 5. Parse list of resources into String Output.
481 | ```
482 |
483 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**.
484 |
485 | 
486 |
487 |
488 | 4. Expand the **Input Parameters** section and enter `AlarmARN` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. This will define a Parameter within our playbook, so that the value of the CloudWatch Alarm ARN can be passed into the playbook to run the action.
489 |
490 | 
491 |
492 | 5. Under **Step 1** section specify `Gather_Resources_For_Alarm` **Step name**, select `aws::executeScript` as the **Action type**.
493 |
494 | 6. Under **Inputs** set `Python3.6` as the **Runtime** and specify `script_handler` as the **Handler**.
495 | 7. Paste in below python codes into the **Script** section.
496 |
497 | 
498 |
499 | ```
500 | import json
501 | import re
502 | from datetime import datetime
503 | import boto3
504 | import os
505 |
506 | def arn_deconstruct(arn):
507 | arnlist = arn.split(":")
508 | service=arnlist[2]
509 | region=arnlist[3]
510 | accountid=arnlist[4]
511 | servicetype=arnlist[5]
512 | name=arnlist[6]
513 | return {
514 | "Service": service,
515 | "Region": region,
516 | "AccountId": accountid,
517 | "Type": servicetype,
518 | "Name": name
519 | }
520 |
521 | def locate_alarm_source(alarm):
522 | cwclient = boto3.client('cloudwatch', region_name = alarm['Region'] )
523 | alarm_source = {}
524 | alarm_detail = cwclient.describe_alarms(AlarmNames=[alarm['Name']])
525 |
526 | if len(alarm_detail['MetricAlarms']) > 0:
527 | metric_alarm = alarm_detail['MetricAlarms'][0]
528 | namespace = metric_alarm['Namespace']
529 |
530 | # Condition if NameSpace is CloudWatch Syntetics
531 | if namespace == 'CloudWatchSynthetics':
532 | if 'Dimensions' in metric_alarm:
533 | dimensions = metric_alarm['Dimensions']
534 | for i in dimensions:
535 | if i['Name'] == 'CanaryName':
536 | source_name = i['Value']
537 | alarm_source['Type'] = namespace
538 | alarm_source['Name'] = source_name
539 | alarm_source['Region'] = alarm['Region']
540 | alarm_source['AccountId'] = alarm['AccountId']
541 |
542 | result = alarm_source
543 | return result
544 |
545 | def locate_canary_endpoint(canaryname,region):
546 | result = None
547 | synclient = boto3.client('synthetics', region_name = region )
548 | res = synclient.get_canary(Name=canaryname)
549 | canary = res['Canary']
550 | if 'Tags' in canary:
551 | if 'TargetEndpoint' in canary['Tags']:
552 | target_endpoint = canary['Tags']['TargetEndpoint']
553 | result = target_endpoint
554 | return result
555 |
556 |
557 | def locate_app_tag_value(resource):
558 | result = None
559 | if resource['Type'] == 'CloudWatchSynthetics':
560 | synclient = boto3.client('synthetics', region_name = resource['Region'] )
561 | res = synclient.get_canary(Name=resource['Name'])
562 | canary = res['Canary']
563 | if 'Tags' in canary:
564 | if 'Application' in canary['Tags']:
565 | apptag_val = canary['Tags']['Application']
566 | result = apptag_val
567 | return result
568 |
569 | def locate_app_resources_by_tag(tag,region):
570 | result = None
571 |
572 | # Search CloufFormation Stacks for tag
573 | cfnclient = boto3.client('cloudformation', region_name = region )
574 | list = cfnclient.list_stacks(StackStatusFilter=['CREATE_COMPLETE','ROLLBACK_COMPLETE','UPDATE_COMPLETE','UPDATE_ROLLBACK_COMPLETE','IMPORT_COMPLETE','IMPORT_ROLLBACK_COMPLETE'] )
575 | for stack in list['StackSummaries']:
576 | app_resources_list = []
577 | stack_name = stack['StackName']
578 | stack_details = cfnclient.describe_stacks(StackName=stack_name)
579 | stack_info = stack_details['Stacks'][0]
580 | if 'Tags' in stack_info:
581 | for t in stack_info['Tags']:
582 | if t['Key'] == 'Application' and t['Value'] == tag:
583 | app_stack_name = stack_info['StackName']
584 | app_resources = cfnclient.describe_stack_resources(StackName=app_stack_name)
585 | for resource in app_resources['StackResources']:
586 | app_resources_list.append(
587 | {
588 | 'PhysicalResourceId' : resource['PhysicalResourceId'],
589 | 'Type': resource['ResourceType']
590 | }
591 | )
592 | result = app_resources_list
593 |
594 | return result
595 | def script_handler(event, context):
596 | result = {}
597 | arn = event['CloudWatchAlarmARN']
598 | alarm = arn_deconstruct(arn)
599 | # Locate tag from CloudWatch Alarm
600 |
601 | alarm_source = locate_alarm_source(alarm) # Identify Alarm Source
602 | tag_value = locate_app_tag_value(alarm_source) #Identify tag from source
603 |
604 | if alarm_source['Type'] == 'CloudWatchSynthetics':
605 | endpoint = locate_canary_endpoint(alarm_source['Name'],alarm_source['Region'])
606 | result['CanaryEndpoint'] = endpoint
607 |
608 | # Locate cloudformation with tag
609 | resources = locate_app_resources_by_tag(tag_value,alarm['Region'])
610 | result['ApplicationStackResources'] = json.dumps(resources)
611 |
612 | return result
613 | ```
614 |
615 | 8. Under **Additional inputs** specify the input value to the step, passing in the parameter we created previously. To do this, specify below values:
616 |
617 | * `InputPayload` as the **Input name**
618 | * `CloudWatchAlarmARN: '{{AlarmARN}}'` as the **Input Value**.
619 |
620 | 9. Under **Outputs** specify below values:
621 |
622 | * `Resources` as **Name**
623 | * `$.Payload.ApplicationStackResources` as **Selector**
624 | * `String` as **Type**
625 |
626 | 10. Once your settings match the screenshot below, click on **Create Automation**
627 |
628 | 
629 |
630 |
631 |
632 |
633 | Once the automation document is created, you can now give it a test.
634 |
635 | 1. You can then find the newly created document under the **Owned by me** tab of the **Document** section in Systems Manager Console.
636 |
637 | 
638 |
639 | 2. Click on the **playbook** called `Playbook-Gather-Resources` and click on **Execute Automation** to run your playbook.
640 | 3. Paste in the CloudWatch Alarm ARN ( You can find this ARN in the email notification in section **2.1 Observing the alarm being triggered** ) and click on **Execute** to test the playbook.
641 |
642 | 
643 |
644 | 4. Once the **playbook** run is completed successfully, click on the **Step Id** to see the final message and output of the step. You should be able to see this output listing all the resources of the application
645 |
646 | 
647 |
648 | 5. **Copy** the Resources list output from the section as highlighted in the screenshot below. This list consist of the all the resources defined in the CloudFormation stack related to our application. These information includes the Elastic Load Balancer, ECS and RDS resource id that we can now use to further our investigation of the underlying issue.
649 |
650 | 
651 |
652 | 6. You can **Paste** the output into a temporary location like notepad for now. You will need this value for our next step.
653 |
654 | ### 3.2 Building the "Investigate-Application-Resources" Playbook.
655 |
656 | In the previous step, you have created a **playbook** that finds all related AWS resources in the application.
657 | In this step you will create a **playbook** that will interrogate resources, capture recent metrics and logs, to look for insights and better understand the root cause of the issue.
658 |
659 | In practice, there can be various possibilities of actions that the **playbook** can take to investigate, depending on the scenario presented by the issue. The purpose of this Lab is to showcase how you can use **playbook** to aid investigation, rather than advise on a specific action path.
660 |
661 | Therefore, in this lab we will assume an example scenario. The **playbook** will look at metrics and logs of the ELB, ECS and RDS services in the resource list. The **playbook** will then highlight the metrics and logs that is considered outside of normal operational threshold.
662 |
663 |
664 | 
665 |
666 | Please follow the below instructions to build this playbook:
667 |
668 | > **Note:** We will deploy this **playbook** via CloudFormation template to simplify deployment. Please follow the steps below to deploy the CloudFormation template via CLI / or Console.
669 |
670 |
671 |
672 | Click here for CloudFormation Console deployment step
673 |
674 | Download the template [here.](/Code/templates/playbook_investigate_application_resources.yml "Resources template")
675 |
676 |
677 | If you decide to deploy the stack from the console, ensure that you follow below requirements & step:
678 |
679 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template.
680 | 2. Use `waopslab-playbook-investigate-resources` as the **Stack Name**, as this is referenced by other stacks later in the lab.
681 |
682 |
683 |
684 |
685 |
686 | Click here for CloudFormation CLI deployment step (Preferred way)
687 |
688 |
689 | 1. From the Cloud9 terminal, change to the required folder as shown:
690 |
691 | ```
692 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
693 | ```
694 |
695 | 2. Run the command below, replacing the 'AutomationRoleArn' with the Arn of **AutomationRole** you took note in previous step **3.0 Prepare Automation Document IAM Role**.
696 |
697 | ```
698 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-resources \
699 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
700 | --template-body file://playbook_investigate_application_resources.yml
701 | ```
702 | Example:
703 |
704 | ```
705 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-resources \
706 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-playbook-role \
707 | --template-body file://playbook_investigate_application_resources.yml
708 | ```
709 |
710 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command as follows:
711 |
712 | ```
713 | aws cloudformation describe-stacks --stack-name waopslab-playbook-investigate-resources
714 | ```
715 |
716 | 4. Locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**
717 |
718 |
719 |
720 | When the document is created, you can go ahead and run a quick test.
721 |
722 | You can find the newly created document under the **Owned by me** tab of the Document resource in the Systems Manager console.
723 |
724 | 1. Click on the **playbook** called `Playbook-Investigate-Application-Resources` and click on **Execute Automation** to run our playbook.
725 |
726 | 2. Paste in the resources list you took note from the output of the previous **playbook** ( refer to section **3.1 Building the "Gather-Resources" Playbook** ) under **Resources** and click on **Execute**
727 |
728 | 
729 |
730 | 3. Under **Executed Steps** you should be able to see each of the step the **playbook**. If you view the content of the document you will be able to see the code and find out what each step does.
731 |
732 | 
733 |
734 | For simplicity, we have created a list of output and description for each step. Expand the list below to view.
735 |
736 |
737 | Output list
738 |
739 |
740 | | Step Name | Description | Output list |
741 | |------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|
742 | | Gather_ELB_Statistics | Go through the resource list and locate the ELB. Query data from the ELB CloudWatch metrics, looking at metrics from the last 60 minutes. | TargetResponseTime (Average) |
743 | ||| HTTPCode_Target_2XX_Count (Sum) |
744 | ||| HTTPCode_Target_3XX_Count (Sum) |
745 | ||| HTTPCode_Target_4XX_Count (Sum) |
746 | ||| HTTPCode_Target_5XX_Count (Sum) |
747 | ||| TargetConnectionErrorCount (Sum) |
748 | ||| UnHealthyHostCount (Average) |
749 | ||| ActiveConnectionCount (Sum) |
750 | ||| HTTPCode_ELB_3XX_Count (Sum) |
751 | ||| HTTPCode_ELB_4XX_Count (Sum) |
752 | ||| HTTPCode_ELB_5XX_Count (Sum) |
753 | ||| HTTPCode_ELB_500_Count (Sum) |
754 | ||| HTTPCode_ELB_502_Count (Sum) |
755 | ||| HTTPCode_ELB_503_Count (Sum) |
756 | ||| HTTPCode_ELB_504_Count (Sum) |
757 | | Gather_RDS_Statistics | Go through resource list and locate the RDS resource. Query data from the RDS CloudWatch metrics, looking at metrics from the last 60 minutes. | BinLogDiskUsage (Sum) |
758 | ||| BinLogDiskUsage (Sum) |
759 | ||| BurstBalance (Average) |
760 | ||| CPUUtilization (Average) |
761 | ||| CPUCreditUsage (Sum) |
762 | ||| CPUCreditBalance (Maximum) |
763 | ||| DatabaseConnections (Sum) |
764 | ||| DiskQueueDepth (Maximum) |
765 | ||| FailedSQLServerAgentJobsCount (Average) |
766 | ||| FreeableMemory (Maximum) |
767 | ||| MaximumUsedTransactionIDs (Maximum) |
768 | ||| NetworkReceiveThroughput (Average) |
769 | ||| OldestReplicationSlotLag (Average) |
770 | ||| ReadIOPS (Average) |
771 | ||| ReadLatency (Average) |
772 | ||| ReadThroughput (Maximum) |
773 | ||| ReplicaLag (Average) |
774 | ||| ReplicationSlotDiskUsage (Maximum) |
775 | ||| SwapUsage (Maximum) |
776 | ||| TransactionLogsDiskUsage (Maximum) |
777 | ||| TransactionLogsGeneration (Average) |
778 | ||| ReplicationSlotDiskUsage (Maximum) |
779 | ||| WriteIOPS (Average) |
780 | ||| WriteLatency (Average) |
781 | ||| WriteThroughput (Average) |
782 | | Gather_ECS_Statistics | Go through the resource list and locate the ECS resource. Query data from the ECS CloudWatch metrics, looking at metrics from the last 6 minutes. | CPUUtilization (Maximum) |
783 | ||| MemoryUtilization (Maximum) |
784 | | Gather_ECS_Error_Logs | Go through the resource list and locate the ECS Service. Search in CloudWatch logs for any Error occurrence. ||
785 | | Gather_ECS_Config | Go through the resource list and locate the ECS resource. Describe the ECS service configuration. ||
786 | | Gather_RDS_Config | Go through the resource list and locate the RDS resource. Describe RDS Instance Config & Parameters. ||
787 | | Inspect_Playbook_Results | Go through the output of above steps, inspect results and check if it is above the threshold. | TargetResponseTime = 5 (ELB) |
788 | |||TargetConnectionErrorCount= 0 (ELB)
789 | |||UnHealthyHostCount = 0 (ELB)
790 | |||ELB5XXCount = 0 (ELB)
791 | |||ELB500Count = 0 (ELB)
792 | |||ELB502Count = 0 (ELB)
793 | |||ELB503Count = 0 (ELB)
794 | |||ELB504Count = 0 (ELB)
795 | |||Target4XXCount = 0 (ELB)
796 | |||Target5XXCount = 0 (ELB)
797 | |||CPUUtilization = 80 (ECS)
798 |
799 |
800 | 4. Wait until all steps are completed successfully.
801 |
802 |
803 |
804 | ### 3.3 Building the "Investigate-Application-From-Alarm" Playbook.
805 |
806 | So far we have 2 separate playbooks. The first playbook gathers the list of resources associated with the application. The second playbook queries the relevant resources and investigates the appropriate logs and metrics.
807 |
808 | In this step we will automate our **playbooks** further by creating a parent **playbook** that orchestrates the 2 Investigative **playbooks**. We will add another step to send notification to our Developers and System Owners.
809 |
810 | 
811 |
812 | Follow the instructions below to build the parent Playbook.
813 |
814 | > **Note:** Select a step-by-step guide below to build the parent playbook using either the AWS console a CloudFormation template.
815 |
816 |
817 | Click here for CloudFormation Console deployment step
818 |
819 | Download the template [here.](/Code/templates/playbook_investigate_application.yml "Resources template")
820 |
821 |
822 | If you decide to deploy the stack from the console, follow these steps:
823 |
824 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template.
825 | 2. Use `waopslab-playbook-investigate-application` as the **Stack Name**, as this is referenced by other stacks later in the lab.
826 | 3. In the parameter input screen, under **PlaybookIAMRole** enter ARN of **playbook** IAM role (defined in previous step), under **NotificationEmail** enter your designated email for **playbook** notification
827 |
828 |
829 |
830 |
831 | Click here for CloudFormation CLI deployment step (Preferred way)
832 |
833 |
834 | 1. From the Cloud9 terminal, change to the required folder as shown:
835 |
836 | ```
837 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
838 | ```
839 |
840 | 2. Then run below command :
841 |
842 | ```
843 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-application \
844 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
845 | --template-body file://playbook_investigate_application.yml
846 | ```
847 | Example:
848 |
849 | ```
850 | aws cloudformation create-stack --stack-name waopslab-playbook-investigate-application \
851 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-playbook-role \
852 | --template-body file://playbook_investigate_application.yml
853 | ```
854 |
855 | **Note:** Please adjust your command-line if you are using profiles within your aws command line as required.
856 |
857 | Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command as follows:
858 | ```
859 | aws cloudformation describe-stacks --stack-name waopslab-playbook-investigate-application
860 | ```
861 |
862 | Locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**
863 |
864 |
865 |
866 |
867 | Click here for Console step-by-step guide
868 |
869 | 1. From the AWS Systems Manager console, click on **documents** as shown below. Once you are there, click on **Create Automation**
870 |
871 | 
872 |
873 | 2. Next, enter in `Playbook-Investigate-Application-From-Alarm` in the **Name** and paste in the notes shown below into the **Description** box. This provides a description of the **playbook**. Systems Manager supports putting in notes as markdown, so feel free to format as required.
874 |
875 |
876 | ```
877 | # What is does this **playbook** do?
878 |
879 | This **playbook** will run **Playbook-Gather-Resources** to gather Application resources monitored by Canary.
880 |
881 | Then subsequently run **Playbook-Investigate-Application-Resources** to Investigate the resources for issues.
882 |
883 | Outputs of the investigation will be sent to SNS Topic Subscriber
884 |
885 | ```
886 |
887 | 3. Under **Assume role** field, enter in the ARN of the IAM role we created in the previous step.
888 |
889 | 4. Under **Input Parameters** field, enter `AlarmARN` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. This will define a Parameter into our playbook, which allows the value of the CloudWatch Alarm to be passed to the main step that will run the action.
890 |
891 | 5. Add another parameter by clicking on the **Add a parameter** link. Enter `SNSTopicARN` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`. This will define another Parameter into our playbook, so that we can send notification to the Owner and Developer.
892 |
893 | 
894 |
895 |
896 | 6. Click **Add Step** and create the first step of `aws:executeAutomation` Action type with StepName `PlaybookGatherAppResourcesCanaryCloudWatchAlarm`
897 |
898 | 7. Specify `Playbook-Gather-Resources` as the **Document name** under Inputs and under **Additional inputs** specify `RuntimeParameters` with `{"AlarmARN":'{{AlarmARN}}'}` as it's value (refer to screenshot below). This step we will be run the `Gather-Resources` **playbook** which we created previously.
899 |
900 | 
901 |
902 | 8. Once this step is defined, add another step by clicking on **Add Step** at the bottom of the section.
903 |
904 | 9. For this second step, specify the **Step name** as `PlaybookInvestigateAppResourcesELBECSRDS` and an action type of `aws:executeAutomation`.
905 |
906 | 10. Specify `Playbook-Investigate-Application-Resources` as the **Document name** and `RuntimeParameters` as `Resources: '{{PlaybookGatherAppResourcesCanaryCloudWatchAlarm.Output}}'` This will take the output of the first step and pass to the second **playbook** to run the investigation of associated resources.
907 |
908 | 
909 |
910 | 11. For the last step, take the output investigation from the second step and send that to the SNS topic where our owner, developers and admin are subscribed.
911 |
912 | 12. Specify the **Step name** as `AWSPublishSNSNotification` and the action type as `aws:executeAutomation`.
913 | 13. Specify `AWS-PublishSNSNotification` as the **Document name** and `RuntimeParameters` as shown below. This will take the output of the second step which contains summary data of the investigation and AWS-PublishSNSNotification which will send an email to the SNS we specified in the parameters.
914 |
915 |
916 | ```
917 | TopicArn: '{{SNSTopicARN}}'
918 | Message: '{{ PlaybookInvestigateAppResourcesELBECSRDS.Output }}'
919 | ```
920 |
921 | 
922 |
923 | 14. Our **playbook** will run investigative tasks and send the result to an SNS topic where our Systems administrator / engineer will subscribe to. To do this we will need to create an SNS topic that our **playbook** will send notification to. Please follow the instructions specified in this [link](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html) and create a Standard SNS topic and name it `PlaybookNotificationSNSTopic`
924 |
925 | 15. Once you've created the topic, go ahead and subscribe your an email using this instruction [here](https://docs.aws.amazon.com/sns/latest/dg/sns-email-notifications.html)
926 |
927 |
928 |
929 | ### 3.4 Executing investigation Playbook.
930 |
931 | You can now run the **playbook** to discover the result of the investigation.
932 |
933 | 1. Go to the **Output** section of the deployed CloudFormation stack `walab-ops-sample-application` and take note of below output values.
934 |
935 | 2. Go to the Systems Manager Automation document we just created in the previous step, `Playbook-Investigate-Application-From-Alarm`.
936 |
937 | 3. And then run the **playbook** passing the ARN as the **AlarmARN** input value, along with the **SNSTopicArn**.
938 |
939 | * You can get the **AlarmARN** from the email that you received from CloudWatch Alarm as described in step **3.1 Building the "Gather-Resources" Playbook.** in this lab.
940 | * To get the value for **SNSTopicArn**, go to the CloudFormation console output of `walab-ops-sample-application` stack and copy, paste the value of **OutputSystemEventTopicArn**
941 |
942 | 
943 |
944 |
945 | 4. When the **playbook** completed, an email will be send to you, which contains a summary of the investigation completed by the playbook as shown.
946 |
947 | 
948 |
949 | 5. Copy and paste the message section and use a json linter tool such as [jsonlint.com](http://jsonlint.com) to give better structure for visibility. The result from the **playbook** investigation might vary slightly, but the overall findings should be similar to the below screenshot.
950 |
951 | 
952 |
953 | 6. From the report being generated you should see a large number of **ELB504Count error** and a high **TargetResponseTime** from the Load balancer. This explains the delay we are seeing from our canary alarm.
954 |
955 | If you then look at the ECS summary, you will notice that there is only 1 ECS **TaskRunningCount**, with a relatively high **CPUUtilization** average. The script calculates the average of maximum value on the ECS service in the last 6 minutes window. If you do not see CPUUtilization value in the json, you can confirm this by going to the ECS service console and click on the **Metrics** tab.
956 |
957 | 
958 |
959 | Therefore, it is likely that the immediate cause of the latency is resource constrained at the application API level running in ECS. Ideally, if we can increase the number of tasks in the ECS service, the application should be able to release some of the CPU Utilization constraints.
960 |
961 | With all of these information provided by our **playbook** findings, we should be able to determine what is the next course of action to attempt remediation to the issue.
962 |
963 | This concludes **Section 3** of this lab, click on the link below to move on to the next section to build the remediation runbook.
964 |
965 | ### Step 4. Build and Run Remediation Runbook
966 |
967 | In contrast to playbooks, **runbooks** are procedures that accomplish specific tasks to achieve an outcome. In the previous section, you have identified an issue with CPU utilization, which occurs because there is only 1 ECS task running in the cluster. This could be remediated through the use of auto-scaling.
968 |
969 | However, implementing this requires preparation and planning. When an incident occurs, operations teams should have a defined escalation path for the issue. Depending on the criticality of the system they should also be equipped to do what is necessary to ensure system availability is protected while the escalation occurs.
970 |
971 | In this section, you will build an automated **runbook** to remediate the CPU utilization issue by increasing the number of tasks in the ECS cluster. Your automated **runbook**, will notify the owner of the workload and give them the option to be able to intercept the scale-up action should they choose not to proceed.
972 |
973 | #### Actions items in this section:
974 |
975 | 1. You will build a **runbook** to scale up the ECS cluster, with the approval mechanism.
976 | 2. You will execute the **runbook** and observe the recovery of your application.
977 |
978 | ### 4.0 Building the "Approval-Gate" Runbooks.
979 |
980 | In this section you will build a reusable **runbook**, which provides the owner with the ability to deny or approve remediation actions within a defined waiting period. If the wait time is exceeded and a decision has has not been made, the runbook will automatically approve the action as shown.
981 |
982 | 
983 |
984 | We will achieve this through the use of a Systems Manager Automation document, which we will build using the following steps:
985 |
986 | 1. The `Approval-Gate` **runbook** executes a separate document called the `Approve-Timer`.
987 |
988 | 2. The `Approve-Timer` **runbook** will then wait for a preconfigured amount of time and send an approve signal to the `Approval-Gate` **runbook**.
989 |
990 | 3. Meanwhile, the `Approval-Gate` **runbook** then sends an approval request to the workload owner via a designated SNS topic.
991 |
992 | * If the owner choose to approve, the `Approval-Gate` **runbook** will continue to the next step.
993 | * If the owner declines the approval, the **runbook** will fail, blocking further steps.
994 | * However, if the owner does not response within the preconfigured wait time, the `Approve-Timer` **runbook** will automatically approve the request.
995 |
996 | Follow the instructions below to build the runbook:
997 |
998 | > **Note:** Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template.
999 |
1000 |
1001 | Click here for Console step by step
1002 |
1003 | 1. Go to the AWS Systems Manager console. Click **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below:
1004 |
1005 | 
1006 |
1007 | 2. Enter `Approval-Timer` in the **Name** field and copy the notes shown below into the **Document description** field.
1008 |
1009 | ```
1010 | # What does this automation do?
1011 |
1012 | Automatically trigger 'Approval' Signal to an execution, after a timer lapse
1013 |
1014 | ## Steps
1015 |
1016 | 1. Sleep for X time specified on the parameter input
1017 | 2. Automatically signal 'Approval' to the Execution specified in parameter input
1018 | ```
1019 |
1020 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**.
1021 |
1022 | 4. Expand the **Input Parameters** section and enter `Timer` as the **Parameter name**. Set the type as `String` and **Required** as `Yes`.
1023 |
1024 | 5. Then add another parameter this time called `AutomationExecutionId`, of type `String` and set **Required** to `Yes`. Once you are done, your configuration should look like the screenshot below.
1025 |
1026 | 
1027 |
1028 | 6. Under **Step 1** section specify `SleepTimer` as **Step name**, select `aws::sleep` as the **Action type**.
1029 |
1030 | 7. Expand the **Inputs** section of the step, and specify `{{Timer}}` as the **Duration**
1031 |
1032 | 
1033 |
1034 |
1035 | 8. Click on **Add step** and specify `ApproveExecution` as **Step name**, select `aws::executeAwsApi` as the **Action type**.
1036 |
1037 | 9. Expand the **Inputs** section of the step, and specify `ssm` in the **Service** field and `SendAutomationSignal` in the API field.
1038 |
1039 | 10. Under **Additional inputs** specify below values.
1040 |
1041 | * `Approve` as the **SignalType**
1042 | * `{{AutomationExecutionId}}` as the **AutomationExecutionId**.
1043 |
1044 | Once you are done, your configuration should look like the screenshot below.
1045 |
1046 | 
1047 |
1048 | 
1049 |
1050 | 6 . Click on **Create automation** once you are done.
1051 |
1052 | Next, you will create the `Approval-Gate` **runbook** responsible for running the `Approval-Timer` **runbook** asynchronously. Follow below steps to complete the configuration:
1053 |
1054 | 1. From the AWS Systems Manager console, select **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below:
1055 |
1056 | 
1057 |
1058 | 2. Next, enter `Approval-Gate` in the **Name** field and add the notes shown below to the **Document description** field.
1059 |
1060 | ```
1061 | # What does this automation do?
1062 |
1063 | Place a gate before your desired step to create approval mechanism.
1064 | Automation will trigger an asynchronously timer that will automatically approve once the time has lapsed.
1065 | Automation will then send approval / deny request to the designated SNS Topic.
1066 | When deny is triggered by approver, the step will fail and block the following step from executing.
1067 |
1068 | Note: Please ensure to have onFailure set to abort in your automation document.
1069 |
1070 | ## Steps
1071 |
1072 | 1. Trigger an asynchronously timer that will automatically approve once the time has lapsed.
1073 | 2. Send approval / deny request to the designated SNS Topic.
1074 |
1075 | ```
1076 |
1077 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**.
1078 |
1079 | 4. Expand the **Input Parameters** section and enter the following:
1080 |
1081 | * `Timer` as the **Parameter name**, set the type as `String` and **Required** as `Yes`.
1082 | * `NotificationMessage` as the **Parameter name**, set the type as type `String` and **Required** is `Yes`.
1083 | * `NotificationTopicArn` as the **Parameter name**, set the type as type `String` and **Required** is `Yes`.
1084 | * `ApproverRoleArn` as the **Parameter name**, set the type as type `String` and **Required** is `Yes`.
1085 |
1086 | 5. Expand **Step 1** create a step named `executeAutoApproveTimer` and action type `aws:executeScript`.
1087 |
1088 | 6. Expand **Inputs**, then set the **Runtime** as `Python3.6` and paste in below code into the script section. Note that code snippet will execute the `Approval-Timer` **runbook** you created asyncronously.
1089 |
1090 | ```
1091 | import boto3
1092 | def script_handler(event, context):
1093 | client = boto3.client('ssm')
1094 | response = client.start_automation_execution(
1095 | DocumentName='Approval-Timer',
1096 | Parameters={
1097 | 'Timer': [ event['Timer'] ],
1098 | 'AutomationExecutionId' : [ event['AutomationExecutionId'] ]
1099 | }
1100 | )
1101 | return None
1102 | ```
1103 |
1104 | 6. Expand **Additional Inputs**, then select `InputPayload` under **Input Name**, and add the text shown below to **Input Value**:
1105 |
1106 | ```
1107 | AutomationExecutionId: '{{automation:EXECUTION_ID}}'
1108 | Timer: '{{Timer}}'
1109 | ```
1110 | Once you have completed this step, your **Step 1** configuration should look like below screenshot.
1111 |
1112 | 
1113 |
1114 | 7. Click **Add step** to create **Step 2**
1115 |
1116 | 8. Create a step named `ApproveOrDeny` and action type `aws:approve`.
1117 |
1118 | 9. Expand **Inputs** and specify below values under **Approvers**, replacing the `AutomationRoleArn` with the Arn of **AutomationRole** you took note of in section **3.0 Prepare Automation Document IAM Role**.
1119 |
1120 | ```
1121 | [ '{{ApproverRoleArn}}', 'AutomationRoleArn' ]
1122 | ```
1123 |
1124 | Example:
1125 |
1126 | ```
1127 | [ '{{ApproverRoleArn}}', 'arn:aws:iam::xxxxx:role/AutomationRole' ]
1128 | ```
1129 |
1130 |
1131 | 10. Expand **Additional Inputs** and specify the following values:
1132 |
1133 | * `NotificationArn` as the **Input name**, and `{{NotificationTopicArn}}` as the **Input value**
1134 | * `Message` as the **Input name**, and `{{NotificationMessage}}` as the **Input value**
1135 | * `MinRequiredApprovals` as the **Input name**, and `1` as the **Input value**
1136 |
1137 | 12. Expand **Common properties** and change the following properties to below values (keep the remaining as it is):
1138 |
1139 | * `Continue` for **On failure**
1140 | * `false` for **Is critical**
1141 |
1142 | Once you have completed this step, your **Step 2** configuration should look like below screenshot.
1143 |
1144 | 
1145 |
1146 |
1147 |
1148 | 13. Click **Add step** to create **Step 3**
1149 |
1150 | 14. Create a step named `getApprovalStatus` and action type `aws:executeAwsApi`
1151 |
1152 | 15. Expand **Inputs** and specify `ssm` in the **Service** field, and `DescribeAutomationStepExecutions` in the **API** field.
1153 |
1154 | 16. Expand **Additional Inputs** and specify below values:
1155 |
1156 | * `AutomationExecutionId` as the **Input Name**, and `{{automation:EXECUTION_ID}}` as the **Input value**
1157 | * `Filters` as the **Input Name**, and copy below values as the **Input value**
1158 |
1159 | ```
1160 | - Key: StepName
1161 | Values:
1162 | - requestApproval
1163 | ```
1164 | 17. Expand **Outputs** and specify below values:
1165 |
1166 | * `approvalStatusVariable` as the **Name**
1167 | * `$.StepExecutions[0].Outputs.ApprovalStatus[0]` as the **Selector**
1168 | * `String` as the **Type**
1169 |
1170 | Once you have completed this step, your **Step 3** configuration should look like below screenshot.
1171 |
1172 | 
1173 |
1174 |
1175 | 18. Click on **Create automation** to complete the configuation.
1176 |
1177 |
1178 |
1179 |
1180 | Click here for CloudFormation deployment steps
1181 |
1182 | Download the template [here.](/Code/templates/runbook_approval_gate.yml "Resources template")
1183 |
1184 | If you decide to deploy the stack from the console, ensure that you follow below requirements & step:
1185 |
1186 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template.
1187 | 2. Use `waopslab-runbook-approval-gate` as the **Stack Name**, as this is referenced by other stacks later in the lab.
1188 |
1189 |
1190 |
1191 |
1192 | Click here for CloudFormation CLI deployment step (Preferred way)
1193 |
1194 | 1. From the Cloud9 terminal, change to the templates folder as shown:
1195 |
1196 | ```
1197 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
1198 | ```
1199 |
1200 |
1201 | 2. Run the below commands, replacing the `AutomationRoleArn` with the Arn of **AutomationRole** you took note of in section **3.0 Prepare Automation Document IAM Role**.
1202 |
1203 | ```
1204 | aws cloudformation create-stack --stack-name waopslab-runbook-approval-gate \
1205 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
1206 | --template-body file://runbook_approval_gate.yml
1207 | ```
1208 |
1209 | With your AutomationRole Arn in place your command will look similar to the following example:
1210 |
1211 | ```
1212 | aws cloudformation create-stack --stack-name waopslab-runbook-approval-gate \
1213 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/xxxx-runbook-role \
1214 | --template-body file://runbook_approval_gate.yml
1215 | ```
1216 |
1217 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command below, locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**.
1218 |
1219 | ```
1220 | aws cloudformation describe-stacks --stack-name waopslab-runbook-approval-gate
1221 | ```
1222 |
1223 |
1224 |
1225 | ### 4.1 Building the "ECS-Scale-Up" runbook.
1226 |
1227 | 
1228 |
1229 | Next, you are going to build the ECS-Scale-Up **runbook** which will complete the following:
1230 |
1231 | 1. Run the `Approval-Gate` **runbook** which you created previously.
1232 | 2. Wait for the `Approval-Gate` **runbook** to complete.
1233 | 3. Once the `Approval-Gate` **runbook** completes successfully, the runbook will increase the number of ECS tasks in the cluster.
1234 |
1235 | Please follow below steps to build the runbook.
1236 |
1237 | > **Note:** Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template.
1238 |
1239 |
1240 | Click here for Console step by step
1241 |
1242 | 1. Go to the AWS Systems Manager console. Click **Documents** under **Shared Resources** on the left menu. Then click **Create Automation** as show in the screen shot below.
1243 |
1244 | 
1245 |
1246 | 2. Next, enter `Runbook-ECS-Scale-Up` in the **Name** field and add the notes shown below to the **Document description** field:
1247 |
1248 | ```
1249 | # What does this automation do?
1250 |
1251 | Scale up a given ECS service task desired count to certain number, with approval process.
1252 | The automation will trigger Approval-Gate runbook, before executing.
1253 |
1254 | ## Steps
1255 |
1256 | 1. Trigger Approval-Gate
1257 | 2. Scale ECS Service by number of service
1258 | ```
1259 |
1260 | 3. In the **Assume role** field, enter the IAM role ARN we created in the previous section **3.0 Prepare Automation Document IAM Role**.
1261 |
1262 | 4. Expand the **Input Parameters** section and enter the following.
1263 |
1264 | * `ECSDesiredCount` as the **Parameter name**, set the type as `Integer` and **Required** as `Yes`.
1265 | * `ECSClusterName` as the **Parameter name**, set the type as `String` and **Required** is `Yes`.
1266 | * `ECSServiceName`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`.
1267 | * `NotificationTopicArn`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`.
1268 | * `NotificationMessage`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`.
1269 | * `ApproverRoleArn`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`.
1270 | * `Timer`, as the **Parameter name**, set the type as `String` and **Required** is `Yes`.
1271 |
1272 |
1273 | 5. Expand **Step 1** create a step named `executeApprovalGate` and action type `aws:executeAutomation`.
1274 |
1275 | 6. Expand **Inputs**, then set the **Document name** as `Approval-Gate`.
1276 |
1277 | 7. Expand **Additional inputs** and select `RuntimeParameters` as the **Input Name**
1278 |
1279 | 8. Paste in below as the **Input Value**
1280 |
1281 | ```
1282 | {
1283 | "Timer":'{{Timer}}',
1284 | "NotificationMessage":'{{NotificationMessage}}',
1285 | "NotificationTopicArn":'{{NotificationTopicArn}}',
1286 | "ApproverRoleArn":'{{ApproverRoleArn}}'
1287 | }
1288 | ```
1289 |
1290 | 9. Click **Add Step** to create the second step.
1291 |
1292 | 10. Specify `updateECSServiceDesiredCount` as **Step Name** and select `aws:executeAwsApi` as Action type.
1293 |
1294 | 11. Expand **Inputs** and configure the following values:
1295 |
1296 | * `ecs` as **Service**
1297 | * `UpdateService` as **Api**
1298 |
1299 | 12. Expand **Additional inputs** and configure the following values:
1300 |
1301 | * `forceNewDeployment` as the **Input Name** and `true` as **Input Value**
1302 | * `desiredCount`as the **Input Name** and `{{ECSDesiredCount}}` as **Input Value**
1303 | * `service` as the **Input Name** and `{{ECSServiceName}}` as **Input Value**
1304 | * `cluster` as the **Input Name** and `{{ECSClusterName}}` as **Input Value**
1305 |
1306 | 13 . Click on **Create automation** once complete
1307 |
1308 |
1309 |
1310 |
1311 |
1312 | Click here for CloudFormation Console deployment step
1313 |
1314 | Download the template [here.](/Code/templates/runbook_scale_ecs_service.yml "Resources template")
1315 |
1316 | If you decide to deploy the stack from the console, ensure that you complete the following steps:
1317 |
1318 | 1. Please follow this [guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) for information on how to deploy the CloudFormation template.
1319 | 2. Use `waopslab-runbook-scale-ecs-service` as the **Stack Name**, as this is referenced by other stacks later in the lab.
1320 |
1321 |
1322 |
1323 |
1324 | Click here for CloudFormation CLI deployment step (Preferred way)
1325 |
1326 | 1. From the Cloud9 terminal, run the command to get into the working script folder.
1327 |
1328 | ```
1329 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/templates
1330 | ```
1331 |
1332 | 2. Then run below commands, replacing the 'AutomationRoleArn' with the Arn of **AutomationRole** you took note in previous step **3.0 Prepare Automation Document IAM Role**.
1333 |
1334 | ```
1335 | aws cloudformation create-stack --stack-name waopslab-runbook-scale-ecs-service \
1336 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=AutomationRoleArn \
1337 | --template-body file://runbook_scale_ecs_service.yml
1338 | ```
1339 | Example:
1340 |
1341 | ```
1342 | aws cloudformation create-stack --stack-name waopslab-runbook-scale-ecs-service \
1343 | --parameters ParameterKey=PlaybookIAMRole,ParameterValue=arn:aws:iam::000000000000:role/AutomationRole \
1344 | --template-body file://runbook_scale_ecs_service.yml
1345 | ```
1346 |
1347 | 3. Confirm that the stack has installed correctly. You can do this by running the **describe-stacks** command below, locate the **StackStatus** and confirm it is set to **CREATE_COMPLETE**.
1348 |
1349 |
1350 | ```
1351 | aws cloudformation describe-stacks --stack-name waopslab-runbook-scale-ecs-service
1352 | ```
1353 |
1354 |
1355 |
1356 | ### 4.2 Executing remediation Runbook.
1357 |
1358 | Now, lets run the **runbook** you created above to remediate the issue.
1359 |
1360 | 1. Go to the AWS CloudFormation console.
1361 |
1362 | 2. Click on the stack named `walab-ops-sample-application`.
1363 |
1364 | 3. Click on the **Output** tab, and take note following output values. You will need these values to execute the runbook.
1365 |
1366 | * OutputECSCluster
1367 | * OutputECSService
1368 | * OutputSystemOwnersTopicArn
1369 |
1370 | 
1371 |
1372 | 4. If you are currently using an IAM user or role to log into your AWS Console, take note of the ARN.
1373 | You will need this ARN when executing the **runbook** to restrict access to approve or deny request capability.
1374 |
1375 | To find your current IAM user ARN, go to the IAM console and click **Users** on the left side menu, then click on your **User** name.
1376 | For IAM role, go to the IAM console and click **Roles** on the left side menu, then click on the **Role** name, you are using.
1377 |
1378 | You will see something similar to the example below. Take note of the ARN value,and proceed to the next step.
1379 |
1380 | 
1381 |
1382 | 5. Go to the Systems Manager Automation console, click on **Document** under **Shared Resources**, locate and click an automation document called `Runbook-ECS-Scale-Up`.
1383 |
1384 | 8. Then click **Execute automation**.
1385 |
1386 | 7. Fill in the **Input parameters** with values below.
1387 |
1388 | 
1389 |
1390 | * For **ECSServiceName**, place the value of **OutputECSService** you took note on step 3.
1391 | * For **ECSClusterName**, Place the value of **OutputECSCluster** you took note on step 3.
1392 | * For **ApproverArn**, place the ARN value you took note on step 4.
1393 | * For **ECSDesiredCount**, place in `100` to increase the task number to 100.
1394 | * For **NotificationMessage**, place in any message that can help the approver make an informed decision when approving or denying the requested action.
1395 |
1396 | For example:
1397 | ```
1398 | Hello, your mysecretword app is experiencing performance degradation. To maintain quality customer experience we will manually scale up the supporting cluster. This action will be approximately 10 minutes after this message is generated unless you do not consent and deny the action within the period.
1399 | ```
1400 |
1401 | * For **NotificationTopicArn**, place the value of **OutputSystemOwnersTopicArn** you took note on step 3.
1402 | * For **Timer**, you can specify `PT5M` or specify a value defined in ISO 8601 duration format.
1403 |
1404 | 5. Click **Execute** to run the **runbook**.
1405 |
1406 | 6. Once the **runbook** is running, you will receive an email with instructions approve or deny, on the email address subscribed to the owners SNS topic ARN.
1407 | Follow the link in the email using the User of the ApproverArn you placed in the Input parameters. The link will take you to the SSM Console where you can approve or deny the request.
1408 |
1409 |
1410 | 
1411 |
1412 | If you approve, or ignore the email, the request will be automatically be approved after the Timer set in the runbook expires.
1413 | If you deny, the **runbook** will fail and no action will be taken.
1414 |
1415 | 7. Once the **runbook** completes, you can see that the ECS task count increased to the value specified.
1416 |
1417 | 8. Go to ECS console and click on **Clusters** and select `mysecretword-cluster`.
1418 |
1419 | 9. Click on the `mysecretword-service` **Service**, and you will see the number of running tasks increasing to 100 and the average CPUUtilization decrease.
1420 |
1421 | 
1422 |
1423 | 
1424 |
1425 | 9. Subsequently, you will see the API response time returns to normal and the CloudWatch Alarm returns to an OK state.
1426 |
1427 | 
1428 |
1429 | You can check both using your CloudWatch Console, following the steps you ran in section **2.1 Observing the alarm being triggered**.
1430 |
1431 |
1432 | #### Congratulations !
1433 | You have now completed the **Automating operations with Playbooks and Runbooks** lab, click on the link below to cleanup the lab resources.
1434 |
1435 |
1436 | ## Teardown
1437 | In this section you will delete all resources related to the lab environment.
1438 |
1439 | 1. Run the following command to navigate to the script folder.
1440 |
1441 | ```
1442 | cd ~/environment/aws-well-architected-labs/static/Operations/200_Automating_operations_with_playbooks_and_runbooks/Code/scripts/
1443 | ```
1444 |
1445 | 2. Run the teardown_resources.sh script to delete all resources related to the lab.
1446 | ```
1447 | bash teardown_resources.sh
1448 | ```
1449 | ## Summary
1450 | In this lab you learnt:
1451 | - Build and run automated playbooks to support your investigations
1452 | - Build and run automated runbooks to remediate specific faults
1453 | - Enabling traceability of operations activities in your environment
1454 |
1455 |
1456 |
--------------------------------------------------------------------------------