├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── _img ├── WAT-CreateLens.png ├── WAT-CustomLenses.png ├── WAT-DraftState.png ├── WAT-Publish.png ├── WAT-Published.png ├── WAT-Selectable.png └── gt-well-architected.png └── wafr-operational-readiness-lens ├── orr-v1.3.2-PUBLISHED.json ├── orr-v1.3.3-PUBLISHED.json ├── orr-v1.3.4-PUBLISHED.json ├── orr-v1.3.5-PUBLISHED.json └── orr-v1.3.6-PUBLISHED.json /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so. 10 | 11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS Operational Readiness Review (ORR) 2 | 3 | ![AWS: Well-Architected](https://img.shields.io/badge/AWS-Well--Architected-green) 4 | 5 | ## Introduction 6 | | | | 7 | |-----|---------------| 8 | | ![AWS Well-Architected Logo](_img/gt-well-architected.png) | The __AWS Operational Readiness Review (ORR)__ acts as a sanity & safety check for a new workload built on AWS services, assessed before the new workload goes live in production. The ORR is complementary to the [Well-Architected Framework Review (WAFR)](https://aws.amazon.com/architecture/well-architected), and provides additional readiness checks to assist a team preparing to support a new AWS workload or service. | 9 | 10 | --- 11 | 12 | ## Use 13 | 14 | The ORR is designed as a [custom lens for the AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html). 15 | 16 | The [AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/intro.html) is a service in the cloud that provides a consistent process for 17 | measuring your architecture using AWS best practices. AWS Well-Architected Tool helps you throughout the product lifecycle by assisting with documenting the decisions that you make, 18 | providing recommendations for improving your workload based on best practices, and guiding you in making your workloads more reliable, secure, efficient, and cost-effective. 19 | 20 | To add the ORR as a [custom lens for the AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html), follow the steps 21 | under ["New Installation"](#new-installation), below. 22 | 23 | --- 24 | 25 | ## New Installation 26 | 27 | 1. Check the `wafr-operational-readiness-lens` directory for a file named "orr-(version)-PUBLISHED.json", and select the highest version available. File names appended with the PUBLISHED tag have been tested by the repository maintainer to pass all validation checks that the AWS Well-Architected Tool service console performs at time of import. Any files in the `wafr-operational-readiness-lens` directory without "PUBLISHED" in the name are a work-in-progress version that may not import correctly without making some minor adjustments. 28 | 29 | 2. Once you've identified the file, pass that file to the Well-Architected Tool console by going to https://us-east-1.console.aws.amazon.com/wellarchitected/ (or your preferred region's equivalent) and then going to the Custom lenses page. Select "Create custom lens". 30 | 31 | ![Well-Architected Tool Custom Lens](_img/WAT-CustomLenses.png) 32 | 33 | 3. Click the "Choose File" button to navigate to the `orr-(VERSION)-PUBLISHED.json` file that you identified earlier. Click "Submit". 34 | 35 | ![Well-Architected Create custom lens](_img/WAT-CreateLens.png) 36 | 37 | > If you get any error messages at the top of the screen when you import the file, kindly reach out to the repository maintainer or make the appropriate modifications yourself to the file. If you definitely imported the file with "PUBLISHED" in its name and are still getting an error message, check the error message carefully. You cannot have two custom lenses with the same unique identifier, and the error message will say as such. See below for upgrade instructions. 38 | 39 | > If you get an error message about questions not passing validation, and you are definitely using a 'PUBLISHED' lens, then please reach out to the repository maintainer for assistance. 40 | 41 | 4. Once the lens has been imported, it should have a status of "DRAFT". 42 | 43 | ![Well-Architected Draft Lens](_img/WAT-DraftState.png) 44 | 45 | 5. Click the "Actions" -> "Publish lens" option. 46 | 47 | ![Well-Architected Draft Lens](_img/WAT-Publish.png) 48 | 49 | 6. Give the custom lens a version name. Its easiest to copy the version number from the file name, such as v1.3.2, but any version will do since you are only importing for use in your own account. Click Publish. 50 | 51 | ![Well-Architected Draft Lens](_img/WAT-Published.png) 52 | 53 | 54 | --- 55 | 56 | ## Upgrading an existing lens 57 | 58 | Once you've imported a lens, you may want to check periodically for updates. 59 | You can check the `wafr-operational-readiness-lens` directory of the repository to see if there's a new major version available (as of writing, the current version is 1.3.2). 60 | If there is, download the file and go back to the Well-Architected Tool Console. Go to "Custom lenses" and select the `AWS Internal Operational Readiness Review` custom lens item. 61 | Click "Edit" and then click "Choose File" and navigate to the new lens version you are interested in. If everything goes well, you'll get a green "Imported successfully" message, 62 | and then click "Publish lens." 63 | 64 | Give the new lens version a version name and publish it. 65 | 66 | The next time you go to the Workloads tab, you'll be prompted to upgrade any workload review that was using the old version of the lens. As part of the ugprade process, 67 | you'll be asked to save the current state of the review as a "milestone" so that you can go back and reference it at any time. 68 | 69 | --- 70 | 71 | ## Using the ORR part of a workload review 72 | 73 | This custom ORR lens is designed to be conducted after the Well-Architected Review but weeks prior to the workload going live in production. 74 | 75 | While the Well-Architected Lens is very prescriptive with an explicit list of items to be checked off, the ORR is more free-form and designed to be used as a discussion guide. 76 | The ORR purpose is to begin a discussion on higher-risk topics. 77 | 78 | Typical time for a review is 3-4 hours for a full review. Each question ends with a "(H)", "(M)", or "(L)", which maps to "High", "Medium", and "Low" respectively. If you have limited 79 | time for the review, you can defer the Low or Medium questions and focus more on the High/Medium questions. 80 | 81 | The questions specified are meant to begin a conversation. Ask yourself follow up questions, such as "Why?", "What happens if _____ does occur?", 82 | "Is this documented? Has it been tested?", to uncover further details and identify whether an item truly poses a risk to the business. 83 | Talk through scenarios, how they can occur, and what mechanisms are (or are not) in place to address them. 84 | 85 | --- 86 | 87 | ## Beginning a new review 88 | 89 | Once the ORR is installed as a new custom lens from the prior steps, it can be used in a workload evaluation. 90 | 91 | Go to the "Workloads" tab of the [Well-Architected Tool](https://console.aws.amazon.com/wellarchitected) Console and click "Define workload". 92 | Fill out the name, description, review owner, and other prompted fields from the console. When done, click "Next". 93 | 94 | You must have the "AWS Well-Architected Framework" as a selected lens, but you can also check off the "AWS Operational Readiness Review" to add it to the question list. 95 | Then click "Define workload" and then the "Start reviewing" dropdown menu on the next page. From the dropdown menu you can start with either a Well-Architected Framework review or the 96 | Operational Readiness Review. 97 | 98 | ![Well-Architected ORR Lens Selection](_img/WAT-Selectable.png) 99 | 100 | The questionnaire is broken into three main pillars: Architecture, Release Quality & Procedures, and Incident & Event Management. The first pillar focuses on general architecture 101 | questions of the application itself, as well as topics such as forecasting for the event, testing beign done, and known failure modes, and other related topics. The second pillar 102 | focuses on how updates are deployed to the application before, during, and after the event. The third pillar focuses on items such as monitoring alarms, success/failure criteria, 103 | and oncall prcoedures. 104 | 105 | Each question has side-bar "Helpful resources" to help expand upon the question's choices. Additionally, as mentioned above, each question has an (H), (M), (L) at the end to denote whether 106 | its considered a High/Medium/Low risk question. If you are short on time for the review, focusing on the high's can be an effective way to cut down the length. 107 | 108 | Again, this lens is meant to guide a discussion in order to uncover potential risks prior to go-live date for the application and spur discussion on ways to 109 | mitigate those risks within the confines of the application. While there are some "Suggested Improvements" for each question choice when you generate the report, they should be taken 110 | as advisory for your application. The decision to implement any given recommendation should take additional factors into consideration, such the available funding, the architecture 111 | of the application, the risk to the business, and how much time is left to make such changes. 112 | 113 | --- 114 | 115 | ## Feedback, Suggestions, and Support 116 | 117 | If you run into issues, please double check the official [AWS Well-Architected Tool documentation](https://docs.aws.amazon.com/wellarchitected/latest/userguide/intro.html) for updated guidance on importing a new custom lens, upgrading a custom lens, or beginning a review. 118 | 119 | If you have any feedback about the lens, suggestions for new questions, or further ideas on remediation text, please raise it as a [GitHub Issue](https://github.com/awslabs/operational-readiness-review-custom-war-lens/issues). 120 | 121 | If you would like to help improve this lens, please see the [CONTRIBUTING](CONTRIBUTING.md) file at the root level of the repository. 122 | 123 | -------------------------------------------------------------------------------- /_img/WAT-CreateLens.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/WAT-CreateLens.png -------------------------------------------------------------------------------- /_img/WAT-CustomLenses.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/WAT-CustomLenses.png -------------------------------------------------------------------------------- /_img/WAT-DraftState.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/WAT-DraftState.png -------------------------------------------------------------------------------- /_img/WAT-Publish.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/WAT-Publish.png -------------------------------------------------------------------------------- /_img/WAT-Published.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/WAT-Published.png -------------------------------------------------------------------------------- /_img/WAT-Selectable.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/WAT-Selectable.png -------------------------------------------------------------------------------- /_img/gt-well-architected.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/operational-readiness-review-custom-war-lens/85b14fa95d84d5d5e8c8b6d4f13f64acdb501464/_img/gt-well-architected.png -------------------------------------------------------------------------------- /wafr-operational-readiness-lens/orr-v1.3.2-PUBLISHED.json: -------------------------------------------------------------------------------- 1 | { 2 | "schemaVersion": "2021-11-01", 3 | "name": "AWS Operational Readiness Review", 4 | "description": "This Well Architected Lense is an adaptation of the AWS Operational Readiness Review program-- a set of questions designed to capture and help correct common failure-points. It is intended to aid customer and account teams in capturing potential issues prior to go-live.", 5 | "pillars": [ 6 | { 7 | "id": "architecture", 8 | "name": "01 - Architecture", 9 | "questions": [ 10 | { 11 | "id": "well_architected", 12 | "title": "AWS Well-Architected Framework Review (H)", 13 | "description": "Have you addressed the issues identified as the result of an AWS Well-Architected Framework Review (WAFR)?", 14 | "choices": [ 15 | { 16 | "id": "wafr", 17 | "title": "AWS Well-Architected Framework Review (WAFR) has been completed on this workload.", 18 | "improvementPlan": { 19 | "displayText": "Complete a WAFR. Your AWS Account team can help!", 20 | "url": "https://aws.amazon.com/well-architected-tool/" 21 | }, 22 | "helpfulResource": { 23 | "displayText": "AWS Well-Architected", 24 | "url": "https://aws.amazon.com/architecture/well-architected" 25 | } 26 | }, 27 | { 28 | "id": "no_high_risk", 29 | "title": "There are no outstanding or unresolved high-risk findings from the AWS Well-Architected Framework Review (WAFR).", 30 | "improvementPlan": { 31 | "displayText": "Resolve the WAFR high-risk findings and re-assess.", 32 | "url": "https://www.wellarchitectedlabs.com" 33 | }, 34 | "helpfulResource": { 35 | "displayText": "AWS Well-Architected", 36 | "url": "https://aws.amazon.com/architecture/well-architected" 37 | } 38 | }, 39 | { 40 | "id": "no_medium_risk", 41 | "title": "There are no outstanding or unresolved medium-risk findings from the AWS Well-Architected Framework Review (WAFR).", 42 | "improvementPlan": { 43 | "displayText": "Resolve the WAFR medium-risk findings and re-assess.", 44 | "url": "https://www.wellarchitectedlabs.com" 45 | }, 46 | "helpfulResource": { 47 | "displayText": "AWS Well-Architected", 48 | "url": "https://aws.amazon.com/architecture/well-architected" 49 | } 50 | } 51 | ], 52 | "riskRules": [ 53 | { 54 | "condition": "!wafr || !no_high_risk", 55 | "risk": "HIGH_RISK" 56 | }, 57 | { 58 | "condition": "wafr && no_high_risk && !no_medium_risk", 59 | "risk": "MEDIUM_RISK" 60 | }, 61 | { 62 | "condition": "default", 63 | "risk": "NO_RISK" 64 | } 65 | ] 66 | }, 67 | { 68 | "id": "architecture_architecture_diagram", 69 | "title": "Architecture Diagram (H)", 70 | "description": "Please provide a diagram of your system or application architecture, both at the infrastructure level and at the data/network flow level. Note the locations in the notes section below.", 71 | "choices": [ 72 | { 73 | "id": "architecture_diagram", 74 | "title": "Architecture diagram provided.", 75 | "helpfulResource": { 76 | "displayText": "An architecture diagram shows the (multi)-regional setup of the underlying infrastructure, relevant ELBs, ASGs, how they are split across AZ's, etc." 77 | }, 78 | "improvementPlan": { 79 | "displayText": "A review of the architecture diagram is highly recommended prior to go-live in order to sanity check there are no visibile single points of failures. " 80 | } 81 | }, 82 | { 83 | "id": "data_flow_diagram", 84 | "title": "Data/network flow diagram provided.", 85 | "helpfulResource": { 86 | "displayText": "A data or network flow diagram shows the flow of data through the system in order to identify external dependencies or internal single points of failure/bottle necks to performance." 87 | }, 88 | "improvementPlan": { 89 | "displayText": "A review of the network/data flow diagram is highly recommended prior to go-live in order to ensure there are no noticable bottlenecks or external dependencies that could impact service operation." 90 | } 91 | } 92 | ], 93 | "riskRules": [ 94 | { 95 | "condition": "data_flow_diagram && architecture_diagram", 96 | "risk": "NO_RISK" 97 | }, 98 | { 99 | "condition": "(!data_flow_diagram && architecture_diagram) || (!architecture_diagram && data_flow_diagram)", 100 | "risk": "MEDIUM_RISK" 101 | }, 102 | { 103 | "condition": "default", 104 | "risk": "HIGH_RISK" 105 | } 106 | ] 107 | }, 108 | { 109 | "id": "architecture_services_used", 110 | "title": "AWS Services Used (H)", 111 | "description": "What AWS services, in which accounts and regions, is your workload using? Please break it out by component (Application/Workload/Service-- logical units). ", 112 | "choices": [ 113 | { 114 | "id": "onlyAWS", 115 | "title": "All AWS services in-use are known and documented by account and region.", 116 | "helpfulResource": { 117 | "displayText": "All AWS services in-use are known and documented by account and region." 118 | }, 119 | "improvementPlan": { 120 | "displayText": "The customer and the account team should leverage Cost Explorer or CUDOS to identify what services are in use across what accounts and regions. The application team should be consulted for calls to externally managed dependencies." 121 | } 122 | }, 123 | { 124 | "id": "thirdParty", 125 | "title": "Third party dependencies exist but are known and documented.", 126 | "helpfulResource": { 127 | "displayText": "Third party dependenices are services that are ran, managed, or controlled by a team that exists outside of the customer's organization and outside of AWS." 128 | }, 129 | "improvementPlan": { 130 | "displayText": "Third party dependencies should be documented, including known SLAs for uptime and escalation paths that can be engaged in the event of an issue." 131 | } 132 | } 133 | ], 134 | "riskRules": [ 135 | { 136 | "condition": "(onlyAWS && !thirdParty)", 137 | "risk": "NO_RISK" 138 | }, 139 | { 140 | "condition": "(!onlyAWS || thirdParty)", 141 | "risk": "MEDIUM_RISK" 142 | }, 143 | { 144 | "condition": "default", 145 | "risk": "HIGH_RISK" 146 | } 147 | ] 148 | }, 149 | { 150 | "id": "architecture_api_matrix", 151 | "title": "Impacted API Matrix (H)", 152 | "description": "Provide a table with all customer-facing APIs, an explanation of what each does, and the components and dependencies of your service that each API impacts. ", 153 | "choices": [ 154 | { 155 | "id": "table_provided", 156 | "title": "Matrix has provided below in the notes section.", 157 | "helpfulResource": { 158 | "displayText": "The API Matrix, or a link to it, has been noted below." 159 | }, 160 | "improvementPlan": { 161 | "displayText": "Key Customer-facing APIs that are expecting high traffic should be documented along with its component pieces and dependencies, and expected traffic load if known." 162 | } 163 | } 164 | ], 165 | "riskRules": [ 166 | { 167 | "condition": "table_provided", 168 | "risk": "NO_RISK" 169 | }, 170 | { 171 | "condition": "default", 172 | "risk": "HIGH_RISK" 173 | } 174 | ] 175 | }, 176 | { 177 | "id": "architecture_failure_models", 178 | "title": "Failure Models (H)", 179 | "description": "Construct a failure model listing soft and hard failures for each of your system's components and dependencies, how they would be detected, and mitigation efforts. Note them below.", 180 | "choices": [ 181 | { 182 | "id": "soft_failures_known", 183 | "title": "Soft failures known and documented.", 184 | "helpfulResource": { 185 | "displayText": "Soft failures are failures where an application is partially operating; for example high latency rendering a high percentage of response errors." 186 | }, 187 | "improvementPlan": { 188 | "displayText": "Known soft failure scenarios for the application or workload should be documented and discussed in order to identify mitigations." 189 | } 190 | }, 191 | { 192 | "id": "soft_failures_detection", 193 | "title": "Soft failures conditions are actively monitored for occurrence.", 194 | "helpfulResource": { 195 | "displayText": "Soft failures such as high latency can be monitored through p90 / p99 transaction monitoring, response error rates can be emitted by the application. Alarms should be set for known soft-failure conditions." 196 | }, 197 | "improvementPlan": { 198 | "displayText": "Known soft failure scenarios should be discussed and documented and alarms set to detect such scenarios, if possible." 199 | } 200 | }, 201 | { 202 | "id": "soft_failures_runbooks", 203 | "title": "Known soft failure conditions have documented mitigations or playbooks", 204 | "helpfulResource": { 205 | "displayText": "On-call engineers should ideally not be scrambling to 'figure out' what to do. If a failure condition is known ahead of time, playbooks should be written that can be followed in the event of occurence." 206 | }, 207 | "improvementPlan": { 208 | "displayText": "Known soft failure scenarios should have runbooks built that on-call engineers can leverage in the event of occurance. Links to the relevant runbooks should be added to the detection alarms." 209 | } 210 | }, 211 | { 212 | "id": "hard_failures_known", 213 | "title": "Hard failures known and documented", 214 | "helpfulResource": { 215 | "displayText": "*Hard failures are failures where an application is entirely non-operational." 216 | }, 217 | "improvementPlan": { 218 | "displayText": "Known hard failure scenarios should be discussed and documented and alarms set to detect such scenarios, if possible." 219 | } 220 | }, 221 | { 222 | "id": "hard_failures_detection", 223 | "title": "Known hard failure conditions are actively monitored for occurrence.", 224 | "helpfulResource": { 225 | "displayText": "Hard failure conditions can be monitored through things like health checks, or instance status checks." 226 | }, 227 | "improvementPlan": { 228 | "displayText": "Known hard failure scenarios should be discussed and documented and alarms set to detect such scenarios, if possible." 229 | } 230 | }, 231 | { 232 | "id": "hard_failures_runbooks", 233 | "title": "Known hard failure conditions have documented mitigations or playbooks", 234 | "helpfulResource": { 235 | "displayText": "On-call engineers should ideally not be scrambling to 'figure out' what to do. If a failure condition is known ahead of time, playbooks should be written that can be followed in the event of occurence." 236 | }, 237 | "improvementPlan": { 238 | "displayText": "Known hard failure scenarios should have runbooks built that on-call engineers can leverage in the event of occurance. Links to the relevant runbooks should be added to the detection alarms." 239 | } 240 | } 241 | ], 242 | "riskRules": [ 243 | { 244 | "condition": "hard_failures_runbooks && hard_failures_detection && hard_failures_known && soft_failures_runbooks && soft_failures_detection && soft_failures_known", 245 | "risk": "NO_RISK" 246 | }, 247 | { 248 | "condition": "hard_failures_known && soft_failures_known && (!hard_failures_detection || !soft_failures_detection || !hard_failures_runbooks || !soft_failures_runbooks)", 249 | "risk": "MEDIUM_RISK" 250 | }, 251 | { 252 | "condition": "default", 253 | "risk": "HIGH_RISK" 254 | } 255 | ] 256 | }, 257 | { 258 | "id": "architecture_plane_redundency", 259 | "title": "Control & Data Plane Redundency (H)", 260 | "description": "What level of redundancy does your application or service support for its control and data plane components?", 261 | "choices": [ 262 | { 263 | "id": "ctrl_regionally_redundant", 264 | "title": "Control plane is redundant within a region.", 265 | "helpfulResource": { 266 | "displayText": "Control plane is a management layer of software, generally dictating where and how the data should move." 267 | }, 268 | "improvementPlan": { 269 | "displayText": "Most applications and services can, and should, be run out of multiple AWS availability zones rather than being limited to a single AZ. Work with the account team to discuss options to spread out within a region." 270 | } 271 | }, 272 | { 273 | "id": "ctrl_globally_redundent", 274 | "title": "Control plane is redundant between regions.", 275 | "helpfulResource": { 276 | "displayText": "Control plane is a management layer of software, generally dictating where and how the data should move." 277 | }, 278 | "improvementPlan": { 279 | "displayText": "Multi-region control plane support is less common but can be achieved with custom software. Work the account team to identify if this is a possibility given the architecture and cost considerations." 280 | } 281 | }, 282 | { 283 | "id": "data_regionally_redundant", 284 | "title": "Data plane is redundant within a region.", 285 | "helpfulResource": { 286 | "displayText": "Data plane is the layer that is responsible for routing and handling data that passes through the network." 287 | }, 288 | "improvementPlan": { 289 | "displayText": "Most applications and services can, and should, be run out of multiple AWS availability zones rather than being limited to a single AZ. Work with the account team to discuss options to spread out within a region." 290 | } 291 | }, 292 | { 293 | "id": "data_globally_redundant", 294 | "title": "Data plane is redundant between regions.", 295 | "helpfulResource": { 296 | "displayText": "Data plane is the layer that is responsible for routing and handling data that passes through the network." 297 | }, 298 | "improvementPlan": { 299 | "displayText": "Multi-region data plane support is possible for many architectures with design choices such as infrastructure-as-code, DNS-based routing, and software choices such as DynamoDB Global Tables. Work the account team to identify if this is a possibility given the applicaiton architecture and cost considerations.." 300 | } 301 | } 302 | ], 303 | "riskRules": [ 304 | { 305 | "condition": "data_regionally_redundant && ctrl_regionally_redundant", 306 | "risk": "NO_RISK" 307 | }, 308 | { 309 | "condition": "default", 310 | "risk": "HIGH_RISK" 311 | } 312 | ] 313 | }, 314 | { 315 | "id": "architecture_retry_timeouts", 316 | "title": "Retries & Socket Timeouts (H)", 317 | "description": "Have you reviewed your retry and socket timeouts?", 318 | "choices": [ 319 | { 320 | "id": "reviewed", 321 | "title": "Retry count and socket timeouts reviewed.", 322 | "helpfulResource": { 323 | "displayText": "If misconfigured, multiple retries could push the system above the timeout value even if the call could succeed on the final retry." 324 | }, 325 | "improvementPlan": { 326 | "displayText": "Review application code to check for number of retries, the timeout per retry, and the length of the overall timeout on either the client side or receiving end." 327 | } 328 | } 329 | ], 330 | "riskRules": [ 331 | { 332 | "condition": "reviewed", 333 | "risk": "NO_RISK" 334 | }, 335 | { 336 | "condition": "default", 337 | "risk": "HIGH_RISK" 338 | } 339 | ] 340 | }, 341 | { 342 | "id": "architecture_health_checks", 343 | "title": "Health Checks (H)", 344 | "description": "Have you configured DNS and load balancer health checks?", 345 | "choices": [ 346 | { 347 | "id": "lb_health_checks", 348 | "title": "Load balancer health checks implemented", 349 | "helpfulResource": { 350 | "displayText": "Health checks at the load balancer level ensure that traffic is routed only to health hosts supporting the application." 351 | }, 352 | "improvementPlan": { 353 | "displayText": "Ensure that ELB Health Checks are enabled for event-relevant ELBs.", 354 | "url": "https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-add-elb-healthcheck.html" 355 | } 356 | }, 357 | { 358 | "id": "dns_health_checks", 359 | "title": "DNS Health Checks", 360 | "helpfulResource": { 361 | "displayText": "DNS health checks allow for more granular control of how and which endpoints (hosts, ELBs, cloudfront distributions) receive application traffic. One benefit can be failing over to an alternative implementation (on-prem, other-cloud, other-region) in the event of an issue." 362 | }, 363 | "improvementPlan": { 364 | "displayText": "DNS Healthchecks should be enabled where applicable." 365 | } 366 | } 367 | ], 368 | "riskRules": [ 369 | { 370 | "condition": "lb_health_checks && dns_health_checks", 371 | "risk": "NO_RISK" 372 | }, 373 | { 374 | "condition": "(!lb_health_checks && !dns_health_checks) || (!dns_health_checks && !lb_health_checks)", 375 | "risk": "MEDIUM_RISK" 376 | }, 377 | { 378 | "condition": "default", 379 | "risk": "HIGH_RISK" 380 | } 381 | ] 382 | }, 383 | { 384 | "id": "architecture_failure_testing", 385 | "title": "Failure Testing (H)", 386 | "description": "Have you tested single-AZ and single-host failures to ensure that automated fail-away occurs where expected?", 387 | "choices": [ 388 | { 389 | "id": "single_host", 390 | "title": "Single host failure has been successfully tested", 391 | "helpfulResource": { 392 | "displayText": "Ensure relevant applications are leveraging autoscaling groups or container orchestration that allows for automatic replacement of failed nodes." 393 | }, 394 | "improvementPlan": { 395 | "displayText": "Single host failure testing can be tested by either terminating an in-use EC2 instance at random or by causing a BSOD/Kernel Panic from within the OS." 396 | } 397 | }, 398 | { 399 | "id": "single_az", 400 | "title": "Single availability-zone failure has been successfully tested.", 401 | "helpfulResource": { 402 | "displayText": "Ensure that applications are leveraging autoscaling groups across multiple AZs, RDS instances have configured for multi-AZ configurations, ELBs are configured across multiple AZs, and Elasticache instances are configured for mutliple AZs with redundancy." 403 | }, 404 | "improvementPlan": { 405 | "displayText": "Single AZ failures can be tested by leveraging VPC NACLs to block all traffic going to those subnets. " 406 | } 407 | } 408 | ], 409 | "riskRules": [ 410 | { 411 | "condition": "single_host && single_az", 412 | "risk": "NO_RISK" 413 | }, 414 | { 415 | "condition": "(single_host && !single_az) || (!single_host && single_az)", 416 | "risk": "MEDIUM_RISK" 417 | }, 418 | { 419 | "condition": "default", 420 | "risk": "HIGH_RISK" 421 | } 422 | ] 423 | }, 424 | { 425 | "id": "architecture_demand_estimates", 426 | "title": "Demand Estimations (H)", 427 | "description": "What are your forecasted estimates for customer demand? Have you tested your services to ensure they can handle your estimates? Are there specific performance requirements the applicatio must meet? Note what those demand expectations are.", 428 | "choices": [ 429 | { 430 | "id": "firmly_known", 431 | "title": "Customer demand well known or throttled.", 432 | "helpfulResource": { 433 | "displayText": "The level of demand is well known, or throttling mechanisms exist in order to limit the level of demand to sustainable levels. " 434 | }, 435 | "improvementPlan": { 436 | "displayText": "Engage with business leadership to understand the expected level of traffic, or if throttling is acceptable. In the event of neither option being available, considering pre-warming any autoscaling groups or static nodes and then scaling down once load is understood." 437 | } 438 | } 439 | ], 440 | "riskRules": [ 441 | { 442 | "condition": "firmly_known", 443 | "risk": "NO_RISK" 444 | }, 445 | { 446 | "condition": "default", 447 | "risk": "HIGH_RISK" 448 | } 449 | ] 450 | }, 451 | { 452 | "id": "architecture_load_testing", 453 | "title": "Load & Penetration Testing (H)", 454 | "description": "Have you performed multiple rounds of load testing to discover and address any unexpected performance bottlenecks and establish known breaking points? Has penetration testing been completed to detect security vulnerabilities? Note any issues that have arisen in the course of testing.", 455 | "choices": [ 456 | { 457 | "id": "load_known", 458 | "title": "The anticipated load and capacity plan are documented. Provide a link to the load documentation and capacity plan in the notes.", 459 | "helpfulResource": { 460 | "displayText": "Document the expected traffic load / traffic requirements for the service," 461 | }, 462 | "improvementPlan": { 463 | "displayText": "Document the expected load and capacity plan in order to ensure stability." 464 | } 465 | }, 466 | { 467 | "id": "lb_1x", 468 | "title": "Tested to expected load / capacity requirement.", 469 | "helpfulResource": { 470 | "displayText": "Load balancing was done to level of or beyond 1x the expected traffic load / traffic requirements" 471 | }, 472 | "improvementPlan": { 473 | "displayText": "Assuming that load levels are known, the application stack should be tested to the expected load in order to ensure stability.", 474 | "url": "https://aws.amazon.com/solutions/implementations/distributed-load-testing-on-aws/" 475 | } 476 | }, 477 | { 478 | "id": "lb_2x", 479 | "title": "Two times expected load / requirement.", 480 | "helpfulResource": { 481 | "displayText": "Load balancing was done to level of or beyond 2x the expected traffic load / traffic requirements" 482 | }, 483 | "improvementPlan": { 484 | "displayText": "Assuming that load levels are known, the application stack should be tested to 2x the expected load in order to ensure stability." 485 | } 486 | }, 487 | { 488 | "id": "lb_3x", 489 | "title": "Three times expected load / requirement.", 490 | "helpfulResource": { 491 | "displayText": "Load balancing was done to level of or beyond 3x the expected traffic load / traffic requirements" 492 | }, 493 | "improvementPlan": { 494 | "displayText": "Assuming that load levels are known, the application stack should be tested to 3x the expected load in order to ensure stability." 495 | } 496 | }, 497 | { 498 | "id": "lb_xx", 499 | "title": "Load balancing performed to break-point.", 500 | "helpfulResource": { 501 | "displayText": "Load balancing was pressed to the breaking point above and beyond any specific targets. Note the break-point." 502 | }, 503 | "improvementPlan": { 504 | "displayText": "If expected user loads are not known, or if the operators are especially cautious, the application/service should be tested and pushed until it, or any other component, breaks down. This establishes the true 'upper limit' of what the stack can support and gives the operators a dry-run test of what such failure may look like once the application is live." 505 | } 506 | }, 507 | { 508 | "id": "pen", 509 | "title": "Security penetration testing has been completed.", 510 | "helpfulResource": { 511 | "displayText": "Test the AWS environment against defined security standards.", 512 | "url": "https://aws.amazon.com/security/penetration-testing/" 513 | }, 514 | "improvementPlan": { 515 | "displayText": "Conduct penetration testing against defined security standards.", 516 | "url": "https://aws.amazon.com/security/penetration-testing/" 517 | } 518 | } 519 | ], 520 | "riskRules": [ 521 | { 522 | "condition": "load_known && lb_1x && lb_2x && lb_3x && pen", 523 | "risk": "NO_RISK" 524 | }, 525 | { 526 | "condition": "load_known && lb_1x && pen && (!lb_2x || !lb_3x || !lb_xx)", 527 | "risk": "MEDIUM_RISK" 528 | }, 529 | { 530 | "condition": "default", 531 | "risk": "HIGH_RISK" 532 | } 533 | ] 534 | }, 535 | { 536 | "id": "architecture_defensive_throttling", 537 | "title": "Defensive Throttling (H)", 538 | "description": "What defensive mechanisms are you using to protect your service from customers?", 539 | "choices": [ 540 | { 541 | "id": "cdn", 542 | "title": "Content delivery network", 543 | "helpfulResource": { 544 | "displayText": "A content delivery network protects a service by distributing its assets to a variety of locations closer to the user's locations, limiting the number of requests that need to be serviced by the origin." 545 | }, 546 | "improvementPlan": { 547 | "displayText": "Speak with the account team on implementing a CDN in front of your application, such as AWS Cloudfront, in order to cache assets.", 548 | "url": "https://aws.amazon.com/cloudfront/" 549 | } 550 | }, 551 | { 552 | "id": "waf", 553 | "title": "Web application firewall", 554 | "helpfulResource": { 555 | "displayText": "A web application firewall helps to protect a service from a variety of common attacks such as distributed denial of service, injection attacks, geological location of users or known bad IPs." 556 | }, 557 | "improvementPlan": { 558 | "displayText": "Speak with the account team on implementing a web application firewall, such as AWS WAF, in order to protect the service from common attack patterns and exploits.", 559 | "url": "https://aws.amazon.com/waf/" 560 | } 561 | }, 562 | { 563 | "id": "autoscaling", 564 | "title": "Autoscaling compute nodes", 565 | "helpfulResource": { 566 | "displayText": "Autoscaling allows for scaling the compute nodes to meet customer demand. While this does not prevent users from hitting the origin srevice, it does allow the origin service to attempt to meet the demand." 567 | }, 568 | "improvementPlan": { 569 | "displayText": "Applications should leverage AWS EC2 Autoscaling and load testing should be done in order to verify that the application, and all its downstream dependencies, can handle scaling out to the maximum configured.", 570 | "url": "https://aws.amazon.com/ec2/autoscaling/" 571 | } 572 | }, 573 | { 574 | "id": "queueing", 575 | "title": "Users sessions are queued", 576 | "helpfulResource": { 577 | "displayText": "An alternative to autoscaling, limiting the number of concurrent users that can interact with the application or system, a waiting room based system, is a highly effective way to ensure that the system does not get overwhelmed by requests." 578 | }, 579 | "improvementPlan": { 580 | "displayText": "If appropriate to the application type, sit down with the account team to discuss methods to include a waiting room or lobby to the application's user flow. " 581 | } 582 | }, 583 | { 584 | "id": "async_execution", 585 | "title": "Users requests are received and processed asynchronously.", 586 | "helpfulResource": { 587 | "displayText": "An alternative to queueing user sessions, asynchronous execution of user requests involves accepting the user provided input and adding it to a queueing mechanism or other storage and then processing them as resources allow with a notification back to the user (such as an email) when their request has been completed." 588 | }, 589 | "improvementPlan": { 590 | "displayText": "If appropriate to the application type, work with the account to discuss methods to modify the application workflow to be asynchronous and prompt the user for a callback-style mechanism to inform them of when their request has been completed." 591 | } 592 | } 593 | ], 594 | "riskRules": [ 595 | { 596 | "condition": "autoscaling && waf && cdn", 597 | "risk": "NO_RISK" 598 | }, 599 | { 600 | "condition": "autoscaling && (!waf || !cdn)", 601 | "risk": "MEDIUM_RISK" 602 | }, 603 | { 604 | "condition": "default", 605 | "risk": "HIGH_RISK" 606 | } 607 | ] 608 | }, 609 | { 610 | "id": "architecture_data_corruption", 611 | "title": "Data Corruption Recovery (H)", 612 | "description": "In the case of logical or physical data corruption, how do you detect, recover, and verify your data and service?", 613 | "choices": [ 614 | { 615 | "id": "risk_mitigated", 616 | "title": "Relevant risk has been mitigated", 617 | "helpfulResource": { 618 | "displayText": "Detecting data corruptiong is dependent upon how the data is collected and what type of data it is. Regular expressions, comparing against JSON schemas, or type checking can detect malformed input data. What may be more difficult is detecting inputs that are technically correct but junk. Recovering data depends on sanitization or on having backups." 619 | }, 620 | "improvementPlan": { 621 | "displayText": "Refer to question's notes on the exact data integrity concerns and consult with the account team TAM & SA to identify potential remediations or mitigations." 622 | } 623 | } 624 | ], 625 | "riskRules": [ 626 | { 627 | "condition": "risk_mitigated", 628 | "risk": "NO_RISK" 629 | }, 630 | { 631 | "condition": "default", 632 | "risk": "HIGH_RISK" 633 | } 634 | ] 635 | }, 636 | { 637 | "id": "architecture_rpo_rto", 638 | "title": "Recovery Objectives (M)", 639 | "description": "Assuming your service or workload suffers data loss, have you defined your Recovery Point Objective? If many varying RPOs exist per component, define them in notes.", 640 | "choices": [ 641 | { 642 | "id": "rto_defined", 643 | "title": "RTO has been defined", 644 | "helpfulResource": { 645 | "displayText": "Recovery Time Objective is a measure of the amount of acceptable downtime per incident, for example five minutes, 30 minutes, an hour, a day, etc." 646 | }, 647 | "improvementPlan": { 648 | "displayText": "The account team, application and ops teams should work together with the business team in order to identify a supportable recovery time objective based upon design and customer agreements." 649 | } 650 | }, 651 | { 652 | "id": "rto_verified", 653 | "title": "RTO has been verified through a dry-run or game-day exercise", 654 | "helpfulResource": { 655 | "displayText": "A theoretical RTO is a good starting point, but until the teams have verified their ability to support it, it is difficult to rely upon." 656 | }, 657 | "improvementPlan": { 658 | "displayText": "A game-day, dry-run or similar exercise should be used to ensure that all relevant teams know what actions need to be taken in the event of an outage. Drafting a written runbook may be useful for documentation purposes. A hot or cold stand-by environment may also be useful in order to achieve faster RTO by evacuating the primary environment." 659 | } 660 | }, 661 | { 662 | "id": "rpo_defined", 663 | "title": "RPO has been defined", 664 | "helpfulResource": { 665 | "displayText": "The Recovery Point Objective is a measure of how much data loss is acceptable following each incident. For example, is needing to restore from backups taken the day prior sufficient, or do the backups need to be taken more frequently? " 666 | }, 667 | "improvementPlan": { 668 | "displayText": "The account team, application and ops teams should work together with the business team in order to identify a supportable recovery point objective based upon design and customer agreements." 669 | } 670 | }, 671 | { 672 | "id": "backups_verified", 673 | "title": "Backups supporting RPO have been verified.", 674 | "helpfulResource": { 675 | "displayText": "Unverified backups cannot be relied upon. Any backups taken in support of recovery efforts should be periodically tested in order to ensure viability." 676 | }, 677 | "improvementPlan": { 678 | "displayText": "Risk has been mitigated, no further improvements needed. " 679 | } 680 | } 681 | ], 682 | "riskRules": [ 683 | { 684 | "condition": "rto_defined && rto_verified && rpo_defined && backups_verified", 685 | "risk": "NO_RISK" 686 | }, 687 | { 688 | "condition": "(rto_defined && rpo_defined) && (!rto_defined || !backups_verified)", 689 | "risk": "MEDIUM_RISK" 690 | }, 691 | { 692 | "condition": "default", 693 | "risk": "HIGH_RISK" 694 | } 695 | ] 696 | }, 697 | { 698 | "id": "architecture_gradeful_recovery", 699 | "title": "Graceful Recovery (H)", 700 | "description": "In the case of a large scale service failure, if your service recovers before its dependencies, does it fail in a way that is acceptable (i.e. a form of graceful degradation, such as an error) or in an unacceptable/unpredictable way?", 701 | "choices": [ 702 | { 703 | "id": "health_checks", 704 | "title": "Applications performs health checks against dependent services.", 705 | "helpfulResource": { 706 | "displayText": "The application performs health checks against its dependent services in order to pro-actively catch issues and keep unnecessary load off of those systems. " 707 | }, 708 | "improvementPlan": { 709 | "displayText": "The application should " 710 | } 711 | }, 712 | { 713 | "id": "error_handling", 714 | "title": "Application catches errors and exceptions correctly.", 715 | "helpfulResource": { 716 | "displayText": "The application's logic code includes error handling and exception catching for issues that arise due to problems with external dependencies." 717 | }, 718 | "improvementPlan": { 719 | "displayText": "Have a discussion with the account team on if this risk can or should be further mitigated prior to go-live." 720 | } 721 | }, 722 | { 723 | "id": "alternate_codepaths", 724 | "title": "An alternate codepath exists that can be executed in the event of dependency issues.", 725 | "helpfulResource": { 726 | "displayText": "The application has alternative logic and code paths that can be leveraged in the event that its dependencies have issues, such as queueing or otherwise storing transactions to be replayed later." 727 | }, 728 | "improvementPlan": { 729 | "displayText": "Work with the account team to mitigate the risk prior to go live." 730 | } 731 | }, 732 | { 733 | "id": "tested", 734 | "title": "The application has tested a scenario where its dependencies are non-fuctional.", 735 | "helpfulResource": { 736 | "displayText": "The application has undergone a specific test to see how it responds when its dependencies are nonfunctional, testing error conditions, and to test alternative codepaths if they exist. The testing was successful." 737 | }, 738 | "improvementPlan": { 739 | "displayText": "Work with the account team to mitigate the risk prior to go live." 740 | } 741 | } 742 | ], 743 | "riskRules": [ 744 | { 745 | "condition": "tested && alternate_codepaths && error_handling && health_checks", 746 | "risk": "NO_RISK" 747 | }, 748 | { 749 | "condition": "(tested || alternate_codepaths || error_handling || health_checks) && (!tested || !alternate_codepaths || !error_handling || !health_checks)", 750 | "risk": "MEDIUM_RISK" 751 | }, 752 | { 753 | "condition": "default", 754 | "risk": "HIGH_RISK" 755 | } 756 | ] 757 | }, 758 | { 759 | "id": "architecture_dependency_retry", 760 | "title": "Dependency Retry/Backoff (M)", 761 | "description": "What is the retry/back-off strategy for each of your service's dependencies?", 762 | "choices": [ 763 | { 764 | "id": "aws_services", 765 | "title": "Code calling out to AWS APIs implement proper retry and backoff strategies.", 766 | "helpfulResource": { 767 | "displayText": "It is an AWS best practice, and a required practice for large applications, to properly catch ThrottlingExceptions and implement retry, backoff, and jitter strategies." 768 | }, 769 | "improvementPlan": { 770 | "displayText": "While some of the AWS SDKs will properly capture ThrottlingExceptions and automatically handle retries and backoff conditions, those retries are limited to a set number of attempts and those errors could still be raised to the application. Those errors should be caught and handled appropriately with fail-safe code paths." 771 | } 772 | }, 773 | { 774 | "id": "third_parties", 775 | "title": "Code calling out to non-AWS APIs implement retry and backoff strategies.", 776 | "helpfulResource": { 777 | "displayText": "It is a recommended best practice to implement retries and backoff strategies for dependencies which emit dedicated throttling errors." 778 | }, 779 | "improvementPlan": { 780 | "displayText": "Assuming that the third party APIs properly emit throttling or other performance errors, those errors should be caught and sane-failure modes should be enacted, either slowing down the rate that the application talks to that third party or skipping over it entirely. Similar failure conditions should be implemented for the third party service being hard-down." 781 | } 782 | } 783 | ], 784 | "riskRules": [ 785 | { 786 | "condition": "aws_services && third_parties", 787 | "risk": "NO_RISK" 788 | }, 789 | { 790 | "condition": "default", 791 | "risk": "HIGH_RISK" 792 | } 793 | ] 794 | }, 795 | { 796 | "id": "architecture_mulitaccount_stragy", 797 | "title": "Multi-account Strategy (M)", 798 | "description": "Does your system use at least one AWS account for every stage and fault container (Region or zone) pair?", 799 | "choices": [ 800 | { 801 | "id": "acct_separated_by_stage", 802 | "title": "Stages of development separated by account", 803 | "helpfulResource": { 804 | "displayText": "Separating Development/Staging/Production from one another through account boundaries is a highly effective way to segregate failure domains as well as security borders." 805 | }, 806 | "improvementPlan": { 807 | "displayText": "If time allows, development and staging should be broken out from production into separate accounts. This reduces the blast radius of security events and makes it easier to manage limits & API throttling across the environments.", 808 | "url": "https://docs.aws.amazon.com/managedservices/latest/userguide/malz-net-arch-section.html" 809 | } 810 | }, 811 | { 812 | "id": "acct_separated_by_region", 813 | "title": "Deployed regions are separated by account.", 814 | "helpfulResource": { 815 | "displayText": "For highly critical applications, separating the production account for us-east-1 from the production account supporting us-west-2 can be beneficial in order to prevent multi-region failures due to scripts running awry or a security incident." 816 | }, 817 | "improvementPlan": { 818 | "displayText": "If multi-region durability is a major concern, an architecture review with the account team may be warranted in order to discuss how to break out a multi-region production environment to a multi-account architecture.", 819 | "url": "https://docs.aws.amazon.com/managedservices/latest/userguide/malz-net-arch-section.html" 820 | } 821 | } 822 | ], 823 | "riskRules": [ 824 | { 825 | "condition": "acct_separated_by_stage && acct_separated_by_region", 826 | "risk": "NO_RISK" 827 | }, 828 | { 829 | "condition": "default", 830 | "risk": "HIGH_RISK" 831 | } 832 | ] 833 | }, 834 | { 835 | "id": "architecture_mutliaccount_credentials", 836 | "title": "Multiaccount Credentials (M)", 837 | "description": "If you're using multiple AWS accounts, how are your accounts and credentials segregated?", 838 | "choices": [ 839 | { 840 | "id": "credentials_separated", 841 | "title": "Relevant risk has been mitigated", 842 | "helpfulResource": { 843 | "displayText": "Separation of accounts and the credentials to access them are an important component of minimizing access. Engineers should not typically have access to systems that they are not responsible for. Greater-than-required levels of access are prime opportunities for accidental impact or for security incidents." 844 | }, 845 | "improvementPlan": { 846 | "displayText": "Work with the account team to segregate levels of access to teams that need it, AD Groups & SAML can be used to limit the accounts that engineers have access to, if using SSO, and hard-coded AWS credentials should be segrated to the teams that may need them, or at the executive level for root credentials." 847 | } 848 | } 849 | ], 850 | "riskRules": [ 851 | { 852 | "condition": "credentials_separated", 853 | "risk": "NO_RISK" 854 | }, 855 | { 856 | "condition": "default", 857 | "risk": "HIGH_RISK" 858 | } 859 | ] 860 | }, 861 | { 862 | "id": "architecture_shared_resources_redundancy", 863 | "title": "Resources shared across regions (M)", 864 | "description": "Do you currently use any resources that are shared across redundant zones? For example, if your service is regionally redundant, are any resources shared across those regions?", 865 | "choices": [ 866 | { 867 | "id": "regional_isolation", 868 | "title": "No cross-region resources exist", 869 | "helpfulResource": { 870 | "displayText": "An example of this setup would be a multi-region setup where each region is fully isolated, likely requiring that a USA user be serviced from a specific USA region." 871 | }, 872 | "improvementPlan": { 873 | "displayText": "Improving this item depends on the exact busienss objectives and applicable laws & regulations. It may be desirable to have a silo'd architecture due to legal requirements regarding user data." 874 | } 875 | }, 876 | { 877 | "id": "geographically_redundant", 878 | "title": "All cross-region resources are redundantly available.", 879 | "helpfulResource": { 880 | "displayText": "An example would be redundant databases geographically separated that can support any user connecting from anywhere." 881 | }, 882 | "improvementPlan": { 883 | "displayText": "Improving this item depends on the exact business objectives and applicable laws and regulations. It may be desirable for the resources to be silod in order to have a firmly defined blast radius during an outage, or to abide by local laws. DynamoDB Global Tables would be one example of a multi-region database." 884 | } 885 | } 886 | ], 887 | "riskRules": [ 888 | { 889 | "condition": "regional_isolation || geographically_redundant", 890 | "risk": "NO_RISK" 891 | }, 892 | { 893 | "condition": "default", 894 | "risk": "HIGH_RISK" 895 | } 896 | ] 897 | }, 898 | { 899 | "id": "architecture_certificates", 900 | "title": "Software Certificates (L)", 901 | "description": "Are SSL/TLS Certificates used within your system stored in AWS Certificate Manager? If not, where are they stored? Non-TLS certificates and API keys are handled in another question. ", 902 | "choices": [ 903 | { 904 | "id": "exp_alarm", 905 | "title": "Certificate expiration has notifications or alarms in place in the days leading up.", 906 | "helpfulResource": { 907 | "displayText": "Alarms are in place in order to alert operators to soon-to-be expired TLS certificates." 908 | }, 909 | "improvementPlan": { 910 | "displayText": "CloudWatch alarms should be created that alarm on soon-to-be-expiring certificates for key certificates.", 911 | "url": "https://docs.aws.amazon.com/acm/latest/userguide/monitoring-and-logging.html" 912 | } 913 | }, 914 | { 915 | "id": "automated_renewal", 916 | "title": "The renewal of certificates is an automated process.", 917 | "helpfulResource": { 918 | "displayText": "Automated systems handle the renewal of TLS certificates with alerts to operators in the event of failure." 919 | }, 920 | "improvementPlan": { 921 | "displayText": "Automated workflows (scripts, SSM Runbooks, other) should be leveraged to generate replacement certificates and stage them for deployment.", 922 | "url": "https://docs.aws.amazon.com/acm/latest/userguide/managed-renewal.html" 923 | } 924 | }, 925 | { 926 | "id": "automated_deployments", 927 | "title": "Renewed certificates are automatically deployed to relevant systems.", 928 | "helpfulResource": { 929 | "displayText": "Automated systems are leveraged in order to automatically switch out the old TLS certificates with new ones, with alarms in place to alert operators in the event of failure." 930 | }, 931 | "improvementPlan": { 932 | "displayText": "Automated workflows (scripts, SSM Runbooks, config management) should be leveraged to deploy staged replacement certificates to relevant resources." 933 | } 934 | } 935 | ], 936 | "riskRules": [ 937 | { 938 | "condition": "exp_alarm && automated_renewal && automated_deployments", 939 | "risk": "NO_RISK" 940 | }, 941 | { 942 | "condition": "!exp_alarm || !automated_renewal || !automated_deployments", 943 | "risk": "MEDIUM_RISK" 944 | }, 945 | { 946 | "condition": "default", 947 | "risk": "HIGH_RISK" 948 | } 949 | ] 950 | } 951 | ] 952 | }, 953 | { 954 | "id": "release_quality", 955 | "name": "02 - Release Quality & Procedures", 956 | "questions": [ 957 | { 958 | "id": "releases_deployment", 959 | "title": "Deployment Mechanisms (H)", 960 | "description": "What mechanisms are you utilizing to deploy to your production systems?", 961 | "choices": [ 962 | { 963 | "id": "automated_in_place", 964 | "title": "In-place updates are made to production systems using automation", 965 | "helpfulResource": { 966 | "displayText": "Automated deployments are done to existing systems, updating them in place. " 967 | }, 968 | "improvementPlan": { 969 | "displayText": "Have a discussion with the account team on if this risk can or should be further mitigated prior to go-live. What does the rollback plan look like for a failed deployment or bad release?", 970 | "url": "https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/welcome.html" 971 | } 972 | }, 973 | { 974 | "id": "blue_green", 975 | "title": "Blue/Green deployments are leveraged.", 976 | "helpfulResource": { 977 | "displayText": "Automated systems handle blue/green deployments to production systems. " 978 | }, 979 | "improvementPlan": { 980 | "displayText": "Blue/Green deployments are not applicable" 981 | } 982 | } 983 | ], 984 | "riskRules": [ 985 | { 986 | "condition": "blue_green && !automated_in_place", 987 | "risk": "NO_RISK" 988 | }, 989 | { 990 | "condition": "automated_in_place && !blue_green", 991 | "risk": "MEDIUM_RISK" 992 | }, 993 | { 994 | "condition": "default", 995 | "risk": "HIGH_RISK" 996 | } 997 | ] 998 | }, 999 | { 1000 | "id": "releases_change_management", 1001 | "title": "Change Management (H)", 1002 | "description": "Do you have a mechanism to ensure all code changes (software, configuration, infrastructure, and operational tooling) to production systems are reviewed and approved by someone other than the code author?", 1003 | "choices": [ 1004 | { 1005 | "id": "risk_mitigated", 1006 | "title": "Relevant risk has been mitigated", 1007 | "helpfulResource": { 1008 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1009 | }, 1010 | "improvementPlan": { 1011 | "displayText": "Risk has been mitigated, no further improvements needed. " 1012 | } 1013 | } 1014 | ], 1015 | "riskRules": [ 1016 | { 1017 | "condition": "risk_mitigated", 1018 | "risk": "NO_RISK" 1019 | }, 1020 | { 1021 | "condition": "default", 1022 | "risk": "HIGH_RISK" 1023 | } 1024 | ] 1025 | }, 1026 | { 1027 | "id": "releases_manual_changes", 1028 | "title": "Manual Changes (H)", 1029 | "description": "What changes (software, configuration, and infrastructure) do you need to perform manually? Have these changes been modeled in a pre-approved template prior to executing?", 1030 | "choices": [ 1031 | { 1032 | "id": "no_manual_changes", 1033 | "title": "No manual changes performed", 1034 | "helpfulResource": { 1035 | "displayText": "Manual changes introduce risk of missed steps, typos, or other human errors." 1036 | }, 1037 | "improvementPlan": { 1038 | "displayText": "Any manual tasks should be documented and turned into automated scripts or other automated processes to remove the opportunity for human error. Any task that cannot be automated should be documented in a clearly defined step-by-step process in order to reduce the opportunity for errors." 1039 | } 1040 | } 1041 | ], 1042 | "riskRules": [ 1043 | { 1044 | "condition": "no_manual_changes", 1045 | "risk": "NO_RISK" 1046 | }, 1047 | { 1048 | "condition": "default", 1049 | "risk": "HIGH_RISK" 1050 | } 1051 | ] 1052 | }, 1053 | { 1054 | "id": "releases_canaries", 1055 | "title": "Deployment Canaries (H)", 1056 | "description": "Does your service have canaries that call your service through its public endpoints to validate happy path functionality for all public and private APIs, critical customer scenarios, and UIs in all regions?", 1057 | "choices": [ 1058 | { 1059 | "id": "availability_canaries", 1060 | "title": "Canaries are leveraged to monitor application availability post deployment.", 1061 | "helpfulResource": { 1062 | "displayText": "Even after a 'technically' successful deployment, things could go wrong that lead to the application being unavailable to the consumer. Canaries can check for those errors by loading the application." 1063 | }, 1064 | "improvementPlan": { 1065 | "displayText": "Synthetic canaries should be in place to continually monitor the uptime and stability of the application in order to ensure that successful deployments do not impact availability." 1066 | } 1067 | }, 1068 | { 1069 | "id": "performance_canaries", 1070 | "title": "Canaries are leveraged to monitor application performance post deployment.", 1071 | "helpfulResource": { 1072 | "displayText": "Even if the application is up and running, performance canaries can inform operators if the last deployment had a noticable impact to user experience." 1073 | }, 1074 | "improvementPlan": { 1075 | "displayText": "If transaction tracing or other performance-monitoring is available, synthetic transactions or canaries should be setup in order to monitor the performance characteristics of the application and compare that data against historical averages to ensure that a deployment does not significantly degrade the user experience." 1076 | } 1077 | } 1078 | ], 1079 | "riskRules": [ 1080 | { 1081 | "condition": "availability_canaries && performance_canaries", 1082 | "risk": "NO_RISK" 1083 | }, 1084 | { 1085 | "condition": "(performance_canaries && !availability_canaries) || (!performance_canaries && availability_canaries)", 1086 | "risk": "MEDIUM_RISK" 1087 | }, 1088 | { 1089 | "condition": "default", 1090 | "risk": "HIGH_RISK" 1091 | } 1092 | ] 1093 | }, 1094 | { 1095 | "id": "releases_staged_deployments", 1096 | "title": "Staged Deployments (M)", 1097 | "description": "Are your deployments first staged in a pre-production or staging environment before reaching a production environment?", 1098 | "choices": [ 1099 | { 1100 | "id": "preprod_testing", 1101 | "title": "Deployments are tested in pre-production environments", 1102 | "helpfulResource": { 1103 | "displayText": "Whether through automated or manual systems, deployments should be tested in pre-production environments prior to being admitted to the production environment." 1104 | }, 1105 | "improvementPlan": { 1106 | "displayText": "Pre-production testing and the development of a testing matrix are critical for ensuring minimal user-impaction." 1107 | } 1108 | }, 1109 | { 1110 | "id": "automated_progression", 1111 | "title": "Deployments automatically proceed using the same code from one environment to the next.", 1112 | "helpfulResource": { 1113 | "displayText": "In order to ease the burden on engineers, testing pipelines should automatically test and then promote changes from one environment to the next." 1114 | }, 1115 | "improvementPlan": { 1116 | "displayText": "Have a discussion with the account team on if this risk can or should be further mitigated prior to go-live." 1117 | } 1118 | } 1119 | ], 1120 | "riskRules": [ 1121 | { 1122 | "condition": "preprod_testing", 1123 | "risk": "NO_RISK" 1124 | }, 1125 | { 1126 | "condition": "default", 1127 | "risk": "HIGH_RISK" 1128 | } 1129 | ] 1130 | }, 1131 | { 1132 | "id": "releases_onebox_deployments", 1133 | "title": "One Box Deployments (M)", 1134 | "description": "Are your deployments tested in a production one-box stage before deploying to the region's production stage? How does the deploy escalate to the full deployment?", 1135 | "choices": [ 1136 | { 1137 | "id": "onebox", 1138 | "title": "Relevant risk has been mitigated", 1139 | "helpfulResource": { 1140 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1141 | }, 1142 | "improvementPlan": { 1143 | "displayText": "Risk has been mitigated, no further improvements needed. " 1144 | } 1145 | }, 1146 | { 1147 | "id": "one_az", 1148 | "title": "Relevant risk has been mitigated", 1149 | "helpfulResource": { 1150 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1151 | }, 1152 | "improvementPlan": { 1153 | "displayText": "Risk has been mitigated, no further improvements needed. " 1154 | } 1155 | }, 1156 | { 1157 | "id": "one_region", 1158 | "title": "Relevant risk has been mitigated", 1159 | "helpfulResource": { 1160 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1161 | }, 1162 | "improvementPlan": { 1163 | "displayText": "Risk has been mitigated, no further improvements needed. " 1164 | } 1165 | } 1166 | ], 1167 | "riskRules": [ 1168 | { 1169 | "condition": "onebox && one_az && one_region", 1170 | "risk": "NO_RISK" 1171 | }, 1172 | { 1173 | "condition": "!onebox", 1174 | "risk": "MEDIUM_RISK" 1175 | }, 1176 | { 1177 | "condition": "default", 1178 | "risk": "HIGH_RISK" 1179 | } 1180 | ] 1181 | }, 1182 | { 1183 | "id": "releases_deployment_rollback", 1184 | "title": "Automated Deployment Rollback (M)", 1185 | "description": "Do your deployments automatically rollback incorrect deployments before they breach SLAs?", 1186 | "choices": [ 1187 | { 1188 | "id": "manual_rollback", 1189 | "title": "Manual rollbacks can be initiated by operators", 1190 | "helpfulResource": { 1191 | "displayText": "Simple rollback mechanisms allow for operators to make the call on whether a given deployment is going to succeed or not after problems arise." 1192 | }, 1193 | "improvementPlan": { 1194 | "displayText": "If manual rollback is not being used currently because it is not supported, then the fail-forward plan should be clearly documented " 1195 | } 1196 | }, 1197 | { 1198 | "id": "auto_rollback", 1199 | "title": "Automatic rollbacks are initiated by monitoring systems.", 1200 | "helpfulResource": { 1201 | "displayText": "Automated rollback mechanisms free up operator time and allows for faster response to a deployment that is not going as planned. This requires deployment metrics to be configured, such as canary alarms." 1202 | }, 1203 | "improvementPlan": { 1204 | "displayText": "In blue/green deployment environments, automatically reverting back to the previous environment can be a safe choice in order to minimize user-impact. Assuming that the app supports rolling back to a previous version, discuss with the account team what parts of the manual steps can be turned into automation and how that automation can be triggered." 1205 | } 1206 | } 1207 | ], 1208 | "riskRules": [ 1209 | { 1210 | "condition": "auto_rollback", 1211 | "risk": "NO_RISK" 1212 | }, 1213 | { 1214 | "condition": "manual_rollback", 1215 | "risk": "MEDIUM_RISK" 1216 | }, 1217 | { 1218 | "condition": "default", 1219 | "risk": "HIGH_RISK" 1220 | } 1221 | ] 1222 | }, 1223 | { 1224 | "id": "releases_deployment_gating", 1225 | "title": "Deployment Gating (M)", 1226 | "description": "Do you prevent your system from deploying to too many hosts at once?", 1227 | "choices": [ 1228 | { 1229 | "id": "risk_mitigated", 1230 | "title": "Deployments gated to static number or percent of hosts", 1231 | "helpfulResource": { 1232 | "displayText": "Limiting deployments to a set number of hosts, or a certain percentage, ensures that there is a healthy capacity of available nodes at a given time." 1233 | }, 1234 | "improvementPlan": { 1235 | "displayText": "If using ECS, SSM or CodeDeploy, review relevant feature sets to limit the number of nodes that a given command/upgrade/deployment is against at a time." 1236 | } 1237 | } 1238 | ], 1239 | "riskRules": [ 1240 | { 1241 | "condition": "risk_mitigated", 1242 | "risk": "NO_RISK" 1243 | }, 1244 | { 1245 | "condition": "default", 1246 | "risk": "HIGH_RISK" 1247 | } 1248 | ] 1249 | }, 1250 | { 1251 | "id": "releases_performance_impact", 1252 | "title": "Performance Impact (M)", 1253 | "description": "If your deployment drastically alters latency and throughput measurements, what mechanisms do you use to detect this change before deploying to production?", 1254 | "choices": [ 1255 | { 1256 | "id": "test_profiling", 1257 | "title": "Performance is tested as part of standard pre-production tests.", 1258 | "helpfulResource": { 1259 | "displayText": "Key performance metrics are monitored as part of the pre-production testing phase. Major (negative) alterations to the test result in failed tests." 1260 | }, 1261 | "improvementPlan": { 1262 | "displayText": "The performance of an applicaiton is nearly as important as the availability of the application, pre-production tests should not only test code quality and functionality but also the performance of the functionality in order to avoid introducing a negative user experience." 1263 | } 1264 | }, 1265 | { 1266 | "id": "production_canaries", 1267 | "title": "Canaries continually test common API calls.", 1268 | "helpfulResource": { 1269 | "displayText": "Automated canaries in production can continually test common API calls and measure their user-facing performance, such as latency and time to completion, and alarm if it deviates from expectations." 1270 | }, 1271 | "improvementPlan": { 1272 | "displayText": "Automated API calls that test the most common, or most critical, code paths as a user would should be implemented and monitor the latency and time it takes for those code paths to complete. Continual major deviences from that baseline should alarm operators to a potential problem." 1273 | } 1274 | }, 1275 | { 1276 | "id": "transaction_tracing", 1277 | "title": "Per-transaction performance metrics are logged.", 1278 | "helpfulResource": { 1279 | "displayText": "Each transaction has its performance metrics reported such that percentile metrics can be captured and monitored for downward trends." 1280 | }, 1281 | "improvementPlan": { 1282 | "displayText": "Work with the account team to receive an introduction on AWS X-Ray which is an SDK and service that enables transaction tracing." 1283 | } 1284 | } 1285 | ], 1286 | "riskRules": [ 1287 | { 1288 | "condition": "test_profiling && production_canaries && transaction_tracing", 1289 | "risk": "NO_RISK" 1290 | }, 1291 | { 1292 | "condition": "(test_profiling && (!production_canaries || !transaction_tracing)) || (transaction_tracing && (!production_canaries || !test_profiling)) || (production_canaries && (!test_profiling || !transaction_tracing))", 1293 | "risk": "MEDIUM_RISK" 1294 | }, 1295 | { 1296 | "condition": "default", 1297 | "risk": "HIGH_RISK" 1298 | } 1299 | ] 1300 | }, 1301 | { 1302 | "id": "releases_validation", 1303 | "title": "Deployment Validation (L)", 1304 | "description": "Do your deployments run on-host validation tests to verify that the software has started successfully and is responding correctly to health checks?", 1305 | "choices": [ 1306 | { 1307 | "id": "risk_mitigated", 1308 | "title": "On-host validation", 1309 | "helpfulResource": { 1310 | "displayText": "Post-deployment validation is critical to ensure that the software which was deployed correctly executes and functions as intended. Such as certain files being in place, services in running states, configuration as expected." 1311 | }, 1312 | "improvementPlan": { 1313 | "displayText": "Depending on deployment methodology (updating existing instances vs blue/green), look to adding a validation step to existing deploymenting mechanisms to validate seemingly successful deployments. In the event an error is found, that error needs to be propagated to operators and the deployment failed." 1314 | } 1315 | } 1316 | ], 1317 | "riskRules": [ 1318 | { 1319 | "condition": "risk_mitigated", 1320 | "risk": "NO_RISK" 1321 | }, 1322 | { 1323 | "condition": "default", 1324 | "risk": "HIGH_RISK" 1325 | } 1326 | ] 1327 | }, 1328 | { 1329 | "id": "releases_canary_errors", 1330 | "title": "Independent Canary Errors (L)", 1331 | "description": "Do you publish your canary errors to an independent metric? Subsequently, do you alarm on this metric?", 1332 | "choices": [ 1333 | { 1334 | "id": "canaries_exist", 1335 | "title": "Canary alarms have been configured", 1336 | "helpfulResource": { 1337 | "displayText": "Canary alarms can include things such as heartbeat checks, broken link detection, " 1338 | }, 1339 | "improvementPlan": { 1340 | "displayText": "Speak with the account team about getting a deep dive on Cloudwatch Synthetics and its canary capabilities.", 1341 | "url": "https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries_Create.html" 1342 | } 1343 | }, 1344 | { 1345 | "id": "canary_ops", 1346 | "title": "Canary Metrics have alarms tied to them that engage the operations team.", 1347 | "helpfulResource": { 1348 | "displayText": "Canary alarms primary purpose to proactively catch downward trends in application or service health. As such, those alarms should notify operator teams so that they can investigate potential issues." 1349 | }, 1350 | "improvementPlan": { 1351 | "displayText": "Canaries failing should be set to engage relevant teams (content teams for broken links, operators for heartbeats / API failures, etc) at an appropriate severity level. Such as tickets for broken links, or perhaps paging for failed heartbeats. " 1352 | } 1353 | } 1354 | ], 1355 | "riskRules": [ 1356 | { 1357 | "condition": "canaries_exist && canary_ops", 1358 | "risk": "NO_RISK" 1359 | }, 1360 | { 1361 | "condition": "canaries_exist && !canary_ops", 1362 | "risk": "MEDIUM_RISK" 1363 | }, 1364 | { 1365 | "condition": "default", 1366 | "risk": "HIGH_RISK" 1367 | } 1368 | ] 1369 | }, 1370 | { 1371 | "id": "releases_traffic_draining", 1372 | "title": "Traffic Draining (L)", 1373 | "description": "How is traffic diverted from a host before shutting down the processes?", 1374 | "choices": [ 1375 | { 1376 | "id": "elb_draining", 1377 | "title": "ELB Connection Draining is turned on and configured.", 1378 | "helpfulResource": { 1379 | "displayText": "Host draining dis-allows new connections to the backend node prior to it being taken out of commission, this reduces the user-facing impact to scale-in & maintenance operations." 1380 | }, 1381 | "improvementPlan": { 1382 | "displayText": "ELB Connection Draining should be enabled if required within the ELB console and an appropriate time-out set. ALB and NLB refer to Connection Draining as 'Deregistration Delay'. " 1383 | } 1384 | } 1385 | ], 1386 | "riskRules": [ 1387 | { 1388 | "condition": "elb_draining", 1389 | "risk": "NO_RISK" 1390 | }, 1391 | { 1392 | "condition": "default", 1393 | "risk": "HIGH_RISK" 1394 | } 1395 | ] 1396 | }, 1397 | { 1398 | "id": "releases_custom_amis", 1399 | "title": "Custom AMIs (L)", 1400 | "description": "If your system uses custom EC2 AMIs, do you have dedicated AMIs for non-production AWS accounts and production AWS accounts?", 1401 | "choices": [ 1402 | { 1403 | "id": "prebuilt_amis", 1404 | "title": "AMIs are leveraged in order to prebake in software and configuration", 1405 | "helpfulResource": { 1406 | "displayText": "By pre-baking software installation and configuration, customers get closer to immutable infrastructure and improve deployment times by ensuring there is minimal stand-up time required." 1407 | }, 1408 | "improvementPlan": { 1409 | "displayText": "Common software (security agents or server packages) can be baked into the AMI along with their configuration. This ensures " 1410 | } 1411 | }, 1412 | { 1413 | "id": "distinct_amis", 1414 | "title": "Distinct AMIs are used for production vs non-production environments.", 1415 | "helpfulResource": { 1416 | "displayText": "Leveraging distinct AMIs for each environment ensures that accidental deletions or changes to lower environments do not impact higher ones." 1417 | }, 1418 | "improvementPlan": { 1419 | "displayText": "Risk has been mitigated, no further improvements needed. " 1420 | } 1421 | }, 1422 | { 1423 | "id": "regenerated", 1424 | "title": "AMIs are regenerated on a consistent basis in order to keep them up to date.", 1425 | "helpfulResource": { 1426 | "displayText": "Security updates are regularly released for operating systems, regenerating the AMIs on a consistent basis improves deployment times and improves security posture." 1427 | }, 1428 | "improvementPlan": { 1429 | "displayText": "Risk has been mitigated, no further improvements needed. " 1430 | } 1431 | } 1432 | ], 1433 | "riskRules": [ 1434 | { 1435 | "condition": "regenerated && distinct_amis && prebuilt_amis", 1436 | "risk": "NO_RISK" 1437 | }, 1438 | { 1439 | "condition": "(prebuilt_amis && (distinct_amis && !regenerated)) || (prebuilt_amis && (regenerated && !distinct_amis))", 1440 | "risk": "MEDIUM_RISK" 1441 | }, 1442 | { 1443 | "condition": "default", 1444 | "risk": "HIGH_RISK" 1445 | } 1446 | ] 1447 | }, 1448 | { 1449 | "id": "automation_of_test_coverage", 1450 | "title": "Test Coverage (H)", 1451 | "description": "What tests and code scans are in place with adequate code coverage?", 1452 | "choices": [ 1453 | { 1454 | "id": "test_int", 1455 | "title": "Automated Integration Tests are implemented and run during the release process with sufficient coverage.", 1456 | "improvementPlan": { 1457 | "displayText": "Ensure you have comprehensive integration test coverage." 1458 | } 1459 | }, 1460 | { 1461 | "id": "test_unit", 1462 | "title": "Automated Unit Tests are implemented and run during the release process with sufficient coverage.", 1463 | "improvementPlan": { 1464 | "displayText": "Ensure you have comprehensive unit test coverage." 1465 | } 1466 | }, 1467 | { 1468 | "id": "smoke", 1469 | "title": "Automated tests are run after deployment to ensure success.", 1470 | "improvementPlan": { 1471 | "displayText": "Add deployment smoke testing to the deployment pipeline to verify successful deployment." 1472 | } 1473 | }, 1474 | { 1475 | "id": "scan", 1476 | "title": "Automated code vunerability scanning and linting are performed during CICD.", 1477 | "improvementPlan": { 1478 | "displayText": "Add code vunerability scanning and linting to the deployment pipeline." 1479 | } 1480 | }, 1481 | { 1482 | "id": "manual", 1483 | "title": "Manual Test Suites and Code Scans are run.", 1484 | "improvementPlan": { 1485 | "displayText": "Ensure you have comprehensive test coverage." 1486 | } 1487 | } 1488 | ], 1489 | "riskRules": [ 1490 | { 1491 | "condition": "test_int && test_unit && scan && smoke", 1492 | "risk": "NO_RISK" 1493 | }, 1494 | { 1495 | "condition": "!test_int && !test_unit && !scan && !smoke && manual", 1496 | "risk": "MEDIUM_RISK" 1497 | }, 1498 | { 1499 | "condition": "default", 1500 | "risk": "HIGH_RISK" 1501 | } 1502 | ] 1503 | } 1504 | ] 1505 | }, 1506 | { 1507 | "id": "event_management", 1508 | "name": "03 - Incident & Event Management", 1509 | "questions": [ 1510 | { 1511 | "id": "underlying_dependencies", 1512 | "title": "Underlying Dependencies (H)", 1513 | "description": "What are the underlying dependencies for the service? Include a list of other services, systems, and infrastructure outside the boundary of this application. Explain how your service will be impacted based on a failure of each of your dependencies.", 1514 | "choices": [ 1515 | { 1516 | "id": "table_provided", 1517 | "title": "A link of the list of underlying dependencies is provided below in the notes section.", 1518 | "helpfulResource": { 1519 | "displayText": "The list of underlying dependencies, or a link to it, has been noted below." 1520 | }, 1521 | "improvementPlan": { 1522 | "displayText": "The key list of underlying dependencies should be inventoried and documented." 1523 | } 1524 | }, 1525 | { 1526 | "id": "impact_provided", 1527 | "title": "The impact of underlying dependency failure is provided below in the notes section.", 1528 | "helpfulResource": { 1529 | "displayText": "The list of underlying dependencies failure impact, or a link to it, has been noted below." 1530 | }, 1531 | "improvementPlan": { 1532 | "displayText": "The list of impacts of underlying dependency failure should be inventories and documented." 1533 | } 1534 | } 1535 | ], 1536 | "riskRules": [ 1537 | { 1538 | "condition": "table_provided && impact_provided", 1539 | "risk": "NO_RISK" 1540 | }, 1541 | { 1542 | "condition": "default", 1543 | "risk": "HIGH_RISK" 1544 | } 1545 | ] 1546 | }, 1547 | { 1548 | "id": "event_kpis", 1549 | "title": "Operational KPIs (H)", 1550 | "description": "What operational goals or KPIs (latency, throughput/TPS, etc.) have you identified for your service?", 1551 | "choices": [ 1552 | { 1553 | "id": "kpis_documented", 1554 | "title": "KPIs are known and documented in the notes section.", 1555 | "helpfulResource": { 1556 | "displayText": "Knowing your KPIs are an important piece in understanding if you are meeting the needs of the users during an event. KPIs could include uptime/availability, number of active users or session, number of transactions per second, amount of time each transaction takes or the amount of latency a user is experiencing." 1557 | }, 1558 | "improvementPlan": { 1559 | "displayText": "Work with business leadership to understand the KPIs that will determine whether the event is a success, and then work to implement those KPIs as metrics and alarms." 1560 | } 1561 | } 1562 | ], 1563 | "riskRules": [ 1564 | { 1565 | "condition": "kpis_documented", 1566 | "risk": "NO_RISK" 1567 | }, 1568 | { 1569 | "condition": "default", 1570 | "risk": "HIGH_RISK" 1571 | } 1572 | ] 1573 | }, 1574 | { 1575 | "id": "event_oncall_rotation", 1576 | "title": "On-call Rotation (H)", 1577 | "description": "Do you have an on-call rotation configured? Are there runbooks written for use by the oncalls? Have those runbooks been tested and validated? In the event the oncall needs help, what are your escalation procedures?", 1578 | "choices": [ 1579 | { 1580 | "id": "oncall_rotation", 1581 | "title": "Oncall rotation is agreed upon and known.", 1582 | "helpfulResource": { 1583 | "displayText": "A clear and well documented on-call plan ensures that each engineer knows who is responsible and who should be engaged in the event of an issue." 1584 | }, 1585 | "improvementPlan": { 1586 | "displayText": "Risk has been mitigated, no further improvements needed. " 1587 | } 1588 | }, 1589 | { 1590 | "id": "alarm_engagement", 1591 | "title": "Oncall engineers are automatically engaged by metric alarms / ticketing systems.", 1592 | "helpfulResource": { 1593 | "displayText": "High-impact alarms should automatically engage relevant oncall engineers, or cut tickets which will then engage the oncall. This ensures that there is minimal delay from the time an issue begins to the time operators are engaged. Highly-critical alarms should be configured to engage engineers within five minutes." 1594 | }, 1595 | "improvementPlan": { 1596 | "displayText": "High-impact alarms should automatically engage relevant oncall engineers, or cut tickets which will then engage the oncall. Highly-critical alarms should be configured to engage engineers within five minutes." 1597 | } 1598 | }, 1599 | { 1600 | "id": "oncall_sla", 1601 | "title": "Oncall engineers have an SLA for check-in time once an engagement ", 1602 | "helpfulResource": { 1603 | "displayText": "As part of the oncall process and documentation, expectations should be clearly defined and understood by the oncall engineers. This includes an expected SLA on when they need to check in by once an event begins and they are notified." 1604 | }, 1605 | "improvementPlan": { 1606 | "displayText": "Discuss and implement, or update, a document for oncall responsibilities and expectations." 1607 | } 1608 | }, 1609 | { 1610 | "id": "runbooks_written", 1611 | "title": "Oncall engineers have documented runbooks they can rely on", 1612 | "helpfulResource": { 1613 | "displayText": "Runbooks can be a critical component of a success incident, taking the guess-work out of how engineers should respond and empowering more junior engineers." 1614 | }, 1615 | "improvementPlan": { 1616 | "displayText": "Frequently executed commands should be documented as runbooks or, ideally, written as scripts that can be executed with minimal human involvement." 1617 | } 1618 | }, 1619 | { 1620 | "id": "runbooks_validated", 1621 | "title": "Runbooks have been validated for permissions issues, typos, etc, and are re-validated on a schedule.", 1622 | "helpfulResource": { 1623 | "displayText": "Runbooks, like any other form of documentation, can be subject of bit-rot. It is important that they be validated and re-validated over time as changes are made to the system and engineer permissions." 1624 | }, 1625 | "improvementPlan": { 1626 | "displayText": "Runbooks should be tested and validated in non-production environments to identify typos, permissions issues, or errors, and the runbooks should be re-validated on a schedule or following major system changes." 1627 | } 1628 | }, 1629 | { 1630 | "id": "escalation_documented", 1631 | "title": "Escalation procedures for the oncall are documented.", 1632 | "helpfulResource": { 1633 | "displayText": "Engineers will inevitably come across as issue that they do not know how to handle or multiple issues will trigger simultaneously. In such events, it is important that engineers know how to escalate for assistance." 1634 | }, 1635 | "improvementPlan": { 1636 | "displayText": "As part of the oncall process, escalation procedures should be clearly documented so that, in the event the oncall is overwhelmed with simultaneous issues or there is a knowledge gap, they know the most effective way to gain assistance." 1637 | } 1638 | } 1639 | ], 1640 | "riskRules": [ 1641 | { 1642 | "condition": "escalation_documented && oncall_sla && runbooks_validated && runbooks_written && alarm_engagement && oncall_rotation", 1643 | "risk": "NO_RISK" 1644 | }, 1645 | { 1646 | "condition": "(oncall_rotation && runbooks_written && alarm_engagement) && (!escalation_documented || !runbooks_validated) ", 1647 | "risk": "MEDIUM_RISK" 1648 | }, 1649 | { 1650 | "condition": "default", 1651 | "risk": "HIGH_RISK" 1652 | } 1653 | ] 1654 | }, 1655 | { 1656 | "id": "event_alarms", 1657 | "title": "Alarms & Runbooks (H)", 1658 | "description": "What automated alarms do you have for your system? Do you have a runbook/SOP documented for how to investigate and troubleshoot each?", 1659 | "choices": [ 1660 | { 1661 | "id": "alarms_exist", 1662 | "title": "Alarms documented", 1663 | "helpfulResource": { 1664 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1665 | }, 1666 | "improvementPlan": { 1667 | "displayText": "Risk has been mitigated, no further improvements needed. " 1668 | } 1669 | }, 1670 | { 1671 | "id": "runbooks_exist", 1672 | "title": "Alarms documented", 1673 | "helpfulResource": { 1674 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1675 | }, 1676 | "improvementPlan": { 1677 | "displayText": "Risk has been mitigated, no further improvements needed. " 1678 | } 1679 | }, 1680 | { 1681 | "id": "alarms_include_runbook_links", 1682 | "title": "Alarms documented", 1683 | "helpfulResource": { 1684 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1685 | }, 1686 | "improvementPlan": { 1687 | "displayText": "Risk has been mitigated, no further improvements needed. " 1688 | } 1689 | }, 1690 | { 1691 | "id": "alarms_auto_trigger_runbooks", 1692 | "title": "Alarms documented", 1693 | "helpfulResource": { 1694 | "displayText": "The risk presented here has been fully mitigated with no lingering questions or concerns that need to be followed up on." 1695 | }, 1696 | "improvementPlan": { 1697 | "displayText": "Risk has been mitigated, no further improvements needed. " 1698 | } 1699 | } 1700 | ], 1701 | "riskRules": [ 1702 | { 1703 | "condition": "alarms_auto_trigger_runbooks && alarms_include_runbook_links && runbooks_exist && alarms_exist", 1704 | "risk": "NO_RISK" 1705 | }, 1706 | { 1707 | "condition": "(alarms_exist && runbooks_exist) && (!alarms_include_runbook_links || !alarms_auto_trigger_runbooks)", 1708 | "risk": "MEDIUM_RISK" 1709 | }, 1710 | { 1711 | "condition": "default", 1712 | "risk": "HIGH_RISK" 1713 | } 1714 | ] 1715 | }, 1716 | { 1717 | "id": "event_automated_data_gathering", 1718 | "title": "Automated Data Gathering (L)", 1719 | "description": "Following or during an event, ", 1720 | "choices": [ 1721 | { 1722 | "id": "risk_mitigated", 1723 | "title": "Automated data gathering is used in the event of an alarm or event", 1724 | "helpfulResource": { 1725 | "displayText": "When alarms auto-cut tickets to engineers, it can be very useful for the automation to also include cloudwatch metrics or log files from the affected resource(s). This frees up engineer time from needing to collect that data by hand, allowing them to focus on the 'why' of the event." 1726 | }, 1727 | "improvementPlan": { 1728 | "displayText": "Work with the account team to get a deep dive on Systems Manager - Incident Manager for inspiration and to see how relevant resource metrics can be collected" 1729 | } 1730 | } 1731 | ], 1732 | "riskRules": [ 1733 | { 1734 | "condition": "risk_mitigated", 1735 | "risk": "NO_RISK" 1736 | }, 1737 | { 1738 | "condition": "default", 1739 | "risk": "HIGH_RISK" 1740 | } 1741 | ] 1742 | }, 1743 | { 1744 | "id": "event_host_fs_alarms", 1745 | "title": "On-host Filesystem Alarms (H)", 1746 | "description": "Do you monitor and alarm on your hosts for file system, index node, and file descriptor utilization? What about CPU utilization, memory utilization, or other resource metrics?", 1747 | "choices": [ 1748 | { 1749 | "id": "fs_capacity", 1750 | "title": "Filesystem space capacity", 1751 | "helpfulResource": { 1752 | "displayText": "For non-autoscaled systems, monitoring and alarming the level free capacity on the filesystem can be critical for catching problems early and preventing a crash." 1753 | }, 1754 | "improvementPlan": { 1755 | "displayText": "Non-autoscaled systems that ingest data should have filesystem capacity alarmed on in order to avoid a situation where the drive is filled to capacity, resulting in a crash." 1756 | } 1757 | }, 1758 | { 1759 | "id": "inode_capacity", 1760 | "title": "Filesystem inode capacity", 1761 | "helpfulResource": { 1762 | "displayText": "Particularly EXT4 based systems may want to monitor inode capacity given that EXT4 inode capacity is hard-coded at creation time and cannot be expanded. Inode exhaustion is a concern on systems that deal with many small files rather than fewer medium to large files." 1763 | }, 1764 | "improvementPlan": { 1765 | "displayText": "On systems which primarily deal with many small files (logging servers, for example), monitoring inode capacity can be critical for ensuring application availability. " 1766 | } 1767 | }, 1768 | { 1769 | "id": "fd_utilization", 1770 | "title": "File descriptor utilization.", 1771 | "helpfulResource": { 1772 | "displayText": "File descriptors are a measurement of how many files can be opened at a single time. " 1773 | }, 1774 | "improvementPlan": { 1775 | "displayText": "File descriptor utilization should be metriced and alarmed on if the server in question handles many services or if one of those services primarily opens and closes many files simultaneously." 1776 | } 1777 | }, 1778 | { 1779 | "id": "cpu_utilization", 1780 | "title": "CPU utilization", 1781 | "helpfulResource": { 1782 | "displayText": "CPU Utilization is a default metric that is available within EC2, and its monitoring can be critical to detecting problems such as processing running out of control. This is a common scale-out metric for autoscaling groups." 1783 | }, 1784 | "improvementPlan": { 1785 | "displayText": "Non-autoscaled systems should alarm on CPU utilization around 75-80%, or use CW Synthetics for dynamic alarms, in order to detect a server that is being overwhelmed." 1786 | } 1787 | }, 1788 | { 1789 | "id": "mem_utilization", 1790 | "title": "Memory utilization", 1791 | "helpfulResource": { 1792 | "displayText": "Memory utilization is not a default metric that is available within EC2. Its monitoring can be critical to detecting problems such as processing running out of control. This is a common scale-out metric for autoscaling groups." 1793 | }, 1794 | "improvementPlan": { 1795 | "displayText": "Non-autoscaled systems should alarm on memory utilization, either statically defined thresholds or Cloudwatch synthetics, in orrder to detect memory leaks or being under-provisioned." 1796 | } 1797 | } 1798 | ], 1799 | "riskRules": [ 1800 | { 1801 | "condition": "mem_utilization && cpu_utilization && fd_utilization && inode_capacity && fs_capacity", 1802 | "risk": "NO_RISK" 1803 | }, 1804 | { 1805 | "condition": "(mem_utilization && cpu_utilization) && (!inode_capacity || !fs_capacity)", 1806 | "risk": "MEDIUM_RISK" 1807 | }, 1808 | { 1809 | "condition": "default", 1810 | "risk": "HIGH_RISK" 1811 | } 1812 | ] 1813 | }, 1814 | { 1815 | "id": "event_database_alarms", 1816 | "title": "Database Alarms", 1817 | "description": "Do you have alarms on database (relational and non-relational) utilization? Do you alarm with enough room to troubleshoot and mitigate the issue before it becomes customer-impacting?", 1818 | "choices": [ 1819 | { 1820 | "id": "risk_mitigated", 1821 | "title": "Database utilization and performance is alarmed", 1822 | "helpfulResource": { 1823 | "displayText": "Pro-actively identifying a downward trend on critical databases is important to ensure that a problem does not become a user-impacting issue." 1824 | }, 1825 | "improvementPlan": { 1826 | "displayText": "Application-critical databases should be proactively monitored for utilization or performance metrics that could indicate a problem, such as average query time increasing or memory utilization reaching critical levels." 1827 | } 1828 | } 1829 | ], 1830 | "riskRules": [ 1831 | { 1832 | "condition": "risk_mitigated", 1833 | "risk": "NO_RISK" 1834 | }, 1835 | { 1836 | "condition": "default", 1837 | "risk": "HIGH_RISK" 1838 | } 1839 | ] 1840 | }, 1841 | { 1842 | "id": "event_jvm_metrics", 1843 | "title": "JVM Metrics & Alarms", 1844 | "description": "Do you monitor and alarm on your JVM metrics? Note what metrics are monitored and where they are published to and how.", 1845 | "choices": [ 1846 | { 1847 | "id": "risk_mitigated", 1848 | "title": "JVM metrics monitored and alarmed on as needed.", 1849 | "helpfulResource": { 1850 | "displayText": "Applicable JVM metrics could include, but are not limited to: memory utilization, garbage collection, heap usage, thread summary." 1851 | }, 1852 | "improvementPlan": { 1853 | "displayText": "Relevant JVM metrics (thread summary, memory utilization, garbage collection, heap usage) should be metriced and alarmed upon. The CloudWatch Agent can collect the JVM metrics via its collectd plugin." 1854 | } 1855 | } 1856 | ], 1857 | "riskRules": [ 1858 | { 1859 | "condition": "risk_mitigated", 1860 | "risk": "NO_RISK" 1861 | }, 1862 | { 1863 | "condition": "default", 1864 | "risk": "HIGH_RISK" 1865 | } 1866 | ] 1867 | }, 1868 | { 1869 | "id": "event_frontend_alarms", 1870 | "title": "Frontend Alarms (M)", 1871 | "description": "Do you monitor and alarm on your frontend components (DNS systems, load balancers, VIPs, proxy services, etc.)?", 1872 | "choices": [ 1873 | { 1874 | "id": "risk_mitigated", 1875 | "title": "Frontend components alarmed on for availability and critical performance metrics.", 1876 | "helpfulResource": { 1877 | "displayText": "Monitoring and alarming of critical-path frontend components, by the owning team, are critical to ensuring the availability of downstream applications and services." 1878 | }, 1879 | "improvementPlan": { 1880 | "displayText": "In-scope critical path components should be alarmed on in order to detect issues and remediate them efficiently." 1881 | } 1882 | } 1883 | ], 1884 | "riskRules": [ 1885 | { 1886 | "condition": "risk_mitigated", 1887 | "risk": "NO_RISK" 1888 | }, 1889 | { 1890 | "condition": "default", 1891 | "risk": "HIGH_RISK" 1892 | } 1893 | ] 1894 | }, 1895 | { 1896 | "id": "event_delayed_consistency", 1897 | "title": "Delayed Consistency Alarms (L)", 1898 | "description": "Do you monitor and alarm on any delayed convergence of eventually consistent processes in your system?", 1899 | "choices": [ 1900 | { 1901 | "id": "risk_mitigated", 1902 | "title": "Latency of eventual consistency is measured and alarmed", 1903 | "helpfulResource": { 1904 | "displayText": "Eventual consistency is a common feature of distributed systems where data is stored on multiple nodes but only one node accepts the incoming write to start with, for example: Amazon S3 uses eventual consistency for updating existing objects." 1905 | }, 1906 | "improvementPlan": { 1907 | "displayText": "If possible, the latency of eventually consistent systems should be measured and alarmed in. A delay in consistency could pose a risk to user experience, data consistency, or could indict an otherwise silent problem in the system." 1908 | } 1909 | } 1910 | ], 1911 | "riskRules": [ 1912 | { 1913 | "condition": "risk_mitigated", 1914 | "risk": "NO_RISK" 1915 | }, 1916 | { 1917 | "condition": "default", 1918 | "risk": "HIGH_RISK" 1919 | } 1920 | ] 1921 | }, 1922 | { 1923 | "id": "event_transaction_tracing", 1924 | "title": "Transaction Tracing (M)", 1925 | "description": "How do you trace a customer request through components in your service?", 1926 | "choices": [ 1927 | { 1928 | "id": "risk_mitigated", 1929 | "title": "Transaction tracing implemented.", 1930 | "helpfulResource": { 1931 | "displayText": "Transaction tracing allows for the finely-tuned monitoring of each transaction as it moves through the system, capturing individual return codes at each step, as well as the latency of each step and which nodes received which requests. This makes it much easier to identify exactly where a problem may have occurred for a given request." 1932 | }, 1933 | "improvementPlan": { 1934 | "displayText": "Work with the account team to get a deep dive on AWS X-Ray and integrating the X-Ray SDK into the relevant applications.", 1935 | "url": "https://aws.amazon.com/xray/" 1936 | } 1937 | } 1938 | ], 1939 | "riskRules": [ 1940 | { 1941 | "condition": "risk_mitigated", 1942 | "risk": "NO_RISK" 1943 | }, 1944 | { 1945 | "condition": "default", 1946 | "risk": "HIGH_RISK" 1947 | } 1948 | ] 1949 | }, 1950 | { 1951 | "id": "event_mutating_access", 1952 | "title": "Mutating Access (M)", 1953 | "description": "Have you limited the individuals who have mutating access to production systems and AWS accounts?", 1954 | "choices": [ 1955 | { 1956 | "id": "seniors_only", 1957 | "title": "Production access is limited to (senior) personnel on relevant teams under normal scenarios.", 1958 | "helpfulResource": { 1959 | "displayText": "Limiting mutating access to production accounts is critical for ensuring accidental changes are not made." 1960 | }, 1961 | "improvementPlan": { 1962 | "displayText": "If limiting to automation-only is not possible, mutating access to production should be limited to engineers on teams who have a need to make changes, potentially limited more to senior engineers on those teams. " 1963 | } 1964 | }, 1965 | { 1966 | "id": "automation_only", 1967 | "title": "Mutating changes to production systems is limited to automated systems only under normal scenarios.", 1968 | "helpfulResource": { 1969 | "displayText": "Limiting access to automated systems only ensures that each and every change can be tested and validated in lower environments prior to moving into production." 1970 | }, 1971 | "improvementPlan": { 1972 | "displayText": "CI/CD pipelines that promote changes from lower environments to higher ones are one method to ensure that each change is tested" 1973 | } 1974 | } 1975 | ], 1976 | "riskRules": [ 1977 | { 1978 | "condition": "seniors_only", 1979 | "risk": "NO_RISK" 1980 | }, 1981 | { 1982 | "condition": "default", 1983 | "risk": "HIGH_RISK" 1984 | } 1985 | ] 1986 | }, 1987 | { 1988 | "id": "event_queue_backlog", 1989 | "title": "Queue Backlog (M)", 1990 | "description": "Does your system queue requests? If so, do you take preventative measure to ensure that your queue does not exceed a pre-determined size?", 1991 | "choices": [ 1992 | { 1993 | "id": "backlog_alarm", 1994 | "title": "Alarms configured to alert on excessive queue depth.", 1995 | "helpfulResource": { 1996 | "displayText": "Monitoring the backlog of a queue is a critical component to ensure that all systems, both upstream and downstream, are functioning as expected. Sibling metrics for the number of items being worked off the queue and the number of items being put on the queue can then be critical pieces in identifying the source of an issue when it arises." 1997 | }, 1998 | "improvementPlan": { 1999 | "displayText": "Ensure that all queueing systems in the application report their respective backlogs with alarms configured for thresholds of concern or synthetic alarms configured for dynamic thresholds." 2000 | } 2001 | } 2002 | ], 2003 | "riskRules": [ 2004 | { 2005 | "condition": "backlog_alarm", 2006 | "risk": "NO_RISK" 2007 | }, 2008 | { 2009 | "condition": "default", 2010 | "risk": "HIGH_RISK" 2011 | } 2012 | ] 2013 | }, 2014 | { 2015 | "id": "event_expiring_materials", 2016 | "title": "Expiring Materials (M)", 2017 | "description": "Does your system or host contain any materials that expire? If so, what monitoring and alarming is in place to prevent disruptive expirations?", 2018 | "choices": [ 2019 | { 2020 | "id": "risk_mitigated", 2021 | "title": "Relevant risk has been mitigated", 2022 | "helpfulResource": { 2023 | "displayText": "Examples could include software validation certifications, time-bound licenses, or API keys." 2024 | }, 2025 | "improvementPlan": { 2026 | "displayText": "If there are materials that are scheduled to expire on a consistent schedule, alarms should be in place around those expiration times in order to ensure that they can be rotated in time, as well as a runbook developed for what should be done in the event of rotation failure." 2027 | } 2028 | } 2029 | ], 2030 | "riskRules": [ 2031 | { 2032 | "condition": "risk_mitigated", 2033 | "risk": "NO_RISK" 2034 | }, 2035 | { 2036 | "condition": "default", 2037 | "risk": "HIGH_RISK" 2038 | } 2039 | ] 2040 | } 2041 | ] 2042 | } 2043 | ] 2044 | } --------------------------------------------------------------------------------