├── .gitignore ├── LICENSE ├── README.md ├── apps └── lecho │ ├── lecho.sh │ ├── lecho.yaml │ └── src │ └── index.js ├── index.ipynb ├── jupyter-docker-run.sh ├── reliability ├── rel.ipynb ├── rel01.ipynb ├── rel02.ipynb ├── rel03.ipynb ├── rel04.ipynb ├── rel05.ipynb ├── rel06.ipynb ├── rel07.ipynb ├── rel08.ipynb └── rel09.ipynb ├── security ├── admin-user │ ├── admin-user.sh │ └── admin-user.yaml ├── audit-trail │ ├── audit-trail.sh │ └── audit-trail.yaml ├── no-root │ ├── README.md │ ├── events │ │ ├── CreateFunction.IAMUser.json │ │ └── GetStackPolicy.Root.json │ ├── no-root.sh │ ├── no-root.yaml │ ├── package.json │ ├── src │ │ └── index.js │ └── test │ │ └── event1.json ├── sec01.ipynb ├── security.sh └── security.yaml └── shared-resources ├── code-bucket.sh ├── code-bucket.yaml ├── coe-topic.sh ├── coe-topic.yaml ├── hosted-zone.sh ├── hosted-zone.yaml ├── logs-bucket.sh ├── logs-bucket.yaml ├── tls-cert.sh └── tls-cert.yaml /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules 2 | target 3 | .ipynb_checkpoints 4 | .Trash-1000 5 | jupyter/.ipynb_checkpoints 6 | 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reference Implementations for AWS Well Architected Framework 2 | 3 | This project aims to help developers build using the AWS Well Architected Framework. It provides reference implementations and orientations to turn the questions and guidelines from the whitepaper into executables. The framework is designed around non-functional requirements and applicable in general to most customers. This adds implementation suggestions, artifacts and examples that were successful in customers, partners and communities. 4 | 5 | The references are presented as jupyter notebooks contaning both the texts and code, making them easy to follow with any AWS account and level of expertise. You should be able to dive as deep as it makes sense for your case, just reading through the documents, running them on your accoount or cherry picking individual components or configurations. 6 | 7 | Each answer here is expected to be: 8 | 9 | * **Satisficing** over optimizing: Pushing the requirements to their expected levels, not to the state-of-the-art. They may be the same in some cases, but answers and code here aims to be as good as needed, even if not the ultimate best. 10 | 11 | * **Objective** over subjective: Each requirement should be demonstrated quantitatively (e.g. latency < 250ms) or sometimess qualitatively (low latency) while missing data and definitions. Exprressing requirements subjectively (high performance, 100% secure, etc) creates ambiquities and makes compliance hard to assess. 12 | 13 | * **Durable** over ephemeral: Implementation decisions are frequently lost. Sometime as soon as the requirement is attended and developer start the the next ticket. The notebooks in this repository can be used as a Lightweight Architectural Decision Record. 14 | 15 | * **Shared** over private: Being open and collaborative is great not only for software. Sharing architectural decisions and resources is important and we hope you'll enjoy these. 16 | 17 | Get started at [AWS Well Architected Reference Implementations](index.ipynb) 18 | 19 | 20 | -------------------------------------------------------------------------------- /apps/lecho/lecho.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | ZIP="${DIR}/target/${STACK_NAME}.zip" 6 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 7 | TEMPLATE_FILE_OUT="${DIR}/target/${STACK_NAME}.out.yaml" 8 | CODE_BUCKET=$(aws cloudformation describe-stack-resources \ 9 | --stack-name "code-bucket" \ 10 | --query "StackResources[?LogicalResourceId =='CodeBucket'].PhysicalResourceId" \ 11 | --output text) 12 | 13 | rm -rf ./target 14 | mkdir ./target 15 | 16 | cd ${DIR}/src 17 | zip -r $ZIP ./* 18 | cd ${DIR} 19 | zip -ur $ZIP node_modules 20 | unzip -vl $ZIP 21 | 22 | aws cloudformation package \ 23 | --template-file "$TEMPLATE_FILE" \ 24 | --s3-bucket "$CODE_BUCKET" \ 25 | --output-template-file "$TEMPLATE_FILE_OUT" 26 | 27 | aws cloudformation delete-stack \ 28 | --stack-name "$STACK_NAME" 29 | 30 | aws cloudformation wait stack-delete-complete \ 31 | --stack-name "$STACK_NAME" 32 | 33 | aws cloudformation create-stack \ 34 | --template-body "file://$TEMPLATE_FILE_OUT" \ 35 | --stack-name "$STACK_NAME" \ 36 | --capabilities CAPABILITY_IAM 37 | -------------------------------------------------------------------------------- /apps/lecho/lecho.yaml: -------------------------------------------------------------------------------- 1 | Description: Sample Echo Application 2 | Resources: 3 | FnLechoRole: 4 | Type: "AWS::IAM::Role" 5 | Properties: 6 | AssumeRolePolicyDocument: 7 | Version: "2012-10-17" 8 | Statement: 9 | - Effect: "Allow" 10 | Principal: 11 | Service: 12 | - "lambda.amazonaws.com" 13 | Action: 14 | - "sts:AssumeRole" 15 | ManagedPolicyArns: ["arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"] 16 | FnLecho: 17 | Type: AWS::Lambda::Function 18 | Properties: 19 | Handler: index.handler 20 | Runtime: nodejs6.10 21 | Code: ./target/lecho.zip 22 | MemorySize: 128 23 | Timeout: 5 24 | Role: !GetAtt FnLechoRole.Arn 25 | EchoAPI: 26 | Type: AWS::ApiGateway::RestApi 27 | Properties: 28 | Name: "EchoAPI" 29 | FailOnWarnings: true 30 | EchoResource: 31 | Type: AWS::ApiGateway::Resource 32 | Properties: 33 | RestApiId: !Ref EchoAPI 34 | ParentId: !GetAtt EchoAPI.RootResourceId 35 | PathPart: '{proxy+}' 36 | EchoMethod: 37 | Type: AWS::ApiGateway::Method 38 | Properties: 39 | RestApiId: !Ref EchoAPI 40 | ResourceId: !Ref EchoResource 41 | HttpMethod: ANY 42 | AuthorizationType: NONE 43 | MethodResponses: 44 | - StatusCode: 200 45 | Integration: 46 | Type: AWS_PROXY 47 | IntegrationHttpMethod: POST 48 | Uri: !Sub arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${FnLecho.Arn}/invocations 49 | EchoDeployment: 50 | DependsOn: "EchoMethod" 51 | Type: "AWS::ApiGateway::Deployment" 52 | Properties: 53 | RestApiId: !Ref EchoAPI 54 | StageName: echoDummyDeploymentStage 55 | EchoStage: 56 | Type: AWS::ApiGateway::Stage 57 | Properties: 58 | StageName: echoStage 59 | RestApiId: !Ref EchoAPI 60 | DeploymentId: !Ref EchoDeployment 61 | FnLechoPermission: 62 | Type: AWS::Lambda::Permission 63 | Properties: 64 | FunctionName: !GetAtt FnLecho.Arn 65 | Action: lambda:InvokeFunction 66 | Principal: apigateway.amazonaws.com 67 | SourceArn: !Join [ "", [ "arn:aws:execute-api:", !Ref "AWS::Region",":", !Ref "AWS::AccountId", ":",!Ref "EchoAPI","/*/*/*" ]] 68 | Outputs: 69 | EchoDistributionDomainName: 70 | Value: !GetAtt EchoDomainName.DistributionDomainName -------------------------------------------------------------------------------- /apps/lecho/src/index.js: -------------------------------------------------------------------------------- 1 | "use strict"; 2 | 3 | 4 | exports.handler = (event, context, callback) => { 5 | console.log(event); 6 | console.log(context); 7 | const response = { 8 | statusCode: 200, 9 | body: JSON.stringify({ 10 | "event":event, 11 | "context":context 12 | }) 13 | }; 14 | callback(null,response); 15 | }; -------------------------------------------------------------------------------- /index.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# AWS Well Architected Reference Implementations\n", 8 | "\n", 9 | "\"The Well-Architected framework has been developed to help cloud architects build the most secure, high-performing, resilient, and efficient infrastructure possible for their applications. This framework provides a consistent approach for customers and partners to evaluate architectures, and provides guidance to help implement designs that will scale with your application needs over time.\" - https://aws.amazon.com/architecture/well-architected/\n", 10 | "\n", 11 | "The AWS Well Architected Framework evaluates application architectures by their non functional requirements, or \"pillars\": Security, Reliability, Performance Efficiency, Cost Optimizaiton, and Operational Excellence. The framework is not meant to be exhaustive or guarantee of success, but provide the basic foundations that should be considered in cloud-native applications. However, the framework is not prescriptive: each software organization will build different implementations.\n", 12 | "\n", 13 | "This project picks up from there, offering reference implementations for each question as specifically as posible. Some will be more process oriented, but a lot of it comes down to code. And this is why they are presented as jupyter notebooks, providing the proper context and code for each reference.\n", 14 | "\n", 15 | "It is helpful to be familiar with the AWS Well Architected Framework before getting started, but that is not strictly required. You'll probably navigate from the implementations here to the requirements there as you build your own architecture and answers.\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Security" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "collapsed": true 30 | }, 31 | "outputs": [], 32 | "source": [] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## Reliability\n", 39 | "\n", 40 | "### Foundations\n", 41 | "\n", 42 | "REL 1: How are you managing AWS service limits for your accounts?\n", 43 | "\n", 44 | "REL 2: How are you planning your network topology on AWS? \n", 45 | "\n", 46 | "### Change Management\n", 47 | "\n", 48 | "REL 3: How does your system adapt to changes in demand?\n", 49 | "\n", 50 | "REL 4: How are you monitoring AWS resources? \n", 51 | "\n", 52 | "REL 5: How are you executing change?\n", 53 | "\n", 54 | "### Failure Manaagement\n", 55 | "\n", 56 | "REL 6: How are you backing up your data?\n", 57 | "\n", 58 | "REL 7: How does your system withstand component failures?\n", 59 | "\n", 60 | "REL 8: How are you testing your resiliency?\n", 61 | "\n", 62 | "REL 9: How are you planning for disaster recovery?\n", 63 | "\n", 64 | "\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": { 71 | "collapsed": true 72 | }, 73 | "outputs": [], 74 | "source": [] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "## Performance Efficiency" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Cost Optimization" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## Operational Excellence" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": { 110 | "collapsed": true 111 | }, 112 | "outputs": [], 113 | "source": [] 114 | } 115 | ], 116 | "metadata": { 117 | "kernelspec": { 118 | "display_name": "Python 3", 119 | "language": "python", 120 | "name": "python3" 121 | }, 122 | "language_info": { 123 | "codemirror_mode": { 124 | "name": "ipython", 125 | "version": 3 126 | }, 127 | "file_extension": ".py", 128 | "mimetype": "text/x-python", 129 | "name": "python", 130 | "nbconvert_exporter": "python", 131 | "pygments_lexer": "ipython3", 132 | "version": "3.6.1" 133 | } 134 | }, 135 | "nbformat": 4, 136 | "nbformat_minor": 2 137 | } 138 | -------------------------------------------------------------------------------- /jupyter-docker-run.sh: -------------------------------------------------------------------------------- 1 | #/bin/bash 2 | 3 | docker run -it --rm \ 4 | -p 8888:8888 \ 5 | -v ${PWD}/jupyter/:/home/jovyan/work \ 6 | jupyter/minimal-notebook \ 7 | start.sh jupyter lab --NotebookApp.notebook_dir=/home/jovyan/work 8 | -------------------------------------------------------------------------------- /reliability/rel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Reliability Pillar\n", 8 | "\n", 9 | "hen business is conducted mostly online as it is today, reliability is invaluable. For applications such as amazon.com, Netflix and many others, minutes of downtime translates directly to millions in lost revenue and other conversions. Losing or mishandling data can be even more catastrophic. \n", 10 | "\n", 11 | "The Reliability pillar focus on designing systems that tolerate and recover from all sorts of undesirable circumstances. Applications may face failing components, partitioned networks, unreachable services… but the show must go on. With the cloud applications can instantly provision resources, shift traffic and correct infrastructure automatically using APIs. However, it is up to developers to explore the many ways things can go wrong, as well as extrapolate and prevent them.\n", 12 | "\n", 13 | "Reliability is too broad of a concept. For the purpose of this chapter, we’ll discuss three specific aspects of reliability:\n", 14 | "\n", 15 | "*\tAvailability: How to keep up and serving over component failures.\n", 16 | "*\tScalability: How to keep performance under varying load.\n", 17 | "*\tCorrectness: How to recover from application errors and other failures.\n", 18 | "\n", 19 | "It is important to note that these aspects apply just as much to the software as to its development process. Site Reliability Engineers are now employed worldwide with the single task of keeping the software delivery coming through smoothly, from requirements to announcements. Reliability is not only a system administration task, but a shared responsibility across the company.\n", 20 | "And not only your software is changing, but your dependencies and integrations as well. From the small libraries to external web services, software is evolving together. Let’s consider how to use stable interfaces, semantic versioning and dynamic routing so our mistakes don’t break other people’s code and do not let other people’s mistakes break ours.\n", 21 | "\n", 22 | "Each section of this chapter focuses on keeping availability, scalability and correctness in different situations: \n", 23 | "\n", 24 | "*\tReliability Foundations: How to keep reliable in steady state. \n", 25 | "*\tChange Management: How to rollout new versions and features. \n", 26 | "*\tFailure Management: How to tolerate component failures with minimal consequences.\n", 27 | "\n", 28 | "## Best Practices\n", 29 | "\n", 30 | "### Foundations\n", 31 | "\n", 32 | "Building reliable systems out of unreliable components is a key aspect of well architected. A reliable system must not only keep itself available and healthy, but also account for the reliability of its dependencies and integrations. And not only through steady state, but while undergoing software updates and component failures.\n", 33 | "Reliability is not a boolean property, to be had or not, but a perception coming mostly from availability, scalability and correctness. Amazon’s e-commerce feels reliable because it is always up, fulfilling correctly the orders, since the first to the current hundreds of millions of shoppers. And more than considering these aspects in steady state, well architected applications are built to keep reliable under software updates and component failures.\n", 34 | "\n", 35 | "### Availability\n", 36 | "\n", 37 | "The first aspect of reliability is availability, usually measured as “uptime”, the portion of time in which the service is up and responsive. Uptime is a metric so important that it is frequently explicit in a Service Level Agreement (SLA) between service providers and customers. For example, the Amazon S3 SLA states that customers are entitled to compensation if the service availability is lower than 99.9% per month, or 43.2 minutes of downtime. This is the Standard storage SLA, but customers can opt for the “Standard - Infrequent Access” feature for cheaper storage in exchange for less strict availability, at a 99.0% SLA. Amazon S3 actually operates much above this level and has an incredible availability record. Amazon Route53 SLA is even more strict, stating compensation after five minutes of service outage, but not a single one has happened so far, largely due to its global network of redundant DNS resolvers.\n", 38 | "\n", 39 | "Even when it is not under formal scrutiny, it is helpful to openly share availability data with customers. AWS publishes a global service health dashboard with the general availability of services at https://status.aws.amazon.com. However, as services scale to several environments and failure zones, a single global availability information may be too succinct. For example, the AWS Personal Health Dashboard offers more detailed information for local or minor events that may be impactful to individual customer accounts. \n", 40 | "\n", 41 | "### Scalability\n", 42 | "\n", 43 | "A reliable system must be not only up but responding to requests and emitting notifications timely as it scales, whatever that means in your business context. Online advertisement exchanges, for example, typically waits at most 80-120 milliseconds for responses to real time bidding. Regardless if you have a crowd of users and a firehose of data, bidders need to respond in time to keep in business. Scalability means keeping acceptable performance, in terms of latency or throughput, while other metrics fluctuate. Most typically, these other metrics are concurrent users and volume of data.\n", 44 | "\n", 45 | "The final performance metrics are a result of a possibly complex invocation paths with many components. From the selection of data structures to network conditions, everything can impact on latency and throughput. So even on known codebases the resulting scalability is very unpredictable and requires reliable measurements from monitoring of production environments and load simulations.\n", 46 | "\n", 47 | "### Correctness\n", 48 | "\n", 49 | "Even if available and responsive, a system won’t be considered reliable if it responds with application errors, inconsistent data and all sorts of bugs. However, sometimes it can be very hard to determine what correct means and even then, it changes. Embracing the changing nature of software requirements can be tricky and much beyond the scope of this book, but we’ll discuss some techniques to help cope with change in this section.\n", 50 | "\n", 51 | "The availability, scalability and correctness of applications are discussed in the following questions of the reliability pillar:\n", 52 | "\n", 53 | "[REL 1: How are you managing AWS service limits for your accounts?](rel01.ipynb)\n", 54 | "\n", 55 | "[REL 2: How are you planning your network topology on AWS?](rel02.ipynb)\n", 56 | "\n", 57 | "### Change Management\n", 58 | "\n", 59 | "[REL 3: How does your system adapt to changes in demand?](rel03.ipynb) \n", 60 | "\n", 61 | "[REL 4: How are you monitoring AWS resources?](rel04.ipynb)\n", 62 | "\n", 63 | "[REL 5: How are you executing change?](rel05.ipynb)\n", 64 | "\n", 65 | "### Failure Manaagement\n", 66 | "\n", 67 | "[REL 6: How are you backing up your data?](rel06.ipynb)\n", 68 | "\n", 69 | "[REL 7: How does your system withstand component failures?](rel07.ipynb)\n", 70 | "\n", 71 | "[REL 8: How are you testing your resiliency?](rel08.ipynb)\n", 72 | "\n", 73 | "[REL 9: How are you planning for disaster recovery?](rel09.ipynb)\n" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## Summary\n", 81 | "\n", 82 | "Reliability is key for business continuity and prosperity, sometimes just as much as security. With the concepts and techniques presented in this chapter you can build architectures that are resilient to expect and unexpected changes or even disasters. With security and reliability in place, we can focus on getting the system more efficient in terms of performance and costs.\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [] 91 | } 92 | ], 93 | "metadata": { 94 | "kernelspec": { 95 | "display_name": "Python 3", 96 | "language": "python", 97 | "name": "python3" 98 | }, 99 | "language_info": { 100 | "codemirror_mode": { 101 | "name": "ipython", 102 | "version": 3 103 | }, 104 | "file_extension": ".py", 105 | "mimetype": "text/x-python", 106 | "name": "python", 107 | "nbconvert_exporter": "python", 108 | "pygments_lexer": "ipython3", 109 | "version": "3.6.4" 110 | } 111 | }, 112 | "nbformat": 4, 113 | "nbformat_minor": 2 114 | } 115 | -------------------------------------------------------------------------------- /reliability/rel01.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 1: How are you managing AWS service limits for your accounts?\n", 8 | "\n" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "Service limits are reviewed frequently and automatically. Physically speaking there is no such thing as “unlimited” capacity, but Amazon scale is astonishing. According to Gartner research AWS capacity in 2015 was estimated to be 10 times bigger than the next 14 competitors and speculations of aggregate count is in the range of millions of servers. Letting applications scale unlimitedly would be very risky for both customers and Amazon, so all services are bound to service limits. For example, S3 buckets are limited to 100 by default and on-demand EC2 instances vary from 1 to 20, according to instance type. Check the [AWS Service Limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) page on the General Reference of AWS Documentation for the complete list of default limits.\n", 16 | "\n", 17 | "These and other limits listed in the AWS Service Limits page are just defaults, you can ask for a limit increase at any time. This is a simple process but requires a support ticket, and the response time of the contracted support level. The Business and Enterprise Support customers can be automatically notified by Trusted Advisor when limits are getting close. This can be done on the web console, API or CLI, but there is also a published solution entitled [AWS Limit Monitor](https://aws.amazon.com/answers/account-management/limit-monitor/) on AWS Answers website for automatically monitoring service limits more closely.\n", 18 | "\n", 19 | "Service limits should be monitored not only for AWS, but for all integrations and dependencies, as they will all be subject to physical or contractual limitations.\n", 20 | "\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [] 29 | } 30 | ], 31 | "metadata": { 32 | "kernelspec": { 33 | "display_name": "Python 3", 34 | "language": "python", 35 | "name": "python3" 36 | }, 37 | "language_info": { 38 | "codemirror_mode": { 39 | "name": "ipython", 40 | "version": 3 41 | }, 42 | "file_extension": ".py", 43 | "mimetype": "text/x-python", 44 | "name": "python", 45 | "nbconvert_exporter": "python", 46 | "pygments_lexer": "ipython3", 47 | "version": "3.6.4" 48 | } 49 | }, 50 | "nbformat": 4, 51 | "nbformat_minor": 2 52 | } 53 | -------------------------------------------------------------------------------- /reliability/rel02.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 2: How are you planning your network topology on AWS? " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Network paths are redundant and monitored. For web and mobile applications this is managed automatically by AWS, as every Availability Zone is reliably connected to redundant Transit Centers and to other AZs in the same region. With the speculations that over 70% of the entire US internet traffic passing by the Virginia region at peak, chances are you are just a couple hops away. The internet is a LAN! Even connecting outside of AWS, from TCP retransmission to BGP failover, the internet builds upon redundancy of paths to be reliable and Amazon manages all of that.\n", 15 | "\n", 16 | "Hybrid architectures and on-premises integrations are not so simple and network issues are probably the most common cause of failure. Although we’d love to have every system on the cloud, some workloads may be harder to migrate, for example:\n", 17 | "\n", 18 | "*\tHybrid CPUs. AWS cooperates with Intel since long and continuously on the development of more efficient EC2 instance families. Applications depending on specific hardware platforms, such as Mainframes, are more complex to migrate.\n", 19 | "*\tUnfriendly Licensing. Licenses bound to physical sockets or number of servers are harder to license for an elastic infrastructure. Further issues, such as hardware locks and license servers may also require a hybrid architecture.\n", 20 | "*\tData Access. Legal regulations, data security departments and others may keep data in silos that are simply not accessible from the internet.\n", 21 | "\n", 22 | "In these hybrid scenarios it is very important to consider how the application tolerates network failures as its “partitions” keep operating isolated. Message queues, relational database read replicas, peer to peer and other asynchronous protocols usually handles brief losses of connectivity gracefully. However, if the protocol requires RPC or any synchronous protocol, a reliable connection to on-premises datacenters becomes required.\n", 23 | "\n", 24 | "Virtual Private Cloud, the service for software defined networking, lets customers define their own network topologies as appropriate for each organization. Connectivity to on-premises data centers can be established over the internet using Virtual Private Gateways to establish VPNs, or trough dedicated links with AWS Direct Connect. \n", 25 | "\n", 26 | "Dedicate private connections can help to reduce internet data-out costs, offer consistent network performance and direct connectivity to all AWS services. Although AWS Direct Connect is not so cost impactful, with port rates starting at a few cents per hour, but the cost of establishing the physical connection is dependent on local telecommunications providers. Visit the “Network Partner Solutions” page for more information about Direct Connect and other connectivity partners." 27 | ] 28 | } 29 | ], 30 | "metadata": { 31 | "kernelspec": { 32 | "display_name": "Python 3", 33 | "language": "python", 34 | "name": "python3" 35 | }, 36 | "language_info": { 37 | "codemirror_mode": { 38 | "name": "ipython", 39 | "version": 3 40 | }, 41 | "file_extension": ".py", 42 | "mimetype": "text/x-python", 43 | "name": "python", 44 | "nbconvert_exporter": "python", 45 | "pygments_lexer": "ipython3", 46 | "version": "3.6.4" 47 | } 48 | }, 49 | "nbformat": 4, 50 | "nbformat_minor": 2 51 | } 52 | -------------------------------------------------------------------------------- /reliability/rel03.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 3: How does your system adapt to changes in demand?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Resources provisioning managed by the service or automated. The goal of resource management is to consume the minimum resources to keep the application running reliably, but no less than that. Scaling resources, such as EC2 Instances or Kinesis Streams, requires a single action but needs understanding of the system dynamics to balance between provisioning too little, and causing bottlenecks, or too much, causing waste. \n", 15 | "\n", 16 | "The level of resource management is different in each service. Some developers just want to execute some code in response to events, such as HTTP requests. AWS Lambda provides that and automatically manages containers, instances, network interfaces and so on. Others want to run the container, but not manage resources, and that is provided by AWS Fargate. One layer down, Amazon ECS lets you not only run, but manage container clusters in detail. And so on until reach the individual compute instances on Amazon EC2, even reaching bare metal if desired.\n", 17 | "\n", 18 | "It pays off to climb this abstraction ladder. The pricing structure of the services is built so that it is attractive to let Amazon do the heavy lifting of resource management and you worry about your business logic. Scaling will be different in each context, but in the absence of further requirements, prefer implementing in the highest resource management level type:\n", 19 | "\n", 20 | "* Serverless became the popular jargon for all services that require little to none provisioning and management of infrastructure, like AWS Lambda, Amazon API Gateway and Amazon Cognito. There are still servers of course, just not under customer management. As explained by AWS Serverless Developer Advocate Paul Johnston: “’Serverless’ is just a name. We could have called it ‘Jeff’”.\n", 21 | "\n", 22 | "* Scheduled Scaling is very simple to setup and can be sufficient for the applications where variation in demand is predictable. If the users arrive in the morning and leave in the night or if there is that big holiday or news coming up, just let the schedule handle it. Auto Scaling Scheduled Action, CloudWatch Events and Elastic Beanstalk Scheduled Scaling can help you to program actions without managing your own task server software, such as cron.\n", 23 | "\n", 24 | "* Auto Scaling refers to adjusting provisioning in reaction to CloudWatch Alarms, binding an action to a metric statistic threshold. One simple example, suitable for many applications, would be “Add 1 server when the average CPU Utilization is above 80% for 5 minutes” and “Remove 1 server when the maximum memory utilization is 10% for 10 minutes”. The provisioning actions are executed when the alarm is triggered so applications can scale to unexpected fluctuations in demand. CloudWatch aggregates data in time using statistics like minimum, maximum, count, samples and average. When using the average statistic, it is always important to consider the metric variance. When the metric vary too much, as usual in small environments, the average metric may not be stable enough for auto-scaling.\n", 25 | "\n", 26 | "* Predictive Scaling can be used to have the best of scheduled and auto scaling. For some applications it may be difficult to define the scaling rules as they become too many and possibly conflicting, for example the CPU metric may be too high while the memory metric is too low. For component that takes more time to be provisioned, more advance is needed in the request so that demand does not trample capacity. It may be helpful to train a machine learning model with actual provisioning history and get anticipated scaling suggestions. This is feature is not natively provided by AWS, but pioneered by Netflix project Scryer and growing in implementations and possibilities.\n", 27 | "\n", 28 | "Realistically, applications are architected composing the best resource management for each component. For example, delivery of the static files and simple computations can be totally serverless, while database storage may use auto scaling and some components may not even be horizontally scalable at all. \n", 29 | "\n", 30 | "Vertical scaling is anticipated and monitored. Relational database writes, directories, resource managers and other single-master services frequently requires vertical scaling. When load can’t be balanced, switching to a larger server may need to be scheduled for downtime risk. As this can take more time, it requires more margin in provisioning for keeping up with future demand before replacement. \n", 31 | "\n", 32 | "Capacity tests are executed periodically. However well code behaves on development, it may be surprising in production. Even a simple change in a regular expression or SQL query can become a performance bottleneck, not only in your code but in any of its dependencies and libraries. Not only the code changes unpredictably, but CPU and network allocation and other limitations may fluctuate randomly. As the components of performance are so unpredictable, the only way to assess the resulting system scalability is testing it from the outside.\n", 33 | "\n", 34 | "* Be thorough but not exhaustive\n", 35 | "* Replay production traffic\n", 36 | "* Learn from outliers (percentiles)\n", 37 | "* Provision according to variance\n", 38 | "\n", 39 | "Just like most types of tests, it is not viable to be exhaustive but it pays off to be thorough. It may be impossible to stimulate all invocation paths, but testing the most important and common use cases frequently is a important step for reliability and not hard to implement. AWS does not currently offer a load testing service or tool, but you can easily integrate open source tools into Code Pipeline. \n" 40 | ] 41 | } 42 | ], 43 | "metadata": { 44 | "kernelspec": { 45 | "display_name": "Python 3", 46 | "language": "python", 47 | "name": "python3" 48 | }, 49 | "language_info": { 50 | "codemirror_mode": { 51 | "name": "ipython", 52 | "version": 3 53 | }, 54 | "file_extension": ".py", 55 | "mimetype": "text/x-python", 56 | "name": "python", 57 | "nbconvert_exporter": "python", 58 | "pygments_lexer": "ipython3", 59 | "version": "3.6.4" 60 | } 61 | }, 62 | "nbformat": 4, 63 | "nbformat_minor": 2 64 | } 65 | -------------------------------------------------------------------------------- /reliability/rel04.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 4: How are you monitoring AWS resources?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Metrics, logs and events are monitored. Monitoring is invaluable for learning and improving systems, especially as architectures are decomposed into fine-grained services and becomes more complex. Monitoring provides not only the visibility to ensure that resources in healthy utilization, but a history to understand the behavior of normal and abnormal conditions. \n", 15 | "\n", 16 | "Amazon CloudWatch is the central monitoring service at AWS, it will manage these three types of monitoring resources: Metrics, Events and logs.\n", 17 | "\n", 18 | "Metrics are time series recorded from monitoring both AWS resources and custom resources. CloudWatch automatically aggregates and stores metric data, but resolution and availability of metric data changes with time:\n", 19 | "\n", 20 | "Data points period\tAvailable for\n", 21 | "less than 60 seconds\t3 hours (high-resolution custom metrics)\n", 22 | "60 seconds (1 minute)\t15 days\n", 23 | "300 seconds (5 minute)\t63 days \n", 24 | "3600 seconds (1 hour)\t455 days (15 months)\n", 25 | "\n", 26 | "Aggregating and expiring data automatically makes metric storage extremely efficient with fine visibility into the present and aggregate history of measurements. As the documentation clarifies:\n", 27 | "\n", 28 | "“For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days this data is still available, but is aggregated and is retrievable only with a resolution of 5 minutes. After 63 days, the data is further aggregated and is available with a resolution of 1 hour. If you need availability of metrics longer than these periods, you can use the GetMetricStatistics API to retrieve the datapoints for offline or different storage.”\n", 29 | "\n", 30 | "CloudWatch serves metric data as statistics over different aggregation periods. Each statistic will provide a different perspective of the behavior of the metric and its alarms. Think of them as lenses into the system, using the correct one to detect and correct each circumstance of interest. The available statistics are Average, Minimum and Maximum, Sum, Count and Percentiles \n", 31 | "\n", 32 | "The average metric is the widely used arithmetic mean, the sum of values divided by the number of samples, computed automatically by CloudWatch. The key to using the average statistic is understanding its variability, especially in small environments. Consider for example the average CPU Utilization of 4 application servers with different loads. \n", 33 | "The variance will be so high that the metric is not stable enough to create an alarm or even interpret. The more data and the more consistent the resource utilization, the more reliable the average will be.\n", 34 | "\n", 35 | "Maximum and minimum are straightforward, but the latter is particularly useful for monitoring resource exhaustion. When “available memory”, “available threads in pool”, “available connections in pool”, or any other resource availability reaches zero, it is probably a cause for alarm. The Sum statistic is useful to determine total volumes, for example the sum of requests to a load balancer or total duration of lambda executions. The Sample Count tells simply how many data points were aggregated, but the interesting thing is sampling rate is usually constant. Variations in sample counts can indicate network partitioning and other issues.\n", 36 | "\n", 37 | "Percentiles are helpful to understand the variation of metrics and catch outliers that would be hard to see using averages or other statistics. Specially with Latency, it can reveal garbage collections or other circumstances that affect only a few of the requests in the higher percentiles.\n", 38 | "\n", 39 | "By default, CloudWatch records metrics related to AWS resources, such as EC2 instance CPU utilization, ELB latency, count of S3 objects and so on. Custom metrics can be created for resources that CloudWatch can’t probe from the outside. Those metrics can be anything, from operational metrics such as the available threads in an application server thread pool to business metrics, such as volume of sales. Any kind of measurement can be sent to CloudWatch using the PutMetricData API. This is very useful to create alarms based on resources not managed by AWS, such as application server thread pool and connections pool. It can also monitor business metrics, such as transaction volumes and literally any quantity. \n", 40 | "\n", 41 | "Logs are grouped streams of text emitted from application execution. The metrics and events provide a limited overview of system behavior, but logs can reveal the true application circumstances from the inside. Having them all in CloudWatch helps to correlate data to form a better picture of events and how to handle them. Another key feature of CloudWatch logs is to trigger lambda functions in response to text, alarming on interesting application conditions, such as exceptions or even simple variable dumps.\n", 42 | "\n", 43 | "Events notifies customers of events occurred in their AWS resources or custom events. CloudWatch events can notify several interesting events, such as instance creation, auto scaling group changes or even S3 object level operations.\n", 44 | "\n", 45 | "CloudWatch Events is also widely used for automation as schedules can be defined as event sources. This lets you execute lambda functions in a specific schedule and taking virtually any action without running your own job server.\n", 46 | "\n", 47 | "Responses are continuously automated. Coding the detection and responses to system operation events is the mantra for improving reliability. Any software team can easily speak about technical debt and the issues it causes, from those that customers often hit to those that developers hope not to happen. Prioritizing and automating the handling of changes, expected and unexpected, is a way to gradually and continuously improve reliability.\n", 48 | "\n", 49 | "The more frequent and risky a change is, the earlier it should be automated, so application deployment is clearly the first target. But it goes much further: resource provisioning, configuration changes, failover, substitution, decommissioning... all can be automated so that precious human attention is dedicated only to the issues intriguing enough to require it. Or those have not been automated yet.\n", 50 | "\n", 51 | "Leverage third-party tools. Monitoring is a rich partner ecosystem with additional features that goes from collection to visualization and predictive analytics. Visit the monitoring section of the AWS Marketplace for a list of partner tools to improve monitoring analytics.\n" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [] 60 | } 61 | ], 62 | "metadata": { 63 | "kernelspec": { 64 | "display_name": "Python 3", 65 | "language": "python", 66 | "name": "python3" 67 | }, 68 | "language_info": { 69 | "codemirror_mode": { 70 | "name": "ipython", 71 | "version": 3 72 | }, 73 | "file_extension": ".py", 74 | "mimetype": "text/x-python", 75 | "name": "python", 76 | "nbconvert_exporter": "python", 77 | "pygments_lexer": "ipython3", 78 | "version": "3.6.4" 79 | } 80 | }, 81 | "nbformat": 4, 82 | "nbformat_minor": 2 83 | } 84 | -------------------------------------------------------------------------------- /reliability/rel05.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 5: How are you executing change?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Software is delivered continuously. The fastest new features are delivered to users, the earlier comes new data, learnings and profits. However, as architectures and teams grow in size and complexity, more needs to be done and longer it takes to release a new feature reliably. Continuous delivery for our purposes refer to the set of techniques used to automate the software pipeline and get changes to users quickly and reliably.\n", 15 | "\n", 16 | "The cloud puts all the effort of acquiring and managing infrastructure behind an API so infrastructure can be expressed as code. That allows every code push to be immediately deployed to production, but that is not necessarily a good idea. Continuous delivery requires the proper safeguards on testing, monitoring and automation to be reliable. Each organization will have different repositories and processes for code and data. AWS Code Pipeline lets you model your delivery process as a workflow and manage its execution as changes flow. This is how a sample pipeline would look in the management console, although remember they can be created using AWS CloudFormation, Command Line or any language with the SDKs.\n", 17 | " \n", 18 | "Software delivery is extremely heterogeneous, each language, framework, legacy and business can be updated quite differently. AWS CodeStar is a service to help building pipelines for common technology stacks, such as Java with Spring or Node.js with Express. The CodeStar new project wizard guides from the connection to source repositories on AWS CodeCommit or Github to deployment in several services.\n", 19 | " \n", 20 | "Continuous delivery is not only tools, but the process of how to apply them and the people that they serve. The following techniques are frequently used to help software evolve fast, reliable and safe. They are not strict rules but building blocks, applied successfully in many architectures, but each time with its context variations. The techniques reinforce each other, but are hardly adopted all at once. As architecture evolves, so does its development process, helped by some or all of the following: Collective Ownership, Zero Regressions, Continuous Integration, Feature Flags, Microservices, Infrastructure as Code, Immutable Infrastructure, Blue-Green Deployments and Canary Releases. \n", 21 | "\n", 22 | "## Collective Ownership\n", 23 | "\n", 24 | "The cornerstone of continuous delivery is the commitment to build for the customer. An outage or issue is not a development or operations problem. It’s a business commitment to the customer first and foremost. As CTO Werner Vogels declared:\n", 25 | "\n", 26 | "“The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.”\n", 27 | "\n", 28 | "Collective ownership is the key to balance the pace of innovation. In an ideal world every service would be launched at the same time in all regions and with support for AWS CloudTrail and AWS CloudFormation. And that is getting faster, but it is still more efficient to serve some customers first than having all to wait. The same logic can be applied to mitigate technical debt and other issues that arises when teams are segregated into hard roles. Everyone is responsible for business continuity in some aspect and should share the same priorities: keeping the software up and improving it as fast as possible, in the direction of the customer needs, while preventing and eliminating bugs as early as possible.\n", 29 | "\n", 30 | "## Fix before build\n", 31 | "\n", 32 | "The fastest possible delivery would be to publish every code change directly to production. But it is very hard to determine when a bug has been introduced and what impact it will have. But it is very likely that the longer bugs take to fix, the more they will cost and the more they breed. And as they do, the schedule, cost and quality of the software process gets more and more unreliable. The “broken windows” theory suggests that if everyone helps preventing small crimes, such as breaking the build or uncaught exceptions, it can significantly reduce the bigger issues. No developer wants to be the only one doing things wrong, but also won’t be the only one caring to do it right. \n", 33 | "To keep reliability, bugs must be prevented thoroughly, detected early and corrected quickly. The many different types of automated software testing are beyond the scope of this discussion, but they are the first line of defense in bug prevention. At each commit, the pipeline can use AWS CodeBuild or other partner test tools to automatically verify if the code is ready to proceed according to the automated tests suite.\n", 34 | " \n", 35 | "## Continuous Integration\n", 36 | "\n", 37 | "The longer a developer work alone in his own local copy of the code, the harder it is to merge it back with the master branch or trunk. Ideally, coding tasks would be short enough to merge frequently and without stepping on each other toes. But tasks such as large refactorings, epic features or irreproducible bugs can keep a developer away from the team for a long time. \n", 38 | "\n", 39 | "Continuous Integration (CI) is about getting the feedback from automated checks on the codebase as often as possible, so everyone can build faster and safer. The “State of DevOps Report” indicates that the adoption of CI helps to reduce change failures in up to 5 times but the really difference is in time from commit to deploy, which can be up to 440 times faster and get done up to 46 times more often.\n", 40 | "\n", 41 | "Continuous integration starts with the shared repository. AWS Code Commit provides private managed git repositories and can so be used with any CI tools, besides the native integration with AWS Developer Tools. GitHub is also widely supported, as it is more suitable for open source projects. Once a repository is set up, it is important to decide on a branching strategy, so that everybody works on the same flow. Here again each team have his own perks, but usually around two general strategies:\n", 42 | "\n", 43 | "* Trunk-based Development is when everyone codes directly to the master branch. It cuts the branching issues by doing it minimally or not at all. As the codebase is single and integrated often, extra care must be taken to not break it the for other developers and even users, a good idea in the first place. The culture and tools for comprehensive automated testing are an important safety net for reliable trunk-based development. \n", 44 | "\n", 45 | "* Git Workflow is the popular name of the branching model described by Vicent Driessen in 2010 and adopted by many teams and tools. The master branch still holds the production-ready code, but a develop branch aggregates changes while the new release is under development. Each of those changes is developed in a separate feature branch, merged back to the develop branch when done. This way unfinished code gets integrated in develop and can be thoroughly tested before merging to master and proceeding in the pipeline to production.\n", 46 | "\n", 47 | "In both strategies, automated validation should begin as soon as code is pushed into the repository, ideally to any branch. From there the pipeline starts a new cycle of compiling, testing, packaging and deploying. AWS CodePipeline manages the execution of the build workflow, but the actual computation of each build step is managed by AWS CodeBuild or one of the integrated partner tools. \n", 48 | "\n", 49 | "Using AWS CodeBuild, the build specification file (buildspec.yml) declares the list of commands to be executed at each of five lifecycle phases (install, pre_build, build, post_build) and their resulting artifacts. Here is a sample build specification, for java using maven in this case, but commands could be replaced to run anything.\n", 50 | " \n", 51 | "Teams frequently adopt further practices to improve the development process. As any other scheduled task, nightly builds can be triggered by a lambda function and a CloudWatch Events schedule to ensure at least a daily pace of integration. \n", 52 | "\n", 53 | "Integrating frequently, even when there are no code changes, can be important to detect issues with dependencies and integrations. For example, the Amazon Inspector service agent can detect vulnerabilities know to the Common Vulnerabilities and Exposures database. Integrating that rule package in your build pipeline can prevent you from deploying a dependency with a known CVE, such as Heartbleed, Shellshock, POODLE and many others. Some package managers may even let up upgrade the dependency versions automatically and that may be a good idea as well. \n", 54 | "\n", 55 | "CodeBuild can also cache dependencies on S3 for significantly improving build times, as each build environment is created empty and all dependencies would be fetched over and over unecessarily.\n", 56 | "\n", 57 | "Once each build is successful, CodeBuild will upload the artifact files to S3 and pass the key name to the following stages in the pipeline. Build results are also emitted to CloudWatch Events so they can trigger lambda functions for any kind of post-processing or notification, such as e-mail or chat channel. Artifacts then keep progressing on the pipeline to further tests, approvals and deployments. But not necessarily new releases.\n", 58 | "\n", 59 | "## Feature Flags\n", 60 | "\n", 61 | "A new software release can be a significant event for business and implicate in announcements, promotions, sales and what not. But it does not need to be a huge thing operationally. The code can be there and working perfectly for a long time, just disabled for all or most customers by feature flags.\n", 62 | "For example, a new feature designed for Christmas may hardly make sense before that, but its code can be deployed since Halloween. As nobody wants to watch logs on Christmas, the new feature can be tested and deployed, but kept hidden. The business release then becomes then a separate event from the software deployment, probably long in the past.\n", 63 | "The actual implementation may range from a simple feature flag and if statement to dynamic properties of the system. There may be value in partial releases, whitelisting new features and services gradually to larger groups of users. Many AWS Services are announced first in a “preview” period before general availability. To use them, customers’ needs to submit an application and may be contact by the service team for feedback. \n", 64 | "Microservices\n", 65 | "As business grows, so does software development teams and the complexity in their coordination. To sustain the vertiginous pace of growth of Amazon, and countless startups, software teams must be able to work independently. \n", 66 | "\"Adding human resources to a late software project makes it later\" – Brooks Law\n", 67 | "This is so much of an issue that back in 2002 Jeff Bezos started to restructure Amazon.com around what he called “two pizza teams”: autonomous groups of around a dozen people, a good enough crew size for effective teamwork and to be fed with a couple pizzas. This decomposition of teams and services was challenging, but also beneficial much further than reducing synchronization and complexity. Faster onboarding of the new engineers and building teams effectively is key for accelerating the pace of innovation. In 2008, Amazon was already decomposed into hundreds of services, whose dependency graph looks like this:\n", 68 | " \n", 69 | "\n", 70 | "\n", 71 | "Netflix, Uber, Ebay and many other fast scaling enterprises are adopting this strategy of fine-grained services and responsibilities to scale faster, both in terms of business and software architecture. Considering that each of those services will be in continuous improvement and delivery, provisioning infrastructure by hand becomes quickly inviable.\n", 72 | "\n", 73 | "## Infrastructure as Code\n", 74 | "\n", 75 | "Deep automation is essential to deliver software at such scale and rate. A new software version and feature may require more than new code, but also new infrastructure components. In traditional on-premises IT, infrastructure components are frequently restricted to servers (application, messaging and database) and perhaps a load balancer or network storage appliance. Bringing in a new resource type, like a NoSQL database or machine learning cluster, may pose a significant business commitment and risk. On the cloud, new features may bring in new resource types or replace them with little to none consequences.\n", 76 | "\n", 77 | "As the name implies, Amazon services are offered over a public web API. This allows infrastructure resources to be managed with code and inherit many benefits from code development: configuration management, version control, automated testing, costless duplication and so on. It also brings different approaches to coding and abstractions as fit for different programming languages and styles. Automations can be as simple as a shell script using the AWS CLI or in most modern languages using AWS SDKs. However, those scripts quickly grow in complexity and may end up consuming as much attention as the application they manage. Instead managing resources imperatively, declaring resources and having an interpreter manage them is more effective for infrastructure code. \n", 78 | "\n", 79 | "AWS CloudFormation creates and manage resources based on templates declared in either JSON or YAML. The template is a recipe for the application architecture and once it is built and tested, can used to create multiple instances of it. It can also take parameters, execute mappings and functions, share outputs and many other features for managing infrastructure as code. Here is a sample template file using YAML\n", 80 | " \n", 81 | "\n", 82 | "CloudFormation stacks can be managed by the console, SDK or CLI just like other services. During development and testing those tools will help to debug and then, when ready, published to the code repository, usually together with application code. This can trigger another cycle of your continuous delivery pipeline to take changes to production. This way not only new application code can be automatically provisioned, but also the infrastructure elements that they depend on, such as caches, databases and many others. See the “CloudFormation Type Reference” page for the complete list of resources and properties.\n", 83 | "\n", 84 | "## Immutable Infrastructure\n", 85 | "\n", 86 | "As applications are decomposed into a complex set of dependencies it may be difficult to determine if a change will have availability impact or other negative consequences. Some application servers can hot-deploy new application packages, others don’t. Some databases need downtime to change schema, others don’t have a schema at all. Some AWS resources can be changed in-place, others may need replacement. Once an environment is up and running, it is safer not to change it at all.\n", 87 | "\n", 88 | "Instead of manipulating resources receiving live traffic, consider creating new “clone” environments and redirecting traffic to it once it is stable and ready. With infrastructure declared as code, the new environment can be provisioned instantly, and as it still has no traffic, the extra cost should be negligible. After stabilizing and passing all health checks, the new environment can start receiving live traffic. Once all traffic has shifted to the new environment and it is performing safe and sound, old environment can be safely decommissioned. Or even better, kept for a while for simplified rollback case a bug slips into production.\n", 89 | "Immutability and infrastructure as code have important implications for security as well. As all changes are applied automatically, there should be no need for SSH or other management. Particularly in production environments, this should even raise an alarm. If someone logged in it is either an automation issue or an attack, and system administrators should be notified of both. It also makes it possible to use or respawn old environments for auditing or understanding the origin of issues, even much after they were fixed.\n", 90 | "\n", 91 | "## Blue Green Deployment\n", 92 | "\n", 93 | "When a bug gets deployed or anything goes wrong, traffic can be rolled back to the old environment and quickly prevent further damage. How long “quickly” means depends on the failure detection and routing capabilities of each software. It is common to rely on changing DNS records to failover, changing resolution to a secondary environment and requiring only a reconnection from clients. While this is simple to implement, DNS changes may take a couple minutes, according to records Time to Live (TTL) and how DNS caches expire. AWS ElasticBeanstalk “Switch URL” feature does exactly that with a single invocation.\n", 94 | "\n", 95 | "Reduced failover times can be obtained using “heartbeat” protocols and client-side fault detection, like featured in MariaDB clients and Netflix Eureka. But however fast and reliable failover can be, it is even better to prevent it. Towards that, development processes can go further than the “zero regressions” suggested earlier and also adopt a policy of data compatibility. The new version must be compatible and able to execute concurrently with the old version against the same data. Databases that do not require a predefined schema, such as Amazon DynamoDB, are very helpful to comply with such a policy, as compatibility moves entirely to application code under developers control.\n", 96 | "\n", 97 | "Having several application environments with the same data is helpful not only for rolling out changes, but also for enabling experimentation. You can have alternative versions of the application running at the same time and monitor metrics to find out which is better, whatever “better” means for the experiment. This could be used to optimize for performance, sales, conversions or any other business criteria. This is instrumental to adopt lean methodologies and taking decisions based not on speculation but on experience and data from actual users. \n", 98 | "\n", 99 | "## Canary Release\n", 100 | "\n", 101 | "A canary release, like the canaries used in coal mines, increases reliability by anticipating the detection of issues. Instead of deploying a new environment directly into the firehose of production traffic, canary releases get a reduced and/or simulated traffic first. Once monitoring indicates a new environment is reliable, or more successful in an experiment, it can gradually receive more traffic. The actual implementation may depend on how safe and how fast deployments must be. Some common alternatives are:\n", 102 | "\n", 103 | "* Synthetic requests can be designed and fired at the application to check if it behaves reliably. Modern tools let developers craft requests in several protocols and fire them from a distributed cluster of slaves. However, as this usually generates fake data, synthetic canary releases are usually discarded and a new copy with the vetted version is deployed into production afterwards.\n", 104 | "\n", 105 | "* Recorded traffic from live production can be replayed to the canaries. Further than the scenarios tested by the developer with synthetic requests, replayed traffic exposes the application to the creativity of users in breaking software. More than that, it can ensure bugs were corrected properly and help to enforce the zero regressions policy. Recorded canaries are also usually discarded, as the same data is already in production and transactions would be duplicated. \n", 106 | "\n", 107 | "* Live canaries receive more traffic according to behavior. While they keep up with reliability metrics, traffic is periodically shifted to the new environment, or otherwise rolled back.\n", 108 | "\n", 109 | "Although these change management techniques were presented in the reliability context, their benefits can extend much further. The reliable and simultaneous execution of several environments can be used not only to roll out new versions, but also to perform business experiments. Instead of shifting traffic according to reliability, that can be made according to other technical metrics, such as latency or throughput, or even business goals, such conversions or revenue." 110 | ] 111 | } 112 | ], 113 | "metadata": { 114 | "kernelspec": { 115 | "display_name": "Python 3", 116 | "language": "python", 117 | "name": "python3" 118 | }, 119 | "language_info": { 120 | "codemirror_mode": { 121 | "name": "ipython", 122 | "version": 3 123 | }, 124 | "file_extension": ".py", 125 | "mimetype": "text/x-python", 126 | "name": "python", 127 | "nbconvert_exporter": "python", 128 | "pygments_lexer": "ipython3", 129 | "version": "3.6.4" 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 2 134 | } 135 | -------------------------------------------------------------------------------- /reliability/rel06.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 6: How are you backing up your data?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Data is backed up to S3 and Glacier. These storage services were designed for the highest durability (99.999999999%) and long-term storage. Files there are copied across multiple isolated facilities in the same region automatically and can be setup for cross-region replication. Hosting yourself yet another copy hardly improves durability and is probably unnecessary.\n", 15 | "\n", 16 | "Backups are unlikely to be restored, so it usually pays off to store them in S3 Standard – Infrequent Access Storage. Objects in this storage class have the same durability as Standard Storage for around half the costs, but a lower availability SLA (99.9%) and a minimum billing of 128KB and 30 days. \n", 17 | "\n", 18 | "Some applications, particularly in regulated institutions, may have retain data for a long time. In healthcare for example, HIPAA compliance requires 6 years of retention, but local government and industry standards may require a decade or even more. Glacier can help with such permanent archival, with the same durability as S3, but costs less than half a penny per GB/month to be stored. Retrieval operations are a bit different, with three options for balancing their duration and cost: Expedited, Standard and Bulk.\n", 19 | "\n", 20 | "Data can be changed from a storage category to another as it gets “cold”. A fresh backup may be stored in S3 Standard, to be used for testing or debugging. After it is unlikely to be necessary, it can move to Standard – IA. When it needs permanent archival, it can be frozen into Glacier and finally expired and deleted after that. Using S3 Lifecycle Policies these rules can be easily defined and applied automatically.\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [] 29 | } 30 | ], 31 | "metadata": { 32 | "kernelspec": { 33 | "display_name": "Python 3", 34 | "language": "python", 35 | "name": "python3" 36 | }, 37 | "language_info": { 38 | "codemirror_mode": { 39 | "name": "ipython", 40 | "version": 3 41 | }, 42 | "file_extension": ".py", 43 | "mimetype": "text/x-python", 44 | "name": "python", 45 | "nbconvert_exporter": "python", 46 | "pygments_lexer": "ipython3", 47 | "version": "3.6.4" 48 | } 49 | }, 50 | "nbformat": 4, 51 | "nbformat_minor": 2 52 | } 53 | -------------------------------------------------------------------------------- /reliability/rel07.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 7: How does your system withstand component failures?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Predictable failure modes are considered for every component type.\n", 15 | "Failure happen not only all the time, but in the most interesting ways. Although some outages are quite unpredictable, many of them can be expected, such as network partitioning, insufficient capacity, race conditions and many others. \n", 16 | "\n", 17 | "The most common and efficient failover mechanism is replacement with a redundant spare, particularly on the cloud where they can be instantly provisioned. To avoid a single point of failure, each component should be replaceable with minimal consequences, but that depends a lot on its architecture. Each system will handle failovers a bit differently according to its data management, but usually around one of the following strategies:\n", 18 | "\n", 19 | "Multi-master services can receive requests in any node, like web servers for example, they all behave equally and, when they fail, clients will get the same responses from a different one without even noticing. That is usually the case for stateless services, but it gets more complicated when data needs to be replicated for redundancy. When a network failure happens and data can’t be replicated, services must choose between responding with errors or with stale data. In other words, when a network partition happens, distributed systems must choose between availability and consistency. This key principle is known as the CAP theorem and proposed by Eric Brewer in 1998. This decision depends on each business, but as outages can have significant implications, it is often easier to manage consistency. Given enough nodes and connections, algorithms such as Consistent Hashing, Vector Clocks and Sloppy Quorum can be used for designing distributed storage with minimal inconsistent states. Amazon DynamoDB uses these techniques to let you chose consistency mode in read time, so it’s up to you weather to get the first response or wait for consistency. But for customer-managed systems and other databases, replication type and latency need to be considered case by case.\n", 20 | "\n", 21 | "If the replication group is within a single region, latency across Availability Zones are predictable and very low, typically less than a couple milliseconds. Then replication can be synchronous, waiting for copies to completed before sending a response to the client. It is as though the transaction happened at the same time in all servers and certainly consistent. This can be effective for a small number of servers, but as data grows, replication time can get larger than client timeout and lead to an outage.\n", 22 | "\n", 23 | "If data is replicated to different regions, or to on-premises servers over the internet, latency can be unpredictable and too large to wait for synchronous replication. In these cases, it may be better keep the replication running asynchronously in the background and respond immediately to requests, even with stale data for a while.\n", 24 | "\n", 25 | "Modern distributed systems may use a mix synchronous and asynchronous replication to reach the best of both worlds, but each will take a different approach. Databases such as DataStax Casandra uses “Multi-datacenter replication” and similar features to store data across the globe. When consistency is desired but not provided by the system, even with the posible availability issues, it is not uncommon to resort to a consistency tier, like Netflix S3mper does for S3 listings.\n", 26 | "\n", 27 | "Unpredictable failure modes are investigated and prevented. Even the most resilient systems may eventually face an outage, due to bugs, human error and other unpredictable conditions. Sometimes, these conditions are so rare that it is not even worthy to work around them after fixing. But if that happens twice or more, it should be investigated and prevented, possibly with the same process and tools defined for security incident response (SEC 12).\n", 28 | "\n" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [] 37 | } 38 | ], 39 | "metadata": { 40 | "kernelspec": { 41 | "display_name": "Python 3", 42 | "language": "python", 43 | "name": "python3" 44 | }, 45 | "language_info": { 46 | "codemirror_mode": { 47 | "name": "ipython", 48 | "version": 3 49 | }, 50 | "file_extension": ".py", 51 | "mimetype": "text/x-python", 52 | "name": "python", 53 | "nbconvert_exporter": "python", 54 | "pygments_lexer": "ipython3", 55 | "version": "3.6.4" 56 | } 57 | }, 58 | "nbformat": 4, 59 | "nbformat_minor": 2 60 | } 61 | -------------------------------------------------------------------------------- /reliability/rel08.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 8: How are you testing your resiliency?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "“There is no compression algorithm for experience” – CEO Andy Jassy\n", 15 | "Game Days are executed periodically. Handling turbulent conditions is an important skill for a system administrator but keeping systems up and running is even more. Having experience in handling failures under controlled scenarios is not only a very helpful exercise for engineering skills, but also expose bottlenecks and design issues early on.\n", 16 | "\n", 17 | "It may be difficult for small organizations to invest deeply on resiliency testing, but at least some periodical game days are important. \n", 18 | "\n", 19 | "## Chaos Engineering.\n", 20 | "\n", 21 | "“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” – Chaos Engineering Principles\n", 22 | "\n", 23 | "Chaos engineering is a new way to increase confidence in distributed systems. It is not only about killing resources with chaos monkey in production, but systematically designing and executing experiments to improve reliability. It is about preventing, solving and learning from incidents, instead of hoping they won’t happen. You can think of chaos engineering as the garbage collector of technical debt.\n", 24 | "\n", 25 | "The fundamental tenet of chaos engineering is embracing that distributed systems failures are unpredictable and complex, particulary for fine-grained microservices under high traffic. So instead of waiting for that bad day when a server fails, let’s prepare for server failures and inject them to make sure. And do the for the many different components we use and the mysterious ways they can fail. Once a failure is revealed, it can be monitored and prevented. There are many tools to help chaos engineering, but at core it is this continuous process of revealing unknown issues earlier in controlled experiments.\n", 26 | "\n" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [] 35 | } 36 | ], 37 | "metadata": { 38 | "kernelspec": { 39 | "display_name": "Python 3", 40 | "language": "python", 41 | "name": "python3" 42 | }, 43 | "language_info": { 44 | "codemirror_mode": { 45 | "name": "ipython", 46 | "version": 3 47 | }, 48 | "file_extension": ".py", 49 | "mimetype": "text/x-python", 50 | "name": "python", 51 | "nbconvert_exporter": "python", 52 | "pygments_lexer": "ipython3", 53 | "version": "3.6.4" 54 | } 55 | }, 56 | "nbformat": 4, 57 | "nbformat_minor": 2 58 | } 59 | -------------------------------------------------------------------------------- /reliability/rel09.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# REL 9: How are you planning for disaster recovery?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "All the ideas above are great for improving reliability and preventing outages, but what about the day they actually do happen? That is when a tested disaster recovery plan pays off. Like insurance, you’d rather waste that resource than using, but it is important to have the cost and coverage that works better for your scenario.\n", 15 | "\n", 16 | "For the purpose of this chapter, a disaster can be defined as an unexpected irrecoverable failure. Systems are down and not restarting, data is being lost, people are stressed. A disaster recovery plan balances the speed and cost of getting back up. Two DR metrics are frequently discussed:\n", 17 | "\n", 18 | "Recovery Time Objective (RTO): How long it should take for systems to back up. For example, a recovering a database backup from cold storage could take a couple hours, but only a couple minutes to recover from a secondary replica.\n", 19 | "\n", 20 | "Recovery Point Objective (RPO): How much data can be lost, usually in a time unit. For example, if a daily backup is taken, up to 24 hours of data can be lost in a disaster. RPO is always longer than RTO, as data is lost during the downtime plus the time since last copy. \n", 21 | "\n", 22 | "It may be very expensive to get this numbers down in traditional IT, as it requires independent facilities and redundant networking. Amazon Global Infrastructure improves that significantly by offering at least two segregated availability zones per region and several regions worldwide at similar costs. However, it is still up to the architect to design the appropriate disaster recovery scenario for his demand and pocket. The “Using Amazon Web Services for Disaster Recovery” whitepaper suggests the following alternatives:\n", 23 | "\n", 24 | "## Backup and Restore\n", 25 | "\n", 26 | "Loosing data can be truly catastrophic, so having a recent and tested backup is probably the most given advice in IT. Amazon S3 can store the backup files reliably and efficiently, specially using Standard – IA and Glacier. However, there are other services and tools that can help getting the data there. EBS Snapshots are copies of EBS volumes stored on S3, taken incrementally to minimize the time and costs. Although snapshots are incremental, they are independent, EBS will keep all the data required to restore a snapshot. Data blocks are safely deleted only when there are no more snapshots referencing them.\n", 27 | "\n", 28 | "This approach costs only for data storage, as no redundant servers are in place, but it may take a while to run Glacier data retrievals and resurrect servers from snapshots. This can cost a lot in terms of RPO, as data received since the last backup may be lost, and RTO, as large backups take long to restore.\n", 29 | "\n", 30 | "## Pilot Light\n", 31 | "\n", 32 | "Like the small everlasting flame in gas heaters, a pilot light strategy will keep only the core of the system turned on in redundancy and bring up the rest when needed. Usually, this means having the data continuously replicated, but no application server or upper tiers. It still may take some time to start application servers and get back to users, but data loss will be restricted to the time since last successful replication.\n", 33 | "\n", 34 | "This approach is more expensive than backup and restore, as replication incurs in costs, but can reduce RPO down to the replication frequency. However, complex systems may take a long time to bootstrap and stabilize, leading to excessive RTO.\n", 35 | "\n", 36 | "## Warm Standby\n", 37 | "\n", 38 | "Keeping a small clone environment for disaster recovery can make recovery much faster. The on-demand resource provisioning in auto-scaling or serverless architectures makes the cost of running a standby environment negligible. When traffic shifts, the standby environment can scale to accommodate full load. \n", 39 | "The RPO for warm standby is the same, as that is defined by the replication type and frequency. But RTO can be much smaller, as resources are already up and running. However, some performance degradation and errors may still be perceived while the new environment scales. Particularly for high-traffic systems, throwing a firehose of traffic on a cold environment can trigger yet more failures.\n", 40 | "\n", 41 | "## Multi-Site\n", 42 | "\n", 43 | "“Production” can be composed of many environments at full capacity and load balanced. In this case, the traffic from the failed environment is divided across the remaining copies and even large-scale events can be managed. However, the viability of multi-site replication and running multiple environments is highly dependent on the system architecture. But for the systems that do support multi-master high availability, both RTO and RPO can be kept minimal.\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [] 52 | } 53 | ], 54 | "metadata": { 55 | "kernelspec": { 56 | "display_name": "Python 3", 57 | "language": "python", 58 | "name": "python3" 59 | }, 60 | "language_info": { 61 | "codemirror_mode": { 62 | "name": "ipython", 63 | "version": 3 64 | }, 65 | "file_extension": ".py", 66 | "mimetype": "text/x-python", 67 | "name": "python", 68 | "nbconvert_exporter": "python", 69 | "pygments_lexer": "ipython3", 70 | "version": "3.6.4" 71 | } 72 | }, 73 | "nbformat": 4, 74 | "nbformat_minor": 2 75 | } 76 | -------------------------------------------------------------------------------- /security/admin-user/admin-user.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | ADMIN_PASSWORD="changemeplease" 8 | 9 | aws cloudformation deploy \ 10 | --stack-name "$STACK_NAME" \ 11 | --template-file "$TEMPLATE_FILE" \ 12 | --capabilities CAPABILITY_NAMED_IAM \ 13 | --parameter-overrides "AdminPassword=$ADMIN_PASSWORD" 14 | 15 | -------------------------------------------------------------------------------- /security/admin-user/admin-user.yaml: -------------------------------------------------------------------------------- 1 | Description: Admin User 2 | Parameters: 3 | AdminPassword: 4 | Type: String 5 | Resources: 6 | AdminGroup: 7 | Type: AWS::IAM::Group 8 | Properties: 9 | GroupName: Administrators 10 | ManagedPolicyArns: ['arn:aws:iam::aws:policy/PowerUserAccess'] 11 | AdminUser: 12 | Type: AWS::IAM::User 13 | Properties: 14 | Groups: ['Administrators'] 15 | UserName: admin 16 | LoginProfile: 17 | Password: !Ref AdminPassword 18 | PasswordResetRequired: false -------------------------------------------------------------------------------- /security/audit-trail/audit-trail.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | aws cloudformation deploy \ 8 | --stack-name "$STACK_NAME" \ 9 | --template-file "$TEMPLATE_FILE" \ 10 | --capabilities CAPABILITY_IAM 11 | -------------------------------------------------------------------------------- /security/audit-trail/audit-trail.yaml: -------------------------------------------------------------------------------- 1 | Description: Log AWS API calls for auditing 2 | Resources: 3 | AuditGroup: 4 | Type: "AWS::Logs::LogGroup" 5 | AuditRole: 6 | Type: "AWS::IAM::Role" 7 | Properties: 8 | AssumeRolePolicyDocument: 9 | Version: "2012-10-17" 10 | Statement: 11 | - 12 | Effect: "Allow" 13 | Principal: 14 | Service: 15 | - "cloudtrail.amazonaws.com" 16 | Action: 17 | - "sts:AssumeRole" 18 | Policies: 19 | - 20 | PolicyName: "root" 21 | PolicyDocument: 22 | Version: "2012-10-17" 23 | Statement: 24 | - 25 | Effect: "Allow" 26 | Action: "logs:CreateLogStream" 27 | Resource: 28 | - !Sub ${AuditGroup.Arn} 29 | - 30 | Effect: "Allow" 31 | Action: "logs:PutLogEvents" 32 | Resource: 33 | - !Sub ${AuditGroup.Arn} 34 | AuditTrail: 35 | Type: "AWS::CloudTrail::Trail" 36 | Properties: 37 | CloudWatchLogsLogGroupArn: !GetAtt AuditGroup.Arn 38 | CloudWatchLogsRoleArn: !GetAtt AuditRole.Arn 39 | S3BucketName: !ImportValue logs-bucket 40 | IsLogging: true 41 | Outputs: 42 | AuditGroupOut: 43 | Description: Log group for trail logs 44 | Value: !Ref AuditGroup 45 | Export: 46 | Name: audit-loggroup 47 | -------------------------------------------------------------------------------- /security/no-root/README.md: -------------------------------------------------------------------------------- 1 | Fires an event when root logs in 2 | https://aws.amazon.com/blogs/security/how-to-receive-notifications-when-your-aws-accounts-root-access-keys-are-used/ 3 | 4 | 5 | const aws = require("aws-sdk"); 6 | const sns = new aws.SNS(); 7 | const publish = sns.publish(params).promise(); 8 | 9 | console.log(eventStr); 10 | var params = { 11 | Message: `${eventStr}`, 12 | TopicArn: `${coeTopicArn}` 13 | }; 14 | return parms; -------------------------------------------------------------------------------- /security/no-root/events/CreateFunction.IAMUser.json: -------------------------------------------------------------------------------- 1 | { 2 | "eventVersion": "1.05", 3 | "userIdentity": { 4 | "type": "IAMUser", 5 | "principalId": "AIDAIOTCGLU7ZBPBV2UHY", 6 | "arn": "arn:aws:iam::030555009967:user/dalek", 7 | "accountId": "030555009967", 8 | "accessKeyId": "ASIAJPKOV5B3PGCTGVWQ", 9 | "userName": "dalek", 10 | "sessionContext": { 11 | "attributes": { 12 | "mfaAuthenticated": "false", 13 | "creationDate": "2017-09-01T12:23:16Z" 14 | } 15 | }, 16 | "invokedBy": "cloudformation.amazonaws.com" 17 | }, 18 | "eventTime": "2017-09-01T12:23:37Z", 19 | "eventSource": "lambda.amazonaws.com", 20 | "eventName": "CreateFunction20150331", 21 | "awsRegion": "us-east-1", 22 | "sourceIPAddress": "cloudformation.amazonaws.com", 23 | "userAgent": "cloudformation.amazonaws.com", 24 | "requestParameters": { 25 | "functionName": "no-root-20170901T122307Z-FnNoRoot-WR1BBPFIN3Y4", 26 | "role": "arn:aws:iam::030555009967:role/no-root-20170901T122307Z-FnNoRootRole-NIM84MBRL7OU", 27 | "memorySize": 128, 28 | "resourceLock": false, 29 | "code": { 30 | "s3Key": "bbc79a64c6913addaa60e9f59f22d9c7", 31 | "s3Bucket": "code-bucket-20170827t195324z-codebucket-tp760wrhohvy" 32 | }, 33 | "timeout": 3, 34 | "publish": false, 35 | "description": "", 36 | "handler": "index.handler", 37 | "tags": { "lambda:createdBy": "SAM" }, 38 | "runtime": "nodejs6.10" 39 | }, 40 | "responseElements": { 41 | "role": "arn:aws:iam::030555009967:role/no-root-20170901T122307Z-FnNoRootRole-NIM84MBRL7OU", 42 | "handler": "index.handler", 43 | "memorySize": 128, 44 | "runtime": "nodejs6.10", 45 | "functionArn": "arn:aws:lambda:us-east-1:030555009967:function:no-root-20170901T122307Z-FnNoRoot-WR1BBPFIN3Y4", 46 | "functionName": "no-root-20170901T122307Z-FnNoRoot-WR1BBPFIN3Y4", 47 | "codeSize": 375, 48 | "version": "$LATEST", 49 | "tracingConfig": { "mode": "PassThrough" }, 50 | "description": "", 51 | "lastModified": "2017-09-01T12:23:37.108+0000", 52 | "codeSha256": "MT6N9QJk3jAuKrfzUsqVgs5cVwOfB3J2yBazuA/oiCA=", 53 | "timeout": 3 54 | }, 55 | "requestID": "61151697-8f10-11e7-8d01-07358f3b8c9e", 56 | "eventID": "6bc4a7a8-a996-4fec-8a15-e9b8c6776d9a", 57 | "eventType": "AwsApiCall", 58 | "recipientAccountId": "030555009967" 59 | } 60 | -------------------------------------------------------------------------------- /security/no-root/events/GetStackPolicy.Root.json: -------------------------------------------------------------------------------- 1 | { 2 | "eventVersion": "1.05", 3 | "userIdentity": { 4 | "type": "Root", 5 | "principalId": "030555009967", 6 | "arn": "arn:aws:iam::030555009967:root", 7 | "accountId": "030555009967", 8 | "accessKeyId": "ASIAI6IV3J2GG4ARAZPQ", 9 | "userName": "julio", 10 | "sessionContext": { 11 | "attributes": { 12 | "mfaAuthenticated": "false", 13 | "creationDate": "2017-09-01T10:50:42Z" 14 | } 15 | } 16 | }, 17 | "eventTime": "2017-09-01T12:30:51Z", 18 | "eventSource": "cloudformation.amazonaws.com", 19 | "eventName": "GetStackPolicy", 20 | "awsRegion": "us-east-1", 21 | "sourceIPAddress": "72.21.198.68", 22 | "userAgent": "console.amazonaws.com", 23 | "requestParameters": { 24 | "stackName": "arn:aws:cloudformation:us-east-1:030555009967:stack/no-root-20170901T122307Z/519a60d0-8f10-11e7-b952-50d5ca632682" 25 | }, 26 | "responseElements": null, 27 | "requestID": "640d03c9-8f11-11e7-a29a-1b9bd0c778df", 28 | "eventID": "a5fac7bb-9bad-4858-a61b-65358f73d2b7", 29 | "eventType": "AwsApiCall", 30 | "recipientAccountId": "030555009967" 31 | } 32 | -------------------------------------------------------------------------------- /security/no-root/no-root.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | ZIP="${DIR}/target/no-root.zip" 6 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 7 | TEMPLATE_FILE_OUT="${DIR}/target/${STACK_NAME}.out.yaml" 8 | CODE_BUCKET=$(aws cloudformation describe-stack-resources \ 9 | --stack-name "code-bucket" \ 10 | --query "StackResources[?LogicalResourceId =='CodeBucket'].PhysicalResourceId" \ 11 | --output text) 12 | 13 | rm -rf ./target 14 | mkdir ./target 15 | 16 | cd ${DIR}/src 17 | zip -r $ZIP ./* 18 | cd ${DIR} 19 | zip -ur $ZIP node_modules 20 | unzip -vl $ZIP 21 | 22 | aws cloudformation package \ 23 | --template-file "$TEMPLATE_FILE" \ 24 | --s3-bucket "$CODE_BUCKET" \ 25 | --output-template-file "$TEMPLATE_FILE_OUT" 26 | 27 | 28 | aws cloudformation deploy \ 29 | --template-file "$TEMPLATE_FILE_OUT" \ 30 | --stack-name "$STACK_NAME" \ 31 | --capabilities CAPABILITY_IAM 32 | -------------------------------------------------------------------------------- /security/no-root/no-root.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion : '2010-09-09' 2 | Description: Notifies root access 3 | Resources: 4 | FnNoRootRole: 5 | Type: "AWS::IAM::Role" 6 | Properties: 7 | AssumeRolePolicyDocument: 8 | Version: "2012-10-17" 9 | Statement: 10 | - Effect: "Allow" 11 | Principal: 12 | Service: 13 | - "lambda.amazonaws.com" 14 | Action: 15 | - "sts:AssumeRole" 16 | ManagedPolicyArns: ["arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"] 17 | Policies: 18 | - 19 | PolicyName: "AllowPublishToCOE" 20 | PolicyDocument: 21 | Version: '2012-10-17' 22 | Statement: 23 | - Effect: Allow 24 | Action: sns:Publish 25 | Resource: '*' #TODO: Restrict to COETopic 26 | FnNoRoot: 27 | Type: AWS::Lambda::Function 28 | Properties: 29 | Handler: index.handler 30 | Runtime: nodejs6.10 31 | Code: ./target/no-root.zip 32 | MemorySize: 128 33 | Timeout: 5 34 | Role: !GetAtt FnNoRootRole.Arn 35 | Environment: 36 | Variables: 37 | COE_TOPIC: !ImportValue coe-topic 38 | FnNoRootPermission: 39 | Type: AWS::Lambda::Permission 40 | Properties: 41 | Action: 'lambda:InvokeFunction' 42 | FunctionName: !Ref FnNoRoot 43 | Principal: !Join [ ".", [ "logs", !Ref "AWS::Region", "amazonaws.com" ] ] 44 | NoRootFilter: 45 | Type: AWS::Logs::SubscriptionFilter 46 | DependsOn: FnNoRootPermission 47 | Properties: 48 | DestinationArn: !GetAtt FnNoRoot.Arn 49 | FilterPattern: '{ $.userIdentity.type = "Root" && $.userIdentity.invokedBy NOT EXISTS && $.eventType != "AwsServiceEvent" }' 50 | LogGroupName: !ImportValue audit-loggroup 51 | -------------------------------------------------------------------------------- /security/no-root/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "no-root", 3 | "version": "1.0.0", 4 | "description": "Fires an event when root logs in https://aws.amazon.com/blogs/security/how-to-receive-notifications-when-your-aws-accounts-root-access-keys-are-used/", 5 | "main": "src/index.js", 6 | "directories": { 7 | "test": "test" 8 | }, 9 | "scripts": { 10 | "package": "pack-zip", 11 | "test": "echo \"Error: no test specified\" && exit 1" 12 | }, 13 | "dependencies": { 14 | "bluebird": "^3.5.1" 15 | } 16 | } 17 | -------------------------------------------------------------------------------- /security/no-root/src/index.js: -------------------------------------------------------------------------------- 1 | "use strict"; 2 | 3 | const Promise = require("bluebird"); 4 | const zlib = Promise.promisifyAll(require('zlib')); 5 | const aws = require("aws-sdk"); 6 | const sns = new aws.SNS(); 7 | const coeTopicArn = process.env.COE_TOPIC; 8 | 9 | const toMessage = (unzipped) => { 10 | const decoded = unzipped.toString('ascii'); 11 | const logData = JSON.parse(decoded); 12 | const logEvents = logData.logEvents; 13 | console.log(`Publishing ${logEvents.length} events`); 14 | const logEventsStr = JSON.stringify(logEvents); 15 | const msg = { 16 | Message: `${logEventsStr}`, 17 | TopicArn: `${coeTopicArn}` 18 | }; 19 | return msg; 20 | } 21 | 22 | exports.handler = (event, context, callback) => { 23 | const eventData = event.awslogs.data; 24 | const eventBuf = new Buffer(eventData, 'base64'); 25 | 26 | const promise = zlib.gunzipAsync(eventBuf) 27 | .then((unzipped) => toMessage(unzipped)) 28 | .then((msg) => sns.publish(msg).promise()); 29 | 30 | Promise.all([promise]) 31 | .then(data => callback(null, data)) 32 | .catch(e => callback(e)); 33 | }; -------------------------------------------------------------------------------- /security/no-root/test/event1.json: -------------------------------------------------------------------------------- 1 | { "awslogs": { "data": "H4sIAAAAAAAAAO1aW1MbxxL+Kyo9M2TuF73tEeBSbDBGgBMHFzVXvCeSluyu4jgu/vvp3eUiB/sYHLuyKe8LQts9PT2t/r7tmZ7342WsKnsRj99dxvFkvJMdZ+f7u/N59mR3vDUu3q5iCY8xw0IIjI2RCh4viosnZbG+BIldh7xGdWnzBcqa/1sBIk+MPt45PHwpT/fm3Yh5XUa7/Iux8+miWIfjZvj5ukLRVjUioF+tXeXL/LLOi9VevqhjWY0nv4xXBSqLokYHxRF8dAJEfppTcTLXT39mhzvj1+1ku7/HVd0MeT/OA8zJmNSCUUWVMURorQjhlAsmBOVKGqEk51gQRYSQVFIjhMGGUAyu1DmEqLZLWC0RBAtMGZEc862b0IH592fj2Mx4Cn6Cx2fjydmYbGNxNt46G6+rWM4CSPP6HUhAt4ZgtzrNKlqdyzJf+fzSLmahFWzGqFWwZWcVPif2bTXJ7XIy2dSalDe2rPfFelV/0pL34PfT+O5aIZvPstnuqzn7z+6PT/ZOX04FzW79PrDLztP/rhd50T6uYDSscVqs6vhH3S3I1nWZu3Udq+77MtlsXb9p1uxtHbuJkl1UsTXhIROaX3YHZK2IYqIQIYjQYyImjE+oeXU2vrq62roO7HG+/JgmNxOqX7U2W7V5sS59p+ibxEpFuWxn2rZL+2exgsht+2J5N+B2fTuxyTcX57X1v3bZ00XrbXUUL25+09sU7SLRzjY7zEIoISjdr67UNhF0m1CzTRi7DWR2ASY7x4pVVSziRzwq429rSLVDW4JXTcp3wawal24dvUmAD9c3ufXsw6Rox/5wDZsfmFAksISRVx5DFKNCxtiEQNtTLZniDFL2qnWlugQ/4+4iLrtYTFbrxeLOydlOt1zsWbQuNRbJtUWrAnzFyggJQOHxLtrXg6QUJFGBEfWYIc6tQC7pgByT3LPANQn6btDxDViyt1V2mU/tYnEdLQBMDgrZ/8n3K1jN36MAMlBAvylAT6j6KhTwLK/qFv5HsUP2wAAPYIDovfYeb1q0LAkUkqAAFZEA0fcYQBnCwQsO81KOOKEMWQF/XBDYY6mDsrI3DEAHBvhOGGAoAr6UApRhgm1YdNFxpBRTPkXlHHX3KABg5yUACTx2BHEnJdLEgY0gglc2YsNwbyiADRTwPVLAgP4Hoj8SQzcsaisU4sZT6ZkUQbD76Fc0UgbAV15qxGn0yGhjwA1iDJcyBNwf9PMB/b1Gv8Df5BRgQP8D0M+4N8SpzQMArSVBOIQoI8ANM3sP/dwJzg0mSERnEU82IucUQ8lw5r2TSjLTG/SLAf3fCfqbA4DpG7u6iPM4VP4PQ38gKiX2KPRHFqQlwiDOofznRnLksAooeSuscEGp6HuDfjmgv9fo/0YdgAH9Dzz8N97ERx3+W2+Z5827X1lAP5UOWREFIklGzJT00cbeoF/9PfTzfw36T/hTeXp0cqqmP77Yef7iX4R+9gn0L4tVXhcQ1osHIz9b2HL5rZCfvZyP2mb3S1v7N6NpRwSdN2VZlNMidN6c2kUe2vDs/uFj2wG/09rvUq6bc/T7reqolY4CMIeHeE9GYGUdR2dna4ypmh2cZs9mO+d7z4/O5yf7+9nRz9eSka1vlCDH69gOu5Elmy9iGNXFqIJJqvRu1NBX0+tf1ZPRfly6WI6W66q+lcfVetl4BVNXEXR+mR3MT/b2ZtPZ7sHxeXO5YGuUPcuO9rdGz5++/gzv3XjTrPVjK3gcV0nHefigUemC1yg6r5UE8tFM3eOqQL0ONkaEpVSIC8OBMj1FWFHLCWEk+v60KfRQqfSdq77iPmVoVH5BtcKZjvhR1QrWVDkSKdJBNRsWz5BJEiOWONWBOgZQ7A0DmKFa6TcD6An7RJdiqFaGamWDqxK1jnvhN9nPRoaYi1IYJrmN/D5X8eB50AaJ6ANqz1edoxQpAtsq46QF+uoLV7HhZmW/uerrnqoO1coX9FW8iZ5unqx+tqsKhYr2zgBtsORRAzvY5JiEIosxceyI0643DDBcrOw3Awwnq//oXoVKTs2j+ipCOhw1vPC1ShJxKQXsVWA45ywwmkiyifYG/cOlyu8F/UNX9QvQb2MymycVn79SrblzJCWoELiiiDtrkQbEo+RTcIk4RjTvDfqH+5T9Rv9wo+ofrfyd4/Cyf0zlbz1JKhCKkpEYKAA7ZJgPyELJT2MSUWj79dD/+up/664blDg6AAA=" } } -------------------------------------------------------------------------------- /security/sec01.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "SEC1" 8 | ] 9 | } 10 | ], 11 | "metadata": { 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "language": "python", 15 | "name": "python3" 16 | }, 17 | "language_info": { 18 | "codemirror_mode": { 19 | "name": "ipython", 20 | "version": 3 21 | }, 22 | "file_extension": ".py", 23 | "mimetype": "text/x-python", 24 | "name": "python", 25 | "nbconvert_exporter": "python", 26 | "pygments_lexer": "ipython3", 27 | "version": "3.6.4" 28 | } 29 | }, 30 | "nbformat": 4, 31 | "nbformat_minor": 2 32 | } 33 | -------------------------------------------------------------------------------- /security/security.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | TEMPLATE_FILE_OUT="${DIR}/target/${STACK_NAME}.out.yaml" 7 | CODE_BUCKET=$(aws cloudformation describe-stack-resources \ 8 | --stack-name "code-bucket" \ 9 | --query "StackResources[?LogicalResourceId =='CodeBucket'].PhysicalResourceId" \ 10 | --output text) 11 | ADMIN_PASSWORD="changemeplease" 12 | 13 | rm -rf ./target 14 | mkdir ./target 15 | 16 | aws cloudformation package \ 17 | --template-file "${DIR}/no-root/no-root.yaml" \ 18 | --s3-bucket "$CODE_BUCKET" \ 19 | --output-template-file "${DIR}/no-root/target/no-root.yaml" 20 | 21 | aws cloudformation package \ 22 | --template-file "$TEMPLATE_FILE" \ 23 | --s3-bucket "$CODE_BUCKET" \ 24 | --output-template-file "$TEMPLATE_FILE_OUT" 25 | 26 | aws cloudformation create-stack \ 27 | --template-body "file://$TEMPLATE_FILE_OUT" \ 28 | --stack-name "$STACK_NAME" \ 29 | --capabilities CAPABILITY_NAMED_IAM \ 30 | --parameters "ParameterKey=AdminPassword,ParameterValue=$ADMIN_PASSWORD" 31 | -------------------------------------------------------------------------------- /security/security.yaml: -------------------------------------------------------------------------------- 1 | Description: Security Resources 2 | Parameters: 3 | AdminPassword: 4 | Type: String 5 | Resources: 6 | AdminUser: 7 | Type: AWS::CloudFormation::Stack 8 | Properties: 9 | TemplateURL: ./admin-user/admin-user.yaml 10 | Parameters: 11 | AdminPassword: !Ref AdminPassword 12 | AuditTrail: 13 | Type: AWS::CloudFormation::Stack 14 | Properties: 15 | TemplateURL: ./audit-trail/audit-trail.yaml 16 | NoRoot: 17 | Type: AWS::CloudFormation::Stack 18 | DependsOn: AuditTrail 19 | Properties: 20 | TemplateURL: ./no-root/target/no-root.out.yaml -------------------------------------------------------------------------------- /shared-resources/code-bucket.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash -e 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | aws cloudformation deploy \ 8 | --stack-name "$STACK_NAME" \ 9 | --template-file "$TEMPLATE_FILE" 10 | -------------------------------------------------------------------------------- /shared-resources/code-bucket.yaml: -------------------------------------------------------------------------------- 1 | Description: Store code for AWS Lambda functions 2 | Resources: 3 | CodeBucket: 4 | Type: "AWS::S3::Bucket" 5 | Outputs: 6 | CodeBucketOut: 7 | Description: Code Bucket Name 8 | Value: !Ref CodeBucket 9 | Export: 10 | Name: code-bucket 11 | -------------------------------------------------------------------------------- /shared-resources/coe-topic.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | aws cloudformation deploy \ 8 | --stack-name "$STACK_NAME" \ 9 | --template-file "$TEMPLATE_FILE" 10 | -------------------------------------------------------------------------------- /shared-resources/coe-topic.yaml: -------------------------------------------------------------------------------- 1 | Description: Notify Cause of Error 2 | Resources: 3 | COEQueue: 4 | Type: "AWS::SQS::Queue" 5 | COETopic: 6 | Type: AWS::SNS::Topic 7 | Properties: 8 | Subscription: 9 | - 10 | Endpoint: !GetAtt COEQueue.Arn 11 | Protocol: "sqs" 12 | COEQueuePolicy: 13 | Type: AWS::SQS::QueuePolicy 14 | Properties: 15 | PolicyDocument: 16 | Statement: 17 | - Sid: AllowFromCOETopic 18 | Effect: Allow 19 | Principal: "*" 20 | Action: 21 | - sqs:SendMessage 22 | Resource: "*" 23 | Condition: 24 | ArnEquals: 25 | aws:SourceArn: !Ref COETopic 26 | Queues: 27 | - !Ref COEQueue 28 | Outputs: 29 | COETopicOut: 30 | Description: Cause Of Error topic name 31 | Value: !Ref COETopic 32 | Export: 33 | Name: coe-topic 34 | -------------------------------------------------------------------------------- /shared-resources/hosted-zone.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | DOMAIN_NAME="wa.julio.cloud" 8 | 9 | aws cloudformation deploy \ 10 | --stack-name "$STACK_NAME" \ 11 | --template-file "$TEMPLATE_FILE" \ 12 | --parameter-overrides "DomainName=$DOMAIN_NAME" 13 | -------------------------------------------------------------------------------- /shared-resources/hosted-zone.yaml: -------------------------------------------------------------------------------- 1 | Description: DNS Hosted Zone 2 | Parameters: 3 | DomainName: 4 | Type: String 5 | Resources: 6 | HostedZone: 7 | Type: AWS::Route53::HostedZone 8 | Properties: 9 | Name: !Ref DomainName 10 | Outputs: 11 | ZoneID: 12 | Value: !Ref HostedZone 13 | Export: 14 | Name: zone-id -------------------------------------------------------------------------------- /shared-resources/logs-bucket.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash -e 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | aws cloudformation deploy \ 8 | --stack-name "$STACK_NAME" \ 9 | --template-file "$TEMPLATE_FILE" 10 | -------------------------------------------------------------------------------- /shared-resources/logs-bucket.yaml: -------------------------------------------------------------------------------- 1 | Description: Store logs 2 | Resources: 3 | AuditBucket: 4 | Type: "AWS::S3::Bucket" 5 | AuditBucketPolicy: 6 | Type: "AWS::S3::BucketPolicy" 7 | Properties: 8 | Bucket: !Ref AuditBucket 9 | PolicyDocument: 10 | Statement: 11 | - 12 | Effect: "Allow" 13 | Principal: 14 | Service: 15 | - "cloudtrail.amazonaws.com" 16 | Action: 17 | - "s3:GetBucketAcl" 18 | Resource: !GetAtt AuditBucket.Arn 19 | - 20 | Effect: "Allow" 21 | Principal: 22 | Service: 23 | - "cloudtrail.amazonaws.com" 24 | Action: 25 | - "s3:PutObject" 26 | Resource: !Join [ "", [!GetAtt AuditBucket.Arn,"/AWSLogs/", !Ref "AWS::AccountId" ,"/*"]] 27 | Condition: 28 | StringEquals: 29 | s3:x-amz-acl: bucket-owner-full-control 30 | Outputs: 31 | AuditBucketName: 32 | Description: Logs Bucket Name 33 | Value: !Ref AuditBucket 34 | Export: 35 | Name: logs-bucket 36 | -------------------------------------------------------------------------------- /shared-resources/tls-cert.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" 3 | FILE=$(basename "${BASH_SOURCE[0]}") 4 | STACK_NAME="${FILE%.*}" 5 | TEMPLATE_FILE="$DIR/${STACK_NAME}.yaml" 6 | 7 | DOMAIN_NAME="wa.julio.cloud" 8 | 9 | aws cloudformation deploy \ 10 | --stack-name "$STACK_NAME" \ 11 | --template-file "$TEMPLATE_FILE"\ 12 | --parameter-overrides "DomainName=$DOMAIN_NAME" 13 | -------------------------------------------------------------------------------- /shared-resources/tls-cert.yaml: -------------------------------------------------------------------------------- 1 | Description: TLS Certificate 2 | Parameters: 3 | DomainName: 4 | Type: String 5 | Resources: 6 | TLSCertificate: 7 | Type: AWS::CertificateManager::Certificate 8 | Properties: 9 | DomainName: !Sub '*.${DomainName}' 10 | SubjectAlternativeNames: [!Ref DomainName] 11 | Outputs: 12 | TLSCertificateId: 13 | Description: Certificate ID 14 | Value: !Ref TLSCertificate 15 | Export: 16 | Name: certificate-arn 17 | --------------------------------------------------------------------------------