├── LICENSE ├── README.md └── starter-project ├── README.md ├── part1.md ├── part2.md ├── part3.md ├── part4.md ├── part5.md ├── part6.md ├── part7.md ├── part8.md └── part9.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Josh Arrington 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The DevOps Engineer Upgrade 2 | 3 | This repo is a collection of tutorials and advice on how to become a DevOps Engineer. 4 | 5 | Topics: 6 | - [Starter Project](starter-project/) 7 | - [Resume Advice](resume-advice/) 8 | - [FAQ](#FAQ) 9 | 10 | ## FAQ 11 | 12 | ### What is a DevOps engineer? 13 | The DevOps Engineers role has a wide range of definitions, due to the very fluid definition of the term and the variety of ways the role is implemented at a individual companies. It can mean anything from a Systems Administrator/Ops type person who is expected to use some automation, to a near full blown Software Engineer who is building automation tools to help the development life cycle. Typically it's something in between those two, with some common tool chains and practices being used by practitioners. 14 | 15 | ### What is Site Reliability Engineering? 16 | Site Reliability Engineering, often referred to as SRE, is a close cousin of the DevOps role, or can be thought of as one framework of execution within the DevOps culture. This term was popularized by Google in the book, [Site Reliability Engineering: How Google Runs Production Systems](https://sre.google/sre-book/table-of-contents/), which is a collection of essays on the topic by Google engineers. 17 | 18 | Much like the DevOps Engineer Role, the SRE role can mean different things to different people and companies. It is often closely associated with Platform Engineering, but is generally regarded as a more development heavy role when compared to DevOps, owing to the Google tenant of focusing no more than 50% of an engineers time on operations tasks, while the rest is focused on development. 19 | 20 | ### What skills are required for the role? 21 | DevOps Engineer is typically a more senior role, and therefore requires more experience than your basic Software Engineer or Systems Engineer/Ops type role. Junior DevOps roles often require some prior experience in the IT field. Most people who advance into DevOps Engineer roles have several years experience on either the development or operations side of the fence, and often a bit of both. 22 | 23 | There are a lot of skills that are useful in the practice of DevOps Engineering, but most of them focus on the deployment and operation of services through the use of software engineering techniques. There is a semi-decent list of skills shared in this [DevOps Roadmap](https://github.com/kamranahmedse/developer-roadmap#devops-roadmap) (2019). No one is likely to know all these skills, so pick one or two from each area and start incorporating them into your projects and work as it makes sense to do so. 24 | 25 | ### I heard DevOps is a culture movement and not a role. What gives? 26 | Originally, DevOps was a portmanteau of the words development and operations, as a way to break down the silos between Development and Operations. Over time it evolved into a way to describe a culture around looking at service delivery as an end to end system with different teams (Service Owners, Business Analysts, Developers, QA, Operations, Security, etc) working together closely to rapidly deliver change, safely and securely. At its heart, DevOps is a way of thinking about deployment and operations in a more seamless, and toil-free way, that enables the business to succeed in it's IT goals. 27 | 28 | In addition, the term has evolved into a job role in the industry. It is commonly a way for companies and individuals to identify roles and engineers who require/possess the mindset and set of skills required to operate inside a DevOps culture. 29 | 30 | ### How do I get started in DevOps? 31 | First, realize that if you are new to the IT industry, starting in DevOps is very difficult. Gaining several years experience in either development or operations first will make this journey much easier. Educate yourself on what DevOps entails. Read books, articles, and participate in online communities to get a feel for what it really means to "do DevOps". 32 | 33 | Secondly, try to implement some of these ideas into your day to day job duties where possible, starting small at first. Some ideas might be, automate a repetitive task, script a deployment, use infrastructure as code and check it into version control. 34 | 35 | If learning these skills in your current role isn't feasible, work on personal projects, websites or home labs. You can work on personal projects to help build your skill set and build a portfolio to show potential hiring managers and recruiters. A collection of ideas on projects is being developed here in this repo under [Starter Project](starter-project/) as a way to get you started. 36 | 37 | Most importantly, never stop learning. There is no point at which you should not be trying to learn something new in this field, it's ever changing and always evolving. -------------------------------------------------------------------------------- /starter-project/README.md: -------------------------------------------------------------------------------- 1 | # Starter Project 2 | 3 | Because a DevOps Engineer is not an entry-level position, you need to be able to demonstrate your skills to employers to make it in this field. You will need to show off your skills in systems management, networking, development, and release management. The projects in this repo will give you some foundations to build on and let you build a portfolio to exhibit your new-found skillset. 4 | 5 | This is *not* a comprehensive guide on everything you need to know, it is only a rough outline to get you familiar with much of the typical work in this field. It is left intentionally vague in some cases to encourage exploration and self-reliance. Please feel free to open an issue if something is too vague or unapproachable. 6 | 7 | A few tips: 8 | 1. Read the section in its entirety before starting on it. 9 | 2. Take your time. 10 | 3. At the bottom of each section will be a list of acceptance criteria. You should do your best to meet every item. 11 | 4. Feel free to [submit an issue](https://docs.github.com/en/github/managing-your-work-on-github/creating-an-issue) to this repo for assistance. 12 | 13 | At the end of this project, you will have a fully built, cloud native service using modern tools and technologies. You will be able to show it off to employers to give yourself a leg up over other candidates. Let's get started! 14 | 15 | # Part 0 - Requirements 16 | 17 | 1. Get a Github account. You need somewhere to save your code and to show it off! If you've never used git before, [start here](https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners). Feel free to switch out for Gitlab if you prefer, just remember to adjust the instructions below to your needs. 18 | 2. Get a good code editor. If you don't already have a favorite, VS Code is a good choice and works on all platforms. 19 | 3. Set aside some time several times per week to work on this. Don't let it stagnate. 20 | 4. Any computer you have access to will be able to perform all these steps. You do not need special equipment. 21 | 5. A credit card. Costs will be minimal but you will probably need to spend a little money here, <$100 USD. 22 | 23 | --- 24 | 25 | [Continue to Part 1](part1.md) 26 | 27 | --- 28 | 29 | # Table of Contents 30 | 31 | - [Part 1 - Build Something](part1.md) 32 | - [Part 2 - Deploy it Manually](part2.md) 33 | - [Part 3 - Infrastructure as Code](part3.md) 34 | - [Part 4 - Continuous Integration and Automated Deployment](part4.md) 35 | - [Part 5 - Monitoring and Logging](part5.md) 36 | - [Part 6 - Extending your dashboards with service metrics](part6.md) 37 | - [Part 7 - High Availability](part7.md) 38 | - [Part 8 - Disaster Recovery](part8.md) 39 | - [Part 9 - Document, Document, Document](part9.md) 40 | -------------------------------------------------------------------------------- /starter-project/part1.md: -------------------------------------------------------------------------------- 1 | # Part 1 - Build Something 2 | 3 | If you are going to support a development organization, you need to understand what it takes to build software. This will not get you the full exposure of a software engineering position but will help you get familiar with softare development in general. 4 | 5 | You will need to build a web service that performs some task. This can be anything you want! I recommend building something relevant to your interests. Choose any language or framework you feel most comfortable with. 6 | 7 | Go as far and wide as you want for this service. Keep it as just a simple HTTP endpoint or go big and build a nice HTML/CSS/JS UI around it too! Bonus points if you separate the two into separate services. 8 | 9 | Your service should have the following basic parts, however: 10 | 11 | 1. Should talk HTTP. This ensures the service you build will be ready for the next parts. It also let's anybody interact with it! 12 | 2. Should be stored in a Github repo. 13 | 3. Should follow the principals of the [twelve factor app](https://12factor.net/). Importantly, never commit credentials to your repo! 14 | 4. Intentionally leave out one or two features you would like to implement. We'll follow up on those later. Create issues in Github to track the progress of those missing features. 15 | 16 | If you are stuck on what to build, here are a few project ideas to get you started. Feel free to modify them as you wish. Or build multiple to start getting comfortable! 17 | 18 | 1. Warm up: A basic HTTP service that performs basic arithmetic. The service should accept a POST with two or more numbers and perform an operation against them as specified by the user. 19 | 2. Build a basic blog engine. The website will consist of one page where blog posts are displayed and another where blog posts are submitted. The blog contents will be stored in a database. 20 | 3. Build an image hosting website. The website will allow users to upload an image that will then be hosted by your website, similar to Imgur. You should process the images so that you have a thumbnail, the original image, and an image with a watermark on top of it. All of these should be available to the user after uploading. The user should also have a way to delete the image later. 21 | 22 | Not sure which language or framework to choose? Use Python and Flask. [See here for a tutorial](https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world). 23 | 24 | Ensure the project is well documented and includes a README to introduce readers to it and explains how to run it. 25 | 26 | ### Acceptance criteria 27 | 1. Your service exposes an HTTP endpoint and performs some logic based on user input. 28 | 2. Your application code is saved in a Github repo. 29 | 3. Includes a README.md at the root level with screenshots and instructions on how to build and run it. 30 | 4. Follows the principles in [The Twelve Factor App](https://12factor.net). 31 | 32 | --- 33 | 34 | [Continue to Part 2](part2.md) -------------------------------------------------------------------------------- /starter-project/part2.md: -------------------------------------------------------------------------------- 1 | # Part 2 - Deploy It Manually 2 | 3 | So you have a web service you've built on your own? Great! Now you need to deploy it somewhere and make it accessible to everyone. 4 | 5 | NOTE: This will cost money. Ensure you understand the limits of the free tier of the provider you choose and turn off or delete any resources you create when you are finished with them. 6 | 7 | - Spin up a cloud hosting account. You can use any cloud provider (GCP, Azure, DigitalOcean, Linode, etc) but AWS is the most popular. You will need to add a credit card. Costs for AWS should be very minimal as we'll try to keep everything in the free tier. 8 | - Create a new server in that environment. Depending on which cloud provider you chose, you may need to also set up networking. 9 | - For AWS, this is a good place to start: https://aws.amazon.com/ec2/getting-started/ 10 | - Install the required software on your server to run the application. This includes the database, the web server, the application server if necessary, and so on. For python and flask, see https://flask.palletsprojects.com/en/2.0.x/tutorial/deploy/ 11 | - Register a domain. This can be specific to your service or more broad. I recommend every engineer owns a domain and their projects can be hosted on it along with their portfolio or resume. Namecheap is a good choice. 12 | - Get a DNS host. If you used Namecheap, this is included for free. 13 | - Choose how you want your project to be accessed. I recommend keeping it simple and making it a subdomain of your new domain. ie `myproject.mydomain.com` 14 | - Ensure you can reach your service from the internet. 15 | - Now we need to add HTTPS! For this, just use [Let's Encrypt](https://letsencrypt.org/). 16 | - Ensure all of the steps you took were documented in your project. 17 | - Show it off to your friends, coworkers, whatever! Celebrate. 18 | 19 | ### Acceptance Criteria 20 | 1. Your service is accessible over the internet. 21 | 2. You can access it via a nice URL (myproject.mydomain.com) 22 | 3. Traffic to and from your service is encrypted. Bonus points for a good score on [SSL Labs Server Test](https://www.ssllabs.com/ssltest/). 23 | 24 | --- 25 | 26 | [Continue to Part 3](part3.md) -------------------------------------------------------------------------------- /starter-project/part3.md: -------------------------------------------------------------------------------- 1 | # Part 3 - Infrastructure as Code 2 | 3 | Delete everything you just deployed manually. All the servers, all the network configuration. Everything. We will rebuild! 4 | 5 | We're going to take all the lessons learned from Part 2 and make them repeatable, auditable, and executable. We're talking about Infrastructure as Code (IaC)! If you're not familiar with IaC, go read [this article](https://stackify.com/what-is-infrastructure-as-code-how-it-works-best-practices-tutorials/). 6 | 7 | Generally speaking, in any modern deployment, you will have two "tiers" of IaC. 8 | - One for configuring the infrastructure layer -- everything below the server OS. This includes networking, the servers themselves, security controls, databases, third party services, and so on. Typically, this will be one or more of the following tools: 9 | - Terraform 10 | - AWS Cloudformation 11 | - GCP Deployment Manager 12 | - Azure Resource Manager 13 | - Another for configuring the deployment targets. This includes, but is not limited to, the server OS and kubernetes objects. Typical tools include: 14 | - Ansible 15 | - Chef 16 | - Puppet 17 | - Helm 18 | 19 | For your project, you will use [Terraform](https://www.terraform.io/) and [Ansible](https://www.terraform.io/). These are free tools that have wide community and enterprise support. Whichever cloud provider you chose to work with will likely have support in Terraform. 20 | 21 | 1. Create separate repos for your terraform and ansible projects. 22 | 2. Start working on terraform. Create all the infrastructure-level components from Part 2. Networking, servers, etc. If your project uses a database, use your cloud provider's hosted database offering. 23 | - It's okay to _not_ configure the OS of your server at this stage. Just get an empty server deployed for now. 24 | - Your goal should be to never have to touch the GUI in your cloud provider's console, save setting up some initial credentials to allow access from Terraform. 25 | - Keep your terraform code reasonably simple at this stage. Avoid using third-party modules so you can learn how Terraform and your cloud provider work. 26 | - Do not commit your terraform state to your git repository. 27 | - Why? Terraform state may contain secrets which you wouldn't want to expose. Add it to your `.gitignore` 28 | - Instead, use a remote backend or just keep it locally for now. For example, if you use AWS, you can use the S3 backend. See https://www.terraform.io/docs/language/settings/backends/index.html 29 | 3. Now work on ansible. You will need to configure the entire OS from Ansible to get your software deployed. To keep it simple, ansible should download your project you developed in Part 1 from github. 30 | - Your initial setup can use a static inventory file -- some IP addresses or hostnames you manually enter into a text file. 31 | - Configure the OS with the appropriate software to run your application 32 | - I highly recommend reviewing [this directory layout](https://dev.to/tmidi/ansible-directory-layout-5edj). Ansible is sometimes too flexible in terms of how you organize your file structure. This helped me immensely when trying to wrap my head around it. 33 | - Consider applying various security settings based on the CIS Benchmark for your OS of choice. 34 | - The software you created in Part 1 should start automatically. You shouldn't need to manually log in to the server (except during troubleshooting). It should also start automatically in the event of a reboot. 35 | - Once you are happy with the configuration, switch to using dynamic inventory. This will be important later as the IP addresses and hostnames of our servers will not be known in advance. This is the case for most modern software deployments. 36 | - For AWS EC2, see https://devopscube.com/setup-ansible-aws-dynamic-inventory/ 37 | - For DigitalOcean, see https://www.digitalocean.com/community/tools/do-ansible-inventory 38 | 39 | When everything is configured and you can successfully access your site, run `terraform destroy`. Take a deep breath. Then run `terraform apply` and apply ansible to your new servers. If you did this correctly, your new site should be running again with just a couple of commands from your local development environment. 40 | 41 | ### Acceptance Criteria 42 | 1. All terraform code is checked in to a Github repository. 43 | 2. Your terraform state should _not_ be in the Github repository. 44 | 3. Your ansible code is in a different repository. 45 | 4. The server can tolerate a reboot and come back online without intervention. 46 | 5. A strong resolve in the wake of smoldering servers. 47 | 48 | --- 49 | 50 | [Continue to Part 4] -------------------------------------------------------------------------------- /starter-project/part4.md: -------------------------------------------------------------------------------- 1 | # Part 4 - Continuous Integration and Automated Deployment 2 | 3 | Read about [Continuous Integration](https://www.cloudbees.com/continuous-delivery/continuous-integration). 4 | 5 | During development of modern software, it is common to use [git flow](https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow) or some variant of it for teams to collaborate on changes to their projects. 6 | 7 | When changes are made to a git repo, we can automatically perform some actions based on what is being done. Every major git repo host will support either their own Continuous Integration tooling or allow you to hook in your own (such as Jenkins). Your goal should be to never log in to a server yourself to deploy new code. 8 | 9 | Using [Github Actions](https://github.com/features/actions), perform the following: 10 | 11 | 1. When a pull request is submitted against your project's repository, run various tests against your code to verify it is not broken. For example: 12 | - Unit tests 13 | - Code linting tests (black for Python formatting, mypy for python type checking) 14 | - Security auditing tests (bandit for python) 15 | 2. When a pull request is approved and merged to the main (or master) branch, perform those tests again. 16 | 3. When a tag is pushed to your repo, deploy the software to your hosting environment. 17 | - There are a few ways to go about this. For now, just use a basic semantic versioning schema. For example: pushing tag `v0.1.0` would deploy to your hosting environment. 18 | - Ensure tags that do not fit the schema you set (in the above example, tags starting with `v`) do not deploy. 19 | - The deployment methodology between each company and team may differ. For example, a merge to master may deploy to a development environment. Or a pull request may deploy that branch to an entirely new environment created just for that PR. For now, keep it simple. 20 | 4. For this project, simply run ansible as your deployment step. You should be able to define in your ansible vars which tag from your git repo to deploy. 21 | - Ensure any secrets (ssh keys, api keys) are not available to the public. Ensure these keys are not available to any CI tools (including actions) in pull requests. This prevents credential leakage. 22 | 5. Develop a new feature and deploy it! 23 | 24 | ### Acceptance Criteria 25 | 1. Pull requests against your repo get scanned or tested automatically. 26 | 2. Merges to your main/master branch are scanned or tested in the same way. 27 | 3. Tags pushed to your repo are deployed automatically. 28 | 29 | --- 30 | 31 | [Continue to Part 5](part5.md) -------------------------------------------------------------------------------- /starter-project/part5.md: -------------------------------------------------------------------------------- 1 | # Part 5 - Monitoring and Logging 2 | 3 | Imagine your service is long lived and you have many happy, paying customers. How do you ensure they _remain_ happy, paying customers? It is a foregone conclusion that the most popular services on the web are up 24/7 but it takes a lot of engineering work to keep it that way. You need to ensure that your services remain available and performant and that issues are resolved quickly. 4 | 5 | In Part 5, you will be deploying monitoring and logging services. For the purposes of this project, they can exist on a single server. However, they should exist on a separate one from the one hosting your service. 6 | 7 | There are many commercial products that can help you here. Popular examples are DataDog, New Relic, AWS CloudWatch, and Dynatrace. These are worth exploring and some have a free tier. Open source examples include the Elastic stack, Zabbix, Graphite, and Influxdb. However, for the purposes of this project, we will be deploying [Prometheus](https://prometheus.io/) for metrics collection, [Loki](https://grafana.com/oss/loki/) for logs, and [Grafana](https://grafana.com/) for visualizations. 8 | 9 | I highly recommend reading up on the [official Prometheus documentation](https://prometheus.io/docs/prometheus/latest/getting_started/). [This overview] is also a very good high-level overview of the Prometheus ecosystem. 10 | 11 | Perform the following: 12 | 1. Install [node exporter](https://github.com/prometheus/node_exporter) on your application server. Expect that this will be installed on every server you deploy in the future. 13 | 2. Add a new server deployment to your Terraform repository. This will be the server we use for our monitoring and logging services. 14 | 3. Using Ansible, install Prometheus on it. 15 | 4. Configure Prometheus to scrape the node exporter endpoint on your application server and monitoring server. 16 | - Ensure you have configured the prometheus server to have network access to the application server's node exporter endpoint. 17 | - Depending on the cloud provider you chose, this may change how you approach it. If the networking layer for your cloud provider offers a firewall attached to each server instance (AWS Security Group, GCP Firewall Rules), use terraform to configure the firewall on the instance. You will need to configure the OS-level firewall if your cloud provider does not have firewall rules managed outside of the server instance. 18 | - Node exporter should _only_ be accessible to your monitoring server and, perhaps, your public IP. The wide internet should not have access. 19 | 5. Using Ansible, install Alertmanager. 20 | 6. Configure Prometheus to send alerts to Alertmanager. 21 | 7. Configure some basic alerts in Prometheus for your application server. For example, if disk usage is greater than 90% or if any server is unreachable. 22 | 8. Configure Alertmanager to forward alerts to you 23 | - TODO: Make some recommendations on easy ways to receive alerts. Idea: Create a free personal slack workspace. 24 | 9. Using Ansible, install Grafana. 25 | 10. Install the [community node exporter dashboard](https://grafana.com/grafana/dashboards/1860). This will get you some visualizations quickly. 26 | - You don't need to do this part with Ansible. But if you did, it would be copying the Dashboard json to the correct location and restarting Grafana. 27 | 11. Using Ansible, Install Loki. 28 | 12. Using Ansible, Install Promtail on both servers. Configure it to send the following to Loki: 29 | - Application server's access and error logs 30 | - Application server's audit logs (SSH logins, sudo access denied, etc) 31 | - Grafana logs 32 | - Prometheus logs 33 | - Alertmanager logs 34 | 35 | I recommend configuring `file_sd_configs` in Promtail and placing a service-specific config file in the specified directory from each of the above's ansible playbook. This lets the individual services in ansible configure their logging themselves. 36 | 13. Configure Grafana so that you can view logs from Loki. 37 | 38 | Loki reference 39 | - https://grafana.com/go/webinar/intro-to-loki-like-prometheus-but-for-logs/?pg=oss-loki&plcmt=hero-txt 40 | - https://grafana.com/docs/loki/latest/clients/promtail/ 41 | 42 | 43 | ### Acceptance Criteria 44 | 1. You should be able to deploy each component (Prometheus, Alertmanager, Grafana) to a separate server in Ansible with minimal additional effort (hint: keep them in separate playbooks/roles) 45 | 2. Every server has basic stats available in Grafana (CPU/Memory/Disk Usage) 46 | 3. Application logs (access logs, error logs) should be accessible from Grafana. 47 | 4. Only the monitoring server and a technician (you) should have access to any exporters. 48 | 5. Only Grafana and a technician (you) should have access to the Prometheus and Alertmanager endpoints. 49 | 6. Of course, this should all be configured with code. You should be able to destroy these servers and have a full environment back up in under 10 minutes. 50 | 51 | --- 52 | 53 | [Continue to Part 6](part6.md) -------------------------------------------------------------------------------- /starter-project/part6.md: -------------------------------------------------------------------------------- 1 | # Part 6 - Extending your dashboards with service metrics 2 | 3 | Part 5 had a lot of interesting goodies about monitoring your _servers_ but nothing about the _services_ that run on top of them. In this section, you'll be expanding your service by adding metrics for Prometheus to scrape. Prometheus provides libraries for almost any popular language and framework to allow you to define your own custom metrics as well as exposing some metrics about your application without any additional workrequired by you. 4 | 5 | If you are using Python+Flask, you could use https://pypi.org/project/prometheus-flask-exporter/. 6 | 7 | Additionally, we'll want to have a dashboard full of database metrics if you chose to use a database in your project. Use a Prometheus exporter appropriate for your database vendor. For example, you can use [Postgres Exporter](https://github.com/prometheus-community/postgres_exporter) for PostgreSQL databases. 8 | 9 | ### Acceptance Criteria 10 | 1. An interesting dashboard with your application's metrics. Should include at a minimum: 11 | - Request latency for each route 12 | - Number of times each route is called 13 | - Total requests per minute 14 | - Memory usage of the application server process 15 | 16 | --- 17 | 18 | [Continue to Part 7](part7.md) -------------------------------------------------------------------------------- /starter-project/part7.md: -------------------------------------------------------------------------------- 1 | # Part 7 - High Availability 2 | 3 | In production, a service must be able to withstand an outage of a single component of the stack. This is particularly true in many cloud scenarios where a single server can be destroyed at a moment's notice. 4 | 5 | Perform the following: 6 | 7 | 1. Deploy two or more identical servers that hosts your service. If you're on a major cloud provider such as AWS, ensure the server can automatically be recovered in the event of a server failure. For AWS, this means using an autoscaling group. 8 | 2. Set up a load balancer to check the health of your service on each server and split the traffic between the two servers. 9 | 3. Configure your https certificate to terminate at the load balancer instead of at the individual server. If you're using AWS, use ACM. 10 | 4. If your service requires a login or has any session, ensure that my request can be serviced by _any_ server behind your load balancer. (Investigate where the best place to store session information is for your particular framework and language. Redis or your DB are good options) 11 | 5. Ensure no state on any particular server lives outside of a single request. This means any file uploads are moved off the server after processing, for example. 12 | 13 | You should now have a robust service that can withstand a single application server outage. Destroy to your heart's content! 14 | 15 | Setting this up isn't just useful for server outages, however. You can take advantage of the high availability to deploy new versions of your application in a rolling deployment. If you're using an autoscaling group in AWS, this would mean either creating a new autoscaling group or modifying the parameters of the autoscaling group and issuing an instance refresh. If your particular cloud provider does not have something akin to an autoscaling group, this can be an ansible playbook or script that fetches the existing hosts dynamically and installs the new software onto each server. Make sure the servers are not receiving traffic when this occurs, take them out of the load balancer first! 16 | 17 | ### Acceptance Criteria 18 | 1. Your service can withstand a single server outage and not be impacted. 19 | 2. Your application logs should contain the _real_ ip of the client and not of the load balancer. (See X-Forwarded-For headers) 20 | 3. You can deploy a new version of your service without any downtime. 21 | 22 | --- 23 | 24 | [Continue to Part 8](part8.md) -------------------------------------------------------------------------------- /starter-project/part8.md: -------------------------------------------------------------------------------- 1 | # Part 8 - Disaster Recovery 2 | 3 | Now that we have a resilient service, we have to talk about the part that was blatantly left out: the database. What happens if _this_ server fails irrecoverably? You lose all your customers! 4 | 5 | 6 | You need to have a backup plan in place for your database (if your service requires one) and be able to recover from a backup. 7 | 8 | Perform the following: 9 | 10 | 1. Using your cloud vendor's native tools, craft a plan to back up your database server on a nightly, weekly, and monthly basis. 11 | 2. Manually trigger a backup or wait for an automatic one to occur. 12 | 3. Delete your database. 13 | 4. Recover your database. 14 | 5. Document the backup and recovery procedure. Ensure you can use your infrastructure as code tooling to perform it. 15 | 6. Consider how you would send those backups to somewhere else in case you ever lost access to your cloud account. 16 | 17 | Disaster can strike your service at many different levels and you should be aware of the ways to recover from them. Not only could you lose your database but you could lose access to your cloud accounts, have your cloud accounts deleted, forget to renew your credit card information, the whole gambit. 18 | 19 | Your infrastructure as code is a critical part of disaster recovery. It is effectively executable documentation on how to deploy your services and infrastructure. In the event you lost all of your infrastructure, how long would it take you to recover today? 20 | 21 | ### Acceptance Criteria 22 | 1. You have backup plans configured for your database. 23 | 2. Your backup and recovery procedures are documented. 24 | 25 | --- 26 | 27 | [Continue to Part 9](part9.md) -------------------------------------------------------------------------------- /starter-project/part9.md: -------------------------------------------------------------------------------- 1 | # Part 9 - Document, Document, Document 2 | 3 | At the end of your project, anyone should be able to view your repositories and their documentation and spin up their own copy of your service in its entirety without any additional assistance. If you intend for potential employers to understand what you did, you need to make it easy and obvious for them. 4 | 5 | --------------------------------------------------------------------------------