├── README.md ├── SLI-Metric.png ├── SLI_Example.png ├── SLO Dashboard.png ├── Server_Latency.png └── Server_error.png /README.md: -------------------------------------------------------------------------------- 1 | # SRE Cheat Sheet 2 | 3 | ## Reliability 4 | Reliability is a measure of how well the service lives up to its expectations. 5 | - What to promise and to whom? 6 | - What metrics to measure? 7 | - How much reliability is good enough? 8 | 9 | ### Principles 10 | 1. Reliability is the most important feature 11 | 2. Users, not monitoring decide reliability 12 | 3. 100% is a wrong target in almost all cases 13 | - To reach 99.9% you need a seasoned software engineering team 14 | - To reach 99.99%, you need a well trained operations team with focus on automation 15 | - To reach 99.999%, you need to sacrifice speed at which features are released 16 | ``` 17 | NOTE: 18 | 1. 100% Reliability is a wrong target. If you are running your service more reliably than you need to, you may be slowing down development 19 | 2. Its more expensive to make already reliable services more reliable. At some point the incremental cost of making a service reliable increases exponentially 20 | 3. Have ambitions but achievable targets based on how it performs and agreed by all stakeholders 21 | 4. Taking reliability to extreme measures is unproductive and costly. It should be "Reliable Enough" 22 | ``` 23 | 24 | ### How to make services reliable? 25 | #### Rolling out changes gradually 26 | - Deployment with incremental changes 27 | - Feature toggles 28 | - Canary deployments with easy rollback that affect only a smaller percent of users initially 29 | #### Remove single point of failure 30 | - Multi AZ deployments 31 | - Set up DR in a geographically isolated region 32 | #### Reduce TTD (Time-To-Detect) 33 | - Catch issues faster by automated alert and monitoring 34 | - Monitor SLO compliance and error budget burnout 35 | #### Reduce TTR (Time-To-Resolution) 36 | - Fix outgages quicker 37 | - Knowledge sharing via playbooks 38 | - Automating outage mitigation steps, such as draining from one region to another. 39 | #### Increase TTF / TBF - Expected frequency of failure to 40 | - Make services Fault tolerant by running them on multiple AZs 41 | - Automate manual mitigation steps 42 | #### Inprove Operation Efficiency 43 | - Post mortems of outages 44 | - Standardized Infrastructure 45 | - Collect data on poor reliability regions and make extra effort to make those reliable 46 | ``` 47 | |------------|---------------| 48 | Issue TTD TTR 49 | Time-To-Detect Time-To-Resolution 50 | ``` 51 | 52 | ## Measuring Reliability 53 | How is reliability measured? 54 | ### SLI 55 | An SLI is a service level indicator — a carefully defined quantitative measure of user experience / reliability of service. 56 | 57 | Simply, 58 | ``` 59 | SLI = good events / valid events 60 | ``` 61 | #### Characteristics of good SLI 62 | - Must have a predictable liner relationship with user happiness (Less variance) 63 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/SLI-Metric.png) 64 | - Shows service is working as users expect it to 65 | - Aggregated over a long time horizon 66 | 67 | #### Ways of measuring SLI 68 | - Request Logs 69 | - Exporting metrics 70 | - Front-end load balancer metrics 71 | - Synthetic clients 72 | - Client side instrumentation 73 | 74 | #### Types of SLI 75 | - Request Latency 76 | - Error Rate = (500 responses/total requestes) per second 77 | - Time between Failures - Frequency of error occurring over a period of time 78 | - Availability = uptime / (uptime + downtime) 79 | - Durability (Data will be retained over a period of time, measure of data loss) 80 | 81 | ##### Request-Response systems 82 | - Availability - Proportion of valid requests served successfully 83 | - Latency - Proportion of valid requests served faster than a threshold 84 | - Quality - Proportion of valid requests served without degrading quality 85 | ##### Data-Processing Systems 86 | - Freshness - Proportion of valid data updated more recently than a threshold 87 | i.e. is it serving stale data 88 | e.g. For a batch processing system its the time since last successful run 89 | - Correctness - Porportion of valid data producing correct output 90 | - Coverage - Proportion of valid data processed successfully 91 | - Thourghput - Proportion of time where the data processing rate is faster than threshold 92 | 93 | #### Example 94 | In a fictional gaming application, users buying in-game currency via in-app purchases. Requests to the Play Store are only visible from the client. 95 | We see between 0.1 and 1 completed purchase every second; this spikes to 10 purchases per second after the release of a new area as players try to meet its requirements. 96 | 97 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/SLI_Example.png) 98 | 99 | Valid Events - Requests of type https and from user agent - Browser or mobile client for path `/api/getSKUs` or `/api/completePurchase` 100 | 101 | - Availability SLI - Proportion of requests for path `/api/getSKUs` or `/api/completePurchase` that are not in status code 500 measured at load balancer 102 | - Latency SLI - Proportion of requests for paths `/api/getSKUs` or `/api/completePurchase` served within 3 seconds (lets say that this is based on historical data) measured at load balancer 103 | - Quality SLI - Proportion of requests for path `/api/getSKUs` or `/api/completePurchase` served without degrading quality measured at client-side using synthetic client or client side instrumentation 104 | 105 | #### Managing complex systems 106 | - Start by thinking about user journeys 107 | - SLI and metrics are different. SLI ---> Something is broken, Metric ---> What is broken 108 | - Keep the number of SLIs down to 1-3 per user journey 109 | - Not all metrics make good SLI 110 | - Higher number of SLIs increase load on operations 111 | - Higher signal to noise ratio as they tend to give conflicting signals 112 | - Aggregate similar user journeys to keep the number of SLIs down 113 | 114 | ### SLO 115 | An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. 116 | They are a fundamental tool in helping your organization strike a good balance between releasing new features and staying reliable for your users. 117 | They also help your teams communicate on the expectations of a service through objective data. 118 | 119 | #### Questions that SLO help answer: 120 | - If reliability is a feature, when do you prioritize it versus other features? 121 | - How fast is too tast for rolling out features? 122 | - What is the right level of reliability for your system? 123 | 124 | A service level objective usually defines a target level for SLI so that one can continue to provide a reliable service 125 | ``` 126 | lower bound ≤ SLI ≤ upper bound --> for a defined period of time 127 | ``` 128 | #### An SLO 129 | - Should be stronger than your SLA to catch issues before they violate customer expectations 130 | - Is an internal promises to meet customer expectations 131 | 132 | #### NOTE: 133 | 1. Defining `SLO` is an iterative process, needs to be reviewed periodically based on changing business needs and customers 134 | ``` 135 | |-----------|-----------------| 136 | Initial Follow up Periodic 137 | ``` 138 | 2. Edge Cases 139 | Not everything is linear, there are many edge cases in different organizations that don't conform to a single SLO for everything. 140 | ``` 141 | Example 142 | 1. Companies might shift from 3 9s to 4 9s during black friday shopping frenzy to cater to high demands 143 | 2. Outage duration can impact customer happiness. The following may affect different customers differently 144 | - A single 4 hour outage 145 | - Four 1 hour outages 146 | - Constant rate of 0.5% errors 147 | 3. Not all users care about latency the same way. Bots may require lower latency than humans. So its resonable to have different SLOs based on different users 148 | ``` 149 | #### SLO Targets 150 | 1. Just high enough to keep customers happy 151 | 2. Ambitious but achievable 152 | 153 | Example 154 | - Latency SLO 155 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/Server_Latency.png) 156 | - 99% of requests complete in under 1000ms over a 28 day window. 157 | - 95% of requests complete in under 750ms over a 28 day window. 158 | - 90% of requests complete in under 500ms over a 28 day window. 159 | - 50% of requests complete in under 200ms over a 28 day window. 160 | 161 | - Availability SLO 162 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/Server_error.png) 163 | - 99.5% of responses are good over a period of 28 days. 164 | - 99.95% of responses are good over a period of 28 days 165 | 166 | #### The happiness Test 167 | The test states that services need target SLOs that capture the performance and availability levels that if barely met would keep a typical customer happy. 168 | Simply put, if your service is performing exactly at its target SLOs, your average user would be happy with that performance. 169 | If it were any less reliable, you'd no longer be meeting their expectations and they would become unhappy. 170 | 171 | If your service meets target SLO, that means you have happy customers. If it misses the target SLO, that means you have sad customers. 172 | 173 | #### SLO Gaps 174 | - 100% coverage for complex systems is unrealistic 175 | - Pay for rare failure modes from your error budget 176 | - Exclude factors from outside your control from SLI 177 | - Do a cost benefit analysis 178 | 179 | #### SLO Dashboard 180 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/SLO%20Dashboard.png) 181 | 182 | ### SLA 183 | Any company providing a service need to have Service Level Agreements, or SLAs. These are your agreements that you make with your customers about the reliability of your service. An SLA has to have consequences if it's violated, otherwise there's no point in making one. If your customers are paying for something and you violate an SLA, there needs to be consequences, such as giving your customers partial refunds or extra service credits. 184 | 185 | if you are only alerted of issues after they violated your SLA, that could be a very costly service to run. Therefore, it is in your best interest to catch an issue before it breaches your SLA so that you have time to fix it. These thresholds are your SLOs, service level objectives. They should always be stronger than your SLAs because customers are usually impacted before the SLA is actually breached. And violating SLAs requires costly compensation. 186 | 187 | ``` 188 | |---------------|--------|----------- 189 | :) SLO :( SLI Penalty 190 | ``` 191 | An SLA 192 | - Must have consequences if violated 193 | - Is an agreement with your customer about reliability of your service 194 | 195 | ## Error budget 196 | Error budget is a measure of service unreliability that is allowed without breaking SLOs or how much down time is allowed before you have unhappy customers 197 | Error budget hepls you figure out how room we have for mistakes that can make service unreliable. 198 | 199 | Error budgets are the tool SRE uses to balance service reliability with the pace of innovation. Changes are a major source of instability, representing roughly 70% of our outages, and development work for features competes with development work for stability. 200 | The error budget forms a control mechanism for diverting attention to stability as needed. 201 | 202 | ``` 203 | Error Budget = 1 - SLO 204 | ``` 205 | ### Illustration 206 | 28 day error budget 207 | - 99.9% = 40 min (Enough time for humans to react) 208 | - 99.99% = 4 minutes 209 | (Incident responses have to be automated. 210 | Else make sure change propagates gradually so not all parts of system 211 | are exposed to change at once giving time for human intervention) 212 | - 99.999% = 24 seconds (Restricting the rate of change so that only 1% of system changes at a given point in time) 213 | 214 | ``` 215 | Example 216 | 217 | A 99.9% SLO service has a 0.1% error budget. 218 | 219 | If the service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period. 220 | 221 | Downtime = 0.001*28*24*60 minutes = 40.32 minutes 222 | 223 | This is just about enough time for 224 | - your monitoring systems to surface an issue, 225 | - a human to investigate and fix it. 226 | 227 | And that only allows for one incident per month 228 | 229 | This unavailability can be generated as a result of bad pushes by the product teams, planned maintenance, hardware failures,etc. 230 | ``` 231 | 232 | ### Benefits 233 | - Common incentives for Devs and SRE 234 | - Dev team can self manage risk 235 | - Unrealistic goals become unattractive 236 | 237 | If Service < Error Budget 238 | - Dev can push changes more frequently 239 | - SRE can proactively work on increasing reliability 240 | 241 | If Service > Error Budget 242 | - Changes need to be stopped till the system is stable 243 | 244 | One simple approach is to keep releasing features till error budget is exhausted, then focussing development on reliability improvements untill the budget is refilled 245 | ### Error Budget Policy 246 | Describes how organisation decides to tradeoff Reliability vs Features when the SLO indicates that service is not reliable 247 | - Clearly describes how and when it should be applied 248 | - Consistently applied 249 | - Documents consequences of NOT applying 250 | - Documents thresholds for escalation - 251 | - after Xh hours of error budget burned 252 | - paging developers after SLO is violated 253 | ``` 254 | Example 255 | • Threshold 1: Automated alerts notify SRE of an at-risk SLO 256 | • Threshold 2: SREs conclude they need help to defend SLO and escalate to devs 257 | • Threshold 3: The 30-day error budget is exhausted and the root cause has not been found; feature releases blocked, dev team dedicates more resources 258 | • Threshold 4: The 90-day error budget is exhausted and the root cause has not been found; SRE escalates to executive leadership to obtain more engineering time for reliability work 259 | ``` 260 | ## Reliability Risks 261 | Analyze if your error budget is realistic 262 | - Be Constructively pessimistic 263 | - Model error budget impact 264 | - Compare and assess risk 265 | - Prioritize fixing critical risk 266 | 267 | Expected impact of a failure on error budget over a period of time 268 | ``` 269 | E ~ (TTR+TTD) * impact % / TTF 270 | ``` 271 | 272 | Risk spreadsheet - https://docs.google.com/spreadsheets/d/1XTsPG79XCCiaOEMj8K4mgPg39ZWB1l5fzDc1aDjLW2Y/view#gid=847168250 273 | 274 | ## Case Study 275 | https://docs.google.com/document/d/1VM1z7naMpNbb9vwWbMxQ1_GUVZu2mB9RsBe9SD9HQUA 276 | 277 | -------------------------------------------------------------------------------- /SLI-Metric.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/SLI-Metric.png -------------------------------------------------------------------------------- /SLI_Example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/SLI_Example.png -------------------------------------------------------------------------------- /SLO Dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/SLO Dashboard.png -------------------------------------------------------------------------------- /Server_Latency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/Server_Latency.png -------------------------------------------------------------------------------- /Server_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/Server_error.png --------------------------------------------------------------------------------