├── README.md
├── SLI-Metric.png
├── SLI_Example.png
├── SLO Dashboard.png
├── Server_Latency.png
└── Server_error.png


/README.md:
--------------------------------------------------------------------------------
  1 | # SRE Cheat Sheet
  2 | 
  3 | ## Reliability
  4 | Reliability is a measure of how well the service lives up to its expectations.
  5 |   - What to promise and to whom?
  6 |   - What metrics to measure?
  7 |   - How much reliability is good enough?    
  8 |   
  9 | ### Principles
 10 | 1. Reliability is the most important feature
 11 | 2. Users, not monitoring decide reliability
 12 | 3. 100% is a wrong target in almost all cases
 13 |     - To reach 99.9% you need a seasoned software engineering team
 14 |     - To reach 99.99%, you need a well trained operations team with focus on automation
 15 |     - To reach 99.999%, you need to sacrifice speed at which features are released
 16 | ```
 17 | NOTE: 
 18 | 1. 100% Reliability is a wrong target. If you are running your service more reliably than you need to, you may be slowing down development
 19 | 2. Its more expensive to make already reliable services more reliable. At some point the incremental cost of making a service reliable increases exponentially
 20 | 3. Have ambitions but achievable targets based on how it performs and agreed by all stakeholders
 21 | 4. Taking reliability to extreme measures is unproductive and costly. It should be "Reliable Enough"
 22 | ```
 23 | 
 24 | ### How to make services reliable?
 25 | #### Rolling out changes gradually
 26 |   - Deployment with incremental changes
 27 |   - Feature toggles
 28 |   - Canary deployments with easy rollback that affect only a smaller percent of users initially
 29 | #### Remove single point of failure
 30 |   - Multi AZ deployments
 31 |   - Set up DR in a geographically isolated region
 32 | #### Reduce TTD (Time-To-Detect)
 33 |   - Catch issues faster by automated alert and monitoring 
 34 |   - Monitor SLO compliance and error budget burnout
 35 | #### Reduce TTR (Time-To-Resolution)
 36 |   - Fix outgages quicker
 37 |   - Knowledge sharing via playbooks
 38 |   - Automating outage mitigation steps, such as draining from one region to another.
 39 | #### Increase TTF / TBF - Expected frequency of failure to 
 40 |   - Make services Fault tolerant by running them on multiple AZs
 41 |   - Automate manual mitigation steps
 42 | #### Inprove Operation Efficiency
 43 |   - Post mortems of outages
 44 |   - Standardized Infrastructure 
 45 |   - Collect data on poor reliability regions and make extra effort to make those reliable
 46 |  ```
 47 |  |------------|---------------|
 48 |  Issue       TTD             TTR
 49 |         Time-To-Detect   Time-To-Resolution
 50 |  ```
 51 | 
 52 | ## Measuring Reliability
 53 | How is reliability measured?
 54 | ### SLI
 55 | An SLI is a service level indicator — a carefully defined quantitative measure of user experience / reliability of service.
 56 | 
 57 | Simply,
 58 | ```
 59 | SLI = good events / valid events
 60 | ```
 61 | #### Characteristics of good SLI
 62 | - Must have a predictable liner relationship with user happiness (Less variance) 
 63 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/SLI-Metric.png)
 64 | - Shows service is working as users expect it to
 65 | - Aggregated over a long time horizon
 66 | 
 67 | #### Ways of measuring SLI
 68 | - Request Logs
 69 | - Exporting metrics
 70 | - Front-end load balancer metrics
 71 | - Synthetic clients
 72 | - Client side instrumentation
 73 | 
 74 | #### Types of SLI
 75 | - Request Latency
 76 | - Error Rate = (500 responses/total requestes) per second
 77 | - Time between Failures - Frequency of error occurring over a period of time
 78 | - Availability = uptime / (uptime + downtime)
 79 | - Durability (Data will be retained over a period of time, measure of data loss)
 80 | 
 81 | ##### Request-Response systems
 82 |  - Availability - Proportion of valid requests served successfully
 83 |  - Latency - Proportion of valid requests served faster than a threshold
 84 |  - Quality - Proportion of valid requests served without degrading quality
 85 | ##### Data-Processing Systems
 86 |  - Freshness - Proportion of valid data updated more recently than a threshold
 87 |                  i.e. is it serving stale data
 88 |                  e.g. For a batch processing system its the time since last successful run
 89 |  - Correctness - Porportion of valid data producing correct output
 90 |  - Coverage - Proportion of valid data processed successfully
 91 |  - Thourghput - Proportion of time where the data processing rate is faster than threshold
 92 |  
 93 | #### Example
 94 | In a fictional gaming application, users buying in-game currency via in-app purchases. Requests to the Play Store are only visible from the client. 
 95 | We see between 0.1 and 1 completed purchase every second; this spikes to 10 purchases per second after the release of a new area as players try to meet its requirements.
 96 | 
 97 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/SLI_Example.png)
 98 | 
 99 | Valid Events - Requests of type https and from user agent - Browser or mobile client for path `/api/getSKUs` or `/api/completePurchase`
100 | 
101 | - Availability SLI - Proportion of requests for path `/api/getSKUs` or `/api/completePurchase` that are not in status code 500 measured at load balancer
102 | - Latency SLI - Proportion of requests for paths `/api/getSKUs` or `/api/completePurchase` served within 3 seconds (lets say that this is based on historical data) measured at load balancer
103 | - Quality SLI - Proportion of requests for path `/api/getSKUs` or `/api/completePurchase` served without degrading quality measured at client-side using synthetic client or client side instrumentation
104 | 
105 | #### Managing complex systems
106 | - Start by thinking about user journeys
107 | - SLI and metrics are different. SLI ---> Something is broken, Metric ---> What is broken
108 | - Keep the number of SLIs down to 1-3 per user journey
109 | - Not all metrics make good SLI
110 | - Higher number of SLIs increase load on operations
111 | - Higher signal to noise ratio as they tend to give conflicting signals
112 | - Aggregate similar user journeys to keep the number of SLIs down
113 | 
114 | ### SLO
115 | An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
116 | They are a fundamental tool in helping your organization strike a good balance between releasing new features and staying reliable for your users. 
117 | They also help your teams communicate on the expectations of a service through objective data.
118 | 
119 | #### Questions that SLO help answer:
120 | - If reliability is a feature, when do you prioritize it versus other features?
121 | - How fast is too tast for rolling out features?
122 | - What is the right level of reliability for your system?
123 |   
124 | A service level objective usually defines a target level for SLI so that one can continue to provide a reliable service
125 | ```
126 | lower bound ≤ SLI ≤ upper bound --> for a defined period of time
127 | ```
128 | #### An SLO
129 | - Should be stronger than your SLA to catch issues before they violate customer expectations
130 | - Is an internal promises to meet customer expectations
131 | 
132 | #### NOTE:
133 | 1. Defining `SLO` is an iterative process, needs to be reviewed periodically based on changing business needs and customers
134 | ```
135 | |-----------|-----------------|
136 | Initial   Follow up          Periodic
137 | ```
138 | 2. Edge Cases
139 | Not everything is linear, there are many edge cases in different organizations that don't conform to a single SLO for everything.
140 | ```
141 | Example
142 | 1. Companies might shift from 3 9s to 4 9s during black friday shopping frenzy to cater to high demands
143 | 2. Outage duration can impact customer happiness. The following may affect different customers differently
144 |     - A single 4 hour outage
145 |     - Four 1 hour outages
146 |     - Constant rate of 0.5% errors
147 | 3. Not all users care about latency the same way. Bots may require lower latency than humans. So its resonable to have different SLOs based on different users
148 | ```
149 | #### SLO Targets
150 | 1. Just high enough to keep customers happy
151 | 2. Ambitious but achievable
152 | 
153 | Example
154 | - Latency SLO
155 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/Server_Latency.png)
156 |     - 99% of requests complete in under 1000ms over a 28 day window. 
157 |     - 95% of requests complete in under 750ms over a 28 day window. 
158 |     - 90% of requests complete in under 500ms over a 28 day window. 
159 |     - 50% of requests complete in under 200ms over a 28 day window. 
160 | 
161 | - Availability SLO
162 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/Server_error.png)
163 |     - 99.5% of responses are good over a period of 28 days.
164 |     - 99.95% of responses are good over a period of 28 days
165 |     
166 | #### The happiness Test
167 | The test states that services need target SLOs that capture the performance and availability levels that if barely met would keep a typical customer happy. 
168 | Simply put, if your service is performing exactly at its target SLOs, your average user would be happy with that performance. 
169 | If it were any less reliable, you'd no longer be meeting their expectations and they would become unhappy. 
170 | 
171 | If your service meets target SLO, that means you have happy customers. If it misses the target SLO, that means you have sad customers. 
172 | 
173 | #### SLO Gaps
174 | - 100% coverage for complex systems is unrealistic
175 | - Pay for rare failure modes from your error budget
176 | - Exclude factors from outside your control from SLI
177 | - Do a cost benefit analysis
178 | 
179 | #### SLO Dashboard
180 | ![alt text](https://github.com/anshudutta/sre-cheat-sheet/blob/master/SLO%20Dashboard.png)
181 | 
182 | ### SLA
183 | Any company providing a service need to have Service Level Agreements, or SLAs. These are your agreements that you make with your customers about the reliability of your service. An SLA has to have consequences if it's violated, otherwise there's no point in making one. If your customers are paying for something and you violate an SLA, there needs to be consequences, such as giving your customers partial refunds or extra service credits.
184 | 
185 | if you are only alerted of issues after they violated your SLA, that could be a very costly service to run. Therefore, it is in your best interest to catch an issue before it breaches your SLA so that you have time to fix it. These thresholds are your SLOs, service level objectives. They should always be stronger than your SLAs because customers are usually impacted before the SLA is actually breached. And violating SLAs requires costly compensation. 
186 | 
187 | ```
188 | |---------------|--------|-----------
189 |             :) SLO  :(  SLI  Penalty
190 | ```
191 | An SLA
192 | - Must have consequences if violated
193 | - Is an agreement with your customer about reliability of your service
194 | 
195 | ## Error budget
196 | Error budget is a measure of service unreliability that is allowed without breaking SLOs or how much down time is allowed before you have unhappy customers
197 | Error budget hepls you figure out how room we have for mistakes that can make service unreliable.
198 | 
199 | Error budgets are the tool SRE uses to balance service reliability with the pace of innovation. Changes are a major source of instability, representing roughly 70% of our outages, and development work for features competes with development work for stability. 
200 | The error budget forms a control mechanism for diverting attention to stability as needed.
201 | 
202 | ```
203 | Error Budget = 1 - SLO
204 | ```
205 | ### Illustration
206 | 28 day error budget
207 | - 99.9% = 40 min (Enough time for humans to react)
208 | - 99.99% = 4 minutes 
209 | (Incident responses have to be automated. 
210 | Else make sure change propagates gradually so not all parts of system 
211 | are exposed to change at once giving time for human intervention)
212 | - 99.999% = 24 seconds (Restricting the rate of change so that only 1% of system changes at a given point in time)
213 | 
214 | ```
215 | Example
216 | 
217 | A 99.9% SLO service has a 0.1% error budget.
218 | 
219 | If the service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period.
220 | 
221 | Downtime = 0.001*28*24*60 minutes = 40.32 minutes
222 | 
223 | This is just about enough time for 
224 | - your monitoring systems to surface an issue, 
225 | - a human to investigate and fix it. 
226 | 
227 | And that only allows for one incident per month
228 | 
229 | This unavailability can be generated as a result of bad pushes by the product teams, planned maintenance, hardware failures,etc.
230 | ```
231 | 
232 | ### Benefits
233 | - Common incentives for Devs and SRE
234 | - Dev team can self manage risk
235 | - Unrealistic goals become unattractive
236 | 
237 | If Service < Error Budget
238 |  - Dev can push changes more frequently
239 |  - SRE can proactively work on increasing reliability
240 |  
241 | If Service > Error Budget
242 |  - Changes need to be stopped till the system is stable
243 |  
244 | One simple approach is to keep releasing features till error budget is exhausted, then focussing development on reliability improvements untill the budget is refilled
245 | ### Error Budget Policy
246 | Describes how organisation decides to tradeoff Reliability vs Features when the SLO indicates that service is not reliable
247 | - Clearly describes how and when it should be applied
248 | - Consistently applied
249 | - Documents consequences of NOT applying
250 | - Documents thresholds for escalation - 
251 |     - after Xh hours of error budget burned
252 |     - paging developers after SLO is violated
253 | ```
254 | Example
255 | • Threshold 1: Automated alerts notify SRE of an at-risk SLO 
256 | • Threshold 2: SREs conclude they need help to defend SLO and escalate to devs 
257 | • Threshold 3: The 30-day error budget is exhausted and the root cause has not been found; feature releases blocked, dev team dedicates more resources 
258 | • Threshold 4: The 90-day error budget is exhausted and the root cause has not been found; SRE escalates to executive leadership to obtain more engineering time for reliability work 
259 | ```
260 | ## Reliability Risks
261 | Analyze if your error budget is realistic
262 | - Be Constructively pessimistic
263 | - Model error budget impact
264 | - Compare and assess risk
265 | - Prioritize fixing critical risk
266 | 
267 |  Expected impact of a failure on error budget over a period of time
268 |  ```
269 |  E ~ (TTR+TTD) * impact % / TTF
270 |  ```
271 | 
272 | Risk spreadsheet - https://docs.google.com/spreadsheets/d/1XTsPG79XCCiaOEMj8K4mgPg39ZWB1l5fzDc1aDjLW2Y/view#gid=847168250
273 | 
274 | ## Case Study
275 | https://docs.google.com/document/d/1VM1z7naMpNbb9vwWbMxQ1_GUVZu2mB9RsBe9SD9HQUA
276 | 
277 | 


--------------------------------------------------------------------------------
/SLI-Metric.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/SLI-Metric.png


--------------------------------------------------------------------------------
/SLI_Example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/SLI_Example.png


--------------------------------------------------------------------------------
/SLO Dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/SLO Dashboard.png


--------------------------------------------------------------------------------
/Server_Latency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/Server_Latency.png


--------------------------------------------------------------------------------
/Server_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anshudutta/sre-cheat-sheet/cda97b26fd6bfa8a41f3ae90f36ffd363e0cf025/Server_error.png


--------------------------------------------------------------------------------