├── how_did_things_go_right ├── README.md └── how_did_things_go_right.pdf └── the_meat_of_it ├── README.md └── the_meat_of_it.pdf /how_did_things_go_right/README.md: -------------------------------------------------------------------------------- 1 | # Thanks so much for your interest in this presentation! 2 | I hope to cover some of the nuance of the content here to balance out what might at first glance seem provocative. 3 | 4 | It’s really hard to introduce these concepts without sounding dogmatic. 5 | It’s really easy to interpret this stuff as “We’re all doing it wrong and need to throw out everything!” 6 | 7 | That’s not the point of this talk! What I want to point out is that the tools and methods we have aren’t designed to solve every problem. 8 | 9 | And that’s ok! There is no panacea. 10 | 11 | What I want to point out is how we can start to look at the pitfalls and blind spots that often go ignored in incredibly subtle ways. We need to find the stuff that’s actually making us better and highlight those things. This is different for every organization which is what makes it so hard to talk about. 12 | 13 | The point of this talk is to encourage a holistic view of incidents (and when we aren’t having incidents). BOTH failure and success and everything in between that happens during normal, everyday performance. These ideas are additive, not a replacement. 14 | 15 | I want us to go beyond the fashionability of SRE and really think about the things we’re doing all the time that contribute to reliability which are often invisible. 16 | 17 | 18 | #### Should we stop using SLOs and Error Budgets? 19 | No, they’re super useful! They’ve transformed what we do and how we think about reliability. However, they aren’t designed to address every problem. There are a lot of weak signals on the human side of reliability that we can’t detect this way, and we should be paying attention to those by compassionately talking with people and interviewing them. 20 | 21 | 22 | #### Should we stop measuring availability? 23 | No! Measuring availability can tell us a lot, but **reporting** on it at a product level can facilitate some pretty nasty incentives that ironically end up harming availability in the long term. Explore the different dimensions that go into creating the numbers, unearth the qualitative measures that cannot be represented, and make the reports about that stuff. 24 | 25 | Where measuring availability becomes really useful is between services behind the edge. We interact with software to form a sociotechnical system. Let’s say there are five services in a call span—each of those availability numbers is the start of a conversation about expectations between teams and services. 26 | 27 | 28 | #### Should we stop creating action items during postmortems? 29 | No, but know that action items are not the point of doing incident investigations. Learning is the goal! Don’t incentivize the number of action items. Incentivize people working together to share expertise and learn. 30 | 31 | 32 | https://www.usenix.org/conference/srecon19americas/presentation/kitchens 33 | 34 | https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/78020 35 | 36 | https://qconnewyork.com/ny2019/presentation/how-did-things-go-right-learning-more-incidents 37 | 38 | 39 | # References and materials for "How Did Things Go Right? Learning More from Incidents" 40 | 41 | http://resiliencepapers.club 42 | 43 | http://hophub.org 44 | 45 | http://safetydifferently.com 46 | 47 | https://www.snafucatchers.com 48 | 49 | http://stella.report 50 | 51 | 52 | ##### The HOP Mentor - Andrea Baker 53 | https://www.thehopmentor.com/ 54 | 55 | 56 | ##### The Field Guide to Understanding 'Human Error' - Sidney Dekker 57 | https://sidneydekker.com/books/ 58 | 59 | 60 | ##### Pre-Accident Investigations - Todd Conklin 61 | https://preaccidentpodcast.podbean.com/ 62 | 63 | 64 | ##### Language Bias in Accident Investgiation - Crista Vesel 65 | https://lup.lub.lu.se/student-papers/search/publication/2971193 66 | 67 | 68 | ##### Maps, Context, and Tribal Knowledge - J Paul Reed 69 | https://lup.lub.lu.se/student-papers/search/publication/8966930 70 | 71 | 72 | ##### Trade-offs Under Pressure - John Allspaw 73 | https://lup.lub.lu.se/student-papers/search/publication/8084520 74 | 75 | 76 | ##### Debriefing Facilitation Guide 77 | https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf 78 | 79 | 80 | ##### Lund University Human Factors & Systems Safety programme 81 | http://www.humanfactors.lth.se/ 82 | 83 | 84 | ##### Cognitive Systems Engineering Laboratory at Ohio State University 85 | http://csel.org.ohio-state.edu/ 86 | 87 | 88 | # Attributions, Thanks, and Inspirations 89 | If only every conference talk could be prefaced with "almost all of this is none of my original thoughts." :) 90 | 91 | 92 | Thanks to my colleagues Nora Jones (@nora_js), Lorin Hochstein (@lhochstein), Ashish Jotwani, and Dave Hahn (@relix42). 93 | 94 | 95 | 'Recovery > prevention' - Aaron Blohowiak (@aaronblohowiak) 96 | 97 | 98 | Much of the groundwork of resilience engineering in software has been introduced by John Allspaw (@allspaw), Richard Cook (@ri_cook), David Woods (@ddwoods2), and Johan Bergstrom (@bergstrom_johan). 99 | 100 | - https://www.devopsdays.org/events/2018-seattle/program/john-allspaw/ 101 | - https://m.youtube.com/watch?v=2S0k12uZR14 102 | - https://m.youtube.com/watch?v=7STcaWjJoww 103 | - https://m.youtube.com/watch?v=Pb_zYs8G6Co 104 | 105 | 106 | "The nines don't matter if users aren't happy" - Charity Majors (@mipsytipsy) 107 | 108 | 109 | 'failure is no longer interesting' has got to be something Todd Conklin has said before. 110 | 111 | 112 | The bit on 'a perfect storm' was paraphrased from Andrea Baker (@thehopmentor). 113 | 114 | 115 | 'probably fine' and many dots connected by Andrew Clay Shafer (@littleidea) 116 | - There is no Talent Shortage https://www.youtube.com/watch?v=P_sWGl7MzhU 117 | 118 | 119 | Comments on demonstrating the inefficiencies of blameful language from Matty Stratton (@mattstratton) 120 | - www.arresteddevops.com 121 | 122 | 123 | Any talk Bridget Kromhout (@bridgetkromhout) has ever given, but especially 'Containers Will Not Fix Your Broken Culture' 124 | - https://m.youtube.com/watch?v=hALDyVBVVz0 125 | 126 | 127 | Many inspirations from the Greater Than Code podcast 128 | - http://www.greaterthancode.com 129 | 130 | 131 | Diagrams/models based on work from Sidney Dekker (@sidneydekkercom), Erik Hollnagel, and Kelvin Genn (@kelvingenn) 132 | - http://www.safetydifferently.com/why-do-things-go-right/ 133 | -------------------------------------------------------------------------------- /how_did_things_go_right/how_did_things_go_right.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thishitshome/learning-from-incidents/9992d1d80163fcb9fe61b399ba92d55064dd975e/how_did_things_go_right/how_did_things_go_right.pdf -------------------------------------------------------------------------------- /the_meat_of_it/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /the_meat_of_it/the_meat_of_it.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thishitshome/learning-from-incidents/9992d1d80163fcb9fe61b399ba92d55064dd975e/the_meat_of_it/the_meat_of_it.pdf --------------------------------------------------------------------------------