├── meeting_notes ├── 2016-03-31-agenda.md └── 2016-02-23-kickoff-hangout.md └── README.md /meeting_notes/2016-03-31-agenda.md: -------------------------------------------------------------------------------- 1 | #Ops NTSB hangout #2 -- call for agenda items! 2 | ##When: Thursday 2016-03-31, 3 pm PST (10 pm UTC) 3 | _Please add anything you would like to chat about!_ 4 | 5 | __Participants:__ 6 | 7 | * In the interest of not posting private info -- if you want an invite, DM your gmail account address to @charity in #hangops! 8 | 9 | __Agenda items:__ 10 | 11 | * Shall we aim for ~monthly hangouts? (charity's feeling: anyone can call a hangout any time, please appoint a note taker and post notes so we can all catch up!) 12 | * Time zones are a thing. Not everyone can make every time, that's ok! Gets hard to have a productive discussion with >$x people, so encourage more, smaller, democratized hangouts Scheduling this one to be friendly to Australians since @rhoml requested :) 13 | * Should we try crowd-sourcing postmortems for outages that weren't publicly processed!? OMG this sounds so amazing (credit to @petey5k, I think) 14 | * Good incident response policies are SO contextual. Should we post some case studies (interview or article style?). Could tag with size of company, maturity, service type, SLA, other useful contextual clues to guide people towards creating appropriate policies for their environment 15 | * Over-engineering incident response is just as damaging as under-valuing or under-excuting. 16 | * Gotten lots of great feedback on the first book reports -- should we reach out and specifically solicit contributions? Any thoughts on next steps? 17 | * Lex has joined #incident_response and plugged us in SREWeekly! So exciting to see more people with new ideas show up. Any newbies want to introduce themselves and what's on their minds lately? 18 | * Other topics? 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OIB Mission Statement 2 | 3 | Three year working mission statement for an Ops Incident Board (OIB) 4 | 5 | ### Proposal: 6 | 7 | Over the next three years, we should work towards having an elected, term-limited board of operations/software engineering professionals who are committed to raising the bar for industry incident reports, developing tools for reporting, and disseminating best practices and knowledge. 8 | 9 | The OIB may provide: 10 | 11 | * **Templates for constructive postmortems and incident response**. Public git repo, perhaps a variety of templates for different types of consumers (e.g. other engineers, ecommerce customers) and degrees of badness (e.g. dipped below SLA, catastrophic data loss) 12 | * **Publicly searchable database / aggregation of incident reports**, similar to the [NTSB database](http://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx) 13 | * **Private peer review of post mortems** before public release, upon request. Requires swift turnaround, but would be valuable to a lot of people who are inexperienced with this sort of thing and would like a review of their language and/or level of specificity. 14 | * **Periodic [book reviews](https://github.com/Operations-Incident-Board/Postmortem-Report-Reviews) of recent or classic outages**, with an emphasis on what was useful about them and anything that was misleading or unclear. The OIB will aggregate these on a central website, and have a rapid-release review of major outages. The goal here is to help post-mortem authors understand the impact of what they are writing, and compare with precedent outages. 15 | * The OIB will try to **signal boost best practices or interesting hacks** that an org is using to improve their reliability or human usability, so that other organizations may adopt the same pattern (e.g. Etsy's nighttime indicator in the deploy tool). We want to encourage and amplify the beneficial networking effects of efforts like the Etsy-Twitter engineering exchange program. 16 | * Ultimately: we would love to reach a point where we can **participate in post mortem discussions** at major tech companies -- as a fly on the wall to start, asking naive questions if that goes well, or even as a trained facilitator. Tiny startups and nonprofits seem like a good place to start with this. This won't work unless it becomes a badge of quality and credibility to publish a post mortem with an OIB stamp -- this means we certify it's a high quality, appropriately transparent post mortem that the rest of the industry should pay attention to and learn from. 17 | 18 | Before we get anywhere near the last goal, we would need to build a name and credibility in the community by developing the first five points as well as a recognizable identity. (Ideas for raising profile? Possibly a VelocityConf panel or something next year.) 19 | 20 | Practically speaking, the OIB should probably be an odd number of people (5?) where one rotates out every year and is replaced by community nomination. It shouldn’t take more than ~2 hours of work per week per member, and there should be some continuity year over year. 21 | 22 | _Better definition needed for high quality / certifiable post mortem. We think it’s honest? Enlightening or educational for others using similar software? We def have to agree not to disclose anything the inviting company has not authorized. Do we publish a separate meta-analysis, subject to their approval?_ 23 | 24 | -------------------------------------------------------------------------------- /meeting_notes/2016-02-23-kickoff-hangout.md: -------------------------------------------------------------------------------- 1 | #Ops NTSB Meeting Notes 2 | ##2016-02-23 3 | _[Marc note: I may have mis-transcribed some of this, all errors are mine, please feel free to correct.]_ 4 | 5 | __Participants:__ 6 | 7 | * Ben Krueger, Slalom Consulting (benjamin@xxxxx) 8 | * Charity Majors, Hound (charity@xxxxx) 9 | * Marc Hedlund, Fiasco, Inc. (marc@xxxxx) 10 | * Aaron (from Kickstarter) 11 | * Sam 12 | * Jonathan Gilbert (jong@xxxxx) 13 | * Tim McGinnis (st.lart@xxxxx) 14 | 15 | __Intro chat; purpose__ 16 | 17 | Ben: Desire to have some shared communication about learning from failure. Shared industry knowledge about past failures is poor. (Charity mentions [SRE Weekly](http://sreweekly.com/).) Would be useful if we started with (1) a standard format, and (2) a repository for them. 18 | 19 | Charity: are we talking about people sharing the postmortems they’re already making, or doing something new, like having an investigatory board that would get people to try new things? Are we talking about social and legal problems in a bigger way? (Do you need legal force to make this work at all?) 20 | 21 | Ben: on the other hand, if we don’t start sharing info voluntarily, it’s likely regulation will be imposed as failures become bigger or have great impact (e.g. CA security breach laws). Things we’re doing are impacting peoples’ lives in a bigger way. 22 | 23 | Charity: (1) non-profits are legally required to disclose this sort of information already. (2) small companies might be more willing to participate since they have less to lose and could build credibility with a “badge of honor/respect” because they’re participating. 24 | 25 | __Benefits of a shared group__ 26 | 27 | Sam: after some recent outages we’ve had a lot of discussions about how to be public about some things, so that technical users can gain confidence in the reaction, but the greater industry doesn’t provide a lot of good models or guidelines. People across a group of companies could be really useful, helping to have higher confidence (e.g. Google management practice or the like). I think we would get behind almost an open standard for this sort of review and understanding. Debate about whether we were pursuing the right kind of communication. 28 | 29 | Charity: So here are questions to ask yourself, how to ask them, starter or feeder questions, ways to make postmortems productive. List of things you should leave any postmortem with. 30 | 31 | Sam: Useful to have persuasion around blamelessness, we could have roundtables, a lot of contribution we could make. 32 | 33 | Charity: I love the idea of a person with fresh eyes, no background on the company but understanding of the technical issues, to get fresh perspective on the events. 34 | 35 | __Incentivizing companies to participate__ 36 | 37 | Ben: Would it help to have PR resources to help develop a message for companies that decide to participate? How could we make companies more comfortable with this? Would companies fear engineers publishing secrets and judgments, and would having experienced PR people make more companies comfortable? 38 | 39 | Charity: Nothing should be published by the committee that hasn’t been anonymized! It would be way better if we could do it in a non-identifiable way. Yearly report? 40 | 41 | Ben: [Data Breach Report from Verizon](http://www.verizonenterprise.com/DBIR/2015/) might be a good model. 42 | 43 | Marc: Would we lose value by anonymizing and losing important details? 44 | 45 | Charity: Yeah, maybe right, and we should be de-stigmatizing this, not hiding it necessarily. 46 | 47 | Ben: We do have companies that are doing this already but not in a standard way. 48 | 49 | Charity: If Github is in, I’m in… (common reaction, perhaps?) 50 | 51 | Ben: Netflix, Github, Amazon, once they do something, other companies might be interested. 52 | 53 | Charity: Excellent public postmortems are also good for recruiting, and a good sign. 54 | 55 | __Anonymization and concrete outputs__ 56 | 57 | Aaron: I’m on the ops team at Kickstarter. One thing I’m struck by is the parallels to the security community, there are groups that do this but there are very informal, confidential networks (you have to know the right person to get in). The benefit is that leads to a safe space and leads to more comfort. The big downside is that it’s all pretty closed, unless you know the right person to ask. Maybe this is a strong pattern to start to get people to be more comfortable? 58 | 59 | Ben: I can see that but what would the output of that be? I think the general goal should be to generate artifacts that people can learn from; if we use FrieNDA as a model, does that give us enough of the benefit for those outside of the inner circle? 60 | 61 | Aaron: Sure, agree, but that’s a community that has already addressed this problem in one way, albeit a closed way. One thing to keep as a concern is that the time invested to make a postmortem that could be shared with the world would be a huge amount (versus the investment for internal sharing), and we would need to have tools or practices (std formats or whatever) to reduce that investment if we expect this to be frequent. 62 | 63 | Ben: Agree, blueprints/templates or examples would be very helpful. There’s probably a bar, too, where an incident is not significant enough to be shared with the community. 64 | 65 | Charity: We’re looking for complex, cascading failures — we don’t usually have outages that are caused by some one thing, but instead when things go wrong in chains or webs. 66 | 67 | Marc: Agree, but sometimes a single remediation might be worthwhile even if the failure was minor (example of nighttime indicators in the Etsy deploy tool as a reminder that people are deploying after hours). Maybe “Pattern Language” style successful remediations would work. 68 | 69 | Ben: Some good thresholds, maybe, similar to “plane fell out of sky” or “person died” for NTSB [Marc note: I may have mis-transcribed this.] 70 | 71 | Sam: Maybe a study of things that work at different companies would be useful — central knowledgebase from the real world. 72 | 73 | Jonathan: How does this work? Do you have to contribute to learn from it? 74 | 75 | Ben: I’d love for the output to be for the general community, not closed. 76 | 77 | Marc: [Example of the Engineering Exchange Program](https://codeascraft.com/2012/09/10/the-engineer-exchange-program/). Even though the first one was not successful on both sides, it was perceived as successful and increased the confidence of other companies, where there was more success. 78 | 79 | __So how can we get this going?__ 80 | 81 | Ben: So how can we get the ball rolling? Big example, signup, etc? Disagree that this is an impossible task, but we should be realistic about the things to overcome. 82 | 83 | Charity: Could we get going with a tool that we’re generating ourselves and get things going with examples, and then build from there, ramp up the public participation over time? 84 | 85 | Ben: Github repo of past examples would be great. 86 | 87 | Charity: Would love to do “book reports” about existing postmortems and what we’ve learned from them. 88 | 89 | Marc: Great idea, [Dan Luu’s examples](http://danluu.com/postmortem-lessons/) (see also [Github repo](https://github.com/danluu/post-mortems) ) are awesome. 90 | 91 | Jonathan: Yeah, if there were more that we would get from postmortems, we might have more attention to them and get more out of them. 92 | 93 | Ben: Will start Github repo, we should work towards a postmortem format. Charity, willing to write the first book report? 94 | 95 | Charity: Totally, on it. Also, we should make a statement about this. What’s the idealistic paragraph about what we’re trying to do with this? 96 | 97 | Ben: Reading NTSB and other examples are super useful. 98 | 99 | Jonathan: Also useful to have eye on customer impact. 100 | 101 | Ben: One final thing: maybe offer a review service to a company that chooses to publish one — maybe help companies make sure they’re doing a good job of it before publishing. That could be a good first step, especially if the offered investigation turns out to be useful to the company. 102 | 103 | Sam: I agree with Charity’s comment about a mission statement. Also, it would be good to ask customers about what they’d want to know about from postmortems. That might give us ideas about things we can deliver. Really high-quality examples of writing and forms of discussion would be extremely useful to the community. 104 | 105 | ACTIONS: 106 | 107 | - [x] Mission statement - Charity (in for a rough draft, welcome collaboration) 108 | 109 | - [ ] Examples of great and worthwhile existing public postmortems- OWNER NEEDED 110 | 111 | - [ ] A few book reports about existing public postmortems - Charity for the first one 112 | 113 | - [ ] Talking to customers/consumers of postmortems about what they might want to see in them- OWNER NEEDED 114 | 115 | - [ ] Recruiting companies to get involved - OWNER NEEDED 116 | 117 | - [ ] Develop an open sourced post-mortem report format - OWNER NEEDED 118 | 119 | - [ ] Find a way to solicit and accept feedback from companies and individuals about: what they would like to know about when learning from an incident. - OWNER NEEDED 120 | 121 | - [X] Create a GitHub repo to store artifacts we generate - Ben 122 | 123 | - [ ] Should write a blog post about some of these ideas. Goes towards recruiting other companies to participate. Depends on mission statement. - Ben 124 | 125 | --------------------------------------------------------------------------------