├── README.md
└── yuan14.md


/README.md:
--------------------------------------------------------------------------------
1 | # Availability reading list
2 | 
3 | This has been merged into my [systems and failure reading list][1].
4 | 
5 | [1]: https://github.com/lorin/systems-reading/#availability
6 | 


--------------------------------------------------------------------------------
/yuan14.md:
--------------------------------------------------------------------------------
  1 | # Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
  2 | 
  3 | Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle
  4 | Zhang, Pranay U. Jain, and Michael Stumm
  5 | Proceedings of the 11th USENIX Symposium on Operating  Systems Design and Implementation (OSDI '14)
  6 | Oct. 2014.
  7 | 
  8 | * [pdf]
  9 | * [video]
 10 | * [slides]
 11 | 
 12 | Terminology:
 13 | * Catastrophic: affects all or a majority of users
 14 | 
 15 | ## Slides
 16 | 
 17 | ### Key findings
 18 | 
 19 | * Failures are the results of complex sequence of events
 20 | * Catastrophic failures are caused by incorrect error handling
 21 |     * Many are caused by a small set of trivial bug patterns
 22 | * Aspirator: a simple rule-based static checker
 23 |     * Found 143 confirmed new bugs and bad practices
 24 | 
 25 | 
 26 | 
 27 | ## Paper
 28 | 
 29 | Most failures require multiple events to trigger. Need stress testing plus other
 30 | techniques such as fault injection.
 31 | 
 32 | Order also matters.
 33 | 
 34 | 
 35 | ### Findings
 36 | 
 37 | 1. A majority (77%) of failures require more than one input event to manifest, but most of the failures (90%) require no more than 3.
 38 | 2. The specific order of events is important in 88% of the failures that require multiple input events.
 39 | 3. Almost all (98%) of the failures are guaranteed to manifest on more than
 40 |    than 3 nodes. 84% will manifest on no more than 2 nodes
 41 | 4. 74% of failures are deterministic
 42 | 5. 53% of deterministic failures have timing constraints only on the input
 43 |    events.
 44 | 6. 76% of the failures print explicit failure-related error messages
 45 | 7. For a majority (84%) of the failures, all of their triggering events are logged.
 46 | 8. Logs are noisy: the median of the number of log messages printed by each failure is 824.
 47 | 9. A majority of the production failures (77%) can be reproduced by a unit test.
 48 | 10. Almost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signaled in software
 49 | 11. 35% of the catastrophic failures are caused by trivial mistakes in error
 50 |     handling logic -- ones that simply violate best programming practices;
 51 |     and that can be detected without specific knowledge.
 52 | 12. In 23% of catastrophic failures, incorrect error handling would have been exposed by 100% statement coverage testing on the error handling logic.
 53 | 
 54 | 
 55 | ## Categories
 56 | 
 57 | - Complexity of failures (most aren't *that* complex)
 58 | - Role of timing (often deterministic enough that they could be reproduced in a test)
 59 | - Logs enable diagnosis opportunities (logs are good, but noisy)
 60 | - Failure reproducibility (these are catchaable in tests if we know where to look)
 61 | - Catastrophic failures
 62 |     - Trivial mistake in error handlers
 63 |     - System-specific bugs
 64 | 
 65 | 
 66 | ## Opportunities for improved testing
 67 | 
 68 | - Starting up services: more than half of the failures require the start of some services
 69 | - Unreachable nodes: 24% failures occur because a node is unreachable
 70 | - Configuration changes: 23% of failures are caused be config changes.
 71 |     - 30% of these involve misconfigurations
 72 |     - Remaining majority involve valid changes to enable certain features that may be rarely-used
 73 | - Adding a node: 15% of failures are triggered by adding a node
 74 | 
 75 | ## Limitations
 76 | 
 77 | 1. Representativeness of the selected systems
 78 | 1. Representativeness of the selected failures
 79 | 1. Size of our sample set
 80 | 1. Possible observer errors
 81 | 
 82 | ## Conclusions
 83 | 
 84 | - We should be able to catch most of these issues in unit testing
 85 | - Logs are useful but noisy
 86 | - We should put more effort into writing the error-handling code
 87 | 
 88 | ### Changing how we code
 89 | 
 90 | ### Changing how we test
 91 | 
 92 | - Daniel Jackson's "Small Scope Hypothesis": most bugs have small counterexamples
 93 | - Most failure can be reproduced with unit tests
 94 | 
 95 | 
 96 | 
 97 | [pdf]: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
 98 | [video]: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan
 99 | [slides]: https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_yuan-ding.pdf
100 | 


--------------------------------------------------------------------------------