├── README.md └── yuan14.md /README.md: -------------------------------------------------------------------------------- 1 | # Availability reading list 2 | 3 | This has been merged into my [systems and failure reading list][1]. 4 | 5 | [1]: https://github.com/lorin/systems-reading/#availability 6 | -------------------------------------------------------------------------------- /yuan14.md: -------------------------------------------------------------------------------- 1 | # Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems 2 | 3 | Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle 4 | Zhang, Pranay U. Jain, and Michael Stumm 5 | Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14) 6 | Oct. 2014. 7 | 8 | * [pdf] 9 | * [video] 10 | * [slides] 11 | 12 | Terminology: 13 | * Catastrophic: affects all or a majority of users 14 | 15 | ## Slides 16 | 17 | ### Key findings 18 | 19 | * Failures are the results of complex sequence of events 20 | * Catastrophic failures are caused by incorrect error handling 21 | * Many are caused by a small set of trivial bug patterns 22 | * Aspirator: a simple rule-based static checker 23 | * Found 143 confirmed new bugs and bad practices 24 | 25 | 26 | 27 | ## Paper 28 | 29 | Most failures require multiple events to trigger. Need stress testing plus other 30 | techniques such as fault injection. 31 | 32 | Order also matters. 33 | 34 | 35 | ### Findings 36 | 37 | 1. A majority (77%) of failures require more than one input event to manifest, but most of the failures (90%) require no more than 3. 38 | 2. The specific order of events is important in 88% of the failures that require multiple input events. 39 | 3. Almost all (98%) of the failures are guaranteed to manifest on more than 40 | than 3 nodes. 84% will manifest on no more than 2 nodes 41 | 4. 74% of failures are deterministic 42 | 5. 53% of deterministic failures have timing constraints only on the input 43 | events. 44 | 6. 76% of the failures print explicit failure-related error messages 45 | 7. For a majority (84%) of the failures, all of their triggering events are logged. 46 | 8. Logs are noisy: the median of the number of log messages printed by each failure is 824. 47 | 9. A majority of the production failures (77%) can be reproduced by a unit test. 48 | 10. Almost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signaled in software 49 | 11. 35% of the catastrophic failures are caused by trivial mistakes in error 50 | handling logic -- ones that simply violate best programming practices; 51 | and that can be detected without specific knowledge. 52 | 12. In 23% of catastrophic failures, incorrect error handling would have been exposed by 100% statement coverage testing on the error handling logic. 53 | 54 | 55 | ## Categories 56 | 57 | - Complexity of failures (most aren't *that* complex) 58 | - Role of timing (often deterministic enough that they could be reproduced in a test) 59 | - Logs enable diagnosis opportunities (logs are good, but noisy) 60 | - Failure reproducibility (these are catchaable in tests if we know where to look) 61 | - Catastrophic failures 62 | - Trivial mistake in error handlers 63 | - System-specific bugs 64 | 65 | 66 | ## Opportunities for improved testing 67 | 68 | - Starting up services: more than half of the failures require the start of some services 69 | - Unreachable nodes: 24% failures occur because a node is unreachable 70 | - Configuration changes: 23% of failures are caused be config changes. 71 | - 30% of these involve misconfigurations 72 | - Remaining majority involve valid changes to enable certain features that may be rarely-used 73 | - Adding a node: 15% of failures are triggered by adding a node 74 | 75 | ## Limitations 76 | 77 | 1. Representativeness of the selected systems 78 | 1. Representativeness of the selected failures 79 | 1. Size of our sample set 80 | 1. Possible observer errors 81 | 82 | ## Conclusions 83 | 84 | - We should be able to catch most of these issues in unit testing 85 | - Logs are useful but noisy 86 | - We should put more effort into writing the error-handling code 87 | 88 | ### Changing how we code 89 | 90 | ### Changing how we test 91 | 92 | - Daniel Jackson's "Small Scope Hypothesis": most bugs have small counterexamples 93 | - Most failure can be reproduced with unit tests 94 | 95 | 96 | 97 | [pdf]: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf 98 | [video]: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan 99 | [slides]: https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_yuan-ding.pdf 100 | --------------------------------------------------------------------------------