├── .gitignore ├── LICENSE ├── README.md ├── calculation.ipynb └── durability.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | 103 | # JetBrains 104 | .idea 105 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Backblaze wants developers and organization to copy and re-use our 2 | code examples, so we make the samples available by several different 3 | licenses. One option is the MIT license (below). Other options are 4 | available here: 5 | 6 | https://www.backblaze.com/using_b2_code.html 7 | 8 | MIT License 9 | 10 | Copyright (c) 2018 Backblaze 11 | 12 | Permission is hereby granted, free of charge, to any person obtaining a copy 13 | of this software and associated documentation files (the "Software"), to deal 14 | in the Software without restriction, including without limitation the rights 15 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 16 | copies of the Software, and to permit persons to whom the Software is 17 | furnished to do so, subject to the following conditions: 18 | 19 | The above copyright notice and this permission notice shall be included in all 20 | copies or substantial portions of the Software. 21 | 22 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 23 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 24 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 25 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 26 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 27 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 28 | SOFTWARE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # erasure-coding-durability 2 | 3 | ## Overview 4 | 5 | This is a simple statistical model for calculating the probability of losing 6 | data that is stored using an erasure coding system, such as Reed-Solomon. 7 | 8 | In erasure coding, each file stored is divided into D shards of the same length. 9 | Then, P parity shards are computed using erasure coding, resulting in D+P shards. 10 | Even if shards are lost, the original file can be recomputed from any D of the 11 | D+P shards stored. In other words, you can lose any P of the shards and still 12 | reconstruct the original file. 13 | 14 | What we would like to compute is the durability of data stored with erasure coding 15 | based on the durability of the individual shards. 16 | The durability is the probability of not losing the data over a period of time. 17 | The period of time we use here is one year, resulting in annual durability. 18 | 19 | Systems that use erasure coding to store data will replace shards that are lost. 20 | Once a shard is replaced, the data is fully intact again. Data is lost only when 21 | P+1 shards are all lost at the same time, before they are replaced. 22 | 23 | ## Assumptions 24 | 25 | To calculate the probability of loss, we need to make some assumptions: 26 | 27 | 1. Data is stored using *D* data shards and *P* parity shards, and is lost when *P+1* shards are lost. 28 | 1. The annual failure rate of each shard is *shard_annual_failure_rate*. 29 | 1. The number of days it takes to replace a failed shard is *shard_failure_days*. 30 | 1. The failures of individual shards are independent. 31 | 32 | ## Calculation 33 | 34 | The details of the calculations are in [calculation.ipynb](https://github.com/Backblaze/erasure-coding-durability/blob/master/calculation.ipynb). 35 | 36 | ## Python code 37 | 38 | The python code in 39 | [durability.py](https://github.com/Backblaze/erasure-coding-durability/blob/master/durability.py) 40 | does the calculations above, with a few tweaks 41 | to maintain precision when dealing with tiny numbers, and prints out the results 42 | for a given set of assumptions: 43 | 44 | ``` 45 | $ python durability.py 46 | usage: durability.py [-h] 47 | data_shards parity_shards annual_shard_failure_rate 48 | shard_replacement_days 49 | durability.py: error: too few arguments 50 | $ python durability.py 4 2 0.10 1python durability.py 17 3 0.00405 6.5 51 | 52 | # 53 | # total shards: 20 54 | # replacement period (days): 6.5000 55 | # annual shard failure rate: 0.0040 56 | # 57 | 58 | |===================================================================================================================================| 59 | | failure_threshold | individual_prob | cumulative_prob | annual_loss_rate | annual_odds | durability | nines | 60 | |-----------------------------------------------------------------------------------------------------------------------------------| 61 | | 20 | 1.449e-83 | 1.449e-83 | 8.117e-82 | NEVER | 1.000000000000000 | 81 nines | 62 | | 19 | 4.019e-78 | 4.019e-78 | 2.251e-76 | NEVER | 1.000000000000000 | 75 nines | 63 | | 18 | 5.294e-73 | 5.294e-73 | 2.965e-71 | NEVER | 1.000000000000000 | 70 nines | 64 | | 17 | 4.404e-68 | 4.404e-68 | 2.466e-66 | NEVER | 1.000000000000000 | 65 nines | 65 | | 16 | 2.595e-63 | 2.595e-63 | 1.453e-61 | NEVER | 1.000000000000000 | 60 nines | 66 | | 15 | 1.151e-58 | 1.151e-58 | 6.447e-57 | NEVER | 1.000000000000000 | 56 nines | 67 | | 14 | 3.991e-54 | 3.991e-54 | 2.235e-52 | NEVER | 1.000000000000000 | 51 nines | 68 | | 13 | 1.107e-49 | 1.107e-49 | 6.197e-48 | NEVER | 1.000000000000000 | 47 nines | 69 | | 12 | 2.493e-45 | 2.493e-45 | 1.396e-43 | NEVER | 1.000000000000000 | 42 nines | 70 | | 11 | 4.609e-41 | 4.609e-41 | 2.581e-39 | NEVER | 1.000000000000000 | 38 nines | 71 | | 10 | 7.029e-37 | 7.029e-37 | 3.936e-35 | NEVER | 1.000000000000000 | 34 nines | 72 | | 9 | 8.859e-33 | 8.860e-33 | 4.962e-31 | NEVER | 1.000000000000000 | 30 nines | 73 | | 8 | 9.212e-29 | 9.213e-29 | 5.159e-27 | 5 in an octillion | 1.000000000000000 | 26 nines | 74 | | 7 | 7.860e-25 | 7.861e-25 | 4.402e-23 | 44 in a septillion | 1.000000000000000 | 22 nines | 75 | | 6 | 5.449e-21 | 5.450e-21 | 3.052e-19 | 305 in a sextillion | 1.000000000000000 | 18 nines | 76 | | 5 | 3.022e-17 | 3.022e-17 | 1.693e-15 | 2 in a quadrillion | 0.999999999999998 | 14 nines | 77 | | 4 | 1.309e-13 | 1.310e-13 | 7.354e-12 | 7 in a trillion | 0.999999999992646 | --> 11 nines | 78 | | 3 | 4.271e-10 | 4.273e-10 | 2.399e-08 | 24 in a billion | 0.999999976008104 | 7 nines | 79 | | 2 | 9.870e-07 | 9.874e-07 | 5.545e-05 | 55 in a million | 0.999944554648366 | 4 nines | 80 | | 1 | 1.440e-03 | 1.441e-03 | 7.781e-02 | 8 in a hundred | 0.922193691444580 | 1 nines | 81 | | 0 | 9.986e-01 | 1.000e+00 | 1.000e+00 | always | 0.000000000000000 | 0 nines | 82 | |===================================================================================================================================| 83 | ``` 84 | 85 | 86 | -------------------------------------------------------------------------------- /calculation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "In this document, we'll go over some probability and statistics, and then \n", 8 | "use that to model the data durability in a system storing data using erasure\n", 9 | "coding.\n", 10 | "\n", 11 | "# Probability and Statistics\n", 12 | "\n", 13 | "## Failure Rate\n", 14 | "\n", 15 | "The distinction between failure rates and the probability of failure causes\n", 16 | "lots of confusion. I have to think about it carefully each time I come back\n", 17 | "to the topic.\n", 18 | "\n", 19 | "The *failure rate* for a widget over a period of time is the average number \n", 20 | "of failures in that period per widget. An annual failure \n", 21 | "rate of 0.25 means that on average, there are 0.25 failures per widget.\n", 22 | "If you have 100 widgets for one year, there would be an average of 25 failures per year.\n", 23 | "\n", 24 | "A failure rate of 0.25 is frequently written as 25%.\n", 25 | "\n", 26 | "It's counter-intuitive, but failure rates for unreliable widgets can be over \n", 27 | "1.0 (100%). An annual failure rate of 12.0 (1200%) would mean that on average \n", 28 | "you would see 12 failures per year per widget. An annual failure rate of 12.0 (1200%)\n", 29 | "is the same thing as a monthly failure rate of 1.0 (100%), and is the same thing as a daily\n", 30 | "failure rate of 0.0333 (3.33%).\n", 31 | "\n", 32 | "If you're running a shop that requires 10 widgets, and the failure rate is 12,\n", 33 | "you'll go through a lot of widgets, and you'll have to keep getting replacements.\n", 34 | "Over the span of a year, you can expect to buy 120 new widgets as replacements so you\n", 35 | "can always have 10 running.\n", 36 | "\n", 37 | "## Probability of Failure\n", 38 | "\n", 39 | "Assuming that the probability of separate failures is independent, the probability of failure \n", 40 | "over a period of time is modeled with the\n", 41 | "[Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution).\n", 42 | "\n", 43 | "If you have an annual failure rate of 2.0 (200%), there may be 0, 1, 2, 3, or more\n", 44 | "failures in any year. The probability of exactly $k$ failures in one year is given\n", 45 | "by the formula:\n", 46 | "\n", 47 | "$$\\text{probability of exactly k failures} = e^{-\\lambda} \\: \\frac{\\lambda^k}{k!}$$\n", 48 | "\n", 49 | "This formula tells us that the probability of no failures is $0.1353$, the probability of exactly one \n", 50 | "failure is $0.2707$, and so on:\n", 51 | "\n", 52 | "Number of Failures | Probability\n", 53 | "--- | ---\n", 54 | "0 | 0.1353\n", 55 | "1 | 0.2707\n", 56 | "2 | 0.2707\n", 57 | "3 | 0.1804\n", 58 | "4 | 0.0902\n", 59 | "5 | 0.0361\n", 60 | "6 | 0.0120\n", 61 | "... | ...\n", 62 | "\n", 63 | "If you add up the infinite sequence of probabilites, it will add up to 1.0.\n", 64 | "\n", 65 | "In practice, looking at one of the widgets in your shop, this means that there's a 13.5% chance\n", 66 | "you won't have to replace it in the year. There's a 27% chance you'll replace it once, a\n", 67 | "27% chance you'll replace it twice, an 18% chance you'll replace it three times, and so on.\n", 68 | "\n", 69 | "To calculate the probability of zero failures, you can simplify the formula:\n", 70 | "\n", 71 | "$$\\text{probability of 0 failures} = e^{-\\lambda} \\: \\frac{\\lambda^0}{0!} \\: = \\: \\: e^{-\\lambda}$$\n", 72 | "\n", 73 | "The probability of having at least one failure is the sum of entries from 1 on, which is \n", 74 | "the same as one minus the probability of 0 failures:\n", 75 | "\n", 76 | "$$\\text{probability of one or more failures} \\: = \\: 1 \\: - \\: e^{-\\lambda}$$\n", 77 | "\n", 78 | "We'll use this formula later to calculate the probability of failures given a failure rate.\n", 79 | "\n", 80 | "## Probability of $n$ Failures\n", 81 | "\n", 82 | "So what if you have $n$ widgets, and you want to know if $k$ or more of them will fail?\n", 83 | "\n", 84 | "Let's look at an example: You have three widgets with an annual failure rate of 0.25 (25%). \n", 85 | "What is the probability that 2 or more of the three widgets will fail in the year?\n", 86 | "\n", 87 | "We'll use $P$ for the probability of a given widget failing at least once in the period.\n", 88 | "Using the formula from the prevous section, that probability is $0.2212$:\n", 89 | "\n", 90 | "$$P = 1 - e^{-0.25} \\approx 0.2212$$\n", 91 | "\n", 92 | "The probability of that widget not failing is $0.7788$:\n", 93 | "\n", 94 | "$$\\text{probability of not failing} = 1 - P \\approx 0.7788$$\n", 95 | "\n", 96 | "There are eight possible combinations of failure for the three widgets. The first one is that none of them fail. The probability for that is the product of the probabilities for each of the widgets not failing: $0.7788 \\times 0.7788 \\times 0.7788 = 0.4724$\n", 97 | "\n", 98 | "We can compute the probability of all eight cases by taking the probability that each\n", 99 | "widget will be OK or FAIL and multiplying them together. The sum of the resulting probabilities\n", 100 | "in the right column add up to 1.0:\n", 101 | "\n", 102 | "A | A prob | B | B prob | C | C prob | Probability\n", 103 | "--- | --- | --- | --- | --- | --- | ---\n", 104 | "ok | 0.7788 | ok | 0.7788 | ok | 0.7788 | 0.4724\n", 105 | "ok | 0.7788 | ok | 0.7788 | FAIL | 0.2212 | 0.1342\n", 106 | "ok | 0.7788 | FAIL | 0.2212 | ok | 0.7788 | 0.1342\n", 107 | "ok | 0.7788 | FAIL | 0.2212 | FAIL | 0.2212 | 0.0381\n", 108 | "FAIL | 0.2212 | ok | 0.7788 | ok | 0.7788 | 0.1342\n", 109 | "FAIL | 0.2212 | ok | 0.7788 | FAIL | 0.2212 | 0.0381\n", 110 | "FAIL | 0.2212 | FAIL | 0.2212 | ok | 0.7788 | 0.0381\n", 111 | "FAIL | 0.2212 | FAIL | 0.2212 | FAIL | 0.2212 | 0.0108\n", 112 | "\n", 113 | "To get the probability of two or more failing, we add up the probabilities of all\n", 114 | "rows that have two or more failures. The rows with exactly two failures add up\n", 115 | "to $0.1143$. The one row with three failures has a probability of $0.0108$. Those\n", 116 | "add up to a probability of $0.1251$ for two or more failures.\n", 117 | "\n", 118 | "You'll notice that all of the rows with two failures have the same probability,\n", 119 | "which makes sense. The number of those rows is given by: $\\binom{3}{2}$, which \n", 120 | "is three. (This is called the [Binomial Coefficient](https://en.wikipedia.org/wiki/Binomial_coefficient).)\n", 121 | "\n", 122 | "So the probability of getting exactly two failures is:\n", 123 | "\n", 124 | "$$\\binom{3}{2} \\times P^2 \\times (1 - P)^1$$\n", 125 | "\n", 126 | "In general, the probability of getting exactly $k$ failures in $n$ widgets is:\n", 127 | "\n", 128 | "$$\\binom{n}{k} \\times P^k \\times (1 - P)^{(n - k)}$$\n", 129 | "\n", 130 | "If you want more information on this, you can read about the [Probability Mass Function](https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function) for a binomial.\n", 131 | "\n", 132 | "We can use this formula to summarize the table above by number of failures:\n", 133 | "\n", 134 | "Number of Failures | Probability\n", 135 | "--- | ---\n", 136 | "0 | 0.4724\n", 137 | "1 | 0.4025\n", 138 | "2 | 0.1143\n", 139 | "3 | 0.0108\n", 140 | "\n", 141 | "# Data Durability\n", 142 | "\n", 143 | "Now we get into calculating the durability of data stored with erasure\n", 144 | "coding, assuming a failure rate for each shard, and independent failures\n", 145 | "for each shard.\n", 146 | "\n", 147 | "First, some naming. We will use these names in the calculations:\n", 148 | "\n", 149 | "* $S$ is the total number of shards (data plus parity)\n", 150 | "* $R$ is the repair time for a shard in days: how long it takes to replace a shard after it fails\n", 151 | "* $A$ is the annual failure rate of one shard\n", 152 | "* $F$ is the failure rate of a shard in $R$ days\n", 153 | "* $P$ is the probability of a shard failing at least once in $R$ days\n", 154 | "* $D$ is the durability of data over $R$ days: not too many shards are lost\n", 155 | "\n", 156 | "With erasure coding, your data remains inact as long as you don't lose \n", 157 | "more shards than there are parity shards. If you do lose more, there\n", 158 | "is no way to recover the data.\n", 159 | "\n", 160 | "One of the assumptions we make is that it takes $R$ days to repair a failed\n", 161 | "shard. Let's start with a simpler problem and look at the data durability\n", 162 | "over a period of $R$ days. For a data loss to happen in this time period,\n", 163 | "$P+1$ shards (or more) would have to fail.\n", 164 | "\n", 165 | "We will use $A$ to denote the annual failure rate of individual shards.\n", 166 | "Over one year, the chances that a shard will fail is evenly distributed over\n", 167 | "all of the $R$-day periods in the year. We will use $F$ to denote the failure\n", 168 | "rate of one shard in an $R$-day period:\n", 169 | "\n", 170 | "$$F = A\\frac{R}{365}$$\n", 171 | "\n", 172 | "The probability of failure of a single shard in R days is approximately $F$, when $F$ is small.\n", 173 | "The exact value, from the Poisson distribution is:\n", 174 | "\n", 175 | "$$P = 1 \\: - \\: e^{-F}$$\n", 176 | "\n", 177 | "Given the probability of one shard failing, we can use the binomial distribution's \n", 178 | "probability mass function to calculate the probability of exactly $n$ of the $S$\n", 179 | "shards failing:\n", 180 | "\n", 181 | "$$\\binom{S}{n} \\: P^n \\: (1-P)^{S-n}$$\n", 182 | " \n", 183 | "We also lose data if more than n shards fail in the period. To include those,\n", 184 | "we can sum the above formula for n through S shards, to get the probability of\n", 185 | "data loss in $R$ days:\n", 186 | "\n", 187 | "$$\\sum_{k=n}^{S} \\binom{S}{k} \\: P^k \\: (1-P)^{S-k}$$\n", 188 | " \n", 189 | "The durability in each period is inverse of that:\n", 190 | "\n", 191 | "$$D = 1 \\: - \\: \\sum_{k=n}^{S} \\binom{S}{k} \\: P^k \\: (1-P)^{S-k}$$\n", 192 | "\n", 193 | "Durability over the full year \n", 194 | "happens when there's durability in all of the periods, which is the product of\n", 195 | "probabilities:\n", 196 | "\n", 197 | "$$D ^ {365/R}$$\n", 198 | "\n", 199 | "And that's the answer!\n" 200 | ] 201 | } 202 | ], 203 | "metadata": { 204 | "kernelspec": { 205 | "display_name": "Python 3", 206 | "language": "python", 207 | "name": "python3" 208 | }, 209 | "language_info": { 210 | "codemirror_mode": { 211 | "name": "ipython", 212 | "version": 3 213 | }, 214 | "file_extension": ".py", 215 | "mimetype": "text/x-python", 216 | "name": "python", 217 | "nbconvert_exporter": "python", 218 | "pygments_lexer": "ipython3", 219 | "version": "3.5.4" 220 | } 221 | }, 222 | "nbformat": 4, 223 | "nbformat_minor": 2 224 | } 225 | -------------------------------------------------------------------------------- /durability.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | ###################################################################### 3 | # 4 | # File: durability.py 5 | # 6 | # Copyright 2018 Backblaze Inc. All Rights Reserved. 7 | # 8 | ###################################################################### 9 | 10 | import argparse 11 | import math 12 | import sys 13 | import unittest 14 | 15 | 16 | class Table(object): 17 | 18 | """ 19 | Knows how to display a table of data. 20 | 21 | The data is in the form of a list of dicts: 22 | 23 | [ { 'a' : 4, 'b' : 8 }, 24 | { 'a' : 5, 'b' : 9 } ] 25 | 26 | And is displayed like this: 27 | 28 | |=======| 29 | | a | b | 30 | |-------| 31 | | 4 | 8 | 32 | | 5 | 9 | 33 | |=======| 34 | """ 35 | 36 | def __init__(self, data, column_names): 37 | self.data = data 38 | self.column_titles = column_names 39 | self.column_widths = [ 40 | max(len(col), max(len(item[col]) for item in data)) 41 | for col in column_names 42 | ] 43 | 44 | def __str__(self): 45 | result = [] 46 | 47 | # Title row 48 | total_width = 1 + sum(3 + w for w in self.column_widths) 49 | result.append('|') 50 | result.append('=' * (total_width - 2)) 51 | result.append('|') 52 | result.append('\n') 53 | result.append('| ') 54 | for (col, w) in zip(self.column_titles, self.column_widths): 55 | result.append(self.pad(col, w)) 56 | result.append(' | ') 57 | result.append('\n') 58 | result.append('|') 59 | result.append('-' * (total_width - 2)) 60 | result.append('|') 61 | result.append('\n') 62 | 63 | # Data rows 64 | for item in self.data: 65 | result.append('| ') 66 | for (col, w) in zip(self.column_titles, self.column_widths): 67 | result.append(self.pad(item[col], w)) 68 | result.append(' | ') 69 | result.append('\n') 70 | result.append('|') 71 | result.append('=' * (total_width - 2)) 72 | result.append('|') 73 | result.append('\n') 74 | 75 | return ''.join(result) 76 | 77 | def pad(self, s, width): 78 | if len(s) < width: 79 | return (' ' * (width - len(s))) + s 80 | else: 81 | return s[:width] 82 | 83 | 84 | def print_markdown_table(data, column_names): 85 | print 86 | print ' | '.join(column_names) 87 | print ' | '.join(['---'] * len(column_names)) 88 | for item in data: 89 | print ' | '.join(item[cn] for cn in column_names) 90 | print 91 | 92 | 93 | def factorial(n): 94 | if n == 0: 95 | return 1 96 | else: 97 | return n * factorial(n - 1) 98 | 99 | 100 | def choose(n, r): 101 | """ 102 | Returns: How many ways there are to choose a subset of n things from a set of r things. 103 | 104 | Computes n! / (r! (n-r)!) exactly. Returns a python long int. 105 | 106 | From: http://stackoverflow.com/questions/3025162/statistics-combinations-in-python 107 | """ 108 | assert n >= 0 109 | assert 0 <= r <= n 110 | 111 | c = 1L 112 | for num, denom in zip(xrange(n, n-r, -1), xrange(1, r+1, 1)): 113 | c = (c * num) // denom 114 | return c 115 | 116 | 117 | def binomial_probability(k, n, p): 118 | """ 119 | Returns: The probability of exactly k of n things happening, when the 120 | probability of each one (independently) is p. 121 | 122 | See: https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function 123 | """ 124 | return choose(n, k) * (p ** k) * ((1 - p) ** (n - k)) 125 | 126 | 127 | class TestBinomialProbability(unittest.TestCase): 128 | 129 | def test_binomial_probability(self): 130 | # these test cases are from the Wikipedia page 131 | self.assertAlmostEqual(0.117649, binomial_probability(0, 6, 0.3)) 132 | self.assertAlmostEqual(0.302526, binomial_probability(1, 6, 0.3)) 133 | self.assertAlmostEqual(0.324135, binomial_probability(2, 6, 0.3)) 134 | 135 | # Wolfram Alpha: (1 - 1e-6)^800 136 | self.assertAlmostEqual(0.9992003, binomial_probability(0, 800, 1.0e-6)) 137 | 138 | 139 | def probability_of_failure_for_failure_rate(f): 140 | """ 141 | Given a failure rate f, what's the probability of at least one failure? 142 | """ 143 | probability_of_no_failures = math.exp(-f) 144 | return 1.0 - probability_of_no_failures 145 | 146 | 147 | def probability_of_failure_in_any_period(p, n): 148 | """ 149 | Returns the probability that a failure (of probability p in one period) 150 | happens once or more in n periods. 151 | 152 | The probability of failure in one period is p, so the probability 153 | of not failing is (1 - p). So the probability of not 154 | failing over n periods is (1 - p) ** n, and the probability 155 | of one or more failures in n periods is: 156 | 157 | 1 - (1 - p) ** n 158 | 159 | Doing the math without losing precision is tricky. 160 | After the binomial expansion, you get (for even n): 161 | 162 | a = 1 - (1 - choose(n, 1) * p + choose(n, 2) p**2 - p**3 + p**4 ... + choose(n, n) p**n) 163 | 164 | For odd n, the last term is negative. 165 | 166 | To avoid precision loss, we don't want to to (1 - p) if p is 167 | really tiny, so we'll cancel out the 1 and get: 168 | you get: 169 | 170 | a = choose(n, 1) * p - choose(n, 2) * p**2 ... 171 | """ 172 | if p < 0.01: 173 | # For tiny numbers, (1 - p) can lose precision. 174 | # First, compute the result for the integer part 175 | n_int = int(n) 176 | result = 0.0 177 | sign = 1 178 | for i in xrange(1, n_int + 1): 179 | p_exp_i = p ** i 180 | if p_exp_i != 0: 181 | result += sign * choose(n_int, i) * (p ** i) 182 | sign = -sign 183 | # Adjust the result to include the fractional part 184 | # What we want is: 1.0 - (1.0 - result) * ((1.0 - p) ** (n - n_int)) 185 | # Which gives this when refactored: 186 | result = 1.0 - ((1.0 - p) ** (n - n_int)) + result * ((1.0 - p) ** (n - n_int)) 187 | return result 188 | else: 189 | # For high probabilities of loss, the powers of p don't 190 | # get small faster than the coefficients get big, and weird 191 | # things happen 192 | return 1.0 - (1.0 - p) ** n 193 | 194 | 195 | class TestProbabilityOfFailureAnyPeriod(unittest.TestCase): 196 | 197 | def test_probability_of_failure(self): 198 | # Easy to check 199 | self.assertAlmostEqual(0.25, probability_of_failure_in_any_period(0.25, 1)) 200 | self.assertAlmostEqual(0.4375, probability_of_failure_in_any_period(0.25, 2)) 201 | self.assertAlmostEqual(0.0199, probability_of_failure_in_any_period(0.01, 2)) 202 | 203 | # From Wolfram Alpha, some tests with tiny probabilities: 204 | self.assertAlmostEqual(2.0, probability_of_failure_in_any_period(1e-10, 200) * 1e8) 205 | self.assertAlmostEqual(2.0, probability_of_failure_in_any_period(1e-30, 200) * 1e28) 206 | self.assertAlmostEqual(7.60690480739, probability_of_failure_in_any_period(3.47347251479e-103, 2190) * 1e100) 207 | 208 | # Check fractional exponents 209 | self.assertAlmostEqual(0.1339746, probability_of_failure_in_any_period(0.25, 0.5)) 210 | self.assertAlmostEqual(0.0345647, probability_of_failure_in_any_period(0.01, 3.5)) 211 | 212 | 213 | SCALE_TABLE = [ 214 | (1, 'ten'), 215 | (2, 'a hundred'), 216 | (3, 'a thousand'), 217 | (6, 'a million'), 218 | (9, 'a billion'), 219 | (12, 'a trillion'), 220 | (15, 'a quadrillion'), 221 | (18, 'a quintillion'), 222 | (21, 'a sextillion'), 223 | (24, 'a septillion'), 224 | (27, 'an octillion') 225 | ] 226 | 227 | 228 | def pretty_probability(p): 229 | """ 230 | Takes a number between 0 and 1 and prints it as a probability in 231 | the form "5 in a million" 232 | """ 233 | if abs(p - 1.0) < 0.01: 234 | return 'always' 235 | for (power, name) in SCALE_TABLE: 236 | count = p * (10.0 ** power) 237 | if count >= 0.90: 238 | return '%d in %s' % (round(count), name) 239 | return 'NEVER' 240 | 241 | 242 | def count_nines(loss_rate): 243 | """ 244 | Returns the number of nines after the decimal point before some other digit happens. 245 | """ 246 | nines = 0 247 | power_of_ten = 0.1 248 | while True: 249 | if power_of_ten < loss_rate: 250 | return nines 251 | power_of_ten /= 10.0 252 | nines += 1 253 | if power_of_ten == 0.0: 254 | return 0 255 | 256 | 257 | def do_scenario(total_shards, min_shards, annual_shard_failure_rate, shard_replacement_days): 258 | """ 259 | Calculates the cumulative failure rates for different numbers of 260 | failures, starting with the most possible, down to 0. 261 | 262 | The first probability in the table will be extremely improbable, 263 | because it is the case where ALL of the shards fail. The next 264 | line in the table is the case where either all of the shards fail, 265 | or all but one fail. The final row in the table is the case where 266 | somewhere between all fail and none fail, which always happens, so 267 | the probability is one. 268 | """ 269 | 270 | num_periods = 365.0 / shard_replacement_days 271 | failure_rate_per_period = annual_shard_failure_rate / num_periods 272 | 273 | print 274 | print '#' 275 | print '# total shards:', total_shards 276 | print '# replacement period (days): %6.4f' % (shard_replacement_days) 277 | print '# annual shard failure rate: %6.4f' % (annual_shard_failure_rate) 278 | print '#' 279 | print 280 | 281 | failure_probability_per_period = 1.0 - math.exp(-failure_rate_per_period) 282 | data = [] 283 | period_cumulative_prob = 0.0 284 | for failed_shards in xrange(total_shards, -1, -1): 285 | period_failure_prob = binomial_probability(failed_shards, total_shards, failure_probability_per_period) 286 | period_cumulative_prob += period_failure_prob 287 | annual_loss_prob = probability_of_failure_in_any_period(period_cumulative_prob, num_periods) 288 | nines = '%d nines' % count_nines(annual_loss_prob) 289 | if failed_shards == total_shards - min_shards + 1: 290 | nines = "--> " + nines 291 | data.append({ 292 | 'individual_prob' : ('%10.3e' % period_failure_prob), 293 | 'failure_threshold' : str(failed_shards), 294 | 'cumulative_prob' : ('%10.3e' % period_cumulative_prob), 295 | 'cumulative_odds' : pretty_probability(period_cumulative_prob), 296 | 'annual_loss_rate' : ('%10.3e' % annual_loss_prob), 297 | 'annual_odds' : pretty_probability(annual_loss_prob), 298 | 'durability' : '%17.15f' % (1.0 - annual_loss_prob), 299 | 'nines' : nines 300 | }) 301 | 302 | print Table(data, ['failure_threshold', 303 | 'individual_prob', 304 | 'cumulative_prob', 305 | 'annual_loss_rate', 306 | 'annual_odds', 307 | 'durability', 308 | 'nines' 309 | ]) 310 | print 311 | 312 | return dict( 313 | (item['failure_threshold'], item) 314 | for item in data 315 | ) 316 | 317 | 318 | def example(): 319 | """ 320 | This is the example in the explanation. 321 | """ 322 | # Make the table of probabilities of k failures with a failure rate of 2.0: 323 | p = 2.0 324 | data = [ 325 | { 'k': str(k), 'p': '%6.4f' % (math.exp(-p) * p**k / factorial(k),) } 326 | for k in xrange(7) 327 | ] 328 | print_markdown_table(data, ['k', 'p']) 329 | 330 | print 'Probability of n Failing' 331 | annual_rate = 0.25 332 | p_one_failing = probability_of_failure_for_failure_rate(annual_rate) 333 | print 'probability of one failing: %6.4f' % p_one_failing 334 | print 'probability of none failing: %6.4f' % (1 - p_one_failing) 335 | print 'probability of three not failing: %6.4f' % (1 - p_one_failing) ** 3 336 | print 'probability of two or more failing: %6.4f' % (binomial_probability(2, 3, p_one_failing) + binomial_probability(3, 3, p_one_failing)) 337 | print 338 | probs = {'ok': (1 - p_one_failing), 'FAIL': p_one_failing} 339 | data = [] 340 | total_prob = 0.0 341 | for a in ['ok', 'FAIL']: 342 | for b in ['ok', 'FAIL']: 343 | for c in ['ok', 'FAIL']: 344 | data.append({ 345 | 'A': a, 346 | 'A prob': '%6.4f' % probs[a], 347 | 'B': b, 348 | 'B prob': '%6.4f' % probs[b], 349 | 'C': c, 350 | 'C prob': '%6.4f' % probs[c], 351 | 'Probability': '%6.4f' % (probs[a] * probs[b] * probs[c]) 352 | }) 353 | total_prob += probs[a] * probs[b] * probs[c] 354 | print_markdown_table(data, ['A', 'A prob', 'B', 'B prob', 'C', 'C prob', 'Probability']) 355 | print 'sum of probabilities: %6.4f' % total_prob 356 | print 357 | 358 | data = [ 359 | {'Number of Failures': str(k), 'Probability': '%6.4f' % binomial_probability(k, 3, p_one_failing)} 360 | for k in xrange(4) 361 | ] 362 | print_markdown_table(data, ['Number of Failures', 'Probability']) 363 | 364 | 365 | def main(): 366 | if sys.argv[1:] == ['test']: 367 | del sys.argv[1] 368 | unittest.main() 369 | elif sys.argv[1:] == ['example']: 370 | example() 371 | else: 372 | parser = argparse.ArgumentParser() 373 | parser.add_argument('data_shards', type=int), 374 | parser.add_argument('parity_shards', type=int), 375 | parser.add_argument('annual_shard_failure_rate', type=float), 376 | parser.add_argument('shard_replacement_days', type=float) 377 | args = parser.parse_args() 378 | total_shards = args.data_shards + args.parity_shards 379 | min_shards = args.data_shards 380 | do_scenario(total_shards, min_shards, args.annual_shard_failure_rate, args.shard_replacement_days) 381 | 382 | 383 | if __name__ == '__main__': 384 | main() 385 | --------------------------------------------------------------------------------