├── .gitignore
├── LICENSE
├── README.md
├── calculation.ipynb
└── durability.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | env/
 12 | build/
 13 | develop-eggs/
 14 | dist/
 15 | downloads/
 16 | eggs/
 17 | .eggs/
 18 | lib/
 19 | lib64/
 20 | parts/
 21 | sdist/
 22 | var/
 23 | wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | 
 49 | # Translations
 50 | *.mo
 51 | *.pot
 52 | 
 53 | # Django stuff:
 54 | *.log
 55 | local_settings.py
 56 | 
 57 | # Flask stuff:
 58 | instance/
 59 | .webassets-cache
 60 | 
 61 | # Scrapy stuff:
 62 | .scrapy
 63 | 
 64 | # Sphinx documentation
 65 | docs/_build/
 66 | 
 67 | # PyBuilder
 68 | target/
 69 | 
 70 | # Jupyter Notebook
 71 | .ipynb_checkpoints
 72 | 
 73 | # pyenv
 74 | .python-version
 75 | 
 76 | # celery beat schedule file
 77 | celerybeat-schedule
 78 | 
 79 | # SageMath parsed files
 80 | *.sage.py
 81 | 
 82 | # dotenv
 83 | .env
 84 | 
 85 | # virtualenv
 86 | .venv
 87 | venv/
 88 | ENV/
 89 | 
 90 | # Spyder project settings
 91 | .spyderproject
 92 | .spyproject
 93 | 
 94 | # Rope project settings
 95 | .ropeproject
 96 | 
 97 | # mkdocs documentation
 98 | /site
 99 | 
100 | # mypy
101 | .mypy_cache/
102 | 
103 | # JetBrains
104 | .idea
105 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Backblaze wants developers and organization to copy and re-use our
 2 | code examples, so we make the samples available by several different
 3 | licenses.  One option is the MIT license (below).  Other options are
 4 | available here:
 5 | 
 6 |     https://www.backblaze.com/using_b2_code.html
 7 | 
 8 | MIT License
 9 | 
10 | Copyright (c) 2018 Backblaze
11 | 
12 | Permission is hereby granted, free of charge, to any person obtaining a copy
13 | of this software and associated documentation files (the "Software"), to deal
14 | in the Software without restriction, including without limitation the rights
15 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
16 | copies of the Software, and to permit persons to whom the Software is
17 | furnished to do so, subject to the following conditions:
18 | 
19 | The above copyright notice and this permission notice shall be included in all
20 | copies or substantial portions of the Software.
21 | 
22 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
23 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
24 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
25 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
26 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
27 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
28 | SOFTWARE.
29 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # erasure-coding-durability
 2 | 
 3 | ## Overview
 4 | 
 5 | This is a simple statistical model for calculating the probability of losing
 6 | data that is stored using an erasure coding system, such as Reed-Solomon.
 7 | 
 8 | In erasure coding, each file stored is divided into D shards of the same length.
 9 | Then, P parity shards are computed using erasure coding, resulting in D+P shards.
10 | Even if shards are lost, the original file can be recomputed from any D of the
11 | D+P shards stored.  In other words, you can lose any P of the shards and still 
12 | reconstruct the original file.
13 | 
14 | What we would like to compute is the durability of data stored with erasure coding
15 | based on the durability of the individual shards.
16 | The durability is the probability of not losing the data over a period of time.
17 | The period of time we use here is one year, resulting in annual durability.
18 |   
19 | Systems that use erasure coding to store data will replace shards that are lost.
20 | Once a shard is replaced, the data is fully intact again.  Data is lost only when
21 | P+1 shards are all lost at the same time, before they are replaced.
22 | 
23 | ## Assumptions
24 | 
25 | To calculate the probability of loss, we need to make some assumptions:
26 | 
27 | 1. Data is stored using *D* data shards and *P* parity shards, and is lost when *P+1* shards are lost.
28 | 1. The annual failure rate of each shard is *shard_annual_failure_rate*.
29 | 1. The number of days it takes to replace a failed shard is *shard_failure_days*.
30 | 1. The failures of individual shards are independent.
31 | 
32 | ## Calculation
33 | 
34 | The details of the calculations are in [calculation.ipynb](https://github.com/Backblaze/erasure-coding-durability/blob/master/calculation.ipynb).
35 | 
36 | ## Python code
37 | 
38 | The python code in 
39 | [durability.py](https://github.com/Backblaze/erasure-coding-durability/blob/master/durability.py)
40 | does the calculations above, with a few tweaks
41 | to maintain precision when dealing with tiny numbers, and prints out the results
42 | for a given set of assumptions:
43 | 
44 | ```
45 | $ python durability.py
46 | usage: durability.py [-h]
47 |                      data_shards parity_shards annual_shard_failure_rate
48 |                      shard_replacement_days
49 | durability.py: error: too few arguments
50 | $ python durability.py 4 2 0.10 1python durability.py 17 3 0.00405 6.5
51 | 
52 | #
53 | # total shards: 20
54 | # replacement period (days): 6.5000
55 | # annual shard failure rate: 0.0040
56 | #
57 | 
58 | |===================================================================================================================================|
59 | | failure_threshold | individual_prob | cumulative_prob | annual_loss_rate |         annual_odds |        durability |        nines | 
60 | |-----------------------------------------------------------------------------------------------------------------------------------|
61 | |                20 |       1.449e-83 |       1.449e-83 |        8.117e-82 |               NEVER | 1.000000000000000 |     81 nines | 
62 | |                19 |       4.019e-78 |       4.019e-78 |        2.251e-76 |               NEVER | 1.000000000000000 |     75 nines | 
63 | |                18 |       5.294e-73 |       5.294e-73 |        2.965e-71 |               NEVER | 1.000000000000000 |     70 nines | 
64 | |                17 |       4.404e-68 |       4.404e-68 |        2.466e-66 |               NEVER | 1.000000000000000 |     65 nines | 
65 | |                16 |       2.595e-63 |       2.595e-63 |        1.453e-61 |               NEVER | 1.000000000000000 |     60 nines | 
66 | |                15 |       1.151e-58 |       1.151e-58 |        6.447e-57 |               NEVER | 1.000000000000000 |     56 nines | 
67 | |                14 |       3.991e-54 |       3.991e-54 |        2.235e-52 |               NEVER | 1.000000000000000 |     51 nines | 
68 | |                13 |       1.107e-49 |       1.107e-49 |        6.197e-48 |               NEVER | 1.000000000000000 |     47 nines | 
69 | |                12 |       2.493e-45 |       2.493e-45 |        1.396e-43 |               NEVER | 1.000000000000000 |     42 nines | 
70 | |                11 |       4.609e-41 |       4.609e-41 |        2.581e-39 |               NEVER | 1.000000000000000 |     38 nines | 
71 | |                10 |       7.029e-37 |       7.029e-37 |        3.936e-35 |               NEVER | 1.000000000000000 |     34 nines | 
72 | |                 9 |       8.859e-33 |       8.860e-33 |        4.962e-31 |               NEVER | 1.000000000000000 |     30 nines | 
73 | |                 8 |       9.212e-29 |       9.213e-29 |        5.159e-27 |   5 in an octillion | 1.000000000000000 |     26 nines | 
74 | |                 7 |       7.860e-25 |       7.861e-25 |        4.402e-23 |  44 in a septillion | 1.000000000000000 |     22 nines | 
75 | |                 6 |       5.449e-21 |       5.450e-21 |        3.052e-19 | 305 in a sextillion | 1.000000000000000 |     18 nines | 
76 | |                 5 |       3.022e-17 |       3.022e-17 |        1.693e-15 |  2 in a quadrillion | 0.999999999999998 |     14 nines | 
77 | |                 4 |       1.309e-13 |       1.310e-13 |        7.354e-12 |     7 in a trillion | 0.999999999992646 | --> 11 nines | 
78 | |                 3 |       4.271e-10 |       4.273e-10 |        2.399e-08 |     24 in a billion | 0.999999976008104 |      7 nines | 
79 | |                 2 |       9.870e-07 |       9.874e-07 |        5.545e-05 |     55 in a million | 0.999944554648366 |      4 nines | 
80 | |                 1 |       1.440e-03 |       1.441e-03 |        7.781e-02 |      8 in a hundred | 0.922193691444580 |      1 nines | 
81 | |                 0 |       9.986e-01 |       1.000e+00 |        1.000e+00 |              always | 0.000000000000000 |      0 nines | 
82 | |===================================================================================================================================|
83 | ```
84 | 
85 | 
86 | 


--------------------------------------------------------------------------------
/calculation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "In this document, we'll go over some probability and statistics, and then \n",
  8 |     "use that to model the data durability in a system storing data using erasure\n",
  9 |     "coding.\n",
 10 |     "\n",
 11 |     "# Probability and Statistics\n",
 12 |     "\n",
 13 |     "## Failure Rate\n",
 14 |     "\n",
 15 |     "The distinction between failure rates and the probability of failure causes\n",
 16 |     "lots of confusion.  I have to think about it carefully each time I come back\n",
 17 |     "to the topic.\n",
 18 |     "\n",
 19 |     "The *failure rate* for a widget over a period of time is the average number \n",
 20 |     "of failures in that period per widget.  An annual failure \n",
 21 |     "rate of 0.25 means that on average, there are 0.25 failures per widget.\n",
 22 |     "If you have 100 widgets for one year, there would be an average of 25 failures per year.\n",
 23 |     "\n",
 24 |     "A failure rate of 0.25 is frequently written as 25%.\n",
 25 |     "\n",
 26 |     "It's counter-intuitive, but failure rates for unreliable widgets can be over \n",
 27 |     "1.0 (100%).  An annual failure rate of 12.0 (1200%) would mean that on average \n",
 28 |     "you would see 12 failures per year per widget.  An annual failure rate of 12.0 (1200%)\n",
 29 |     "is the same thing as a monthly failure rate of 1.0 (100%), and is the same thing as a daily\n",
 30 |     "failure rate of 0.0333 (3.33%).\n",
 31 |     "\n",
 32 |     "If you're running a shop that requires 10 widgets, and the failure rate is 12,\n",
 33 |     "you'll go through a lot of widgets, and you'll have to keep getting replacements.\n",
 34 |     "Over the span of a year, you can expect to buy 120 new widgets as replacements so you\n",
 35 |     "can always have 10 running.\n",
 36 |     "\n",
 37 |     "## Probability of Failure\n",
 38 |     "\n",
 39 |     "Assuming that the probability of separate failures is independent, the probability of failure \n",
 40 |     "over a period of time is modeled with the\n",
 41 |     "[Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution).\n",
 42 |     "\n",
 43 |     "If you have an annual failure rate of 2.0 (200%), there may be 0, 1, 2, 3, or more\n",
 44 |     "failures in any year.  The probability of exactly $k$ failures in one year is given\n",
 45 |     "by the formula:\n",
 46 |     "\n",
 47 |     "$$\\text{probability of exactly k failures} = e^{-\\lambda} \\: \\frac{\\lambda^k}{k!}$$\n",
 48 |     "\n",
 49 |     "This formula tells us that the probability of no failures is $0.1353$, the probability of exactly one \n",
 50 |     "failure is $0.2707$, and so on:\n",
 51 |     "\n",
 52 |     "Number of Failures | Probability\n",
 53 |     "--- | ---\n",
 54 |     "0 | 0.1353\n",
 55 |     "1 | 0.2707\n",
 56 |     "2 | 0.2707\n",
 57 |     "3 | 0.1804\n",
 58 |     "4 | 0.0902\n",
 59 |     "5 | 0.0361\n",
 60 |     "6 | 0.0120\n",
 61 |     "... | ...\n",
 62 |     "\n",
 63 |     "If you add up the infinite sequence of probabilites, it will add up to 1.0.\n",
 64 |     "\n",
 65 |     "In practice, looking at one of the widgets in your shop, this means that there's a 13.5% chance\n",
 66 |     "you won't have to replace it in the year.  There's a 27% chance you'll replace it once, a\n",
 67 |     "27% chance you'll replace it twice, an 18% chance you'll replace it three times, and so on.\n",
 68 |     "\n",
 69 |     "To calculate the probability of zero failures, you can simplify the formula:\n",
 70 |     "\n",
 71 |     "$$\\text{probability of 0 failures} = e^{-\\lambda} \\: \\frac{\\lambda^0}{0!} \\: = \\: \\: e^{-\\lambda}$$\n",
 72 |     "\n",
 73 |     "The probability of having at least one failure is the sum of entries from 1 on, which is \n",
 74 |     "the same as one minus the probability of 0 failures:\n",
 75 |     "\n",
 76 |     "$$\\text{probability of one or more failures} \\: = \\: 1 \\: - \\: e^{-\\lambda}$$\n",
 77 |     "\n",
 78 |     "We'll use this formula later to calculate the probability of failures given a failure rate.\n",
 79 |     "\n",
 80 |     "## Probability of $n$ Failures\n",
 81 |     "\n",
 82 |     "So what if you have $n$ widgets, and you want to know if $k$ or more of them will fail?\n",
 83 |     "\n",
 84 |     "Let's look at an example: You have three widgets with an annual failure rate of 0.25 (25%).  \n",
 85 |     "What is the probability that 2 or more of the three widgets will fail in the year?\n",
 86 |     "\n",
 87 |     "We'll use $P$ for the probability of a given widget failing at least once in the period.\n",
 88 |     "Using the formula from the prevous section, that probability is $0.2212$:\n",
 89 |     "\n",
 90 |     "$$P = 1 - e^{-0.25} \\approx 0.2212$$\n",
 91 |     "\n",
 92 |     "The probability of that widget not failing is $0.7788$:\n",
 93 |     "\n",
 94 |     "$$\\text{probability of not failing} = 1 - P \\approx 0.7788$$\n",
 95 |     "\n",
 96 |     "There are eight possible combinations of failure for the three widgets.  The first one is that none of them fail.  The probability for that is the product of the probabilities for each of the widgets not failing: $0.7788 \\times 0.7788 \\times 0.7788 = 0.4724$\n",
 97 |     "\n",
 98 |     "We can compute the probability of all eight cases by taking the probability that each\n",
 99 |     "widget will be OK or FAIL and multiplying them together.  The sum of the resulting probabilities\n",
100 |     "in the right column add up to 1.0:\n",
101 |     "\n",
102 |     "A | A prob | B | B prob | C | C prob | Probability\n",
103 |     "--- | --- | --- | --- | --- | --- | ---\n",
104 |     "ok | 0.7788 | ok | 0.7788 | ok | 0.7788 | 0.4724\n",
105 |     "ok | 0.7788 | ok | 0.7788 | FAIL | 0.2212 | 0.1342\n",
106 |     "ok | 0.7788 | FAIL | 0.2212 | ok | 0.7788 | 0.1342\n",
107 |     "ok | 0.7788 | FAIL | 0.2212 | FAIL | 0.2212 | 0.0381\n",
108 |     "FAIL | 0.2212 | ok | 0.7788 | ok | 0.7788 | 0.1342\n",
109 |     "FAIL | 0.2212 | ok | 0.7788 | FAIL | 0.2212 | 0.0381\n",
110 |     "FAIL | 0.2212 | FAIL | 0.2212 | ok | 0.7788 | 0.0381\n",
111 |     "FAIL | 0.2212 | FAIL | 0.2212 | FAIL | 0.2212 | 0.0108\n",
112 |     "\n",
113 |     "To get the probability of two or more failing, we add up the probabilities of all\n",
114 |     "rows that have two or more failures.  The rows with exactly two failures add up\n",
115 |     "to $0.1143$.  The one row with three failures has a probability of $0.0108$.  Those\n",
116 |     "add up to a probability of $0.1251$ for two or more failures.\n",
117 |     "\n",
118 |     "You'll notice that all of the rows with two failures have the same probability,\n",
119 |     "which makes sense.  The number of those rows is given by: $\\binom{3}{2}$, which \n",
120 |     "is three.  (This is called the [Binomial Coefficient](https://en.wikipedia.org/wiki/Binomial_coefficient).)\n",
121 |     "\n",
122 |     "So the probability of getting exactly two failures is:\n",
123 |     "\n",
124 |     "$$\\binom{3}{2} \\times P^2 \\times (1 - P)^1$$\n",
125 |     "\n",
126 |     "In general, the probability of getting exactly $k$ failures in $n$ widgets is:\n",
127 |     "\n",
128 |     "$$\\binom{n}{k} \\times P^k \\times (1 - P)^{(n - k)}$$\n",
129 |     "\n",
130 |     "If you want more information on this, you can read about the [Probability Mass Function](https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function) for a binomial.\n",
131 |     "\n",
132 |     "We can use this formula to summarize the table above by number of failures:\n",
133 |     "\n",
134 |     "Number of Failures | Probability\n",
135 |     "--- | ---\n",
136 |     "0 | 0.4724\n",
137 |     "1 | 0.4025\n",
138 |     "2 | 0.1143\n",
139 |     "3 | 0.0108\n",
140 |     "\n",
141 |     "# Data Durability\n",
142 |     "\n",
143 |     "Now we get into calculating the durability of data stored with erasure\n",
144 |     "coding, assuming a failure rate for each shard, and independent failures\n",
145 |     "for each shard.\n",
146 |     "\n",
147 |     "First, some naming.  We will use these names in the calculations:\n",
148 |     "\n",
149 |     "* $S$ is the total number of shards (data plus parity)\n",
150 |     "* $R$ is the repair time for a shard in days: how long it takes to replace a shard after it fails\n",
151 |     "* $A$ is the annual failure rate of one shard\n",
152 |     "* $F$ is the failure rate of a shard in $R$ days\n",
153 |     "* $P$ is the probability of a shard failing at least once in $R$ days\n",
154 |     "* $D$ is the durability of data over $R$ days: not too many shards are lost\n",
155 |     "\n",
156 |     "With erasure coding, your data remains inact as long as you don't lose \n",
157 |     "more shards than there are parity shards.  If you do lose more, there\n",
158 |     "is no way to recover the data.\n",
159 |     "\n",
160 |     "One of the assumptions we make is that it takes $R$ days to repair a failed\n",
161 |     "shard.  Let's start with a simpler problem and look at the data durability\n",
162 |     "over a period of $R$ days.  For a data loss to happen in this time period,\n",
163 |     "$P+1$ shards (or more) would have to fail.\n",
164 |     "\n",
165 |     "We will use $A$ to denote the annual failure rate of individual shards.\n",
166 |     "Over one year, the chances that a shard will fail is evenly distributed over\n",
167 |     "all of the $R$-day periods in the year.  We will use $F$ to denote the failure\n",
168 |     "rate of one shard in an $R$-day period:\n",
169 |     "\n",
170 |     "$$F = A\\frac{R}{365}$$\n",
171 |     "\n",
172 |     "The probability of failure of a single shard in R days is approximately $F$, when $F$ is small.\n",
173 |     "The exact value, from the Poisson distribution is:\n",
174 |     "\n",
175 |     "$$P = 1 \\: - \\: e^{-F}$$\n",
176 |     "\n",
177 |     "Given the probability of one shard failing, we can use the binomial distribution's \n",
178 |     "probability mass function to calculate the probability of exactly $n$ of the $S$\n",
179 |     "shards failing:\n",
180 |     "\n",
181 |     "$$\\binom{S}{n} \\: P^n \\: (1-P)^{S-n}$$\n",
182 |     "        \n",
183 |     "We also lose data if more than n shards fail in the period.  To include those,\n",
184 |     "we can sum the above formula for n through S shards, to get the probability of\n",
185 |     "data loss in $R$ days:\n",
186 |     "\n",
187 |     "$$\\sum_{k=n}^{S} \\binom{S}{k} \\: P^k \\: (1-P)^{S-k}$$\n",
188 |     "    \n",
189 |     "The durability in each period is inverse of that:\n",
190 |     "\n",
191 |     "$$D = 1 \\: - \\: \\sum_{k=n}^{S} \\binom{S}{k} \\: P^k \\: (1-P)^{S-k}$$\n",
192 |     "\n",
193 |     "Durability over the full year \n",
194 |     "happens when there's durability in all of the periods, which is the product of\n",
195 |     "probabilities:\n",
196 |     "\n",
197 |     "$$D ^ {365/R}$$\n",
198 |     "\n",
199 |     "And that's the answer!\n"
200 |    ]
201 |   }
202 |  ],
203 |  "metadata": {
204 |   "kernelspec": {
205 |    "display_name": "Python 3",
206 |    "language": "python",
207 |    "name": "python3"
208 |   },
209 |   "language_info": {
210 |    "codemirror_mode": {
211 |     "name": "ipython",
212 |     "version": 3
213 |    },
214 |    "file_extension": ".py",
215 |    "mimetype": "text/x-python",
216 |    "name": "python",
217 |    "nbconvert_exporter": "python",
218 |    "pygments_lexer": "ipython3",
219 |    "version": "3.5.4"
220 |   }
221 |  },
222 |  "nbformat": 4,
223 |  "nbformat_minor": 2
224 | }
225 | 


--------------------------------------------------------------------------------
/durability.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | ######################################################################
  3 | # 
  4 | # File: durability.py
  5 | # 
  6 | # Copyright 2018 Backblaze Inc. All Rights Reserved.
  7 | # 
  8 | ######################################################################
  9 | 
 10 | import argparse
 11 | import math
 12 | import sys
 13 | import unittest
 14 | 
 15 | 
 16 | class Table(object):
 17 | 
 18 |     """
 19 |     Knows how to display a table of data.
 20 | 
 21 |     The data is in the form of a list of dicts:
 22 | 
 23 |         [ { 'a' : 4, 'b' : 8 },
 24 |           { 'a' : 5, 'b' : 9 } ]
 25 | 
 26 |     And is displayed like this:
 27 | 
 28 |         |=======|
 29 |         | a | b |
 30 |         |-------|
 31 |         | 4 | 8 |
 32 |         | 5 | 9 |
 33 |         |=======|
 34 |     """
 35 | 
 36 |     def __init__(self, data, column_names):
 37 |         self.data = data
 38 |         self.column_titles = column_names
 39 |         self.column_widths = [
 40 |             max(len(col), max(len(item[col]) for item in data))
 41 |             for col in column_names
 42 |         ]
 43 | 
 44 |     def __str__(self):
 45 |         result = []
 46 | 
 47 |         # Title row
 48 |         total_width = 1 + sum(3 + w for w in self.column_widths)
 49 |         result.append('|')
 50 |         result.append('=' * (total_width - 2))
 51 |         result.append('|')
 52 |         result.append('\n')
 53 |         result.append('| ')
 54 |         for (col, w) in zip(self.column_titles, self.column_widths):
 55 |             result.append(self.pad(col, w))
 56 |             result.append(' | ')
 57 |         result.append('\n')
 58 |         result.append('|')
 59 |         result.append('-' * (total_width - 2))
 60 |         result.append('|')
 61 |         result.append('\n')
 62 | 
 63 |         # Data rows
 64 |         for item in self.data:
 65 |             result.append('| ')
 66 |             for (col, w) in zip(self.column_titles, self.column_widths):
 67 |                 result.append(self.pad(item[col], w))
 68 |                 result.append(' | ')
 69 |             result.append('\n')
 70 |         result.append('|')
 71 |         result.append('=' * (total_width - 2))
 72 |         result.append('|')
 73 |         result.append('\n')
 74 | 
 75 |         return ''.join(result)
 76 | 
 77 |     def pad(self, s, width):
 78 |         if len(s) < width:
 79 |             return (' ' * (width - len(s))) + s
 80 |         else:
 81 |             return s[:width]
 82 | 
 83 | 
 84 | def print_markdown_table(data, column_names):
 85 |     print
 86 |     print ' | '.join(column_names)
 87 |     print ' | '.join(['---'] * len(column_names))
 88 |     for item in data:
 89 |         print ' | '.join(item[cn] for cn in column_names)
 90 |     print
 91 | 
 92 | 
 93 | def factorial(n):
 94 |     if n == 0:
 95 |         return 1
 96 |     else:
 97 |         return n * factorial(n - 1)
 98 | 
 99 | 
100 | def choose(n, r):
101 |     """
102 |     Returns: How many ways there are to choose a subset of n things from a set of r things.
103 | 
104 |     Computes n! / (r! (n-r)!) exactly. Returns a python long int.
105 | 
106 |     From: http://stackoverflow.com/questions/3025162/statistics-combinations-in-python
107 |     """
108 |     assert n >= 0
109 |     assert 0 <= r <= n
110 | 
111 |     c = 1L
112 |     for num, denom in zip(xrange(n, n-r, -1), xrange(1, r+1, 1)):
113 |         c = (c * num) // denom
114 |     return c
115 | 
116 | 
117 | def binomial_probability(k, n, p):
118 |     """
119 |     Returns: The probability of exactly k of n things happening, when the
120 |              probability of each one (independently) is p.
121 | 
122 |     See: https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function
123 |     """
124 |     return choose(n, k) * (p ** k) * ((1 - p) ** (n - k))
125 | 
126 | 
127 | class TestBinomialProbability(unittest.TestCase):
128 | 
129 |     def test_binomial_probability(self):
130 |         # these test cases are from the Wikipedia page
131 |         self.assertAlmostEqual(0.117649, binomial_probability(0, 6, 0.3))
132 |         self.assertAlmostEqual(0.302526, binomial_probability(1, 6, 0.3))
133 |         self.assertAlmostEqual(0.324135, binomial_probability(2, 6, 0.3))
134 | 
135 |         # Wolfram Alpha: (1 - 1e-6)^800
136 |         self.assertAlmostEqual(0.9992003, binomial_probability(0, 800, 1.0e-6))
137 | 
138 | 
139 | def probability_of_failure_for_failure_rate(f):
140 |     """
141 |     Given a failure rate f, what's the probability of at least one failure?
142 |     """
143 |     probability_of_no_failures = math.exp(-f)
144 |     return 1.0 - probability_of_no_failures
145 | 
146 | 
147 | def probability_of_failure_in_any_period(p, n):
148 |     """
149 |     Returns the probability that a failure (of probability p in one period)
150 |     happens once or more in n periods.
151 | 
152 |     The probability of failure in one period is p, so the probability
153 |     of not failing is (1 - p).  So the probability of not
154 |     failing over n periods is (1 - p) ** n, and the probability
155 |     of one or more failures in n periods is:
156 | 
157 |         1 - (1 - p) ** n
158 | 
159 |     Doing the math without losing precision is tricky.
160 |     After the binomial expansion, you get (for even n):
161 | 
162 |         a = 1 - (1 - choose(n, 1) * p + choose(n, 2) p**2 - p**3 + p**4 ... + choose(n, n) p**n)
163 | 
164 |     For odd n, the last term is negative.
165 | 
166 |     To avoid precision loss, we don't want to to (1 - p) if p is
167 |     really tiny, so we'll cancel out the 1 and get:
168 |     you get:
169 | 
170 |         a = choose(n, 1) * p - choose(n, 2) * p**2 ...
171 |     """
172 |     if p < 0.01:
173 |         # For tiny numbers, (1 - p) can lose precision.
174 |         # First, compute the result for the integer part
175 |         n_int = int(n)
176 |         result = 0.0
177 |         sign = 1
178 |         for i in xrange(1, n_int + 1):
179 |             p_exp_i = p ** i
180 |             if p_exp_i != 0:
181 |                 result += sign * choose(n_int, i) * (p ** i)
182 |             sign = -sign
183 |         # Adjust the result to include the fractional part
184 |         # What we want is: 1.0 - (1.0 - result) * ((1.0 - p) ** (n - n_int))
185 |         # Which gives this when refactored:
186 |         result = 1.0 - ((1.0 - p) ** (n - n_int)) + result * ((1.0 - p) ** (n - n_int))
187 |         return result
188 |     else:
189 |         # For high probabilities of loss, the powers of p don't
190 |         # get small faster than the coefficients get big, and weird
191 |         # things happen
192 |         return 1.0 - (1.0 - p) ** n
193 | 
194 | 
195 | class TestProbabilityOfFailureAnyPeriod(unittest.TestCase):
196 | 
197 |     def test_probability_of_failure(self):
198 |         # Easy to check
199 |         self.assertAlmostEqual(0.25, probability_of_failure_in_any_period(0.25, 1))
200 |         self.assertAlmostEqual(0.4375, probability_of_failure_in_any_period(0.25, 2))
201 |         self.assertAlmostEqual(0.0199, probability_of_failure_in_any_period(0.01, 2))
202 | 
203 |         # From Wolfram Alpha, some tests with tiny probabilities:
204 |         self.assertAlmostEqual(2.0, probability_of_failure_in_any_period(1e-10, 200) * 1e8)
205 |         self.assertAlmostEqual(2.0, probability_of_failure_in_any_period(1e-30, 200) * 1e28)
206 |         self.assertAlmostEqual(7.60690480739, probability_of_failure_in_any_period(3.47347251479e-103, 2190) * 1e100)
207 | 
208 |         # Check fractional exponents
209 |         self.assertAlmostEqual(0.1339746, probability_of_failure_in_any_period(0.25, 0.5))
210 |         self.assertAlmostEqual(0.0345647, probability_of_failure_in_any_period(0.01, 3.5))
211 | 
212 | 
213 | SCALE_TABLE = [
214 |     (1, 'ten'),
215 |     (2, 'a hundred'),
216 |     (3, 'a thousand'),
217 |     (6, 'a million'),
218 |     (9, 'a billion'),
219 |     (12, 'a trillion'),
220 |     (15, 'a quadrillion'),
221 |     (18, 'a quintillion'),
222 |     (21, 'a sextillion'),
223 |     (24, 'a septillion'),
224 |     (27, 'an octillion')
225 |     ]
226 | 
227 | 
228 | def pretty_probability(p):
229 |     """
230 |     Takes a number between 0 and 1 and prints it as a probability in
231 |     the form "5 in a million"
232 |     """
233 |     if abs(p - 1.0) < 0.01:
234 |         return 'always'
235 |     for (power, name) in SCALE_TABLE:
236 |         count = p * (10.0 ** power)
237 |         if count >= 0.90:
238 |             return '%d in %s' % (round(count), name)
239 |     return 'NEVER'
240 | 
241 | 
242 | def count_nines(loss_rate):
243 |     """
244 |     Returns the number of nines after the decimal point before some other digit happens.
245 |     """
246 |     nines = 0
247 |     power_of_ten = 0.1
248 |     while True:
249 |         if power_of_ten < loss_rate:
250 |             return nines
251 |         power_of_ten /= 10.0
252 |         nines += 1
253 |         if power_of_ten == 0.0:
254 |             return 0
255 | 
256 | 
257 | def do_scenario(total_shards, min_shards, annual_shard_failure_rate, shard_replacement_days):
258 |     """
259 |     Calculates the cumulative failure rates for different numbers of
260 |     failures, starting with the most possible, down to 0.
261 | 
262 |     The first probability in the table will be extremely improbable,
263 |     because it is the case where ALL of the shards fail.  The next
264 |     line in the table is the case where either all of the shards fail,
265 |     or all but one fail.  The final row in the table is the case where
266 |     somewhere between all fail and none fail, which always happens, so
267 |     the probability is one.
268 |     """
269 | 
270 |     num_periods = 365.0 / shard_replacement_days
271 |     failure_rate_per_period = annual_shard_failure_rate / num_periods
272 | 
273 |     print
274 |     print '#'
275 |     print '# total shards:', total_shards
276 |     print '# replacement period (days): %6.4f' % (shard_replacement_days)
277 |     print '# annual shard failure rate: %6.4f' % (annual_shard_failure_rate)
278 |     print '#'
279 |     print
280 | 
281 |     failure_probability_per_period = 1.0 - math.exp(-failure_rate_per_period)
282 |     data = []
283 |     period_cumulative_prob = 0.0
284 |     for failed_shards in xrange(total_shards, -1, -1):
285 |         period_failure_prob = binomial_probability(failed_shards, total_shards, failure_probability_per_period)
286 |         period_cumulative_prob += period_failure_prob
287 |         annual_loss_prob = probability_of_failure_in_any_period(period_cumulative_prob, num_periods)
288 |         nines = '%d nines' % count_nines(annual_loss_prob)
289 |         if failed_shards == total_shards - min_shards + 1:
290 |             nines = "--> " + nines
291 |         data.append({
292 |             'individual_prob' : ('%10.3e' % period_failure_prob),
293 |             'failure_threshold' : str(failed_shards),
294 |             'cumulative_prob' : ('%10.3e' % period_cumulative_prob),
295 |             'cumulative_odds' : pretty_probability(period_cumulative_prob),
296 |             'annual_loss_rate' : ('%10.3e' % annual_loss_prob),
297 |             'annual_odds' : pretty_probability(annual_loss_prob),
298 |             'durability' : '%17.15f' % (1.0 - annual_loss_prob),
299 |             'nines' : nines
300 |             })
301 | 
302 |     print Table(data, ['failure_threshold',
303 |                        'individual_prob',
304 |                        'cumulative_prob',
305 |                        'annual_loss_rate',
306 |                        'annual_odds',
307 |                        'durability',
308 |                        'nines'
309 |                        ])
310 |     print
311 | 
312 |     return dict(
313 |         (item['failure_threshold'], item)
314 |         for item in data
315 |         )
316 | 
317 | 
318 | def example():
319 |     """
320 |     This is the example in the explanation.
321 |     """
322 |     # Make the table of probabilities of k failures with a failure rate of 2.0:
323 |     p = 2.0
324 |     data = [
325 |         { 'k': str(k), 'p': '%6.4f' % (math.exp(-p) * p**k / factorial(k),) }
326 |         for k in xrange(7)
327 |     ]
328 |     print_markdown_table(data, ['k', 'p'])
329 | 
330 |     print 'Probability of n Failing'
331 |     annual_rate = 0.25
332 |     p_one_failing = probability_of_failure_for_failure_rate(annual_rate)
333 |     print 'probability of one failing: %6.4f' % p_one_failing
334 |     print 'probability of none failing: %6.4f' % (1 - p_one_failing)
335 |     print 'probability of three not failing: %6.4f' % (1 - p_one_failing) ** 3
336 |     print 'probability of two or more failing: %6.4f' % (binomial_probability(2, 3, p_one_failing) + binomial_probability(3, 3, p_one_failing))
337 |     print
338 |     probs = {'ok': (1 - p_one_failing), 'FAIL': p_one_failing}
339 |     data = []
340 |     total_prob = 0.0
341 |     for a in ['ok', 'FAIL']:
342 |         for b in ['ok', 'FAIL']:
343 |             for c in ['ok', 'FAIL']:
344 |                 data.append({
345 |                     'A': a,
346 |                     'A prob': '%6.4f' % probs[a],
347 |                     'B': b,
348 |                     'B prob': '%6.4f' % probs[b],
349 |                     'C': c,
350 |                     'C prob': '%6.4f' % probs[c],
351 |                     'Probability': '%6.4f' % (probs[a] * probs[b] * probs[c])
352 |                 })
353 |                 total_prob += probs[a] * probs[b] * probs[c]
354 |     print_markdown_table(data, ['A', 'A prob', 'B', 'B prob', 'C', 'C prob', 'Probability'])
355 |     print 'sum of probabilities: %6.4f' % total_prob
356 |     print
357 | 
358 |     data = [
359 |         {'Number of Failures': str(k), 'Probability': '%6.4f' % binomial_probability(k, 3, p_one_failing)}
360 |         for k in xrange(4)
361 |     ]
362 |     print_markdown_table(data, ['Number of Failures', 'Probability'])
363 | 
364 | 
365 | def main():
366 |     if sys.argv[1:] == ['test']:
367 |         del sys.argv[1]
368 |         unittest.main()
369 |     elif sys.argv[1:] == ['example']:
370 |         example()
371 |     else:
372 |         parser = argparse.ArgumentParser()
373 |         parser.add_argument('data_shards', type=int),
374 |         parser.add_argument('parity_shards', type=int),
375 |         parser.add_argument('annual_shard_failure_rate', type=float),
376 |         parser.add_argument('shard_replacement_days', type=float)
377 |         args = parser.parse_args()
378 |         total_shards = args.data_shards + args.parity_shards
379 |         min_shards = args.data_shards
380 |         do_scenario(total_shards, min_shards, args.annual_shard_failure_rate, args.shard_replacement_days)
381 | 
382 | 
383 | if __name__ == '__main__':
384 |     main()
385 | 


--------------------------------------------------------------------------------