├── .gitignore
├── .pyup.yml
├── LICENSE
├── README.md
├── livestats
    ├── __init__.py
    └── livestats.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *~
3 | *.swp
4 | 


--------------------------------------------------------------------------------
/.pyup.yml:
--------------------------------------------------------------------------------
1 | # autogenerated pyup.io config file 
2 | # see https://pyup.io/docs/configuration/ for all available options
3 | 
4 | update: insecure
5 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2013, Sean Cassidy <sean.a.cassidy@gmail.com>
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 |     * Redistributions of source code must retain the above copyright
 7 |       notice, this list of conditions and the following disclaimer.
 8 |     * Redistributions in binary form must reproduce the above copyright
 9 |       notice, this list of conditions and the following disclaimer in the
10 |       documentation and/or other materials provided with the distribution.
11 |     * The names of this project's contributors may not be used to endorse 
12 |       or promote products derived from this software without specific prior 
13 |       written permission.
14 | 
15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
18 | DISCLAIMED. IN NO EVENT SHALL Sean Cassidy BE LIABLE FOR ANY
19 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
22 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # LiveStats - Online Statistical Algorithms for Python
 2 | 
 3 | LiveStats solves the problem of generating accurate statistics for when your data set is too large to fit in memory, or too costly to sort. Just add your data to the LiveStats object, and query the methods on it to produce statistical estimates of your data.
 4 | 
 5 | LiveStats doesn't keep any items in memory, only estimates of the statistics. This means you can calculate statistics on an arbitrary amount of data.
 6 | 
 7 | LiveStats supports Python 2.7+ and Python 3.2+ and doesn't rely on any external Python libraries.
 8 | 
 9 | ## Example usage
10 | 
11 | First install LiveStats
12 | 
13 |     $ pip install LiveStats
14 | 
15 | When constructing a LiveStats object, pass in an array of the quantiles you wish you track. LiveStats stores 15 double values per quantile supplied.
16 | 
17 |     from livestats import livestats
18 |     from math import sqrt
19 |     import random
20 | 
21 |     test = livestats.LiveStats([0.25, 0.5, 0.75])
22 | 
23 |     data = iter(random.random, 1)
24 | 
25 |     for x in xrange(3):
26 |         for y in xrange(100):
27 |             test.add(data.next()*100)
28 | 
29 |         print "Average {}, stddev {}, quantiles {}".format(test.mean(), 
30 |                 sqrt(test.variance()), test.quantiles())
31 | 
32 | Easy.
33 | 
34 | There are plenty of other methods too, such as minimum, maximum, [kurtosis](http://en.wikipedia.org/wiki/Kurtosis), and [skewness](http://en.wikipedia.org/wiki/Skewness).
35 | 
36 | # FAQ
37 | 
38 | ## How does this work? 
39 | LiveStats uses the [P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations](http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf) and other online statistical algorithms. I also [wrote a post](https://www.seancassidy.me/on-accepting-interview-question-answers.html) on where I got this idea.
40 | 
41 | ## How accurate is it?
42 | 
43 | Very accurate. If you run livestats.py as a script with a numeric argument, it'll run some tests with that many data points. As soon as you start to get over 10,000 elements, accuracy to the actual quantiles is well below 1%. At 10,000,000, it's this:
44 | 
45 |     Uniform:    Avg%E 1.732260e-12 Var%E 2.999999e-05 Quant%E 1.315983e-05
46 |     Expovar:    Avg%E 9.999994e-06 Var%E 1.000523e-05 Quant%E 1.741774e-05
47 |     Triangular: Avg%E 9.988727e-06 Var%E 4.839340e-12 Quant%E 0.015595
48 |     Bimodal:    Avg%E 9.999991e-06 Var%E 4.555303e-05 Quant%E 9.047849e-06
49 | 
50 | That's percent error for the cumulative moving average, variance, and the average percent error for four different random distributions at three quantiules, 25th, 50th, and 75th. Pretty good.
51 | 
52 | 
53 | ## Why didn't you use NumPy?
54 | 
55 | I didn't think it would help that much. LiveStats doesn't work on large arrays and I wanted PyPy support, which NumPy currently lacks. I'm curious about any and all performance improvements, so please contact me if you think NumPy (or anything else) would help.
56 | 
57 | 


--------------------------------------------------------------------------------
/livestats/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cxxr/LiveStats/1723a4008845fdcb7caff2936989e9e0580d22b4/livestats/__init__.py


--------------------------------------------------------------------------------
/livestats/livestats.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | from math import copysign,fabs,sqrt
  4 | import random, sys
  5 | 
  6 | def calcP2(qp1, q, qm1, d, np1, n, nm1):
  7 |     d = float(d)
  8 |     n = float(n)
  9 |     np1 = float(np1)
 10 |     nm1 = float(nm1)
 11 | 
 12 |     outer = d / (np1 - nm1)
 13 |     inner_left = (n - nm1 + d) * (qp1 - q ) / (np1 - n)
 14 |     inner_right = (np1 - n - d) * (q - qm1 ) / (n - nm1)
 15 | 
 16 |     return q + outer * (inner_left + inner_right)
 17 | 
 18 | class Quantile:
 19 |     LEN = 5
 20 |     def __init__(self, p):
 21 |         """ Constructs a single quantile object """
 22 |         self.dn = [0, p/2, p, (1 + p)/2, 1]
 23 |         self.npos = [1, 1 + 2*p, 1 + 4*p, 3 + 2*p, 5]
 24 |         self.pos = list(range(1, self.LEN + 1))
 25 |         self.heights = []
 26 |         self.initialized = False
 27 |         self.p = p
 28 | 
 29 |     def add(self, item):
 30 |         """ Adds another datum """
 31 |         if len(self.heights) != self.LEN:
 32 |             self.heights.append(item)
 33 |         else:
 34 |             if self.initialized == False:
 35 |                 self.heights.sort()
 36 |                 self.initialized = True
 37 | 
 38 |             # find cell k
 39 |             if item < self.heights[0]:
 40 |                 self.heights[0] = item
 41 |                 k = 1
 42 |             else:
 43 |                 for i in range(1, self.LEN):
 44 |                     if self.heights[i - 1] <= item and item < self.heights[i]:
 45 |                         k = i
 46 |                         break
 47 |                 else:
 48 |                     k = 4
 49 |                     if self.heights[-1] < item:
 50 |                         self.heights[-1] = item
 51 | 
 52 |             # increment all positions greater than k
 53 |             self.pos = [j if i < k else j + 1 for i,j in enumerate(self.pos)]
 54 |             self.npos = [x + y for x,y in zip(self.npos, self.dn)]
 55 | 
 56 |             self.__adjust()
 57 | 
 58 |     def __adjust(self):
 59 |         for i in range(1, self.LEN - 1):
 60 |             n = self.pos[i]
 61 |             q = self.heights[i]
 62 | 
 63 |             d = self.npos[i] - n
 64 | 
 65 |             if (d >= 1 and self.pos[i + 1] - n > 1) or (d <= -1 and self.pos[i - 1] - n < -1):
 66 |                 d = int(copysign(1,d))
 67 | 
 68 |                 qp1 = self.heights[i + 1]
 69 |                 qm1 = self.heights[i - 1]
 70 |                 np1 = self.pos[i + 1]
 71 |                 nm1 = self.pos[i - 1]
 72 |                 qn = calcP2(qp1, q, qm1, d, np1, n, nm1)
 73 | 
 74 |                 if qm1 < qn and qn < qp1:
 75 |                     self.heights[i] = qn
 76 |                 else:
 77 |                     # use linear form
 78 |                     self.heights[i] = q + d * (self.heights[i + d] - q) / (self.pos[i + d] - n)
 79 | 
 80 |                 self.pos[i] = n + d
 81 | 
 82 |     def quantile(self):
 83 |         if self.initialized:
 84 |             return self.heights[2]
 85 |         else:
 86 |             self.heights.sort()
 87 |             l = len(self.heights)
 88 |             # make sure we don't overflow on p == 1 or underflow on p == 0
 89 |             return self.heights[int(min(max(l - 1, 0), l * self.p))]
 90 | 
 91 | 
 92 | class LiveStats:
 93 |     def __init__(self, p = [0.5]):
 94 |         """ Constructs a LiveStats object
 95 | 
 96 |         Keyword arguments:
 97 | 
 98 |         p -- A list of quantiles to track, by default, [0.5]
 99 | 
100 |         """
101 |         self.min_val = float('inf')
102 |         self.max_val = float('-inf')
103 |         self.var_m2 = 0.0
104 |         self.kurt_m4 = 0.0
105 |         self.skew_m3 = 0.0
106 |         self.average = 0.0
107 |         self.count = 0
108 |         self.tiles = {}
109 |         self.initialized = False
110 |         for i in p:
111 |             self.tiles[i] = Quantile(i)
112 | 
113 |     def add(self, item):
114 |         """ Adds another datum """
115 |         delta = item - self.average
116 | 
117 |         self.min_val = min(self.min_val, item)
118 |         self.max_val = max(self.max_val, item)
119 | 
120 |         # Average
121 |         self.average = (self.count * self.average + item) / (self.count + 1)
122 |         self.count = self.count + 1
123 | 
124 |         # Variance (except for the scale)
125 |         self.var_m2 = self.var_m2 + delta * (item - self.average)
126 | 
127 |         # tiles
128 |         for perc in list(self.tiles.values()):
129 |             perc.add(item)
130 | 
131 |         # Kurtosis
132 |         self.kurt_m4 = self.kurt_m4 + (item - self.average)**4.0
133 | 
134 |         # Skewness
135 |         self.skew_m3 = self.skew_m3 + (item - self.average)**3.0
136 | 
137 | 
138 |     def quantiles(self):
139 |         """ Returns a list of tuples of the quantile and its location """
140 |         return [(key, val.quantile()) for key, val in self.tiles.items()]
141 | 
142 |     def maximum(self):
143 |         """ Returns the maximum value given """
144 |         return self.max_val
145 | 
146 |     def mean(self):
147 |         """ Returns the cumulative moving average of the data """
148 |         return self.average
149 | 
150 |     def minimum(self):
151 |         """ Returns the minimum value given """
152 |         return self.min_val
153 | 
154 |     def num(self):
155 |         """ Returns the number of items added so far"""
156 |         return self.count
157 | 
158 |     def variance(self):
159 |         """ Returns the sample variance of the data given so far"""
160 |         if self.count > 1:
161 |             return self.var_m2 / self.count
162 |         else:
163 |             return float('NaN')
164 | 
165 |     def kurtosis(self):
166 |         """ Returns the sample kurtosis of the data given so far"""
167 |         if self.count > 1:
168 |             return self.kurt_m4 / (self.count * self.variance()**2.0) - 3.0
169 |         else:
170 |             return float('NaN')
171 | 
172 |     def skewness(self):
173 |         """ Returns the sample skewness of the data given so far"""
174 |         if self.count > 1:
175 |             return self.skew_m3 / (self.count * self.variance()**1.5)
176 |         else:
177 |             return float('NaN')
178 | 
179 | 
180 | def bimodal( low1, high1, mode1, low2, high2, mode2 ):
181 |     toss = random.choice( (1, 2) )
182 |     if toss == 1:
183 |         return random.triangular( low1, high1, mode1 )
184 |     else:
185 |         return random.triangular( low2, high2, mode2 )
186 | 
187 | def output (tiles, data, stats, name):
188 |     data.sort()
189 |     tuples = [x[1] for x in stats.quantiles()]
190 |     med = [data[int(len(data) * x)] for x in tiles]
191 |     pe = 0
192 |     for approx, exact in zip(tuples, med):
193 |         pe = pe + (fabs(approx - exact)/fabs(exact))
194 |     pe = 100.0 * pe / len(data)
195 |     avg = sum(data)/len(data)
196 | 
197 |     s2 = 0
198 |     for x in data:
199 |         s2 = s2 + (x - avg)**2
200 |     var = s2 / len(data)
201 | 
202 |     v_pe = 100.0*fabs(stats.variance() - var)/fabs(var)
203 |     avg_pe = 100.0*fabs(stats.mean() - avg)/fabs(avg)
204 | 
205 |     print("{}: Avg%E {} Var%E {} Quant%E {}, Kurtosis {}, Skewness {}".format(
206 |             name, avg_pe, v_pe, pe, stats.kurtosis(), stats.skewness()));
207 | 
208 | 
209 | if __name__ == '__main__':
210 |     count = int(sys.argv[1])
211 |     random.seed()
212 | 
213 |     tiles = [0.25, 0.5, 0.75]
214 | 
215 |     median = LiveStats(tiles)
216 |     test = [0.02, 0.15, 0.74, 3.39, 0.83, 22.37, 10.15, 15.43, 38.62, 15.92, 34.60,
217 |             10.28, 1.47, 0.40, 0.05, 11.39, 0.27, 0.42, 0.09, 11.37]
218 |     for i in test:
219 |         median.add(i)
220 | 
221 |     output(tiles, test, median, "Test")
222 | 
223 |     median = LiveStats(tiles)
224 |     x = list(range(count))
225 |     random.shuffle(x)
226 |     for i in x:
227 |         median.add(i)
228 | 
229 |     output(tiles, x, median, "Uniform")
230 | 
231 |     median = LiveStats(tiles)
232 |     for i in range(count):
233 |         x[i] = random.expovariate(1.0/435)
234 |         median.add(x[i])
235 | 
236 |     output(tiles, x, median, "Expovar")
237 | 
238 |     median = LiveStats(tiles)
239 |     for i in range(count):
240 |         x[i] = random.triangular(-1000*count/10, 1000*count/10, 100)
241 |         median.add(x[i])
242 | 
243 |     output(tiles, x, median, "Triangular")
244 | 
245 |     median = LiveStats(tiles)
246 |     for i in range(count):
247 |         x[i] = bimodal(0, 1000, 500, 500, 1500, 1400)
248 |         median.add(x[i])
249 | 
250 |     output(tiles, x, median, "Bimodal")
251 | 
252 | 
253 | 
254 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from distutils.core import setup
 4 | 
 5 | setup(name='LiveStats',
 6 |       version='1.0',
 7 |       description='LiveStats solves the problem of generating accurate statistics for when your data set is too large to fit in memory, or too costly to sort.',
 8 |       author='Sean Cassidy',
 9 |       author_email='gward@python.net',
10 |       url='https://bitbucket.org/scassidy/livestats',
11 |       packages=['livestats'],
12 |      )
13 | 


--------------------------------------------------------------------------------