├── .gitignore ├── .pyup.yml ├── LICENSE ├── README.md ├── livestats ├── __init__.py └── livestats.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *~ 3 | *.swp 4 | -------------------------------------------------------------------------------- /.pyup.yml: -------------------------------------------------------------------------------- 1 | # autogenerated pyup.io config file 2 | # see https://pyup.io/docs/configuration/ for all available options 3 | 4 | update: insecure 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013, Sean Cassidy 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | * Redistributions of source code must retain the above copyright 7 | notice, this list of conditions and the following disclaimer. 8 | * Redistributions in binary form must reproduce the above copyright 9 | notice, this list of conditions and the following disclaimer in the 10 | documentation and/or other materials provided with the distribution. 11 | * The names of this project's contributors may not be used to endorse 12 | or promote products derived from this software without specific prior 13 | written permission. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL Sean Cassidy BE LIABLE FOR ANY 19 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 22 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LiveStats - Online Statistical Algorithms for Python 2 | 3 | LiveStats solves the problem of generating accurate statistics for when your data set is too large to fit in memory, or too costly to sort. Just add your data to the LiveStats object, and query the methods on it to produce statistical estimates of your data. 4 | 5 | LiveStats doesn't keep any items in memory, only estimates of the statistics. This means you can calculate statistics on an arbitrary amount of data. 6 | 7 | LiveStats supports Python 2.7+ and Python 3.2+ and doesn't rely on any external Python libraries. 8 | 9 | ## Example usage 10 | 11 | First install LiveStats 12 | 13 | $ pip install LiveStats 14 | 15 | When constructing a LiveStats object, pass in an array of the quantiles you wish you track. LiveStats stores 15 double values per quantile supplied. 16 | 17 | from livestats import livestats 18 | from math import sqrt 19 | import random 20 | 21 | test = livestats.LiveStats([0.25, 0.5, 0.75]) 22 | 23 | data = iter(random.random, 1) 24 | 25 | for x in xrange(3): 26 | for y in xrange(100): 27 | test.add(data.next()*100) 28 | 29 | print "Average {}, stddev {}, quantiles {}".format(test.mean(), 30 | sqrt(test.variance()), test.quantiles()) 31 | 32 | Easy. 33 | 34 | There are plenty of other methods too, such as minimum, maximum, [kurtosis](http://en.wikipedia.org/wiki/Kurtosis), and [skewness](http://en.wikipedia.org/wiki/Skewness). 35 | 36 | # FAQ 37 | 38 | ## How does this work? 39 | LiveStats uses the [P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations](http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf) and other online statistical algorithms. I also [wrote a post](https://www.seancassidy.me/on-accepting-interview-question-answers.html) on where I got this idea. 40 | 41 | ## How accurate is it? 42 | 43 | Very accurate. If you run livestats.py as a script with a numeric argument, it'll run some tests with that many data points. As soon as you start to get over 10,000 elements, accuracy to the actual quantiles is well below 1%. At 10,000,000, it's this: 44 | 45 | Uniform: Avg%E 1.732260e-12 Var%E 2.999999e-05 Quant%E 1.315983e-05 46 | Expovar: Avg%E 9.999994e-06 Var%E 1.000523e-05 Quant%E 1.741774e-05 47 | Triangular: Avg%E 9.988727e-06 Var%E 4.839340e-12 Quant%E 0.015595 48 | Bimodal: Avg%E 9.999991e-06 Var%E 4.555303e-05 Quant%E 9.047849e-06 49 | 50 | That's percent error for the cumulative moving average, variance, and the average percent error for four different random distributions at three quantiules, 25th, 50th, and 75th. Pretty good. 51 | 52 | 53 | ## Why didn't you use NumPy? 54 | 55 | I didn't think it would help that much. LiveStats doesn't work on large arrays and I wanted PyPy support, which NumPy currently lacks. I'm curious about any and all performance improvements, so please contact me if you think NumPy (or anything else) would help. 56 | 57 | -------------------------------------------------------------------------------- /livestats/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cxxr/LiveStats/1723a4008845fdcb7caff2936989e9e0580d22b4/livestats/__init__.py -------------------------------------------------------------------------------- /livestats/livestats.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from math import copysign,fabs,sqrt 4 | import random, sys 5 | 6 | def calcP2(qp1, q, qm1, d, np1, n, nm1): 7 | d = float(d) 8 | n = float(n) 9 | np1 = float(np1) 10 | nm1 = float(nm1) 11 | 12 | outer = d / (np1 - nm1) 13 | inner_left = (n - nm1 + d) * (qp1 - q ) / (np1 - n) 14 | inner_right = (np1 - n - d) * (q - qm1 ) / (n - nm1) 15 | 16 | return q + outer * (inner_left + inner_right) 17 | 18 | class Quantile: 19 | LEN = 5 20 | def __init__(self, p): 21 | """ Constructs a single quantile object """ 22 | self.dn = [0, p/2, p, (1 + p)/2, 1] 23 | self.npos = [1, 1 + 2*p, 1 + 4*p, 3 + 2*p, 5] 24 | self.pos = list(range(1, self.LEN + 1)) 25 | self.heights = [] 26 | self.initialized = False 27 | self.p = p 28 | 29 | def add(self, item): 30 | """ Adds another datum """ 31 | if len(self.heights) != self.LEN: 32 | self.heights.append(item) 33 | else: 34 | if self.initialized == False: 35 | self.heights.sort() 36 | self.initialized = True 37 | 38 | # find cell k 39 | if item < self.heights[0]: 40 | self.heights[0] = item 41 | k = 1 42 | else: 43 | for i in range(1, self.LEN): 44 | if self.heights[i - 1] <= item and item < self.heights[i]: 45 | k = i 46 | break 47 | else: 48 | k = 4 49 | if self.heights[-1] < item: 50 | self.heights[-1] = item 51 | 52 | # increment all positions greater than k 53 | self.pos = [j if i < k else j + 1 for i,j in enumerate(self.pos)] 54 | self.npos = [x + y for x,y in zip(self.npos, self.dn)] 55 | 56 | self.__adjust() 57 | 58 | def __adjust(self): 59 | for i in range(1, self.LEN - 1): 60 | n = self.pos[i] 61 | q = self.heights[i] 62 | 63 | d = self.npos[i] - n 64 | 65 | if (d >= 1 and self.pos[i + 1] - n > 1) or (d <= -1 and self.pos[i - 1] - n < -1): 66 | d = int(copysign(1,d)) 67 | 68 | qp1 = self.heights[i + 1] 69 | qm1 = self.heights[i - 1] 70 | np1 = self.pos[i + 1] 71 | nm1 = self.pos[i - 1] 72 | qn = calcP2(qp1, q, qm1, d, np1, n, nm1) 73 | 74 | if qm1 < qn and qn < qp1: 75 | self.heights[i] = qn 76 | else: 77 | # use linear form 78 | self.heights[i] = q + d * (self.heights[i + d] - q) / (self.pos[i + d] - n) 79 | 80 | self.pos[i] = n + d 81 | 82 | def quantile(self): 83 | if self.initialized: 84 | return self.heights[2] 85 | else: 86 | self.heights.sort() 87 | l = len(self.heights) 88 | # make sure we don't overflow on p == 1 or underflow on p == 0 89 | return self.heights[int(min(max(l - 1, 0), l * self.p))] 90 | 91 | 92 | class LiveStats: 93 | def __init__(self, p = [0.5]): 94 | """ Constructs a LiveStats object 95 | 96 | Keyword arguments: 97 | 98 | p -- A list of quantiles to track, by default, [0.5] 99 | 100 | """ 101 | self.min_val = float('inf') 102 | self.max_val = float('-inf') 103 | self.var_m2 = 0.0 104 | self.kurt_m4 = 0.0 105 | self.skew_m3 = 0.0 106 | self.average = 0.0 107 | self.count = 0 108 | self.tiles = {} 109 | self.initialized = False 110 | for i in p: 111 | self.tiles[i] = Quantile(i) 112 | 113 | def add(self, item): 114 | """ Adds another datum """ 115 | delta = item - self.average 116 | 117 | self.min_val = min(self.min_val, item) 118 | self.max_val = max(self.max_val, item) 119 | 120 | # Average 121 | self.average = (self.count * self.average + item) / (self.count + 1) 122 | self.count = self.count + 1 123 | 124 | # Variance (except for the scale) 125 | self.var_m2 = self.var_m2 + delta * (item - self.average) 126 | 127 | # tiles 128 | for perc in list(self.tiles.values()): 129 | perc.add(item) 130 | 131 | # Kurtosis 132 | self.kurt_m4 = self.kurt_m4 + (item - self.average)**4.0 133 | 134 | # Skewness 135 | self.skew_m3 = self.skew_m3 + (item - self.average)**3.0 136 | 137 | 138 | def quantiles(self): 139 | """ Returns a list of tuples of the quantile and its location """ 140 | return [(key, val.quantile()) for key, val in self.tiles.items()] 141 | 142 | def maximum(self): 143 | """ Returns the maximum value given """ 144 | return self.max_val 145 | 146 | def mean(self): 147 | """ Returns the cumulative moving average of the data """ 148 | return self.average 149 | 150 | def minimum(self): 151 | """ Returns the minimum value given """ 152 | return self.min_val 153 | 154 | def num(self): 155 | """ Returns the number of items added so far""" 156 | return self.count 157 | 158 | def variance(self): 159 | """ Returns the sample variance of the data given so far""" 160 | if self.count > 1: 161 | return self.var_m2 / self.count 162 | else: 163 | return float('NaN') 164 | 165 | def kurtosis(self): 166 | """ Returns the sample kurtosis of the data given so far""" 167 | if self.count > 1: 168 | return self.kurt_m4 / (self.count * self.variance()**2.0) - 3.0 169 | else: 170 | return float('NaN') 171 | 172 | def skewness(self): 173 | """ Returns the sample skewness of the data given so far""" 174 | if self.count > 1: 175 | return self.skew_m3 / (self.count * self.variance()**1.5) 176 | else: 177 | return float('NaN') 178 | 179 | 180 | def bimodal( low1, high1, mode1, low2, high2, mode2 ): 181 | toss = random.choice( (1, 2) ) 182 | if toss == 1: 183 | return random.triangular( low1, high1, mode1 ) 184 | else: 185 | return random.triangular( low2, high2, mode2 ) 186 | 187 | def output (tiles, data, stats, name): 188 | data.sort() 189 | tuples = [x[1] for x in stats.quantiles()] 190 | med = [data[int(len(data) * x)] for x in tiles] 191 | pe = 0 192 | for approx, exact in zip(tuples, med): 193 | pe = pe + (fabs(approx - exact)/fabs(exact)) 194 | pe = 100.0 * pe / len(data) 195 | avg = sum(data)/len(data) 196 | 197 | s2 = 0 198 | for x in data: 199 | s2 = s2 + (x - avg)**2 200 | var = s2 / len(data) 201 | 202 | v_pe = 100.0*fabs(stats.variance() - var)/fabs(var) 203 | avg_pe = 100.0*fabs(stats.mean() - avg)/fabs(avg) 204 | 205 | print("{}: Avg%E {} Var%E {} Quant%E {}, Kurtosis {}, Skewness {}".format( 206 | name, avg_pe, v_pe, pe, stats.kurtosis(), stats.skewness())); 207 | 208 | 209 | if __name__ == '__main__': 210 | count = int(sys.argv[1]) 211 | random.seed() 212 | 213 | tiles = [0.25, 0.5, 0.75] 214 | 215 | median = LiveStats(tiles) 216 | test = [0.02, 0.15, 0.74, 3.39, 0.83, 22.37, 10.15, 15.43, 38.62, 15.92, 34.60, 217 | 10.28, 1.47, 0.40, 0.05, 11.39, 0.27, 0.42, 0.09, 11.37] 218 | for i in test: 219 | median.add(i) 220 | 221 | output(tiles, test, median, "Test") 222 | 223 | median = LiveStats(tiles) 224 | x = list(range(count)) 225 | random.shuffle(x) 226 | for i in x: 227 | median.add(i) 228 | 229 | output(tiles, x, median, "Uniform") 230 | 231 | median = LiveStats(tiles) 232 | for i in range(count): 233 | x[i] = random.expovariate(1.0/435) 234 | median.add(x[i]) 235 | 236 | output(tiles, x, median, "Expovar") 237 | 238 | median = LiveStats(tiles) 239 | for i in range(count): 240 | x[i] = random.triangular(-1000*count/10, 1000*count/10, 100) 241 | median.add(x[i]) 242 | 243 | output(tiles, x, median, "Triangular") 244 | 245 | median = LiveStats(tiles) 246 | for i in range(count): 247 | x[i] = bimodal(0, 1000, 500, 500, 1500, 1400) 248 | median.add(x[i]) 249 | 250 | output(tiles, x, median, "Bimodal") 251 | 252 | 253 | 254 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from distutils.core import setup 4 | 5 | setup(name='LiveStats', 6 | version='1.0', 7 | description='LiveStats solves the problem of generating accurate statistics for when your data set is too large to fit in memory, or too costly to sort.', 8 | author='Sean Cassidy', 9 | author_email='gward@python.net', 10 | url='https://bitbucket.org/scassidy/livestats', 11 | packages=['livestats'], 12 | ) 13 | --------------------------------------------------------------------------------