├── .gitignore ├── LICENSE ├── LICENSE-3rdparty.csv ├── README.md ├── img └── example_regression.png ├── piecewise ├── __init__.py ├── plotter.py └── regressor.py ├── setup.py ├── tests ├── __init__.py └── test_piecewise.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | 3 | # tox and pyenv 4 | .python-version 5 | .tox/ 6 | 7 | # Ignore files generated during `python setup.py install` 8 | build/ 9 | dist/ 10 | piecewise.egg-info/ 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017, Datadog, Inc. 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /LICENSE-3rdparty.csv: -------------------------------------------------------------------------------- 1 | Component,Origin,License,Copyright 2 | import,matplotlib,Python-2.0,Copyright (c) 2012 Matplotlib Development Team; All Rights Reserved 3 | import,numpy,BSD-3-Clause,Copyright (c) 2005-2017 NumPy Developers.; All rights reserved. 4 | import,setuptools,MIT,Copyright (c) 2016 Jason R Coombs 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # piecewise 2 | 3 | This repo accompanies [Piecewise regression: when one line simply isn’t enough](https://www.datadoghq.com/blog/engineering/piecewise-regression/), a blog post about Datadog's approach to piecewise regression. The code included here is intended to be minimal and readable; this is not a Swiss Army knife to solve all variations of piecewise regression problems. 4 | 5 | ## Installation & dependencies 6 | 7 | This package was written to work with both Python 2 and Python 3. 8 | 9 | To install this package using setup tools, clone this repo and run `python setup.py install` from within the `piecewise` root directory. 10 | 11 | The package's core `piecewise()` function for regression requires only `numpy`. The use of `piecewise_plot()` for plotting depends also on `matplotlib`. 12 | 13 | ## Usage 14 | 15 | Start by preparing your data as list-likes of timestamps (independent variables) and values (dependent variables). 16 | 17 | ``` 18 | import numpy as np 19 | 20 | t = np.arange(10) 21 | v = np.array( 22 | [2*i for i in range(5)] + 23 | [10-i for i in range(5, 10)] 24 | ) + np.random.normal(0, 1, 10) 25 | ``` 26 | 27 | Now, you're ready to import the `piecewise()` function and fit a piecewise linear regression. 28 | 29 | ``` 30 | from piecewise import piecewise 31 | 32 | model = piecewise(t, v) 33 | ``` 34 | 35 | `model` if a `FittedModel` object. If you are at a shell, you can print the object to see the fitted segments domains and regression coefficients. 36 | 37 | ``` 38 | >>> model 39 | FittedModel with segments: 40 | * FittedSegment(start_t=0, end_t=5, coeffs=(-0.8576123780622642, 2.224791099812951)) 41 | * FittedSegment(start_t=5, end_t=9, coeffs=(10.975487672814133, -1.0722348284390741)) 42 | ``` 43 | 44 | Alternatively, you can use the `FittedModel`'s `segments` attribute to get at values. 45 | 46 | ``` 47 | >>> len(model.segments) 48 | 2 49 | >>> model.segments[0].coeffs 50 | (-0.8576123780622642, 2.224791099812951) 51 | ``` 52 | 53 | If you want to interpolate or extrapolate, you can use the `FittedModel`'s `predict()` function. 54 | 55 | ``` 56 | >>> model.predict(t_new=[3.5, 100]) 57 | array([ 6.92915647, -96.24799517]) 58 | ``` 59 | 60 | To see a plot, instead of getting a `FittedModel`, use `piecewise_plot()`. You may also use an existing `FittedModel`. 61 | 62 | ``` 63 | from piecewise import piecewise_plot 64 | 65 | # using an existing FittedModel 66 | piecewise_plot(t, v, model=model) 67 | 68 | # fitting a model on the fly 69 | piecewise_plot(t, v) 70 | ``` 71 | 72 | 73 | -------------------------------------------------------------------------------- /img/example_regression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataDog/piecewise/0e3c7f50d1e9fc8a251823edab13704e368aa1d4/img/example_regression.png -------------------------------------------------------------------------------- /piecewise/__init__.py: -------------------------------------------------------------------------------- 1 | from .regressor import piecewise 2 | from .plotter import piecewise_plot 3 | 4 | __all__ = ['piecewise', 'piecewise_plot'] 5 | -------------------------------------------------------------------------------- /piecewise/plotter.py: -------------------------------------------------------------------------------- 1 | # prj 2 | from .regressor import piecewise 3 | 4 | 5 | def piecewise_plot(t, v, min_stop_frac=0.03, model=None): 6 | """ Fits a piecewise (aka "segmented") regression and creates a scatter plot 7 | of the data overlaid with the regression segments. 8 | 9 | Params: 10 | t (listlike of ints or floats): independent/predictor variable values 11 | v (listlike of ints or floats): dependent/outcome variable values 12 | min_stop_frac (float between 0 and 1): the fraction of total error that 13 | a merge must account for to be considered "too big" to keep merging; 14 | the default is usually adequate, but this may be increased to make 15 | merging more aggressive (leading to fewer segments in the result) 16 | model (FittedModel): a previously fit model, if available 17 | Returns: 18 | None. 19 | """ 20 | # 3p 21 | # delay importing until first use (enables export in __init__.py) 22 | import matplotlib.pyplot as plt 23 | 24 | model = model or piecewise(t, v, min_stop_frac) 25 | print('Num segments: %s' % len(model.segments)) 26 | plt.plot(t, v, '.', alpha=0.6) 27 | for seg in model.segments: 28 | t_new = [seg.start_t, seg.end_t] 29 | v_hat = [seg.predict(t) for t in t_new] 30 | plt.plot(t_new, v_hat, 'k-') 31 | plt.show() 32 | 33 | # alias for backward compatibility 34 | plot_data_with_regression = piecewise_plot 35 | -------------------------------------------------------------------------------- /piecewise/regressor.py: -------------------------------------------------------------------------------- 1 | # std 2 | from collections import namedtuple 3 | import heapq 4 | 5 | # 3p 6 | import numpy as np 7 | 8 | 9 | ## Function to learn and plot piecewise regressions. 10 | 11 | 12 | def piecewise(t, v, min_stop_frac=0.03): 13 | """ Fits a piecewise (aka "segmented") regression. 14 | Params: 15 | t (listlike of ints or floats): independent/predictor variable values 16 | v (listlike of ints or floats): dependent/outcome variable values 17 | min_stop_frac (float between 0 and 1): the fraction of total error that 18 | a merge must account for to be considered big enough to stop merging; 19 | the default is usually adequate, but this may be increased to make 20 | merging more aggressive (leading to fewer segments in the result) 21 | Returns: 22 | A FittedModel object that can be used for interpolation and extrapolation. 23 | """ 24 | # Validate the inputs, and force t and v to be np.arrays sorted in 25 | # ascending t order. 26 | t, v = _preprocess(t, v) 27 | 28 | # Initialize the segments. 29 | init_segments, merges = _get_initial_segments_and_merges(t, v) 30 | seg_tracker = SegmentTracker(init_segments) 31 | 32 | # Use a min heap to track potential merges. At the top of the heap 33 | # will be the next best merge (smallest increase in error). 34 | heapq.heapify(merges) 35 | 36 | # Greedily make the next best merge until we've merged everything together into 37 | # one segment. 38 | cum_cost, biggest_cost_increase = 0.0, 0.0 39 | 40 | # list of merges we need to undo to return to last best configuration 41 | merges_since_best = [] 42 | 43 | while len(seg_tracker) > 1: 44 | # Identify the next merge to be executed. 45 | next_merge = _get_next_merge(merges, seg_tracker) 46 | 47 | # If the next merge increases the error by a larger amount than any 48 | # merge so far, remember the current state (which might end up being the 49 | # "best"). To prevent stopping too early (for example, in cases where 50 | # there should only be one segment), use min_stop_frac to keep on 51 | # remembering the current state as the best state if no single 52 | # merge has accounted for a significant part of the total error. 53 | cum_cost += next_merge.cost 54 | cost_increase = next_merge.cost - biggest_cost_increase 55 | biggest_cost_increase = max(biggest_cost_increase, cost_increase) 56 | if biggest_cost_increase < min_stop_frac*cum_cost or \ 57 | cost_increase == biggest_cost_increase: 58 | merges_since_best = [next_merge] 59 | else: 60 | merges_since_best.append(next_merge) 61 | 62 | # Execute the next merge. 63 | # Update segments, replacing the two old ones with the one new one. 64 | seg_tracker.apply_merge(next_merge) 65 | 66 | # Add new potential merges. 67 | neighbors = seg_tracker.get_neighbors(next_merge.new_seg) 68 | for neighbor in neighbors: 69 | left_seg, right_seg = sorted([next_merge.new_seg, neighbor]) 70 | heapq.heappush(merges, _make_merge(t, v, left_seg, right_seg)) 71 | 72 | if biggest_cost_increase < min_stop_frac*cum_cost: 73 | # This path is needed for the case where there is only one segment, because 74 | # merges_since_best isn't updated after merging in the loop above. 75 | merges_since_best = [] 76 | 77 | for merge in reversed(merges_since_best): 78 | seg_tracker.unapply_merge(merge) 79 | 80 | fitted_segments = [ 81 | FittedSegment(t[seg.start_index], t[min(seg.end_index, len(t)-1)], seg.coeffs) 82 | for seg in seg_tracker.segments 83 | ] 84 | return FittedModel(fitted_segments) 85 | 86 | 87 | ## Data structures used for representing the fitted model returned by `piecewise()`. 88 | 89 | 90 | class FittedSegment(namedtuple('FittedSegment', 91 | [ 92 | 'start_t', # (float) first t value to which this segment applies 93 | 'end_t', # (float) first t value to which this segment no longer applies 94 | 'coeffs' # (tuple of floats) regression coefficients 95 | ] 96 | )): 97 | def predict(self, t_new, out=None, where=True): 98 | return _predict(self.coeffs, t_new, out=out, where=where) 99 | 100 | 101 | class FittedModel(object): 102 | """ Completely defines the result of a piecewise regression. 103 | The `segments` attribute contains a list of FittedSegments. 104 | """ 105 | 106 | def __init__(self, fitted_segments): 107 | self.segments = fitted_segments 108 | self._starts = [fs.start_t for fs in fitted_segments] 109 | 110 | def __repr__(self): 111 | return 'FittedModel with segments:\n' + '\n'.join( 112 | ['* ' + seg.__repr__() for seg in self.segments] 113 | ) 114 | 115 | def predict(self, t_new): 116 | """ Use the segments in this model to predict the v value for new t values. 117 | Params: 118 | t_new (scalar or array like): t values for which predictions should be made 119 | Returns: 120 | scalar or array like of predictions 121 | """ 122 | if len(self.segments) == 1: 123 | return self.segments[0].predict(t_new) 124 | 125 | t_array = np.asanyarray(t_new) 126 | seg_index = np.digitize(t_array, [s.end_t for s in self.segments[:-1]]) 127 | if seg_index.shape == (): # t_new is a scalar or 0-dimensional array 128 | return self.segments[seg_index].predict(t_new) 129 | else: 130 | v_hats = np.empty_like(t_array, dtype=np.double) 131 | for i, segment in enumerate(self.segments): 132 | segment.predict(t_array, out=v_hats, where=seg_index == i) 133 | return v_hats 134 | 135 | 136 | ## Data structures used during the fitting of the regression in `piecewise()`. 137 | 138 | 139 | # Segment represents a time range and a linear regression fit through it. 140 | Segment = namedtuple('Segment', 141 | [ 142 | 'start_index', # (int) zero-based index of start time 143 | 'end_index', # (int) zero-based index of non-inclusive end time 144 | 'coeffs', # (tuple of floats) regression coefficients 145 | 'error', # (float) the total error in the segment 146 | 'cov_data', # (np.array of floats) incremental covariance data 147 | ] 148 | ) 149 | 150 | 151 | # Merge represents a potential merge of two neighboring segments. 152 | Merge = namedtuple('Merge', 153 | [ 154 | 'cost', # (float) increase in sum of squared error that would result from executing this merge 155 | 'left_seg', # (Segment) 156 | 'right_seg', # (Segment) 157 | 'new_seg' # (Segment) the Segment that would result from merging combining left_seg and right_seg 158 | ] 159 | ) 160 | 161 | 162 | class SegmentTracker(object): 163 | """ Utility class for tracking the state of the piecewise regression (i.e., 164 | what are the current segments based on the set of merges that have been 165 | executed so far). 166 | """ 167 | 168 | def __init__(self, segments): 169 | # Assume segments are sorted 170 | starts = np.fromiter((s.start_index for s in segments), np.intp, count=len(segments)) 171 | 172 | # One position for each original index. 173 | # About 50% of this space is wasted, but this enables O(1) lookup and 174 | # replacement by start_index 175 | self._segments = np.empty(segments[-1].end_index, dtype=object) 176 | self._segments[starts] = segments 177 | 178 | # Valid mask. As segments are merged, this mask is updated 179 | self._valid = np.not_equal(self._segments, None) 180 | 181 | # Previous neighbor lookup 182 | self._prev = np.zeros_like(self._segments, dtype=np.intp) 183 | self._prev[starts[1:]] = starts[:-1] 184 | 185 | # Cached length. Without this, we would need to count _valid every len() 186 | self._len = len(segments) 187 | 188 | def __len__(self): 189 | return self._len 190 | 191 | def contains(self, segment): 192 | """ Returns True if segment is currently valid; False otherwise. """ 193 | # segment at start_index has not been merged away and is still the same 194 | return self._valid[segment.start_index] \ 195 | and self._segments[segment.start_index] is segment 196 | 197 | def get_prev(self, segment): 198 | """ Returns the left neighbor of segment; None if it is the first. """ 199 | if segment.start_index > 0: 200 | return self._segments[self._prev[segment.start_index]] 201 | else: 202 | return None 203 | 204 | def get_next(self, segment): 205 | """ Returns the right neighbor of segment; None if it is the last. """ 206 | if segment.end_index < len(self._segments): 207 | return self._segments[segment.end_index] 208 | else: 209 | return None 210 | 211 | def get_neighbors(self, segment): 212 | """ Returns a list of Segments, containing the 0, 1, or 2 segments 213 | adjacent to the given Segment. 214 | """ 215 | return ( 216 | s for s in (self.get_prev(segment), self.get_next(segment)) 217 | if s is not None 218 | ) 219 | 220 | def apply_merge(self, merge): 221 | """ Insert a new segment and remove the two existing segments 222 | from which it was created. 223 | """ 224 | right_seg, new_seg = merge.right_seg, merge.new_seg 225 | self._valid[right_seg.start_index] = False 226 | _next = self.get_next(right_seg) 227 | if _next: 228 | self._prev[_next.start_index] = new_seg.start_index 229 | self._segments[new_seg.start_index] = new_seg 230 | self._len -= 1 231 | 232 | def unapply_merge(self, merge): 233 | """ Remove a segment and reinsert the two segments 234 | from which it was created. 235 | """ 236 | right_seg, left_seg = merge.right_seg, merge.left_seg 237 | self._valid[right_seg.start_index] = True 238 | _next = self.get_next(right_seg) 239 | if _next: 240 | self._prev[_next.start_index] = right_seg.start_index 241 | self._segments[left_seg.start_index] = left_seg 242 | self._len += 1 243 | 244 | @property 245 | def segments(self): 246 | return self._segments[self._valid] 247 | 248 | 249 | ## Helper functions for doing piecewise regression. 250 | 251 | 252 | def _preprocess(t, v): 253 | """ Raises an exception if any of the inputs are not valid. 254 | Otherwise, returns a list of Points, ordered by t. 255 | """ 256 | # Validate the inputs. 257 | if len(t) != len(v): 258 | raise ValueError('`t` and `v` must have the same length.') 259 | t_arr, v_arr = np.asanyarray(t, dtype=np.double), np.asanyarray(v, dtype=np.double) 260 | if not np.all(np.isfinite(t)): 261 | raise ValueError('All values in `t` must be finite.') 262 | finite_mask = np.isfinite(v_arr) 263 | if np.sum(finite_mask) < 2: 264 | raise ValueError('`v` must have at least 2 finite values.') 265 | t_arr, v_arr = t_arr[finite_mask], v_arr[finite_mask] 266 | 267 | # Order both arrays by t-values. 268 | sort_order = np.argsort(t_arr) 269 | t_arr, v_arr = t_arr[sort_order], v_arr[sort_order] 270 | 271 | return t_arr, v_arr 272 | 273 | 274 | def _get_initial_segments_and_merges(t, v): 275 | """ Returns a 2-tuple with the lists of initial segments and initial merges. 276 | Each Segment is of length 1, 2, or 3. They are created by using even-indexed 277 | points as seeds and attaching odd-indexed points to the neighboring seed with 278 | the closer v value. 279 | This initialization procedure exists to decrease the odds of bad initial 280 | merges. If initial segments were each a single point, then merging any two 281 | neighboring points would be equally attractive to our algorithm, because the 282 | squared error of a line fit through any pair of points is zero. However, 283 | in the case that the data looks like [1, 1, 1, 1, 10, 10, 10, 10], we would 284 | prefer to avoid the 1 and neighboring 10 from starting out in the same 285 | segment. This initialization does this by doing initial merges based on 286 | absolute difference rather than regression error. Unfortunately, there can 287 | still be suboptimal initializations, as in this case, where the two 1s will 288 | be initialized in the same segment: [19, 10, 1, 1, -8, -17] 289 | """ 290 | 291 | # creates segments from an array of start, end indices 292 | def _build_segments(ranges): 293 | # number of point in range 294 | n = np.diff(ranges, axis=1).reshape(-1) 295 | 296 | # expand ranges in rows, using masked arrays to deal with uneven lengths. 297 | # indices like these: 298 | # [[0, 1], [1, 4], [4, 6]] 299 | # yield something like this: 300 | # [ 301 | # [v0, --, --], 302 | # [v1, v2, v3], 303 | # [v4, v5, --], 304 | # ] 305 | max_n = np.max(n) 306 | indices = np.ma.array(ranges[:,:1] + np.arange(max_n).reshape(1, -1)) 307 | for i in range(1, max_n): 308 | indices[n == i, i:] = np.ma.masked 309 | 310 | segment_t = np.ma.take(t, indices) 311 | segment_v = np.ma.take(v, indices) 312 | 313 | # sum(t), sum(v), unmasked 314 | st = np.ma.getdata(np.ma.sum(segment_t, axis=1)) 315 | sv = np.ma.getdata(np.ma.sum(segment_v, axis=1)) 316 | 317 | # mean(t), mean(v) 318 | mu_t = (st / n).reshape(-1, 1) 319 | mu_v = (sv / n).reshape(-1, 1) 320 | 321 | # distance from means 322 | dt = segment_t - mu_t 323 | dv = segment_v - mu_v 324 | 325 | # var(t), var(v) and cov(t, v), before division by n, unmasked 326 | ct = np.ma.getdata(np.ma.sum(dt ** 2, axis=1)) 327 | cv = np.ma.getdata(np.ma.sum(dv ** 2, axis=1)) 328 | ctv = np.ma.getdata(np.ma.sum(dt * dv, axis=1)) 329 | 330 | # slope and intercept 331 | # for single point segments (ct == 0), assume slope = 0, intercept = mean(v) 332 | nonzero_ct = ct > 0 333 | slope = np.where(nonzero_ct, ctv, 0.0) 334 | slope = np.divide(slope, ct, out=slope, where=nonzero_ct).reshape(-1, 1) 335 | intercept = mu_v - slope * mu_t 336 | 337 | # sum of squared errors 338 | # if n < 3: error = 0 339 | # elif ct == 0: error = cv 340 | # else: error = cv - ctv ** 2 / ct 341 | 342 | nonzero_error = n >= 3 343 | nonzero_ct &= nonzero_error 344 | error = np.where(nonzero_ct, ctv, 0.0) # 0, 0, ctv 345 | np.square(error, out=error, where=nonzero_ct) # 0, 0, ctv ** 2 346 | np.divide(error, ct, out=error, where=nonzero_ct) # 0, 0, ctv ** 2 / ct 347 | np.subtract(cv, error, out=error, where=nonzero_error) # 0, cv, cv - ctv ** 2 / ct 348 | 349 | 350 | return [ 351 | Segment( 352 | ranges[i, 0], ranges[i, 1], (intercept[i, 0], slope[i, 0]), error[i], 353 | cov_data 354 | ) 355 | for i, cov_data in enumerate(np.c_[n, st, sv, ct, cv, ctv]) 356 | ] 357 | 358 | # If there are multiple values at the same t, average them and treat them 359 | # like a single point during initialization. This ensures that all the 360 | # points with the same t are assigned to the same linear segment. 361 | unique_t = np.unique(t, return_index=True)[1] 362 | even_n = len(unique_t) % 2 == 0 363 | index_ranges = np.c_[unique_t, np.r_[unique_t[1:], len(t)]] 364 | 365 | # unique t is pretty common, optimize for that 366 | averages = v[index_ranges[:,0]] 367 | long_ranges = np.diff(index_ranges, axis=1).reshape(-1) > 1 368 | if long_ranges.any(): 369 | averages[long_ranges] = np.fromiter( 370 | (v[idx[0]:idx[1]].mean() for idx in index_ranges[long_ranges]), 371 | np.double, count=long_ranges.sum() 372 | ) 373 | 374 | # Pair every other t with the t on its left or on its right, based on which 375 | # is closer. 376 | pair_left = np.less(*np.abs( 377 | np.ediff1d(averages, to_end=np.inf if even_n else None) 378 | ).reshape(-1, 2).T) 379 | np.copyto( 380 | index_ranges[:-1:2, 1], 381 | index_ranges[1::2, 1], 382 | where=pair_left 383 | ) 384 | np.copyto( 385 | index_ranges[2::2, 0], 386 | index_ranges[1:-1:2, 0], 387 | where=~pair_left[:-1 if even_n else None] 388 | ) 389 | 390 | # initial segment ranges are at even indices 391 | segment_ranges = index_ranges[::2] 392 | segments = _build_segments(segment_ranges) 393 | 394 | # merge every consecutive segment 395 | merge_ranges = np.c_[segment_ranges[:-1,0], segment_ranges[1:,1]] 396 | merge_segments = _build_segments(merge_ranges) 397 | 398 | merges = [ 399 | Merge( 400 | new_seg.error - segments[i].error - segments[i + 1].error, 401 | segments[i], segments[i + 1], new_seg 402 | ) 403 | for i, new_seg in enumerate(merge_segments) 404 | ] 405 | 406 | return segments, merges 407 | 408 | 409 | def _get_next_merge(merges, segment_tracker): 410 | """ Returns the valid Merge that has the lowest cost. 411 | Params: 412 | merges: a heapified list of Merges 413 | segment_tracker: a SegmentTracker with the currently valid segments; 414 | any Merge referencing a Segment not in the tracker is no longer valid 415 | """ 416 | while True: 417 | next_merge = heapq.heappop(merges) 418 | if (segment_tracker.contains(next_merge.left_seg) and 419 | segment_tracker.contains(next_merge.right_seg)): 420 | return next_merge 421 | 422 | 423 | def _make_segment(t, v, left_seg, right_seg): 424 | """ Returns a Segment that is the merge of left_seg and right_seg, 425 | starting at left_seg.start_index and ending at the non-inclusive 426 | right_seg.end_index. 427 | """ 428 | start_index = left_seg.start_index 429 | end_index = right_seg.end_index 430 | cov_data = _merge_cov_data(left_seg.cov_data, right_seg.cov_data) 431 | coeffs, error = _fit_line(t, v, start_index, end_index, cov_data) 432 | return Segment(start_index, end_index, coeffs, error, cov_data) 433 | 434 | 435 | def _make_merge(t, v, left_seg, right_seg): 436 | """ Returns a Merge combining the left_seg and right_seg Segments. 437 | """ 438 | new_seg = _make_segment(t, v, left_seg, right_seg) 439 | cost = new_seg.error - left_seg.error - right_seg.error 440 | return Merge(cost, left_seg, right_seg, new_seg) 441 | 442 | 443 | def _merge_cov_data(d1, d2): 444 | """ Merge covariance data from two segments into a new one. 445 | See also: 446 | https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm 447 | https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online 448 | """ 449 | d3 = d1 + d2 450 | n1 = d1[0] 451 | n2 = d2[0] 452 | n12 = n1 * n2 453 | n3 = d3[0] 454 | deltat = (d1[1] * n2 - d2[1] * n1) / n12 455 | deltav = (d1[2] * n2 - d2[2] * n1) / n12 456 | d3[3] += deltat ** 2 * n12 / n3 457 | d3[4] += deltav ** 2 * n12 / n3 458 | d3[5] += deltat * deltav * n12 / n3 459 | return d3 460 | 461 | def _fit_line(t, v, start_index, end_index, cov_data): 462 | """ Fits and OLS regression for the set of t and v values in the given index 463 | range. Returns (coefficients of line, sum of squared error). 464 | """ 465 | 466 | # based on scipy.stats.linregress 467 | mu_t, mu_v = cov_data[1:3] / cov_data[0] 468 | ct, cv, ctv = cov_data[3:] 469 | if ct != 0: 470 | slope = ctv / ct 471 | intercept = mu_v - slope * mu_t 472 | error = cv - ctv ** 2 / ct 473 | else: 474 | slope, intercept, error = 0.0, mu_v, cv 475 | 476 | return ((intercept, slope), error) 477 | 478 | 479 | def _predict(coeffs, t, out=None, where=True): 480 | """ Given OLS coefficients, predict the corresponding v values for the given 481 | t values. 482 | """ 483 | # if out is None, numpy allocates an empty one 484 | out = np.multiply(t, coeffs[1], out=out, where=where) 485 | if np.isscalar(out): 486 | # t was either a scalar or a 0-dimensional array 487 | # returning a scalar is consistent with numpy arithmetic operations 488 | return out + coeffs[0] 489 | else: 490 | np.add(out, coeffs[0], out=out, where=where) 491 | return out 492 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import find_packages, setup 2 | 3 | setup( 4 | name='piecewise', 5 | version='0.1', 6 | description='Piecewise linear regression', 7 | url='http://github.com/datadog/piecewise', 8 | author='Stephen Kappel', 9 | author_email='stephen@datadoghq.com', 10 | license='BSD-3-Clause', 11 | packages=['piecewise'], 12 | install_requires=[ 13 | 'numpy>=1.10.0' 14 | ], 15 | extras_require = { 16 | 'plotting': ['matplotlib>=1.4.3'], 17 | } 18 | ) 19 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DataDog/piecewise/0e3c7f50d1e9fc8a251823edab13704e368aa1d4/tests/__init__.py -------------------------------------------------------------------------------- /tests/test_piecewise.py: -------------------------------------------------------------------------------- 1 | # std 2 | import unittest 3 | 4 | # 3p 5 | import numpy as np 6 | 7 | # prj 8 | from piecewise import piecewise 9 | 10 | 11 | class TestPiecewise(unittest.TestCase): 12 | 13 | def test_single_line(self): 14 | """ When the data follows a single linear path with Gaussian noise, then 15 | only one segment should be found. 16 | """ 17 | # Generate some data. 18 | np.random.seed(1) 19 | intercept = -45.0 20 | slope = 0.7 21 | t = np.arange(2000) 22 | v = intercept + slope*t + np.random.normal(0, 1, 2000) 23 | # Fit the piecewise regression. 24 | model = piecewise(t, v) 25 | # A single segment should be found, encompassing the whole domain with 26 | # coefficients approximately equal to those used to generate the data. 27 | np.testing.assert_equal(len(model.segments), 1) 28 | seg = model.segments[0] 29 | np.testing.assert_equal(seg.start_t, 0) 30 | np.testing.assert_equal(seg.end_t, 1999) 31 | np.testing.assert_almost_equal(seg.coeffs[0], intercept, decimal=0) 32 | np.testing.assert_almost_equal(seg.coeffs[1], slope, decimal=0) 33 | 34 | def test_single_line_with_nans(self): 35 | """ Some nans in the data shouldn't break the regression, and leading and 36 | trailing nans should lead to exclusion of the corresponding t values from 37 | the segment domain. 38 | """ 39 | # Generate some data, and introduce nans. 40 | np.random.seed(1) 41 | intercept = -45.0 42 | slope = 0.7 43 | t = np.arange(2000) 44 | v = intercept + slope*t + np.random.normal(0, 1, 2000) 45 | v[[0, 24, 400, 401, 402, 1000, 1999]] = np.nan 46 | # Fit the piecewise regression. 47 | model = piecewise(t, v) 48 | # A single segment should be found, encompassing the whole domain (excluding 49 | # the leading and trailing nans) with coefficients approximately equal to 50 | # those used to generate the data. 51 | np.testing.assert_equal(len(model.segments), 1) 52 | seg = model.segments[0] 53 | np.testing.assert_equal(seg.start_t, 1) 54 | np.testing.assert_equal(seg.end_t, 1998) 55 | np.testing.assert_almost_equal(seg.coeffs[0], intercept, decimal=0) 56 | np.testing.assert_almost_equal(seg.coeffs[1], slope, decimal=0) 57 | 58 | def test_five_segments(self): 59 | """ If there are multiple distinct segments, piecewise() should be able to 60 | find the proper breakpoints between them. 61 | """ 62 | # Generate some data. 63 | t = np.arange(1900, 2000) 64 | v = t % 20 65 | # Fit the piecewise regression. 66 | model = piecewise(t, v) 67 | # There should be five segments, each with a slope of 1. 68 | np.testing.assert_equal(len(model.segments), 5) 69 | for segment in model.segments: 70 | np.testing.assert_almost_equal(segment.coeffs[1], 1.0) 71 | # The segments should be in time order and each should cover 20 units of the 72 | # domain. 73 | np.testing.assert_equal(model.segments[0].start_t, 1900) 74 | np.testing.assert_equal(model.segments[1].start_t, 1920) 75 | np.testing.assert_equal(model.segments[2].start_t, 1940) 76 | np.testing.assert_equal(model.segments[3].start_t, 1960) 77 | np.testing.assert_equal(model.segments[4].start_t, 1980) 78 | 79 | def test_messy_ts(self): 80 | """ Unevenly-spaced, out-of-order, float t-values should work. 81 | """ 82 | # Generate some step-function data. 83 | t = [1.0, 0.2, 0.5, 0.4, 2.3, 1.1] 84 | v = [5, 0, 0, 0, 5, 5] 85 | # Fit the piecewise regression. 86 | model = piecewise(t, v) 87 | # There should be two constant-valued segments. 88 | np.testing.assert_equal(len(model.segments), 2) 89 | seg1, seg2 = model.segments 90 | 91 | np.testing.assert_equal(seg1.start_t, 0.2) 92 | np.testing.assert_equal(seg1.end_t, 1.0) 93 | np.testing.assert_almost_equal(seg1.coeffs[0], 0) 94 | np.testing.assert_almost_equal(seg1.coeffs[1], 0) 95 | 96 | np.testing.assert_equal(seg2.start_t, 1.0) 97 | np.testing.assert_equal(seg2.end_t, 2.3) 98 | np.testing.assert_almost_equal(seg2.coeffs[0], 5) 99 | np.testing.assert_almost_equal(seg2.coeffs[1], 0) 100 | 101 | def test_non_unique_ts(self): 102 | """ A dataset with multiple values with the same t should not break the 103 | code, and all points with the same t should be assigned to the same 104 | segment. 105 | """ 106 | # Generate some data. 107 | t1 = [t for t in range(100)] 108 | v1 = [v for v in np.random.normal(3, 1, 100)] 109 | t2 = [t for t in range(99, 199)] 110 | v2 = [v for v in np.random.normal(20, 1, 100)] 111 | t = t1 + t2 112 | v = v1 + v2 113 | # Fit the piecewise regression. 114 | model = piecewise(t, v) 115 | # There should be two segments, and the split shouldn't be in the middle 116 | # of t=99. 117 | np.testing.assert_equal(len(model.segments), 2) 118 | seg1, seg2 = model.segments 119 | assert seg1.end_t == seg2.start_t 120 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py{27,35,36,37} 3 | 4 | [testenv] 5 | # at least one environment with minimum numpy version and without matplotlib 6 | deps = 7 | py35: numpy==1.10.0 8 | !py35: numpy 9 | !py35: matplotlib 10 | commands = 11 | python -m unittest {posargs:tests.test_piecewise} 12 | --------------------------------------------------------------------------------