├── .gitignore
├── LICENSE
├── LICENSE-3rdparty.csv
├── README.md
├── img
    └── example_regression.png
├── piecewise
    ├── __init__.py
    ├── plotter.py
    └── regressor.py
├── setup.py
├── tests
    ├── __init__.py
    └── test_piecewise.py
└── tox.ini


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pyc
 2 | 
 3 | # tox and pyenv
 4 | .python-version
 5 | .tox/
 6 | 
 7 | # Ignore files generated during `python setup.py install`
 8 | build/
 9 | dist/
10 | piecewise.egg-info/
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2017, Datadog, Inc.
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/LICENSE-3rdparty.csv:
--------------------------------------------------------------------------------
1 | Component,Origin,License,Copyright
2 | import,matplotlib,Python-2.0,Copyright (c) 2012 Matplotlib Development Team; All Rights Reserved
3 | import,numpy,BSD-3-Clause,Copyright (c) 2005-2017 NumPy Developers.; All rights reserved.
4 | import,setuptools,MIT,Copyright (c) 2016 Jason R Coombs <jaraco@jaraco.com>
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # piecewise
 2 | 
 3 | This repo accompanies [Piecewise regression: when one line simply isn’t enough](https://www.datadoghq.com/blog/engineering/piecewise-regression/), a blog post about Datadog's approach to piecewise regression. The code included here is intended to be minimal and readable; this is not a Swiss Army knife to solve all variations of piecewise regression problems.
 4 | 
 5 | ## Installation & dependencies
 6 | 
 7 | This package was written to work with both Python 2 and Python 3.
 8 | 
 9 | To install this package using setup tools, clone this repo and run `python setup.py install` from within the `piecewise` root directory.
10 | 
11 | The package's core `piecewise()` function for regression requires only `numpy`. The use of `piecewise_plot()` for plotting depends also on `matplotlib`.
12 | 
13 | ## Usage
14 | 
15 | Start by preparing your data as list-likes of timestamps (independent variables) and values (dependent variables).
16 | 
17 | ```
18 | import numpy as np
19 | 
20 | t = np.arange(10)
21 | v = np.array(
22 |     [2*i for i in range(5)] +
23 |     [10-i for i in range(5, 10)]
24 | ) + np.random.normal(0, 1, 10)
25 | ```
26 | 
27 | Now, you're ready to import the `piecewise()` function and fit a piecewise linear regression.
28 | 
29 | ```
30 | from piecewise import piecewise
31 | 
32 | model = piecewise(t, v)
33 | ```
34 | 
35 | `model` if a `FittedModel` object. If you are at a shell, you can print the object to see the fitted segments domains and regression coefficients.
36 | 
37 | ```
38 | >>> model
39 | FittedModel with segments:
40 | * FittedSegment(start_t=0, end_t=5, coeffs=(-0.8576123780622642, 2.224791099812951))
41 | * FittedSegment(start_t=5, end_t=9, coeffs=(10.975487672814133, -1.0722348284390741))
42 | ```
43 | 
44 | Alternatively, you can use the `FittedModel`'s `segments` attribute to get at values.
45 | 
46 | ```
47 | >>> len(model.segments)
48 | 2
49 | >>> model.segments[0].coeffs
50 | (-0.8576123780622642, 2.224791099812951)
51 | ```
52 | 
53 | If you want to interpolate or extrapolate, you can use the `FittedModel`'s `predict()` function.
54 | 
55 | ```
56 | >>> model.predict(t_new=[3.5, 100])
57 | array([  6.92915647, -96.24799517])
58 | ```
59 | 
60 | To see a plot, instead of getting a `FittedModel`, use `piecewise_plot()`.  You may also use an existing `FittedModel`.
61 | 
62 | ```
63 | from piecewise import piecewise_plot
64 | 
65 | # using an existing FittedModel
66 | piecewise_plot(t, v, model=model)
67 | 
68 | # fitting a model on the fly
69 | piecewise_plot(t, v)
70 | ```
71 | 
72 | <img src="/img/example_regression.png" width="400px">
73 | 


--------------------------------------------------------------------------------
/img/example_regression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataDog/piecewise/0e3c7f50d1e9fc8a251823edab13704e368aa1d4/img/example_regression.png


--------------------------------------------------------------------------------
/piecewise/__init__.py:
--------------------------------------------------------------------------------
1 | from .regressor import piecewise
2 | from .plotter import piecewise_plot
3 | 
4 | __all__ = ['piecewise', 'piecewise_plot']
5 | 


--------------------------------------------------------------------------------
/piecewise/plotter.py:
--------------------------------------------------------------------------------
 1 | # prj
 2 | from .regressor import piecewise
 3 | 
 4 | 
 5 | def piecewise_plot(t, v, min_stop_frac=0.03, model=None):
 6 |     """ Fits a piecewise (aka "segmented") regression and creates a scatter plot
 7 |     of the data overlaid with the regression segments.
 8 | 
 9 |     Params:
10 |         t (listlike of ints or floats): independent/predictor variable values
11 |         v (listlike of ints or floats): dependent/outcome variable values
12 |         min_stop_frac (float between 0 and 1): the fraction of total error that
13 |             a merge must account for to be considered "too big" to keep merging;
14 |             the default is usually adequate, but this may be increased to make
15 |             merging more aggressive (leading to fewer segments in the result)
16 |         model (FittedModel): a previously fit model, if available
17 |     Returns:
18 |         None.
19 |     """
20 |     # 3p
21 |     # delay importing until first use (enables export in __init__.py)
22 |     import matplotlib.pyplot as plt
23 |     
24 |     model = model or piecewise(t, v, min_stop_frac)
25 |     print('Num segments: %s' % len(model.segments))
26 |     plt.plot(t, v, '.', alpha=0.6)
27 |     for seg in model.segments:
28 |         t_new = [seg.start_t, seg.end_t]
29 |         v_hat = [seg.predict(t) for t in t_new]
30 |         plt.plot(t_new, v_hat, 'k-')
31 |     plt.show()
32 | 
33 | # alias for backward compatibility
34 | plot_data_with_regression = piecewise_plot
35 | 


--------------------------------------------------------------------------------
/piecewise/regressor.py:
--------------------------------------------------------------------------------
  1 | # std
  2 | from collections import namedtuple
  3 | import heapq
  4 | 
  5 | # 3p
  6 | import numpy as np
  7 | 
  8 | 
  9 | ## Function to learn and plot piecewise regressions.
 10 | 
 11 | 
 12 | def piecewise(t, v, min_stop_frac=0.03):
 13 |     """ Fits a piecewise (aka "segmented") regression.
 14 |     Params:
 15 |         t (listlike of ints or floats): independent/predictor variable values
 16 |         v (listlike of ints or floats): dependent/outcome variable values
 17 |         min_stop_frac (float between 0 and 1): the fraction of total error that
 18 |             a merge must account for to be considered big enough to stop merging;
 19 |             the default is usually adequate, but this may be increased to make
 20 |             merging more aggressive (leading to fewer segments in the result)
 21 |     Returns:
 22 |         A FittedModel object that can be used for interpolation and extrapolation.
 23 |     """
 24 |     # Validate the inputs, and force t and v to be np.arrays sorted in
 25 |     # ascending t order.
 26 |     t, v = _preprocess(t, v)
 27 | 
 28 |     # Initialize the segments.
 29 |     init_segments, merges = _get_initial_segments_and_merges(t, v)
 30 |     seg_tracker = SegmentTracker(init_segments)
 31 | 
 32 |     # Use a min heap to track potential merges. At the top of the heap
 33 |     # will be the next best merge (smallest increase in error).
 34 |     heapq.heapify(merges)
 35 | 
 36 |     # Greedily make the next best merge until we've merged everything together into
 37 |     # one segment.
 38 |     cum_cost, biggest_cost_increase = 0.0, 0.0
 39 | 
 40 |     # list of merges we need to undo to return to last best configuration
 41 |     merges_since_best = []
 42 | 
 43 |     while len(seg_tracker) > 1:
 44 |         # Identify the next merge to be executed.
 45 |         next_merge = _get_next_merge(merges, seg_tracker)
 46 | 
 47 |         # If the next merge increases the error by a larger amount than any
 48 |         # merge so far, remember the current state (which might end up being the
 49 |         # "best"). To prevent stopping too early (for example, in cases where
 50 |         # there should only be one segment), use min_stop_frac to keep on
 51 |         # remembering the current state as the best state if no single
 52 |         # merge has accounted for a significant part of the total error.
 53 |         cum_cost += next_merge.cost
 54 |         cost_increase = next_merge.cost - biggest_cost_increase
 55 |         biggest_cost_increase = max(biggest_cost_increase, cost_increase)
 56 |         if biggest_cost_increase < min_stop_frac*cum_cost or \
 57 |                 cost_increase == biggest_cost_increase:
 58 |             merges_since_best = [next_merge]
 59 |         else:
 60 |             merges_since_best.append(next_merge)
 61 | 
 62 |         # Execute the next merge.
 63 |         # Update segments, replacing the two old ones with the one new one.
 64 |         seg_tracker.apply_merge(next_merge)
 65 |         
 66 |         # Add new potential merges.
 67 |         neighbors = seg_tracker.get_neighbors(next_merge.new_seg)
 68 |         for neighbor in neighbors:
 69 |             left_seg, right_seg = sorted([next_merge.new_seg, neighbor])
 70 |             heapq.heappush(merges, _make_merge(t, v, left_seg, right_seg))
 71 | 
 72 |     if biggest_cost_increase < min_stop_frac*cum_cost:
 73 |         # This path is needed for the case where there is only one segment, because
 74 |         # merges_since_best isn't updated after merging in the loop above.
 75 |         merges_since_best = []
 76 |     
 77 |     for merge in reversed(merges_since_best):
 78 |         seg_tracker.unapply_merge(merge)
 79 | 
 80 |     fitted_segments = [
 81 |         FittedSegment(t[seg.start_index], t[min(seg.end_index, len(t)-1)], seg.coeffs)
 82 |         for seg in seg_tracker.segments
 83 |     ]
 84 |     return FittedModel(fitted_segments)
 85 | 
 86 | 
 87 | ## Data structures used for representing the fitted model returned by `piecewise()`.
 88 | 
 89 | 
 90 | class FittedSegment(namedtuple('FittedSegment',
 91 |     [
 92 |         'start_t',  # (float) first t value to which this segment applies
 93 |         'end_t',    # (float) first t value to which this segment no longer applies
 94 |         'coeffs'    # (tuple of floats) regression coefficients
 95 |     ]
 96 | )):
 97 |     def predict(self, t_new, out=None, where=True):
 98 |         return _predict(self.coeffs, t_new, out=out, where=where)
 99 | 
100 | 
101 | class FittedModel(object):
102 |     """ Completely defines the result of a piecewise regression.
103 |     The `segments` attribute contains a list of FittedSegments.
104 |     """
105 | 
106 |     def __init__(self, fitted_segments):
107 |         self.segments = fitted_segments
108 |         self._starts = [fs.start_t for fs in fitted_segments]
109 | 
110 |     def __repr__(self):
111 |         return 'FittedModel with segments:\n' + '\n'.join(
112 |             ['* ' + seg.__repr__() for seg in self.segments]
113 |         )
114 | 
115 |     def predict(self, t_new):
116 |         """ Use the segments in this model to predict the v value for new t values.
117 |         Params:
118 |             t_new (scalar or array like): t values for which predictions should be made
119 |         Returns:
120 |             scalar or array like of predictions
121 |         """
122 |         if len(self.segments) == 1:
123 |             return self.segments[0].predict(t_new)
124 | 
125 |         t_array = np.asanyarray(t_new)
126 |         seg_index = np.digitize(t_array, [s.end_t for s in self.segments[:-1]])
127 |         if seg_index.shape == (): # t_new is a scalar or 0-dimensional array
128 |             return self.segments[seg_index].predict(t_new)
129 |         else:
130 |             v_hats = np.empty_like(t_array, dtype=np.double)
131 |             for i, segment in enumerate(self.segments):
132 |                 segment.predict(t_array, out=v_hats, where=seg_index == i)
133 |             return v_hats
134 | 
135 | 
136 | ## Data structures used during the fitting of the regression in `piecewise()`.
137 | 
138 | 
139 | # Segment represents a time range and a linear regression fit through it.
140 | Segment = namedtuple('Segment',
141 |     [
142 |         'start_index',  # (int) zero-based index of start time
143 |         'end_index',    # (int) zero-based index of non-inclusive end time
144 |         'coeffs',       # (tuple of floats) regression coefficients
145 |         'error',        # (float) the total error in the segment
146 |         'cov_data',     # (np.array of floats) incremental covariance data
147 |     ]
148 | )
149 | 
150 | 
151 | # Merge represents a potential merge of two neighboring segments.
152 | Merge = namedtuple('Merge',
153 |     [
154 |         'cost',       # (float) increase in sum of squared error that would result from executing this merge
155 |         'left_seg',   # (Segment)
156 |         'right_seg',  # (Segment)
157 |         'new_seg'     # (Segment) the Segment that would result from merging combining left_seg and right_seg
158 |     ]
159 | )
160 | 
161 | 
162 | class SegmentTracker(object):
163 |     """ Utility class for tracking the state of the piecewise regression (i.e.,
164 |     what are the current segments based on the set of merges that have been
165 |     executed so far).
166 |     """
167 | 
168 |     def __init__(self, segments):
169 |         # Assume segments are sorted
170 |         starts = np.fromiter((s.start_index for s in segments), np.intp, count=len(segments))
171 | 
172 |         # One position for each original index.
173 |         # About 50% of this space is wasted, but this enables O(1) lookup and
174 |         # replacement by start_index
175 |         self._segments = np.empty(segments[-1].end_index, dtype=object)
176 |         self._segments[starts] = segments
177 | 
178 |         # Valid mask.  As segments are merged, this mask is updated
179 |         self._valid = np.not_equal(self._segments, None)
180 | 
181 |         # Previous neighbor lookup
182 |         self._prev = np.zeros_like(self._segments, dtype=np.intp)
183 |         self._prev[starts[1:]] = starts[:-1]
184 | 
185 |         # Cached length.  Without this, we would need to count _valid every len()
186 |         self._len = len(segments)
187 | 
188 |     def __len__(self):
189 |         return self._len
190 | 
191 |     def contains(self, segment):
192 |         """ Returns True if segment is currently valid; False otherwise. """
193 |         # segment at start_index has not been merged away and is still the same
194 |         return self._valid[segment.start_index] \
195 |             and self._segments[segment.start_index] is segment
196 |     
197 |     def get_prev(self, segment):
198 |         """ Returns the left neighbor of segment; None if it is the first. """
199 |         if segment.start_index > 0:
200 |             return self._segments[self._prev[segment.start_index]]
201 |         else:
202 |             return None
203 |     
204 |     def get_next(self, segment):
205 |         """ Returns the right neighbor of segment; None if it is the last. """
206 |         if segment.end_index < len(self._segments):
207 |             return self._segments[segment.end_index]
208 |         else:
209 |             return None
210 | 
211 |     def get_neighbors(self, segment):
212 |         """ Returns a list of Segments, containing the 0, 1, or 2 segments
213 |         adjacent to the given Segment.
214 |         """
215 |         return (
216 |             s for s in (self.get_prev(segment), self.get_next(segment))
217 |             if s is not None
218 |         )
219 | 
220 |     def apply_merge(self, merge):
221 |         """ Insert a new segment and remove the two existing segments
222 |         from which it was created.
223 |         """
224 |         right_seg, new_seg = merge.right_seg, merge.new_seg
225 |         self._valid[right_seg.start_index] = False
226 |         _next = self.get_next(right_seg)
227 |         if _next:
228 |             self._prev[_next.start_index] = new_seg.start_index
229 |         self._segments[new_seg.start_index] = new_seg
230 |         self._len -= 1
231 |     
232 |     def unapply_merge(self, merge):
233 |         """ Remove a segment and reinsert the two segments
234 |         from which it was created.
235 |         """
236 |         right_seg, left_seg = merge.right_seg, merge.left_seg
237 |         self._valid[right_seg.start_index] = True
238 |         _next = self.get_next(right_seg)
239 |         if _next:
240 |             self._prev[_next.start_index] = right_seg.start_index
241 |         self._segments[left_seg.start_index] = left_seg
242 |         self._len += 1
243 | 
244 |     @property
245 |     def segments(self):
246 |         return self._segments[self._valid]
247 | 
248 | 
249 | ## Helper functions for doing piecewise regression.
250 | 
251 | 
252 | def _preprocess(t, v):
253 |     """ Raises an exception if any of the inputs are not valid.
254 |     Otherwise, returns a list of Points, ordered by t.
255 |     """
256 |     # Validate the inputs.
257 |     if len(t) != len(v):
258 |         raise ValueError('`t` and `v` must have the same length.')
259 |     t_arr, v_arr = np.asanyarray(t, dtype=np.double), np.asanyarray(v, dtype=np.double)
260 |     if not np.all(np.isfinite(t)):
261 |         raise ValueError('All values in `t` must be finite.')
262 |     finite_mask = np.isfinite(v_arr)
263 |     if np.sum(finite_mask) < 2:
264 |         raise ValueError('`v` must have at least 2 finite values.')
265 |     t_arr, v_arr = t_arr[finite_mask], v_arr[finite_mask]
266 | 
267 |     # Order both arrays by t-values.
268 |     sort_order = np.argsort(t_arr)
269 |     t_arr, v_arr = t_arr[sort_order], v_arr[sort_order]
270 | 
271 |     return t_arr, v_arr
272 | 
273 | 
274 | def _get_initial_segments_and_merges(t, v):
275 |     """ Returns a 2-tuple with the lists of initial segments and initial merges.
276 |     Each Segment is of length 1, 2, or 3. They are created by using even-indexed
277 |     points as seeds and attaching odd-indexed points to the neighboring seed with
278 |     the closer v value.
279 |     This initialization procedure exists to decrease the odds of bad initial
280 |     merges. If initial segments were each a single point, then merging any two
281 |     neighboring points would be equally attractive to our algorithm, because the
282 |     squared error of a line fit through any pair of points is zero. However,
283 |     in the case that the data looks like [1, 1, 1, 1, 10, 10, 10, 10], we would
284 |     prefer to avoid the 1 and neighboring 10 from starting out in the same
285 |     segment. This initialization does this by doing initial merges based on
286 |     absolute difference rather than regression error. Unfortunately, there can
287 |     still be suboptimal initializations, as in this case, where the two 1s will
288 |     be initialized in the same segment: [19, 10, 1, 1, -8, -17]
289 |     """
290 | 
291 |     # creates segments from an array of start, end indices
292 |     def _build_segments(ranges):
293 |         # number of point in range
294 |         n = np.diff(ranges, axis=1).reshape(-1)
295 |         
296 |         # expand ranges in rows, using masked arrays to deal with uneven lengths.
297 |         # indices like these:
298 |         #   [[0, 1], [1, 4], [4, 6]]
299 |         # yield something like this:
300 |         #   [
301 |         #       [v0, --, --],
302 |         #       [v1, v2, v3],
303 |         #       [v4, v5, --],
304 |         #   ]
305 |         max_n = np.max(n)
306 |         indices = np.ma.array(ranges[:,:1] + np.arange(max_n).reshape(1, -1))
307 |         for i in range(1, max_n):
308 |             indices[n == i, i:] = np.ma.masked
309 |         
310 |         segment_t = np.ma.take(t, indices)
311 |         segment_v = np.ma.take(v, indices)
312 |     
313 |         # sum(t), sum(v), unmasked
314 |         st = np.ma.getdata(np.ma.sum(segment_t, axis=1))
315 |         sv = np.ma.getdata(np.ma.sum(segment_v, axis=1))
316 | 
317 |         # mean(t), mean(v)
318 |         mu_t = (st / n).reshape(-1, 1)
319 |         mu_v = (sv / n).reshape(-1, 1)
320 | 
321 |         # distance from means
322 |         dt = segment_t - mu_t
323 |         dv = segment_v - mu_v
324 | 
325 |         # var(t), var(v) and cov(t, v), before division by n, unmasked
326 |         ct = np.ma.getdata(np.ma.sum(dt ** 2, axis=1))
327 |         cv = np.ma.getdata(np.ma.sum(dv ** 2, axis=1))
328 |         ctv = np.ma.getdata(np.ma.sum(dt * dv, axis=1))
329 |         
330 |         # slope and intercept
331 |         # for single point segments (ct == 0), assume slope = 0, intercept = mean(v)
332 |         nonzero_ct = ct > 0
333 |         slope = np.where(nonzero_ct, ctv, 0.0)
334 |         slope = np.divide(slope, ct, out=slope, where=nonzero_ct).reshape(-1, 1)
335 |         intercept = mu_v - slope * mu_t
336 | 
337 |         # sum of squared errors
338 |         # if n < 3: error = 0
339 |         # elif ct == 0: error = cv
340 |         # else: error = cv - ctv ** 2 / ct
341 | 
342 |         nonzero_error = n >= 3
343 |         nonzero_ct &= nonzero_error
344 |         error = np.where(nonzero_ct, ctv, 0.0) # 0, 0, ctv
345 |         np.square(error, out=error, where=nonzero_ct) # 0, 0, ctv ** 2
346 |         np.divide(error, ct, out=error, where=nonzero_ct) # 0, 0, ctv ** 2 / ct
347 |         np.subtract(cv, error, out=error, where=nonzero_error) # 0, cv, cv - ctv ** 2 / ct
348 | 
349 | 
350 |         return [
351 |             Segment(
352 |                 ranges[i, 0], ranges[i, 1], (intercept[i, 0], slope[i, 0]), error[i],
353 |                 cov_data
354 |             )
355 |             for i, cov_data in enumerate(np.c_[n, st, sv, ct, cv, ctv])
356 |         ]
357 |     
358 |     # If there are multiple values at the same t, average them and treat them
359 |     # like a single point during initialization. This ensures that all the
360 |     # points with the same t are assigned to the same linear segment.
361 |     unique_t = np.unique(t, return_index=True)[1]
362 |     even_n = len(unique_t) % 2 == 0
363 |     index_ranges = np.c_[unique_t, np.r_[unique_t[1:], len(t)]]
364 |     
365 |     # unique t is pretty common, optimize for that
366 |     averages = v[index_ranges[:,0]]
367 |     long_ranges = np.diff(index_ranges, axis=1).reshape(-1) > 1
368 |     if long_ranges.any():
369 |         averages[long_ranges] = np.fromiter(
370 |             (v[idx[0]:idx[1]].mean() for idx in index_ranges[long_ranges]),
371 |             np.double, count=long_ranges.sum()
372 |         )
373 | 
374 |     # Pair every other t with the t on its left or on its right, based on which
375 |     # is closer.
376 |     pair_left = np.less(*np.abs(
377 |         np.ediff1d(averages, to_end=np.inf if even_n else None)
378 |     ).reshape(-1, 2).T)
379 |     np.copyto(
380 |         index_ranges[:-1:2, 1],
381 |         index_ranges[1::2, 1],
382 |         where=pair_left
383 |     )
384 |     np.copyto(
385 |         index_ranges[2::2, 0],
386 |         index_ranges[1:-1:2, 0],
387 |         where=~pair_left[:-1 if even_n else None]
388 |     )
389 | 
390 |     # initial segment ranges are at even indices
391 |     segment_ranges = index_ranges[::2]
392 |     segments = _build_segments(segment_ranges)
393 | 
394 |     # merge every consecutive segment
395 |     merge_ranges = np.c_[segment_ranges[:-1,0], segment_ranges[1:,1]]
396 |     merge_segments = _build_segments(merge_ranges)
397 |     
398 |     merges = [
399 |         Merge(
400 |             new_seg.error - segments[i].error - segments[i + 1].error,
401 |             segments[i], segments[i + 1], new_seg
402 |         )
403 |         for i, new_seg in enumerate(merge_segments)
404 |     ]
405 | 
406 |     return segments, merges
407 | 
408 | 
409 | def _get_next_merge(merges, segment_tracker):
410 |     """ Returns the valid Merge that has the lowest cost.
411 |     Params:
412 |         merges: a heapified list of Merges
413 |         segment_tracker: a SegmentTracker with the currently valid segments;
414 |             any Merge referencing a Segment not in the tracker is no longer valid
415 |     """
416 |     while True:
417 |         next_merge = heapq.heappop(merges)
418 |         if (segment_tracker.contains(next_merge.left_seg) and
419 |                 segment_tracker.contains(next_merge.right_seg)):
420 |             return next_merge
421 | 
422 | 
423 | def _make_segment(t, v, left_seg, right_seg):
424 |     """ Returns a Segment that is the merge of left_seg and right_seg,
425 |     starting at left_seg.start_index and ending at the non-inclusive
426 |     right_seg.end_index.
427 |     """
428 |     start_index = left_seg.start_index
429 |     end_index = right_seg.end_index
430 |     cov_data = _merge_cov_data(left_seg.cov_data, right_seg.cov_data)
431 |     coeffs, error = _fit_line(t, v, start_index, end_index, cov_data)
432 |     return Segment(start_index, end_index, coeffs, error, cov_data)
433 | 
434 | 
435 | def _make_merge(t, v, left_seg, right_seg):
436 |     """ Returns a Merge combining the left_seg and right_seg Segments.
437 |     """
438 |     new_seg = _make_segment(t, v, left_seg, right_seg)
439 |     cost = new_seg.error - left_seg.error - right_seg.error
440 |     return Merge(cost, left_seg, right_seg, new_seg)
441 | 
442 | 
443 | def _merge_cov_data(d1, d2):
444 |     """ Merge covariance data from two segments into a new one.
445 |     See also:
446 |         https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
447 |         https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online
448 |     """
449 |     d3 = d1 + d2
450 |     n1 = d1[0]
451 |     n2 = d2[0]
452 |     n12 = n1 * n2
453 |     n3 = d3[0]
454 |     deltat = (d1[1] * n2 - d2[1] * n1) / n12
455 |     deltav = (d1[2] * n2 - d2[2] * n1) / n12
456 |     d3[3] += deltat ** 2 * n12 / n3
457 |     d3[4] += deltav ** 2 * n12 / n3
458 |     d3[5] += deltat * deltav * n12 / n3
459 |     return d3
460 | 
461 | def _fit_line(t, v, start_index, end_index, cov_data):
462 |     """ Fits and OLS regression for the set of t and v values in the given index
463 |     range. Returns (coefficients of line, sum of squared error).
464 |     """
465 | 
466 |     # based on scipy.stats.linregress
467 |     mu_t, mu_v = cov_data[1:3] / cov_data[0]
468 |     ct, cv, ctv = cov_data[3:]
469 |     if ct != 0:
470 |         slope = ctv / ct
471 |         intercept = mu_v - slope * mu_t
472 |         error = cv - ctv ** 2 / ct
473 |     else:
474 |         slope, intercept, error = 0.0, mu_v, cv
475 | 
476 |     return ((intercept, slope), error)
477 | 
478 | 
479 | def _predict(coeffs, t, out=None, where=True):
480 |     """ Given OLS coefficients, predict the corresponding v values for the given
481 |     t values.
482 |     """
483 |     # if out is None, numpy allocates an empty one
484 |     out = np.multiply(t, coeffs[1], out=out, where=where)
485 |     if np.isscalar(out):
486 |         # t was either a scalar or a 0-dimensional array
487 |         # returning a scalar is consistent with numpy arithmetic operations
488 |         return out + coeffs[0]
489 |     else:
490 |         np.add(out, coeffs[0], out=out, where=where)
491 |         return out
492 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import find_packages, setup
 2 | 
 3 | setup(
 4 |     name='piecewise',
 5 |     version='0.1',
 6 |     description='Piecewise linear regression',
 7 |     url='http://github.com/datadog/piecewise',
 8 |     author='Stephen Kappel',
 9 |     author_email='stephen@datadoghq.com',
10 |     license='BSD-3-Clause',
11 |     packages=['piecewise'],
12 |     install_requires=[
13 |         'numpy>=1.10.0'
14 |     ],
15 |     extras_require = {
16 |         'plotting':  ['matplotlib>=1.4.3'],
17 |     }
18 | )
19 | 


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataDog/piecewise/0e3c7f50d1e9fc8a251823edab13704e368aa1d4/tests/__init__.py


--------------------------------------------------------------------------------
/tests/test_piecewise.py:
--------------------------------------------------------------------------------
  1 | # std
  2 | import unittest
  3 | 
  4 | # 3p
  5 | import numpy as np
  6 | 
  7 | # prj
  8 | from piecewise import piecewise
  9 | 
 10 | 
 11 | class TestPiecewise(unittest.TestCase):
 12 | 
 13 |     def test_single_line(self):
 14 |         """ When the data follows a single linear path with Gaussian noise, then
 15 |         only one segment should be found.
 16 |         """
 17 |         # Generate some data.
 18 |         np.random.seed(1)
 19 |         intercept = -45.0
 20 |         slope = 0.7
 21 |         t = np.arange(2000)
 22 |         v = intercept + slope*t + np.random.normal(0, 1, 2000)
 23 |         # Fit the piecewise regression.
 24 |         model = piecewise(t, v)
 25 |         # A single segment should be found, encompassing the whole domain with
 26 |         # coefficients approximately equal to those used to generate the data.
 27 |         np.testing.assert_equal(len(model.segments), 1)
 28 |         seg = model.segments[0]
 29 |         np.testing.assert_equal(seg.start_t, 0)
 30 |         np.testing.assert_equal(seg.end_t, 1999)
 31 |         np.testing.assert_almost_equal(seg.coeffs[0], intercept, decimal=0)
 32 |         np.testing.assert_almost_equal(seg.coeffs[1], slope, decimal=0)
 33 | 
 34 |     def test_single_line_with_nans(self):
 35 |         """ Some nans in the data shouldn't break the regression, and leading and
 36 |         trailing nans should lead to exclusion of the corresponding t values from
 37 |         the segment domain.
 38 |         """
 39 |         # Generate some data, and introduce nans.
 40 |         np.random.seed(1)
 41 |         intercept = -45.0
 42 |         slope = 0.7
 43 |         t = np.arange(2000)
 44 |         v = intercept + slope*t + np.random.normal(0, 1, 2000)
 45 |         v[[0, 24, 400, 401, 402, 1000, 1999]] = np.nan
 46 |         # Fit the piecewise regression.
 47 |         model = piecewise(t, v)
 48 |         # A single segment should be found, encompassing the whole domain (excluding
 49 |         # the leading and trailing nans) with coefficients approximately equal to
 50 |         # those used to generate the data.
 51 |         np.testing.assert_equal(len(model.segments), 1)
 52 |         seg = model.segments[0]
 53 |         np.testing.assert_equal(seg.start_t, 1)
 54 |         np.testing.assert_equal(seg.end_t, 1998)
 55 |         np.testing.assert_almost_equal(seg.coeffs[0], intercept, decimal=0)
 56 |         np.testing.assert_almost_equal(seg.coeffs[1], slope, decimal=0)
 57 | 
 58 |     def test_five_segments(self):
 59 |         """ If there are multiple distinct segments, piecewise() should be able to
 60 |         find the proper breakpoints between them.
 61 |         """
 62 |         # Generate some data.
 63 |         t = np.arange(1900, 2000)
 64 |         v = t % 20
 65 |         # Fit the piecewise regression.
 66 |         model = piecewise(t, v)
 67 |         # There should be five segments, each with a slope of 1.
 68 |         np.testing.assert_equal(len(model.segments), 5)
 69 |         for segment in model.segments:
 70 |             np.testing.assert_almost_equal(segment.coeffs[1], 1.0)
 71 |         # The segments should be in time order and each should cover 20 units of the
 72 |         # domain.
 73 |         np.testing.assert_equal(model.segments[0].start_t, 1900)
 74 |         np.testing.assert_equal(model.segments[1].start_t, 1920)
 75 |         np.testing.assert_equal(model.segments[2].start_t, 1940)
 76 |         np.testing.assert_equal(model.segments[3].start_t, 1960)
 77 |         np.testing.assert_equal(model.segments[4].start_t, 1980)
 78 | 
 79 |     def test_messy_ts(self):
 80 |         """ Unevenly-spaced, out-of-order, float t-values should work.
 81 |         """
 82 |         # Generate some step-function data.
 83 |         t = [1.0, 0.2, 0.5, 0.4, 2.3, 1.1]
 84 |         v = [5, 0, 0, 0, 5, 5]
 85 |         # Fit the piecewise regression.
 86 |         model = piecewise(t, v)
 87 |         # There should be two constant-valued segments.
 88 |         np.testing.assert_equal(len(model.segments), 2)
 89 |         seg1, seg2 = model.segments
 90 | 
 91 |         np.testing.assert_equal(seg1.start_t, 0.2)
 92 |         np.testing.assert_equal(seg1.end_t, 1.0)
 93 |         np.testing.assert_almost_equal(seg1.coeffs[0], 0)
 94 |         np.testing.assert_almost_equal(seg1.coeffs[1], 0)
 95 | 
 96 |         np.testing.assert_equal(seg2.start_t, 1.0)
 97 |         np.testing.assert_equal(seg2.end_t, 2.3)
 98 |         np.testing.assert_almost_equal(seg2.coeffs[0], 5)
 99 |         np.testing.assert_almost_equal(seg2.coeffs[1], 0)
100 | 
101 |     def test_non_unique_ts(self):
102 |         """ A dataset with multiple values with the same t should not break the
103 |         code, and all points with the same t should be assigned to the same
104 |         segment.
105 |         """
106 |         # Generate some data.
107 |         t1 = [t for t in range(100)]
108 |         v1 = [v for v in np.random.normal(3, 1, 100)]
109 |         t2 = [t for t in range(99, 199)]
110 |         v2 = [v for v in np.random.normal(20, 1, 100)]
111 |         t = t1 + t2
112 |         v = v1 + v2
113 |         # Fit the piecewise regression.
114 |         model = piecewise(t, v)
115 |         # There should be two segments, and the split shouldn't be in the middle
116 |         # of t=99.
117 |         np.testing.assert_equal(len(model.segments), 2)
118 |         seg1, seg2 = model.segments
119 |         assert seg1.end_t == seg2.start_t
120 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
 1 | [tox]
 2 | envlist = py{27,35,36,37}
 3 | 
 4 | [testenv]
 5 | # at least one environment with minimum numpy version and without matplotlib
 6 | deps =
 7 |     py35: numpy==1.10.0
 8 |     !py35: numpy
 9 |     !py35: matplotlib
10 | commands =
11 |     python -m unittest {posargs:tests.test_piecewise}
12 | 


--------------------------------------------------------------------------------