├── CITATION.cff
├── LICENSE.md
├── README.md
├── pygmmis.py
├── setup.py
└── tests
    ├── pygmmis.png
    ├── test.py
    └── test_3D.py


/CITATION.cff:
--------------------------------------------------------------------------------
 1 | cff-version: 1.2.0
 2 | message: "If you use this software, please cite it as below."
 3 | authors:
 4 | - family-names: "Melchior"
 5 |   given-names: "Peter"
 6 |   orcid: "https://orcid.org/0000-0002-8873-5065"
 7 | title: "pyGMMis"
 8 | url: "https://github.com/pmelchior/pygmmis"
 9 | preferred-citation:
10 |   type: article
11 |   authors:
12 |   - family-names: "Melchior"
13 |     given-names: "Peter"
14 |     orcid: "https://orcid.org/0000-0002-8873-5065"
15 |   - family-names: "Goulding"
16 |     given-names: "Andy"
17 |     orcid: "https://orcid.org/0000-0003-4700-663X"
18 |   doi: "10.1016/j.ascom.2018.09.013"
19 |   journal: "Astronomy and Computing"
20 |   start: 183 # First page number
21 |   end: 194 # Last page number
22 |   title: "Filling the gaps: Gaussian mixture models from noisy, truncated or incomplete samples"
23 |   volume: 25
24 |   year: 2018
25 |   month: 10
26 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Peter Melchior
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![PyPI](https://img.shields.io/pypi/v/pygmmis.svg)](https://pypi.python.org/pypi/pygmmis/)
  2 | [![License](https://img.shields.io/github/license/pmelchior/pygmmis.svg)](https://github.com/pmelchior/pygmmis/blob/master/LICENSE.md)
  3 | [![DOI](https://img.shields.io/badge/DOI-10.1016%2Fj.ascom.2018.09.013-blue.svg)](https://doi.org/10.1016/j.ascom.2018.09.013)
  4 | [![arXiv](https://img.shields.io/badge/arxiv-1611.05806-red.svg)](http://arxiv.org/abs/1611.05806)
  5 | 
  6 | # pyGMMis
  7 | 
  8 | Need a simple and powerful Gaussian-mixture code in pure python? It can be as easy as this:
  9 | 
 10 | ```python
 11 | import pygmmis
 12 | gmm = pygmmis.GMM(K=K, D=D)      # K components, D dimensions
 13 | logL, U = pygmmis.fit(gmm, data) # logL = log-likelihood, U = association of data to components
 14 | ```
 15 | However, **pyGMMis** has a few extra tricks up its sleeve.
 16 | 
 17 | * It can account for independent multivariate normal measurement errors for each of the observed samples, and then recovers an estimate of the error-free distribution. This technique is known as "Extreme Deconvolution" by Bovy, Hogg & Roweis (2011).
 18 | * It works with missing data (features) by setting the respective elements of the covariance matrix to a vary large value, thus effectively setting the weights of the missing feature to 0.
 19 | * It can deal with gaps (aka "truncated data") and variable sample completeness as long as
 20 |   * you know the incompleteness over the entire feature space,
 21 |   * and the incompleteness does not depend on the sample density (missing at random).
 22 | * It can incorporate a "background" distribution (implemented is a uniform one) and separate signal from background, with the former being fit by the GMM.
 23 | * It keeps track of which components need to be evaluated in which regions of the feature space, thereby substantially increasing the performance for fragmented data.
 24 | 
 25 | If you want more context and details on those capabilities, have a look at this [blog post](http://pmelchior.net/blog/gaussian-mixture-models-for-astronomy.html).
 26 | 
 27 | Under the hood, **pyGMMis** uses the Expectation-Maximization procedure. When dealing with sample incompleteness it generates its best guess of the unobserved samples on the fly given the current model fit to the observed samples.
 28 | 
 29 | ![Example of pyGMMis](https://raw.githubusercontent.com/pmelchior/pygmmis/master/tests/pygmmis.png)
 30 | 
 31 | In the example above, the true distribution is shown as contours in the left panel. We then draw 400 samples from it (red), add Gaussian noise to them (1,2,3 sigma contours shown in blue), and select only samples within the box but outside of the circle (blue).
 32 | 
 33 | The code is written in pure python (developed and tested in 2.7), parallelized with `multiprocessing`, and is capable of performing density estimation with millions of samples and thousands of model components on machines with sufficient memory.
 34 | 
 35 | More details are in the paper listed in the file `CITATION.cff`.
 36 | 
 37 | 
 38 | 
 39 | ## Installation and Prerequisites
 40 | 
 41 | You can either clone the repo and install by `python setup.py install` or get the latest release with
 42 | 
 43 | ```
 44 | pip install pygmmis
 45 | ```
 46 | 
 47 | Dependencies:
 48 | 
 49 | * numpy
 50 | * scipy
 51 | * multiprocessing
 52 | * parmap
 53 | 
 54 | ## How to run the code
 55 | 
 56 | 1. Create a GMM object with the desired component number K and data dimensionality D:
 57 |    ```gmm = pygmmis.GMM(K=K, D=D) ```
 58 | 
 59 | 3. Define a callback for the completeness function. When called with with `data` with shape `(N,D)` and returns the probability of each sample getting observed. Two simple examples:
 60 | 
 61 |    ```python
 62 |    def cutAtSix(coords):
 63 |    	"""Selects all samples whose first coordinate is < 6"""
 64 |        return (coords[:,0] < 6)
 65 | 
 66 |    def selSlope(coords, rng=np.random):
 67 |        """Selects probabilistically according to first coordinate x:
 68 |        Omega = 1    for x < 0
 69 |              = 1-x  for x = 0 .. 1
 70 |              = 0    for x > 1
 71 |        """
 72 |        return np.max(0, np.min(1, 1 - coords[:,0]))
 73 |    ```
 74 | 
 75 | 4. If the samples are noisy (i.e. they have positional uncertainties), you need to provide the covariance matrix of each data sample, or one for all in case of i.i.d. noise.
 76 | 
 77 | 4. If the samples are noisy *and* there completeness function isn't constant, you need to provide a callback function that returns an estimate of the covariance at arbitrary locations:
 78 | 
 79 |    ```python
 80 |    # example 1: simply using the same covariance for all samples
 81 |    dispersion = 1
 82 |    default_covar = np.eye(D) * dispersion**2
 83 |    covar_cb = lambda coords: default_covar
 84 |    
 85 |    # example: use the covariance of the nearest neighbor.
 86 |    def covar_tree_cb(coords, tree, covar):
 87 |        """Return the covariance of the nearest neighbor of coords in data."""
 88 |        dist, ind = tree.query(coords, k=1)
 89 |        return covar[ind.flatten()]
 90 |    
 91 |    from sklearn.neighbors import KDTree
 92 |    tree = KDTree(data, leaf_size=100)
 93 |    
 94 |    from functools import partial
 95 |    covar_cb = partial(covar_tree_cb, tree=tree, covar=covar)
 96 |    ```
 97 | 
 98 | 5. If there is a uniform background signal, you need to define it. Because a uniform distribution is normalizable only if its support is finite, you need to decide on the footprint over which the background model is present, e.g.:
 99 | 
100 |    ```python
101 |    footprint = data.min(axis=0), data.max(axis=0)
102 |    amp = 0.3
103 |    bg = pygmmis.Background(footprint, amp=amp)
104 |    
105 |    # fine tuning, if desired
106 |    bg.amp_min = 0.1
107 |    bg.amp_max = 0.5
108 |    bg.adjust_amp = False # freezes bg.amp at current value
109 |    ```
110 | 
111 | 6. Select an initialization method. This tells the GMM what initial parameters is should assume. The options are `'minmax','random','kmeans','none'`. See the respective functions for details:
112 | 
113 |    * `pygmmis.initFromDataMinMax()`
114 |    * `pygmmis.initFromDataAtRandom()`
115 |    * `pygmmis.initFromKMeans()`
116 | 
117 |    For difficult situations, or if you are not happy with the convergence, you may want to experiment with your own initialization. All you have to do is set `gmm.amp`, `gmm.mean`, and `gmm.covar` to desired values and use `init_method='none'`.
118 | 
119 | 7. Decide to freeze out any components. This makes sense if you *know* some of the parameters of the components. You can freeze amplitude, mean, or covariance of any component by listing them in a dictionary, e.g:
120 | 
121 |    ```python
122 |    frozen={"amp": [1,2], "mean": [], "covar": [1]}
123 |    ```
124 | 
125 |    This freezes the amplitudes of component 1 and 2 (NOTE: Counting starts at 0), and the covariance of 1.
126 | 
127 | 8. Run the fitter:
128 | 
129 |    ```python
130 |    w = 0.1    # minimum covariance regularization, same units as data
131 |    cutoff = 5 # segment the data set into neighborhood within 5 sigma around components
132 |    tol = 1e-3 # tolerance on logL to terminate EM
133 |    
134 |    # define RNG for deterministic behavior
135 |    from numpy.random import RandomState
136 |    seed = 42
137 |    rng = RandomState(seed)
138 |    
139 |    # run EM
140 |    logL, U = pygmmis.fit(gmm, data, init_method='random',\
141 |                          sel_callback=cb, covar_callback=covar_cb, w=w, cutoff=cutoff,\
142 |                          background=bg, tol=tol, frozen=frozen, rng=rng)
143 |    ```
144 | 
145 |    This runs the EM procedure until tolerance is reached and returns the final mean log-likelihood of all samples, and the neighborhood of each component (indices of data samples that are within cutoff of a GMM component).
146 | 
147 | 9. Evaluate the model:
148 | 
149 |    ```python
150 |    # log of p(x)
151 |    p = gmm(test_coords, as_log=False)
152 |    N_s = 1000
153 |    # draw samples from GMM
154 |    samples = gmm.draw(N_s)
155 |    
156 |    # draw sample from the model with noise, background, and selection:
157 |    # if you want to get the missing sample, set invert_sel=True.
158 |    # N_orig is the estimated number of samples prior to selection
159 |    obs_size = len(data)
160 |    samples, covar_samples, N_orig = pygmmis.draw(gmm, obs_size, sel_callback=cb,\
161 |                                                  invert_sel=False, orig_size=None,\
162 |                                                  covar_callback=covar_cb,background=bg)
163 |    ```
164 | 
165 | 
166 | 
167 | For a complete example, have a look at [the test script](https://github.com/pmelchior/pygmmis/blob/master/tests/test.py). For requests and bug reports, please open an issue.
168 | 


--------------------------------------------------------------------------------
/pygmmis.py:
--------------------------------------------------------------------------------
   1 | from __future__ import division
   2 | import numpy as np
   3 | import scipy.special, scipy.stats
   4 | import ctypes
   5 | 
   6 | import logging
   7 | logger = logging.getLogger("pygmmis")
   8 | 
   9 | # set up multiprocessing
  10 | import multiprocessing
  11 | import parmap
  12 | 
  13 | def createShared(a, dtype=ctypes.c_double):
  14 |     """Create a shared array to be used for multiprocessing's processes.
  15 | 
  16 |     Taken from http://stackoverflow.com/questions/5549190/
  17 | 
  18 |     Works only for float, double, int, long types (e.g. no bool).
  19 | 
  20 |     Args:
  21 |         numpy array, arbitrary shape
  22 | 
  23 |     Returns:
  24 |         numpy array whose container is a multiprocessing.Array
  25 |     """
  26 |     shared_array_base = multiprocessing.Array(dtype, a.size)
  27 |     shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
  28 |     shared_array[:] = a.flatten()
  29 |     shared_array = shared_array.reshape(a.shape)
  30 |     return shared_array
  31 | 
  32 | # this is to allow multiprocessing pools to operate on class methods:
  33 | # https://gist.github.com/bnyeggen/1086393
  34 | def _pickle_method(method):
  35 | 	func_name = method.im_func.__name__
  36 | 	obj = method.im_self
  37 | 	cls = method.im_class
  38 | 	if func_name.startswith('__') and not func_name.endswith('__'): #deal with mangled names
  39 | 		cls_name = cls.__name__.lstrip('_')
  40 | 		func_name = '_' + cls_name + func_name
  41 | 	return _unpickle_method, (func_name, obj, cls)
  42 | 
  43 | def _unpickle_method(func_name, obj, cls):
  44 | 	for cls in cls.__mro__:
  45 | 		try:
  46 | 			func = cls.__dict__[func_name]
  47 | 		except KeyError:
  48 | 			pass
  49 | 		else:
  50 | 			break
  51 | 	return func.__get__(obj, cls)
  52 | 
  53 | import types
  54 | # python 2 -> 3 adjustments
  55 | try:
  56 |     import copy_reg
  57 | except ImportError:
  58 |     import copyreg as copy_reg
  59 | copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)
  60 | 
  61 | try:
  62 |     xrange
  63 | except NameError:
  64 |     xrange = range
  65 | 
  66 | # Blantant copy from Erin Sheldon's esutil
  67 | # https://github.com/esheldon/esutil/blob/master/esutil/numpy_util.py
  68 | def match1d(arr1input, arr2input, presorted=False):
  69 |     """
  70 |     NAME:
  71 |         match
  72 |     CALLING SEQUENCE:
  73 |         ind1,ind2 = match(arr1, arr2, presorted=False)
  74 |     PURPOSE:
  75 |         Match two numpy arrays.  Return the indices of the matches or empty
  76 |         arrays if no matches are found.  This means arr1[ind1] == arr2[ind2] is
  77 |         true for all corresponding pairs.  arr1 must contain only unique
  78 |         inputs, but arr2 may be non-unique.
  79 |         If you know arr1 is sorted, set presorted=True and it will run
  80 |         even faster
  81 |     METHOD:
  82 |         uses searchsorted with some sugar.  Much faster than old version
  83 |         based on IDL code.
  84 |     REVISION HISTORY:
  85 |         Created 2015, Eli Rykoff, SLAC.
  86 |     """
  87 | 
  88 |     # make sure 1D
  89 |     arr1 = np.array(arr1input, ndmin=1, copy=False)
  90 |     arr2 = np.array(arr2input, ndmin=1, copy=False)
  91 | 
  92 |     # check for integer data...
  93 |     if (not issubclass(arr1.dtype.type,np.integer) or
  94 |         not issubclass(arr2.dtype.type,np.integer)) :
  95 |         mess="Error: only works with integer types, got %s %s"
  96 |         mess = mess % (arr1.dtype.type,arr2.dtype.type)
  97 |         raise ValueError(mess)
  98 | 
  99 |     if (arr1.size == 0) or (arr2.size == 0) :
 100 |         mess="Error: arr1 and arr2 must each be non-zero length"
 101 |         raise ValueError(mess)
 102 | 
 103 |     # make sure that arr1 has unique values...
 104 |     test=np.unique(arr1)
 105 |     if test.size != arr1.size:
 106 |         raise ValueError("Error: the arr1input must be unique")
 107 | 
 108 |     # sort arr1 if not presorted
 109 |     if not presorted:
 110 |         st1 = np.argsort(arr1)
 111 |     else:
 112 |         st1 = None
 113 | 
 114 |     # search the sorted array
 115 |     sub1=np.searchsorted(arr1,arr2,sorter=st1)
 116 | 
 117 |     # check for out-of-bounds at the high end if necessary
 118 |     if (arr2.max() > arr1.max()) :
 119 |         bad,=np.where(sub1 == arr1.size)
 120 |         sub1[bad] = arr1.size-1
 121 | 
 122 |     if not presorted:
 123 |         sub2,=np.where(arr1[st1[sub1]] == arr2)
 124 |         sub1=st1[sub1[sub2]]
 125 |     else:
 126 |         sub2,=np.where(arr1[sub1] == arr2)
 127 |         sub1=sub1[sub2]
 128 | 
 129 |     return sub1,sub2
 130 | 
 131 | 
 132 | def logsum(logX, axis=0):
 133 |     """Computes log of the sum along give axis from the log of the summands.
 134 | 
 135 |     This method tries hard to avoid over- or underflow.
 136 |     See appendix A of Bovy, Hogg, Roweis (2009).
 137 | 
 138 |     Args:
 139 |         logX: numpy array of logarithmic summands
 140 |         axis (int): axis to sum over
 141 | 
 142 |     Returns:
 143 |         log of the sum, shortened by one axis
 144 | 
 145 |     Throws:
 146 |         ValueError if logX has length 0 along given axis
 147 | 
 148 |     """
 149 |     floatinfo = np.finfo(logX.dtype)
 150 |     underflow = np.log(floatinfo.tiny) - logX.min(axis=axis)
 151 |     overflow = np.log(floatinfo.max) - logX.max(axis=axis) - np.log(logX.shape[axis])
 152 |     c = np.where(underflow < overflow, underflow, overflow)
 153 |     # adjust the shape of c for addition with logX
 154 |     c_shape = [slice(None) for i in xrange(len(logX.shape))]
 155 |     c_shape[axis] = None
 156 |     return np.log(np.exp(logX + c[tuple(c_shape)]).sum(axis=axis)) - c
 157 | 
 158 | 
 159 | def chi2_cutoff(D, cutoff=3.):
 160 |     """D-dimensional eqiuvalent of "n sigma" cut.
 161 | 
 162 |     Evaluates the quantile function of the chi-squared distribution to determine
 163 |     the limit for the chi^2 of samples wrt to GMM so that they satisfy the
 164 |     68-95-99.7 percent rule of the 1D Normal distribution.
 165 | 
 166 |     Args:
 167 |         D (int): dimensions of the feature space
 168 |         cutoff (float): 1D equivalent cut [in units of sigma]
 169 | 
 170 |     Returns:
 171 |         float: upper limit for chi-squared in D dimensions
 172 |     """
 173 |     cdf_1d = scipy.stats.norm.cdf(cutoff)
 174 |     confidence_1d = 1-(1-cdf_1d)*2
 175 |     cutoff_nd = scipy.stats.chi2.ppf(confidence_1d, D)
 176 |     return cutoff_nd
 177 | 
 178 | def covar_callback_default(coords, default=None):
 179 |     N,D = coords.shape
 180 |     if default.shape != (D,D):
 181 |         raise RuntimeError("covar_callback received improper default covariance %r" % default)
 182 |     # no need to copy since a single covariance matrix is sufficient
 183 |     # return np.tile(default, (N,1,1))
 184 |     return default
 185 | 
 186 | 
 187 | class GMM(object):
 188 |     """Gaussian mixture model with K components in D dimensions.
 189 | 
 190 |     Attributes:
 191 |         amp: numpy array (K,), component amplitudes
 192 |         mean: numpy array (K,D), component means
 193 |         covar: numpy array (K,D,D), component covariances
 194 |     """
 195 |     def __init__(self, K=0, D=0):
 196 |         """Create the arrays for amp, mean, covar."""
 197 |         self.amp = np.zeros((K))
 198 |         self.mean = np.empty((K,D))
 199 |         self.covar = np.empty((K,D,D))
 200 | 
 201 |     @property
 202 |     def K(self):
 203 |         """int: number of components, depends on size of amp."""
 204 |         return self.amp.size
 205 | 
 206 |     @property
 207 |     def D(self):
 208 |         """int: dimensions of the feature space."""
 209 |         return self.mean.shape[1]
 210 | 
 211 |     def save(self, filename, **kwargs):
 212 |         """Save GMM to file.
 213 | 
 214 |         Args:
 215 |             filename (str): name for saved file, should end on .npz as the default
 216 |                 of numpy.savez(), which is called here
 217 |             kwargs:  dictionary of additional information to be stored in file.
 218 | 
 219 |         Returns:
 220 |             None
 221 |         """
 222 |         np.savez(filename, amp=self.amp, mean=self.mean, covar=self.covar, **kwargs)
 223 | 
 224 |     def load(self, filename):
 225 |         """Load GMM from file.
 226 | 
 227 |         Additional arguments stored by save() will be ignored.
 228 | 
 229 |         Args:
 230 |             filename (str): name for file create with save().
 231 | 
 232 |         Returns:
 233 |             None
 234 |         """
 235 |         F = np.load(filename)
 236 |         self.amp = F["amp"]
 237 |         self.mean = F["mean"]
 238 |         self.covar = F["covar"]
 239 |         F.close()
 240 | 
 241 |     @staticmethod
 242 |     def from_file(filename):
 243 |         """Load GMM from file.
 244 | 
 245 |         Additional arguments stored by save() will be ignored.
 246 | 
 247 |         Args:
 248 |             filename (str): name for file create with save().
 249 | 
 250 |         Returns:
 251 |             GMM
 252 |         """
 253 |         gmm = GMM()
 254 |         gmm.load(filename)
 255 |         return gmm
 256 | 
 257 |     def draw(self, size=1, rng=np.random):
 258 |         """Draw samples from the GMM.
 259 | 
 260 |         Args:
 261 |             size (int): number of samples to draw
 262 |             rng: numpy.random.RandomState for deterministic draw
 263 | 
 264 |         Returns:
 265 |             numpy array (size,D)
 266 |         """
 267 |         # draw indices for components given amplitudes, need to make sure: sum=1
 268 |         ind = rng.choice(self.K, size=size, p=self.amp/self.amp.sum())
 269 |         N = np.bincount(ind, minlength=self.K)
 270 | 
 271 |         # for each component: draw as many points as in ind from a normal
 272 |         samples = np.empty((size, self.D))
 273 |         lower = 0
 274 |         for k in np.flatnonzero(N):
 275 |             upper = lower + N[k]
 276 |             samples[lower:upper, :] = rng.multivariate_normal(self.mean[k], self.covar[k], size=N[k])
 277 |             lower = upper
 278 |         return samples
 279 | 
 280 |     def __call__(self, coords, covar=None, as_log=False):
 281 |         """Evaluate model PDF at given coordinates.
 282 | 
 283 |         see logL() for details.
 284 | 
 285 |         Args:
 286 |             coords: numpy array (D,) or (N, D) of test coordinates
 287 |             covar:  numpy array (D, D) or (N, D, D) covariance matrix of coords
 288 |             as_log (bool): return log(p) instead p
 289 | 
 290 |         Returns:
 291 |             numpy array (1,) or (N, 1) of PDF (or its log)
 292 |         """
 293 |         if as_log:
 294 |             return self.logL(coords, covar=covar)
 295 |         else:
 296 |             return np.exp(self.logL(coords, covar=covar))
 297 | 
 298 |     def _mp_chunksize(self):
 299 |         # find how many components to distribute over available threads
 300 |         cpu_count = multiprocessing.cpu_count()
 301 |         chunksize = max(1, self.K//cpu_count)
 302 |         n_chunks = min(cpu_count, self.K//chunksize)
 303 |         return n_chunks, chunksize
 304 | 
 305 |     def _get_chunks(self):
 306 |         # split all component in ideal-sized chunks
 307 |         n_chunks, chunksize = self._mp_chunksize()
 308 |         left = self.K - n_chunks*chunksize
 309 |         chunks = []
 310 |         n = 0
 311 |         for i in xrange(n_chunks):
 312 |             n_ = n + chunksize
 313 |             if left > i:
 314 |                 n_ += 1
 315 |             chunks.append((n, n_))
 316 |             n = n_
 317 |         return chunks
 318 | 
 319 |     def logL(self, coords, covar=None):
 320 |         """Log-likelihood of coords given all (i.e. the sum of) GMM components
 321 | 
 322 |         Distributes computation over all threads on the machine.
 323 | 
 324 |         If covar is None, this method returns
 325 |             log(sum_k(p(x | k)))
 326 |         of the data values x. If covar is set, the method returns
 327 |             log(sum_k(p(y | k))),
 328 |         where y = x + noise and noise ~ N(0, covar).
 329 | 
 330 |         Args:
 331 |             coords: numpy array (D,) or (N, D) of test coordinates
 332 |             covar:  numpy array (D, D) or (N, D, D) covariance matrix of coords
 333 | 
 334 |         Returns:
 335 |             numpy array (1,) or (N, 1) log(L), depending on shape of data
 336 |         """
 337 |         # Instead log p (x | k) for each k (which is huge)
 338 |         # compute it in stages: first for each chunk, then sum over all chunks
 339 |         pool = multiprocessing.Pool()
 340 |         chunks = self._get_chunks()
 341 |         results = [pool.apply_async(self._logsum_chunk, (chunk, coords, covar)) for chunk in chunks]
 342 |         log_p_y_chunk = []
 343 |         for r in results:
 344 |             log_p_y_chunk.append(r.get())
 345 |         pool.close()
 346 |         pool.join()
 347 |         return logsum(np.array(log_p_y_chunk)) # sum over all chunks = all k
 348 | 
 349 |     def _logsum_chunk(self, chunk, coords, covar=None):
 350 |         # helper function to reduce the memory requirement of logL
 351 |         log_p_y_k = np.empty((chunk[1]-chunk[0], len(coords)))
 352 |         for i in xrange(chunk[1] - chunk[0]):
 353 |             k = chunk[0] + i
 354 |             log_p_y_k[i,:] = self.logL_k(k, coords, covar=covar)
 355 |         return logsum(log_p_y_k)
 356 | 
 357 |     def logL_k(self, k, coords, covar=None, chi2_only=False):
 358 |         """Log-likelihood of coords given only component k.
 359 | 
 360 |         Args:
 361 |             k (int): component index
 362 |             coords: numpy array (D,) or (N, D) of test coordinates
 363 |             covar:  numpy array (D, D) or (N, D, D) covariance matrix of coords
 364 |             chi2_only (bool): only compute deltaX^T Sigma_k^-1 deltaX
 365 | 
 366 |         Returns:
 367 |             numpy array (1,) or (N, 1) log(L), depending on shape of data
 368 |         """
 369 |         # compute p(x | k)
 370 |         dx = coords - self.mean[k]
 371 |         if covar is None:
 372 |             T_k = self.covar[k]
 373 |         else:
 374 |             T_k = self.covar[k] + covar
 375 |         chi2 = np.einsum('...i,...ij,...j', dx, np.linalg.inv(T_k), dx)
 376 | 
 377 |         if chi2_only:
 378 |             return chi2
 379 | 
 380 |         # prevent tiny negative determinants to mess up
 381 |         (sign, logdet) = np.linalg.slogdet(T_k)
 382 |         log2piD2 = np.log(2*np.pi)*(0.5*self.D)
 383 |         return np.log(self.amp[k]) - log2piD2 - sign*logdet/2 - chi2/2
 384 | 
 385 | class Background(object):
 386 |     """Background object to be used in conjuction with GMM.
 387 | 
 388 |     For a normalizable uniform distribution, a support footprint must be set.
 389 |     It should be sufficiently large to explain all non-clusters samples.
 390 | 
 391 |     Attributes:
 392 |         amp (float): mixing amplitude
 393 |         footprint: numpy array, (D,2) of rectangular volume
 394 |         adjust_amp (bool): whether amp will be adjusted as part of the fit
 395 |         amp_max (float): maximum value of amp allowed if adjust_amp=True
 396 |     """
 397 |     def __init__(self, footprint, amp=0):
 398 |         """Initialize Background with a footprint.
 399 | 
 400 |         Args:
 401 |             footprint: numpy array, (D,2) of rectangular volume
 402 | 
 403 |         Returns:
 404 |             None
 405 |         """
 406 |         self.amp = amp
 407 |         self.footprint = footprint
 408 |         self.adjust_amp = True
 409 |         self.amp_max = 1
 410 |         self.amp_min = 0
 411 | 
 412 |     @property
 413 |     def p(self):
 414 |         """Probability of the background model.
 415 | 
 416 |         Returns:
 417 |             float, equal to 1/volume, where volume is given by footprint.
 418 |         """
 419 |         volume = np.prod(self.footprint[1] - self.footprint[0])
 420 |         return 1/volume
 421 | 
 422 |     def draw(self, size=1, rng=np.random):
 423 |         """Draw samples from uniform background.
 424 | 
 425 |         Args:
 426 |             size (int): number of samples to draw
 427 |             rng: numpy.random.RandomState for deterministic draw
 428 | 
 429 |         Returns:
 430 |             numpy array (size, D)
 431 |         """
 432 |         dx = self.footprint[1] - self.footprint[0]
 433 |         return self.footprint[0] + dx*rng.rand(size,len(self.footprint[0]))
 434 | 
 435 | 
 436 | ############################
 437 | # Begin of fit functions
 438 | ############################
 439 | 
 440 | def initFromDataMinMax(gmm, data, covar=None, s=None, k=None, rng=np.random):
 441 |     """Initialization callback for uniform random component means.
 442 | 
 443 |     Component amplitudes are set at 1/gmm.K, covariances are set to
 444 |     s**2*np.eye(D), and means are distributed randomly over the range that is
 445 |     covered by data.
 446 | 
 447 |     If s is not given, it will be set such that the volume of all components
 448 |     completely fills the space covered by data.
 449 | 
 450 |     Args:
 451 |         gmm: A GMM to be initialized
 452 |         data: numpy array (N,D) to define the range of the component means
 453 |         covar: ignored in this callback
 454 |         s (float): if set, sets component variances
 455 |         k (iterable): list of components to set, is None sets all components
 456 |         rng: numpy.random.RandomState for deterministic behavior
 457 | 
 458 |     Returns:
 459 |         None
 460 |     """
 461 |     if k is None:
 462 |         k = slice(None)
 463 |     gmm.amp[k] = 1/gmm.K
 464 |     # set model to random positions with equally sized spheres within
 465 |     # volumne spanned by data
 466 |     min_pos = data.min(axis=0)
 467 |     max_pos = data.max(axis=0)
 468 |     gmm.mean[k,:] = min_pos + (max_pos-min_pos)*rng.rand(gmm.K, gmm.D)
 469 |     # if s is not set: use volume filling argument:
 470 |     # K spheres of radius s [having volume s^D * pi^D/2 / gamma(D/2+1)]
 471 |     # should completely fill the volume spanned by data.
 472 |     if s is None:
 473 |         vol_data = np.prod(max_pos-min_pos)
 474 |         s = (vol_data / gmm.K * scipy.special.gamma(gmm.D*0.5 + 1))**(1/gmm.D) / np.sqrt(np.pi)
 475 |         logger.info("initializing spheres with s=%.2f in data domain" % s)
 476 | 
 477 |     gmm.covar[k,:,:] = s**2 * np.eye(data.shape[1])
 478 | 
 479 | def initFromDataAtRandom(gmm, data, covar=None, s=None, k=None, rng=np.random):
 480 |     """Initialization callback for component means to follow data on scales > s.
 481 | 
 482 |     Component amplitudes are set to 1/gmm.K, covariances are set to
 483 |     s**2*np.eye(D). For each mean, a data sample is selected at random, and a
 484 |     multivariant Gaussian offset is added, whose variance is given by s**2.
 485 | 
 486 |     If s is not given, it will be set such that the volume of all components
 487 |     completely fills the space covered by data.
 488 | 
 489 |     Args:
 490 |         gmm: A GMM to be initialized
 491 |         data: numpy array (N,D) to define the range of the component means
 492 |         covar: ignored in this callback
 493 |         s (float): if set, sets component variances
 494 |         k (iterable): list of components to set, is None sets all components
 495 |         rng: numpy.random.RandomState for deterministic behavior
 496 | 
 497 |     Returns:
 498 |         None
 499 |     """
 500 |     if k is None:
 501 |         k = slice(None)
 502 |         k_len = gmm.K
 503 |     else:
 504 |         try:
 505 |             k_len = len(gmm.amp[k])
 506 |         except TypeError:
 507 |             k_len = 1
 508 |     gmm.amp[k] = 1/gmm.K
 509 |     # initialize components around data points with uncertainty s
 510 |     refs = rng.randint(0, len(data), size=k_len)
 511 |     D = data.shape[1]
 512 |     if s is None:
 513 |         min_pos = data.min(axis=0)
 514 |         max_pos = data.max(axis=0)
 515 |         vol_data = np.prod(max_pos-min_pos)
 516 |         s = (vol_data / gmm.K * scipy.special.gamma(gmm.D*0.5 + 1))**(1/gmm.D) / np.sqrt(np.pi)
 517 |         logger.info("initializing spheres with s=%.2f near data points" % s)
 518 | 
 519 |     gmm.mean[k,:] = data[refs] + rng.multivariate_normal(np.zeros(D), s**2 * np.eye(D), size=k_len)
 520 |     gmm.covar[k,:,:] = s**2 * np.eye(data.shape[1])
 521 | 
 522 | def initFromKMeans(gmm, data, covar=None, rng=np.random):
 523 |     """Initialization callback from a k-means clustering run.
 524 | 
 525 |     See Algorithm 1 from Bloemer & Bujna (arXiv:1312.5946)
 526 |     NOTE: The result of this call are not deterministic even if rng is set
 527 |     because scipy.cluster.vq.kmeans2 uses its own initialization.
 528 | 
 529 |     Args:
 530 |         gmm: A GMM to be initialized
 531 |         data: numpy array (N,D) to define the range of the component means
 532 |         covar: ignored in this callback
 533 |         rng: numpy.random.RandomState for deterministic behavior
 534 | 
 535 |     Returns:
 536 |         None
 537 |     """
 538 |     from scipy.cluster.vq import kmeans2
 539 |     center, label = kmeans2(data, gmm.K)
 540 |     for k in xrange(gmm.K):
 541 |         mask = (label == k)
 542 |         gmm.amp[k] = mask.sum() / len(data)
 543 |         gmm.mean[k,:] = data[mask].mean(axis=0)
 544 |         d_m = data[mask] - gmm.mean[k]
 545 |         # funny way of saying: for each point i, do the outer product
 546 |         # of d_m with its transpose and sum over i
 547 |         gmm.covar[k,:,:] = (d_m[:, :, None] * d_m[:, None, :]).sum(axis=0) / len(data)
 548 | 
 549 | 
 550 | def fit(gmm, data, covar=None, R=None, init_method='random', w=0., cutoff=None, sel_callback=None, oversampling=10, covar_callback=None, background=None, tol=1e-3, miniter=1, maxiter=1000, frozen=None, split_n_merge=False, rng=np.random):
 551 |     """Fit GMM to data.
 552 | 
 553 |     If given, init_callback is called to set up the GMM components. Then, the
 554 |     EM sequence is repeated until the mean log-likelihood converges within tol.
 555 | 
 556 |     Args:
 557 |         gmm: an instance if GMM
 558 |         data: numpy array (N,D)
 559 |         covar: sample noise covariance; numpy array (N,D,D) or (D,D) if i.i.d.
 560 |         R: sample projection matrix; numpy array (N,D,D)
 561 |         init_method (string): one of ['random', 'minmax', 'kmeans', 'none']
 562 |             defines the method to initialize the GMM components
 563 |         w (float): minimum covariance regularization
 564 |         cutoff (float): size of component neighborhood [in 1D equivalent sigmas]
 565 |         sel_callback: completeness callback to generate imputation samples.
 566 |         oversampling (int): number of imputation samples per data sample.
 567 |             only used if sel_callback is set.
 568 |             value of 1 is fine but results are noisy. Set as high as feasible.
 569 |         covar_callback: covariance callback for imputation samples.
 570 |             needs to be present if sel_callback and covar are set.
 571 |         background: an instance of Background if simultaneous fitting is desired
 572 |         tol (float): tolerance for covergence of mean log-likelihood
 573 |         maxiter (int): maximum number of iterations of EM
 574 |         frozen (iterable or dict): index list of components that are not updated
 575 |         split_n_merge (int): number of split & merge attempts
 576 |         rng: numpy.random.RandomState for deterministic behavior
 577 | 
 578 |     Notes:
 579 |         If frozen is a simple list, it will be assumed that is applies to mean
 580 |         and covariance of the specified components. It can also be a dictionary
 581 |         with the keys "mean" and "covar" to specify them separately.
 582 |         In either case, amplitudes will be updated to reflect any changes made.
 583 |         If frozen["amp"] is set, it will use this list instead.
 584 | 
 585 |     Returns:
 586 |         mean log-likelihood (float), component neighborhoods (list of ints)
 587 | 
 588 |     Throws:
 589 |         RuntimeError for inconsistent argument combinations
 590 |     """
 591 | 
 592 |     N = len(data)
 593 |     # if there are data (features) missing, i.e. masked as np.nan, set them to zeros
 594 |     # and create/set covariance elements to very large value to reduce its weight
 595 |     # to effectively zero
 596 |     missing = np.isnan(data)
 597 |     if missing.any():
 598 |         data_ = createShared(data.copy())
 599 |         data_[missing] = 0 # value does not matter as long as it's not nan
 600 |         if covar is None:
 601 |             covar = np.zeros((gmm.D, gmm.D))
 602 |             # need to create covar_callback if imputation is requested
 603 |             if sel_callback is not None:
 604 |                 from functools import partial
 605 |                 covar_callback = partial(covar_callback_default, default=np.zeros((gmm.D, gmm.D)))
 606 |         if covar.shape == (gmm.D, gmm.D):
 607 |             covar_ = createShared(np.tile(covar, (N,1,1)))
 608 |         else:
 609 |             covar_ = createShared(covar.copy())
 610 | 
 611 |         large = 1e10
 612 |         for d in range(gmm.D):
 613 |             covar_[missing[:,d],d,d] += large
 614 |             covar_[missing[:,d],d,d] += large
 615 |     else:
 616 |         data_ = createShared(data.copy())
 617 |         if covar is None or covar.shape == (gmm.D, gmm.D):
 618 |             covar_ = covar
 619 |         else:
 620 |             covar_ = createShared(covar.copy())
 621 | 
 622 |     # init components
 623 |     if init_method.lower() not in ['random', 'minmax', 'kmeans', 'none']:
 624 |         raise NotImplementedError("init_mehod %s not in ['random', 'minmax', 'kmeans', 'none']" % init_method)
 625 |     if init_method.lower() == 'random':
 626 |         initFromDataAtRandom(gmm, data_, covar=covar_, rng=rng)
 627 |     if init_method.lower() == 'minmax':
 628 |         initFromDataMinMax(gmm, data_, covar=covar_, rng=rng)
 629 |     if init_method.lower() == 'kmeans':
 630 |         initFromKMeans(gmm, data_, covar=covar_, rng=rng)
 631 | 
 632 |     # test if callbacks are consistent
 633 |     if sel_callback is not None and covar is not None and covar_callback is None:
 634 |         raise NotImplementedError("covar is set, but covar_callback is None: imputation samples inconsistent")
 635 | 
 636 |     # set up pool
 637 |     pool = multiprocessing.Pool()
 638 |     n_chunks, chunksize = gmm._mp_chunksize()
 639 | 
 640 |     # containers
 641 |     # precautions for cases when some points are treated as outliers
 642 |     # and not considered as belonging to any component
 643 |     log_S = createShared(np.zeros(N))          # S = sum_k p(x|k)
 644 |     log_p = [[] for k in xrange(gmm.K)]        # P = p(x|k) for x in U[k]
 645 |     T_inv = [None for k in xrange(gmm.K)]      # T = covar(x) + gmm.covar[k]
 646 |     U = [None for k in xrange(gmm.K)]          # U = {x close to k}
 647 |     p_bg = None
 648 |     if background is not None:
 649 |         gmm.amp *= 1 - background.amp          # GMM amp + BG amp = 1
 650 |         p_bg = [None]                          # p_bg = p(x|BG), no log because values are larger
 651 |         if covar is not None:
 652 |             # check if covar is diagonal and issue warning if not
 653 |             mess = "background model will only consider diagonal elements of covar"
 654 |             nondiag = ~np.eye(gmm.D, dtype='bool')
 655 |             if covar.shape == (gmm.D, gmm.D):
 656 |                 if (covar[nondiag] != 0).any():
 657 |                     logger.warning(mess)
 658 |             else:
 659 |                 if (covar[np.tile(nondiag,(N,1,1))] != 0).any():
 660 |                     logger.warning(mess)
 661 | 
 662 |     # check if all component parameters can be changed
 663 |     changeable = {"amp": slice(None), "mean": slice(None), "covar": slice(None)}
 664 |     if frozen is not None:
 665 |         if all(isinstance(item, int) for item in frozen):
 666 |             changeable['amp'] = changeable['mean'] = changeable['covar'] = np.in1d(xrange(gmm.K), frozen, assume_unique=True, invert=True)
 667 |         elif hasattr(frozen, 'keys') and np.in1d(["amp","mean","covar"], tuple(frozen.keys()), assume_unique=True).any():
 668 |             if "amp" in frozen.keys():
 669 |                 changeable['amp'] = np.in1d(xrange(gmm.K), frozen['amp'], assume_unique=True, invert=True)
 670 |             if "mean" in frozen.keys():
 671 |                 changeable['mean'] = np.in1d(xrange(gmm.K), frozen['mean'], assume_unique=True, invert=True)
 672 |             if "covar" in frozen.keys():
 673 |                 changeable['covar'] = np.in1d(xrange(gmm.K), frozen['covar'], assume_unique=True, invert=True)
 674 |         else:
 675 |             raise NotImplementedError("frozen should be list of indices or dictionary with keys in ['amp','mean','covar']")
 676 | 
 677 |     try:
 678 |         log_L, N, N2 = _EM(gmm, log_p, U, T_inv, log_S, data_, covar=covar_, R=R, sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, changeable=changeable, miniter=miniter, maxiter=maxiter, tol=tol, rng=rng)
 679 |     except Exception:
 680 |         # cleanup
 681 |         pool.close()
 682 |         pool.join()
 683 |         del data_, covar_, log_S
 684 |         raise
 685 | 
 686 |     # should we try to improve by split'n'merge of components?
 687 |     # if so, keep backup copy
 688 |     gmm_ = None
 689 |     if frozen is not None and split_n_merge:
 690 |         logger.warning("forgoing split'n'merge because some components are frozen")
 691 |     else:
 692 |         while split_n_merge and gmm.K >= 3:
 693 | 
 694 |             if gmm_ is None:
 695 |                 gmm_ = GMM(gmm.K, gmm.D)
 696 | 
 697 |             gmm_.amp[:] = gmm.amp[:]
 698 |             gmm_.mean[:] = gmm.mean[:,:]
 699 |             gmm_.covar[:,:,:] = gmm.covar[:,:,:]
 700 |             U_ = [U[k].copy() for k in xrange(gmm.K)]
 701 | 
 702 |             changing, cleanup = _findSNMComponents(gmm, U, log_p, log_S, N+N2, pool=pool, chunksize=chunksize)
 703 |             logger.info("merging %d and %d, splitting %d" % tuple(changing))
 704 | 
 705 |             # modify components
 706 |             _update_snm(gmm, changing, U, N+N2, cleanup)
 707 | 
 708 |             # run partial EM on changeable components
 709 |             # NOTE: for a partial run, we'd only need the change to Log_S from the
 710 |             # changeable components. However, the neighborhoods can change from _update_snm
 711 |             # or because they move, so that operation is ill-defined.
 712 |             # Thus, we'll always run a full E-step, which is pretty cheap for
 713 |             # converged neighborhood.
 714 |             # The M-step could in principle be run on the changeable components only,
 715 |             # but there seem to be side effects in what I've tried.
 716 |             # Similar to the E-step, the imputation step needs to be run on all
 717 |             # components, otherwise the contribution of the changeable ones to the mixture
 718 |             # would be over-estimated.
 719 |             # Effectively, partial runs are as expensive as full runs.
 720 | 
 721 |             changeable['amp'] = changeable['mean'] = changeable['covar'] = np.in1d(xrange(gmm.K), changing, assume_unique=True)
 722 |             log_L_, N_, N2_ = _EM(gmm, log_p, U, T_inv, log_S, data_, covar=covar_, R=R,  sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, maxiter=maxiter, tol=tol, prefix="SNM_P", changeable=changeable, rng=rng)
 723 | 
 724 |             changeable['amp'] = changeable['mean'] = changeable['covar'] = slice(None)
 725 |             log_L_, N_, N2_ = _EM(gmm, log_p, U, T_inv, log_S, data_, covar=covar_, R=R,  sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, maxiter=maxiter, tol=tol, prefix="SNM_F", changeable=changeable, rng=rng)
 726 | 
 727 |             if log_L >= log_L_:
 728 |                 # revert to backup
 729 |                 gmm.amp[:] = gmm_.amp[:]
 730 |                 gmm.mean[:] = gmm_.mean[:,:]
 731 |                 gmm.covar[:,:,:] = gmm_.covar[:,:,:]
 732 |                 U = U_
 733 |                 logger.info ("split'n'merge likelihood decreased: reverting to previous model")
 734 |                 break
 735 | 
 736 |             log_L = log_L_
 737 |             split_n_merge -= 1
 738 | 
 739 |     pool.close()
 740 |     pool.join()
 741 |     del data_, covar_, log_S
 742 |     return log_L, U
 743 | 
 744 | # run EM sequence
 745 | def _EM(gmm, log_p, U, T_inv, log_S, data, covar=None, R=None, sel_callback=None, oversampling=10, covar_callback=None, background=None, p_bg=None, w=0, pool=None, chunksize=1, cutoff=None, miniter=1, maxiter=1000, tol=1e-3, prefix="", changeable=None, rng=np.random):
 746 | 
 747 |     # compute effective cutoff for chi2 in D dimensions
 748 |     if cutoff is not None:
 749 |         # note: subsequently the cutoff parameter, e.g. in _E(), refers to this:
 750 |         # chi2 < cutoff,
 751 |         # while in fit() it means e.g. "cut at 3 sigma".
 752 |         # These differing conventions need to be documented well.
 753 |         cutoff_nd = chi2_cutoff(gmm.D, cutoff=cutoff)
 754 | 
 755 |         # store chi2 cutoff for component shifts, use 0.5 sigma
 756 |         shift_cutoff = chi2_cutoff(gmm.D, cutoff=min(0.1, cutoff/2))
 757 |     else:
 758 |         cutoff_nd = None
 759 |         shift_cutoff = chi2_cutoff(gmm.D, cutoff=0.1)
 760 | 
 761 |     if sel_callback is not None:
 762 |         omega = createShared(sel_callback(data).astype("float"))
 763 |         if np.any(omega == 0):
 764 |             logger.warning("Selection probability Omega = 0 for an observed sample.")
 765 |             logger.warning("Selection callback likely incorrect! Bad things will happen!")
 766 |     else:
 767 |         omega = None
 768 | 
 769 |     it = 0
 770 |     header = "ITER\tSAMPLES"
 771 |     if sel_callback is not None:
 772 |         header += "\tIMPUTED\tORIG"
 773 |     if background is not None:
 774 |         header += "\tBG_AMP"
 775 |     header += "\tLOG_L\tSTABLE"
 776 |     logger.info(header)
 777 | 
 778 |     # save backup
 779 |     gmm_ = GMM(gmm.K, gmm.D)
 780 |     gmm_.amp[:] = gmm.amp[:]
 781 |     gmm_.mean[:,:] = gmm.mean[:,:]
 782 |     gmm_.covar[:,:,:] = gmm.covar[:,:,:]
 783 |     N0 = len(data) # size of original (unobscured) data set (signal and background)
 784 |     N2 = 0         # size of imputed signal sample
 785 |     if background is not None:
 786 |         bg_amp_ = background.amp
 787 | 
 788 |     while it < maxiter: # limit loop in case of slow convergence
 789 |         log_L_, N, N2_, N0_ = _EMstep(gmm, log_p, U, T_inv, log_S, N0, data, covar=covar, R=R, sel_callback=sel_callback, omega=omega, oversampling=oversampling, covar_callback=covar_callback, background=background, p_bg=p_bg , w=w, pool=pool, chunksize=chunksize, cutoff=cutoff_nd, tol=tol, changeable=changeable, it=it, rng=rng)
 790 | 
 791 |         # check if component has moved by more than sigma/2
 792 |         shift2 = np.einsum('...i,...ij,...j', gmm.mean - gmm_.mean, np.linalg.inv(gmm_.covar), gmm.mean - gmm_.mean)
 793 |         moved = np.flatnonzero(shift2 > shift_cutoff)
 794 |         status_mess = "%s%d\t%d" % (prefix, it, N)
 795 |         if sel_callback is not None:
 796 |             status_mess += "\t%.2f\t%.2f" % (N2_, N0_)
 797 |         if background is not None:
 798 |             status_mess += "\t%.3f" % bg_amp_
 799 |         status_mess += "\t%.3f\t%d" % (log_L_, gmm.K - moved.size)
 800 |         logger.info(status_mess)
 801 | 
 802 |         # convergence tests
 803 |         if it > miniter:
 804 |             if sel_callback is None:
 805 |                 if np.abs(log_L_ - log_L) < tol * np.abs(log_L) and moved.size == 0:
 806 |                     log_L = log_L_
 807 |                     logger.info("likelihood converged within relative tolerance %r: stopping here." % tol)
 808 |                     break
 809 |             else:
 810 |                 if np.abs(N0_ - N0) < tol * N0 and np.abs(N2_ - N2) < tol * N2 and moved.size == 0:
 811 |                     log_L = log_L_
 812 |                     logger.info("imputation sample size converged within relative tolerance %r: stopping here." % tol)
 813 |                     break
 814 | 
 815 |         # force update to U for all moved components
 816 |         if cutoff is not None:
 817 |             for k in moved:
 818 |                 U[k] = None
 819 | 
 820 |         if moved.size:
 821 |             logger.debug("resetting neighborhoods of moving components: (" + ("%d," * moved.size + ")") % tuple(moved))
 822 | 
 823 |         # update all important _ quantities for convergence test(s)
 824 |         log_L = log_L_
 825 |         N0 = N0_
 826 |         N2 = N2_
 827 | 
 828 |         # backup to see if components move or if next step gets worse
 829 |         # note: not gmm = gmm_ !
 830 |         gmm_.amp[:] = gmm.amp[:]
 831 |         gmm_.mean[:,:] = gmm.mean[:,:]
 832 |         gmm_.covar[:,:,:] = gmm.covar[:,:,:]
 833 |         if background is not None:
 834 |             bg_amp_ = background.amp
 835 | 
 836 |         it += 1
 837 | 
 838 |     return log_L, N, N2
 839 | 
 840 | # run one EM step
 841 | def _EMstep(gmm, log_p, U, T_inv, log_S, N0, data, covar=None, R=None, sel_callback=None, omega=None, oversampling=10, covar_callback=None, background=None, p_bg=None, w=0, pool=None, chunksize=1, cutoff=None, tol=1e-3, changeable=None, it=0, rng=np.random):
 842 | 
 843 |     # NOTE: T_inv (in fact (T_ik)^-1 for all samples i and components k)
 844 |     # is very large and is unfortunately duplicated in the parallelized _Mstep.
 845 |     # If memory is too limited, one can recompute T_inv in _Msums() instead.
 846 |     log_L = _Estep(gmm, log_p, U, T_inv, log_S, data, covar=covar, R=R, omega=omega, background=background, p_bg=p_bg, pool=pool, chunksize=chunksize, cutoff=cutoff, it=it)
 847 |     A,M,C,N,B = _Mstep(gmm, U, log_p, T_inv, log_S, data, covar=covar, R=R, p_bg=p_bg, pool=pool, chunksize=chunksize)
 848 | 
 849 |     A2 = M2 = C2 = B2 = N2 = 0
 850 | 
 851 |     # here the magic happens: imputation from the current model
 852 |     if sel_callback is not None:
 853 | 
 854 |         # if there are projections / missing data, we don't know how to
 855 |         # generate those for the imputation samples
 856 |         # NOTE: in principle, if there are only missing data, i.e. R is 1_D,
 857 |         # we could ignore missingness for data2 because we'll do an analytic
 858 |         # marginalization. This doesn't work if R is a non-trivial matrix.
 859 |         if R is not None:
 860 |             raise NotImplementedError("R is not None: imputation samples likely inconsistent")
 861 | 
 862 |         # create fake data with same mechanism as the original data,
 863 |         # but invert selection to get the missing part
 864 |         data2, covar2, N0, omega2 = draw(gmm, len(data)*oversampling, sel_callback=sel_callback, orig_size=N0*oversampling, invert_sel=True, covar_callback=covar_callback, background=background, rng=rng)
 865 |         data2 = createShared(data2)
 866 |         if not(covar2 is None or covar2.shape == (gmm.D, gmm.D)):
 867 |             covar2 = createShared(covar2)
 868 | 
 869 |         N0 = N0/oversampling
 870 |         U2 = [None for k in xrange(gmm.K)]
 871 | 
 872 |         if len(data2) > 0:
 873 |             log_S2 = np.zeros(len(data2))
 874 |             log_p2 = [[] for k in xrange(gmm.K)]
 875 |             T2_inv = [None for k in xrange(gmm.K)]
 876 |             R2 = None
 877 |             if background is not None:
 878 |                 p_bg2 = [None]
 879 |             else:
 880 |                 p_bg2 = None
 881 | 
 882 |             log_L2 = _Estep(gmm, log_p2, U2, T2_inv, log_S2, data2, covar=covar2, R=R2, omega=None, background=background, p_bg=p_bg2, pool=pool, chunksize=chunksize, cutoff=cutoff, it=it)
 883 |             A2,M2,C2,N2,B2 = _Mstep(gmm, U2, log_p2, T2_inv, log_S2, data2, covar=covar2, R=R2, p_bg=p_bg2, pool=pool, chunksize=chunksize)
 884 | 
 885 |             # normalize for oversampling
 886 |             A2 /= oversampling
 887 |             M2 /= oversampling
 888 |             C2 /= oversampling
 889 |             B2 /= oversampling
 890 |             N2 = N2/oversampling # need floating point precision in update
 891 | 
 892 |             # check if components have outside selection
 893 |             sel_outside = A2 > tol * A
 894 |             if sel_outside.any():
 895 |                 logger.debug("component inside fractions: " + ("(" + "%.2f," * gmm.K + ")") % tuple(A/(A+A2)))
 896 | 
 897 |         # correct the observed likelihood for the overall normalization constant of
 898 |         # of the data process with selection:
 899 |         # logL(x | gmm) = sum_k p_k(x) / Z(gmm), with Z(gmm) = int dx sum_k p_k(x) = 1
 900 |         # becomes
 901 |         # logL(x | gmm) = sum_k Omega(x) p_k(x) / Z'(gmm),
 902 |         # with Z'(gmm) = int dx Omega(x) sum_k p_k(x), which we can gt by MC integration
 903 |         log_L -= N * np.log((omega.sum() + omega2.sum() / oversampling) / (N + N2))
 904 | 
 905 |     _update(gmm, A, M, C, N, B, A2, M2, C2, N2, B2, w, changeable=changeable, background=background)
 906 | 
 907 |     return log_L, N, N2, N0
 908 | 
 909 | # perform E step calculations.
 910 | # If cutoff is set, this will also set the neighborhoods U
 911 | def _Estep(gmm, log_p, U, T_inv, log_S, data, covar=None, R=None, omega=None, background=None, p_bg=None, pool=None, chunksize=1, cutoff=None, it=0, rng=np.random):
 912 |     # compute p(i | k) for each k independently in the pool
 913 |     # need S = sum_k p(i | k) for further calculation
 914 |     log_S[:] = 0
 915 | 
 916 |     # H = {i | i in neighborhood[k]} for any k, needed for outliers below
 917 |     # TODO: Use only when cutoff is set
 918 |     H = np.zeros(len(data), dtype="bool")
 919 | 
 920 |     k = 0
 921 |     for log_p[k], U[k], T_inv[k] in \
 922 |     parmap.starmap(_Esum, zip(xrange(gmm.K), U), gmm, data, covar, R, cutoff, pm_pool=pool, pm_chunksize=chunksize):
 923 |         log_S[U[k]] += np.exp(log_p[k]) # actually S, not logS
 924 |         H[U[k]] = 1
 925 |         k += 1
 926 | 
 927 |     if background is not None:
 928 |         p_bg[0] = background.amp * background.p
 929 |         if covar is not None:
 930 |             # This is the zeroth moment of a truncated Normal error distribution
 931 |             # Its calculation is simple only of the covariance is diagonal!
 932 |             # See e.g. Manjunath & Wilhem (2012) if not
 933 |             error = np.ones(len(data))
 934 |             x0,x1 = background.footprint
 935 |             for d in range(gmm.D):
 936 |                 if covar.shape == (gmm.D, gmm.D): # one-for-all
 937 |                     denom = np.sqrt(2 * covar[d,d])
 938 |                 else:
 939 |                     denom = np.sqrt(2 * covar[:,d,d])
 940 |                 # CAUTION: The erf is approximate and returns 0
 941 |                 # Thus, we don't add the logs but multiple the value itself
 942 |                 # underrun is not a big problem here
 943 |                 error *= np.real(scipy.special.erf((data[:,d] - x0[d])/denom)  - scipy.special.erf((data[:,d] - x1[d])/denom)) / 2
 944 |             p_bg[0] *= error
 945 |         log_S[:] = np.log(log_S + p_bg[0])
 946 |         if omega is not None:
 947 |             log_S += np.log(omega)
 948 |         log_L = log_S.sum()
 949 |     else:
 950 |         # need log(S), but since log(0) isn't a good idea, need to restrict to H
 951 |         log_S[H] = np.log(log_S[H])
 952 |         if omega is not None:
 953 |             log_S += np.log(omega)
 954 |         log_L = log_S[H].sum()
 955 | 
 956 |     return log_L
 957 | 
 958 | # compute chi^2, and apply selections on component neighborhood based in chi^2
 959 | def _Esum(k, U_k, gmm, data, covar=None, R=None, cutoff=None):
 960 |     # since U_k could be None, need explicit reshape
 961 |     d_ = data[U_k].reshape(-1, gmm.D)
 962 |     if covar is not None:
 963 |         if covar.shape == (gmm.D, gmm.D): # one-for-all
 964 |             covar_ = covar
 965 |         else: # each datum has covariance
 966 |             covar_ = covar[U_k].reshape(-1, gmm.D, gmm.D)
 967 |     else:
 968 |         covar_ = 0
 969 |     if R is not None:
 970 |         R_ = R[U_k].reshape(-1, gmm.D, gmm.D)
 971 | 
 972 |     # p(x | k) for all x in the vicinity of k
 973 |     # determine all points within cutoff sigma from mean[k]
 974 |     if R is None:
 975 |         dx = d_ - gmm.mean[k]
 976 |     else:
 977 |         dx = d_ - np.dot(R_, gmm.mean[k])
 978 | 
 979 |     if covar is None and R is None:
 980 |          T_inv_k = None
 981 |          chi2 = np.einsum('...i,...ij,...j', dx, np.linalg.inv(gmm.covar[k]), dx)
 982 |     else:
 983 |         # with data errors: need to create and return T_ik = covar_i + C_k
 984 |         # and weight each datum appropriately
 985 |         if R is None:
 986 |             T_inv_k = np.linalg.inv(gmm.covar[k] + covar_)
 987 |         else: # need to project out missing elements: T_ik = R_i C_k R_i^R + covar_i
 988 |             T_inv_k = np.linalg.inv(np.einsum('...ij,jk,...lk', R_, gmm.covar[k], R_) + covar_)
 989 |         chi2 = np.einsum('...i,...ij,...j', dx, T_inv_k, dx)
 990 | 
 991 |     # NOTE: close to convergence, we could stop applying the cutoff because
 992 |     # changes to U will be minimal
 993 |     if cutoff is not None:
 994 |         indices = chi2 < cutoff
 995 |         chi2 = chi2[indices]
 996 |         if (covar is not None and covar.shape != (gmm.D, gmm.D)) or R is not None:
 997 |             T_inv_k = T_inv_k[indices]
 998 |         if U_k is None:
 999 |             U_k = np.flatnonzero(indices)
1000 |         else:
1001 |             U_k = U_k[indices]
1002 | 
1003 |     # prevent tiny negative determinants to mess up
1004 |     if covar is None:
1005 |         (sign, logdet) = np.linalg.slogdet(gmm.covar[k])
1006 |     else:
1007 |         (sign, logdet) = np.linalg.slogdet(T_inv_k)
1008 |         sign *= -1 # since det(T^-1) = 1/det(T)
1009 | 
1010 |     log2piD2 = np.log(2*np.pi)*(0.5*gmm.D)
1011 |     return np.log(gmm.amp[k]) - log2piD2 - sign*logdet/2 - chi2/2, U_k, T_inv_k
1012 | 
1013 | # get zeroth, first, second moments of the data weighted with p_k(x) avgd over x
1014 | def _Mstep(gmm, U, log_p, T_inv, log_S, data, covar=None, R=None, p_bg=None, pool=None, chunksize=1):
1015 | 
1016 |     # save the M sums from observed data
1017 |     A = np.empty(gmm.K)                 # sum for amplitudes
1018 |     M = np.empty((gmm.K, gmm.D))        # ... means
1019 |     C = np.empty((gmm.K, gmm.D, gmm.D)) # ... covariances
1020 |     N = len(data)
1021 | 
1022 |     # perform sums for M step in the pool
1023 |     # NOTE: in a partial run, could work on changeable components only;
1024 |     # however, there seem to be side effects or race conditions
1025 |     k = 0
1026 |     for A[k], M[k,:], C[k,:,:] in \
1027 |     parmap.starmap(_Msums, zip(xrange(gmm.K), U, log_p, T_inv), gmm, data, R, log_S, pm_pool=pool, pm_chunksize=chunksize):
1028 |         k += 1
1029 | 
1030 |     if p_bg is not None:
1031 |         q_bg = p_bg[0] / np.exp(log_S)
1032 |         B = q_bg.sum() # equivalent to A_k in _Msums, but done without logs
1033 |     else:
1034 |         B = 0
1035 | 
1036 |     return A,M,C,N,B
1037 | 
1038 | # compute moments for the Mstep
1039 | def _Msums(k, U_k, log_p_k, T_inv_k, gmm, data, R, log_S):
1040 |     if log_p_k.size == 0:
1041 |         return 0,0,0
1042 | 
1043 |     # get log_q_ik by dividing with S = sum_k p_ik
1044 |     # NOTE:  this modifies log_p_k in place, but is only relevant
1045 |     # within this method since the call is parallel and its arguments
1046 |     # therefore don't get updated across components.
1047 | 
1048 |     # NOTE: reshape needed when U_k is None because of its
1049 |     # implicit meaning as np.newaxis
1050 |     log_p_k -= log_S[U_k].reshape(log_p_k.size)
1051 |     d = data[U_k].reshape((log_p_k.size, gmm.D))
1052 |     if R is not None:
1053 |         R_ = R[U_k].reshape((log_p_k.size, gmm.D, gmm.D))
1054 | 
1055 |     # amplitude: A_k = sum_i q_ik
1056 |     A_k = np.exp(logsum(log_p_k))
1057 | 
1058 |     # in fact: q_ik, but we treat sample index i silently everywhere
1059 |     q_k = np.exp(log_p_k)
1060 | 
1061 |     if R is None:
1062 |         d_m = d - gmm.mean[k]
1063 |     else:
1064 |         d_m = d - np.dot(R_, gmm.mean[k])
1065 | 
1066 |     # data with errors?
1067 |     if T_inv_k is None and R is None:
1068 |         # mean: M_k = sum_i x_i q_ik
1069 |         M_k = (d * q_k[:,None]).sum(axis=0)
1070 | 
1071 |         # covariance: C_k = sum_i (x_i - mu_k)^T(x_i - mu_k) q_ik
1072 |         # funny way of saying: for each point i, do the outer product
1073 |         # of d_m with its transpose, multiply with pi[i], and sum over i
1074 |         C_k = (q_k[:, None, None] * d_m[:, :, None] * d_m[:, None, :]).sum(axis=0)
1075 |     else:
1076 |         if R is None: # that means T_ik is not None
1077 |             # b_ik = mu_k + C_k T_ik^-1 (x_i - mu_k)
1078 |             # B_ik = C_k - C_k T_ik^-1 C_k
1079 |             b_k = gmm.mean[k] + np.einsum('ij,...jk,...k', gmm.covar[k], T_inv_k, d_m)
1080 |             B_k = gmm.covar[k] - np.einsum('ij,...jk,...kl', gmm.covar[k], T_inv_k, gmm.covar[k])
1081 |         else:
1082 |             # F_ik = C_k R_i^T T_ik^-1
1083 |             F_k = np.einsum('ij,...kj,...kl', gmm.covar[k], R_, T_inv_k)
1084 |             b_k = gmm.mean[k] + np.einsum('...ij,...j', F_k, d_m)
1085 |             B_k = gmm.covar[k] - np.einsum('...ij,...jk,kl', F_k, R_, gmm.covar[k])
1086 | 
1087 |             #b_k = gmm.mean[k] + np.einsum('ij,...jk,...k', gmm.covar[k], T_inv_k, d_m)
1088 |             #B_k = gmm.covar[k] - np.einsum('ij,...jk,...kl', gmm.covar[k], T_inv_k, gmm.covar[k])
1089 |         M_k = (b_k * q_k[:,None]).sum(axis=0)
1090 |         b_k -= gmm.mean[k]
1091 |         C_k = (q_k[:, None, None] * (b_k[:, :, None] * b_k[:, None, :] + B_k)).sum(axis=0)
1092 |     return A_k, M_k, C_k
1093 | 
1094 | 
1095 | # update component with the moment matrices.
1096 | # If changeable is set, update only those components and renormalize the amplitudes
1097 | def _update(gmm, A, M, C, N, B, A2, M2, C2, N2, B2, w, changeable=None, background=None):
1098 | 
1099 |     # recompute background amplitude
1100 |     if background is not None and background.adjust_amp:
1101 |         background.amp = max(min((B + B2) / (N + N2), background.amp_max), background.amp_min)
1102 | 
1103 |     # amp update:
1104 |     # for partial update: need to update amp for any component that is changeable
1105 |     if not hasattr(changeable['amp'], '__iter__'): # it's a slice(None), not a bool array
1106 |         gmm.amp[changeable['amp']] = (A + A2)[changeable['amp']] / (N + N2)
1107 |     else:
1108 |         # Bovy eq. 31, with correction for bg.amp if needed
1109 |         if background is None:
1110 |             total = 1
1111 |         else:
1112 |             total = 1 - background.amp
1113 |         gmm.amp[changeable['amp']] = (A + A2)[changeable['amp']] / (A + A2)[changeable['amp']].sum() * (total - (gmm.amp[~changeable['amp']]).sum())
1114 | 
1115 |     # mean updateL
1116 |     gmm.mean[changeable['mean'],:] = (M + M2)[changeable['mean'],:]/(A + A2)[changeable['mean'],None]
1117 | 
1118 |     # covar updateL
1119 |     # minimum covariance term?
1120 |     if w > 0:
1121 |         # we assume w to be a lower bound of the isotropic dispersion,
1122 |         # C_k = w^2 I + ...
1123 |         # then eq. 38 in Bovy et al. only ~works for N = 0 because of the
1124 |         # prefactor 1 / (q_j + 1) = 1 / (A + 1) in our terminology
1125 |         # On average, q_j = N/K, so we'll adopt that to correct.
1126 |         w_eff = w**2 * ((N+N2)/gmm.K + 1)
1127 |         gmm.covar[changeable['covar'],:,:] = (C + C2 + w_eff*np.eye(gmm.D)[None,:,:])[changeable['covar'],:,:] / (A + A2 + 1)[changeable['covar'],None,None]
1128 |     else:
1129 |         gmm.covar[changeable['covar'],:,:] = (C + C2)[changeable['covar'],:,:] / (A + A2)[changeable['covar'],None,None]
1130 | 
1131 | # draw from the model (+ background) and apply appropriate covariances
1132 | def _drawGMM_BG(gmm, size, covar_callback=None, background=None, rng=np.random):
1133 |     # draw sample from model, or from background+model
1134 |     if background is None:
1135 |         data2 = gmm.draw(int(np.round(size)), rng=rng)
1136 |     else:
1137 |         # model is GMM + Background
1138 |         bg_size = int(background.amp * size)
1139 |         data2 = np.concatenate((gmm.draw(int(np.round(size-bg_size)), rng=rng), background.draw(int(np.round(bg_size)), rng=rng)))
1140 | 
1141 |     # add noise
1142 |     # NOTE: When background is set, adding noise is problematic if
1143 |     # scattering them out is more likely than in.
1144 |     # This can be avoided when the background footprint is large compared to
1145 |     # selection region
1146 |     if covar_callback is not None:
1147 |         covar2 = covar_callback(data2)
1148 |         if covar2.shape == (gmm.D, gmm.D): # one-for-all
1149 |             noise = rng.multivariate_normal(np.zeros(gmm.D), covar2, size=len(data2))
1150 |         else:
1151 |             # create noise from unit covariance and then dot with eigenvalue
1152 |             # decomposition of covar2 to get a the right noise distribution:
1153 |             # n' = R V^1/2 n, where covar = R V R^-1
1154 |             # faster than drawing one sample per each covariance
1155 |             noise = rng.multivariate_normal(np.zeros(gmm.D), np.eye(gmm.D), size=len(data2))
1156 |             val, rot = np.linalg.eigh(covar2)
1157 |             val = np.maximum(val,0) # to prevent univariate errors to underflow
1158 |             noise = np.einsum('...ij,...j', rot, np.sqrt(val)*noise)
1159 |         data2 += noise
1160 |     else:
1161 |         covar2 = None
1162 |     return data2, covar2
1163 | 
1164 | 
1165 | def draw(gmm, obs_size, sel_callback=None, invert_sel=False, orig_size=None, covar_callback=None, background=None, rng=np.random):
1166 |     """Draw from the GMM (and the Background) with noise and selection.
1167 | 
1168 |     Draws orig_size samples from the GMM and the Background, if set; calls
1169 |     covar_callback if set and applies resulting covariances; the calls
1170 |     sel_callback on the (noisy) samples and returns those matching ones.
1171 | 
1172 |     If the number is resulting samples is inconsistent with obs_size, i.e.
1173 |     outside of the 68 percent confidence limit of a Poisson draw, it will
1174 |     update its estimate for the original sample size orig_size.
1175 |     An estimate can be provided with orig_size, otherwise it will use obs_size.
1176 | 
1177 |     Note:
1178 |         If sel_callback is set, the number of returned samples is not
1179 |         necessarily given by obs_size.
1180 | 
1181 |     Args:
1182 |         gmm: an instance if GMM
1183 |         obs_size (int): number of observed samples
1184 |         sel_callback: completeness callback to generate imputation samples.
1185 |         invert_sel (bool): whether to invert the result of sel_callback
1186 |         orig_size (int): an estimate of the original size of the sample.
1187 |         background: an instance of Background
1188 |         covar_callback: covariance callback for imputation samples.
1189 |         rng: numpy.random.RandomState for deterministic behavior
1190 | 
1191 |     Returns:
1192 |         sample: nunmpy array (N_orig, D)
1193 |         covar_sample: numpy array (N_orig, D, D) or None of covar_callback=None
1194 |         N_orig (int): updated estimate of orig_size if sel_callback is set
1195 | 
1196 |     Throws:
1197 |         RuntimeError for inconsistent argument combinations
1198 |     """
1199 | 
1200 |     if orig_size is None:
1201 |         orig_size = int(obs_size)
1202 | 
1203 |     # draw from model (with background) and add noise.
1204 |     # TODO: may want to decide whether to add noise before selection or after
1205 |     # Here we do noise, then selection, but this is not fundamental
1206 |     data, covar = _drawGMM_BG(gmm, orig_size, covar_callback=covar_callback, background=background, rng=rng)
1207 | 
1208 |     # apply selection
1209 |     if sel_callback is not None:
1210 |         omega = sel_callback(data)
1211 |         sel = rng.rand(len(data)) < omega
1212 | 
1213 |         # check if predicted observed size is consistent with observed data
1214 |         # 68% confidence interval for Poisson variate: observed size
1215 |         alpha = 0.32
1216 |         lower = 0.5*scipy.stats.chi2.ppf(alpha/2, 2*obs_size)
1217 |         upper = 0.5*scipy.stats.chi2.ppf(1 - alpha/2, 2*obs_size + 2)
1218 |         obs_size_ = sel.sum()
1219 |         while obs_size_ > upper or obs_size_ < lower:
1220 |             orig_size = int(orig_size / obs_size_ * obs_size)
1221 |             data, covar = _drawGMM_BG(gmm, orig_size, covar_callback=covar_callback, background=background, rng=rng)
1222 |             omega = sel_callback(data)
1223 |             sel = rng.rand(len(data)) < omega
1224 |             obs_size_ = sel.sum()
1225 | 
1226 |         if invert_sel:
1227 |             sel = ~sel
1228 |         data = data[sel]
1229 |         omega = omega[sel]
1230 |         if covar_callback is not None and covar.shape != (gmm.D, gmm.D):
1231 |             covar = covar[sel]
1232 | 
1233 |     return data, covar, orig_size, omega
1234 | 
1235 | 
1236 | def _JS(k, gmm, log_p, log_S, U, A):
1237 |     # compute Kullback-Leiber divergence
1238 |     log_q_k = log_p[k] - log_S[U[k]]
1239 |     return np.dot(np.exp(log_q_k), log_q_k - np.log(A[k]) - log_p[k] + np.log(gmm.amp[k])) / A[k]
1240 | 
1241 | 
1242 | def _findSNMComponents(gmm, U, log_p, log_S, N, pool=None, chunksize=1):
1243 |     # find those components that are most similar
1244 |     JM = np.zeros((gmm.K, gmm.K))
1245 |     # compute log_q (posterior for k given i), but use normalized probabilities
1246 |     # to allow for merging of empty components
1247 |     log_q = [log_p[k] - log_S[U[k]] - np.log(gmm.amp[k]) for k in xrange(gmm.K)]
1248 |     for k in xrange(gmm.K):
1249 |         # don't need diagonal (can merge), and JM is symmetric
1250 |         for j in xrange(k+1, gmm.K):
1251 |             # get index list for intersection of U of k and l
1252 |             # FIXME: match1d fails if either U is empty
1253 |             # SOLUTION: merge empty U, split another
1254 |             i_k, i_j = match1d(U[k], U[j], presorted=True)
1255 |             JM[k,j] = np.dot(np.exp(log_q[k][i_k]), np.exp(log_q[j][i_j]))
1256 |     merge_jk = np.unravel_index(JM.argmax(), JM.shape)
1257 |     # if all Us are disjunct, JM is blank and merge_jk = [0,0]
1258 |     # merge two smallest components and clean up from the bottom
1259 |     cleanup = False
1260 |     if merge_jk[0] == 0 and merge_jk[1] == 0:
1261 |         logger.debug("neighborhoods disjunct. merging components %d and %d" % tuple(merge_jk))
1262 |         merge_jk = np.argsort(gmm.amp)[:2]
1263 |         cleanup = True
1264 | 
1265 | 
1266 |     # split the one whose p(x|k) deviate most from current Gaussian
1267 |     # ask for the three worst components to avoid split being in merge_jk
1268 |     """
1269 |     JS = np.empty(gmm.K)
1270 |     k = 0
1271 |     A = gmm.amp * N
1272 |     for JS[k] in \
1273 |     parmap.map(_JS, xrange(gmm.K), gmm, log_p, log_S, U, A, pm_pool=pool, pm_chunksize=chunksize):
1274 |         k += 1
1275 |     """
1276 |     # get largest Eigenvalue, weighed by amplitude
1277 |     # Large EV implies extended object, which often is caused by coverving
1278 |     # multiple clusters. This happes also for almost empty components, which
1279 |     # should rather be merged than split, hence amplitude weights.
1280 |     # TODO: replace with linalg.eigvalsh, but eigenvalues are not always ordered
1281 |     EV = np.linalg.svd(gmm.covar, compute_uv=False)
1282 |     JS = EV[:,0] * gmm.amp
1283 |     split_l3 = np.argsort(JS)[-3:][::-1]
1284 | 
1285 |     # check that the three indices are unique
1286 |     changing = np.array([merge_jk[0], merge_jk[1], split_l3[0]])
1287 |     if split_l3[0] in merge_jk:
1288 |         if split_l3[1] not in merge_jk:
1289 |             changing[2] = split_l3[1]
1290 |         else:
1291 |             changing[2] = split_l3[2]
1292 |     return changing, cleanup
1293 | 
1294 | 
1295 | def _update_snm(gmm, changeable, U, N, cleanup):
1296 |     # reconstruct A from gmm.amp
1297 |     A = gmm.amp * N
1298 | 
1299 |     # update parameters and U
1300 |     # merge 0 and 1, store in 0, Bovy eq. 39
1301 |     gmm.amp[changeable[0]] = gmm.amp[changeable[0:2]].sum()
1302 |     if not cleanup:
1303 |         gmm.mean[changeable[0]] = np.sum(gmm.mean[changeable[0:2]] * A[changeable[0:2]][:,None], axis=0) / A[changeable[0:2]].sum()
1304 |         gmm.covar[changeable[0]] = np.sum(gmm.covar[changeable[0:2]] * A[changeable[0:2]][:,None,None], axis=0) / A[changeable[0:2]].sum()
1305 |         U[changeable[0]] = np.union1d(U[changeable[0]], U[changeable[1]])
1306 |     else:
1307 |         # if we're cleaning up the weakest components:
1308 |         # merging does not lead to valid component parameters as the original
1309 |         # ones can be anywhere. Simply adopt second one.
1310 |         gmm.mean[changeable[0],:] = gmm.mean[changeable[1],:]
1311 |         gmm.covar[changeable[0],:,:] = gmm.covar[changeable[1],:,:]
1312 |         U[changeable[0]] = U[changeable[1]]
1313 | 
1314 |     # split 2, store in 1 and 2
1315 |     # following SVD method in Zhang 2003, with alpha=1/2, u = 1/4
1316 |     gmm.amp[changeable[1]] = gmm.amp[changeable[2]] = gmm.amp[changeable[2]] / 2
1317 |     # TODO: replace with linalg.eigvalsh, but eigenvalues are not always ordered
1318 |     _, radius2, rotation = np.linalg.svd(gmm.covar[changeable[2]])
1319 |     dl = np.sqrt(radius2[0]) *  rotation[0] / 4
1320 |     gmm.mean[changeable[1]] = gmm.mean[changeable[2]] - dl
1321 |     gmm.mean[changeable[2]] = gmm.mean[changeable[2]] + dl
1322 |     gmm.covar[changeable[1:]] = np.linalg.det(gmm.covar[changeable[2]])**(1/gmm.D) * np.eye(gmm.D)
1323 |     U[changeable[1]] = U[changeable[2]].copy() # now 1 and 2 have same U
1324 | 
1325 | 
1326 | # L-fold cross-validation of the fit function.
1327 | # all parameters for fit must be supplied with kwargs.
1328 | # the rng seed will be fixed for the CV runs so that all random effects are the
1329 | # same for each run.
1330 | def cv_fit(gmm, data, L=10, **kwargs):
1331 |     N = len(data)
1332 |     lcv = np.empty(N)
1333 |     logger.info("running %d-fold cross-validation ..." % L)
1334 | 
1335 |     # CV and stacking can't have probabilistic inits that depends on
1336 |     # data or subsets thereof
1337 |     init_callback = kwargs.get("init_callback", None)
1338 |     if init_callback is not None:
1339 |         raise RuntimeError("Cross-validation can only be used consistently with init_callback=None")
1340 | 
1341 |     # make sure we know what the RNG is,
1342 |     # fix state of RNG to make behavior of fit reproducable
1343 |     rng = kwargs.get("rng", np.random)
1344 |     rng_state = rng.get_state()
1345 | 
1346 |     # need to copy the gmm when init_cb is None
1347 |     # otherwise runs start from different init positions
1348 |     gmm0 = GMM(K=gmm.K, D=gmm.D)
1349 |     gmm0.amp[:,] = gmm.amp[:]
1350 |     gmm0.mean[:,:] = gmm.mean[:,:]
1351 |     gmm0.covar[:,:,:] = gmm.covar[:,:,:]
1352 | 
1353 |     # same for bg if present
1354 |     bg = kwargs.get("background", None)
1355 |     if bg is not None:
1356 |         bg_amp0 = bg.amp
1357 | 
1358 |     # to L-fold CV here, need to split covar too if set
1359 |     covar = kwargs.pop("covar", None)
1360 |     for i in xrange(L):
1361 |         rng.set_state(rng_state)
1362 |         mask = np.arange(N) % L == i
1363 |         if covar is None or covar.shape == (gmm.D, gmm.D):
1364 |             fit(gmm, data[~mask], covar=covar, **kwargs)
1365 |             lcv[mask] = gmm.logL(data[mask], covar=covar)
1366 |         else:
1367 |             fit(gmm, data[~mask], covar=covar[~mask], **kwargs)
1368 |             lcv[mask] = gmm.logL(data[mask], covar=covar[mask])
1369 | 
1370 |         # undo for consistency
1371 |         gmm.amp[:,] = gmm0.amp[:]
1372 |         gmm.mean[:,:] = gmm0.mean[:,:]
1373 |         gmm.covar[:,:,:] = gmm0.covar[:,:,:]
1374 |         if bg is not None:
1375 |             bg.amp = bg_amp0
1376 | 
1377 |     return lcv
1378 | 
1379 | 
1380 | def stack(gmms, weights):
1381 |     # build stacked model by combining all gmms and applying weights to amps
1382 |     stacked = GMM(K=0, D=gmms[0].D)
1383 |     for m in xrange(len(gmms)):
1384 |         stacked.amp = np.concatenate((stacked.amp[:], weights[m]*gmms[m].amp[:]))
1385 |         stacked.mean = np.concatenate((stacked.mean[:,:], gmms[m].mean[:,:]))
1386 |         stacked.covar = np.concatenate((stacked.covar[:,:,:], gmms[m].covar[:,:,:]))
1387 |     stacked.amp /= stacked.amp.sum()
1388 |     return stacked
1389 | 
1390 | 
1391 | def stack_fit(gmms, data, kwargs, L=10, tol=1e-5, rng=np.random):
1392 |     M = len(gmms)
1393 |     N = len(data)
1394 |     lcvs = np.empty((M,N))
1395 | 
1396 |     for m in xrange(M):
1397 |         # run CV to get cross-validation likelihood
1398 |         rng_state = rng.get_state()
1399 |         lcvs[m,:] = cv_fit(gmms[m], data, L=L, **(kwargs[m]))
1400 |         rng.set_state(rng_state)
1401 |         # run normal fit on all data
1402 |         fit(gmms[m], data, **(kwargs[m]))
1403 | 
1404 |     # determine the weights that maximize the stacked estimator likelihood
1405 |     # run a tiny EM on lcvs to get them
1406 |     beta = np.ones(M)/M
1407 |     log_p_k = np.empty_like(lcvs)
1408 |     log_S = np.empty(N)
1409 |     it = 0
1410 |     logger.info("optimizing stacking weights\n")
1411 |     logger.info("ITER\tLOG_L")
1412 | 
1413 |     while True and it < 20:
1414 |         log_p_k[:,:] = lcvs + np.log(beta)[:,None]
1415 |         log_S[:] = logsum(log_p_k)
1416 |         log_p_k[:,:] -= log_S
1417 |         beta[:] = np.exp(logsum(log_p_k, axis=1)) / N
1418 |         logL_ = log_S.mean()
1419 |         logger.info("STACK%d\t%.4f" % (it, logL_))
1420 | 
1421 |         if it > 0 and logL_ - logL < tol:
1422 |             break
1423 |         logL = logL_
1424 |         it += 1
1425 |     return stack(gmms, beta)
1426 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | long_description = open('README.md').read()
 4 | 
 5 | setup(
 6 |     name="pygmmis",
 7 |     version='1.2.3',
 8 |     description="Gaussian mixture model for incomplete, truncated, and noisy data",
 9 |     long_description = long_description,
10 |     long_description_content_type='text/markdown',
11 |     author="Peter Melchior",
12 |     author_email="peter.m.melchior@gmail.com",
13 |     license='MIT',
14 |     py_modules=["pygmmis"],
15 |     url="https://github.com/pmelchior/pygmmis",
16 |     classifiers=[
17 |         "Development Status :: 5 - Production/Stable",
18 |         "License :: OSI Approved :: MIT License",
19 |         "Intended Audience :: Developers",
20 |         "Intended Audience :: Science/Research",
21 |         "Operating System :: OS Independent",
22 |         "Programming Language :: Python",
23 |         "Topic :: Scientific/Engineering :: Information Analysis"
24 |     ],
25 |     install_requires=["numpy","scipy","parmap>=1.5.2"]
26 | )
27 | 


--------------------------------------------------------------------------------
/tests/pygmmis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pmelchior/pygmmis/87ad02dd607896205ccde3ca668971c6dcacd026/tests/pygmmis.png


--------------------------------------------------------------------------------
/tests/test.py:
--------------------------------------------------------------------------------
  1 | #!/bin/env python
  2 | 
  3 | import pygmmis
  4 | import numpy as np
  5 | import matplotlib.pyplot as plt
  6 | import matplotlib.patches as patches
  7 | import matplotlib.lines as lines
  8 | import matplotlib.cm
  9 | import datetime
 10 | from functools import partial
 11 | import logging
 12 | 
 13 | def plotResults(orig, data, gmm, patch=None, description=None, disp=None):
 14 |     fig = plt.figure(figsize=(6,6))
 15 |     ax = fig.add_subplot(111, aspect='equal')
 16 | 
 17 |     # plot inner and outer points
 18 |     ax.plot(orig[:,0], orig[:,1], 'o', mfc='None', mec='r', mew=1)
 19 |     missing = np.isnan(data)
 20 |     if missing.any():
 21 |         data_ = data.copy()
 22 |         data_[missing] = -5 # put at limits of plotting range
 23 |     else:
 24 |         data_ = data
 25 |     ax.plot(data_[:,0], data_[:,1], 's', mfc='b', mec='None')#, mew=1)
 26 | 
 27 |     # prediction
 28 |     B = 100
 29 |     x,y = np.meshgrid(np.linspace(-5,15,B), np.linspace(-5,15,B))
 30 |     coords = np.dstack((x.flatten(), y.flatten()))[0]
 31 | 
 32 |     # compute sum_k(p_k(x)) for all x
 33 |     p = gmm(coords).reshape((B,B))
 34 |     # for better visibility use arcshinh stretch
 35 |     p = np.arcsinh(p/1e-4)
 36 |     cs = ax.contourf(p, 10, extent=(-5,15,-5,15), cmap=plt.cm.Greys)
 37 |     for c in cs.collections:
 38 |         c.set_edgecolor(c.get_facecolor())
 39 | 
 40 |     # plot boundary
 41 |     if patch is not None:
 42 |         import copy
 43 |         if hasattr(patch, '__iter__'):
 44 |             for p in patch:
 45 |                 ax.add_artist(copy.copy(p))
 46 |         else:
 47 |             ax.add_artist(copy.copy(patch))
 48 | 
 49 |     # add description and complete data logL to plot
 50 |     logL = gmm(orig, as_log=True).mean()
 51 |     if description is not None:
 52 |         ax.text(0.05, 0.95, r'%s' % description, ha='left', va='top', transform=ax.transAxes, fontsize=20)
 53 |         ax.text(0.05, 0.89, '$\log{\mathcal{L}} = %.3f$' % logL, ha='left', va='top', transform=ax.transAxes, fontsize=20)
 54 |     else:
 55 |         ax.text(0.05, 0.95, '$\log{\mathcal{L}} = %.3f$' % logL, ha='left', va='top', transform=ax.transAxes, fontsize=20)
 56 | 
 57 |     # show size of error dispersion as Circle
 58 |     if disp is not None:
 59 | 
 60 |         circ1 = patches.Circle((12.5, -2.5), radius=disp, fc='b', ec='None', alpha=0.5)
 61 |         circ2 = patches.Circle((12.5, -2.5), radius=2*disp, fc='b', ec='None', alpha=0.3)
 62 |         circ3 = patches.Circle((12.5, -2.5), radius=3*disp, fc='b', ec='None', alpha=0.1)
 63 |         ax.add_artist(circ1)
 64 |         ax.add_artist(circ2)
 65 |         ax.add_artist(circ3)
 66 |         ax.text(12.5, -2.5, r'$\sigma$', color='w', fontsize=20, ha='center', va='center')
 67 | 
 68 |     ax.set_xlim(-5, 15)
 69 |     ax.set_ylim(-5, 15)
 70 |     ax.set_xticks([])
 71 |     ax.set_yticks([])
 72 |     fig.subplots_adjust(bottom=0.01, top=0.99, left=0.01, right=0.99)
 73 |     fig.show()
 74 | 
 75 | def plotDifferences(orig, data, gmms, avg, l, patch=None):
 76 |     fig = plt.figure(figsize=(6,6))
 77 |     ax = fig.add_subplot(111, aspect='equal')
 78 | 
 79 |     # plot inner and outer points
 80 |     #ax.plot(orig[:,0], orig[:,1], 'o', mfc='None', mec='r', mew=1)
 81 |     ax.plot(data[:,0], data[:,1], 's', mfc='b', mec='None')#, mew=1)
 82 | 
 83 |     # prediction
 84 |     B = 100
 85 |     x,y = np.meshgrid(np.linspace(-5,15,B), np.linspace(-5,15,B))
 86 |     coords = np.dstack((x.flatten(), y.flatten()))[0]
 87 | 
 88 |     # compute sum_k(p_k(x)) for all x
 89 |     pw = avg(coords).reshape((B,B))
 90 | 
 91 |     # use each run and compute weighted std
 92 |     p = np.empty((T,B,B))
 93 |     for r in range(T):
 94 |         # compute sum_k(p_k(x)) for all x
 95 |         p[r,:,:] = gmms[r](coords).reshape((B,B))
 96 | 
 97 |     p = ((p-pw[None,:,:])**2 * l[:,None, None]).sum(axis=0)
 98 |     V1 = l.sum()
 99 |     V2 = (l**2).sum()
100 |     p /= (V1 - V2/V1)
101 | 
102 |     p = np.arcsinh(np.sqrt(p)/1e-4)
103 |     cs = ax.contourf(p, 10, extent=(-5,15,-5,15), cmap=plt.cm.Greys, vmin=np.arcsinh(pw/1e-4).min(), vmax=np.arcsinh(pw/1e-4).max())
104 |     for c in cs.collections:
105 |         c.set_edgecolor(c.get_facecolor())
106 | 
107 |     # plot boundary
108 |     if patch is not None:
109 |         import copy
110 |         if hasattr(patch, '__iter__'):
111 |             for p in patch:
112 |                 ax.add_artist(copy.copy(p))
113 |         else:
114 |             ax.add_artist(copy.copy(patch))
115 | 
116 |     ax.text(0.05, 0.95, 'Dispersion', ha='left', va='top', transform=ax.transAxes, fontsize=20)
117 | 
118 |     ax.set_xlim(-5, 15)
119 |     ax.set_ylim(-5, 15)
120 |     ax.set_xticks([])
121 |     ax.set_yticks([])
122 |     fig.subplots_adjust(bottom=0.01, top=0.99, left=0.01, right=0.99)
123 |     fig.show()
124 | 
125 | def getBox(coords):
126 |     box_limits = np.array([[0,0],[10,10]])
127 |     return (coords[:,0] > box_limits[0,0]) & (coords[:,0] < box_limits[1,0]) & (coords[:,1] > box_limits[0,1]) & (coords[:,1] < box_limits[1,1])
128 | 
129 | def getHole(coords):
130 |     x,y,r = 6.5, 6., 2
131 |     return ((coords[:,0] - x)**2 + (coords[:,1] - y)**2 > r**2)
132 | 
133 | def getBoxWithHole(coords):
134 |     return getBox(coords)*getHole(coords)
135 | 
136 | def getCut(coords):
137 |     return (coords[:,0] < 6)
138 | 
139 | def getAll(coords):
140 |     return np.ones(len(coords))
141 | 
142 | def getHalf(coords, rng=np.random):
143 |     return 0.5 * np.ones(len(coords))
144 | 
145 | def getSelection(type="hole", rng=np.random):
146 |     if type == "hole":
147 |         cb = getHole
148 |         ps = patches.Circle([6.5, 6.], radius=2, fc="none", ec='k', lw=1, ls='dashed')
149 |     if type == "box":
150 |         cb = getBox
151 |         ps = patches.Rectangle([0,0], 10, 10, fc="none", ec='k', lw=1, ls='dashed')
152 |     if type == "boxWithHole":
153 |         cb = getBoxWithHole
154 |         ps = [patches.Circle([6.5, 6.], radius=2, fc="none", ec='k', lw=1, ls='dashed'),
155 |             patches.Rectangle([0,0], 10, 10, fc="none", ec='k', lw=1, ls='dashed')]
156 |     if type == "cut":
157 |         cb = getCut
158 |         ps = lines.Line2D([6, 6],[-5, 15], ls='dotted', lw=1, color='k')
159 |     if type == "all":
160 |         cb = getAll
161 |         ps = None
162 |     return cb, ps
163 | 
164 | if __name__ == '__main__':
165 | 
166 |     # set up test
167 |     N = 400             # number of samples
168 |     K = 3               # number of components
169 |     T = 1               # number of runs
170 |     sel_type = "boxWithHole"    # type of selection
171 |     disp = 0.5          # additive noise dispersion
172 |     bg_amp = 0.0        # fraction of background samples
173 |     w = 0.1             # minimum covariance regularization [data units]
174 |     cutoff = 5          # cutoff distance between components [sigma]
175 |     seed = 8365         # seed value
176 |     oversampling = 10   # for missing data: imputation samples per observed sample
177 |     # show EM iteration results
178 |     logging.basicConfig(format='%(message)s',level=logging.INFO)
179 | 
180 |     # define RNG for run
181 |     from numpy.random import RandomState
182 |     rng = RandomState(seed)
183 | 
184 |     # draw N points from 3-component GMM
185 |     D = 2
186 |     gmm = pygmmis.GMM(K=3, D=2)
187 |     gmm.amp[:] = np.array([ 0.36060026,  0.27986906,  0.206774])
188 |     gmm.amp /= gmm.amp.sum()
189 |     gmm.mean[:,:] = np.array([[ 0.08016886,  0.21300697],
190 |                               [ 0.70306351,  0.6709532 ],
191 |                               [ 0.01087670,  0.852077]])*10
192 |     gmm.covar[:,:,:] = np.array([[[ 0.08530014, -0.00314178],
193 |                                   [-0.00314178,  0.00541106]],
194 |                                  [[ 0.03053402, 0.0125736],
195 |                                   [0.0125736,  0.01075791]],
196 |                                  [[ 0.00258605,  0.00409287],
197 |                                  [ 0.00409287,  0.01065186]]])*100
198 | 
199 |     # data come from pure GMM model or one with background?
200 |     orig = gmm.draw(N, rng=rng)
201 |     if bg_amp == 0:
202 |         orig_bg = orig
203 |         bg = None
204 |     else:
205 |         footprint = np.array([-10,-10]), np.array([20,20])
206 |         bg = pygmmis.Background(footprint)
207 |         bg.amp = bg_amp
208 |         bg.adjust_amp = True
209 | 
210 |         bg_size = int(bg_amp/(1-bg_amp) * N)
211 |         orig_bg = np.concatenate((orig, bg.draw(bg_size, rng=rng)))
212 | 
213 |     # add isotropic errors on data
214 |     noisy = orig_bg + rng.normal(0, scale=disp, size=(len(orig_bg), D))
215 | 
216 |     # get observational selection function
217 |     omega, ps = getSelection(sel_type, rng=rng)
218 | 
219 |     # apply selection
220 |     sel = rng.rand(N) < omega(noisy)
221 |     data = noisy[sel]
222 |     # single covariance for all samples
223 |     covar = disp**2 * np.eye(D)
224 | 
225 |     # plot data vs true model
226 |     plotResults(orig, data, gmm, patch=ps, description="Truth", disp=disp)
227 | 
228 |     # repeated runs: store results and logL
229 |     l = np.empty(T)
230 |     gmms = [pygmmis.GMM(K=K, D=D) for r in range(T)]
231 | 
232 |     # 1) EM without imputation, ignoring errors
233 |     start = datetime.datetime.now()
234 |     rng = RandomState(seed)
235 |     for r in range(T):
236 |         if bg is not None:
237 |             bg.amp = bg_amp
238 |         l[r], _ = pygmmis.fit(gmms[r], data, w=w, cutoff=cutoff, background=bg, rng=rng)
239 |     avg = pygmmis.stack(gmms, l)
240 |     print ("execution time %ds" % (datetime.datetime.now() - start).seconds)
241 |     plotResults(orig, data, avg, patch=ps, description="Standard EM")
242 | 
243 |     # 2) EM without imputation, deconvolving via Extreme Deconvolution
244 |     start = datetime.datetime.now()
245 |     rng = RandomState(seed)
246 |     for r in range(T):
247 |         if bg is not None:
248 |             bg.amp = bg_amp
249 |         l[r], _ = pygmmis.fit(gmms[r], data, covar=covar, w=w, cutoff=cutoff, background=bg, rng=rng)
250 |     avg = pygmmis.stack(gmms, l)
251 |     print ("execution time %ds" % (datetime.datetime.now() - start).seconds)
252 |     plotResults(orig, data, avg, patch=ps, description="Standard EM & noise deconvolution")
253 | 
254 |     # 3) pygmmis with imputation, igoring errors
255 |     # We need a good initial location to explore the
256 |     # volume that is spanned by the missing part of the data
257 |     # We therefore run a standard GMM without imputation first
258 |     start = datetime.datetime.now()
259 |     rng = RandomState(seed)
260 |     for r in range(T):
261 |         if bg is not None:
262 |             bg.amp = bg_amp
263 |         pygmmis.fit(gmms[r], data, w=w, cutoff=cutoff, background=bg, rng=rng)
264 |         l[r], _ = pygmmis.fit(gmms[r], data, init_method='none', w=w,  cutoff=cutoff, sel_callback=omega,  oversampling=oversampling, background=bg, rng=rng)
265 |     avg = pygmmis.stack(gmms, l)
266 |     print ("execution time %ds" % (datetime.datetime.now() - start).seconds)
267 |     plotResults(orig, data, avg, patch=ps, description="$\mathtt{GMMis}$")
268 | 
269 |     # 4) pygmmis with imputation, incorporating errors
270 |     covar_cb = partial(pygmmis.covar_callback_default, default=np.eye(D)*disp**2)
271 |     start = datetime.datetime.now()
272 |     rng = RandomState(seed)
273 |     for r in range(T):
274 |         if bg is not None:
275 |             bg.amp = bg_amp
276 |         pygmmis.fit(gmms[r], data, w=w, cutoff=cutoff, background=bg, rng=rng)
277 |         l[r], _ = pygmmis.fit(gmms[r], data, covar=covar, init_method='none', w=w, cutoff=cutoff, sel_callback=omega, oversampling=oversampling, covar_callback=covar_cb, background=bg, rng=rng)
278 |     avg = pygmmis.stack(gmms, l)
279 |     print ("execution time %ds" % (datetime.datetime.now() - start).seconds)
280 |     plotResults(orig, data, avg, patch=ps, description="$\mathtt{GMMis}$ & noise deconvolution")
281 | 
282 |     if T > 1:
283 |         plotDifferences(orig, data, gmms, avg, l, patch=ps)
284 |         #plotCoverage(orig, data, avg, patch=ps, sel_callback=cb)
285 |     """
286 |     # stacked estimator: needs to do init by hand to keep it fixed
287 |     start = datetime.datetime.now()
288 |     rng = RandomState(seed)
289 |     for r in range(R):
290 |         init_cb(gmms[r], data=data, covar=covar, rng=rng)
291 |     kwargs = [dict(covar=covar, init_callback=None, w=w, cutoff=cutoff, sel_callback=cb, covar_callback=covar_cb, background=bg, rng=rng) for i in range(R)]
292 |     stacked = pygmmis.stack_fit(gmms, data, kwargs, L=10, rng=rng)
293 |     print ("execution time %ds" % (datetime.datetime.now() - start).seconds)
294 |     plotResults(orig, data, stacked, patch=ps, description="Stacked")
295 |     """
296 | 


--------------------------------------------------------------------------------
/tests/test_3D.py:
--------------------------------------------------------------------------------
  1 | import pygmmis
  2 | import numpy as np
  3 | import logging
  4 | from functools import partial
  5 | 
  6 | L = 1
  7 | 
  8 | def binSample(coords, C):
  9 |     dl = L*1./C
 10 |     N = len(coords)
 11 |     from sklearn.neighbors import KDTree
 12 |     # chebyshev metric: results in cube selection
 13 |     tree = KDTree(coords, leaf_size=N/100, metric="chebyshev")
 14 |     # sample position: center of cubes of length K
 15 |     skewer = np.arange(C)
 16 |     grid = np.meshgrid(skewer, skewer, skewer, indexing="ij")
 17 |     grid = np.dstack((grid[0].flatten(), grid[1].flatten(), grid[2].flatten()))[0]
 18 |     samples = dl*(grid +0.5)
 19 | 
 20 |     # get counts in boxes
 21 |     c = tree.query_radius(samples, r=0.5*dl, count_only=True)
 22 |     #counts = np.zeros(K**3)
 23 |     #counts[mask] = c
 24 |     #return counts.reshape(K,K,K)
 25 |     return c.reshape(C,C,C)
 26 | 
 27 | def initCube(gmm, w=0, rng=np.random):
 28 |     #gmm.amp[:] = rng.rand(gmm.K)
 29 |     #gmm.amp /= gmm.amp.sum()
 30 |     global K
 31 |     alpha = K
 32 |     gmm.amp[:] = rng.dirichlet(alpha*np.ones(gmm.K)/K, 1)[0]
 33 |     gmm.mean[:,:] = rng.rand(gmm.K,gmm.D)
 34 |     for k in range(gmm.K):
 35 |         gmm.covar[k] = np.diag((w + rng.rand(gmm.D) / 30)**2)
 36 |     # use random rotations for each component covariance
 37 |     # from http://www.mathworks.com/matlabcentral/newsreader/view_thread/298500
 38 |     # since we don't care about parity flips we don't have to check
 39 |     # the determinant of R (and hence don't need R)
 40 |     for k in range(gmm.K):
 41 |         Q,_ = np.linalg.qr(rng.normal(size=(gmm.D, gmm.D)), mode='complete')
 42 |         gmm.covar[k] = np.dot(Q, np.dot(gmm.covar[k], Q.T))
 43 | 
 44 | def initToFillCube(gmm, omega=0.5, rng=np.random):
 45 |     gmm.amp[k] = 1./gmm.K
 46 |     # set model to random positions with equally sized spheres within
 47 |     # volumne spanned by data
 48 |     min_pos = np.zeros(3)
 49 |     max_pos = np.ones(3)
 50 |     gmm.mean[k,:] = min_pos + (max_pos-min_pos)*rng.rand(gmm.K, gmm.D)
 51 |     # K spheres of radius s [having volume s^D * pi^D/2 / gamma(D/2+1)]
 52 |     # should fill fraction omega of cube
 53 |     from scipy.special import gamma
 54 |     vol_data = np.prod(max_pos-min_pos)
 55 |     s = (omega * vol_data / gmm.K * gamma(gmm.D*0.5 + 1))**(1./gmm.D) / np.sqrt(np.pi)
 56 |     gmm.covar[k,:,:] = s**2 * np.eye(data.shape[1])
 57 | 
 58 | def drawWithNbh(gmm, size=1, rng=np.random):
 59 |     # draw indices for components given amplitudes, need to make sure: sum=1
 60 |     ind = rng.choice(gmm.K, size=size, p=(gmm.amp/gmm.amp.sum()))
 61 |     samples = np.empty((size, gmm.D))
 62 |     N_k = np.bincount(ind, minlength=gmm.K)
 63 |     nbh = [None for k in range(gmm.K)]
 64 |     counter = 0
 65 |     for k in range(gmm.K):
 66 |         s = N_k[k]
 67 |         samples[counter:counter+s] = rng.multivariate_normal(gmm.mean[k], gmm.covar[k], size=s)
 68 |         nbh[k] = np.arange(counter, counter+s)
 69 |         counter += s
 70 |     return samples, nbh
 71 | 
 72 | from mpl_toolkits.mplot3d import Axes3D
 73 | import matplotlib.pyplot as plt
 74 | 
 75 | def createFigure():
 76 |     fig = plt.figure()
 77 |     ax = plt.axes([0,0,1,1], projection='3d')#, aspect='equal')
 78 |     return fig, ax
 79 | 
 80 | def plotPoints(coords, ax=None, depth_shading=True, **kwargs):
 81 |     if ax is None:
 82 |         fig, ax = createFigure()
 83 | 
 84 |     #if ecolor != 'None':
 85 |     #    lw = 0.25
 86 |     sc = ax.scatter(coords[:,0], coords[:,1], coords[:,2], **kwargs)
 87 |     # get rid of pesky depth shading in absence of depthshade=False option
 88 |     if depth_shading is False:
 89 |         sc.set_edgecolors = sc.set_facecolors = lambda *args:None
 90 |     plt.show()
 91 |     return ax
 92 | 
 93 | def slopeSel(coords, rng=np.random):
 94 |     return rng.rand(len(coords)) > coords[:,0]
 95 | 
 96 | def noSel(coords, rng=np.random):
 97 |     return np.ones(len(coords), dtype="bool")
 98 | 
 99 | def insideComponent(k, gmm, coords, covar=None, cutoff=5.):
100 |     if gmm.amp[k]*K > 0.01:
101 |         return gmm.logL_k(k, coords, covar=covar, chi2_only=True) < cutoff
102 |     else:
103 |         return np.zeros(len(coords), dtype='bool')
104 | 
105 | def GMMSel(coords, gmm, covar=None, sel_gmm=None, cutoff_nd=3., rng=np.random):
106 |     # swiss cheese selection based on a GMM:
107 |     # if within 1 sigma of any component: you're out!
108 |     import multiprocessing, parmap
109 |     n_chunks, chunksize = sel_gmm._mp_chunksize()
110 |     inside = np.array(parmap.map(insideComponent, range(sel_gmm.K), sel_gmm, coords, covar, cutoff_nd, pm_chunksize=chunksize))
111 |     return np.max(inside, axis=0)
112 | 
113 | def max_posterior(gmm, U, coords, covar=None):
114 |     import multiprocessing, parmap
115 |     pool = multiprocessing.Pool()
116 |     n_chunks, chunksize = gmm._mp_chunksize()
117 |     log_p = [[] for k in range(gmm.K)]
118 |     log_S = np.zeros(len(coords))
119 |     H = np.zeros(len(coords), dtype="bool")
120 |     k = 0
121 |     for log_p[k], U[k], _ in \
122 |     parmap.starmap(pygmmis._Estep, zip(range(gmm.K), U), gmm, data, covar, None, pm_pool=pool, pm_chunksize=chunksize):
123 |         log_S[U[k]] += np.exp(log_p[k]) # actually S, not logS
124 |         H[U[k]] = 1
125 |         k += 1
126 |     log_S[H] = np.log(log_S[H])
127 | 
128 |     max_q = np.zeros(len(coords))
129 |     max_k = np.zeros(len(coords), dtype='uint32')
130 |     for k in range(gmm.K):
131 |         q_k = np.exp(log_p[k] - log_S[U[k]])
132 |         max_k[U[k]] = np.where(max_q[U[k]] < q_k, k, max_k[U[k]])
133 |         max_q[U[k]] = np.maximum(max_q[U[k]],q_k)
134 |     return max_k
135 | 
136 | # from http://stackoverflow.com/questions/36740887/how-can-a-python-context-manager-try-to-execute-code
137 | def try_forever(f):
138 |     def decorated(*args, **kwargs):
139 |         while True:
140 |             try:
141 |                 return f(*args, **kwargs)
142 |             except:
143 |                 pass
144 |     return decorated
145 | 
146 | if __name__ == "__main__":
147 |     N = 10000
148 |     K = 50
149 |     D = 3
150 |     C = 50
151 |     w = 0.001
152 |     inner_cutoff = 1
153 | 
154 |     seed = 42#np.random.randint(1, 10000)
155 |     from numpy.random import RandomState
156 |     rng = RandomState(seed)
157 |     logging.basicConfig(format='%(message)s',level=logging.INFO)
158 | 
159 |     # define selection and create Omega in cube:
160 |     # expensive, only do once
161 |     sel_callback = partial(slopeSel, rng=rng)
162 |     """
163 |     random = rng.rand(N*100, D)
164 |     sel = sel_callback(random)
165 |     omega_cube = binSample(random[sel], C).astype('float') / binSample(random, C)
166 |     del random
167 |     """
168 |     omega_cube = np.ones((C,C,C))
169 |     for c in range(C):
170 |         omega_cube[c,:,:] *= 1 - (c+0.5)/C
171 | 
172 |     count_cube = np.zeros((C,C,C))
173 |     count__cube = np.zeros((C,C,C))
174 |     count0_cube = np.zeros((C,C,C))
175 | 
176 |     R = 10
177 |     amp0 = np.empty(R*K)
178 |     frac = np.empty(R*K)
179 |     Omega = np.empty(R*K)
180 |     assoc_frac = np.empty(R*K)
181 |     posterior = np.empty(R*K)
182 | 
183 |     cutoff_nd = pygmmis.chi2_cutoff(D, cutoff=inner_cutoff)
184 |     counter = 0
185 |     for r in range(R):
186 |         print ("start")
187 |         # create original sample from GMM
188 |         gmm0 = pygmmis.GMM(K=K, D=D)
189 |         initCube(gmm0, w=w*10, rng=rng) # use larger size floor than in fit
190 |         data0, nbh0 = drawWithNbh(gmm0, N, rng=rng)
191 | 
192 |         # apply selection
193 |         sel0 = sel_callback(data0)
194 | 
195 |         # how often is each component used
196 |         comp0 = np.empty(len(data0), dtype='uint32')
197 |         for k in range(gmm0.K):
198 |             comp0[nbh0[k]] = k
199 |         count0 = np.bincount(comp0, minlength=gmm0.K)
200 | 
201 |         # compute effective Omega
202 |         comp = comp0[sel0]
203 |         count = np.bincount(comp, minlength=gmm0.K)
204 | 
205 |         frac__ = count.astype('float') / count.sum()
206 |         Omega__ = count.astype('float') / count0
207 | 
208 |         # restrict to "safe" components
209 |         safe = frac__ >  1./1 * 1./ K
210 |         if safe.sum() < gmm0.K:
211 |             print ("reset to safe components")
212 |             gmm0.amp = gmm0.amp[safe]
213 |             gmm0.amp /= gmm0.amp.sum()
214 |             gmm0.mean = gmm0.mean[safe]
215 |             gmm0.covar = gmm0.covar[safe]
216 | 
217 |             # redraw data0 and sel0
218 |             data0, nbh0 = drawWithNbh(gmm0, N, rng=rng)
219 |             sel0 = sel_callback(data0)
220 | 
221 |             # recompute effective Omega and frac
222 |             # how often is each component used
223 |             comp0 = np.empty(len(data0), dtype='uint32')
224 |             for k in range(gmm0.K):
225 |                 comp0[nbh0[k]] = k
226 |             count0 = np.bincount(comp0, minlength=gmm0.K)
227 |             comp = comp0[sel0]
228 |             count = np.bincount(comp, minlength=gmm0.K)
229 | 
230 |             frac__ = count.astype('float') / count.sum()
231 |             Omega__ = count.astype('float') / count0
232 | 
233 |         frac[counter:counter+gmm0.K] = frac__
234 |         Omega[counter:counter+gmm0.K] = Omega__
235 |         amp0[counter:counter+gmm0.K] = gmm0.amp
236 |         count0_cube += binSample(data0, C)
237 | 
238 |         # which K: K0 or K/N = const?
239 |         K_ = gmm0.K #int(K*omega_cube.mean())
240 | 
241 |         # fit model after selection
242 |         data = data0[sel0]
243 | 
244 |         split_n_merge = K_/3 # 0
245 |         gmm = pygmmis.GMM(K=K_, D=3)
246 |         logL, U = pygmmis.fit(gmm, data, init_method='minmax', w=w, cutoff=5, split_n_merge=split_n_merge, rng=rng)
247 |         sample = gmm.draw(N, rng=rng)
248 |         count_cube += binSample(sample, C)
249 | 
250 |         fit_forever = try_forever(pygmmis.fit)
251 |         gmm_ = pygmmis.GMM(K=K_, D=3)
252 |         #fit_forever(gmm_, data, sel_callback=sel_callback, init_callback=init_cb, w=w, cutoff=5, split_n_merge=split_n_merge, rng=rng)
253 |         gmm_.amp[:] = gmm.amp[:]
254 |         gmm_.mean[:,:] = gmm.mean[:,:]
255 |         gmm_.covar[:,:,:] = 2*gmm.covar[:,:,:]
256 |         logL_, U_ = fit_forever(gmm_, data, sel_callback=sel_callback, init_method='none', w=w, cutoff=5, split_n_merge=split_n_merge, rng=rng)
257 |         sample_ = gmm_.draw(N, rng=rng)
258 |         """
259 |         gmm_ = gmm
260 |         logL_, U_ = logL, U
261 |         sample_ = sample
262 |         """
263 | 
264 |         count__cube += binSample(sample_, C)
265 | 
266 |         # find density threshold to be associated with any fit GMM component:
267 |         # below a threshold, the EM algorithm won't bother to put a component.
268 |         # under selection, that threshold applies to the observed sample.
269 |         #
270 |         # 1) compute fraction of observed points for each component of gmm0
271 |         for k in range(K_):
272 |             # select data that is within cutoff of any component of sel_gmm
273 |             sel__ = GMMSel(data0[nbh0[k]], gmm=None, sel_gmm=gmm_, cutoff_nd=cutoff_nd, rng=rng)
274 |             assoc_frac[k + counter] = sel__.sum() * 1./ nbh0[k].size
275 | 
276 |         """
277 |         # 2) test which components have majority of points associated with
278 |         # any fit component
279 |         max_k = max_posterior(gmm, U, data0)
280 |         for k in range(K_):
281 |             posterior[k + counter] = np.bincount(max_k[comp0 == k]).max() * 1./ (comp0 == k).sum()
282 |         """
283 | 
284 |         counter += gmm0.K
285 | 
286 |     # plot average cell density as function of cell omega:
287 |     # biased estimate will avoid low-omega region and (over)compensate in
288 |     # high-omega regions
289 |     B = 10
290 |     bins = np.linspace(0,1,B+1)
291 | 
292 |     mean_rho0 = np.empty(B)
293 |     mean_rho = np.empty(B)
294 |     mean_rho_ = np.empty(B)
295 |     mean_omega = np.empty(B)
296 |     std_rho0 = np.empty(B)
297 |     std_rho = np.empty(B)
298 |     std_rho_ = np.empty(B)
299 |     std_omega = np.empty(B)
300 |     for i in range(B):
301 |         mask = (omega_cube > bins[i]) & (omega_cube <= bins[i+1])
302 |         sqrtN = np.sqrt(mask.sum())
303 |         mean_omega[i] = omega_cube[mask].mean()
304 |         std_omega[i] = omega_cube[mask].std()
305 |         mean_rho0[i] = count0_cube[mask].mean()
306 |         std_rho0[i] = count0_cube[mask].std() / sqrtN
307 |         mean_rho[i] = count_cube[mask].mean()
308 |         std_rho[i] = count_cube[mask].std() / sqrtN
309 |         mean_rho_[i] = count__cube[mask].mean()
310 |         std_rho_[i] = count__cube[mask].std() / sqrtN
311 | 
312 |     """
313 |     fig = plt.figure()
314 |     ax = fig.add_subplot(111)
315 |     ax.plot(bins, np.zeros_like(bins), ls='--', c='#888888')
316 |     ax.plot([0,1], [-1,1], ls='--', c='#888888')
317 |     angle = 36
318 |     ax.text(0.30, -1+0.47, 'uncorrected $\Omega$', color='#888888', ha='center', va='center', rotation=angle)
319 |     ax.text(0.97, -0.05, 'perfect correction', color='#888888', ha='right', va='top')
320 |     ax.errorbar(mean_omega, (mean_rho - mean_rho0)/mean_rho0, yerr=np.sqrt(std_rho**2 + std_rho0**2)/mean_rho0, fmt='b-', marker='s', label='Standard EM')
321 |     ax.errorbar(mean_omega, (mean_rho_ - mean_rho0)/mean_rho0, yerr=np.sqrt(std_rho_**2 + std_rho0**2)/mean_rho0, fmt='r-', marker='o', label='$\mathtt{GMMis}$')
322 |     ax.set_ylabel(r'$(\tilde{\rho} - \rho)/\rho$')
323 |     ax.set_xlabel('$\Omega$')
324 |     fig.subplots_adjust(bottom=0.12, right=0.97)
325 |     ax.set_xlim(0,1)
326 |     ax.set_ylim(-1,1)
327 |     leg = ax.legend(loc='upper left', frameon=False, numpoints=1)
328 |     fig.show()
329 | 
330 |     # plot associated fraction vs observed amplitude
331 |     import scipy.stats
332 |     cdf_1d = scipy.stats.norm.cdf(inner_cutoff)
333 |     confidence_1d = 1-(1-cdf_1d)*2
334 | 
335 |     fig = plt.figure()
336 |     ax = fig.add_subplot(111)
337 |     sc = ax.scatter(frac[:counter], assoc_frac[:counter], c=Omega[:counter], s=100*amp0[:counter]/amp0[:counter].mean(), marker='o', rasterized=True, cmap='RdYlBu')
338 |     xl = [-0.005, frac[:counter].max()*1.1]
339 |     yl = [0,1.0]
340 |     ax.plot(xl, [confidence_1d, confidence_1d], c='#888888', ls='--', lw=1)
341 |     ax.text(xl[1]*0.97, 0.68*0.97, '$1\sigma$ region', color='#888888', ha='right', va='top')
342 |     ax.plot([1./gmm0.K, 1./gmm0.K], yl, c='#888888', ls=':', lw=1)
343 |     ax.text(1./gmm0.K + (xl[1]-xl[0])*0.03, yl[0] + 0.03, '$1/K$', color='#888888', ha='left', va='bottom', rotation=90)
344 |     ax.set_xlim(xl)
345 |     ax.set_ylim(yl)
346 |     ax.set_xlabel('$N^o_k / N^o$')
347 |     ax.set_ylabel('$\eta_k$')
348 |     from mpl_toolkits.axes_grid1 import make_axes_locatable
349 |     divider = make_axes_locatable(ax)
350 |     cax = divider.append_axes("right", size="3%", pad=0.0)
351 |     cb = plt.colorbar(sc, cax=cax)
352 |     ticks = np.linspace(0, 1, 6)
353 |     cb.set_ticks(ticks)
354 |     cb.set_label('$\Omega_k$')
355 |     fig.subplots_adjust(bottom=0.13, right=0.90)
356 |     fig.show()
357 | 
358 | 
359 |     cmap = matplotlib.cm.get_cmap('RdYlBu')
360 |     color = np.array([cmap(20),cmap(255)])[sel0.astype('int')]
361 |     #ecolor = np.array(['r','b'])[sel0.astype('int')]
362 |     ax = plotPoints(data0, s=4, c=color,lw=0,rasterized=True, depth_shading=False)
363 |     ax.set_xlim3d(0,1)
364 |     ax.set_ylim3d(0,1)
365 |     ax.set_zlim3d(0,1)
366 | 
367 |     ax = plotPoints(sample_, s=1, alpha=0.5)
368 |     for k in range(gmm0.K):
369 |         ax.text(gmm_.mean[k,0]+0.03, gmm_.mean[k,1]+0.03, gmm_.mean[k,2]+0.03, "%d" % k, color='r', zorder=1000)
370 |     plotPoints(gmm0.mean, c='g', s=400, ax=ax, alpha=0.5, zorder=100)
371 |     plotPoints(gmm_.mean, c='r', s=400, ax=ax, alpha=0.5, zorder=100)
372 |     ax.set_xlim3d(0,1)
373 |     ax.set_ylim3d(0,1)
374 |     ax.set_zlim3d(0,1)
375 |     """
376 | 


--------------------------------------------------------------------------------