├── CITATION.cff ├── LICENSE.md ├── README.md ├── pygmmis.py ├── setup.py └── tests ├── pygmmis.png ├── test.py └── test_3D.py /CITATION.cff: -------------------------------------------------------------------------------- 1 | cff-version: 1.2.0 2 | message: "If you use this software, please cite it as below." 3 | authors: 4 | - family-names: "Melchior" 5 | given-names: "Peter" 6 | orcid: "https://orcid.org/0000-0002-8873-5065" 7 | title: "pyGMMis" 8 | url: "https://github.com/pmelchior/pygmmis" 9 | preferred-citation: 10 | type: article 11 | authors: 12 | - family-names: "Melchior" 13 | given-names: "Peter" 14 | orcid: "https://orcid.org/0000-0002-8873-5065" 15 | - family-names: "Goulding" 16 | given-names: "Andy" 17 | orcid: "https://orcid.org/0000-0003-4700-663X" 18 | doi: "10.1016/j.ascom.2018.09.013" 19 | journal: "Astronomy and Computing" 20 | start: 183 # First page number 21 | end: 194 # Last page number 22 | title: "Filling the gaps: Gaussian mixture models from noisy, truncated or incomplete samples" 23 | volume: 25 24 | year: 2018 25 | month: 10 26 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Peter Melchior 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![PyPI](https://img.shields.io/pypi/v/pygmmis.svg)](https://pypi.python.org/pypi/pygmmis/) 2 | [![License](https://img.shields.io/github/license/pmelchior/pygmmis.svg)](https://github.com/pmelchior/pygmmis/blob/master/LICENSE.md) 3 | [![DOI](https://img.shields.io/badge/DOI-10.1016%2Fj.ascom.2018.09.013-blue.svg)](https://doi.org/10.1016/j.ascom.2018.09.013) 4 | [![arXiv](https://img.shields.io/badge/arxiv-1611.05806-red.svg)](http://arxiv.org/abs/1611.05806) 5 | 6 | # pyGMMis 7 | 8 | Need a simple and powerful Gaussian-mixture code in pure python? It can be as easy as this: 9 | 10 | ```python 11 | import pygmmis 12 | gmm = pygmmis.GMM(K=K, D=D) # K components, D dimensions 13 | logL, U = pygmmis.fit(gmm, data) # logL = log-likelihood, U = association of data to components 14 | ``` 15 | However, **pyGMMis** has a few extra tricks up its sleeve. 16 | 17 | * It can account for independent multivariate normal measurement errors for each of the observed samples, and then recovers an estimate of the error-free distribution. This technique is known as "Extreme Deconvolution" by Bovy, Hogg & Roweis (2011). 18 | * It works with missing data (features) by setting the respective elements of the covariance matrix to a vary large value, thus effectively setting the weights of the missing feature to 0. 19 | * It can deal with gaps (aka "truncated data") and variable sample completeness as long as 20 | * you know the incompleteness over the entire feature space, 21 | * and the incompleteness does not depend on the sample density (missing at random). 22 | * It can incorporate a "background" distribution (implemented is a uniform one) and separate signal from background, with the former being fit by the GMM. 23 | * It keeps track of which components need to be evaluated in which regions of the feature space, thereby substantially increasing the performance for fragmented data. 24 | 25 | If you want more context and details on those capabilities, have a look at this [blog post](http://pmelchior.net/blog/gaussian-mixture-models-for-astronomy.html). 26 | 27 | Under the hood, **pyGMMis** uses the Expectation-Maximization procedure. When dealing with sample incompleteness it generates its best guess of the unobserved samples on the fly given the current model fit to the observed samples. 28 | 29 | ![Example of pyGMMis](https://raw.githubusercontent.com/pmelchior/pygmmis/master/tests/pygmmis.png) 30 | 31 | In the example above, the true distribution is shown as contours in the left panel. We then draw 400 samples from it (red), add Gaussian noise to them (1,2,3 sigma contours shown in blue), and select only samples within the box but outside of the circle (blue). 32 | 33 | The code is written in pure python (developed and tested in 2.7), parallelized with `multiprocessing`, and is capable of performing density estimation with millions of samples and thousands of model components on machines with sufficient memory. 34 | 35 | More details are in the paper listed in the file `CITATION.cff`. 36 | 37 | 38 | 39 | ## Installation and Prerequisites 40 | 41 | You can either clone the repo and install by `python setup.py install` or get the latest release with 42 | 43 | ``` 44 | pip install pygmmis 45 | ``` 46 | 47 | Dependencies: 48 | 49 | * numpy 50 | * scipy 51 | * multiprocessing 52 | * parmap 53 | 54 | ## How to run the code 55 | 56 | 1. Create a GMM object with the desired component number K and data dimensionality D: 57 | ```gmm = pygmmis.GMM(K=K, D=D) ``` 58 | 59 | 3. Define a callback for the completeness function. When called with with `data` with shape `(N,D)` and returns the probability of each sample getting observed. Two simple examples: 60 | 61 | ```python 62 | def cutAtSix(coords): 63 | """Selects all samples whose first coordinate is < 6""" 64 | return (coords[:,0] < 6) 65 | 66 | def selSlope(coords, rng=np.random): 67 | """Selects probabilistically according to first coordinate x: 68 | Omega = 1 for x < 0 69 | = 1-x for x = 0 .. 1 70 | = 0 for x > 1 71 | """ 72 | return np.max(0, np.min(1, 1 - coords[:,0])) 73 | ``` 74 | 75 | 4. If the samples are noisy (i.e. they have positional uncertainties), you need to provide the covariance matrix of each data sample, or one for all in case of i.i.d. noise. 76 | 77 | 4. If the samples are noisy *and* there completeness function isn't constant, you need to provide a callback function that returns an estimate of the covariance at arbitrary locations: 78 | 79 | ```python 80 | # example 1: simply using the same covariance for all samples 81 | dispersion = 1 82 | default_covar = np.eye(D) * dispersion**2 83 | covar_cb = lambda coords: default_covar 84 | 85 | # example: use the covariance of the nearest neighbor. 86 | def covar_tree_cb(coords, tree, covar): 87 | """Return the covariance of the nearest neighbor of coords in data.""" 88 | dist, ind = tree.query(coords, k=1) 89 | return covar[ind.flatten()] 90 | 91 | from sklearn.neighbors import KDTree 92 | tree = KDTree(data, leaf_size=100) 93 | 94 | from functools import partial 95 | covar_cb = partial(covar_tree_cb, tree=tree, covar=covar) 96 | ``` 97 | 98 | 5. If there is a uniform background signal, you need to define it. Because a uniform distribution is normalizable only if its support is finite, you need to decide on the footprint over which the background model is present, e.g.: 99 | 100 | ```python 101 | footprint = data.min(axis=0), data.max(axis=0) 102 | amp = 0.3 103 | bg = pygmmis.Background(footprint, amp=amp) 104 | 105 | # fine tuning, if desired 106 | bg.amp_min = 0.1 107 | bg.amp_max = 0.5 108 | bg.adjust_amp = False # freezes bg.amp at current value 109 | ``` 110 | 111 | 6. Select an initialization method. This tells the GMM what initial parameters is should assume. The options are `'minmax','random','kmeans','none'`. See the respective functions for details: 112 | 113 | * `pygmmis.initFromDataMinMax()` 114 | * `pygmmis.initFromDataAtRandom()` 115 | * `pygmmis.initFromKMeans()` 116 | 117 | For difficult situations, or if you are not happy with the convergence, you may want to experiment with your own initialization. All you have to do is set `gmm.amp`, `gmm.mean`, and `gmm.covar` to desired values and use `init_method='none'`. 118 | 119 | 7. Decide to freeze out any components. This makes sense if you *know* some of the parameters of the components. You can freeze amplitude, mean, or covariance of any component by listing them in a dictionary, e.g: 120 | 121 | ```python 122 | frozen={"amp": [1,2], "mean": [], "covar": [1]} 123 | ``` 124 | 125 | This freezes the amplitudes of component 1 and 2 (NOTE: Counting starts at 0), and the covariance of 1. 126 | 127 | 8. Run the fitter: 128 | 129 | ```python 130 | w = 0.1 # minimum covariance regularization, same units as data 131 | cutoff = 5 # segment the data set into neighborhood within 5 sigma around components 132 | tol = 1e-3 # tolerance on logL to terminate EM 133 | 134 | # define RNG for deterministic behavior 135 | from numpy.random import RandomState 136 | seed = 42 137 | rng = RandomState(seed) 138 | 139 | # run EM 140 | logL, U = pygmmis.fit(gmm, data, init_method='random',\ 141 | sel_callback=cb, covar_callback=covar_cb, w=w, cutoff=cutoff,\ 142 | background=bg, tol=tol, frozen=frozen, rng=rng) 143 | ``` 144 | 145 | This runs the EM procedure until tolerance is reached and returns the final mean log-likelihood of all samples, and the neighborhood of each component (indices of data samples that are within cutoff of a GMM component). 146 | 147 | 9. Evaluate the model: 148 | 149 | ```python 150 | # log of p(x) 151 | p = gmm(test_coords, as_log=False) 152 | N_s = 1000 153 | # draw samples from GMM 154 | samples = gmm.draw(N_s) 155 | 156 | # draw sample from the model with noise, background, and selection: 157 | # if you want to get the missing sample, set invert_sel=True. 158 | # N_orig is the estimated number of samples prior to selection 159 | obs_size = len(data) 160 | samples, covar_samples, N_orig = pygmmis.draw(gmm, obs_size, sel_callback=cb,\ 161 | invert_sel=False, orig_size=None,\ 162 | covar_callback=covar_cb,background=bg) 163 | ``` 164 | 165 | 166 | 167 | For a complete example, have a look at [the test script](https://github.com/pmelchior/pygmmis/blob/master/tests/test.py). For requests and bug reports, please open an issue. 168 | -------------------------------------------------------------------------------- /pygmmis.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import numpy as np 3 | import scipy.special, scipy.stats 4 | import ctypes 5 | 6 | import logging 7 | logger = logging.getLogger("pygmmis") 8 | 9 | # set up multiprocessing 10 | import multiprocessing 11 | import parmap 12 | 13 | def createShared(a, dtype=ctypes.c_double): 14 | """Create a shared array to be used for multiprocessing's processes. 15 | 16 | Taken from http://stackoverflow.com/questions/5549190/ 17 | 18 | Works only for float, double, int, long types (e.g. no bool). 19 | 20 | Args: 21 | numpy array, arbitrary shape 22 | 23 | Returns: 24 | numpy array whose container is a multiprocessing.Array 25 | """ 26 | shared_array_base = multiprocessing.Array(dtype, a.size) 27 | shared_array = np.ctypeslib.as_array(shared_array_base.get_obj()) 28 | shared_array[:] = a.flatten() 29 | shared_array = shared_array.reshape(a.shape) 30 | return shared_array 31 | 32 | # this is to allow multiprocessing pools to operate on class methods: 33 | # https://gist.github.com/bnyeggen/1086393 34 | def _pickle_method(method): 35 | func_name = method.im_func.__name__ 36 | obj = method.im_self 37 | cls = method.im_class 38 | if func_name.startswith('__') and not func_name.endswith('__'): #deal with mangled names 39 | cls_name = cls.__name__.lstrip('_') 40 | func_name = '_' + cls_name + func_name 41 | return _unpickle_method, (func_name, obj, cls) 42 | 43 | def _unpickle_method(func_name, obj, cls): 44 | for cls in cls.__mro__: 45 | try: 46 | func = cls.__dict__[func_name] 47 | except KeyError: 48 | pass 49 | else: 50 | break 51 | return func.__get__(obj, cls) 52 | 53 | import types 54 | # python 2 -> 3 adjustments 55 | try: 56 | import copy_reg 57 | except ImportError: 58 | import copyreg as copy_reg 59 | copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method) 60 | 61 | try: 62 | xrange 63 | except NameError: 64 | xrange = range 65 | 66 | # Blantant copy from Erin Sheldon's esutil 67 | # https://github.com/esheldon/esutil/blob/master/esutil/numpy_util.py 68 | def match1d(arr1input, arr2input, presorted=False): 69 | """ 70 | NAME: 71 | match 72 | CALLING SEQUENCE: 73 | ind1,ind2 = match(arr1, arr2, presorted=False) 74 | PURPOSE: 75 | Match two numpy arrays. Return the indices of the matches or empty 76 | arrays if no matches are found. This means arr1[ind1] == arr2[ind2] is 77 | true for all corresponding pairs. arr1 must contain only unique 78 | inputs, but arr2 may be non-unique. 79 | If you know arr1 is sorted, set presorted=True and it will run 80 | even faster 81 | METHOD: 82 | uses searchsorted with some sugar. Much faster than old version 83 | based on IDL code. 84 | REVISION HISTORY: 85 | Created 2015, Eli Rykoff, SLAC. 86 | """ 87 | 88 | # make sure 1D 89 | arr1 = np.array(arr1input, ndmin=1, copy=False) 90 | arr2 = np.array(arr2input, ndmin=1, copy=False) 91 | 92 | # check for integer data... 93 | if (not issubclass(arr1.dtype.type,np.integer) or 94 | not issubclass(arr2.dtype.type,np.integer)) : 95 | mess="Error: only works with integer types, got %s %s" 96 | mess = mess % (arr1.dtype.type,arr2.dtype.type) 97 | raise ValueError(mess) 98 | 99 | if (arr1.size == 0) or (arr2.size == 0) : 100 | mess="Error: arr1 and arr2 must each be non-zero length" 101 | raise ValueError(mess) 102 | 103 | # make sure that arr1 has unique values... 104 | test=np.unique(arr1) 105 | if test.size != arr1.size: 106 | raise ValueError("Error: the arr1input must be unique") 107 | 108 | # sort arr1 if not presorted 109 | if not presorted: 110 | st1 = np.argsort(arr1) 111 | else: 112 | st1 = None 113 | 114 | # search the sorted array 115 | sub1=np.searchsorted(arr1,arr2,sorter=st1) 116 | 117 | # check for out-of-bounds at the high end if necessary 118 | if (arr2.max() > arr1.max()) : 119 | bad,=np.where(sub1 == arr1.size) 120 | sub1[bad] = arr1.size-1 121 | 122 | if not presorted: 123 | sub2,=np.where(arr1[st1[sub1]] == arr2) 124 | sub1=st1[sub1[sub2]] 125 | else: 126 | sub2,=np.where(arr1[sub1] == arr2) 127 | sub1=sub1[sub2] 128 | 129 | return sub1,sub2 130 | 131 | 132 | def logsum(logX, axis=0): 133 | """Computes log of the sum along give axis from the log of the summands. 134 | 135 | This method tries hard to avoid over- or underflow. 136 | See appendix A of Bovy, Hogg, Roweis (2009). 137 | 138 | Args: 139 | logX: numpy array of logarithmic summands 140 | axis (int): axis to sum over 141 | 142 | Returns: 143 | log of the sum, shortened by one axis 144 | 145 | Throws: 146 | ValueError if logX has length 0 along given axis 147 | 148 | """ 149 | floatinfo = np.finfo(logX.dtype) 150 | underflow = np.log(floatinfo.tiny) - logX.min(axis=axis) 151 | overflow = np.log(floatinfo.max) - logX.max(axis=axis) - np.log(logX.shape[axis]) 152 | c = np.where(underflow < overflow, underflow, overflow) 153 | # adjust the shape of c for addition with logX 154 | c_shape = [slice(None) for i in xrange(len(logX.shape))] 155 | c_shape[axis] = None 156 | return np.log(np.exp(logX + c[tuple(c_shape)]).sum(axis=axis)) - c 157 | 158 | 159 | def chi2_cutoff(D, cutoff=3.): 160 | """D-dimensional eqiuvalent of "n sigma" cut. 161 | 162 | Evaluates the quantile function of the chi-squared distribution to determine 163 | the limit for the chi^2 of samples wrt to GMM so that they satisfy the 164 | 68-95-99.7 percent rule of the 1D Normal distribution. 165 | 166 | Args: 167 | D (int): dimensions of the feature space 168 | cutoff (float): 1D equivalent cut [in units of sigma] 169 | 170 | Returns: 171 | float: upper limit for chi-squared in D dimensions 172 | """ 173 | cdf_1d = scipy.stats.norm.cdf(cutoff) 174 | confidence_1d = 1-(1-cdf_1d)*2 175 | cutoff_nd = scipy.stats.chi2.ppf(confidence_1d, D) 176 | return cutoff_nd 177 | 178 | def covar_callback_default(coords, default=None): 179 | N,D = coords.shape 180 | if default.shape != (D,D): 181 | raise RuntimeError("covar_callback received improper default covariance %r" % default) 182 | # no need to copy since a single covariance matrix is sufficient 183 | # return np.tile(default, (N,1,1)) 184 | return default 185 | 186 | 187 | class GMM(object): 188 | """Gaussian mixture model with K components in D dimensions. 189 | 190 | Attributes: 191 | amp: numpy array (K,), component amplitudes 192 | mean: numpy array (K,D), component means 193 | covar: numpy array (K,D,D), component covariances 194 | """ 195 | def __init__(self, K=0, D=0): 196 | """Create the arrays for amp, mean, covar.""" 197 | self.amp = np.zeros((K)) 198 | self.mean = np.empty((K,D)) 199 | self.covar = np.empty((K,D,D)) 200 | 201 | @property 202 | def K(self): 203 | """int: number of components, depends on size of amp.""" 204 | return self.amp.size 205 | 206 | @property 207 | def D(self): 208 | """int: dimensions of the feature space.""" 209 | return self.mean.shape[1] 210 | 211 | def save(self, filename, **kwargs): 212 | """Save GMM to file. 213 | 214 | Args: 215 | filename (str): name for saved file, should end on .npz as the default 216 | of numpy.savez(), which is called here 217 | kwargs: dictionary of additional information to be stored in file. 218 | 219 | Returns: 220 | None 221 | """ 222 | np.savez(filename, amp=self.amp, mean=self.mean, covar=self.covar, **kwargs) 223 | 224 | def load(self, filename): 225 | """Load GMM from file. 226 | 227 | Additional arguments stored by save() will be ignored. 228 | 229 | Args: 230 | filename (str): name for file create with save(). 231 | 232 | Returns: 233 | None 234 | """ 235 | F = np.load(filename) 236 | self.amp = F["amp"] 237 | self.mean = F["mean"] 238 | self.covar = F["covar"] 239 | F.close() 240 | 241 | @staticmethod 242 | def from_file(filename): 243 | """Load GMM from file. 244 | 245 | Additional arguments stored by save() will be ignored. 246 | 247 | Args: 248 | filename (str): name for file create with save(). 249 | 250 | Returns: 251 | GMM 252 | """ 253 | gmm = GMM() 254 | gmm.load(filename) 255 | return gmm 256 | 257 | def draw(self, size=1, rng=np.random): 258 | """Draw samples from the GMM. 259 | 260 | Args: 261 | size (int): number of samples to draw 262 | rng: numpy.random.RandomState for deterministic draw 263 | 264 | Returns: 265 | numpy array (size,D) 266 | """ 267 | # draw indices for components given amplitudes, need to make sure: sum=1 268 | ind = rng.choice(self.K, size=size, p=self.amp/self.amp.sum()) 269 | N = np.bincount(ind, minlength=self.K) 270 | 271 | # for each component: draw as many points as in ind from a normal 272 | samples = np.empty((size, self.D)) 273 | lower = 0 274 | for k in np.flatnonzero(N): 275 | upper = lower + N[k] 276 | samples[lower:upper, :] = rng.multivariate_normal(self.mean[k], self.covar[k], size=N[k]) 277 | lower = upper 278 | return samples 279 | 280 | def __call__(self, coords, covar=None, as_log=False): 281 | """Evaluate model PDF at given coordinates. 282 | 283 | see logL() for details. 284 | 285 | Args: 286 | coords: numpy array (D,) or (N, D) of test coordinates 287 | covar: numpy array (D, D) or (N, D, D) covariance matrix of coords 288 | as_log (bool): return log(p) instead p 289 | 290 | Returns: 291 | numpy array (1,) or (N, 1) of PDF (or its log) 292 | """ 293 | if as_log: 294 | return self.logL(coords, covar=covar) 295 | else: 296 | return np.exp(self.logL(coords, covar=covar)) 297 | 298 | def _mp_chunksize(self): 299 | # find how many components to distribute over available threads 300 | cpu_count = multiprocessing.cpu_count() 301 | chunksize = max(1, self.K//cpu_count) 302 | n_chunks = min(cpu_count, self.K//chunksize) 303 | return n_chunks, chunksize 304 | 305 | def _get_chunks(self): 306 | # split all component in ideal-sized chunks 307 | n_chunks, chunksize = self._mp_chunksize() 308 | left = self.K - n_chunks*chunksize 309 | chunks = [] 310 | n = 0 311 | for i in xrange(n_chunks): 312 | n_ = n + chunksize 313 | if left > i: 314 | n_ += 1 315 | chunks.append((n, n_)) 316 | n = n_ 317 | return chunks 318 | 319 | def logL(self, coords, covar=None): 320 | """Log-likelihood of coords given all (i.e. the sum of) GMM components 321 | 322 | Distributes computation over all threads on the machine. 323 | 324 | If covar is None, this method returns 325 | log(sum_k(p(x | k))) 326 | of the data values x. If covar is set, the method returns 327 | log(sum_k(p(y | k))), 328 | where y = x + noise and noise ~ N(0, covar). 329 | 330 | Args: 331 | coords: numpy array (D,) or (N, D) of test coordinates 332 | covar: numpy array (D, D) or (N, D, D) covariance matrix of coords 333 | 334 | Returns: 335 | numpy array (1,) or (N, 1) log(L), depending on shape of data 336 | """ 337 | # Instead log p (x | k) for each k (which is huge) 338 | # compute it in stages: first for each chunk, then sum over all chunks 339 | pool = multiprocessing.Pool() 340 | chunks = self._get_chunks() 341 | results = [pool.apply_async(self._logsum_chunk, (chunk, coords, covar)) for chunk in chunks] 342 | log_p_y_chunk = [] 343 | for r in results: 344 | log_p_y_chunk.append(r.get()) 345 | pool.close() 346 | pool.join() 347 | return logsum(np.array(log_p_y_chunk)) # sum over all chunks = all k 348 | 349 | def _logsum_chunk(self, chunk, coords, covar=None): 350 | # helper function to reduce the memory requirement of logL 351 | log_p_y_k = np.empty((chunk[1]-chunk[0], len(coords))) 352 | for i in xrange(chunk[1] - chunk[0]): 353 | k = chunk[0] + i 354 | log_p_y_k[i,:] = self.logL_k(k, coords, covar=covar) 355 | return logsum(log_p_y_k) 356 | 357 | def logL_k(self, k, coords, covar=None, chi2_only=False): 358 | """Log-likelihood of coords given only component k. 359 | 360 | Args: 361 | k (int): component index 362 | coords: numpy array (D,) or (N, D) of test coordinates 363 | covar: numpy array (D, D) or (N, D, D) covariance matrix of coords 364 | chi2_only (bool): only compute deltaX^T Sigma_k^-1 deltaX 365 | 366 | Returns: 367 | numpy array (1,) or (N, 1) log(L), depending on shape of data 368 | """ 369 | # compute p(x | k) 370 | dx = coords - self.mean[k] 371 | if covar is None: 372 | T_k = self.covar[k] 373 | else: 374 | T_k = self.covar[k] + covar 375 | chi2 = np.einsum('...i,...ij,...j', dx, np.linalg.inv(T_k), dx) 376 | 377 | if chi2_only: 378 | return chi2 379 | 380 | # prevent tiny negative determinants to mess up 381 | (sign, logdet) = np.linalg.slogdet(T_k) 382 | log2piD2 = np.log(2*np.pi)*(0.5*self.D) 383 | return np.log(self.amp[k]) - log2piD2 - sign*logdet/2 - chi2/2 384 | 385 | class Background(object): 386 | """Background object to be used in conjuction with GMM. 387 | 388 | For a normalizable uniform distribution, a support footprint must be set. 389 | It should be sufficiently large to explain all non-clusters samples. 390 | 391 | Attributes: 392 | amp (float): mixing amplitude 393 | footprint: numpy array, (D,2) of rectangular volume 394 | adjust_amp (bool): whether amp will be adjusted as part of the fit 395 | amp_max (float): maximum value of amp allowed if adjust_amp=True 396 | """ 397 | def __init__(self, footprint, amp=0): 398 | """Initialize Background with a footprint. 399 | 400 | Args: 401 | footprint: numpy array, (D,2) of rectangular volume 402 | 403 | Returns: 404 | None 405 | """ 406 | self.amp = amp 407 | self.footprint = footprint 408 | self.adjust_amp = True 409 | self.amp_max = 1 410 | self.amp_min = 0 411 | 412 | @property 413 | def p(self): 414 | """Probability of the background model. 415 | 416 | Returns: 417 | float, equal to 1/volume, where volume is given by footprint. 418 | """ 419 | volume = np.prod(self.footprint[1] - self.footprint[0]) 420 | return 1/volume 421 | 422 | def draw(self, size=1, rng=np.random): 423 | """Draw samples from uniform background. 424 | 425 | Args: 426 | size (int): number of samples to draw 427 | rng: numpy.random.RandomState for deterministic draw 428 | 429 | Returns: 430 | numpy array (size, D) 431 | """ 432 | dx = self.footprint[1] - self.footprint[0] 433 | return self.footprint[0] + dx*rng.rand(size,len(self.footprint[0])) 434 | 435 | 436 | ############################ 437 | # Begin of fit functions 438 | ############################ 439 | 440 | def initFromDataMinMax(gmm, data, covar=None, s=None, k=None, rng=np.random): 441 | """Initialization callback for uniform random component means. 442 | 443 | Component amplitudes are set at 1/gmm.K, covariances are set to 444 | s**2*np.eye(D), and means are distributed randomly over the range that is 445 | covered by data. 446 | 447 | If s is not given, it will be set such that the volume of all components 448 | completely fills the space covered by data. 449 | 450 | Args: 451 | gmm: A GMM to be initialized 452 | data: numpy array (N,D) to define the range of the component means 453 | covar: ignored in this callback 454 | s (float): if set, sets component variances 455 | k (iterable): list of components to set, is None sets all components 456 | rng: numpy.random.RandomState for deterministic behavior 457 | 458 | Returns: 459 | None 460 | """ 461 | if k is None: 462 | k = slice(None) 463 | gmm.amp[k] = 1/gmm.K 464 | # set model to random positions with equally sized spheres within 465 | # volumne spanned by data 466 | min_pos = data.min(axis=0) 467 | max_pos = data.max(axis=0) 468 | gmm.mean[k,:] = min_pos + (max_pos-min_pos)*rng.rand(gmm.K, gmm.D) 469 | # if s is not set: use volume filling argument: 470 | # K spheres of radius s [having volume s^D * pi^D/2 / gamma(D/2+1)] 471 | # should completely fill the volume spanned by data. 472 | if s is None: 473 | vol_data = np.prod(max_pos-min_pos) 474 | s = (vol_data / gmm.K * scipy.special.gamma(gmm.D*0.5 + 1))**(1/gmm.D) / np.sqrt(np.pi) 475 | logger.info("initializing spheres with s=%.2f in data domain" % s) 476 | 477 | gmm.covar[k,:,:] = s**2 * np.eye(data.shape[1]) 478 | 479 | def initFromDataAtRandom(gmm, data, covar=None, s=None, k=None, rng=np.random): 480 | """Initialization callback for component means to follow data on scales > s. 481 | 482 | Component amplitudes are set to 1/gmm.K, covariances are set to 483 | s**2*np.eye(D). For each mean, a data sample is selected at random, and a 484 | multivariant Gaussian offset is added, whose variance is given by s**2. 485 | 486 | If s is not given, it will be set such that the volume of all components 487 | completely fills the space covered by data. 488 | 489 | Args: 490 | gmm: A GMM to be initialized 491 | data: numpy array (N,D) to define the range of the component means 492 | covar: ignored in this callback 493 | s (float): if set, sets component variances 494 | k (iterable): list of components to set, is None sets all components 495 | rng: numpy.random.RandomState for deterministic behavior 496 | 497 | Returns: 498 | None 499 | """ 500 | if k is None: 501 | k = slice(None) 502 | k_len = gmm.K 503 | else: 504 | try: 505 | k_len = len(gmm.amp[k]) 506 | except TypeError: 507 | k_len = 1 508 | gmm.amp[k] = 1/gmm.K 509 | # initialize components around data points with uncertainty s 510 | refs = rng.randint(0, len(data), size=k_len) 511 | D = data.shape[1] 512 | if s is None: 513 | min_pos = data.min(axis=0) 514 | max_pos = data.max(axis=0) 515 | vol_data = np.prod(max_pos-min_pos) 516 | s = (vol_data / gmm.K * scipy.special.gamma(gmm.D*0.5 + 1))**(1/gmm.D) / np.sqrt(np.pi) 517 | logger.info("initializing spheres with s=%.2f near data points" % s) 518 | 519 | gmm.mean[k,:] = data[refs] + rng.multivariate_normal(np.zeros(D), s**2 * np.eye(D), size=k_len) 520 | gmm.covar[k,:,:] = s**2 * np.eye(data.shape[1]) 521 | 522 | def initFromKMeans(gmm, data, covar=None, rng=np.random): 523 | """Initialization callback from a k-means clustering run. 524 | 525 | See Algorithm 1 from Bloemer & Bujna (arXiv:1312.5946) 526 | NOTE: The result of this call are not deterministic even if rng is set 527 | because scipy.cluster.vq.kmeans2 uses its own initialization. 528 | 529 | Args: 530 | gmm: A GMM to be initialized 531 | data: numpy array (N,D) to define the range of the component means 532 | covar: ignored in this callback 533 | rng: numpy.random.RandomState for deterministic behavior 534 | 535 | Returns: 536 | None 537 | """ 538 | from scipy.cluster.vq import kmeans2 539 | center, label = kmeans2(data, gmm.K) 540 | for k in xrange(gmm.K): 541 | mask = (label == k) 542 | gmm.amp[k] = mask.sum() / len(data) 543 | gmm.mean[k,:] = data[mask].mean(axis=0) 544 | d_m = data[mask] - gmm.mean[k] 545 | # funny way of saying: for each point i, do the outer product 546 | # of d_m with its transpose and sum over i 547 | gmm.covar[k,:,:] = (d_m[:, :, None] * d_m[:, None, :]).sum(axis=0) / len(data) 548 | 549 | 550 | def fit(gmm, data, covar=None, R=None, init_method='random', w=0., cutoff=None, sel_callback=None, oversampling=10, covar_callback=None, background=None, tol=1e-3, miniter=1, maxiter=1000, frozen=None, split_n_merge=False, rng=np.random): 551 | """Fit GMM to data. 552 | 553 | If given, init_callback is called to set up the GMM components. Then, the 554 | EM sequence is repeated until the mean log-likelihood converges within tol. 555 | 556 | Args: 557 | gmm: an instance if GMM 558 | data: numpy array (N,D) 559 | covar: sample noise covariance; numpy array (N,D,D) or (D,D) if i.i.d. 560 | R: sample projection matrix; numpy array (N,D,D) 561 | init_method (string): one of ['random', 'minmax', 'kmeans', 'none'] 562 | defines the method to initialize the GMM components 563 | w (float): minimum covariance regularization 564 | cutoff (float): size of component neighborhood [in 1D equivalent sigmas] 565 | sel_callback: completeness callback to generate imputation samples. 566 | oversampling (int): number of imputation samples per data sample. 567 | only used if sel_callback is set. 568 | value of 1 is fine but results are noisy. Set as high as feasible. 569 | covar_callback: covariance callback for imputation samples. 570 | needs to be present if sel_callback and covar are set. 571 | background: an instance of Background if simultaneous fitting is desired 572 | tol (float): tolerance for covergence of mean log-likelihood 573 | maxiter (int): maximum number of iterations of EM 574 | frozen (iterable or dict): index list of components that are not updated 575 | split_n_merge (int): number of split & merge attempts 576 | rng: numpy.random.RandomState for deterministic behavior 577 | 578 | Notes: 579 | If frozen is a simple list, it will be assumed that is applies to mean 580 | and covariance of the specified components. It can also be a dictionary 581 | with the keys "mean" and "covar" to specify them separately. 582 | In either case, amplitudes will be updated to reflect any changes made. 583 | If frozen["amp"] is set, it will use this list instead. 584 | 585 | Returns: 586 | mean log-likelihood (float), component neighborhoods (list of ints) 587 | 588 | Throws: 589 | RuntimeError for inconsistent argument combinations 590 | """ 591 | 592 | N = len(data) 593 | # if there are data (features) missing, i.e. masked as np.nan, set them to zeros 594 | # and create/set covariance elements to very large value to reduce its weight 595 | # to effectively zero 596 | missing = np.isnan(data) 597 | if missing.any(): 598 | data_ = createShared(data.copy()) 599 | data_[missing] = 0 # value does not matter as long as it's not nan 600 | if covar is None: 601 | covar = np.zeros((gmm.D, gmm.D)) 602 | # need to create covar_callback if imputation is requested 603 | if sel_callback is not None: 604 | from functools import partial 605 | covar_callback = partial(covar_callback_default, default=np.zeros((gmm.D, gmm.D))) 606 | if covar.shape == (gmm.D, gmm.D): 607 | covar_ = createShared(np.tile(covar, (N,1,1))) 608 | else: 609 | covar_ = createShared(covar.copy()) 610 | 611 | large = 1e10 612 | for d in range(gmm.D): 613 | covar_[missing[:,d],d,d] += large 614 | covar_[missing[:,d],d,d] += large 615 | else: 616 | data_ = createShared(data.copy()) 617 | if covar is None or covar.shape == (gmm.D, gmm.D): 618 | covar_ = covar 619 | else: 620 | covar_ = createShared(covar.copy()) 621 | 622 | # init components 623 | if init_method.lower() not in ['random', 'minmax', 'kmeans', 'none']: 624 | raise NotImplementedError("init_mehod %s not in ['random', 'minmax', 'kmeans', 'none']" % init_method) 625 | if init_method.lower() == 'random': 626 | initFromDataAtRandom(gmm, data_, covar=covar_, rng=rng) 627 | if init_method.lower() == 'minmax': 628 | initFromDataMinMax(gmm, data_, covar=covar_, rng=rng) 629 | if init_method.lower() == 'kmeans': 630 | initFromKMeans(gmm, data_, covar=covar_, rng=rng) 631 | 632 | # test if callbacks are consistent 633 | if sel_callback is not None and covar is not None and covar_callback is None: 634 | raise NotImplementedError("covar is set, but covar_callback is None: imputation samples inconsistent") 635 | 636 | # set up pool 637 | pool = multiprocessing.Pool() 638 | n_chunks, chunksize = gmm._mp_chunksize() 639 | 640 | # containers 641 | # precautions for cases when some points are treated as outliers 642 | # and not considered as belonging to any component 643 | log_S = createShared(np.zeros(N)) # S = sum_k p(x|k) 644 | log_p = [[] for k in xrange(gmm.K)] # P = p(x|k) for x in U[k] 645 | T_inv = [None for k in xrange(gmm.K)] # T = covar(x) + gmm.covar[k] 646 | U = [None for k in xrange(gmm.K)] # U = {x close to k} 647 | p_bg = None 648 | if background is not None: 649 | gmm.amp *= 1 - background.amp # GMM amp + BG amp = 1 650 | p_bg = [None] # p_bg = p(x|BG), no log because values are larger 651 | if covar is not None: 652 | # check if covar is diagonal and issue warning if not 653 | mess = "background model will only consider diagonal elements of covar" 654 | nondiag = ~np.eye(gmm.D, dtype='bool') 655 | if covar.shape == (gmm.D, gmm.D): 656 | if (covar[nondiag] != 0).any(): 657 | logger.warning(mess) 658 | else: 659 | if (covar[np.tile(nondiag,(N,1,1))] != 0).any(): 660 | logger.warning(mess) 661 | 662 | # check if all component parameters can be changed 663 | changeable = {"amp": slice(None), "mean": slice(None), "covar": slice(None)} 664 | if frozen is not None: 665 | if all(isinstance(item, int) for item in frozen): 666 | changeable['amp'] = changeable['mean'] = changeable['covar'] = np.in1d(xrange(gmm.K), frozen, assume_unique=True, invert=True) 667 | elif hasattr(frozen, 'keys') and np.in1d(["amp","mean","covar"], tuple(frozen.keys()), assume_unique=True).any(): 668 | if "amp" in frozen.keys(): 669 | changeable['amp'] = np.in1d(xrange(gmm.K), frozen['amp'], assume_unique=True, invert=True) 670 | if "mean" in frozen.keys(): 671 | changeable['mean'] = np.in1d(xrange(gmm.K), frozen['mean'], assume_unique=True, invert=True) 672 | if "covar" in frozen.keys(): 673 | changeable['covar'] = np.in1d(xrange(gmm.K), frozen['covar'], assume_unique=True, invert=True) 674 | else: 675 | raise NotImplementedError("frozen should be list of indices or dictionary with keys in ['amp','mean','covar']") 676 | 677 | try: 678 | log_L, N, N2 = _EM(gmm, log_p, U, T_inv, log_S, data_, covar=covar_, R=R, sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, changeable=changeable, miniter=miniter, maxiter=maxiter, tol=tol, rng=rng) 679 | except Exception: 680 | # cleanup 681 | pool.close() 682 | pool.join() 683 | del data_, covar_, log_S 684 | raise 685 | 686 | # should we try to improve by split'n'merge of components? 687 | # if so, keep backup copy 688 | gmm_ = None 689 | if frozen is not None and split_n_merge: 690 | logger.warning("forgoing split'n'merge because some components are frozen") 691 | else: 692 | while split_n_merge and gmm.K >= 3: 693 | 694 | if gmm_ is None: 695 | gmm_ = GMM(gmm.K, gmm.D) 696 | 697 | gmm_.amp[:] = gmm.amp[:] 698 | gmm_.mean[:] = gmm.mean[:,:] 699 | gmm_.covar[:,:,:] = gmm.covar[:,:,:] 700 | U_ = [U[k].copy() for k in xrange(gmm.K)] 701 | 702 | changing, cleanup = _findSNMComponents(gmm, U, log_p, log_S, N+N2, pool=pool, chunksize=chunksize) 703 | logger.info("merging %d and %d, splitting %d" % tuple(changing)) 704 | 705 | # modify components 706 | _update_snm(gmm, changing, U, N+N2, cleanup) 707 | 708 | # run partial EM on changeable components 709 | # NOTE: for a partial run, we'd only need the change to Log_S from the 710 | # changeable components. However, the neighborhoods can change from _update_snm 711 | # or because they move, so that operation is ill-defined. 712 | # Thus, we'll always run a full E-step, which is pretty cheap for 713 | # converged neighborhood. 714 | # The M-step could in principle be run on the changeable components only, 715 | # but there seem to be side effects in what I've tried. 716 | # Similar to the E-step, the imputation step needs to be run on all 717 | # components, otherwise the contribution of the changeable ones to the mixture 718 | # would be over-estimated. 719 | # Effectively, partial runs are as expensive as full runs. 720 | 721 | changeable['amp'] = changeable['mean'] = changeable['covar'] = np.in1d(xrange(gmm.K), changing, assume_unique=True) 722 | log_L_, N_, N2_ = _EM(gmm, log_p, U, T_inv, log_S, data_, covar=covar_, R=R, sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, maxiter=maxiter, tol=tol, prefix="SNM_P", changeable=changeable, rng=rng) 723 | 724 | changeable['amp'] = changeable['mean'] = changeable['covar'] = slice(None) 725 | log_L_, N_, N2_ = _EM(gmm, log_p, U, T_inv, log_S, data_, covar=covar_, R=R, sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, maxiter=maxiter, tol=tol, prefix="SNM_F", changeable=changeable, rng=rng) 726 | 727 | if log_L >= log_L_: 728 | # revert to backup 729 | gmm.amp[:] = gmm_.amp[:] 730 | gmm.mean[:] = gmm_.mean[:,:] 731 | gmm.covar[:,:,:] = gmm_.covar[:,:,:] 732 | U = U_ 733 | logger.info ("split'n'merge likelihood decreased: reverting to previous model") 734 | break 735 | 736 | log_L = log_L_ 737 | split_n_merge -= 1 738 | 739 | pool.close() 740 | pool.join() 741 | del data_, covar_, log_S 742 | return log_L, U 743 | 744 | # run EM sequence 745 | def _EM(gmm, log_p, U, T_inv, log_S, data, covar=None, R=None, sel_callback=None, oversampling=10, covar_callback=None, background=None, p_bg=None, w=0, pool=None, chunksize=1, cutoff=None, miniter=1, maxiter=1000, tol=1e-3, prefix="", changeable=None, rng=np.random): 746 | 747 | # compute effective cutoff for chi2 in D dimensions 748 | if cutoff is not None: 749 | # note: subsequently the cutoff parameter, e.g. in _E(), refers to this: 750 | # chi2 < cutoff, 751 | # while in fit() it means e.g. "cut at 3 sigma". 752 | # These differing conventions need to be documented well. 753 | cutoff_nd = chi2_cutoff(gmm.D, cutoff=cutoff) 754 | 755 | # store chi2 cutoff for component shifts, use 0.5 sigma 756 | shift_cutoff = chi2_cutoff(gmm.D, cutoff=min(0.1, cutoff/2)) 757 | else: 758 | cutoff_nd = None 759 | shift_cutoff = chi2_cutoff(gmm.D, cutoff=0.1) 760 | 761 | if sel_callback is not None: 762 | omega = createShared(sel_callback(data).astype("float")) 763 | if np.any(omega == 0): 764 | logger.warning("Selection probability Omega = 0 for an observed sample.") 765 | logger.warning("Selection callback likely incorrect! Bad things will happen!") 766 | else: 767 | omega = None 768 | 769 | it = 0 770 | header = "ITER\tSAMPLES" 771 | if sel_callback is not None: 772 | header += "\tIMPUTED\tORIG" 773 | if background is not None: 774 | header += "\tBG_AMP" 775 | header += "\tLOG_L\tSTABLE" 776 | logger.info(header) 777 | 778 | # save backup 779 | gmm_ = GMM(gmm.K, gmm.D) 780 | gmm_.amp[:] = gmm.amp[:] 781 | gmm_.mean[:,:] = gmm.mean[:,:] 782 | gmm_.covar[:,:,:] = gmm.covar[:,:,:] 783 | N0 = len(data) # size of original (unobscured) data set (signal and background) 784 | N2 = 0 # size of imputed signal sample 785 | if background is not None: 786 | bg_amp_ = background.amp 787 | 788 | while it < maxiter: # limit loop in case of slow convergence 789 | log_L_, N, N2_, N0_ = _EMstep(gmm, log_p, U, T_inv, log_S, N0, data, covar=covar, R=R, sel_callback=sel_callback, omega=omega, oversampling=oversampling, covar_callback=covar_callback, background=background, p_bg=p_bg , w=w, pool=pool, chunksize=chunksize, cutoff=cutoff_nd, tol=tol, changeable=changeable, it=it, rng=rng) 790 | 791 | # check if component has moved by more than sigma/2 792 | shift2 = np.einsum('...i,...ij,...j', gmm.mean - gmm_.mean, np.linalg.inv(gmm_.covar), gmm.mean - gmm_.mean) 793 | moved = np.flatnonzero(shift2 > shift_cutoff) 794 | status_mess = "%s%d\t%d" % (prefix, it, N) 795 | if sel_callback is not None: 796 | status_mess += "\t%.2f\t%.2f" % (N2_, N0_) 797 | if background is not None: 798 | status_mess += "\t%.3f" % bg_amp_ 799 | status_mess += "\t%.3f\t%d" % (log_L_, gmm.K - moved.size) 800 | logger.info(status_mess) 801 | 802 | # convergence tests 803 | if it > miniter: 804 | if sel_callback is None: 805 | if np.abs(log_L_ - log_L) < tol * np.abs(log_L) and moved.size == 0: 806 | log_L = log_L_ 807 | logger.info("likelihood converged within relative tolerance %r: stopping here." % tol) 808 | break 809 | else: 810 | if np.abs(N0_ - N0) < tol * N0 and np.abs(N2_ - N2) < tol * N2 and moved.size == 0: 811 | log_L = log_L_ 812 | logger.info("imputation sample size converged within relative tolerance %r: stopping here." % tol) 813 | break 814 | 815 | # force update to U for all moved components 816 | if cutoff is not None: 817 | for k in moved: 818 | U[k] = None 819 | 820 | if moved.size: 821 | logger.debug("resetting neighborhoods of moving components: (" + ("%d," * moved.size + ")") % tuple(moved)) 822 | 823 | # update all important _ quantities for convergence test(s) 824 | log_L = log_L_ 825 | N0 = N0_ 826 | N2 = N2_ 827 | 828 | # backup to see if components move or if next step gets worse 829 | # note: not gmm = gmm_ ! 830 | gmm_.amp[:] = gmm.amp[:] 831 | gmm_.mean[:,:] = gmm.mean[:,:] 832 | gmm_.covar[:,:,:] = gmm.covar[:,:,:] 833 | if background is not None: 834 | bg_amp_ = background.amp 835 | 836 | it += 1 837 | 838 | return log_L, N, N2 839 | 840 | # run one EM step 841 | def _EMstep(gmm, log_p, U, T_inv, log_S, N0, data, covar=None, R=None, sel_callback=None, omega=None, oversampling=10, covar_callback=None, background=None, p_bg=None, w=0, pool=None, chunksize=1, cutoff=None, tol=1e-3, changeable=None, it=0, rng=np.random): 842 | 843 | # NOTE: T_inv (in fact (T_ik)^-1 for all samples i and components k) 844 | # is very large and is unfortunately duplicated in the parallelized _Mstep. 845 | # If memory is too limited, one can recompute T_inv in _Msums() instead. 846 | log_L = _Estep(gmm, log_p, U, T_inv, log_S, data, covar=covar, R=R, omega=omega, background=background, p_bg=p_bg, pool=pool, chunksize=chunksize, cutoff=cutoff, it=it) 847 | A,M,C,N,B = _Mstep(gmm, U, log_p, T_inv, log_S, data, covar=covar, R=R, p_bg=p_bg, pool=pool, chunksize=chunksize) 848 | 849 | A2 = M2 = C2 = B2 = N2 = 0 850 | 851 | # here the magic happens: imputation from the current model 852 | if sel_callback is not None: 853 | 854 | # if there are projections / missing data, we don't know how to 855 | # generate those for the imputation samples 856 | # NOTE: in principle, if there are only missing data, i.e. R is 1_D, 857 | # we could ignore missingness for data2 because we'll do an analytic 858 | # marginalization. This doesn't work if R is a non-trivial matrix. 859 | if R is not None: 860 | raise NotImplementedError("R is not None: imputation samples likely inconsistent") 861 | 862 | # create fake data with same mechanism as the original data, 863 | # but invert selection to get the missing part 864 | data2, covar2, N0, omega2 = draw(gmm, len(data)*oversampling, sel_callback=sel_callback, orig_size=N0*oversampling, invert_sel=True, covar_callback=covar_callback, background=background, rng=rng) 865 | data2 = createShared(data2) 866 | if not(covar2 is None or covar2.shape == (gmm.D, gmm.D)): 867 | covar2 = createShared(covar2) 868 | 869 | N0 = N0/oversampling 870 | U2 = [None for k in xrange(gmm.K)] 871 | 872 | if len(data2) > 0: 873 | log_S2 = np.zeros(len(data2)) 874 | log_p2 = [[] for k in xrange(gmm.K)] 875 | T2_inv = [None for k in xrange(gmm.K)] 876 | R2 = None 877 | if background is not None: 878 | p_bg2 = [None] 879 | else: 880 | p_bg2 = None 881 | 882 | log_L2 = _Estep(gmm, log_p2, U2, T2_inv, log_S2, data2, covar=covar2, R=R2, omega=None, background=background, p_bg=p_bg2, pool=pool, chunksize=chunksize, cutoff=cutoff, it=it) 883 | A2,M2,C2,N2,B2 = _Mstep(gmm, U2, log_p2, T2_inv, log_S2, data2, covar=covar2, R=R2, p_bg=p_bg2, pool=pool, chunksize=chunksize) 884 | 885 | # normalize for oversampling 886 | A2 /= oversampling 887 | M2 /= oversampling 888 | C2 /= oversampling 889 | B2 /= oversampling 890 | N2 = N2/oversampling # need floating point precision in update 891 | 892 | # check if components have outside selection 893 | sel_outside = A2 > tol * A 894 | if sel_outside.any(): 895 | logger.debug("component inside fractions: " + ("(" + "%.2f," * gmm.K + ")") % tuple(A/(A+A2))) 896 | 897 | # correct the observed likelihood for the overall normalization constant of 898 | # of the data process with selection: 899 | # logL(x | gmm) = sum_k p_k(x) / Z(gmm), with Z(gmm) = int dx sum_k p_k(x) = 1 900 | # becomes 901 | # logL(x | gmm) = sum_k Omega(x) p_k(x) / Z'(gmm), 902 | # with Z'(gmm) = int dx Omega(x) sum_k p_k(x), which we can gt by MC integration 903 | log_L -= N * np.log((omega.sum() + omega2.sum() / oversampling) / (N + N2)) 904 | 905 | _update(gmm, A, M, C, N, B, A2, M2, C2, N2, B2, w, changeable=changeable, background=background) 906 | 907 | return log_L, N, N2, N0 908 | 909 | # perform E step calculations. 910 | # If cutoff is set, this will also set the neighborhoods U 911 | def _Estep(gmm, log_p, U, T_inv, log_S, data, covar=None, R=None, omega=None, background=None, p_bg=None, pool=None, chunksize=1, cutoff=None, it=0, rng=np.random): 912 | # compute p(i | k) for each k independently in the pool 913 | # need S = sum_k p(i | k) for further calculation 914 | log_S[:] = 0 915 | 916 | # H = {i | i in neighborhood[k]} for any k, needed for outliers below 917 | # TODO: Use only when cutoff is set 918 | H = np.zeros(len(data), dtype="bool") 919 | 920 | k = 0 921 | for log_p[k], U[k], T_inv[k] in \ 922 | parmap.starmap(_Esum, zip(xrange(gmm.K), U), gmm, data, covar, R, cutoff, pm_pool=pool, pm_chunksize=chunksize): 923 | log_S[U[k]] += np.exp(log_p[k]) # actually S, not logS 924 | H[U[k]] = 1 925 | k += 1 926 | 927 | if background is not None: 928 | p_bg[0] = background.amp * background.p 929 | if covar is not None: 930 | # This is the zeroth moment of a truncated Normal error distribution 931 | # Its calculation is simple only of the covariance is diagonal! 932 | # See e.g. Manjunath & Wilhem (2012) if not 933 | error = np.ones(len(data)) 934 | x0,x1 = background.footprint 935 | for d in range(gmm.D): 936 | if covar.shape == (gmm.D, gmm.D): # one-for-all 937 | denom = np.sqrt(2 * covar[d,d]) 938 | else: 939 | denom = np.sqrt(2 * covar[:,d,d]) 940 | # CAUTION: The erf is approximate and returns 0 941 | # Thus, we don't add the logs but multiple the value itself 942 | # underrun is not a big problem here 943 | error *= np.real(scipy.special.erf((data[:,d] - x0[d])/denom) - scipy.special.erf((data[:,d] - x1[d])/denom)) / 2 944 | p_bg[0] *= error 945 | log_S[:] = np.log(log_S + p_bg[0]) 946 | if omega is not None: 947 | log_S += np.log(omega) 948 | log_L = log_S.sum() 949 | else: 950 | # need log(S), but since log(0) isn't a good idea, need to restrict to H 951 | log_S[H] = np.log(log_S[H]) 952 | if omega is not None: 953 | log_S += np.log(omega) 954 | log_L = log_S[H].sum() 955 | 956 | return log_L 957 | 958 | # compute chi^2, and apply selections on component neighborhood based in chi^2 959 | def _Esum(k, U_k, gmm, data, covar=None, R=None, cutoff=None): 960 | # since U_k could be None, need explicit reshape 961 | d_ = data[U_k].reshape(-1, gmm.D) 962 | if covar is not None: 963 | if covar.shape == (gmm.D, gmm.D): # one-for-all 964 | covar_ = covar 965 | else: # each datum has covariance 966 | covar_ = covar[U_k].reshape(-1, gmm.D, gmm.D) 967 | else: 968 | covar_ = 0 969 | if R is not None: 970 | R_ = R[U_k].reshape(-1, gmm.D, gmm.D) 971 | 972 | # p(x | k) for all x in the vicinity of k 973 | # determine all points within cutoff sigma from mean[k] 974 | if R is None: 975 | dx = d_ - gmm.mean[k] 976 | else: 977 | dx = d_ - np.dot(R_, gmm.mean[k]) 978 | 979 | if covar is None and R is None: 980 | T_inv_k = None 981 | chi2 = np.einsum('...i,...ij,...j', dx, np.linalg.inv(gmm.covar[k]), dx) 982 | else: 983 | # with data errors: need to create and return T_ik = covar_i + C_k 984 | # and weight each datum appropriately 985 | if R is None: 986 | T_inv_k = np.linalg.inv(gmm.covar[k] + covar_) 987 | else: # need to project out missing elements: T_ik = R_i C_k R_i^R + covar_i 988 | T_inv_k = np.linalg.inv(np.einsum('...ij,jk,...lk', R_, gmm.covar[k], R_) + covar_) 989 | chi2 = np.einsum('...i,...ij,...j', dx, T_inv_k, dx) 990 | 991 | # NOTE: close to convergence, we could stop applying the cutoff because 992 | # changes to U will be minimal 993 | if cutoff is not None: 994 | indices = chi2 < cutoff 995 | chi2 = chi2[indices] 996 | if (covar is not None and covar.shape != (gmm.D, gmm.D)) or R is not None: 997 | T_inv_k = T_inv_k[indices] 998 | if U_k is None: 999 | U_k = np.flatnonzero(indices) 1000 | else: 1001 | U_k = U_k[indices] 1002 | 1003 | # prevent tiny negative determinants to mess up 1004 | if covar is None: 1005 | (sign, logdet) = np.linalg.slogdet(gmm.covar[k]) 1006 | else: 1007 | (sign, logdet) = np.linalg.slogdet(T_inv_k) 1008 | sign *= -1 # since det(T^-1) = 1/det(T) 1009 | 1010 | log2piD2 = np.log(2*np.pi)*(0.5*gmm.D) 1011 | return np.log(gmm.amp[k]) - log2piD2 - sign*logdet/2 - chi2/2, U_k, T_inv_k 1012 | 1013 | # get zeroth, first, second moments of the data weighted with p_k(x) avgd over x 1014 | def _Mstep(gmm, U, log_p, T_inv, log_S, data, covar=None, R=None, p_bg=None, pool=None, chunksize=1): 1015 | 1016 | # save the M sums from observed data 1017 | A = np.empty(gmm.K) # sum for amplitudes 1018 | M = np.empty((gmm.K, gmm.D)) # ... means 1019 | C = np.empty((gmm.K, gmm.D, gmm.D)) # ... covariances 1020 | N = len(data) 1021 | 1022 | # perform sums for M step in the pool 1023 | # NOTE: in a partial run, could work on changeable components only; 1024 | # however, there seem to be side effects or race conditions 1025 | k = 0 1026 | for A[k], M[k,:], C[k,:,:] in \ 1027 | parmap.starmap(_Msums, zip(xrange(gmm.K), U, log_p, T_inv), gmm, data, R, log_S, pm_pool=pool, pm_chunksize=chunksize): 1028 | k += 1 1029 | 1030 | if p_bg is not None: 1031 | q_bg = p_bg[0] / np.exp(log_S) 1032 | B = q_bg.sum() # equivalent to A_k in _Msums, but done without logs 1033 | else: 1034 | B = 0 1035 | 1036 | return A,M,C,N,B 1037 | 1038 | # compute moments for the Mstep 1039 | def _Msums(k, U_k, log_p_k, T_inv_k, gmm, data, R, log_S): 1040 | if log_p_k.size == 0: 1041 | return 0,0,0 1042 | 1043 | # get log_q_ik by dividing with S = sum_k p_ik 1044 | # NOTE: this modifies log_p_k in place, but is only relevant 1045 | # within this method since the call is parallel and its arguments 1046 | # therefore don't get updated across components. 1047 | 1048 | # NOTE: reshape needed when U_k is None because of its 1049 | # implicit meaning as np.newaxis 1050 | log_p_k -= log_S[U_k].reshape(log_p_k.size) 1051 | d = data[U_k].reshape((log_p_k.size, gmm.D)) 1052 | if R is not None: 1053 | R_ = R[U_k].reshape((log_p_k.size, gmm.D, gmm.D)) 1054 | 1055 | # amplitude: A_k = sum_i q_ik 1056 | A_k = np.exp(logsum(log_p_k)) 1057 | 1058 | # in fact: q_ik, but we treat sample index i silently everywhere 1059 | q_k = np.exp(log_p_k) 1060 | 1061 | if R is None: 1062 | d_m = d - gmm.mean[k] 1063 | else: 1064 | d_m = d - np.dot(R_, gmm.mean[k]) 1065 | 1066 | # data with errors? 1067 | if T_inv_k is None and R is None: 1068 | # mean: M_k = sum_i x_i q_ik 1069 | M_k = (d * q_k[:,None]).sum(axis=0) 1070 | 1071 | # covariance: C_k = sum_i (x_i - mu_k)^T(x_i - mu_k) q_ik 1072 | # funny way of saying: for each point i, do the outer product 1073 | # of d_m with its transpose, multiply with pi[i], and sum over i 1074 | C_k = (q_k[:, None, None] * d_m[:, :, None] * d_m[:, None, :]).sum(axis=0) 1075 | else: 1076 | if R is None: # that means T_ik is not None 1077 | # b_ik = mu_k + C_k T_ik^-1 (x_i - mu_k) 1078 | # B_ik = C_k - C_k T_ik^-1 C_k 1079 | b_k = gmm.mean[k] + np.einsum('ij,...jk,...k', gmm.covar[k], T_inv_k, d_m) 1080 | B_k = gmm.covar[k] - np.einsum('ij,...jk,...kl', gmm.covar[k], T_inv_k, gmm.covar[k]) 1081 | else: 1082 | # F_ik = C_k R_i^T T_ik^-1 1083 | F_k = np.einsum('ij,...kj,...kl', gmm.covar[k], R_, T_inv_k) 1084 | b_k = gmm.mean[k] + np.einsum('...ij,...j', F_k, d_m) 1085 | B_k = gmm.covar[k] - np.einsum('...ij,...jk,kl', F_k, R_, gmm.covar[k]) 1086 | 1087 | #b_k = gmm.mean[k] + np.einsum('ij,...jk,...k', gmm.covar[k], T_inv_k, d_m) 1088 | #B_k = gmm.covar[k] - np.einsum('ij,...jk,...kl', gmm.covar[k], T_inv_k, gmm.covar[k]) 1089 | M_k = (b_k * q_k[:,None]).sum(axis=0) 1090 | b_k -= gmm.mean[k] 1091 | C_k = (q_k[:, None, None] * (b_k[:, :, None] * b_k[:, None, :] + B_k)).sum(axis=0) 1092 | return A_k, M_k, C_k 1093 | 1094 | 1095 | # update component with the moment matrices. 1096 | # If changeable is set, update only those components and renormalize the amplitudes 1097 | def _update(gmm, A, M, C, N, B, A2, M2, C2, N2, B2, w, changeable=None, background=None): 1098 | 1099 | # recompute background amplitude 1100 | if background is not None and background.adjust_amp: 1101 | background.amp = max(min((B + B2) / (N + N2), background.amp_max), background.amp_min) 1102 | 1103 | # amp update: 1104 | # for partial update: need to update amp for any component that is changeable 1105 | if not hasattr(changeable['amp'], '__iter__'): # it's a slice(None), not a bool array 1106 | gmm.amp[changeable['amp']] = (A + A2)[changeable['amp']] / (N + N2) 1107 | else: 1108 | # Bovy eq. 31, with correction for bg.amp if needed 1109 | if background is None: 1110 | total = 1 1111 | else: 1112 | total = 1 - background.amp 1113 | gmm.amp[changeable['amp']] = (A + A2)[changeable['amp']] / (A + A2)[changeable['amp']].sum() * (total - (gmm.amp[~changeable['amp']]).sum()) 1114 | 1115 | # mean updateL 1116 | gmm.mean[changeable['mean'],:] = (M + M2)[changeable['mean'],:]/(A + A2)[changeable['mean'],None] 1117 | 1118 | # covar updateL 1119 | # minimum covariance term? 1120 | if w > 0: 1121 | # we assume w to be a lower bound of the isotropic dispersion, 1122 | # C_k = w^2 I + ... 1123 | # then eq. 38 in Bovy et al. only ~works for N = 0 because of the 1124 | # prefactor 1 / (q_j + 1) = 1 / (A + 1) in our terminology 1125 | # On average, q_j = N/K, so we'll adopt that to correct. 1126 | w_eff = w**2 * ((N+N2)/gmm.K + 1) 1127 | gmm.covar[changeable['covar'],:,:] = (C + C2 + w_eff*np.eye(gmm.D)[None,:,:])[changeable['covar'],:,:] / (A + A2 + 1)[changeable['covar'],None,None] 1128 | else: 1129 | gmm.covar[changeable['covar'],:,:] = (C + C2)[changeable['covar'],:,:] / (A + A2)[changeable['covar'],None,None] 1130 | 1131 | # draw from the model (+ background) and apply appropriate covariances 1132 | def _drawGMM_BG(gmm, size, covar_callback=None, background=None, rng=np.random): 1133 | # draw sample from model, or from background+model 1134 | if background is None: 1135 | data2 = gmm.draw(int(np.round(size)), rng=rng) 1136 | else: 1137 | # model is GMM + Background 1138 | bg_size = int(background.amp * size) 1139 | data2 = np.concatenate((gmm.draw(int(np.round(size-bg_size)), rng=rng), background.draw(int(np.round(bg_size)), rng=rng))) 1140 | 1141 | # add noise 1142 | # NOTE: When background is set, adding noise is problematic if 1143 | # scattering them out is more likely than in. 1144 | # This can be avoided when the background footprint is large compared to 1145 | # selection region 1146 | if covar_callback is not None: 1147 | covar2 = covar_callback(data2) 1148 | if covar2.shape == (gmm.D, gmm.D): # one-for-all 1149 | noise = rng.multivariate_normal(np.zeros(gmm.D), covar2, size=len(data2)) 1150 | else: 1151 | # create noise from unit covariance and then dot with eigenvalue 1152 | # decomposition of covar2 to get a the right noise distribution: 1153 | # n' = R V^1/2 n, where covar = R V R^-1 1154 | # faster than drawing one sample per each covariance 1155 | noise = rng.multivariate_normal(np.zeros(gmm.D), np.eye(gmm.D), size=len(data2)) 1156 | val, rot = np.linalg.eigh(covar2) 1157 | val = np.maximum(val,0) # to prevent univariate errors to underflow 1158 | noise = np.einsum('...ij,...j', rot, np.sqrt(val)*noise) 1159 | data2 += noise 1160 | else: 1161 | covar2 = None 1162 | return data2, covar2 1163 | 1164 | 1165 | def draw(gmm, obs_size, sel_callback=None, invert_sel=False, orig_size=None, covar_callback=None, background=None, rng=np.random): 1166 | """Draw from the GMM (and the Background) with noise and selection. 1167 | 1168 | Draws orig_size samples from the GMM and the Background, if set; calls 1169 | covar_callback if set and applies resulting covariances; the calls 1170 | sel_callback on the (noisy) samples and returns those matching ones. 1171 | 1172 | If the number is resulting samples is inconsistent with obs_size, i.e. 1173 | outside of the 68 percent confidence limit of a Poisson draw, it will 1174 | update its estimate for the original sample size orig_size. 1175 | An estimate can be provided with orig_size, otherwise it will use obs_size. 1176 | 1177 | Note: 1178 | If sel_callback is set, the number of returned samples is not 1179 | necessarily given by obs_size. 1180 | 1181 | Args: 1182 | gmm: an instance if GMM 1183 | obs_size (int): number of observed samples 1184 | sel_callback: completeness callback to generate imputation samples. 1185 | invert_sel (bool): whether to invert the result of sel_callback 1186 | orig_size (int): an estimate of the original size of the sample. 1187 | background: an instance of Background 1188 | covar_callback: covariance callback for imputation samples. 1189 | rng: numpy.random.RandomState for deterministic behavior 1190 | 1191 | Returns: 1192 | sample: nunmpy array (N_orig, D) 1193 | covar_sample: numpy array (N_orig, D, D) or None of covar_callback=None 1194 | N_orig (int): updated estimate of orig_size if sel_callback is set 1195 | 1196 | Throws: 1197 | RuntimeError for inconsistent argument combinations 1198 | """ 1199 | 1200 | if orig_size is None: 1201 | orig_size = int(obs_size) 1202 | 1203 | # draw from model (with background) and add noise. 1204 | # TODO: may want to decide whether to add noise before selection or after 1205 | # Here we do noise, then selection, but this is not fundamental 1206 | data, covar = _drawGMM_BG(gmm, orig_size, covar_callback=covar_callback, background=background, rng=rng) 1207 | 1208 | # apply selection 1209 | if sel_callback is not None: 1210 | omega = sel_callback(data) 1211 | sel = rng.rand(len(data)) < omega 1212 | 1213 | # check if predicted observed size is consistent with observed data 1214 | # 68% confidence interval for Poisson variate: observed size 1215 | alpha = 0.32 1216 | lower = 0.5*scipy.stats.chi2.ppf(alpha/2, 2*obs_size) 1217 | upper = 0.5*scipy.stats.chi2.ppf(1 - alpha/2, 2*obs_size + 2) 1218 | obs_size_ = sel.sum() 1219 | while obs_size_ > upper or obs_size_ < lower: 1220 | orig_size = int(orig_size / obs_size_ * obs_size) 1221 | data, covar = _drawGMM_BG(gmm, orig_size, covar_callback=covar_callback, background=background, rng=rng) 1222 | omega = sel_callback(data) 1223 | sel = rng.rand(len(data)) < omega 1224 | obs_size_ = sel.sum() 1225 | 1226 | if invert_sel: 1227 | sel = ~sel 1228 | data = data[sel] 1229 | omega = omega[sel] 1230 | if covar_callback is not None and covar.shape != (gmm.D, gmm.D): 1231 | covar = covar[sel] 1232 | 1233 | return data, covar, orig_size, omega 1234 | 1235 | 1236 | def _JS(k, gmm, log_p, log_S, U, A): 1237 | # compute Kullback-Leiber divergence 1238 | log_q_k = log_p[k] - log_S[U[k]] 1239 | return np.dot(np.exp(log_q_k), log_q_k - np.log(A[k]) - log_p[k] + np.log(gmm.amp[k])) / A[k] 1240 | 1241 | 1242 | def _findSNMComponents(gmm, U, log_p, log_S, N, pool=None, chunksize=1): 1243 | # find those components that are most similar 1244 | JM = np.zeros((gmm.K, gmm.K)) 1245 | # compute log_q (posterior for k given i), but use normalized probabilities 1246 | # to allow for merging of empty components 1247 | log_q = [log_p[k] - log_S[U[k]] - np.log(gmm.amp[k]) for k in xrange(gmm.K)] 1248 | for k in xrange(gmm.K): 1249 | # don't need diagonal (can merge), and JM is symmetric 1250 | for j in xrange(k+1, gmm.K): 1251 | # get index list for intersection of U of k and l 1252 | # FIXME: match1d fails if either U is empty 1253 | # SOLUTION: merge empty U, split another 1254 | i_k, i_j = match1d(U[k], U[j], presorted=True) 1255 | JM[k,j] = np.dot(np.exp(log_q[k][i_k]), np.exp(log_q[j][i_j])) 1256 | merge_jk = np.unravel_index(JM.argmax(), JM.shape) 1257 | # if all Us are disjunct, JM is blank and merge_jk = [0,0] 1258 | # merge two smallest components and clean up from the bottom 1259 | cleanup = False 1260 | if merge_jk[0] == 0 and merge_jk[1] == 0: 1261 | logger.debug("neighborhoods disjunct. merging components %d and %d" % tuple(merge_jk)) 1262 | merge_jk = np.argsort(gmm.amp)[:2] 1263 | cleanup = True 1264 | 1265 | 1266 | # split the one whose p(x|k) deviate most from current Gaussian 1267 | # ask for the three worst components to avoid split being in merge_jk 1268 | """ 1269 | JS = np.empty(gmm.K) 1270 | k = 0 1271 | A = gmm.amp * N 1272 | for JS[k] in \ 1273 | parmap.map(_JS, xrange(gmm.K), gmm, log_p, log_S, U, A, pm_pool=pool, pm_chunksize=chunksize): 1274 | k += 1 1275 | """ 1276 | # get largest Eigenvalue, weighed by amplitude 1277 | # Large EV implies extended object, which often is caused by coverving 1278 | # multiple clusters. This happes also for almost empty components, which 1279 | # should rather be merged than split, hence amplitude weights. 1280 | # TODO: replace with linalg.eigvalsh, but eigenvalues are not always ordered 1281 | EV = np.linalg.svd(gmm.covar, compute_uv=False) 1282 | JS = EV[:,0] * gmm.amp 1283 | split_l3 = np.argsort(JS)[-3:][::-1] 1284 | 1285 | # check that the three indices are unique 1286 | changing = np.array([merge_jk[0], merge_jk[1], split_l3[0]]) 1287 | if split_l3[0] in merge_jk: 1288 | if split_l3[1] not in merge_jk: 1289 | changing[2] = split_l3[1] 1290 | else: 1291 | changing[2] = split_l3[2] 1292 | return changing, cleanup 1293 | 1294 | 1295 | def _update_snm(gmm, changeable, U, N, cleanup): 1296 | # reconstruct A from gmm.amp 1297 | A = gmm.amp * N 1298 | 1299 | # update parameters and U 1300 | # merge 0 and 1, store in 0, Bovy eq. 39 1301 | gmm.amp[changeable[0]] = gmm.amp[changeable[0:2]].sum() 1302 | if not cleanup: 1303 | gmm.mean[changeable[0]] = np.sum(gmm.mean[changeable[0:2]] * A[changeable[0:2]][:,None], axis=0) / A[changeable[0:2]].sum() 1304 | gmm.covar[changeable[0]] = np.sum(gmm.covar[changeable[0:2]] * A[changeable[0:2]][:,None,None], axis=0) / A[changeable[0:2]].sum() 1305 | U[changeable[0]] = np.union1d(U[changeable[0]], U[changeable[1]]) 1306 | else: 1307 | # if we're cleaning up the weakest components: 1308 | # merging does not lead to valid component parameters as the original 1309 | # ones can be anywhere. Simply adopt second one. 1310 | gmm.mean[changeable[0],:] = gmm.mean[changeable[1],:] 1311 | gmm.covar[changeable[0],:,:] = gmm.covar[changeable[1],:,:] 1312 | U[changeable[0]] = U[changeable[1]] 1313 | 1314 | # split 2, store in 1 and 2 1315 | # following SVD method in Zhang 2003, with alpha=1/2, u = 1/4 1316 | gmm.amp[changeable[1]] = gmm.amp[changeable[2]] = gmm.amp[changeable[2]] / 2 1317 | # TODO: replace with linalg.eigvalsh, but eigenvalues are not always ordered 1318 | _, radius2, rotation = np.linalg.svd(gmm.covar[changeable[2]]) 1319 | dl = np.sqrt(radius2[0]) * rotation[0] / 4 1320 | gmm.mean[changeable[1]] = gmm.mean[changeable[2]] - dl 1321 | gmm.mean[changeable[2]] = gmm.mean[changeable[2]] + dl 1322 | gmm.covar[changeable[1:]] = np.linalg.det(gmm.covar[changeable[2]])**(1/gmm.D) * np.eye(gmm.D) 1323 | U[changeable[1]] = U[changeable[2]].copy() # now 1 and 2 have same U 1324 | 1325 | 1326 | # L-fold cross-validation of the fit function. 1327 | # all parameters for fit must be supplied with kwargs. 1328 | # the rng seed will be fixed for the CV runs so that all random effects are the 1329 | # same for each run. 1330 | def cv_fit(gmm, data, L=10, **kwargs): 1331 | N = len(data) 1332 | lcv = np.empty(N) 1333 | logger.info("running %d-fold cross-validation ..." % L) 1334 | 1335 | # CV and stacking can't have probabilistic inits that depends on 1336 | # data or subsets thereof 1337 | init_callback = kwargs.get("init_callback", None) 1338 | if init_callback is not None: 1339 | raise RuntimeError("Cross-validation can only be used consistently with init_callback=None") 1340 | 1341 | # make sure we know what the RNG is, 1342 | # fix state of RNG to make behavior of fit reproducable 1343 | rng = kwargs.get("rng", np.random) 1344 | rng_state = rng.get_state() 1345 | 1346 | # need to copy the gmm when init_cb is None 1347 | # otherwise runs start from different init positions 1348 | gmm0 = GMM(K=gmm.K, D=gmm.D) 1349 | gmm0.amp[:,] = gmm.amp[:] 1350 | gmm0.mean[:,:] = gmm.mean[:,:] 1351 | gmm0.covar[:,:,:] = gmm.covar[:,:,:] 1352 | 1353 | # same for bg if present 1354 | bg = kwargs.get("background", None) 1355 | if bg is not None: 1356 | bg_amp0 = bg.amp 1357 | 1358 | # to L-fold CV here, need to split covar too if set 1359 | covar = kwargs.pop("covar", None) 1360 | for i in xrange(L): 1361 | rng.set_state(rng_state) 1362 | mask = np.arange(N) % L == i 1363 | if covar is None or covar.shape == (gmm.D, gmm.D): 1364 | fit(gmm, data[~mask], covar=covar, **kwargs) 1365 | lcv[mask] = gmm.logL(data[mask], covar=covar) 1366 | else: 1367 | fit(gmm, data[~mask], covar=covar[~mask], **kwargs) 1368 | lcv[mask] = gmm.logL(data[mask], covar=covar[mask]) 1369 | 1370 | # undo for consistency 1371 | gmm.amp[:,] = gmm0.amp[:] 1372 | gmm.mean[:,:] = gmm0.mean[:,:] 1373 | gmm.covar[:,:,:] = gmm0.covar[:,:,:] 1374 | if bg is not None: 1375 | bg.amp = bg_amp0 1376 | 1377 | return lcv 1378 | 1379 | 1380 | def stack(gmms, weights): 1381 | # build stacked model by combining all gmms and applying weights to amps 1382 | stacked = GMM(K=0, D=gmms[0].D) 1383 | for m in xrange(len(gmms)): 1384 | stacked.amp = np.concatenate((stacked.amp[:], weights[m]*gmms[m].amp[:])) 1385 | stacked.mean = np.concatenate((stacked.mean[:,:], gmms[m].mean[:,:])) 1386 | stacked.covar = np.concatenate((stacked.covar[:,:,:], gmms[m].covar[:,:,:])) 1387 | stacked.amp /= stacked.amp.sum() 1388 | return stacked 1389 | 1390 | 1391 | def stack_fit(gmms, data, kwargs, L=10, tol=1e-5, rng=np.random): 1392 | M = len(gmms) 1393 | N = len(data) 1394 | lcvs = np.empty((M,N)) 1395 | 1396 | for m in xrange(M): 1397 | # run CV to get cross-validation likelihood 1398 | rng_state = rng.get_state() 1399 | lcvs[m,:] = cv_fit(gmms[m], data, L=L, **(kwargs[m])) 1400 | rng.set_state(rng_state) 1401 | # run normal fit on all data 1402 | fit(gmms[m], data, **(kwargs[m])) 1403 | 1404 | # determine the weights that maximize the stacked estimator likelihood 1405 | # run a tiny EM on lcvs to get them 1406 | beta = np.ones(M)/M 1407 | log_p_k = np.empty_like(lcvs) 1408 | log_S = np.empty(N) 1409 | it = 0 1410 | logger.info("optimizing stacking weights\n") 1411 | logger.info("ITER\tLOG_L") 1412 | 1413 | while True and it < 20: 1414 | log_p_k[:,:] = lcvs + np.log(beta)[:,None] 1415 | log_S[:] = logsum(log_p_k) 1416 | log_p_k[:,:] -= log_S 1417 | beta[:] = np.exp(logsum(log_p_k, axis=1)) / N 1418 | logL_ = log_S.mean() 1419 | logger.info("STACK%d\t%.4f" % (it, logL_)) 1420 | 1421 | if it > 0 and logL_ - logL < tol: 1422 | break 1423 | logL = logL_ 1424 | it += 1 1425 | return stack(gmms, beta) 1426 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | long_description = open('README.md').read() 4 | 5 | setup( 6 | name="pygmmis", 7 | version='1.2.3', 8 | description="Gaussian mixture model for incomplete, truncated, and noisy data", 9 | long_description = long_description, 10 | long_description_content_type='text/markdown', 11 | author="Peter Melchior", 12 | author_email="peter.m.melchior@gmail.com", 13 | license='MIT', 14 | py_modules=["pygmmis"], 15 | url="https://github.com/pmelchior/pygmmis", 16 | classifiers=[ 17 | "Development Status :: 5 - Production/Stable", 18 | "License :: OSI Approved :: MIT License", 19 | "Intended Audience :: Developers", 20 | "Intended Audience :: Science/Research", 21 | "Operating System :: OS Independent", 22 | "Programming Language :: Python", 23 | "Topic :: Scientific/Engineering :: Information Analysis" 24 | ], 25 | install_requires=["numpy","scipy","parmap>=1.5.2"] 26 | ) 27 | -------------------------------------------------------------------------------- /tests/pygmmis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pmelchior/pygmmis/87ad02dd607896205ccde3ca668971c6dcacd026/tests/pygmmis.png -------------------------------------------------------------------------------- /tests/test.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | 3 | import pygmmis 4 | import numpy as np 5 | import matplotlib.pyplot as plt 6 | import matplotlib.patches as patches 7 | import matplotlib.lines as lines 8 | import matplotlib.cm 9 | import datetime 10 | from functools import partial 11 | import logging 12 | 13 | def plotResults(orig, data, gmm, patch=None, description=None, disp=None): 14 | fig = plt.figure(figsize=(6,6)) 15 | ax = fig.add_subplot(111, aspect='equal') 16 | 17 | # plot inner and outer points 18 | ax.plot(orig[:,0], orig[:,1], 'o', mfc='None', mec='r', mew=1) 19 | missing = np.isnan(data) 20 | if missing.any(): 21 | data_ = data.copy() 22 | data_[missing] = -5 # put at limits of plotting range 23 | else: 24 | data_ = data 25 | ax.plot(data_[:,0], data_[:,1], 's', mfc='b', mec='None')#, mew=1) 26 | 27 | # prediction 28 | B = 100 29 | x,y = np.meshgrid(np.linspace(-5,15,B), np.linspace(-5,15,B)) 30 | coords = np.dstack((x.flatten(), y.flatten()))[0] 31 | 32 | # compute sum_k(p_k(x)) for all x 33 | p = gmm(coords).reshape((B,B)) 34 | # for better visibility use arcshinh stretch 35 | p = np.arcsinh(p/1e-4) 36 | cs = ax.contourf(p, 10, extent=(-5,15,-5,15), cmap=plt.cm.Greys) 37 | for c in cs.collections: 38 | c.set_edgecolor(c.get_facecolor()) 39 | 40 | # plot boundary 41 | if patch is not None: 42 | import copy 43 | if hasattr(patch, '__iter__'): 44 | for p in patch: 45 | ax.add_artist(copy.copy(p)) 46 | else: 47 | ax.add_artist(copy.copy(patch)) 48 | 49 | # add description and complete data logL to plot 50 | logL = gmm(orig, as_log=True).mean() 51 | if description is not None: 52 | ax.text(0.05, 0.95, r'%s' % description, ha='left', va='top', transform=ax.transAxes, fontsize=20) 53 | ax.text(0.05, 0.89, '$\log{\mathcal{L}} = %.3f$' % logL, ha='left', va='top', transform=ax.transAxes, fontsize=20) 54 | else: 55 | ax.text(0.05, 0.95, '$\log{\mathcal{L}} = %.3f$' % logL, ha='left', va='top', transform=ax.transAxes, fontsize=20) 56 | 57 | # show size of error dispersion as Circle 58 | if disp is not None: 59 | 60 | circ1 = patches.Circle((12.5, -2.5), radius=disp, fc='b', ec='None', alpha=0.5) 61 | circ2 = patches.Circle((12.5, -2.5), radius=2*disp, fc='b', ec='None', alpha=0.3) 62 | circ3 = patches.Circle((12.5, -2.5), radius=3*disp, fc='b', ec='None', alpha=0.1) 63 | ax.add_artist(circ1) 64 | ax.add_artist(circ2) 65 | ax.add_artist(circ3) 66 | ax.text(12.5, -2.5, r'$\sigma$', color='w', fontsize=20, ha='center', va='center') 67 | 68 | ax.set_xlim(-5, 15) 69 | ax.set_ylim(-5, 15) 70 | ax.set_xticks([]) 71 | ax.set_yticks([]) 72 | fig.subplots_adjust(bottom=0.01, top=0.99, left=0.01, right=0.99) 73 | fig.show() 74 | 75 | def plotDifferences(orig, data, gmms, avg, l, patch=None): 76 | fig = plt.figure(figsize=(6,6)) 77 | ax = fig.add_subplot(111, aspect='equal') 78 | 79 | # plot inner and outer points 80 | #ax.plot(orig[:,0], orig[:,1], 'o', mfc='None', mec='r', mew=1) 81 | ax.plot(data[:,0], data[:,1], 's', mfc='b', mec='None')#, mew=1) 82 | 83 | # prediction 84 | B = 100 85 | x,y = np.meshgrid(np.linspace(-5,15,B), np.linspace(-5,15,B)) 86 | coords = np.dstack((x.flatten(), y.flatten()))[0] 87 | 88 | # compute sum_k(p_k(x)) for all x 89 | pw = avg(coords).reshape((B,B)) 90 | 91 | # use each run and compute weighted std 92 | p = np.empty((T,B,B)) 93 | for r in range(T): 94 | # compute sum_k(p_k(x)) for all x 95 | p[r,:,:] = gmms[r](coords).reshape((B,B)) 96 | 97 | p = ((p-pw[None,:,:])**2 * l[:,None, None]).sum(axis=0) 98 | V1 = l.sum() 99 | V2 = (l**2).sum() 100 | p /= (V1 - V2/V1) 101 | 102 | p = np.arcsinh(np.sqrt(p)/1e-4) 103 | cs = ax.contourf(p, 10, extent=(-5,15,-5,15), cmap=plt.cm.Greys, vmin=np.arcsinh(pw/1e-4).min(), vmax=np.arcsinh(pw/1e-4).max()) 104 | for c in cs.collections: 105 | c.set_edgecolor(c.get_facecolor()) 106 | 107 | # plot boundary 108 | if patch is not None: 109 | import copy 110 | if hasattr(patch, '__iter__'): 111 | for p in patch: 112 | ax.add_artist(copy.copy(p)) 113 | else: 114 | ax.add_artist(copy.copy(patch)) 115 | 116 | ax.text(0.05, 0.95, 'Dispersion', ha='left', va='top', transform=ax.transAxes, fontsize=20) 117 | 118 | ax.set_xlim(-5, 15) 119 | ax.set_ylim(-5, 15) 120 | ax.set_xticks([]) 121 | ax.set_yticks([]) 122 | fig.subplots_adjust(bottom=0.01, top=0.99, left=0.01, right=0.99) 123 | fig.show() 124 | 125 | def getBox(coords): 126 | box_limits = np.array([[0,0],[10,10]]) 127 | return (coords[:,0] > box_limits[0,0]) & (coords[:,0] < box_limits[1,0]) & (coords[:,1] > box_limits[0,1]) & (coords[:,1] < box_limits[1,1]) 128 | 129 | def getHole(coords): 130 | x,y,r = 6.5, 6., 2 131 | return ((coords[:,0] - x)**2 + (coords[:,1] - y)**2 > r**2) 132 | 133 | def getBoxWithHole(coords): 134 | return getBox(coords)*getHole(coords) 135 | 136 | def getCut(coords): 137 | return (coords[:,0] < 6) 138 | 139 | def getAll(coords): 140 | return np.ones(len(coords)) 141 | 142 | def getHalf(coords, rng=np.random): 143 | return 0.5 * np.ones(len(coords)) 144 | 145 | def getSelection(type="hole", rng=np.random): 146 | if type == "hole": 147 | cb = getHole 148 | ps = patches.Circle([6.5, 6.], radius=2, fc="none", ec='k', lw=1, ls='dashed') 149 | if type == "box": 150 | cb = getBox 151 | ps = patches.Rectangle([0,0], 10, 10, fc="none", ec='k', lw=1, ls='dashed') 152 | if type == "boxWithHole": 153 | cb = getBoxWithHole 154 | ps = [patches.Circle([6.5, 6.], radius=2, fc="none", ec='k', lw=1, ls='dashed'), 155 | patches.Rectangle([0,0], 10, 10, fc="none", ec='k', lw=1, ls='dashed')] 156 | if type == "cut": 157 | cb = getCut 158 | ps = lines.Line2D([6, 6],[-5, 15], ls='dotted', lw=1, color='k') 159 | if type == "all": 160 | cb = getAll 161 | ps = None 162 | return cb, ps 163 | 164 | if __name__ == '__main__': 165 | 166 | # set up test 167 | N = 400 # number of samples 168 | K = 3 # number of components 169 | T = 1 # number of runs 170 | sel_type = "boxWithHole" # type of selection 171 | disp = 0.5 # additive noise dispersion 172 | bg_amp = 0.0 # fraction of background samples 173 | w = 0.1 # minimum covariance regularization [data units] 174 | cutoff = 5 # cutoff distance between components [sigma] 175 | seed = 8365 # seed value 176 | oversampling = 10 # for missing data: imputation samples per observed sample 177 | # show EM iteration results 178 | logging.basicConfig(format='%(message)s',level=logging.INFO) 179 | 180 | # define RNG for run 181 | from numpy.random import RandomState 182 | rng = RandomState(seed) 183 | 184 | # draw N points from 3-component GMM 185 | D = 2 186 | gmm = pygmmis.GMM(K=3, D=2) 187 | gmm.amp[:] = np.array([ 0.36060026, 0.27986906, 0.206774]) 188 | gmm.amp /= gmm.amp.sum() 189 | gmm.mean[:,:] = np.array([[ 0.08016886, 0.21300697], 190 | [ 0.70306351, 0.6709532 ], 191 | [ 0.01087670, 0.852077]])*10 192 | gmm.covar[:,:,:] = np.array([[[ 0.08530014, -0.00314178], 193 | [-0.00314178, 0.00541106]], 194 | [[ 0.03053402, 0.0125736], 195 | [0.0125736, 0.01075791]], 196 | [[ 0.00258605, 0.00409287], 197 | [ 0.00409287, 0.01065186]]])*100 198 | 199 | # data come from pure GMM model or one with background? 200 | orig = gmm.draw(N, rng=rng) 201 | if bg_amp == 0: 202 | orig_bg = orig 203 | bg = None 204 | else: 205 | footprint = np.array([-10,-10]), np.array([20,20]) 206 | bg = pygmmis.Background(footprint) 207 | bg.amp = bg_amp 208 | bg.adjust_amp = True 209 | 210 | bg_size = int(bg_amp/(1-bg_amp) * N) 211 | orig_bg = np.concatenate((orig, bg.draw(bg_size, rng=rng))) 212 | 213 | # add isotropic errors on data 214 | noisy = orig_bg + rng.normal(0, scale=disp, size=(len(orig_bg), D)) 215 | 216 | # get observational selection function 217 | omega, ps = getSelection(sel_type, rng=rng) 218 | 219 | # apply selection 220 | sel = rng.rand(N) < omega(noisy) 221 | data = noisy[sel] 222 | # single covariance for all samples 223 | covar = disp**2 * np.eye(D) 224 | 225 | # plot data vs true model 226 | plotResults(orig, data, gmm, patch=ps, description="Truth", disp=disp) 227 | 228 | # repeated runs: store results and logL 229 | l = np.empty(T) 230 | gmms = [pygmmis.GMM(K=K, D=D) for r in range(T)] 231 | 232 | # 1) EM without imputation, ignoring errors 233 | start = datetime.datetime.now() 234 | rng = RandomState(seed) 235 | for r in range(T): 236 | if bg is not None: 237 | bg.amp = bg_amp 238 | l[r], _ = pygmmis.fit(gmms[r], data, w=w, cutoff=cutoff, background=bg, rng=rng) 239 | avg = pygmmis.stack(gmms, l) 240 | print ("execution time %ds" % (datetime.datetime.now() - start).seconds) 241 | plotResults(orig, data, avg, patch=ps, description="Standard EM") 242 | 243 | # 2) EM without imputation, deconvolving via Extreme Deconvolution 244 | start = datetime.datetime.now() 245 | rng = RandomState(seed) 246 | for r in range(T): 247 | if bg is not None: 248 | bg.amp = bg_amp 249 | l[r], _ = pygmmis.fit(gmms[r], data, covar=covar, w=w, cutoff=cutoff, background=bg, rng=rng) 250 | avg = pygmmis.stack(gmms, l) 251 | print ("execution time %ds" % (datetime.datetime.now() - start).seconds) 252 | plotResults(orig, data, avg, patch=ps, description="Standard EM & noise deconvolution") 253 | 254 | # 3) pygmmis with imputation, igoring errors 255 | # We need a good initial location to explore the 256 | # volume that is spanned by the missing part of the data 257 | # We therefore run a standard GMM without imputation first 258 | start = datetime.datetime.now() 259 | rng = RandomState(seed) 260 | for r in range(T): 261 | if bg is not None: 262 | bg.amp = bg_amp 263 | pygmmis.fit(gmms[r], data, w=w, cutoff=cutoff, background=bg, rng=rng) 264 | l[r], _ = pygmmis.fit(gmms[r], data, init_method='none', w=w, cutoff=cutoff, sel_callback=omega, oversampling=oversampling, background=bg, rng=rng) 265 | avg = pygmmis.stack(gmms, l) 266 | print ("execution time %ds" % (datetime.datetime.now() - start).seconds) 267 | plotResults(orig, data, avg, patch=ps, description="$\mathtt{GMMis}$") 268 | 269 | # 4) pygmmis with imputation, incorporating errors 270 | covar_cb = partial(pygmmis.covar_callback_default, default=np.eye(D)*disp**2) 271 | start = datetime.datetime.now() 272 | rng = RandomState(seed) 273 | for r in range(T): 274 | if bg is not None: 275 | bg.amp = bg_amp 276 | pygmmis.fit(gmms[r], data, w=w, cutoff=cutoff, background=bg, rng=rng) 277 | l[r], _ = pygmmis.fit(gmms[r], data, covar=covar, init_method='none', w=w, cutoff=cutoff, sel_callback=omega, oversampling=oversampling, covar_callback=covar_cb, background=bg, rng=rng) 278 | avg = pygmmis.stack(gmms, l) 279 | print ("execution time %ds" % (datetime.datetime.now() - start).seconds) 280 | plotResults(orig, data, avg, patch=ps, description="$\mathtt{GMMis}$ & noise deconvolution") 281 | 282 | if T > 1: 283 | plotDifferences(orig, data, gmms, avg, l, patch=ps) 284 | #plotCoverage(orig, data, avg, patch=ps, sel_callback=cb) 285 | """ 286 | # stacked estimator: needs to do init by hand to keep it fixed 287 | start = datetime.datetime.now() 288 | rng = RandomState(seed) 289 | for r in range(R): 290 | init_cb(gmms[r], data=data, covar=covar, rng=rng) 291 | kwargs = [dict(covar=covar, init_callback=None, w=w, cutoff=cutoff, sel_callback=cb, covar_callback=covar_cb, background=bg, rng=rng) for i in range(R)] 292 | stacked = pygmmis.stack_fit(gmms, data, kwargs, L=10, rng=rng) 293 | print ("execution time %ds" % (datetime.datetime.now() - start).seconds) 294 | plotResults(orig, data, stacked, patch=ps, description="Stacked") 295 | """ 296 | -------------------------------------------------------------------------------- /tests/test_3D.py: -------------------------------------------------------------------------------- 1 | import pygmmis 2 | import numpy as np 3 | import logging 4 | from functools import partial 5 | 6 | L = 1 7 | 8 | def binSample(coords, C): 9 | dl = L*1./C 10 | N = len(coords) 11 | from sklearn.neighbors import KDTree 12 | # chebyshev metric: results in cube selection 13 | tree = KDTree(coords, leaf_size=N/100, metric="chebyshev") 14 | # sample position: center of cubes of length K 15 | skewer = np.arange(C) 16 | grid = np.meshgrid(skewer, skewer, skewer, indexing="ij") 17 | grid = np.dstack((grid[0].flatten(), grid[1].flatten(), grid[2].flatten()))[0] 18 | samples = dl*(grid +0.5) 19 | 20 | # get counts in boxes 21 | c = tree.query_radius(samples, r=0.5*dl, count_only=True) 22 | #counts = np.zeros(K**3) 23 | #counts[mask] = c 24 | #return counts.reshape(K,K,K) 25 | return c.reshape(C,C,C) 26 | 27 | def initCube(gmm, w=0, rng=np.random): 28 | #gmm.amp[:] = rng.rand(gmm.K) 29 | #gmm.amp /= gmm.amp.sum() 30 | global K 31 | alpha = K 32 | gmm.amp[:] = rng.dirichlet(alpha*np.ones(gmm.K)/K, 1)[0] 33 | gmm.mean[:,:] = rng.rand(gmm.K,gmm.D) 34 | for k in range(gmm.K): 35 | gmm.covar[k] = np.diag((w + rng.rand(gmm.D) / 30)**2) 36 | # use random rotations for each component covariance 37 | # from http://www.mathworks.com/matlabcentral/newsreader/view_thread/298500 38 | # since we don't care about parity flips we don't have to check 39 | # the determinant of R (and hence don't need R) 40 | for k in range(gmm.K): 41 | Q,_ = np.linalg.qr(rng.normal(size=(gmm.D, gmm.D)), mode='complete') 42 | gmm.covar[k] = np.dot(Q, np.dot(gmm.covar[k], Q.T)) 43 | 44 | def initToFillCube(gmm, omega=0.5, rng=np.random): 45 | gmm.amp[k] = 1./gmm.K 46 | # set model to random positions with equally sized spheres within 47 | # volumne spanned by data 48 | min_pos = np.zeros(3) 49 | max_pos = np.ones(3) 50 | gmm.mean[k,:] = min_pos + (max_pos-min_pos)*rng.rand(gmm.K, gmm.D) 51 | # K spheres of radius s [having volume s^D * pi^D/2 / gamma(D/2+1)] 52 | # should fill fraction omega of cube 53 | from scipy.special import gamma 54 | vol_data = np.prod(max_pos-min_pos) 55 | s = (omega * vol_data / gmm.K * gamma(gmm.D*0.5 + 1))**(1./gmm.D) / np.sqrt(np.pi) 56 | gmm.covar[k,:,:] = s**2 * np.eye(data.shape[1]) 57 | 58 | def drawWithNbh(gmm, size=1, rng=np.random): 59 | # draw indices for components given amplitudes, need to make sure: sum=1 60 | ind = rng.choice(gmm.K, size=size, p=(gmm.amp/gmm.amp.sum())) 61 | samples = np.empty((size, gmm.D)) 62 | N_k = np.bincount(ind, minlength=gmm.K) 63 | nbh = [None for k in range(gmm.K)] 64 | counter = 0 65 | for k in range(gmm.K): 66 | s = N_k[k] 67 | samples[counter:counter+s] = rng.multivariate_normal(gmm.mean[k], gmm.covar[k], size=s) 68 | nbh[k] = np.arange(counter, counter+s) 69 | counter += s 70 | return samples, nbh 71 | 72 | from mpl_toolkits.mplot3d import Axes3D 73 | import matplotlib.pyplot as plt 74 | 75 | def createFigure(): 76 | fig = plt.figure() 77 | ax = plt.axes([0,0,1,1], projection='3d')#, aspect='equal') 78 | return fig, ax 79 | 80 | def plotPoints(coords, ax=None, depth_shading=True, **kwargs): 81 | if ax is None: 82 | fig, ax = createFigure() 83 | 84 | #if ecolor != 'None': 85 | # lw = 0.25 86 | sc = ax.scatter(coords[:,0], coords[:,1], coords[:,2], **kwargs) 87 | # get rid of pesky depth shading in absence of depthshade=False option 88 | if depth_shading is False: 89 | sc.set_edgecolors = sc.set_facecolors = lambda *args:None 90 | plt.show() 91 | return ax 92 | 93 | def slopeSel(coords, rng=np.random): 94 | return rng.rand(len(coords)) > coords[:,0] 95 | 96 | def noSel(coords, rng=np.random): 97 | return np.ones(len(coords), dtype="bool") 98 | 99 | def insideComponent(k, gmm, coords, covar=None, cutoff=5.): 100 | if gmm.amp[k]*K > 0.01: 101 | return gmm.logL_k(k, coords, covar=covar, chi2_only=True) < cutoff 102 | else: 103 | return np.zeros(len(coords), dtype='bool') 104 | 105 | def GMMSel(coords, gmm, covar=None, sel_gmm=None, cutoff_nd=3., rng=np.random): 106 | # swiss cheese selection based on a GMM: 107 | # if within 1 sigma of any component: you're out! 108 | import multiprocessing, parmap 109 | n_chunks, chunksize = sel_gmm._mp_chunksize() 110 | inside = np.array(parmap.map(insideComponent, range(sel_gmm.K), sel_gmm, coords, covar, cutoff_nd, pm_chunksize=chunksize)) 111 | return np.max(inside, axis=0) 112 | 113 | def max_posterior(gmm, U, coords, covar=None): 114 | import multiprocessing, parmap 115 | pool = multiprocessing.Pool() 116 | n_chunks, chunksize = gmm._mp_chunksize() 117 | log_p = [[] for k in range(gmm.K)] 118 | log_S = np.zeros(len(coords)) 119 | H = np.zeros(len(coords), dtype="bool") 120 | k = 0 121 | for log_p[k], U[k], _ in \ 122 | parmap.starmap(pygmmis._Estep, zip(range(gmm.K), U), gmm, data, covar, None, pm_pool=pool, pm_chunksize=chunksize): 123 | log_S[U[k]] += np.exp(log_p[k]) # actually S, not logS 124 | H[U[k]] = 1 125 | k += 1 126 | log_S[H] = np.log(log_S[H]) 127 | 128 | max_q = np.zeros(len(coords)) 129 | max_k = np.zeros(len(coords), dtype='uint32') 130 | for k in range(gmm.K): 131 | q_k = np.exp(log_p[k] - log_S[U[k]]) 132 | max_k[U[k]] = np.where(max_q[U[k]] < q_k, k, max_k[U[k]]) 133 | max_q[U[k]] = np.maximum(max_q[U[k]],q_k) 134 | return max_k 135 | 136 | # from http://stackoverflow.com/questions/36740887/how-can-a-python-context-manager-try-to-execute-code 137 | def try_forever(f): 138 | def decorated(*args, **kwargs): 139 | while True: 140 | try: 141 | return f(*args, **kwargs) 142 | except: 143 | pass 144 | return decorated 145 | 146 | if __name__ == "__main__": 147 | N = 10000 148 | K = 50 149 | D = 3 150 | C = 50 151 | w = 0.001 152 | inner_cutoff = 1 153 | 154 | seed = 42#np.random.randint(1, 10000) 155 | from numpy.random import RandomState 156 | rng = RandomState(seed) 157 | logging.basicConfig(format='%(message)s',level=logging.INFO) 158 | 159 | # define selection and create Omega in cube: 160 | # expensive, only do once 161 | sel_callback = partial(slopeSel, rng=rng) 162 | """ 163 | random = rng.rand(N*100, D) 164 | sel = sel_callback(random) 165 | omega_cube = binSample(random[sel], C).astype('float') / binSample(random, C) 166 | del random 167 | """ 168 | omega_cube = np.ones((C,C,C)) 169 | for c in range(C): 170 | omega_cube[c,:,:] *= 1 - (c+0.5)/C 171 | 172 | count_cube = np.zeros((C,C,C)) 173 | count__cube = np.zeros((C,C,C)) 174 | count0_cube = np.zeros((C,C,C)) 175 | 176 | R = 10 177 | amp0 = np.empty(R*K) 178 | frac = np.empty(R*K) 179 | Omega = np.empty(R*K) 180 | assoc_frac = np.empty(R*K) 181 | posterior = np.empty(R*K) 182 | 183 | cutoff_nd = pygmmis.chi2_cutoff(D, cutoff=inner_cutoff) 184 | counter = 0 185 | for r in range(R): 186 | print ("start") 187 | # create original sample from GMM 188 | gmm0 = pygmmis.GMM(K=K, D=D) 189 | initCube(gmm0, w=w*10, rng=rng) # use larger size floor than in fit 190 | data0, nbh0 = drawWithNbh(gmm0, N, rng=rng) 191 | 192 | # apply selection 193 | sel0 = sel_callback(data0) 194 | 195 | # how often is each component used 196 | comp0 = np.empty(len(data0), dtype='uint32') 197 | for k in range(gmm0.K): 198 | comp0[nbh0[k]] = k 199 | count0 = np.bincount(comp0, minlength=gmm0.K) 200 | 201 | # compute effective Omega 202 | comp = comp0[sel0] 203 | count = np.bincount(comp, minlength=gmm0.K) 204 | 205 | frac__ = count.astype('float') / count.sum() 206 | Omega__ = count.astype('float') / count0 207 | 208 | # restrict to "safe" components 209 | safe = frac__ > 1./1 * 1./ K 210 | if safe.sum() < gmm0.K: 211 | print ("reset to safe components") 212 | gmm0.amp = gmm0.amp[safe] 213 | gmm0.amp /= gmm0.amp.sum() 214 | gmm0.mean = gmm0.mean[safe] 215 | gmm0.covar = gmm0.covar[safe] 216 | 217 | # redraw data0 and sel0 218 | data0, nbh0 = drawWithNbh(gmm0, N, rng=rng) 219 | sel0 = sel_callback(data0) 220 | 221 | # recompute effective Omega and frac 222 | # how often is each component used 223 | comp0 = np.empty(len(data0), dtype='uint32') 224 | for k in range(gmm0.K): 225 | comp0[nbh0[k]] = k 226 | count0 = np.bincount(comp0, minlength=gmm0.K) 227 | comp = comp0[sel0] 228 | count = np.bincount(comp, minlength=gmm0.K) 229 | 230 | frac__ = count.astype('float') / count.sum() 231 | Omega__ = count.astype('float') / count0 232 | 233 | frac[counter:counter+gmm0.K] = frac__ 234 | Omega[counter:counter+gmm0.K] = Omega__ 235 | amp0[counter:counter+gmm0.K] = gmm0.amp 236 | count0_cube += binSample(data0, C) 237 | 238 | # which K: K0 or K/N = const? 239 | K_ = gmm0.K #int(K*omega_cube.mean()) 240 | 241 | # fit model after selection 242 | data = data0[sel0] 243 | 244 | split_n_merge = K_/3 # 0 245 | gmm = pygmmis.GMM(K=K_, D=3) 246 | logL, U = pygmmis.fit(gmm, data, init_method='minmax', w=w, cutoff=5, split_n_merge=split_n_merge, rng=rng) 247 | sample = gmm.draw(N, rng=rng) 248 | count_cube += binSample(sample, C) 249 | 250 | fit_forever = try_forever(pygmmis.fit) 251 | gmm_ = pygmmis.GMM(K=K_, D=3) 252 | #fit_forever(gmm_, data, sel_callback=sel_callback, init_callback=init_cb, w=w, cutoff=5, split_n_merge=split_n_merge, rng=rng) 253 | gmm_.amp[:] = gmm.amp[:] 254 | gmm_.mean[:,:] = gmm.mean[:,:] 255 | gmm_.covar[:,:,:] = 2*gmm.covar[:,:,:] 256 | logL_, U_ = fit_forever(gmm_, data, sel_callback=sel_callback, init_method='none', w=w, cutoff=5, split_n_merge=split_n_merge, rng=rng) 257 | sample_ = gmm_.draw(N, rng=rng) 258 | """ 259 | gmm_ = gmm 260 | logL_, U_ = logL, U 261 | sample_ = sample 262 | """ 263 | 264 | count__cube += binSample(sample_, C) 265 | 266 | # find density threshold to be associated with any fit GMM component: 267 | # below a threshold, the EM algorithm won't bother to put a component. 268 | # under selection, that threshold applies to the observed sample. 269 | # 270 | # 1) compute fraction of observed points for each component of gmm0 271 | for k in range(K_): 272 | # select data that is within cutoff of any component of sel_gmm 273 | sel__ = GMMSel(data0[nbh0[k]], gmm=None, sel_gmm=gmm_, cutoff_nd=cutoff_nd, rng=rng) 274 | assoc_frac[k + counter] = sel__.sum() * 1./ nbh0[k].size 275 | 276 | """ 277 | # 2) test which components have majority of points associated with 278 | # any fit component 279 | max_k = max_posterior(gmm, U, data0) 280 | for k in range(K_): 281 | posterior[k + counter] = np.bincount(max_k[comp0 == k]).max() * 1./ (comp0 == k).sum() 282 | """ 283 | 284 | counter += gmm0.K 285 | 286 | # plot average cell density as function of cell omega: 287 | # biased estimate will avoid low-omega region and (over)compensate in 288 | # high-omega regions 289 | B = 10 290 | bins = np.linspace(0,1,B+1) 291 | 292 | mean_rho0 = np.empty(B) 293 | mean_rho = np.empty(B) 294 | mean_rho_ = np.empty(B) 295 | mean_omega = np.empty(B) 296 | std_rho0 = np.empty(B) 297 | std_rho = np.empty(B) 298 | std_rho_ = np.empty(B) 299 | std_omega = np.empty(B) 300 | for i in range(B): 301 | mask = (omega_cube > bins[i]) & (omega_cube <= bins[i+1]) 302 | sqrtN = np.sqrt(mask.sum()) 303 | mean_omega[i] = omega_cube[mask].mean() 304 | std_omega[i] = omega_cube[mask].std() 305 | mean_rho0[i] = count0_cube[mask].mean() 306 | std_rho0[i] = count0_cube[mask].std() / sqrtN 307 | mean_rho[i] = count_cube[mask].mean() 308 | std_rho[i] = count_cube[mask].std() / sqrtN 309 | mean_rho_[i] = count__cube[mask].mean() 310 | std_rho_[i] = count__cube[mask].std() / sqrtN 311 | 312 | """ 313 | fig = plt.figure() 314 | ax = fig.add_subplot(111) 315 | ax.plot(bins, np.zeros_like(bins), ls='--', c='#888888') 316 | ax.plot([0,1], [-1,1], ls='--', c='#888888') 317 | angle = 36 318 | ax.text(0.30, -1+0.47, 'uncorrected $\Omega$', color='#888888', ha='center', va='center', rotation=angle) 319 | ax.text(0.97, -0.05, 'perfect correction', color='#888888', ha='right', va='top') 320 | ax.errorbar(mean_omega, (mean_rho - mean_rho0)/mean_rho0, yerr=np.sqrt(std_rho**2 + std_rho0**2)/mean_rho0, fmt='b-', marker='s', label='Standard EM') 321 | ax.errorbar(mean_omega, (mean_rho_ - mean_rho0)/mean_rho0, yerr=np.sqrt(std_rho_**2 + std_rho0**2)/mean_rho0, fmt='r-', marker='o', label='$\mathtt{GMMis}$') 322 | ax.set_ylabel(r'$(\tilde{\rho} - \rho)/\rho$') 323 | ax.set_xlabel('$\Omega$') 324 | fig.subplots_adjust(bottom=0.12, right=0.97) 325 | ax.set_xlim(0,1) 326 | ax.set_ylim(-1,1) 327 | leg = ax.legend(loc='upper left', frameon=False, numpoints=1) 328 | fig.show() 329 | 330 | # plot associated fraction vs observed amplitude 331 | import scipy.stats 332 | cdf_1d = scipy.stats.norm.cdf(inner_cutoff) 333 | confidence_1d = 1-(1-cdf_1d)*2 334 | 335 | fig = plt.figure() 336 | ax = fig.add_subplot(111) 337 | sc = ax.scatter(frac[:counter], assoc_frac[:counter], c=Omega[:counter], s=100*amp0[:counter]/amp0[:counter].mean(), marker='o', rasterized=True, cmap='RdYlBu') 338 | xl = [-0.005, frac[:counter].max()*1.1] 339 | yl = [0,1.0] 340 | ax.plot(xl, [confidence_1d, confidence_1d], c='#888888', ls='--', lw=1) 341 | ax.text(xl[1]*0.97, 0.68*0.97, '$1\sigma$ region', color='#888888', ha='right', va='top') 342 | ax.plot([1./gmm0.K, 1./gmm0.K], yl, c='#888888', ls=':', lw=1) 343 | ax.text(1./gmm0.K + (xl[1]-xl[0])*0.03, yl[0] + 0.03, '$1/K$', color='#888888', ha='left', va='bottom', rotation=90) 344 | ax.set_xlim(xl) 345 | ax.set_ylim(yl) 346 | ax.set_xlabel('$N^o_k / N^o$') 347 | ax.set_ylabel('$\eta_k$') 348 | from mpl_toolkits.axes_grid1 import make_axes_locatable 349 | divider = make_axes_locatable(ax) 350 | cax = divider.append_axes("right", size="3%", pad=0.0) 351 | cb = plt.colorbar(sc, cax=cax) 352 | ticks = np.linspace(0, 1, 6) 353 | cb.set_ticks(ticks) 354 | cb.set_label('$\Omega_k$') 355 | fig.subplots_adjust(bottom=0.13, right=0.90) 356 | fig.show() 357 | 358 | 359 | cmap = matplotlib.cm.get_cmap('RdYlBu') 360 | color = np.array([cmap(20),cmap(255)])[sel0.astype('int')] 361 | #ecolor = np.array(['r','b'])[sel0.astype('int')] 362 | ax = plotPoints(data0, s=4, c=color,lw=0,rasterized=True, depth_shading=False) 363 | ax.set_xlim3d(0,1) 364 | ax.set_ylim3d(0,1) 365 | ax.set_zlim3d(0,1) 366 | 367 | ax = plotPoints(sample_, s=1, alpha=0.5) 368 | for k in range(gmm0.K): 369 | ax.text(gmm_.mean[k,0]+0.03, gmm_.mean[k,1]+0.03, gmm_.mean[k,2]+0.03, "%d" % k, color='r', zorder=1000) 370 | plotPoints(gmm0.mean, c='g', s=400, ax=ax, alpha=0.5, zorder=100) 371 | plotPoints(gmm_.mean, c='r', s=400, ax=ax, alpha=0.5, zorder=100) 372 | ax.set_xlim3d(0,1) 373 | ax.set_ylim3d(0,1) 374 | ax.set_zlim3d(0,1) 375 | """ 376 | --------------------------------------------------------------------------------