├── TwoNN ├── __init__.py └── twonn_dimension.py ├── README.md └── LICENSE /TwoNN/__init__.py: -------------------------------------------------------------------------------- 1 | from TwoNN.twonn_dimension import twonn_dimension -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TWO-NN 2 | 3 | Naive Python 3 implementation of TWO-NN algorithm for intrinsic dimension inference. 4 | 5 | Dependencies 6 | --- 7 | * Python >= 3.6 8 | * Numpy >= 1.17 9 | 10 | Usage 11 | --- 12 | ```python 13 | import numpy as np 14 | from TwoNN import twonn_dimension 15 | 16 | #mock dataset - 1000 samples with 500 features 17 | data = np.random.uniform(0,1,size=(1000,500)) 18 | 19 | #calculate intrinsic dimension d 20 | d = twonn_dimension(data) 21 | ``` 22 | 23 | References 24 | --- 25 | E. Facco, M. d’Errico, A. Rodriguez & A. Laio, Estimating the intrinsic dimension of datasets by a minimal neighborhood information, *Scientific Reports*, 2017 26 | 27 | (https://doi.org/10.1038/s41598-017-11873-y) 28 | 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 fmottes 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /TwoNN/twonn_dimension.py: -------------------------------------------------------------------------------- 1 | #Author: Francesco Mottes 2 | #Date : 15-Oct-2019 3 | #----------------------------- 4 | 5 | 6 | import numpy as np 7 | 8 | 9 | def twonn_dimension(data, return_xy=False): 10 | """ 11 | Calculates intrinsic dimension of the provided data points with the TWO-NN algorithm. 12 | 13 | ----------- 14 | Parameters: 15 | 16 | data : 2d array-like 17 | 2d data matrix. Samples on rows and features on columns. 18 | 19 | return_xy : bool (default=False) 20 | Whether to return also the coordinate vectors used for the linear fit. 21 | 22 | ----------- 23 | Returns: 24 | 25 | d : int 26 | Intrinsic dimension of the dataset according to TWO-NN. 27 | 28 | x : 1d array (optional) 29 | Array with the -log(mu) values. 30 | 31 | y : 1d array (optional) 32 | Array with the -log(F(mu_{sigma(i)})) values. 33 | 34 | ----------- 35 | References: 36 | 37 | [1] E. Facco, M. d’Errico, A. Rodriguez & A. Laio 38 | Estimating the intrinsic dimension of datasets by a minimal neighborhood information (https://doi.org/10.1038/s41598-017-11873-y) 39 | 40 | 41 | """ 42 | 43 | 44 | data = np.array(data) 45 | 46 | N = len(data) 47 | 48 | #mu = r2/r1 for each data point 49 | mu = [] 50 | for i,x in enumerate(data): 51 | 52 | dist = np.sort(np.sqrt(np.sum((x-data)**2, axis=1))) 53 | r1, r2 = dist[dist>0][:2] 54 | 55 | mu.append((i+1,r2/r1)) 56 | 57 | 58 | #permutation function 59 | sigma_i = dict(zip(range(1,len(mu)+1), np.array(sorted(mu, key=lambda x: x[1]))[:,0].astype(int))) 60 | 61 | mu = dict(mu) 62 | 63 | #cdf F(mu_{sigma(i)}) 64 | F_i = {} 65 | for i in mu: 66 | F_i[sigma_i[i]] = i/N 67 | 68 | #fitting coordinates 69 | x = np.log([mu[i] for i in sorted(mu.keys())]) 70 | y = np.array([1-F_i[i] for i in sorted(mu.keys())]) 71 | 72 | #avoid having log(0) 73 | x = x[y>0] 74 | y = y[y>0] 75 | 76 | y = -1*np.log(y) 77 | 78 | #fit line through origin to get the dimension 79 | d = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T, y, rcond=None)[0][0] 80 | 81 | if return_xy: 82 | return d, x, y 83 | else: 84 | return d --------------------------------------------------------------------------------