├── .gitignore ├── LICENSE ├── README.md ├── density_forest.py ├── df_help.py ├── grid.py ├── node.py ├── result_plots ├── combined_trees.png ├── density_comp.png ├── density_comp_kde.png ├── density_estimation.png ├── evol.png └── lcurve.gif └── tree.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | __pycache__/ 3 | old/ 4 | *.pyc 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 ksanjeevan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Density Estimation Forests in Python 2 | ## Kernel Density Estimation in Random Forests 3 | 4 | ### Usage 5 | 6 | Running --help 7 | ```bash 8 | usage: density_forest.py [-h] [-l LEAF] [-d DATA] [-g GRANULARITY] 9 | 10 | randomforest-density: density estimation using random forests. 11 | 12 | optional arguments: 13 | -h, --help show this help message and exit 14 | -l LEAF, --leaf LEAF Choose what leaf estimator to use (Gaussian ['gauss'] 15 | or KDE ['kde']) 16 | -d DATA, --data DATA Path to data (.npy file, shape (sample_zie, 2)). 17 | -g GRANULARITY, --granularity GRANULARITY 18 | Number of division for the Grid 19 | ``` 20 | 21 | Run demo (will produce all the plots seen below): 22 | 23 | ```bash 24 | python3 density_forest.py -l kde 25 | ``` 26 | 27 | 28 | Run own data: 29 | 30 | ```bash 31 | python3 density_forest.py -d data_test.npy -l gauss 32 | ``` 33 | --- 34 | ### Introduction 35 | 36 | 37 | _In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population._ 38 | 39 | 40 | _Random forests are an ensemble learning method that operate by constructing a multitude of decision trees and combining their results for predictions. Random decision forests correct for decision trees' habit of overfitting to their training set._ 41 | 42 | In this project, a random forests method for density estimation is implemented in python. Following is the presentation of some of the steps, results, tests, and comparisons. 43 | 44 | 45 | ### Random Forest Implementation 46 | 47 | In this implementation, axis aligned split functions are used (called [stumps](https://en.wikipedia.org/wiki/Decision_stump)) to build binary trees by optimizing the [entropy gain](https://en.wikipedia.org/wiki/Differential_entropy) at each node. The key parameters to select for this method are: tree depth/entropy gain threshold, forest size, and randomness. 48 | 49 | The optimal depth of a tree will be case dependent. For that reason we first train a small set of trees on a fixed depth (_tune\_entropy\_threshold_ method, parameters _n_ and _depth_). Unlike forest size, where an increase will never yield worse results, a lax stop condition will lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting). The entropy gain is strictly decreasing with depth, as can be seen in the animation below: 50 |

51 | 52 |

53 | 54 | Optimizing the entropy gain threshold is an [ill-posed regularization problem](https://en.wikipedia.org/wiki/Tikhonov_regularization), that is handled in this implementation by finding the elbow point of 'maximum depth' (point furthest from the line connecting the the function's extremes), and averaging it out over _n_, as we can see here: 55 | 56 | 57 |

58 | 59 |

60 | 61 | This step is expensive, since the depth is fixed with no a priori indication of where the optimal threshold is, and the number of leafs that need to be fitted grows exponentially. A better approach would be to implement an online L-curve method (such as the ones discussed [here](https://www1.icsi.berkeley.edu/~barath/papers/kneedle-simplex11.pdf)) as a first pass to avoid initial over-splitting (pending). 62 | 63 | From _[Decision Forests for Classification, 64 | Regression, Density Estimation, Manifold 65 | Learning and Semi-Supervised Learning](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/decisionForests_MSR_TR_2011_114.pdf)_: 66 | 67 | _A key aspect of decision forests is the fact that its component trees are all randomly different from one another. This leads to de-correlation between the individual tree predictions and, in turn, to improved generalization. Forest randomness also helps achieve high robustness with respect to noisy data. 68 | Randomness is injected into the trees during the training phase. Two of the most popular ways of doing so are:_ 69 | 70 | * _[Random training data set sampling (e.g. bagging) ](https://en.wikipedia.org/wiki/Bootstrap_aggregating)_ 71 | * _[Randomized node optimization](https://en.wikipedia.org/wiki/Random_subspace_method)_ 72 | 73 | _These two techniques are not mutually exclusive and could be used together._ 74 | 75 | The method is tested by sampling a combination of gaussians. In order to introduce randomness the node optimization is randomized by parameter _rho_, with the available parameter search space at each node split proportional to it. With a 50% _rho_ and 5 trees, we see the firsts results below: 76 | 77 | 78 |

79 | 80 |

81 | 82 | Performance is harder to measure when using random forests for density estimation (as opposed to regression or classification) since we're in the unsupervised space. Here, the [Jensen-Shannon Divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) is used as a comparison measure (whenever test data from a known distribution is used). 83 | 84 |

85 | 86 |

87 | 88 | 89 | ### Leaf prediction using KDE 90 | 91 | One of the main problems of [Kernel Density Estimation](https://web.as.uky.edu/statistics/users/pbreheny/621/F10/notes/10-28.pdf) is the choice of bandwidth. Many of the approaches to find it rely on assumptions of the underlying distribution, and perform poorly on clustered, real-world data (although there are methods that incroporate an [adaptive bandwidth](https://indico.cern.ch/event/548789/contributions/2258640/attachments/1327011/1992522/kde.pdf) effectively). 92 | 93 | The module can work with any imlementation of the _Node_ class. In these first examples the _NodeGauss_ class is used, by fitting a gaussian distribution at each leaf. Below can be seen the results of using _NodeKDE_, where the compactness measure is still based on the gaussian differential entropy, but the leaf prediction is the result of the KDE method. By looking for splits that optimize fitting a gaussian function, many of the multivariate bandwidth problems that KDE has are avoided, and Silverman's rule for bandwidth selection can be used with good results: 94 | 95 |

96 | 97 |

98 | 99 | 100 | Although it produces an overall much better JSD, it's worth noting that the top right 'bump' is overfitting the noise more. This is expected since the underlying distribution of our test data is a combination of Gaussians, and if a leaf totally encompasses a bump (as can be seen in the forest representation below), then fitting a Gaussian function will perform better than any non parametric technique. 101 | 102 |

103 | 104 |

105 | 106 | 107 | #### To do 108 | 109 | * Try other entropy gain functions / compactness measures. 110 | * Use online L-curve method for entropy threshold optimization. 111 | * Other bottlenecks. 112 | * Refactor to reuse framework in classification and regression. 113 | * Fit unknown distributions / performance. 114 | * Use EMD as comparison metric. 115 | 116 | 117 | -------------------------------------------------------------------------------- /density_forest.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | import matplotlib.pyplot as plt 4 | import pandas as pd 5 | import numpy as np 6 | import os 7 | 8 | from df_help import * 9 | 10 | from grid import Grid 11 | from tree import Tree 12 | from node import NodeGauss, NodeKDE 13 | import argparse 14 | 15 | from pylab import * 16 | 17 | 18 | argparser = argparse.ArgumentParser( 19 | description='randomforest-density: density estimation using random forests.') 20 | 21 | 22 | argparser.add_argument( 23 | '-l', 24 | '--leaf', 25 | help='Choose what leaf estimator to use' 26 | ' (Gaussian [\'gauss\'] or KDE [\'kde\'])', 27 | default='gauss') 28 | 29 | argparser.add_argument( 30 | '-d', 31 | '--data', 32 | help='Path to data (.npy file, shape (sample_zie, 2)).', 33 | default='') 34 | 35 | argparser.add_argument( 36 | '-g', 37 | '--granularity', 38 | help='Number of division for the Grid', 39 | default=100) 40 | 41 | 42 | 43 | 44 | args = argparser.parse_args() 45 | assert args.leaf in ['kde', 'gauss'], "Pass valid leaf estimator" 46 | 47 | 48 | LEAF_DICT = { 49 | 'kde': NodeKDE, 50 | 'gauss': NodeGauss 51 | } 52 | 53 | LEAF_TYPE = args.leaf 54 | MODE = 'est' if args.data else 'demo' 55 | DATA_PATH = args.data 56 | DIVS = args.granularity 57 | 58 | def gauss_entropy_func(S): 59 | """ 60 | Gaussian differential entropy (ignoring constant terms since 61 | we're interested in the delta). 62 | """ 63 | return np.log(np.linalg.det(np.cov(S, rowvar=False)) + 1e-8) 64 | 65 | 66 | class DensityForest: 67 | """ 68 | Class for density forest, compute entropy threshold for stop condition, then build and 69 | train forest of trees with randomness rho. 70 | """ 71 | 72 | def __init__(self, data, grid_obj, f_size, rho=1.): 73 | 74 | self.data = data 75 | self.f_size = f_size 76 | 77 | self.node_class = NodeGauss 78 | self.entropy_func = gauss_entropy_func 79 | 80 | self.grid_obj = grid_obj 81 | 82 | self.grid = self.grid_obj.axis 83 | 84 | self.rho = rho 85 | 86 | 87 | def train(self): 88 | 89 | self.opt_entropy = self.tune_entropy_threshold(plot_debug=True) 90 | self.forest = self.build_forest() 91 | 92 | 93 | def estimate(self): 94 | 95 | self.dist = self.compute_density() 96 | return self.dist 97 | 98 | 99 | def compute_density(self): 100 | 101 | 102 | 103 | dist = [] 104 | for j, y in enumerate(self.grid[1]): 105 | dist.append([]) 106 | for i, x in enumerate(self.grid[0]): 107 | dist[j].append(self.forest_output(np.array([x, y]))) 108 | return dist 109 | 110 | 111 | def plot_density(self, fname='density_estimation.png'): 112 | 113 | X = self.grid[0] 114 | Y = self.grid[1] 115 | Z = self.dist 116 | 117 | fig = plt.figure(figsize=(12, 12)) 118 | ax = fig.add_subplot(111) 119 | 120 | 121 | vmin=np.min(Z) 122 | vmax=np.max(Z) 123 | var = plt.pcolormesh(np.array(X),np.array(Y),np.array(Z), cmap=cm.Blues, vmin=vmin, vmax=vmax) 124 | plt.colorbar(var, ticks=np.arange(vmin, vmax, (vmax-vmin)/8)) 125 | 126 | ax = plt.gca() 127 | 128 | gris = 200.0 129 | ax.set_facecolor((gris/255, gris/255, gris/255)) 130 | 131 | #ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0) 132 | 133 | plt.xlim(np.min(X), np.max(X)) 134 | plt.ylim(np.min(Y), np.max(Y)) 135 | plt.grid() 136 | 137 | ax.set_title('rho = %s, |T| = %d, max_entropy = %.2f'%(self.rho, self.f_size, self.opt_entropy)) 138 | 139 | fig.savefig(fname, format='png') 140 | plt.close() 141 | 142 | 143 | 144 | def forest_output(self, x): 145 | 146 | result = [] 147 | for i, t in self.forest.items(): 148 | result.append( t.output(x) ) 149 | 150 | return np.mean(result) 151 | 152 | 153 | 154 | def build_forest(self): 155 | 156 | forest = {} 157 | 158 | for t in range(self.f_size): 159 | forest[t] = Tree(self, rho=self.rho) 160 | forest[t].tree_leaf_plots(fname='tree_opt%s.png'%t) 161 | 162 | 163 | path = os.getcwd() + '/plots/' 164 | mkdir_p(path) 165 | 166 | 167 | fig = plt.figure(figsize=(10,10)) 168 | ax = fig.add_subplot(111) 169 | 170 | if MODE == 'demo': 171 | color = ['lightcoral', 'dodgerblue', 'mediumseagreen', 'darkorange'] 172 | 173 | for t in range(self.f_size): 174 | print(len(forest[t].leaf_nodes)) 175 | for c, n in enumerate(forest[t].leaf_nodes): 176 | 177 | [[i1, i2], [j1, j2]] = n.quad 178 | x1, x2 = self.grid[0][i1], self.grid[0][i2] 179 | y1, y2 = self.grid[1][j1], self.grid[1][j2] 180 | 181 | ax.fill_between([x1,x2], y1, y2, alpha=.15, color=color[c]) 182 | 183 | pd.DataFrame(self.data, columns=['x', 'y']).plot(ax=ax, x='x', y='y', kind='scatter', lw=0, alpha=.6, s=20, c='k') 184 | plt.savefig(path + 'combined.png', format='png') 185 | plt.close() 186 | 187 | 188 | return forest 189 | 190 | 191 | # Implement Online L-curve optimization like EWMA to get rid of input depth 192 | def tune_entropy_threshold(self, n=5, depth=6, plot_debug=False): 193 | """ 194 | Compute mean optimal entropy based on L-curve elbow method. 195 | """ 196 | 197 | e_arr = [] 198 | for i in range(n): 199 | 200 | var = Tree(self, rho=.5, depth=depth) 201 | e_arr += [pair + [i] for pair in var.entropy_gain_evol] 202 | 203 | var.domain_splits_plots(subpath='%s/'%i) 204 | 205 | entropy_evol = pd.DataFrame(e_arr, columns=['depth', 'entropy', 'tree']) 206 | entropy_evol = entropy_evol.groupby(['tree', 'depth'])[['entropy']].mean().reset_index().pivot(columns='tree', index='depth', values='entropy').fillna(0) 207 | entropy_elbow_cand = entropy_evol.apply(lambda x: opt_L_curve(np.array(x.index), np.array(x))) 208 | 209 | avg_opt_entropy = entropy_elbow_cand.mean() 210 | if plot_debug: 211 | 212 | fig = plt.figure(figsize=(10,10)) 213 | ax = fig.add_subplot(111) 214 | entropy_evol.plot(ax=ax, kind='line', alpha=.6, lw=3., title='Avg. Opt. Entropy = %.2f'%avg_opt_entropy) 215 | plt.savefig('evol.png', format='png') 216 | plt.close() 217 | 218 | return avg_opt_entropy 219 | 220 | 221 | 222 | 223 | def _run_rf(data, grid_obj): 224 | 225 | print('Starting...') 226 | df_obj = DensityForest(data, grid_obj=grid_obj, f_size=5, rho=.5) 227 | df_obj.node_class = LEAF_DICT[LEAF_TYPE] 228 | 229 | print('Training...') 230 | df_obj.train() 231 | print('Estimating...') 232 | pdf = df_obj.estimate() 233 | print('Plotting...') 234 | df_obj.plot_density(fname='density_estimation_kde.png') 235 | 236 | return df_obj, pdf 237 | 238 | 239 | def run_demo(): 240 | 241 | params = { 242 | 'mu' : [[0, 0], [0, 20], [50, 40], [10, -20], [20,5]], 243 | 'cov': [[[1, 0], [0, 150]], [[90, 0], [0, 5]], [[4, 1], [1, 4]], [[40, 5], [5, 40]], [[90, 15], [15, 16]]], 244 | 'n':[700, 20, 200, 400, 300]} 245 | gauss_data_obj = TestDataGauss(params=params, fname='data_new.npy', replace=False) 246 | gauss_data_obj.check_plot() 247 | 248 | df_obj, _ = _run_rf(gauss_data_obj.data, gauss_data_obj.grid_obj) 249 | 250 | print('Comparing...') 251 | 252 | comp = CompareDistributions(original=gauss_data_obj, estimate=df_obj) 253 | comp.vizualize_both('density_comp_kde.png', show_data=False) 254 | 255 | 256 | def run(): 257 | 258 | data_obj = TestDataAny(fname=DATA_PATH, partitions=DIVS) 259 | 260 | density_estimation, pdf = _run_rf(data_obj.data, data_obj.grid_obj) 261 | 262 | pdf_path = os.path.dirname(DATA_PATH) 263 | np.save(pdf_path, pdf) 264 | 265 | print('\tDensity estimation function stored at: %s'%pdf_path) 266 | 267 | 268 | if __name__ == "__main__": 269 | 270 | if MODE == 'demo': 271 | 272 | print('---------------------------') 273 | print('DEMO') 274 | print('---------------------------') 275 | run_demo() 276 | else: 277 | print('---------------------------') 278 | print('Density Estimation') 279 | print('---------------------------') 280 | run() 281 | 282 | 283 | 284 | ''' 285 | 286 | params = { 287 | 'mu' : [[0, 0], [0, 55], [20, 15], [45, 20]], 288 | 'cov': [[[2, 0], [0, 80]], [[90, 0], [0, 5]], [[4, 0], [0, 4]], [[40, 0], [0, 40]]], 289 | 'n':[100, 100, 100, 100]} 290 | 291 | 292 | var = TestDataGauss(params=params, fname='data.npy') 293 | ''' 294 | ''' 295 | foo = DensityForest(var.data, grid_obj=var.grid_obj, f_size=5, rho=.5) 296 | foo.train() 297 | foo.estimate() 298 | foo.plot_density() 299 | 300 | tri = CompareDistributions(original=var, estimate=foo) 301 | tri.vizualize_both('density_comp.png') 302 | ''' 303 | 304 | #print(foo.forest[0].output([0, 40])) 305 | 306 | 307 | 308 | 309 | 310 | 311 | -------------------------------------------------------------------------------- /df_help.py: -------------------------------------------------------------------------------- 1 | 2 | import os, errno 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | from pylab import * 6 | 7 | from grid import Grid 8 | 9 | def mkdir_p(path): 10 | """ 11 | Create directory given path. 12 | """ 13 | try: 14 | os.makedirs(path) 15 | except OSError as exc: # Python >2.5 16 | if exc.errno == errno.EEXIST and os.path.isdir(path): 17 | pass 18 | else: 19 | raise 20 | 21 | def integrate_2d(deltas, func): 22 | """ 23 | 2D Integral numeric approximation. 24 | """ 25 | 26 | suma = 0 27 | for i in range(len(func)-1): 28 | for j in range(len(func[0])-1): 29 | suma += func[i][j] + func[i+1][j] + func[i][j+1] + func[i+1][j+1] 30 | 31 | step = 1. 32 | for d in deltas: 33 | step *= d 34 | 35 | return 0.25*suma*step 36 | 37 | 38 | def cartesian(arrays, out=None): 39 | """ 40 | Compute cartesian product between vector set. 41 | """ 42 | 43 | arrays = [np.asarray(x) for x in arrays] 44 | dtype = arrays[0].dtype 45 | 46 | n = np.prod([x.size for x in arrays]) 47 | if out is None: 48 | out = np.zeros([n, len(arrays)], dtype=dtype) 49 | 50 | m = int(n / arrays[0].size) 51 | out[:,0] = np.repeat(arrays[0], m) 52 | if arrays[1:]: 53 | cartesian(arrays[1:], out=out[0:m,1:]) 54 | for j in range(1, arrays[0].size): 55 | out[j*m:(j+1)*m,1:] = out[0:m,1:] 56 | return out 57 | 58 | 59 | 60 | 61 | def opt_L_curve(xs, ys): 62 | """ 63 | Generic approach to find optimal Y for L-curve. 64 | """ 65 | 66 | x0, y0 = xs[0], ys[0] 67 | x1, y1 = xs[-1], ys[-1] 68 | ra = float(y0 - y1) / (x0 - x1) 69 | rb = y1 - ra*x1 70 | result = [] 71 | for xp, yp in zip(xs, ys): 72 | da = -1./ra 73 | db = yp - da * xp 74 | x_star = float(db-rb)/(ra-da) 75 | y_star = ra*x_star + rb 76 | result.append( [np.sqrt((xp-x_star)**2 + (yp-y_star)**2), xp, yp] ) 77 | 78 | return max(result, key=lambda x: x[0])[2] 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | import abc 89 | 90 | class TestData(abc.ABC): 91 | 92 | """ 93 | Abstract Data and Distribution test 94 | """ 95 | 96 | def __init__(self): 97 | self.dist = [] 98 | 99 | @abc.abstractmethod 100 | def generate_data(self, fname=''): 101 | pass 102 | 103 | @abc.abstractmethod 104 | def check_norm(self): 105 | pass 106 | 107 | @abc.abstractmethod 108 | def compute_distribution(self): 109 | pass 110 | 111 | 112 | @abc.abstractmethod 113 | def evaluate(self, x): 114 | pass 115 | 116 | 117 | 118 | 119 | 120 | 121 | class TestDataGauss(TestData): 122 | 123 | def __init__(self, params, fname, replace=False, partitions=100): 124 | 125 | self.replace = replace 126 | self.mu = params['mu'] 127 | self.cov = params['cov'] 128 | self.n = params['n'] 129 | self.Z = [N/np.sum(self.n) for N in self.n] 130 | 131 | self.data = self.generate_data(fname=fname) 132 | self.grid_obj = Grid(self.data, partitions) 133 | self.grid = self.grid_obj.axis 134 | 135 | 136 | self.dist = self.compute_distribution() 137 | 138 | 139 | def check_norm(self): 140 | dist_vals = [] 141 | 142 | deltas = [] 143 | 144 | for v in self.grid: 145 | deltas.append( v[1]-v[0] ) 146 | 147 | for i, x in enumerate(self.grid[0]): 148 | dist_vals.append([]) 149 | for j, y in enumerate(self.grid[1]): 150 | dist_vals[i].append(self.evaluate(np.array([x, y]))) 151 | 152 | integral = integrate_2d(deltas=deltas, func=dist_vals) 153 | 154 | return integral 155 | 156 | 157 | def compute_distribution(self): 158 | dist = [] 159 | for j, y in enumerate(self.grid[1]): 160 | dist.append([]) 161 | for i, x in enumerate(self.grid[0]): 162 | dist[j].append(self.evaluate(np.array([x, y]))) 163 | return dist 164 | 165 | def evaluate(self, x): 166 | suma = 0 167 | for i, args in enumerate(zip(self.mu, self.cov)): 168 | mu, cov = args 169 | gauss_arg = np.inner(np.transpose((x-mu)), np.inner(np.linalg.inv(cov), (x-mu))) 170 | suma += self.Z[i]*(np.exp(-.5*gauss_arg))/(2*np.pi*np.sqrt(np.linalg.det(cov))) 171 | return suma 172 | 173 | 174 | def generate_data(self, fname='data.npy'): 175 | 176 | if os.path.isfile(fname) and not self.replace: 177 | return np.load(fname) 178 | else: 179 | g = np.random.multivariate_normal 180 | data = [] 181 | 182 | for mu, cov, n in zip(self.mu, self.cov, self.n): 183 | data += list(g(mu, cov, n)) 184 | 185 | data = np.array(data) 186 | 187 | np.save(fname, data) 188 | 189 | return data 190 | 191 | 192 | def check_plot(self): 193 | 194 | X = self.grid[0] 195 | Y = self.grid[1] 196 | Z = self.dist 197 | 198 | fig = plt.figure(figsize=(12, 12)) 199 | ax = fig.add_subplot(111) 200 | 201 | vmin = np.min(Z) 202 | vmax = np.max(Z) 203 | var = plt.pcolormesh(np.array(X),np.array(Y),np.array(Z), cmap=cm.Greens, vmin=vmin, vmax=vmax) 204 | plt.colorbar(var, ticks=np.arange(vmin, vmax, (vmax-vmin)/8)) 205 | ax = plt.gca() 206 | gris = 200.0 207 | ax.set_facecolor((gris/255, gris/255, gris/255)) 208 | 209 | ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0) 210 | 211 | plt.xlim(np.min(X), np.max(X)) 212 | plt.ylim(np.min(Y), np.max(Y)) 213 | plt.grid() 214 | fig.savefig('true_dist_check.png', format='png') 215 | plt.close() 216 | 217 | 218 | 219 | 220 | class TestDataAny(TestData): 221 | 222 | def __init__(self, fname, partitions=100): 223 | 224 | self.data = self.generate_data(fname=fname) 225 | 226 | self.grid_obj = Grid(self.data, partitions) 227 | self.grid = self.grid_obj.axis 228 | 229 | self.dist = self.compute_distribution() 230 | 231 | self.check_plot() 232 | 233 | def generate_data(self, fname='data.npy'): 234 | 235 | if os.path.isfile(fname): 236 | return np.load(fname) 237 | else: 238 | raise ValueError('Enter valid path to source data.') 239 | 240 | def check_plot(self): 241 | 242 | X = self.grid[0] 243 | Y = self.grid[1] 244 | 245 | fig = plt.figure(figsize=(12, 12)) 246 | ax = fig.add_subplot(111) 247 | 248 | ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0) 249 | 250 | plt.xlim(np.min(X), np.max(X)) 251 | plt.ylim(np.min(Y), np.max(Y)) 252 | plt.grid() 253 | fig.savefig('init_data.png', format='png') 254 | plt.close() 255 | 256 | 257 | def check_norm(self): 258 | pass 259 | 260 | def compute_distribution(self): 261 | pass 262 | 263 | def evaluate(self, x): 264 | pass 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | class CompareDistributions: 274 | 275 | def __init__(self, original, estimate): 276 | self.P = np.array(original.dist) 277 | self.Q = np.array(estimate.dist) 278 | self.grid = original.grid 279 | self.data = original.data 280 | 281 | 282 | def compute_JSD(self): 283 | 284 | if self.P.shape == self.Q.shape: 285 | suma = 0 286 | 287 | import math 288 | 289 | for i in range(len(self.P)): 290 | for j in range(len(self.P[0])): 291 | if self.P[i][j] > 0: 292 | suma += self.P[i][j]*np.log(2.*self.P[i][j]/(self.P[i][j] + self.Q[i][j])) 293 | 294 | for i in range(len(self.Q)): 295 | for j in range(len(self.Q[0])): 296 | if self.Q[i][j] > 0: 297 | suma += self.Q[i][j]*np.log(2.*self.Q[i][j]/(self.P[i][j] + self.Q[i][j])) 298 | 299 | return suma 300 | 301 | 302 | 303 | def vizualize_both(self, fname='density_comp.png', show_data=False): 304 | X = self.grid[0] 305 | Y = self.grid[1] 306 | Z1 = self.P 307 | Z2 = self.Q 308 | 309 | fig = plt.figure(figsize=(12, 12)) 310 | 311 | true_dist_params = (211, Z1, 'True Distribution', cm.Greens) 312 | rf_dist_params = (212, Z2, 'Density Forest Estimate; JSD = %.3f'%(self.compute_JSD()), cm.Blues) 313 | 314 | for sp, Z, title, cmap in [true_dist_params, rf_dist_params]: 315 | ax = fig.add_subplot(sp) 316 | vmin=np.min(Z1) 317 | vmax=np.max(Z1) 318 | var = plt.pcolormesh(np.array(X),np.array(Y),np.array(Z), cmap=cmap, vmin=vmin, vmax=vmax) 319 | plt.colorbar(var, ticks=np.arange(vmin, vmax, (vmax-vmin)/8)) 320 | ax = plt.gca() 321 | gris = 200.0 322 | ax.set_facecolor((gris/255, gris/255, gris/255)) 323 | 324 | if show_data: 325 | ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0) 326 | 327 | ax.set_title(title) 328 | plt.xlim(np.min(X), np.max(X)) 329 | plt.ylim(np.min(Y), np.max(Y)) 330 | plt.grid() 331 | 332 | 333 | fig.savefig(fname, format='png') 334 | plt.close() 335 | 336 | 337 | 338 | 339 | -------------------------------------------------------------------------------- /grid.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | 4 | class Grid(object): 5 | """ 6 | Class which defines the boundaries and granularity of the finite grid. 7 | """ 8 | 9 | 10 | def __init__(self, data, div): 11 | self.data = data 12 | self.perc_padding = .2 13 | 14 | self.partitions, self.axis = self.init_grid(div) 15 | #self.grid, self.step, self.displ = self.init_grid() 16 | #self.split_map = self.create_split_map() 17 | 18 | 19 | def init_grid(self, div): 20 | 21 | maxi, mini = np.max(self.data, axis=0), np.min(self.data, axis=0) 22 | step = (maxi - mini)/div 23 | 24 | mini = mini - step * div * self.perc_padding 25 | maxi = maxi + step * div * self.perc_padding 26 | 27 | grid_range = int(div*(1+2*self.perc_padding)) 28 | axis = [[mini[0] + i * step[0] for i in range(grid_range)], [mini[1] + j * step[1] for j in range(grid_range)]] 29 | 30 | return grid_range, axis 31 | 32 | 33 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /node.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | from df_help import * 4 | 5 | 6 | 7 | class Node(object): 8 | """ 9 | Class for each of the nodes in a decision tree. 10 | """ 11 | 12 | def __init__(self, data, quad, depth): 13 | 14 | self.go_right = None 15 | self.quad = quad 16 | self.depth = depth 17 | self.s_l = len(data) 18 | 19 | self.left = None 20 | self.right = None 21 | 22 | 23 | 24 | 25 | def add_split(self, value, axis): 26 | return lambda x: x[axis] > value 27 | 28 | ''' 29 | def leaf_output(self, x): 30 | """ 31 | Evaluate the density estimation of that leaf on x. 32 | """ 33 | gauss_arg = np.inner(np.transpose((x-self.mu)), np.inner(self.inv_cov, (x-self.mu))) 34 | return (np.exp(-.5*gauss_arg))/(2*np.pi*self.sqrt_cov) 35 | ''' 36 | def check_norm(self, grid_axis): 37 | """ 38 | Verify leaf integrates to ~ 1. 39 | """ 40 | 41 | dist_vals = [] 42 | 43 | grid_axis_local = grid_axis.copy() 44 | grid_axis_local[0] = grid_axis_local[0][self.quad[0][0]:self.quad[0][1]+1] 45 | grid_axis_local[1] = grid_axis_local[1][self.quad[1][0]:self.quad[1][1]+1] 46 | 47 | deltas = [] 48 | 49 | for v in grid_axis_local: 50 | deltas.append( v[1]-v[0] ) 51 | 52 | for i, x in enumerate(grid_axis_local[0]): 53 | dist_vals.append([]) 54 | for j, y in enumerate(grid_axis_local[1]): 55 | dist_vals[i].append(self.leaf_output(np.array([x, y]))) 56 | 57 | integral = integrate_2d(deltas=deltas, func=dist_vals) 58 | 59 | if not (integral > 0.95 and integral < 1.05): 60 | print('Node of depth %s, norm = %s'%(self.depth, integral)) 61 | 62 | return integral 63 | 64 | 65 | 66 | 67 | 68 | class NodeGauss(Node): 69 | """ 70 | Class for each of the nodes in a decision tree. 71 | """ 72 | 73 | def __init__(self, data, quad, depth, leaf=False): 74 | 75 | super(NodeGauss, self).__init__(data=data, quad=quad, depth=depth) 76 | 77 | self.leaf = leaf 78 | if leaf: 79 | self.cov = np.cov(data, rowvar=False) 80 | # Check cov positive semidef 81 | # np.all(np.linalg.eigvals(np.cov(data, rowvar=False)) > 0) 82 | if np.all(np.linalg.eigvals(np.cov(data, rowvar=False)) > 0): 83 | self.sqrt_cov = np.sqrt(np.linalg.det(self.cov)) 84 | self.inv_cov = np.linalg.inv(self.cov) 85 | self.mu = np.mean(data, axis=0) 86 | else: 87 | self.sqrt_cov = 1 88 | self.inv_cov = np.zeros(self.cov.shape) 89 | self.mu = np.mean(data, axis=0) 90 | 91 | 92 | def leaf_output(self, x): 93 | """ 94 | Evaluate the density estimation of that leaf on x. 95 | """ 96 | gauss_arg = np.inner(np.transpose((x-self.mu)), np.inner(self.inv_cov, (x-self.mu))) 97 | return (np.exp(-.5*gauss_arg))/(2*np.pi*self.sqrt_cov) 98 | 99 | 100 | 101 | def h_rot(x, d): 102 | return math.pow(len(x), -(2.0)/(d+4))*np.var(x, axis=0) + 1e-8 103 | 104 | 105 | class NodeKDE(Node): 106 | """ 107 | Class for each of the nodes in a decision tree. 108 | """ 109 | 110 | def __init__(self, data, quad, depth, leaf=False): 111 | 112 | super(NodeKDE, self).__init__(data=data, quad=quad, depth=depth) 113 | self.leaf = leaf 114 | 115 | if leaf: 116 | 117 | self.data = data 118 | h = h_rot(data, len(data[0])) 119 | 120 | self.H = [[h[0], 0],[0, h[1]]] 121 | ''' 122 | print('-------------------') 123 | print(data) 124 | print(data.shape) 125 | print(self.H) 126 | ''' 127 | 128 | self.H_inv = np.linalg.inv(self.H) 129 | self.H_inv_sqrt_det = np.sqrt(np.linalg.det(self.H_inv)) 130 | 131 | 132 | 133 | def k_gauss(self, u): 134 | argum = np.sum(u*np.transpose(np.inner(self.H_inv, u)), axis=1) 135 | return np.exp(-.5*argum) 136 | 137 | 138 | def leaf_output(self, x): 139 | 140 | result = self.H_inv_sqrt_det * np.sum(self.k_gauss(x - self.data)) * 1./(2*np.pi*len(self.data)) 141 | return result 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | if __name__ == "__main__": 153 | 154 | H = np.array([[2,0],[0,2]]) 155 | x = np.array([[1,0],[0,1],[1,1],[0,-1],[-1,0]]) 156 | 157 | a = np.inner(np.transpose(x), np.inner(H, x)) 158 | 159 | print(np.inner(H, x).shape) 160 | #np.inner(np.transpose(u), np.inner(self.H_inv, u)) 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | -------------------------------------------------------------------------------- /result_plots/combined_trees.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/combined_trees.png -------------------------------------------------------------------------------- /result_plots/density_comp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/density_comp.png -------------------------------------------------------------------------------- /result_plots/density_comp_kde.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/density_comp_kde.png -------------------------------------------------------------------------------- /result_plots/density_estimation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/density_estimation.png -------------------------------------------------------------------------------- /result_plots/evol.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/evol.png -------------------------------------------------------------------------------- /result_plots/lcurve.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/lcurve.gif -------------------------------------------------------------------------------- /tree.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import pandas as pd 4 | import matplotlib.pyplot as plt 5 | import os 6 | 7 | from df_help import * 8 | 9 | 10 | 11 | 12 | 13 | class Tree: 14 | """ 15 | Class for training and building decision trees. 16 | """ 17 | 18 | def __init__(self, forest_obj, rho, depth=None): 19 | 20 | self.forest_obj = forest_obj 21 | self.rho = rho 22 | 23 | self.node_class = forest_obj.node_class 24 | 25 | self.s_0 = len(forest_obj.data) 26 | 27 | self.leaf_nodes = [] 28 | self.entropy_gain_evol = [] 29 | 30 | self.explore_depth = depth if depth else 0 31 | 32 | self.root_node = self.build_tree() 33 | 34 | self.Zt = None 35 | 36 | self.tree_nodes_depth = self.extract_levels(self.root_node) 37 | self.tree_nodes_domain = self.extract_domain_splits(self.root_node) 38 | 39 | 40 | if not depth: 41 | 42 | self.Zt = self.norm_tree() 43 | 44 | 45 | def check_norm(self): 46 | 47 | dist_vals = [] 48 | 49 | deltas = [] 50 | 51 | for v in self.forest_obj.grid: 52 | deltas.append( v[1]-v[0] ) 53 | 54 | for i, x in enumerate(self.forest_obj.grid[0]): 55 | dist_vals.append([]) 56 | for j, y in enumerate(self.forest_obj.grid[1]): 57 | dist_vals[i].append(self.output(np.array([x, y]))) 58 | 59 | integral = integrate_2d(deltas=deltas, func=dist_vals) 60 | 61 | return integral 62 | 63 | 64 | def norm_tree(self): 65 | 66 | Zt = 0 67 | 68 | for l in self.leaf_nodes: 69 | 70 | pi_l = l.s_l / self.s_0 71 | integral = l.check_norm(self.forest_obj.grid) 72 | Zt += pi_l * integral 73 | 74 | return Zt 75 | 76 | 77 | 78 | def output(self, x): 79 | current_node = self.root_node 80 | 81 | while not current_node.leaf: 82 | if current_node.go_right(x): 83 | current_node = current_node.right 84 | else: 85 | current_node = current_node.left 86 | 87 | 88 | pi_l = current_node.s_l / self.s_0 89 | return (pi_l/self.Zt)*current_node.leaf_output(x) 90 | 91 | 92 | 93 | def _compute_det_lamb(self, S): 94 | 95 | if S.shape[0] > 2: 96 | return self.forest_obj.entropy_func(S) 97 | return 1e5 98 | 99 | 100 | def entropy_gain(self, parent_entropy, S, ind, axis): 101 | """ 102 | Compute entropy gain given data set, split index and axis of application 103 | """ 104 | 105 | S_right = S[S[:,axis]>=self.forest_obj.grid[axis][ind]] 106 | S_left = S[S[:,axis]= self.forest_obj.grid[0][quad[0][0]] 124 | left = self.forest_obj.data[:,0] < self.forest_obj.grid[0][quad[0][1]] 125 | top = self.forest_obj.data[:,1] >= self.forest_obj.grid[1][quad[1][0]] 126 | bottom = self.forest_obj.data[:,1] < self.forest_obj.grid[1][quad[1][1]] 127 | 128 | return self.forest_obj.data[(right)&(left)&(top)&(bottom)] 129 | 130 | def _get_search_space(self, quad): 131 | # d axis ranges inside branch domain 132 | x_edge = range(quad[0][0], quad[0][1]+1) 133 | y_edge = range(quad[1][0], quad[1][1]+1) 134 | 135 | # Apply randomness rho factor to limit parameter space search 136 | edge = np.array([(z, 0) for z in x_edge] + [(z, 1) for z in y_edge]) 137 | size = len(edge) 138 | return edge[np.random.choice(size, size=int(size*self.rho), replace=False)] 139 | 140 | 141 | def _find_opt_cut(self, ind_array, local_data): 142 | max_entropy = 0 143 | opt_ind = -1 144 | opt_axis = -1 145 | 146 | parent_entropy = self._compute_det_lamb(local_data) 147 | 148 | for ind, axis in ind_array: 149 | 150 | entropy, left_size, right_size = self.entropy_gain(parent_entropy, local_data, ind, axis) 151 | 152 | if entropy > max_entropy and left_size > 2 and right_size > 2: 153 | max_entropy = entropy 154 | opt_ind, opt_axis = (ind, axis) 155 | 156 | return max_entropy, opt_ind, opt_axis 157 | 158 | 159 | 160 | def _get_new_quad(self, old_quad, axis, opt_ind): 161 | """ 162 | quad: Return 2*d - indexes that delimit branch domain. 163 | Splits branch domain based on optimal index and axis of application. 164 | """ 165 | opt_quad_left = old_quad.copy() 166 | opt_quad_right = old_quad.copy() 167 | 168 | opt_quad_left[axis] = [old_quad[axis][0], opt_ind] 169 | opt_quad_right[axis] = [opt_ind, old_quad[axis][1]] 170 | 171 | return opt_quad_left, opt_quad_right 172 | 173 | 174 | def split_node(self, quad, depth): 175 | """ 176 | Recursively split nodes until stop condition is reached 177 | """ 178 | 179 | # Restrict data to in branch domain 180 | local_data = self._get_local_data(quad) 181 | 182 | # Restrict search space for optimal cut 183 | ind_array = self._get_search_space(quad) 184 | 185 | # Find split with maxiumum entropy gain 186 | 187 | max_entropy, opt_ind, opt_axis = self._find_opt_cut(ind_array, local_data) 188 | 189 | tune_threshold_cond = depth == self.explore_depth 190 | stop_condition = tune_threshold_cond if self.explore_depth else (self.forest_obj.opt_entropy > max_entropy) 191 | 192 | # Stop Condition 193 | if stop_condition or opt_ind == -1: 194 | leaf_node = self.node_class(data=local_data, quad=quad, depth=depth, leaf=True) 195 | 196 | self.leaf_nodes.append( leaf_node ) 197 | return leaf_node 198 | 199 | 200 | self.entropy_gain_evol.append( [depth, max_entropy] ) 201 | 202 | # Split node's quad 203 | node = self.node_class(data=local_data, quad=quad, depth=depth) 204 | node.go_right = node.add_split(self.forest_obj.grid[opt_axis][opt_ind], opt_axis) 205 | 206 | 207 | opt_quad_left, opt_quad_right = self._get_new_quad(quad, opt_axis, opt_ind) 208 | 209 | node.left = self.split_node(quad=opt_quad_left, depth=depth+1) 210 | node.right = self.split_node(quad=opt_quad_right, depth=depth+1) 211 | 212 | return node 213 | 214 | 215 | def extract_levels(self, node): 216 | 217 | if node.left: 218 | 219 | levels_dic_left = self.extract_levels(node.left) 220 | levels_dic_right = self.extract_levels(node.right) 221 | 222 | for k, v in levels_dic_right.items(): 223 | if k in levels_dic_left: 224 | levels_dic_left[k] += v 225 | else: 226 | levels_dic_left[k] = v 227 | 228 | levels_dic_left[node.depth] = [node] 229 | 230 | return levels_dic_left 231 | 232 | else: 233 | return {node.depth : [node]} 234 | 235 | 236 | def extract_domain_splits(self, node): 237 | 238 | dic = {} 239 | count = 0 240 | dic[count] = [node] 241 | 242 | while not all([n.left is None for n in dic[count]]): 243 | 244 | nodes = dic[count] 245 | count += 1 246 | dic[count] = [] 247 | 248 | for k, n in enumerate(nodes): 249 | if n.left: 250 | 251 | dic[count].append( n.left ) 252 | dic[count].append( n.right ) 253 | else: 254 | dic[count].append( n ) 255 | return dic 256 | 257 | 258 | 259 | def domain_splits_plots(self, subpath=''): 260 | path = os.getcwd() + '/evol/' + subpath 261 | mkdir_p(path) 262 | 263 | 264 | evol = pd.DataFrame(self.entropy_gain_evol).groupby(0)[1].mean() 265 | evol = np.array(list(zip(evol.index, evol))) 266 | 267 | 268 | for d in np.arange(len(self.tree_nodes_domain)): 269 | 270 | nodes = self.tree_nodes_domain[d] 271 | fig = plt.figure(figsize=(10,10)) 272 | 273 | ax0 = fig.add_subplot(211) 274 | 275 | ax0.plot(*zip(*evol[:d]), alpha=.8, color='k', ls='-', lw=2.) 276 | ax0.set_title('Entropy gain vs. Depth') 277 | plt.xlim(np.min(evol[:,0]), np.max(evol[:,0])) 278 | plt.ylim(np.min(evol[:,1]), np.max(evol[:,1])) 279 | 280 | ax = fig.add_subplot(212) 281 | 282 | 283 | for n in nodes: 284 | 285 | #n.check_norm(self.grid.axis) 286 | [[i1, i2], [j1, j2]] = n.quad 287 | x1, x2 = self.forest_obj.grid[0][i1], self.forest_obj.grid[0][i2] 288 | y1, y2 = self.forest_obj.grid[1][j1], self.forest_obj.grid[1][j2] 289 | ax.fill_between([x1,x2], y1, y2, alpha=.7) 290 | 291 | 292 | pd.DataFrame(self.forest_obj.data, columns=['x', 'y']).plot(ax=ax, x='x', y='y', kind='scatter', lw=0, alpha=.6, s=20, c='k') 293 | plt.savefig(path + 'branches_depth%s.png'%d, format='png') 294 | plt.close() 295 | 296 | 297 | 298 | def tree_leaf_plots(self, fname='data.png'): 299 | 300 | path = os.getcwd() + '/plots/' 301 | mkdir_p(path) 302 | 303 | 304 | fig = plt.figure(figsize=(10,10)) 305 | ax = fig.add_subplot(111) 306 | 307 | for n in self.leaf_nodes: 308 | 309 | #n.check_norm(self.grid.axis) 310 | 311 | [[i1, i2], [j1, j2]] = n.quad 312 | x1, x2 = self.forest_obj.grid[0][i1], self.forest_obj.grid[0][i2] 313 | y1, y2 = self.forest_obj.grid[1][j1], self.forest_obj.grid[1][j2] 314 | 315 | ax.fill_between([x1,x2], y1, y2, alpha=.7) 316 | 317 | pd.DataFrame(self.forest_obj.data, columns=['x', 'y']).plot(ax=ax, x='x', y='y', kind='scatter', lw=0, alpha=.6, s=20, c='k') 318 | plt.savefig(path + fname, format='png') 319 | plt.close() 320 | 321 | 322 | 323 | 324 | 325 | 326 | --------------------------------------------------------------------------------