├── .gitignore
├── LICENSE
├── README.md
├── density_forest.py
├── df_help.py
├── grid.py
├── node.py
├── result_plots
    ├── combined_trees.png
    ├── density_comp.png
    ├── density_comp_kde.png
    ├── density_estimation.png
    ├── evol.png
    └── lcurve.gif
└── tree.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | __pycache__/
3 | old/
4 | *.pyc
5 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 ksanjeevan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Density Estimation Forests in Python
  2 | ## Kernel Density Estimation in Random Forests
  3 | 
  4 | ### Usage
  5 | 
  6 | Running --help
  7 | ```bash
  8 | usage: density_forest.py [-h] [-l LEAF] [-d DATA] [-g GRANULARITY]
  9 | 
 10 | randomforest-density: density estimation using random forests.
 11 | 
 12 | optional arguments:
 13 |   -h, --help            show this help message and exit
 14 |   -l LEAF, --leaf LEAF  Choose what leaf estimator to use (Gaussian ['gauss']
 15 |                         or KDE ['kde'])
 16 |   -d DATA, --data DATA  Path to data (.npy file, shape (sample_zie, 2)).
 17 |   -g GRANULARITY, --granularity GRANULARITY
 18 |                         Number of division for the Grid
 19 | ```
 20 | 
 21 | Run demo (will produce all the plots seen below):
 22 | 
 23 | ```bash
 24 | python3 density_forest.py -l kde
 25 | ```
 26 | 
 27 | 
 28 | Run own data:
 29 | 
 30 | ```bash
 31 | python3 density_forest.py -d data_test.npy -l gauss
 32 | ```
 33 | ---
 34 | ### Introduction
 35 | 
 36 | 
 37 | _In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population._
 38 | 
 39 | 
 40 | _Random forests are an ensemble learning method that operate by constructing a multitude of decision trees and combining their results for predictions. Random decision forests correct for decision trees' habit of overfitting to their training set._
 41 | 
 42 | In this project, a random forests method for density estimation is implemented in python. Following is the presentation of some of the steps, results, tests, and comparisons.
 43 | 
 44 | 
 45 | ### Random Forest Implementation
 46 | 
 47 | In this implementation, axis aligned split functions are used (called [stumps](https://en.wikipedia.org/wiki/Decision_stump)) to build binary trees by optimizing the [entropy gain](https://en.wikipedia.org/wiki/Differential_entropy) at each node. The key parameters to select for this method are: tree depth/entropy gain threshold, forest size, and randomness. 
 48 | 
 49 | The optimal depth of a tree will be case dependent. For that reason we first train a small set of trees on a fixed depth (_tune\_entropy\_threshold_ method, parameters _n_ and _depth_). Unlike forest size, where an increase will never yield worse results, a lax stop condition will lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting). The entropy gain is strictly decreasing with depth, as can be seen in the animation below:
 50 | <p align="center">
 51 | <img src="result_plots/lcurve.gif" width="500px"/>
 52 | </p>
 53 | 
 54 | Optimizing the entropy gain threshold is an [ill-posed regularization problem](https://en.wikipedia.org/wiki/Tikhonov_regularization), that is handled in this implementation by finding the elbow point of 'maximum depth' (point furthest from the line connecting the the function's extremes), and averaging it out over _n_, as we can see here: 
 55 | 
 56 | 
 57 | <p align="center">
 58 | <img src="result_plots/evol.png" width="400px"/>
 59 | </p>
 60 | 
 61 | This step is expensive, since the depth is fixed with no a priori indication of where the optimal threshold is, and the number of leafs that need to be fitted grows exponentially. A better approach would be to implement an online L-curve method (such as the ones discussed [here](https://www1.icsi.berkeley.edu/~barath/papers/kneedle-simplex11.pdf)) as a first pass to avoid initial over-splitting (pending).
 62 | 
 63 | From _[Decision Forests for Classification,
 64 | Regression, Density Estimation, Manifold
 65 | Learning and Semi-Supervised Learning](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/decisionForests_MSR_TR_2011_114.pdf)_: 
 66 | 
 67 | _A key aspect of decision forests is the fact that its component trees are all randomly different from one another. This leads to de-correlation between the individual tree predictions and, in turn, to improved generalization. Forest randomness also helps achieve high robustness with respect to noisy data.
 68 | Randomness is injected into the trees during the training phase. Two of the most popular ways of doing so are:_
 69 | 
 70 | * _[Random training data set sampling (e.g. bagging) ](https://en.wikipedia.org/wiki/Bootstrap_aggregating)_
 71 | * _[Randomized node optimization](https://en.wikipedia.org/wiki/Random_subspace_method)_
 72 | 
 73 | _These two techniques are not mutually exclusive and could be used together._
 74 | 
 75 | The method is tested by sampling a combination of gaussians. In order to introduce randomness the node optimization is randomized by parameter _rho_, with the available parameter search space at each node split proportional to it. With a 50% _rho_ and 5 trees, we see the firsts results below:
 76 | 
 77 | 
 78 | <p align="center">
 79 | <img src="result_plots/density_estimation.png" width="400px"/>
 80 | </p>
 81 | 
 82 | Performance is harder to measure when using random forests for density estimation (as opposed to regression or classification) since we're in the unsupervised space. Here, the [Jensen-Shannon Divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) is used as a comparison measure (whenever test data from a known distribution is used).
 83 | 
 84 | <p align="center">
 85 | <img src="result_plots/density_comp.png" width="400px"/>
 86 | </p>
 87 | 
 88 | 
 89 | ### Leaf prediction using KDE 
 90 | 
 91 | One of the main problems of [Kernel Density Estimation](https://web.as.uky.edu/statistics/users/pbreheny/621/F10/notes/10-28.pdf) is the choice of bandwidth. Many of the approaches to find it rely on assumptions of the underlying distribution, and perform poorly on clustered, real-world data (although there are methods that incroporate an [adaptive bandwidth](https://indico.cern.ch/event/548789/contributions/2258640/attachments/1327011/1992522/kde.pdf) effectively).
 92 | 
 93 | The module can work with any imlementation of the _Node_ class. In these first examples the _NodeGauss_ class is used, by fitting a gaussian distribution at each leaf. Below can be seen the results of using _NodeKDE_, where the compactness measure is still based on the gaussian differential entropy, but the leaf prediction is the result of the  KDE method. By looking for splits that optimize fitting a gaussian function, many of the multivariate bandwidth problems that KDE has are avoided, and Silverman's rule for bandwidth selection can be used with good results:
 94 | 
 95 | <p align="center">
 96 | <img src="result_plots/density_comp_kde.png" width="400px"/>
 97 | </p>
 98 | 
 99 | 
100 | Although it produces an overall much better JSD, it's worth noting that the top right 'bump' is overfitting the noise more. This is expected since the underlying distribution of our test data is a combination of Gaussians, and if a leaf totally encompasses a bump (as can be seen in the forest representation below), then fitting a Gaussian function will perform better than any non parametric technique. 
101 | 
102 | <p align="center">
103 | <img src="result_plots/combined_trees.png" width="400px"/>
104 | </p>
105 | 
106 | 
107 | #### To do
108 | 
109 | * Try other entropy gain functions / compactness measures.
110 | * Use online L-curve method for entropy threshold optimization.
111 | * Other bottlenecks.
112 | * Refactor to reuse framework in classification and regression.
113 | * Fit unknown distributions / performance.
114 | * Use EMD as comparison metric.
115 | 
116 | 
117 | 


--------------------------------------------------------------------------------
/density_forest.py:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | import matplotlib.pyplot as plt
  4 | import pandas as pd
  5 | import numpy as np
  6 | import os 
  7 | 
  8 | from df_help import *
  9 | 
 10 | from grid import Grid
 11 | from tree import Tree
 12 | from node import NodeGauss, NodeKDE
 13 | import argparse
 14 | 
 15 | from pylab import *
 16 | 
 17 | 
 18 | argparser = argparse.ArgumentParser(
 19 |     description='randomforest-density: density estimation using random forests.')
 20 | 
 21 | 
 22 | argparser.add_argument(
 23 |     '-l',
 24 |     '--leaf',
 25 |     help='Choose what leaf estimator to use'
 26 |     ' (Gaussian [\'gauss\'] or KDE [\'kde\'])',
 27 |     default='gauss')
 28 | 
 29 | argparser.add_argument(
 30 |     '-d',
 31 |     '--data',
 32 |     help='Path to data (.npy file, shape (sample_zie, 2)).',
 33 |     default='')
 34 | 
 35 | argparser.add_argument(
 36 |     '-g',
 37 |     '--granularity',
 38 |     help='Number of division for the Grid',
 39 |     default=100)
 40 | 
 41 | 
 42 | 
 43 | 
 44 | args = argparser.parse_args()
 45 | assert args.leaf in ['kde', 'gauss'], "Pass valid leaf estimator"
 46 | 
 47 | 
 48 | LEAF_DICT = {
 49 |             'kde': NodeKDE,
 50 |             'gauss': NodeGauss
 51 | } 
 52 | 
 53 | LEAF_TYPE = args.leaf
 54 | MODE = 'est' if args.data else 'demo'
 55 | DATA_PATH = args.data
 56 | DIVS = args.granularity
 57 | 
 58 | def gauss_entropy_func(S):
 59 |     """
 60 |     Gaussian differential entropy (ignoring constant terms since 
 61 |     we're interested in the delta).
 62 |     """
 63 |     return np.log(np.linalg.det(np.cov(S, rowvar=False)) + 1e-8)
 64 | 
 65 | 
 66 | class DensityForest:
 67 |     """
 68 |     Class for density forest, compute entropy threshold for stop condition, then build and
 69 |     train forest of trees with randomness rho.
 70 |     """
 71 | 
 72 |     def __init__(self, data, grid_obj, f_size, rho=1.):
 73 |         
 74 |         self.data = data
 75 |         self.f_size = f_size
 76 | 
 77 |         self.node_class = NodeGauss
 78 |         self.entropy_func = gauss_entropy_func
 79 | 
 80 |         self.grid_obj = grid_obj
 81 | 
 82 |         self.grid = self.grid_obj.axis
 83 |         
 84 |         self.rho = rho
 85 | 
 86 | 
 87 |     def train(self):
 88 | 
 89 |         self.opt_entropy = self.tune_entropy_threshold(plot_debug=True)
 90 |         self.forest = self.build_forest()
 91 | 
 92 | 
 93 |     def estimate(self):
 94 | 
 95 |         self.dist = self.compute_density()
 96 |         return self.dist
 97 | 
 98 | 
 99 |     def compute_density(self):
100 | 
101 | 
102 | 
103 |         dist = []
104 |         for j, y in enumerate(self.grid[1]):
105 |             dist.append([])
106 |             for i, x in enumerate(self.grid[0]):    
107 |                 dist[j].append(self.forest_output(np.array([x, y])))
108 |         return dist
109 |         
110 | 
111 |     def plot_density(self, fname='density_estimation.png'):
112 | 
113 |         X = self.grid[0]
114 |         Y = self.grid[1]
115 |         Z = self.dist
116 | 
117 |         fig = plt.figure(figsize=(12, 12))
118 |         ax = fig.add_subplot(111)
119 | 
120 | 
121 |         vmin=np.min(Z)
122 |         vmax=np.max(Z)
123 |         var = plt.pcolormesh(np.array(X),np.array(Y),np.array(Z), cmap=cm.Blues, vmin=vmin, vmax=vmax)
124 |         plt.colorbar(var, ticks=np.arange(vmin, vmax, (vmax-vmin)/8))
125 | 
126 |         ax = plt.gca()
127 | 
128 |         gris = 200.0
129 |         ax.set_facecolor((gris/255, gris/255, gris/255))
130 | 
131 |         #ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0)
132 | 
133 |         plt.xlim(np.min(X), np.max(X))
134 |         plt.ylim(np.min(Y), np.max(Y))
135 |         plt.grid()
136 | 
137 |         ax.set_title('rho = %s, |T| = %d, max_entropy = %.2f'%(self.rho, self.f_size, self.opt_entropy))
138 | 
139 |         fig.savefig(fname, format='png')
140 |         plt.close()
141 | 
142 | 
143 | 
144 |     def forest_output(self, x):
145 | 
146 |         result = []
147 |         for i, t in self.forest.items():
148 |             result.append( t.output(x) )
149 | 
150 |         return np.mean(result)
151 | 
152 | 
153 | 
154 |     def build_forest(self):
155 | 
156 |         forest = {}
157 | 
158 |         for t in range(self.f_size):
159 |             forest[t] = Tree(self, rho=self.rho)
160 |             forest[t].tree_leaf_plots(fname='tree_opt%s.png'%t)
161 |         
162 | 
163 |         path = os.getcwd() + '/plots/'
164 |         mkdir_p(path)
165 | 
166 | 
167 |         fig = plt.figure(figsize=(10,10))
168 |         ax = fig.add_subplot(111)
169 |         
170 |         if MODE == 'demo':
171 |             color = ['lightcoral', 'dodgerblue', 'mediumseagreen', 'darkorange']
172 | 
173 |             for t in range(self.f_size):
174 |                 print(len(forest[t].leaf_nodes))
175 |                 for c, n in enumerate(forest[t].leaf_nodes):
176 | 
177 |                     [[i1, i2], [j1, j2]] = n.quad
178 |                     x1, x2 = self.grid[0][i1], self.grid[0][i2]
179 |                     y1, y2 = self.grid[1][j1], self.grid[1][j2]
180 |                     
181 |                     ax.fill_between([x1,x2], y1, y2, alpha=.15, color=color[c])
182 |      
183 |             pd.DataFrame(self.data, columns=['x', 'y']).plot(ax=ax, x='x', y='y', kind='scatter', lw=0, alpha=.6, s=20, c='k')
184 |             plt.savefig(path + 'combined.png', format='png')
185 |             plt.close()
186 | 
187 | 
188 |         return forest
189 | 
190 | 
191 |     # Implement Online L-curve optimization like EWMA to get rid of input depth
192 |     def tune_entropy_threshold(self, n=5, depth=6, plot_debug=False):
193 |         """
194 |         Compute mean optimal entropy based on L-curve elbow method.
195 |         """
196 |         
197 |         e_arr = []
198 |         for i in range(n):
199 |             
200 |             var = Tree(self, rho=.5, depth=depth)
201 |             e_arr += [pair + [i] for pair in var.entropy_gain_evol]
202 | 
203 |             var.domain_splits_plots(subpath='%s/'%i)
204 | 
205 |         entropy_evol = pd.DataFrame(e_arr, columns=['depth', 'entropy', 'tree'])
206 |         entropy_evol = entropy_evol.groupby(['tree', 'depth'])[['entropy']].mean().reset_index().pivot(columns='tree', index='depth', values='entropy').fillna(0)
207 |         entropy_elbow_cand = entropy_evol.apply(lambda x: opt_L_curve(np.array(x.index), np.array(x)))
208 |         
209 |         avg_opt_entropy = entropy_elbow_cand.mean()
210 |         if plot_debug:
211 |             
212 |             fig = plt.figure(figsize=(10,10))
213 |             ax = fig.add_subplot(111)
214 |             entropy_evol.plot(ax=ax, kind='line', alpha=.6, lw=3., title='Avg. Opt. Entropy = %.2f'%avg_opt_entropy)
215 |             plt.savefig('evol.png', format='png')
216 |             plt.close()
217 | 
218 |         return avg_opt_entropy
219 | 
220 | 
221 | 
222 | 
223 | def _run_rf(data, grid_obj):
224 | 
225 |     print('Starting...')
226 |     df_obj = DensityForest(data, grid_obj=grid_obj, f_size=5, rho=.5)
227 |     df_obj.node_class = LEAF_DICT[LEAF_TYPE]
228 | 
229 |     print('Training...')
230 |     df_obj.train()
231 |     print('Estimating...')
232 |     pdf = df_obj.estimate()
233 |     print('Plotting...')
234 |     df_obj.plot_density(fname='density_estimation_kde.png')
235 | 
236 |     return df_obj, pdf
237 | 
238 | 
239 | def run_demo():
240 | 
241 |     params = {
242 |                 'mu' : [[0, 0], [0, 20], [50, 40], [10, -20], [20,5]],
243 |                 'cov': [[[1, 0], [0, 150]], [[90, 0], [0, 5]], [[4, 1], [1, 4]], [[40, 5], [5, 40]], [[90, 15], [15, 16]]],
244 |                 'n':[700, 20, 200, 400, 300]}
245 |     gauss_data_obj = TestDataGauss(params=params, fname='data_new.npy', replace=False)
246 |     gauss_data_obj.check_plot()
247 | 
248 |     df_obj, _ = _run_rf(gauss_data_obj.data, gauss_data_obj.grid_obj)
249 | 
250 |     print('Comparing...')
251 | 
252 |     comp = CompareDistributions(original=gauss_data_obj, estimate=df_obj)
253 |     comp.vizualize_both('density_comp_kde.png', show_data=False)
254 | 
255 | 
256 | def run():
257 | 
258 |     data_obj = TestDataAny(fname=DATA_PATH, partitions=DIVS)
259 |  
260 |     density_estimation, pdf = _run_rf(data_obj.data, data_obj.grid_obj)
261 | 
262 |     pdf_path = os.path.dirname(DATA_PATH) 
263 |     np.save(pdf_path, pdf)
264 | 
265 |     print('\tDensity estimation function stored at: %s'%pdf_path)
266 | 
267 | 
268 | if __name__ == "__main__":
269 | 
270 |     if MODE == 'demo':
271 | 
272 |         print('---------------------------')
273 |         print('DEMO')
274 |         print('---------------------------')
275 |         run_demo()
276 |     else:
277 |         print('---------------------------')
278 |         print('Density Estimation')
279 |         print('---------------------------')
280 |         run()
281 | 
282 | 
283 | 
284 |     '''
285 | 
286 |     params = {
287 |             'mu' : [[0, 0], [0, 55], [20, 15], [45, 20]],
288 |             'cov': [[[2, 0], [0, 80]], [[90, 0], [0, 5]], [[4, 0], [0, 4]], [[40, 0], [0, 40]]],
289 |             'n':[100, 100, 100, 100]}
290 | 
291 | 
292 |     var = TestDataGauss(params=params, fname='data.npy')
293 |     '''
294 |     '''
295 |     foo = DensityForest(var.data, grid_obj=var.grid_obj, f_size=5, rho=.5)
296 |     foo.train()
297 |     foo.estimate()
298 |     foo.plot_density()
299 | 
300 |     tri = CompareDistributions(original=var, estimate=foo)
301 |     tri.vizualize_both('density_comp.png')
302 |     '''
303 | 
304 |     #print(foo.forest[0].output([0, 40]))
305 | 
306 | 
307 | 
308 | 
309 |     
310 | 
311 | 


--------------------------------------------------------------------------------
/df_help.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import os, errno
  3 | import numpy as np
  4 | import matplotlib.pyplot as plt
  5 | from pylab import *
  6 | 
  7 | from grid import Grid
  8 | 
  9 | def mkdir_p(path):
 10 |     """
 11 |     Create directory given path.
 12 |     """
 13 |     try:
 14 |         os.makedirs(path)
 15 |     except OSError as exc:  # Python >2.5
 16 |         if exc.errno == errno.EEXIST and os.path.isdir(path):
 17 |             pass
 18 |         else:
 19 |             raise
 20 |         
 21 | def integrate_2d(deltas, func):
 22 |     """
 23 |     2D Integral numeric approximation.
 24 |     """
 25 | 
 26 |     suma = 0
 27 |     for i in range(len(func)-1):
 28 |         for j in range(len(func[0])-1):
 29 |             suma += func[i][j] + func[i+1][j] + func[i][j+1] + func[i+1][j+1]
 30 |             
 31 |     step = 1.
 32 |     for d in deltas:
 33 |         step *= d
 34 | 
 35 |     return 0.25*suma*step
 36 | 
 37 | 
 38 | def cartesian(arrays, out=None):
 39 |     """
 40 |     Compute cartesian product between vector set.
 41 |     """
 42 | 
 43 |     arrays = [np.asarray(x) for x in arrays]
 44 |     dtype = arrays[0].dtype
 45 | 
 46 |     n = np.prod([x.size for x in arrays])
 47 |     if out is None:
 48 |         out = np.zeros([n, len(arrays)], dtype=dtype)
 49 | 
 50 |     m = int(n / arrays[0].size)
 51 |     out[:,0] = np.repeat(arrays[0], m)
 52 |     if arrays[1:]:
 53 |         cartesian(arrays[1:], out=out[0:m,1:])
 54 |         for j in range(1, arrays[0].size):
 55 |             out[j*m:(j+1)*m,1:] = out[0:m,1:]
 56 |     return out
 57 | 
 58 | 
 59 | 
 60 | 
 61 | def opt_L_curve(xs, ys):
 62 |     """
 63 |     Generic approach to find optimal Y for L-curve.
 64 |     """
 65 | 
 66 |     x0, y0 = xs[0], ys[0]
 67 |     x1, y1 = xs[-1], ys[-1]
 68 |     ra = float(y0 - y1) / (x0 - x1)
 69 |     rb = y1 - ra*x1
 70 |     result = []
 71 |     for xp, yp in zip(xs, ys):
 72 |         da = -1./ra
 73 |         db = yp - da * xp
 74 |         x_star = float(db-rb)/(ra-da)
 75 |         y_star = ra*x_star + rb
 76 |         result.append( [np.sqrt((xp-x_star)**2 + (yp-y_star)**2), xp, yp] )
 77 | 
 78 |     return max(result, key=lambda x: x[0])[2]
 79 | 
 80 | 
 81 | 
 82 | 
 83 | 
 84 | 
 85 | 
 86 | 
 87 | 
 88 | import abc
 89 | 
 90 | class TestData(abc.ABC):
 91 | 
 92 |     """
 93 |     Abstract Data and Distribution test
 94 |     """
 95 | 
 96 |     def __init__(self):
 97 |         self.dist = []
 98 | 
 99 |     @abc.abstractmethod
100 |     def generate_data(self, fname=''):
101 |         pass
102 | 
103 |     @abc.abstractmethod
104 |     def check_norm(self):
105 |         pass
106 | 
107 |     @abc.abstractmethod
108 |     def compute_distribution(self):
109 |         pass
110 | 
111 | 
112 |     @abc.abstractmethod
113 |     def evaluate(self, x):
114 |         pass
115 | 
116 | 
117 | 
118 | 
119 | 
120 | 
121 | class TestDataGauss(TestData):
122 | 
123 |     def __init__(self, params, fname, replace=False, partitions=100):
124 | 
125 |         self.replace = replace
126 |         self.mu = params['mu']
127 |         self.cov = params['cov']
128 |         self.n = params['n']
129 |         self.Z = [N/np.sum(self.n) for N in self.n]
130 | 
131 |         self.data = self.generate_data(fname=fname)
132 |         self.grid_obj = Grid(self.data, partitions)
133 |         self.grid = self.grid_obj.axis
134 |         
135 | 
136 |         self.dist = self.compute_distribution()
137 | 
138 |      
139 |     def check_norm(self):
140 |         dist_vals = []
141 | 
142 |         deltas = []
143 | 
144 |         for v in self.grid:
145 |             deltas.append( v[1]-v[0] )
146 | 
147 |         for i, x in enumerate(self.grid[0]):
148 |             dist_vals.append([])
149 |             for j, y in enumerate(self.grid[1]):
150 |                 dist_vals[i].append(self.evaluate(np.array([x, y])))
151 | 
152 |         integral = integrate_2d(deltas=deltas, func=dist_vals)
153 | 
154 |         return integral
155 | 
156 | 
157 |     def compute_distribution(self):
158 |         dist = []
159 |         for j, y in enumerate(self.grid[1]):
160 |             dist.append([])
161 |             for i, x in enumerate(self.grid[0]):    
162 |                 dist[j].append(self.evaluate(np.array([x, y])))
163 |         return dist
164 | 
165 |     def evaluate(self, x):
166 |         suma = 0
167 |         for i, args in enumerate(zip(self.mu, self.cov)):
168 |             mu, cov = args
169 |             gauss_arg = np.inner(np.transpose((x-mu)), np.inner(np.linalg.inv(cov), (x-mu)))
170 |             suma += self.Z[i]*(np.exp(-.5*gauss_arg))/(2*np.pi*np.sqrt(np.linalg.det(cov)))
171 |         return suma
172 | 
173 | 
174 |     def generate_data(self, fname='data.npy'):
175 | 
176 |         if os.path.isfile(fname) and not self.replace:
177 |             return np.load(fname)
178 |         else:
179 |             g = np.random.multivariate_normal
180 |             data = []
181 | 
182 |             for mu, cov, n in zip(self.mu, self.cov, self.n):
183 |                 data += list(g(mu, cov, n))
184 | 
185 |             data = np.array(data)
186 | 
187 |             np.save(fname, data)
188 | 
189 |             return data
190 | 
191 | 
192 |     def check_plot(self):
193 | 
194 |         X = self.grid[0]
195 |         Y = self.grid[1]
196 |         Z = self.dist
197 |         
198 |         fig = plt.figure(figsize=(12, 12))
199 |         ax = fig.add_subplot(111)
200 |    
201 |         vmin = np.min(Z)
202 |         vmax = np.max(Z)
203 |         var = plt.pcolormesh(np.array(X),np.array(Y),np.array(Z), cmap=cm.Greens, vmin=vmin, vmax=vmax)
204 |         plt.colorbar(var, ticks=np.arange(vmin, vmax, (vmax-vmin)/8))
205 |         ax = plt.gca()
206 |         gris = 200.0
207 |         ax.set_facecolor((gris/255, gris/255, gris/255))
208 |         
209 |         ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0)
210 | 
211 |         plt.xlim(np.min(X), np.max(X))
212 |         plt.ylim(np.min(Y), np.max(Y))
213 |         plt.grid()
214 |         fig.savefig('true_dist_check.png', format='png')
215 |         plt.close()
216 | 
217 | 
218 | 
219 | 
220 | class TestDataAny(TestData):
221 | 
222 |     def __init__(self, fname, partitions=100):
223 | 
224 |         self.data = self.generate_data(fname=fname)
225 | 
226 |         self.grid_obj = Grid(self.data, partitions)
227 |         self.grid = self.grid_obj.axis
228 |         
229 |         self.dist = self.compute_distribution()
230 | 
231 |         self.check_plot()
232 | 
233 |     def generate_data(self, fname='data.npy'):
234 | 
235 |         if os.path.isfile(fname):
236 |             return np.load(fname)
237 |         else:
238 |             raise ValueError('Enter valid path to source data.')
239 | 
240 |     def check_plot(self):
241 | 
242 |         X = self.grid[0]
243 |         Y = self.grid[1]
244 |         
245 |         fig = plt.figure(figsize=(12, 12))
246 |         ax = fig.add_subplot(111)
247 |    
248 |         ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0)
249 | 
250 |         plt.xlim(np.min(X), np.max(X))
251 |         plt.ylim(np.min(Y), np.max(Y))
252 |         plt.grid()
253 |         fig.savefig('init_data.png', format='png')
254 |         plt.close()
255 | 
256 | 
257 |     def check_norm(self):
258 |         pass
259 | 
260 |     def compute_distribution(self):
261 |         pass
262 | 
263 |     def evaluate(self, x):
264 |         pass
265 | 
266 | 
267 | 
268 | 
269 | 
270 | 
271 | 
272 | 
273 | class CompareDistributions:
274 | 
275 |     def __init__(self, original, estimate):
276 |         self.P = np.array(original.dist)
277 |         self.Q = np.array(estimate.dist)
278 |         self.grid = original.grid
279 |         self.data = original.data
280 | 
281 | 
282 |     def compute_JSD(self):
283 | 
284 |         if self.P.shape == self.Q.shape:
285 |             suma = 0
286 | 
287 |             import math
288 |         
289 |             for i in range(len(self.P)):
290 |                 for j in range(len(self.P[0])):    
291 |                     if self.P[i][j] > 0:
292 |                         suma += self.P[i][j]*np.log(2.*self.P[i][j]/(self.P[i][j] + self.Q[i][j]))
293 |                     
294 |             for i in range(len(self.Q)):
295 |                 for j in range(len(self.Q[0])):
296 |                     if self.Q[i][j] > 0:
297 |                         suma += self.Q[i][j]*np.log(2.*self.Q[i][j]/(self.P[i][j] + self.Q[i][j]))
298 | 
299 |             return suma
300 | 
301 | 
302 | 
303 |     def vizualize_both(self, fname='density_comp.png', show_data=False):
304 |         X = self.grid[0]
305 |         Y = self.grid[1]
306 |         Z1 = self.P
307 |         Z2 = self.Q
308 | 
309 |         fig = plt.figure(figsize=(12, 12))
310 | 
311 |         true_dist_params = (211, Z1, 'True Distribution', cm.Greens)
312 |         rf_dist_params = (212, Z2, 'Density Forest Estimate; JSD = %.3f'%(self.compute_JSD()), cm.Blues)
313 | 
314 |         for sp, Z, title, cmap in [true_dist_params, rf_dist_params]:
315 |             ax = fig.add_subplot(sp)
316 |             vmin=np.min(Z1)
317 |             vmax=np.max(Z1)
318 |             var = plt.pcolormesh(np.array(X),np.array(Y),np.array(Z), cmap=cmap, vmin=vmin, vmax=vmax)
319 |             plt.colorbar(var, ticks=np.arange(vmin, vmax, (vmax-vmin)/8))
320 |             ax = plt.gca()
321 |             gris = 200.0
322 |             ax.set_facecolor((gris/255, gris/255, gris/255))
323 |             
324 |             if show_data:
325 |                 ax.scatter(*zip(*self.data), alpha=.5, c='k', s=10., lw=0)
326 |             
327 |             ax.set_title(title)
328 |             plt.xlim(np.min(X), np.max(X))
329 |             plt.ylim(np.min(Y), np.max(Y))
330 |             plt.grid()
331 | 
332 | 
333 |         fig.savefig(fname, format='png')
334 |         plt.close()
335 | 
336 | 
337 | 
338 | 
339 | 


--------------------------------------------------------------------------------
/grid.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import numpy as np
 3 | 
 4 | class Grid(object):
 5 |     """
 6 |     Class which defines the boundaries and granularity of the finite grid.
 7 |     """
 8 | 
 9 | 
10 |     def __init__(self, data, div):
11 |         self.data = data
12 |         self.perc_padding = .2
13 | 
14 |         self.partitions, self.axis = self.init_grid(div)
15 |         #self.grid, self.step, self.displ = self.init_grid()
16 |         #self.split_map = self.create_split_map()
17 | 
18 | 
19 |     def init_grid(self, div):
20 | 
21 |         maxi, mini = np.max(self.data, axis=0), np.min(self.data, axis=0)
22 |         step = (maxi - mini)/div
23 | 
24 |         mini = mini - step * div * self.perc_padding
25 |         maxi = maxi + step * div * self.perc_padding
26 | 
27 |         grid_range = int(div*(1+2*self.perc_padding))
28 |         axis = [[mini[0] + i * step[0] for i in range(grid_range)], [mini[1] + j * step[1] for j in range(grid_range)]]
29 | 
30 |         return grid_range, axis
31 | 
32 | 
33 | 
34 | 
35 | 
36 | 


--------------------------------------------------------------------------------
/node.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import numpy as np
  3 | from df_help import *
  4 | 
  5 | 
  6 | 
  7 | class Node(object):
  8 |     """
  9 |     Class for each of the nodes in a decision tree.
 10 |     """
 11 | 
 12 |     def __init__(self, data, quad, depth):
 13 |         
 14 |         self.go_right = None
 15 |         self.quad = quad
 16 |         self.depth = depth
 17 |         self.s_l = len(data)
 18 | 
 19 |         self.left = None
 20 |         self.right = None
 21 | 
 22 |         
 23 | 
 24 | 
 25 |     def add_split(self, value, axis):
 26 |         return lambda x: x[axis] > value
 27 | 
 28 |     '''
 29 |     def leaf_output(self, x):
 30 |         """
 31 |         Evaluate the density estimation of that leaf on x.
 32 |         """
 33 |         gauss_arg = np.inner(np.transpose((x-self.mu)), np.inner(self.inv_cov, (x-self.mu)))
 34 |         return (np.exp(-.5*gauss_arg))/(2*np.pi*self.sqrt_cov)
 35 |     '''         
 36 |     def check_norm(self, grid_axis):
 37 |         """
 38 |         Verify leaf integrates to ~ 1.
 39 |         """
 40 | 
 41 |         dist_vals = []
 42 | 
 43 |         grid_axis_local = grid_axis.copy()
 44 |         grid_axis_local[0] = grid_axis_local[0][self.quad[0][0]:self.quad[0][1]+1]
 45 |         grid_axis_local[1] = grid_axis_local[1][self.quad[1][0]:self.quad[1][1]+1]
 46 | 
 47 |         deltas = []
 48 | 
 49 |         for v in grid_axis_local:
 50 |             deltas.append( v[1]-v[0] )
 51 | 
 52 |         for i, x in enumerate(grid_axis_local[0]):
 53 |             dist_vals.append([])
 54 |             for j, y in enumerate(grid_axis_local[1]):
 55 |                 dist_vals[i].append(self.leaf_output(np.array([x, y])))
 56 |         
 57 |         integral = integrate_2d(deltas=deltas, func=dist_vals)
 58 | 
 59 |         if not (integral > 0.95 and integral < 1.05):
 60 |             print('Node of depth %s, norm = %s'%(self.depth, integral))
 61 | 
 62 |         return integral
 63 | 
 64 | 
 65 | 
 66 | 
 67 | 
 68 | class NodeGauss(Node):
 69 |     """
 70 |     Class for each of the nodes in a decision tree.
 71 |     """
 72 | 
 73 |     def __init__(self, data, quad, depth, leaf=False):
 74 |         
 75 |         super(NodeGauss, self).__init__(data=data, quad=quad, depth=depth)
 76 | 
 77 |         self.leaf = leaf
 78 |         if leaf:
 79 |             self.cov = np.cov(data, rowvar=False)
 80 |             # Check cov positive semidef
 81 |             # np.all(np.linalg.eigvals(np.cov(data, rowvar=False)) > 0)
 82 |             if np.all(np.linalg.eigvals(np.cov(data, rowvar=False)) > 0):
 83 |                 self.sqrt_cov = np.sqrt(np.linalg.det(self.cov))
 84 |                 self.inv_cov = np.linalg.inv(self.cov)
 85 |                 self.mu = np.mean(data, axis=0)
 86 |             else:
 87 |                 self.sqrt_cov = 1
 88 |                 self.inv_cov = np.zeros(self.cov.shape)
 89 |                 self.mu = np.mean(data, axis=0)
 90 | 
 91 | 
 92 |     def leaf_output(self, x):
 93 |         """
 94 |         Evaluate the density estimation of that leaf on x.
 95 |         """
 96 |         gauss_arg = np.inner(np.transpose((x-self.mu)), np.inner(self.inv_cov, (x-self.mu)))
 97 |         return (np.exp(-.5*gauss_arg))/(2*np.pi*self.sqrt_cov)
 98 |             
 99 | 
100 | 
101 | def h_rot(x, d):
102 |     return math.pow(len(x), -(2.0)/(d+4))*np.var(x, axis=0) + 1e-8
103 | 
104 | 
105 | class NodeKDE(Node):
106 |     """
107 |     Class for each of the nodes in a decision tree.
108 |     """
109 | 
110 |     def __init__(self, data, quad, depth, leaf=False):
111 |         
112 |         super(NodeKDE, self).__init__(data=data, quad=quad, depth=depth)
113 |         self.leaf = leaf
114 | 
115 |         if leaf:
116 | 
117 |             self.data = data
118 |             h = h_rot(data, len(data[0]))
119 | 
120 |             self.H = [[h[0], 0],[0, h[1]]]
121 |             '''
122 |             print('-------------------')
123 |             print(data)
124 |             print(data.shape)
125 |             print(self.H)
126 |             '''
127 | 
128 |             self.H_inv = np.linalg.inv(self.H)
129 |             self.H_inv_sqrt_det = np.sqrt(np.linalg.det(self.H_inv))
130 | 
131 | 
132 | 
133 |     def k_gauss(self, u):
134 |         argum = np.sum(u*np.transpose(np.inner(self.H_inv, u)), axis=1)
135 |         return np.exp(-.5*argum)
136 |     
137 | 
138 |     def leaf_output(self, x):
139 | 
140 |         result = self.H_inv_sqrt_det * np.sum(self.k_gauss(x - self.data)) * 1./(2*np.pi*len(self.data))
141 |         return result
142 | 
143 | 
144 | 
145 | 
146 | 
147 | 
148 | 
149 | 
150 | 
151 | 
152 | if __name__ == "__main__":
153 | 
154 |     H = np.array([[2,0],[0,2]])
155 |     x = np.array([[1,0],[0,1],[1,1],[0,-1],[-1,0]])
156 | 
157 |     a = np.inner(np.transpose(x), np.inner(H, x))
158 | 
159 |     print(np.inner(H, x).shape)
160 |     #np.inner(np.transpose(u), np.inner(self.H_inv, u))
161 | 
162 | 
163 | 
164 | 
165 | 
166 | 
167 | 
168 | 
169 | 
170 | 
171 | 
172 | 
173 | 
174 | 


--------------------------------------------------------------------------------
/result_plots/combined_trees.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/combined_trees.png


--------------------------------------------------------------------------------
/result_plots/density_comp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/density_comp.png


--------------------------------------------------------------------------------
/result_plots/density_comp_kde.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/density_comp_kde.png


--------------------------------------------------------------------------------
/result_plots/density_estimation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/density_estimation.png


--------------------------------------------------------------------------------
/result_plots/evol.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/evol.png


--------------------------------------------------------------------------------
/result_plots/lcurve.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ksanjeevan/randomforest-density-python/445db31bdf2adc0343e4fe463917dd4aaee223f8/result_plots/lcurve.gif


--------------------------------------------------------------------------------
/tree.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import numpy as np
  3 | import pandas as pd
  4 | import matplotlib.pyplot as plt
  5 | import os
  6 | 
  7 | from df_help import *
  8 | 
  9 | 
 10 | 
 11 | 
 12 | 
 13 | class Tree:
 14 |     """
 15 |     Class for training and building decision trees.
 16 |     """
 17 | 
 18 |     def __init__(self, forest_obj, rho, depth=None):
 19 | 
 20 |         self.forest_obj = forest_obj
 21 |         self.rho = rho
 22 | 
 23 |         self.node_class = forest_obj.node_class
 24 | 
 25 |         self.s_0 = len(forest_obj.data)
 26 | 
 27 |         self.leaf_nodes = []
 28 |         self.entropy_gain_evol = []
 29 |         
 30 |         self.explore_depth = depth if depth else 0
 31 | 
 32 |         self.root_node = self.build_tree()
 33 | 
 34 |         self.Zt = None
 35 | 
 36 |         self.tree_nodes_depth = self.extract_levels(self.root_node)
 37 |         self.tree_nodes_domain = self.extract_domain_splits(self.root_node)
 38 | 
 39 | 
 40 |         if not depth:    
 41 | 
 42 |             self.Zt = self.norm_tree()
 43 | 
 44 | 
 45 |     def check_norm(self):
 46 | 
 47 |         dist_vals = []
 48 | 
 49 |         deltas = []
 50 | 
 51 |         for v in self.forest_obj.grid:
 52 |             deltas.append( v[1]-v[0] )
 53 | 
 54 |         for i, x in enumerate(self.forest_obj.grid[0]):
 55 |             dist_vals.append([])
 56 |             for j, y in enumerate(self.forest_obj.grid[1]):
 57 |                 dist_vals[i].append(self.output(np.array([x, y])))
 58 | 
 59 |         integral = integrate_2d(deltas=deltas, func=dist_vals)
 60 | 
 61 |         return integral
 62 | 
 63 | 
 64 |     def norm_tree(self):
 65 |  
 66 |         Zt = 0
 67 | 
 68 |         for l in self.leaf_nodes:
 69 | 
 70 |             pi_l = l.s_l / self.s_0
 71 |             integral = l.check_norm(self.forest_obj.grid)
 72 |             Zt += pi_l * integral
 73 |         
 74 |         return Zt
 75 | 
 76 | 
 77 | 
 78 |     def output(self, x):
 79 |         current_node = self.root_node
 80 |         
 81 |         while not current_node.leaf:
 82 |             if current_node.go_right(x):
 83 |                 current_node = current_node.right
 84 |             else:
 85 |                 current_node = current_node.left
 86 | 
 87 | 
 88 |         pi_l = current_node.s_l / self.s_0
 89 |         return (pi_l/self.Zt)*current_node.leaf_output(x)
 90 | 
 91 | 
 92 | 
 93 |     def _compute_det_lamb(self, S):
 94 | 
 95 |         if S.shape[0] > 2:
 96 |             return self.forest_obj.entropy_func(S)
 97 |         return 1e5
 98 | 
 99 | 
100 |     def entropy_gain(self, parent_entropy, S, ind, axis):
101 |         """
102 |         Compute entropy gain given data set, split index and axis of application
103 |         """
104 | 
105 |         S_right = S[S[:,axis]>=self.forest_obj.grid[axis][ind]]
106 |         S_left = S[S[:,axis]<self.forest_obj.grid[axis][ind]]
107 | 
108 |         right_entropy = self._compute_det_lamb(S_right)*len(S_right)/len(S)
109 |         left_entropy = self._compute_det_lamb(S_left)*len(S_left)/len(S)
110 | 
111 |         return parent_entropy - (right_entropy + left_entropy), len(S_left), len(S_right)
112 | 
113 | 
114 |     def build_tree(self):
115 |         quad = [[0,len(self.forest_obj.grid[0])-1]]*2
116 |         root_node = self.split_node(quad=quad, depth=0)
117 | 
118 |         return root_node
119 | 
120 | 
121 | 
122 |     def _get_local_data(self, quad):
123 |         right = self.forest_obj.data[:,0] >= self.forest_obj.grid[0][quad[0][0]]
124 |         left = self.forest_obj.data[:,0] < self.forest_obj.grid[0][quad[0][1]]
125 |         top = self.forest_obj.data[:,1] >= self.forest_obj.grid[1][quad[1][0]]
126 |         bottom = self.forest_obj.data[:,1] < self.forest_obj.grid[1][quad[1][1]]
127 | 
128 |         return self.forest_obj.data[(right)&(left)&(top)&(bottom)]
129 | 
130 |     def _get_search_space(self, quad):
131 |         # d axis ranges inside branch domain
132 |         x_edge = range(quad[0][0], quad[0][1]+1)
133 |         y_edge = range(quad[1][0], quad[1][1]+1)
134 |         
135 |         # Apply randomness rho factor to limit parameter space search
136 |         edge = np.array([(z, 0) for z in x_edge] + [(z, 1) for z in y_edge])
137 |         size = len(edge)
138 |         return edge[np.random.choice(size, size=int(size*self.rho), replace=False)]
139 | 
140 | 
141 |     def _find_opt_cut(self, ind_array, local_data):
142 |         max_entropy = 0
143 |         opt_ind = -1
144 |         opt_axis = -1
145 | 
146 |         parent_entropy = self._compute_det_lamb(local_data)
147 | 
148 |         for ind, axis in ind_array:
149 | 
150 |             entropy, left_size, right_size = self.entropy_gain(parent_entropy, local_data, ind, axis)
151 | 
152 |             if entropy > max_entropy and left_size > 2 and right_size > 2:
153 |                 max_entropy = entropy
154 |                 opt_ind, opt_axis = (ind, axis)
155 | 
156 |         return max_entropy, opt_ind, opt_axis
157 | 
158 | 
159 | 
160 |     def _get_new_quad(self, old_quad, axis, opt_ind):
161 |         """
162 |         quad: Return 2*d - indexes that delimit branch domain.
163 |         Splits branch domain based on optimal index and axis of application.
164 |         """
165 |         opt_quad_left = old_quad.copy()
166 |         opt_quad_right = old_quad.copy()
167 | 
168 |         opt_quad_left[axis] = [old_quad[axis][0], opt_ind]
169 |         opt_quad_right[axis] = [opt_ind, old_quad[axis][1]]
170 | 
171 |         return opt_quad_left, opt_quad_right
172 | 
173 | 
174 |     def split_node(self, quad, depth):
175 |         """
176 |         Recursively split nodes until stop condition is reached
177 |         """
178 | 
179 |         # Restrict data to in branch domain
180 |         local_data = self._get_local_data(quad)
181 | 
182 |         # Restrict search space for optimal cut
183 |         ind_array = self._get_search_space(quad)
184 |             
185 |         # Find split with maxiumum entropy gain
186 |         
187 |         max_entropy, opt_ind, opt_axis = self._find_opt_cut(ind_array, local_data)
188 | 
189 |         tune_threshold_cond = depth == self.explore_depth
190 |         stop_condition = tune_threshold_cond if self.explore_depth else (self.forest_obj.opt_entropy > max_entropy)
191 |         
192 |         # Stop Condition
193 |         if stop_condition or opt_ind == -1:
194 |             leaf_node = self.node_class(data=local_data, quad=quad, depth=depth, leaf=True)
195 |             
196 |             self.leaf_nodes.append( leaf_node )
197 |             return leaf_node
198 | 
199 |         
200 |         self.entropy_gain_evol.append( [depth, max_entropy] )
201 | 
202 |         # Split node's quad
203 |         node = self.node_class(data=local_data, quad=quad, depth=depth)
204 |         node.go_right = node.add_split(self.forest_obj.grid[opt_axis][opt_ind], opt_axis)
205 | 
206 | 
207 |         opt_quad_left, opt_quad_right = self._get_new_quad(quad, opt_axis, opt_ind)
208 |         
209 |         node.left = self.split_node(quad=opt_quad_left, depth=depth+1)
210 |         node.right = self.split_node(quad=opt_quad_right, depth=depth+1)
211 | 
212 |         return node
213 | 
214 | 
215 |     def extract_levels(self, node):
216 | 
217 |         if node.left:
218 | 
219 |             levels_dic_left = self.extract_levels(node.left)
220 |             levels_dic_right = self.extract_levels(node.right)
221 | 
222 |             for k, v in levels_dic_right.items():
223 |                 if k in levels_dic_left:
224 |                     levels_dic_left[k] += v
225 |                 else:
226 |                     levels_dic_left[k] = v
227 | 
228 |             levels_dic_left[node.depth] = [node]
229 | 
230 |             return levels_dic_left
231 | 
232 |         else:
233 |             return {node.depth : [node]}
234 | 
235 | 
236 |     def extract_domain_splits(self, node):
237 | 
238 |         dic = {}
239 |         count = 0
240 |         dic[count] = [node]
241 |         
242 |         while not all([n.left is None for n in dic[count]]): 
243 | 
244 |             nodes = dic[count]
245 |             count += 1
246 |             dic[count] = []
247 | 
248 |             for k, n in enumerate(nodes):
249 |                 if n.left:
250 |                     
251 |                     dic[count].append( n.left )
252 |                     dic[count].append( n.right )
253 |                 else:
254 |                     dic[count].append( n )
255 |         return dic
256 | 
257 | 
258 | 
259 |     def domain_splits_plots(self, subpath=''):
260 |         path = os.getcwd() + '/evol/' + subpath
261 |         mkdir_p(path)
262 | 
263 | 
264 |         evol = pd.DataFrame(self.entropy_gain_evol).groupby(0)[1].mean()
265 |         evol = np.array(list(zip(evol.index, evol)))
266 | 
267 |         
268 |         for d in np.arange(len(self.tree_nodes_domain)):
269 | 
270 |             nodes = self.tree_nodes_domain[d]
271 |             fig = plt.figure(figsize=(10,10))
272 |             
273 |             ax0 = fig.add_subplot(211)
274 | 
275 |             ax0.plot(*zip(*evol[:d]), alpha=.8, color='k', ls='-', lw=2.)
276 |             ax0.set_title('Entropy gain vs. Depth')
277 |             plt.xlim(np.min(evol[:,0]), np.max(evol[:,0]))
278 |             plt.ylim(np.min(evol[:,1]), np.max(evol[:,1]))
279 | 
280 |             ax = fig.add_subplot(212)
281 | 
282 | 
283 |             for n in nodes:
284 | 
285 |                 #n.check_norm(self.grid.axis)
286 |                 [[i1, i2], [j1, j2]] = n.quad
287 |                 x1, x2 = self.forest_obj.grid[0][i1], self.forest_obj.grid[0][i2]
288 |                 y1, y2 = self.forest_obj.grid[1][j1], self.forest_obj.grid[1][j2]                
289 |                 ax.fill_between([x1,x2], y1, y2, alpha=.7)
290 | 
291 | 
292 |             pd.DataFrame(self.forest_obj.data, columns=['x', 'y']).plot(ax=ax, x='x', y='y', kind='scatter', lw=0, alpha=.6, s=20, c='k')
293 |             plt.savefig(path + 'branches_depth%s.png'%d, format='png')
294 |             plt.close()
295 | 
296 |         
297 | 
298 |     def tree_leaf_plots(self, fname='data.png'):
299 | 
300 |         path = os.getcwd() + '/plots/'
301 |         mkdir_p(path)
302 | 
303 | 
304 |         fig = plt.figure(figsize=(10,10))
305 |         ax = fig.add_subplot(111)
306 |         
307 |         for n in self.leaf_nodes:
308 | 
309 |             #n.check_norm(self.grid.axis)
310 | 
311 |             [[i1, i2], [j1, j2]] = n.quad
312 |             x1, x2 = self.forest_obj.grid[0][i1], self.forest_obj.grid[0][i2]
313 |             y1, y2 = self.forest_obj.grid[1][j1], self.forest_obj.grid[1][j2]
314 |             
315 |             ax.fill_between([x1,x2], y1, y2, alpha=.7)
316 |  
317 |         pd.DataFrame(self.forest_obj.data, columns=['x', 'y']).plot(ax=ax, x='x', y='y', kind='scatter', lw=0, alpha=.6, s=20, c='k')
318 |         plt.savefig(path + fname, format='png')
319 |         plt.close()
320 |         
321 | 
322 | 
323 | 
324 | 
325 | 
326 | 


--------------------------------------------------------------------------------