├── requirements.txt ├── LICENSE ├── README.md ├── gcForest_tuto.ipynb └── GCForest.py /requirements.txt: -------------------------------------------------------------------------------- 1 | jupyter>=1.0.0 2 | numpy>=1.12.0 3 | scikit-learn>=0.18.1 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Pierre-Yves Lablanche 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. 20 | 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Forest in Python 2 | 3 | *Status* : not under active development 4 | 5 | ## What's New 6 | version 0.1.6 : corrected `max_features=1` for the completely random forest (correction thanks to [sevenguin](https://github.com/sevenguin)).
7 | version 0.1.5 : remove layer when accuracy gets worse (behavior corrected thanks to [felixwzh](https://github.com/felixwzh)).
8 | version 0.1.4 : faster slicing method. 9 | 10 | ## Presentation 11 | **gcForest** is a deep forest algorithm suggested in Zhou and Feng 2017 ( [https://arxiv.org/abs/1702.08835](https://arxiv.org/abs/1702.08835) ). It uses a multi-grain scanning approach for data slicing and a cascade structure of multiple random forests layers (see paper for details). 12 | 13 | The present **gcForest** implementation has been first developed as a Classifier and designed such that the multi-grain scanning module and the cascade structure can be used separately. During development I've paid special attention to write the code in the way that future parallelization should be pretty straightforward to implement. 14 | 15 | You can find the official release of the code used in Zhou and Feng 2017 [here](https://github.com/kingfengji/gcforest). 16 | 17 | ## Prerequisites 18 | 19 | The present code has been developed under python3.x. You will need to have the following installed on your computer to make it work : 20 | 21 | * Python 3.x 22 | * Numpy >= 1.12.0 23 | * Scikit-learn >= 0.18.1 24 | * jupyter >= 1.0.0 (only useful to run the tuto notebook) 25 | 26 | You can install all of them using `pip` install : 27 | 28 | ```sh 29 | $ pip3 install -r requirements.txt 30 | ``` 31 | 32 | ## Using gcForest 33 | 34 | The syntax uses the scikit learn style with a `.fit()` function to train the algorithm and a `.predict()` function to predict new values class. You can find two examples in the jupyter notebook included in the repository. 35 | 36 | ```python 37 | from GCForest import * 38 | gcf = gcForest( **kwargs ) 39 | gcf.fit(X_train, y_train) 40 | gcf.predict(X_test) 41 | ``` 42 | 43 | ## Saving and Loading Models 44 | 45 | Using `sklearn.externals.joblib` you can easily save your model to disk and load it later. Just proceed as follow :
46 | To save : 47 | ```python 48 | from sklearn.externals import joblib 49 | joblib.dump(gcf, 'name_of_file.sav') 50 | ``` 51 | To load : 52 | ```python 53 | joblib.load('name_of_file.sav') 54 | ``` 55 | 56 | ## Notes 57 | I wrote the code from scratch in two days and even though I have tested it on several cases I cannot certify that it is a 100% bug free obviously. 58 | **Feel free to test it and send me your feedback about any improvement and/or modification!** 59 | 60 | ### Known Issues 61 | 62 | **Memory comsuption when slicing data** 63 | There is now a short naive calculation illustrating the issue in the notebook. 64 | So far the input data slicing is done all in a single step to train the Random Forest for the Multi-Grain Scanning. The problem is that it might requires a lot of memory depending on the size of the data set and the number of slices asked resulting in memory crashes (at least on my Intel Core 2 Duo).
65 | *The memory consumption when slicing data is more complicated than it seems. A StackOverflow related post can be found [here](https://stackoverflow.com/questions/43822413/numpy-minimum-memory-usage-when-slicing-images). 66 | The main problem is the non-contiguous aspect of the sliced array with the original data forcing a copy to be made in memory.* 67 | 68 | **OOB score error** 69 | During the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.
70 | *A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.* 71 | 72 | ## Built With 73 | 74 | * [PyCharm](https://www.jetbrains.com/pycharm/) community edition 75 | * ``memory_profiler`` library 76 | 77 | ## License 78 | This project is licensed under the MIT License (see `LICENSE` for details) 79 | 80 | 81 | 82 | ### Early Results 83 | (will be updated as new results come out) 84 | 85 | * Scikit-learn handwritten digits classification :
86 | training time ~ 5min
87 | accuracy ~ 98% 88 | -------------------------------------------------------------------------------- /gcForest_tuto.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

gcForest Algorithm

\n", 8 | "\n", 9 | "

The gcForest algorithm was suggested in Zhou and Feng 2017 ( https://arxiv.org/abs/1702.08835 , refer for this paper for technical details) and I provide here a python3 implementation of this algorithm.
\n", 10 | "I chose to adopt the scikit-learn syntax for ease of use and hereafter I present how it can be used.

" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": { 17 | "collapsed": true 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "from GCForest import gcForest\n", 22 | "from sklearn.datasets import load_iris, load_digits\n", 23 | "from sklearn.model_selection import train_test_split\n", 24 | "from sklearn.metrics import accuracy_score" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "

Before starting, a word about sizes.

\n", 32 | "

*Note* : I recommend the reader to look at this section with the original paper next to the computer to see what I am talking about.

\n", 33 | "

The main technical problem in the present gcForest implementation so far is the memory usage when slicing the input data.\n", 34 | "A naive calculation can actually give you an idea of the number and sizes of objects the algorithm will be dealing with.

\n", 35 | "

Starting with a dataset of $N$ samples of size $[l,L]$ and with $C$ classes, the initial \"size\" is:

\n", 36 | "$S_{D} = N.l.L$

\n", 37 | "

**Slicing Step**
\n", 38 | "If my window is of size $[w_l,w_L]$ and the chosen stride are $[s_l,s_L]$ then the number of slices per sample is :
\n", 39 | "
\n", 40 | "$n_{slices} = \\left(\\frac{l-w_l}{s_l}+1\\right)\\left(\\frac{L-w_L}{s_L}+1\\right)$

\n", 41 | "Obviously the size of slice is $w_l.w_L$ hence the total size of the sliced data set is :

\n", 42 | "$S_{sliced} = N.w_l.w_L.\\left(\\frac{l-w_l}{s_l}+1\\right)\\left(\\frac{L-w_L}{s_L}+1\\right)$
\n", 43 | "This is when the memory consumption is its peak maximum.

\n", 44 | "

**Class Vector after Multi-Grain Scanning**
\n", 45 | "Now all slices are fed to the random forest to generate *class vectors*.\n", 46 | "The number of class vector per random forest per window per sample is simply equal to the number of slices given to the random forest $n_{cv}(w) = n_{slices}(w)$.\n", 47 | "Hence, if we have $N_{RF}$ random forest per window the size of a class vector is (recall we have $N$ samples and $C$ classes):

\n", 48 | "$S_{cv}(w) = N.n_{cv}(w).N_{RF}.C$

\n", 49 | "And finally the total size of the Multi-Grain Scanning output will be:

\n", 50 | "$S_{mgs} = N.\\sum_{w} N_{RF}.C.n_{cv}(w)$\n", 51 | "

\n", 52 | "

This short calculation is just meant to give you an idea of the data processing during the Multi-Grain Scanning phase. The actual memory consumption depends on the format given (aka float, int, double, etc.) and it might be worth looking at it carefully when dealing with large datasets.

" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "

Iris example

\n", 60 | "\n", 61 | "

The iris data set is actually not a very good example as the gcForest algorithm is better suited for time series and images where informations can be found at different scales in one sample.
\n", 62 | "Nonetheless it is still an easy way to test the method.

" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 2, 68 | "metadata": { 69 | "collapsed": true 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "# loading the data\n", 74 | "iris = load_iris()\n", 75 | "X = iris.data\n", 76 | "y = iris.target\n", 77 | "X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33)" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "

First calling and training the algorithm.\n", 85 | "A specificity here is the presence of the 'shape_1X' keyword to specify the shape of a single sample.\n", 86 | "I have added it as pictures fed to the machinery might not be square.
\n", 87 | "Obviously it is not very relevant for the iris data set but still, it has to be defined.

\n", 88 | "

**New in version 0.1.3** : possibility to directly use an int as shape_1X for sequence data.

" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 3, 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "name": "stdout", 98 | "output_type": "stream", 99 | "text": [ 100 | "Slicing Sequence...\n", 101 | "Training MGS Random Forests...\n", 102 | "Adding/Training Layer, n_layer=1\n", 103 | "Layer validation accuracy = 1.0\n", 104 | "Adding/Training Layer, n_layer=2\n", 105 | "Layer validation accuracy = 1.0\n" 106 | ] 107 | } 108 | ], 109 | "source": [ 110 | "gcf = gcForest(shape_1X=4, window=2, tolerance=0.0)\n", 111 | "gcf.fit(X_tr, y_tr)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "

Now checking the prediction for the test set:

" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "Slicing Sequence...\n", 131 | "[0 1 0 2 1 2 1 0 0 1 0 2 2 2 2 2 1 0 1 0 2 2 2 0 0 0 0 2 0 2 0 0 2 0 1 0 0\n", 132 | " 1 1 2 2 1 2 1 0 0 2 1 0 2]\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "pred_X = gcf.predict(X_te)\n", 138 | "print(pred_X)" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 5, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "gcForest accuracy : 0.96\n" 151 | ] 152 | } 153 | ], 154 | "source": [ 155 | "# evaluating accuracy\n", 156 | "accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)\n", 157 | "print('gcForest accuracy : {}'.format(accuracy))" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "

Digits Example

\n", 165 | "

A much better example is the digits data set containing images of hand written digits.\n", 166 | "The scikit data set can be viewed as a mini-MNIST for training purpose.

" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 6, 172 | "metadata": { 173 | "collapsed": true 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "# loading the data\n", 178 | "digits = load_digits()\n", 179 | "X = digits.data\n", 180 | "y = digits.target\n", 181 | "X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4)" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "

... taining gcForest ... (can take some time...)

" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 7, 194 | "metadata": {}, 195 | "outputs": [ 196 | { 197 | "name": "stdout", 198 | "output_type": "stream", 199 | "text": [ 200 | "Slicing Images...\n", 201 | "Training MGS Random Forests...\n", 202 | "Slicing Images...\n", 203 | "Training MGS Random Forests...\n", 204 | "Adding/Training Layer, n_layer=1\n", 205 | "Layer validation accuracy = 0.9861111111111112\n", 206 | "Adding/Training Layer, n_layer=2\n", 207 | "Layer validation accuracy = 0.9861111111111112\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "gcf = gcForest(shape_1X=[8,8], window=[4,6], tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)\n", 213 | "gcf.fit(X_tr, y_tr)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "

... and predicting classes ...

" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 8, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "Slicing Images...\n", 233 | "Slicing Images...\n", 234 | "[0 1 0 2 4 3 7 1 4 6 9 5 5 2 6 4 7 7 5 2 1 6 8 6 4 3 3 3 0 9 3 7 3 4 5 2 5\n", 235 | " 4 2 1 3 6 9 8 5 8 6 2 2 7 8 6 0 3 0 9 5 0 4 6 7 0 3 8 0 6 9 1 8 3 9 1 1 3\n", 236 | " 7 3 5 6 6 6 9 4 0 8 7 3 0 0 4 2 7 0 4 7 9 7 9 6 4 4 7 0 2 2 5 9 4 5 7 7 7\n", 237 | " 7 0 6 0 1 6 3 9 4 7 3 1 1 3 6 3 0 3 8 8 1 9 7 6 4 7 5 5 3 5 4 8 3 4 4 6 4\n", 238 | " 0 5 9 6 8 5 1 3 1 9 9 5 4 0 4 1 1 7 9 2 3 5 5 6 5 4 6 3 7 5 1 9 6 8 1 7 4\n", 239 | " 9 7 3 0 7 3 6 5 9 7 7 9 7 2 3 8 8 7 7 9 4 5 9 0 7 6 1 4 5 7 6 6 0 4 5 9 3\n", 240 | " 6 4 0 6 4 6 4 9 1 3 7 3 4 8 8 0 7 4 1 5 5 8 2 0 7 7 6 4 5 8 1 2 0 4 5 7 4\n", 241 | " 8 5 4 5 8 3 5 9 6 2 7 6 7 1 3 4 1 0 6 9 4 0 7 6 0 0 9 2 7 4 3 9 3 9 0 4 1\n", 242 | " 0 2 9 0 9 0 6 2 4 0 1 6 9 0 2 0 9 1 1 5 3 4 2 9 6 1 6 1 3 9 1 1 8 6 4 9 3\n", 243 | " 5 5 2 1 4 8 8 2 2 9 0 9 4 4 2 0 6 2 6 1 7 4 6 7 7 1 5 6 9 7 2 6 0 5 8 7 5\n", 244 | " 8 4 2 6 9 2 5 6 5 7 4 3 1 2 7 9 5 3 0 2 6 9 9 1 6 0 9 4 5 7 8 4 8 9 6 5 6\n", 245 | " 6 2 1 4 3 3 5 8 2 0 6 5 9 8 7 4 8 6 1 0 1 8 4 3 9 4 5 5 7 9 8 6 1 7 0 8 2\n", 246 | " 5 2 5 9 8 4 0 8 7 3 9 7 4 5 5 3 7 4 7 6 0 8 0 9 5 6 0 3 1 6 7 6 4 6 4 7 6\n", 247 | " 0 0 6 7 7 2 2 9 8 9 8 8 5 5 4 1 4 6 3 8 8 0 0 1 4 4 7 3 3 4 7 1 3 3 9 2 4\n", 248 | " 0 9 4 9 1 5 1 4 4 5 8 5 7 6 8 8 0 8 1 4 6 0 2 0 1 6 0 3 0 8 1 2 3 2 5 0 4\n", 249 | " 0 2 1 9 2 3 2 6 3 9 9 2 6 8 3 0 8 0 5 8 4 2 2 4 6 5 7 1 2 6 9 1 2 8 8 7 5\n", 250 | " 9 1 5 0 2 2 5 6 7 0 7 3 5 5 2 2 7 1 1 7 0 3 7 4 2 1 3 2 9 1 7 4 6 3 1 3 2\n", 251 | " 7 6 9 2 0 4 6 3 0 7 1 8 4 6 0 7 3 6 3 0 6 3 1 1 5 0 8 9 2 3 0 5 5 3 9 0 9\n", 252 | " 9 8 6 8 7 8 8 3 1 6 1 4 2 6 0 4 1 4 3 2 1 4 8 8 2 3 3 2 3 9 4 5 3 7 4 2 2\n", 253 | " 9 2 8 0 9 3 9 3 2 7 1 6 0 7 0 8]\n" 254 | ] 255 | } 256 | ], 257 | "source": [ 258 | "pred_X = gcf.predict(X_te)\n", 259 | "print(pred_X)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 9, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "name": "stdout", 269 | "output_type": "stream", 270 | "text": [ 271 | "gcForest accuracy : 0.9847009735744089\n" 272 | ] 273 | } 274 | ], 275 | "source": [ 276 | "# evaluating accuracy\n", 277 | "accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)\n", 278 | "print('gcForest accuracy : {}'.format(accuracy))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "

Saving Models to Disk

\n", 286 | "

You probably don't want to re-train your classifier every day especially if you're using it on large data sets.\n", 287 | "Fortunately there is a very easy way to save and load models to disk using ```sklearn.externals.joblib```

\n", 288 | "

__Saving model:__

" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 10, 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "data": { 298 | "text/plain": [ 299 | "['gcf_model.sav']" 300 | ] 301 | }, 302 | "execution_count": 10, 303 | "metadata": {}, 304 | "output_type": "execute_result" 305 | } 306 | ], 307 | "source": [ 308 | "from sklearn.externals import joblib\n", 309 | "joblib.dump(gcf, 'gcf_model.sav')" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "

__Loading model__:

" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 11, 322 | "metadata": { 323 | "collapsed": true 324 | }, 325 | "outputs": [], 326 | "source": [ 327 | "gcf = joblib.load('gcf_model.sav')" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "

Using mg-scanning and cascade_forest Sperately

\n", 335 | "

As the Multi-Grain scanning and the cascade forest modules are quite independent it is possible to use them seperately.
\n", 336 | "If a target `y` is given the code automaticcaly use it for training otherwise it recalls the last trained Random Forests to slice the data.

" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 12, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "Slicing Images...\n", 349 | "Training MGS Random Forests...\n" 350 | ] 351 | } 352 | ], 353 | "source": [ 354 | "gcf = gcForest(shape_1X=[8,8], window=5, min_samples_mgs=10, min_samples_cascade=7)\n", 355 | "X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 13, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stdout", 365 | "output_type": "stream", 366 | "text": [ 367 | "Slicing Images...\n" 368 | ] 369 | } 370 | ], 371 | "source": [ 372 | "X_te_mgs = gcf.mg_scanning(X_te)" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "

It is now possible to use the mg_scanning output as input for cascade forests using different parameters. Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade. Hence the need to first take the mean of the output and then find the max.

" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 14, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "name": "stdout", 389 | "output_type": "stream", 390 | "text": [ 391 | "Adding/Training Layer, n_layer=1\n", 392 | "Layer validation accuracy = 0.9814814814814815\n", 393 | "Adding/Training Layer, n_layer=2\n", 394 | "Layer validation accuracy = 0.9814814814814815\n" 395 | ] 396 | } 397 | ], 398 | "source": [ 399 | "gcf = gcForest(tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)\n", 400 | "_ = gcf.cascade_forest(X_tr_mgs, y_tr)" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 15, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "data": { 410 | "text/plain": [ 411 | "0.97635605006954107" 412 | ] 413 | }, 414 | "execution_count": 15, 415 | "metadata": {}, 416 | "output_type": "execute_result" 417 | } 418 | ], 419 | "source": [ 420 | "pred_proba = gcf.cascade_forest(X_te_mgs)\n", 421 | "tmp = np.mean(pred_proba, axis=0)\n", 422 | "preds = np.argmax(tmp, axis=1)\n", 423 | "accuracy_score(y_true=y_te, y_pred=preds)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 16, 429 | "metadata": {}, 430 | "outputs": [ 431 | { 432 | "name": "stdout", 433 | "output_type": "stream", 434 | "text": [ 435 | "Adding/Training Layer, n_layer=1\n", 436 | "Layer validation accuracy = 0.9675925925925926\n", 437 | "Adding/Training Layer, n_layer=2\n", 438 | "Layer validation accuracy = 0.9722222222222222\n", 439 | "Adding/Training Layer, n_layer=3\n", 440 | "Layer validation accuracy = 0.9722222222222222\n" 441 | ] 442 | } 443 | ], 444 | "source": [ 445 | "gcf = gcForest(tolerance=0.0, min_samples_mgs=20, min_samples_cascade=10)\n", 446 | "_ = gcf.cascade_forest(X_tr_mgs, y_tr)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 17, 452 | "metadata": {}, 453 | "outputs": [ 454 | { 455 | "data": { 456 | "text/plain": [ 457 | "0.97635605006954107" 458 | ] 459 | }, 460 | "execution_count": 17, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | } 464 | ], 465 | "source": [ 466 | "pred_proba = gcf.cascade_forest(X_te_mgs)\n", 467 | "tmp = np.mean(pred_proba, axis=0)\n", 468 | "preds = np.argmax(tmp, axis=1)\n", 469 | "accuracy_score(y_true=y_te, y_pred=preds)" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": { 475 | "collapsed": true 476 | }, 477 | "source": [ 478 | "

Skipping mg_scanning

\n", 479 | "

It is also possible to directly use the cascade forest and skip the multi grain scanning step.

" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 18, 485 | "metadata": {}, 486 | "outputs": [ 487 | { 488 | "name": "stdout", 489 | "output_type": "stream", 490 | "text": [ 491 | "Adding/Training Layer, n_layer=1\n", 492 | "Layer validation accuracy = 0.9583333333333334\n", 493 | "Adding/Training Layer, n_layer=2\n", 494 | "Layer validation accuracy = 0.9583333333333334\n" 495 | ] 496 | } 497 | ], 498 | "source": [ 499 | "gcf = gcForest(tolerance=0.0, min_samples_cascade=20)\n", 500 | "_ = gcf.cascade_forest(X_tr, y_tr)" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 19, 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "data": { 510 | "text/plain": [ 511 | "0.95827538247566069" 512 | ] 513 | }, 514 | "execution_count": 19, 515 | "metadata": {}, 516 | "output_type": "execute_result" 517 | } 518 | ], 519 | "source": [ 520 | "pred_proba = gcf.cascade_forest(X_te)\n", 521 | "tmp = np.mean(pred_proba, axis=0)\n", 522 | "preds = np.argmax(tmp, axis=1)\n", 523 | "accuracy_score(y_true=y_te, y_pred=preds)" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": { 530 | "collapsed": true 531 | }, 532 | "outputs": [], 533 | "source": [] 534 | } 535 | ], 536 | "metadata": { 537 | "kernelspec": { 538 | "display_name": "Python 3", 539 | "language": "python", 540 | "name": "python3" 541 | }, 542 | "language_info": { 543 | "codemirror_mode": { 544 | "name": "ipython", 545 | "version": 3 546 | }, 547 | "file_extension": ".py", 548 | "mimetype": "text/x-python", 549 | "name": "python", 550 | "nbconvert_exporter": "python", 551 | "pygments_lexer": "ipython3", 552 | "version": "3.5.2" 553 | } 554 | }, 555 | "nbformat": 4, 556 | "nbformat_minor": 2 557 | } 558 | -------------------------------------------------------------------------------- /GCForest.py: -------------------------------------------------------------------------------- 1 | #!usr/bin/env python 2 | """ 3 | Version : 0.1.6 4 | Date : 15th April 2017 5 | 6 | Author : Pierre-Yves Lablanche 7 | Email : plablanche@aims.ac.za 8 | Affiliation : African Institute for Mathematical Sciences - South Africa 9 | Stellenbosch University - South Africa 10 | 11 | License : MIT 12 | 13 | Status : Not Under Active Development 14 | 15 | Description : 16 | Python3 implementation of the gcForest algorithm preesented in Zhou and Feng 2017 17 | (paper can be found here : https://arxiv.org/abs/1702.08835 ). 18 | It uses the typical scikit-learn syntax with a .fit() function for training 19 | and a .predict() function for predictions. 20 | 21 | """ 22 | import itertools 23 | import numpy as np 24 | from sklearn.ensemble import RandomForestClassifier 25 | from sklearn.model_selection import train_test_split 26 | from sklearn.metrics import accuracy_score 27 | 28 | __author__ = "Pierre-Yves Lablanche" 29 | __email__ = "plablanche@aims.ac.za" 30 | __license__ = "MIT" 31 | __version__ = "0.1.6" 32 | #__status__ = "Development" 33 | 34 | 35 | # noinspection PyUnboundLocalVariable 36 | class gcForest(object): 37 | 38 | def __init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1, 39 | cascade_test_size=0.2, n_cascadeRF=2, n_cascadeRFtree=101, cascade_layer=np.inf, 40 | min_samples_mgs=0.1, min_samples_cascade=0.05, tolerance=0.0, n_jobs=1): 41 | """ gcForest Classifier. 42 | 43 | :param shape_1X: int or tuple list or np.array (default=None) 44 | Shape of a single sample element [n_lines, n_cols]. Required when calling mg_scanning! 45 | For sequence data a single int can be given. 46 | 47 | :param n_mgsRFtree: int (default=30) 48 | Number of trees in a Random Forest during Multi Grain Scanning. 49 | 50 | :param window: int (default=None) 51 | List of window sizes to use during Multi Grain Scanning. 52 | If 'None' no slicing will be done. 53 | 54 | :param stride: int (default=1) 55 | Step used when slicing the data. 56 | 57 | :param cascade_test_size: float or int (default=0.2) 58 | Split fraction or absolute number for cascade training set splitting. 59 | 60 | :param n_cascadeRF: int (default=2) 61 | Number of Random Forests in a cascade layer. 62 | For each pseudo Random Forest a complete Random Forest is created, hence 63 | the total numbe of Random Forests in a layer will be 2*n_cascadeRF. 64 | 65 | :param n_cascadeRFtree: int (default=101) 66 | Number of trees in a single Random Forest in a cascade layer. 67 | 68 | :param min_samples_mgs: float or int (default=0.1) 69 | Minimum number of samples in a node to perform a split 70 | during the training of Multi-Grain Scanning Random Forest. 71 | If int number_of_samples = int. 72 | If float, min_samples represents the fraction of the initial n_samples to consider. 73 | 74 | :param min_samples_cascade: float or int (default=0.1) 75 | Minimum number of samples in a node to perform a split 76 | during the training of Cascade Random Forest. 77 | If int number_of_samples = int. 78 | If float, min_samples represents the fraction of the initial n_samples to consider. 79 | 80 | :param cascade_layer: int (default=np.inf) 81 | mMximum number of cascade layers allowed. 82 | Useful to limit the contruction of the cascade. 83 | 84 | :param tolerance: float (default=0.0) 85 | Accuracy tolerance for the casacade growth. 86 | If the improvement in accuracy is not better than the tolerance the construction is 87 | stopped. 88 | 89 | :param n_jobs: int (default=1) 90 | The number of jobs to run in parallel for any Random Forest fit and predict. 91 | If -1, then the number of jobs is set to the number of cores. 92 | """ 93 | setattr(self, 'shape_1X', shape_1X) 94 | setattr(self, 'n_layer', 0) 95 | setattr(self, '_n_samples', 0) 96 | setattr(self, 'n_cascadeRF', int(n_cascadeRF)) 97 | if isinstance(window, int): 98 | setattr(self, 'window', [window]) 99 | elif isinstance(window, list): 100 | setattr(self, 'window', window) 101 | setattr(self, 'stride', stride) 102 | setattr(self, 'cascade_test_size', cascade_test_size) 103 | setattr(self, 'n_mgsRFtree', int(n_mgsRFtree)) 104 | setattr(self, 'n_cascadeRFtree', int(n_cascadeRFtree)) 105 | setattr(self, 'cascade_layer', cascade_layer) 106 | setattr(self, 'min_samples_mgs', min_samples_mgs) 107 | setattr(self, 'min_samples_cascade', min_samples_cascade) 108 | setattr(self, 'tolerance', tolerance) 109 | setattr(self, 'n_jobs', n_jobs) 110 | 111 | def fit(self, X, y): 112 | """ Training the gcForest on input data X and associated target y. 113 | 114 | :param X: np.array 115 | Array containing the input samples. 116 | Must be of shape [n_samples, data] where data is a 1D array. 117 | 118 | :param y: np.array 119 | 1D array containing the target values. 120 | Must be of shape [n_samples] 121 | """ 122 | if np.shape(X)[0] != len(y): 123 | raise ValueError('Sizes of y and X do not match.') 124 | 125 | mgs_X = self.mg_scanning(X, y) 126 | _ = self.cascade_forest(mgs_X, y) 127 | 128 | def predict_proba(self, X): 129 | """ Predict the class probabilities of unknown samples X. 130 | 131 | :param X: np.array 132 | Array containing the input samples. 133 | Must be of the same shape [n_samples, data] as the training inputs. 134 | 135 | :return: np.array 136 | 1D array containing the predicted class probabilities for each input sample. 137 | """ 138 | mgs_X = self.mg_scanning(X) 139 | cascade_all_pred_prob = self.cascade_forest(mgs_X) 140 | predict_proba = np.mean(cascade_all_pred_prob, axis=0) 141 | 142 | return predict_proba 143 | 144 | def predict(self, X): 145 | """ Predict the class of unknown samples X. 146 | 147 | :param X: np.array 148 | Array containing the input samples. 149 | Must be of the same shape [n_samples, data] as the training inputs. 150 | 151 | :return: np.array 152 | 1D array containing the predicted class for each input sample. 153 | """ 154 | pred_proba = self.predict_proba(X=X) 155 | predictions = np.argmax(pred_proba, axis=1) 156 | 157 | return predictions 158 | 159 | def mg_scanning(self, X, y=None): 160 | """ Performs a Multi Grain Scanning on input data. 161 | 162 | :param X: np.array 163 | Array containing the input samples. 164 | Must be of shape [n_samples, data] where data is a 1D array. 165 | 166 | :param y: np.array (default=None) 167 | 168 | :return: np.array 169 | Array of shape [n_samples, .. ] containing Multi Grain Scanning sliced data. 170 | """ 171 | setattr(self, '_n_samples', np.shape(X)[0]) 172 | shape_1X = getattr(self, 'shape_1X') 173 | if isinstance(shape_1X, int): 174 | shape_1X = [1,shape_1X] 175 | if not getattr(self, 'window'): 176 | setattr(self, 'window', [shape_1X[1]]) 177 | 178 | mgs_pred_prob = [] 179 | 180 | for wdw_size in getattr(self, 'window'): 181 | wdw_pred_prob = self.window_slicing_pred_prob(X, wdw_size, shape_1X, y=y) 182 | mgs_pred_prob.append(wdw_pred_prob) 183 | 184 | return np.concatenate(mgs_pred_prob, axis=1) 185 | 186 | def window_slicing_pred_prob(self, X, window, shape_1X, y=None): 187 | """ Performs a window slicing of the input data and send them through Random Forests. 188 | If target values 'y' are provided sliced data are then used to train the Random Forests. 189 | 190 | :param X: np.array 191 | Array containing the input samples. 192 | Must be of shape [n_samples, data] where data is a 1D array. 193 | 194 | :param window: int 195 | Size of the window to use for slicing. 196 | 197 | :param shape_1X: list or np.array 198 | Shape of a single sample. 199 | 200 | :param y: np.array (default=None) 201 | Target values. If 'None' no training is done. 202 | 203 | :return: np.array 204 | Array of size [n_samples, ..] containing the Random Forest. 205 | prediction probability for each input sample. 206 | """ 207 | n_tree = getattr(self, 'n_mgsRFtree') 208 | min_samples = getattr(self, 'min_samples_mgs') 209 | stride = getattr(self, 'stride') 210 | 211 | if shape_1X[0] > 1: 212 | print('Slicing Images...') 213 | sliced_X, sliced_y = self._window_slicing_img(X, window, shape_1X, y=y, stride=stride) 214 | else: 215 | print('Slicing Sequence...') 216 | sliced_X, sliced_y = self._window_slicing_sequence(X, window, shape_1X, y=y, stride=stride) 217 | 218 | if y is not None: 219 | n_jobs = getattr(self, 'n_jobs') 220 | prf = RandomForestClassifier(n_estimators=n_tree, max_features='sqrt', 221 | min_samples_split=min_samples, oob_score=True, n_jobs=n_jobs) 222 | crf = RandomForestClassifier(n_estimators=n_tree, max_features=1, 223 | min_samples_split=min_samples, oob_score=True, n_jobs=n_jobs) 224 | print('Training MGS Random Forests...') 225 | prf.fit(sliced_X, sliced_y) 226 | crf.fit(sliced_X, sliced_y) 227 | setattr(self, '_mgsprf_{}'.format(window), prf) 228 | setattr(self, '_mgscrf_{}'.format(window), crf) 229 | pred_prob_prf = prf.oob_decision_function_ 230 | pred_prob_crf = crf.oob_decision_function_ 231 | 232 | if hasattr(self, '_mgsprf_{}'.format(window)) and y is None: 233 | prf = getattr(self, '_mgsprf_{}'.format(window)) 234 | crf = getattr(self, '_mgscrf_{}'.format(window)) 235 | pred_prob_prf = prf.predict_proba(sliced_X) 236 | pred_prob_crf = crf.predict_proba(sliced_X) 237 | 238 | pred_prob = np.c_[pred_prob_prf, pred_prob_crf] 239 | 240 | return pred_prob.reshape([getattr(self, '_n_samples'), -1]) 241 | 242 | def _window_slicing_img(self, X, window, shape_1X, y=None, stride=1): 243 | """ Slicing procedure for images 244 | 245 | :param X: np.array 246 | Array containing the input samples. 247 | Must be of shape [n_samples, data] where data is a 1D array. 248 | 249 | :param window: int 250 | Size of the window to use for slicing. 251 | 252 | :param shape_1X: list or np.array 253 | Shape of a single sample [n_lines, n_cols]. 254 | 255 | :param y: np.array (default=None) 256 | Target values. 257 | 258 | :param stride: int (default=1) 259 | Step used when slicing the data. 260 | 261 | :return: np.array and np.array 262 | Arrays containing the sliced images and target values (empty if 'y' is None). 263 | """ 264 | if any(s < window for s in shape_1X): 265 | raise ValueError('window must be smaller than both dimensions for an image') 266 | 267 | len_iter_x = np.floor_divide((shape_1X[1] - window), stride) + 1 268 | len_iter_y = np.floor_divide((shape_1X[0] - window), stride) + 1 269 | iterx_array = np.arange(0, stride*len_iter_x, stride) 270 | itery_array = np.arange(0, stride*len_iter_y, stride) 271 | 272 | ref_row = np.arange(0, window) 273 | ref_ind = np.ravel([ref_row + shape_1X[1] * i for i in range(window)]) 274 | inds_to_take = [ref_ind + ix + shape_1X[1] * iy 275 | for ix, iy in itertools.product(iterx_array, itery_array)] 276 | 277 | sliced_imgs = np.take(X, inds_to_take, axis=1).reshape(-1, window**2) 278 | 279 | if y is not None: 280 | sliced_target = np.repeat(y, len_iter_x * len_iter_y) 281 | elif y is None: 282 | sliced_target = None 283 | 284 | return sliced_imgs, sliced_target 285 | 286 | def _window_slicing_sequence(self, X, window, shape_1X, y=None, stride=1): 287 | """ Slicing procedure for sequences (aka shape_1X = [.., 1]). 288 | 289 | :param X: np.array 290 | Array containing the input samples. 291 | Must be of shape [n_samples, data] where data is a 1D array. 292 | 293 | :param window: int 294 | Size of the window to use for slicing. 295 | 296 | :param shape_1X: list or np.array 297 | Shape of a single sample [n_lines, n_col]. 298 | 299 | :param y: np.array (default=None) 300 | Target values. 301 | 302 | :param stride: int (default=1) 303 | Step used when slicing the data. 304 | 305 | :return: np.array and np.array 306 | Arrays containing the sliced sequences and target values (empty if 'y' is None). 307 | """ 308 | if shape_1X[1] < window: 309 | raise ValueError('window must be smaller than the sequence dimension') 310 | 311 | len_iter = np.floor_divide((shape_1X[1] - window), stride) + 1 312 | iter_array = np.arange(0, stride*len_iter, stride) 313 | 314 | ind_1X = np.arange(np.prod(shape_1X)) 315 | inds_to_take = [ind_1X[i:i+window] for i in iter_array] 316 | sliced_sqce = np.take(X, inds_to_take, axis=1).reshape(-1, window) 317 | 318 | if y is not None: 319 | sliced_target = np.repeat(y, len_iter) 320 | elif y is None: 321 | sliced_target = None 322 | 323 | return sliced_sqce, sliced_target 324 | 325 | def cascade_forest(self, X, y=None): 326 | """ Perform (or train if 'y' is not None) a cascade forest estimator. 327 | 328 | :param X: np.array 329 | Array containing the input samples. 330 | Must be of shape [n_samples, data] where data is a 1D array. 331 | 332 | :param y: np.array (default=None) 333 | Target values. If 'None' perform training. 334 | 335 | :return: np.array 336 | 1D array containing the predicted class for each input sample. 337 | """ 338 | if y is not None: 339 | setattr(self, 'n_layer', 0) 340 | test_size = getattr(self, 'cascade_test_size') 341 | max_layers = getattr(self, 'cascade_layer') 342 | tol = getattr(self, 'tolerance') 343 | 344 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size) 345 | 346 | self.n_layer += 1 347 | prf_crf_pred_ref = self._cascade_layer(X_train, y_train) 348 | accuracy_ref = self._cascade_evaluation(X_test, y_test) 349 | feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref) 350 | 351 | self.n_layer += 1 352 | prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train) 353 | accuracy_layer = self._cascade_evaluation(X_test, y_test) 354 | 355 | while accuracy_layer > (accuracy_ref + tol) and self.n_layer <= max_layers: 356 | accuracy_ref = accuracy_layer 357 | prf_crf_pred_ref = prf_crf_pred_layer 358 | feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref) 359 | self.n_layer += 1 360 | prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train) 361 | accuracy_layer = self._cascade_evaluation(X_test, y_test) 362 | 363 | if accuracy_layer < accuracy_ref : 364 | n_cascadeRF = getattr(self, 'n_cascadeRF') 365 | for irf in range(n_cascadeRF): 366 | delattr(self, '_casprf{}_{}'.format(self.n_layer, irf)) 367 | delattr(self, '_cascrf{}_{}'.format(self.n_layer, irf)) 368 | self.n_layer -= 1 369 | 370 | elif y is None: 371 | at_layer = 1 372 | prf_crf_pred_ref = self._cascade_layer(X, layer=at_layer) 373 | while at_layer < getattr(self, 'n_layer'): 374 | at_layer += 1 375 | feat_arr = self._create_feat_arr(X, prf_crf_pred_ref) 376 | prf_crf_pred_ref = self._cascade_layer(feat_arr, layer=at_layer) 377 | 378 | return prf_crf_pred_ref 379 | 380 | def _cascade_layer(self, X, y=None, layer=0): 381 | """ Cascade layer containing Random Forest estimators. 382 | If y is not None the layer is trained. 383 | 384 | :param X: np.array 385 | Array containing the input samples. 386 | Must be of shape [n_samples, data] where data is a 1D array. 387 | 388 | :param y: np.array (default=None) 389 | Target values. If 'None' perform training. 390 | 391 | :param layer: int (default=0) 392 | Layer indice. Used to call the previously trained layer. 393 | 394 | :return: list 395 | List containing the prediction probabilities for all samples. 396 | """ 397 | n_tree = getattr(self, 'n_cascadeRFtree') 398 | n_cascadeRF = getattr(self, 'n_cascadeRF') 399 | min_samples = getattr(self, 'min_samples_cascade') 400 | 401 | n_jobs = getattr(self, 'n_jobs') 402 | prf = RandomForestClassifier(n_estimators=n_tree, max_features='sqrt', 403 | min_samples_split=min_samples, oob_score=True, n_jobs=n_jobs) 404 | crf = RandomForestClassifier(n_estimators=n_tree, max_features=1, 405 | min_samples_split=min_samples, oob_score=True, n_jobs=n_jobs) 406 | 407 | prf_crf_pred = [] 408 | if y is not None: 409 | print('Adding/Training Layer, n_layer={}'.format(self.n_layer)) 410 | for irf in range(n_cascadeRF): 411 | prf.fit(X, y) 412 | crf.fit(X, y) 413 | setattr(self, '_casprf{}_{}'.format(self.n_layer, irf), prf) 414 | setattr(self, '_cascrf{}_{}'.format(self.n_layer, irf), crf) 415 | prf_crf_pred.append(prf.oob_decision_function_) 416 | prf_crf_pred.append(crf.oob_decision_function_) 417 | elif y is None: 418 | for irf in range(n_cascadeRF): 419 | prf = getattr(self, '_casprf{}_{}'.format(layer, irf)) 420 | crf = getattr(self, '_cascrf{}_{}'.format(layer, irf)) 421 | prf_crf_pred.append(prf.predict_proba(X)) 422 | prf_crf_pred.append(crf.predict_proba(X)) 423 | 424 | return prf_crf_pred 425 | 426 | def _cascade_evaluation(self, X_test, y_test): 427 | """ Evaluate the accuracy of the cascade using X and y. 428 | 429 | :param X_test: np.array 430 | Array containing the test input samples. 431 | Must be of the same shape as training data. 432 | 433 | :param y_test: np.array 434 | Test target values. 435 | 436 | :return: float 437 | the cascade accuracy. 438 | """ 439 | casc_pred_prob = np.mean(self.cascade_forest(X_test), axis=0) 440 | casc_pred = np.argmax(casc_pred_prob, axis=1) 441 | casc_accuracy = accuracy_score(y_true=y_test, y_pred=casc_pred) 442 | print('Layer validation accuracy = {}'.format(casc_accuracy)) 443 | 444 | return casc_accuracy 445 | 446 | def _create_feat_arr(self, X, prf_crf_pred): 447 | """ Concatenate the original feature vector with the predicition probabilities 448 | of a cascade layer. 449 | 450 | :param X: np.array 451 | Array containing the input samples. 452 | Must be of shape [n_samples, data] where data is a 1D array. 453 | 454 | :param prf_crf_pred: list 455 | Prediction probabilities by a cascade layer for X. 456 | 457 | :return: np.array 458 | Concatenation of X and the predicted probabilities. 459 | To be used for the next layer in a cascade forest. 460 | """ 461 | swap_pred = np.swapaxes(prf_crf_pred, 0, 1) 462 | add_feat = swap_pred.reshape([np.shape(X)[0], -1]) 463 | feat_arr = np.concatenate([add_feat, X], axis=1) 464 | 465 | return feat_arr 466 | --------------------------------------------------------------------------------