├── .gitignore
├── Readme.md
├── run_classification.py
├── run_regression.py
├── universal_params.py
└── utilities.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | .DS_store
3 | .DS_Store
4 | __pycache__
5 | 


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
 1 | # WHAT IT IS
 2 | Many believe that 
 3 | 
 4 | > most of the work of supervised (non-deep) Machine Learning lies in feature engineering, whereas the model-selection process is just running through all the models with a huge for-loop. 
 5 | 
 6 | So one glorious weekend, I decided to type out said loop!
 7 | 
 8 | ![And I thought Machine Learning was hard.](https://user-images.githubusercontent.com/2030548/27264037-f532cb72-542a-11e7-9b37-d776f5a09a08.png)
 9 | 
10 | # HOW IT WORKS
11 | Runs through all `sklearn` models (both classification and regression), with **all possible hyperparameters**, and rank using cross-validation.
12 | 
13 | # MODELS
14 | Runs **all the model** available on `sklearn` for supervised learning [here](http://scikit-learn.org/stable/supervised_learning.html). The categories are:
15 | 
16 | * Generalized Linear Models
17 | * Kernel Ridge
18 | * Support Vector Machines
19 | * Nearest Neighbors
20 | * Gaussian Processes
21 | * Naive Bayes
22 | * Trees
23 | * Neural Networks
24 | * Ensemble methods
25 | 
26 | Note: I skipped GradientTreeBoosting due to sub-par model performance, long run-time and constant convergence issues. Skipped AdaBoost because it keeps giving max_features errors. (Please ping me or feel free to contribute to the repo directly if you ever got AdaBoost to work at some point.)
27 | 
28 | # USAGE
29 | 
30 | ### How to run
31 | 1. Feed in `X` (2-D `numpy.array`) and `y` (1-D `numpy.array`). (The code also has fake data generated for testing purposes.)
32 | 2. Use `run_classification` or `run_regression` where appropriate. 
33 | 
34 | 
35 | The output looks this:
36 | 
37 | | Model                       |  accuracy     |  Time/clf (s)|
38 | |---------------------------- |:-------------:|:-------------:|
39 | |SGDClassifier                |     0.967     |      0.001   |
40 | |LogisticRegression           |     0.940      |      0.001   |
41 | |Perceptron                   |     0.900       |      0.001   |
42 | |PassiveAggressiveClassifier  |     0.967     |      0.001   |
43 | |MLPClassifier                |     0.827     |      0.018   |
44 | |KMeans                       |     0.580      |      0.010    |
45 | |KNeighborsClassifier         |     0.960      |      0.000       |
46 | |NearestCentroid              |     0.933     |      0.000       |
47 | |RadiusNeighborsClassifier    |     0.927     |      0.000       |
48 | |SVC                          |     0.960      |      0.000       |
49 | |NuSVC                        |     0.980      |      0.001   |
50 | |LinearSVC                    |     0.940      |      0.005   |
51 | |RandomForestClassifier       |     0.980      |      0.015   |
52 | |DecisionTreeClassifier       |     0.960      |      0.000       |
53 | |ExtraTreesClassifier         |     0.993     |      0.002   |
54 | 
55 | *The winner is: ExtraTreesClassifier with score 0.993.*
56 | 
57 | ### Knobs
58 | 
59 | * Evaluation criteria
60 | 
61 |   By default classification uses accuracy and regression uses negative MSE, given by the parameter of the `big_loop` function in `utilities.py`. It also accepts any `sklearn` scoring string.
62 | 
63 | * Scale
64 | 
65 |   Because it takes a long time to run through all models and hyperparameters at full-blown scale, there is a "small" and a full version of hyperparameters for almost every model. The "small" ones run much faster by evaluating only the most essential hyperparameters in smaller ranges than the full version. It's controlled by the `small` parameter of all of the `run_all` functions.
66 | 
67 | * Hyperparameters
68 | 
69 |   You can modify the search space of hyperparameters in `run_regression.py` and `run_classification.py`.
70 |   
71 | * Running only a category of models
72 | 
73 |   Depending on the nature of the problem, certain categories of models work better than others. There are separate functions for each category in `run_regression.py` and `run_classification.py`.
74 | 
75 | # TO-DO'S
76 | 
77 | Feel free to contribute by hashing out the following:
78 | 
79 | * Wrap an emsemble (bagging/boosting) model on top of the best models.
80 | * multi-target classification (i.e. `y` having multiple columns)
81 | 
82 | &nbsp;
83 | &nbsp;
84 | &nbsp;
85 | &nbsp;
86 | &nbsp;
87 | 
88 | Oh boy that was a lot of typing in the past 24 hours! Hopefully it saves you (and myself) some typing in the future. I'm gonna grab some lunch, sip a cold drink and enjoy the California summer heat. :)
89 | Check out more of my pet projects on [planetj.io](planetj.io).
90 | 


--------------------------------------------------------------------------------
/run_classification.py:
--------------------------------------------------------------------------------
  1 | import warnings
  2 | warnings.filterwarnings('ignore')
  3 | import numpy as np
  4 | from sklearn import datasets
  5 | from sklearn.linear_model import SGDClassifier, LogisticRegression, \
  6 |     Perceptron, PassiveAggressiveClassifier
  7 | 
  8 | from sklearn.preprocessing import StandardScaler
  9 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
 10 | from sklearn.svm import SVC, LinearSVC, NuSVC
 11 | from sklearn.cluster import KMeans
 12 | from sklearn.neighbors import KNeighborsClassifier, NearestCentroid, RadiusNeighborsClassifier
 13 | from sklearn.gaussian_process import GaussianProcessClassifier
 14 | from sklearn.gaussian_process.kernels import RBF, ConstantKernel, DotProduct, Matern, StationaryKernelMixin, WhiteKernel
 15 | from sklearn.naive_bayes import GaussianNB
 16 | from sklearn.neural_network import MLPClassifier
 17 | from sklearn.tree import DecisionTreeClassifier
 18 | 
 19 | from utilities import *
 20 | from universal_params import *
 21 | 
 22 | 
 23 | def gen_classification_data(n=None):
 24 |     """
 25 |     uses the iris data
 26 |     :return: x, y
 27 |     """
 28 | 
 29 |     iris = datasets.load_iris()
 30 |     x = iris.data
 31 |     y = iris.target
 32 | 
 33 |     if n:
 34 |         half = int(n/2)
 35 |         np.concatenate((x[:half], x[-half:]), 1), np.concatenate((y[:half], y[-half:]), 0)
 36 | 
 37 |     return x, y
 38 | 
 39 | linear_models_n_params = [
 40 |     (SGDClassifier,
 41 |      {'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge'],
 42 |       'alpha': [0.0001, 0.001, 0.1],
 43 |       **penalty_12none
 44 |       }),
 45 | 
 46 |     (LogisticRegression,
 47 |      {**penalty_12, **max_iter, **tol, ** warm_start, **C,
 48 |       'solver': ['liblinear']
 49 |       }),
 50 | 
 51 |     (Perceptron,
 52 |      {**penalty_all, **alpha, **n_iter, **eta0, **warm_start
 53 |       }),
 54 | 
 55 |     (PassiveAggressiveClassifier,
 56 |      {**C, **n_iter, **warm_start,
 57 |       'loss': ['hinge', 'squared_hinge'],
 58 |       })
 59 | ]
 60 | 
 61 | linear_models_n_params_small = linear_models_n_params
 62 | 
 63 | svm_models_n_params = [
 64 |     (SVC,
 65 |      {**C, **kernel, **degree, **gamma, **coef0, **shrinking, **tol, **max_iter_inf2}),
 66 | 
 67 |     (NuSVC,
 68 |      {**nu, **kernel, **degree, **gamma, **coef0, **shrinking, **tol
 69 |       }),
 70 | 
 71 |     (LinearSVC,
 72 |      { **C, **penalty_12, **tol, **max_iter,
 73 |        'loss': ['hinge', 'squared_hinge'],
 74 |        })
 75 | ]
 76 | 
 77 | svm_models_n_params_small = [
 78 |     (SVC,
 79 |      {**kernel, **degree, **shrinking
 80 |       }),
 81 | 
 82 |     (NuSVC,
 83 |      {**nu_small, **kernel, **degree, **shrinking
 84 |       }),
 85 | 
 86 |     (LinearSVC,
 87 |      { **C_small,
 88 |        'penalty': ['l2'],
 89 |        'loss': ['hinge', 'squared_hinge'],
 90 |        })
 91 | ]
 92 | 
 93 | neighbor_models_n_params = [
 94 | 
 95 |     (KMeans,
 96 |      {'algorithm': ['auto', 'full', 'elkan'],
 97 |       'init': ['k-means++', 'random']}),
 98 | 
 99 |     (KNeighborsClassifier,
100 |      {**n_neighbors, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric,
101 |       'weights': ['uniform', 'distance'],
102 |       'p': [1, 2]
103 |       }),
104 | 
105 |     (NearestCentroid,
106 |      {**neighbor_metric,
107 |       'shrink_threshold': [1e-3, 1e-2, 0.1, 0.5, 0.9, 2]
108 |       }),
109 | 
110 |     (RadiusNeighborsClassifier,
111 |      {**neighbor_radius, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric,
112 |       'weights': ['uniform', 'distance'],
113 |       'p': [1, 2],
114 |       'outlier_label': [-1]
115 |       })
116 | ]
117 | 
118 | gaussianprocess_models_n_params = [
119 |     (GaussianProcessClassifier,
120 |      {**warm_start,
121 |       'kernel': [RBF(), ConstantKernel(), DotProduct(), WhiteKernel()],
122 |       'max_iter_predict': [500],
123 |       'n_restarts_optimizer': [3],
124 |       })
125 | ]
126 | 
127 | bayes_models_n_params = [
128 |     (GaussianNB, {})
129 | ]
130 | 
131 | nn_models_n_params = [
132 |     (MLPClassifier,
133 |      { 'hidden_layer_sizes': [(16,), (64,), (100,), (32, 32)],
134 |        'activation': ['identity', 'logistic', 'tanh', 'relu'],
135 |        **alpha, **learning_rate, **tol, **warm_start,
136 |        'batch_size': ['auto', 50],
137 |        'max_iter': [1000],
138 |        'early_stopping': [True, False],
139 |        'epsilon': [1e-8, 1e-5]
140 |        })
141 | ]
142 | 
143 | nn_models_n_params_small = [
144 |     (MLPClassifier,
145 |      { 'hidden_layer_sizes': [(64,), (32, 64)],
146 |        'batch_size': ['auto', 50],
147 |        'activation': ['identity', 'tanh', 'relu'],
148 |        'max_iter': [500],
149 |        'early_stopping': [True],
150 |        **learning_rate_small
151 |        })
152 | ]
153 | 
154 | tree_models_n_params = [
155 | 
156 |     (RandomForestClassifier,
157 |      {'criterion': ['gini', 'entropy'],
158 |       **max_features, **n_estimators, **max_depth,
159 |       **min_samples_split, **min_impurity_split, **warm_start, **min_samples_leaf,
160 |       }),
161 | 
162 |     (DecisionTreeClassifier,
163 |      {'criterion': ['gini', 'entropy'],
164 |       **max_features, **max_depth, **min_samples_split, **min_impurity_split, **min_samples_leaf
165 |       }),
166 | 
167 |     (ExtraTreesClassifier,
168 |      {**n_estimators, **max_features, **max_depth,
169 |       **min_samples_split, **min_samples_leaf, **min_impurity_split, **warm_start,
170 |       'criterion': ['gini', 'entropy']})
171 | ]
172 | 
173 | 
174 | tree_models_n_params_small = [
175 | 
176 |     (RandomForestClassifier,
177 |      {**max_features_small, **n_estimators_small, **min_samples_split, **max_depth_small, **min_samples_leaf
178 |       }),
179 | 
180 |     (DecisionTreeClassifier,
181 |      {**max_features_small, **max_depth_small, **min_samples_split, **min_samples_leaf
182 |       }),
183 | 
184 |     (ExtraTreesClassifier,
185 |      {**n_estimators_small, **max_features_small, **max_depth_small,
186 |       **min_samples_split, **min_samples_leaf})
187 | ]
188 | 
189 | 
190 | 
191 | def run_linear_models(x, y, small = True, normalize_x = True):
192 |     return big_loop(linear_models_n_params_small if small else linear_models_n_params,
193 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True)
194 | 
195 | def run_svm_models(x, y, small = True, normalize_x = True):
196 |     return big_loop(svm_models_n_params_small if small else svm_models_n_params,
197 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True)
198 | 
199 | def run_neighbor_models(x, y, normalize_x = True):
200 |     return big_loop(neighbor_models_n_params,
201 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True)
202 | 
203 | def run_gaussian_models(x, y, normalize_x = True):
204 |     return big_loop(gaussianprocess_models_n_params,
205 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True)
206 | 
207 | def run_nn_models(x, y, small = True, normalize_x = True):
208 |     return big_loop(nn_models_n_params_small if small else nn_models_n_params,
209 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True)
210 | 
211 | def run_tree_models(x, y, small = True, normalize_x = True):
212 |     return big_loop(tree_models_n_params_small if small else tree_models_n_params,
213 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True)
214 | 
215 | def run_all(x, y, small = True, normalize_x = True, n_jobs=cpu_count()-1):
216 | 
217 |     all_params = (linear_models_n_params_small if small else linear_models_n_params) + \
218 |                  (nn_models_n_params_small if small else nn_models_n_params) + \
219 |                  ([] if small else gaussianprocess_models_n_params) + \
220 |                  neighbor_models_n_params + \
221 |                  (svm_models_n_params_small if small else svm_models_n_params) + \
222 |                  (tree_models_n_params_small if small else tree_models_n_params)
223 | 
224 |     return big_loop(all_params,
225 |                     StandardScaler().fit_transform(x) if normalize_x else x, y,
226 |                     isClassification=True, n_jobs=n_jobs)
227 | 
228 | 
229 | 
230 | if __name__ == '__main__':
231 | 
232 |     x, y = gen_classification_data()
233 |     run_all(x, y, n_jobs=1)
234 | 
235 | 
236 | 
237 | 
238 | 
239 | 
240 | 
241 | 
242 | 
243 | 
244 | 
245 | 
246 | 
247 | 
248 | 
249 | 
250 | 
251 | 
252 | 
253 | 


--------------------------------------------------------------------------------
/run_regression.py:
--------------------------------------------------------------------------------
  1 | import warnings
  2 | warnings.filterwarnings('ignore')
  3 | from multiprocessing import cpu_count
  4 | 
  5 | # linear models: http://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd
  6 | from sklearn.linear_model import \
  7 |     LinearRegression, Ridge, Lasso, ElasticNet, \
  8 |     Lars, LassoLars, \
  9 |     OrthogonalMatchingPursuit, \
 10 |     BayesianRidge, ARDRegression, \
 11 |     SGDRegressor, \
 12 |     PassiveAggressiveRegressor, \
 13 |     RANSACRegressor, HuberRegressor
 14 | 
 15 | from sklearn.kernel_ridge import KernelRidge
 16 | from sklearn.preprocessing import StandardScaler
 17 | 
 18 | # svm models: http://scikit-learn.org/stable/modules/svm.html
 19 | from sklearn.svm import SVR, NuSVR, LinearSVR
 20 | 
 21 | # neighbor models: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor
 22 | from sklearn.neighbors import RadiusNeighborsRegressor, KNeighborsRegressor
 23 | 
 24 | from sklearn.gaussian_process import GaussianProcessRegressor
 25 | from sklearn.gaussian_process.kernels import RBF, ConstantKernel, DotProduct, WhiteKernel
 26 | from sklearn.neural_network import MLPRegressor
 27 | 
 28 | from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, RandomForestRegressor
 29 | from sklearn.tree import DecisionTreeRegressor
 30 | 
 31 | from utilities import *
 32 | from universal_params import *
 33 | 
 34 | 
 35 | def gen_reg_data(x_mu=10., x_sigma=1., num_samples=100, num_features=3,
 36 |                  y_formula=sum, y_sigma=1.):
 37 |     """
 38 |     generate some fake data for us to work with
 39 |     :return: x, y
 40 |     """
 41 |     x = np.random.normal(x_mu, x_sigma, (num_samples, num_features))
 42 |     y = np.apply_along_axis(y_formula, 1, x) + np.random.normal(0, y_sigma, (num_samples,))
 43 | 
 44 |     return x, y
 45 | 
 46 | 
 47 | linear_models_n_params = [
 48 |     (LinearRegression, normalize),
 49 | 
 50 |     (Ridge,
 51 |      {**alpha, **normalize, **tol,
 52 |       'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag']
 53 |       }),
 54 | 
 55 |     (Lasso,
 56 |      {**alpha, **normalize, **tol, **warm_start
 57 |       }),
 58 | 
 59 |     (ElasticNet,
 60 |      {**alpha, **normalize, **tol,
 61 |       'l1_ratio': [0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9],
 62 |       }),
 63 | 
 64 |     (Lars,
 65 |      {**normalize,
 66 |       'n_nonzero_coefs': [100, 300, 500, np.inf],
 67 |       }),
 68 | 
 69 |     (LassoLars,
 70 |      {**normalize, **max_iter_inf, **normalize, **alpha
 71 |       }),
 72 | 
 73 |     (OrthogonalMatchingPursuit,
 74 |      {'n_nonzero_coefs': [100, 300, 500, np.inf, None],
 75 |       **tol, **normalize
 76 |       }),
 77 | 
 78 |     (BayesianRidge,
 79 |      {
 80 |          'n_iter': [100, 300, 1000],
 81 |          **tol, **normalize,
 82 |          'alpha_1': [1e-6, 1e-4, 1e-2, 0.1, 0],
 83 |          'alpha_2': [1e-6, 1e-4, 1e-2, 0.1, 0],
 84 |          'lambda_1': [1e-6, 1e-4, 1e-2, 0.1, 0],
 85 |          'lambda_2': [1e-6, 1e-4, 1e-2, 0.1, 0],
 86 |      }),
 87 | 
 88 |     # WARNING: ARDRegression takes a long time to run
 89 |     (ARDRegression,
 90 |      {'n_iter': [100, 300, 1000],
 91 |       **tol, **normalize,
 92 |       'alpha_1': [1e-6, 1e-4, 1e-2, 0.1, 0],
 93 |       'alpha_2': [1e-6, 1e-4, 1e-2, 0.1, 0],
 94 |       'lambda_1': [1e-6, 1e-4, 1e-2, 0.1, 0],
 95 |       'lambda_2': [1e-6, 1e-4, 1e-2, 0.1, 0],
 96 |       'threshold_lambda': [1e2, 1e3, 1e4, 1e6]}),
 97 | 
 98 |     (SGDRegressor,
 99 |      {'loss': ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
100 |       **penalty_12e, **n_iter, **epsilon, **eta0,
101 |       'alpha': [1e-6, 1e-5, 1e-2, 'optimal'],
102 |       'l1_ratio': [0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9],
103 |       'learning_rate': ['constant', 'optimal', 'invscaling'],
104 |       'power_t': [0.1, 0.25, 0.5]
105 |       }),
106 | 
107 |     (PassiveAggressiveRegressor,
108 |      {**C, **epsilon, **n_iter, **warm_start,
109 |       'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive']
110 |       }),
111 | 
112 |     (RANSACRegressor,
113 |      {'min_samples': [0.1, 0.5, 0.9, None],
114 |       'max_trials': n_iter['n_iter'],
115 |       'stop_score': [0.8, 0.9, 1],
116 |       'stop_probability': [0.9, 0.95, 0.99, 1],
117 |       'loss': ['absolute_loss', 'squared_loss']
118 |       }),
119 | 
120 |     (HuberRegressor,
121 |      { 'epsilon': [1.1, 1.35, 1.5, 2],
122 |        **max_iter, **alpha, **warm_start, **tol
123 |        }),
124 | 
125 |     (KernelRidge,
126 |      {**alpha, **degree, **gamma, **coef0
127 |       })
128 | ]
129 | 
130 | linear_models_n_params_small = [
131 |     (LinearRegression, normalize),
132 | 
133 |     (Ridge,
134 |      {**alpha_small, **normalize
135 |       }),
136 | 
137 |     (Lasso,
138 |      {**alpha_small, **normalize
139 |       }),
140 | 
141 |     (ElasticNet,
142 |      {**alpha, **normalize,
143 |       'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
144 |       }),
145 | 
146 |     (Lars,
147 |      {**normalize,
148 |       'n_nonzero_coefs': [100, 300, 500, np.inf],
149 |       }),
150 | 
151 |     (LassoLars,
152 |      {**normalize, **max_iter_inf, **normalize, **alpha_small
153 |       }),
154 | 
155 |     (OrthogonalMatchingPursuit,
156 |      {'n_nonzero_coefs': [100, 300, 500, np.inf, None],
157 |       **normalize
158 |       }),
159 | 
160 |     (BayesianRidge,
161 |      { 'n_iter': [100, 300, 1000],
162 |        'alpha_1': [1e-6, 1e-3],
163 |        'alpha_2': [1e-6, 1e-3],
164 |        'lambda_1': [1e-6, 1e-3],
165 |        'lambda_2': [1e-6, 1e-3],
166 |        **normalize,
167 |        }),
168 | 
169 |     # WARNING: ARDRegression takes a long time to run
170 |     (ARDRegression,
171 |      {'n_iter': [100, 300],
172 |       **normalize,
173 |       'alpha_1': [1e-6, 1e-3],
174 |       'alpha_2': [1e-6, 1e-3],
175 |       'lambda_1': [1e-6, 1e-3],
176 |       'lambda_2': [1e-6, 1e-3],
177 |       }),
178 | 
179 |     (SGDRegressor,
180 |      {'loss': ['squared_loss', 'huber'],
181 |       **penalty_12e, **n_iter,
182 |       'alpha': [1e-6, 1e-5, 1e-2, 'optimal'],
183 |       'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
184 |       }),
185 | 
186 |     (PassiveAggressiveRegressor,
187 |      {**C, **n_iter,
188 |       }),
189 | 
190 |     (RANSACRegressor,
191 |      {'min_samples': [0.1, 0.5, 0.9, None],
192 |       'max_trials': n_iter['n_iter'],
193 |       'stop_score': [0.8, 1],
194 |       'loss': ['absolute_loss', 'squared_loss']
195 |       }),
196 | 
197 |     (HuberRegressor,
198 |      { **max_iter, **alpha_small,
199 |        }),
200 | 
201 |     (KernelRidge,
202 |      {**alpha_small, **degree,
203 |       })
204 | ]
205 | 
206 | svm_models_n_params_small = [
207 |     (SVR,
208 |      {**kernel, **degree, **shrinking
209 |       }),
210 | 
211 |     (NuSVR,
212 |      {**nu_small, **kernel, **degree, **shrinking,
213 |       }),
214 | 
215 |     (LinearSVR,
216 |      {**C_small, **epsilon,
217 |       'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'],
218 |       'intercept_scaling': [0.1, 1, 10]
219 |       })
220 | ]
221 | 
222 | svm_models_n_params = [
223 |     (SVR,
224 |      {**C, **epsilon, **kernel, **degree, **gamma, **coef0, **shrinking, **tol, **max_iter_inf2
225 |       }),
226 | 
227 |     (NuSVR,
228 |      {**C, **nu, **kernel, **degree, **gamma, **coef0, **shrinking , **tol, **max_iter_inf2
229 |       }),
230 | 
231 |     (LinearSVR,
232 |      {**C, **epsilon, **tol, **max_iter,
233 |       'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'],
234 |       'intercept_scaling': [0.1, 0.5, 1, 5, 10]
235 |       })
236 | ]
237 | 
238 | neighbor_models_n_params = [
239 |     (RadiusNeighborsRegressor,
240 |      {**neighbor_radius, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric,
241 |       'weights': ['uniform', 'distance'],
242 |       'p': [1, 2],
243 |       }),
244 | 
245 |     (KNeighborsRegressor,
246 |      {**n_neighbors, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric,
247 |       'p': [1, 2],
248 |       'weights': ['uniform', 'distance'],
249 |       })
250 | ]
251 | 
252 | gaussianprocess_models_n_params = [
253 |     (GaussianProcessRegressor,
254 |      {'kernel': [RBF(), ConstantKernel(), DotProduct(), WhiteKernel()],
255 |       'n_restarts_optimizer': [3],
256 |       'alpha': [1e-10, 1e-5],
257 |       'normalize_y': [True, False]
258 |       })
259 | ]
260 | 
261 | nn_models_n_params = [
262 |     (MLPRegressor,
263 |      { 'hidden_layer_sizes': [(16,), (64,), (100,), (32, 64)],
264 |        'activation': ['identity', 'logistic', 'tanh', 'relu'],
265 |        **alpha, **learning_rate, **tol, **warm_start,
266 |        'batch_size': ['auto', 50],
267 |        'max_iter': [1000],
268 |        'early_stopping': [True, False],
269 |        'epsilon': [1e-8, 1e-5]
270 |        })
271 | ]
272 | 
273 | nn_models_n_params_small = [
274 |     (MLPRegressor,
275 |      { 'hidden_layer_sizes': [(64,), (32, 64)],
276 |        'activation': ['identity', 'tanh', 'relu'],
277 |        'max_iter': [500],
278 |        'early_stopping': [True],
279 |        **learning_rate_small
280 |        })
281 | ]
282 | 
283 | tree_models_n_params = [
284 | 
285 |     (DecisionTreeRegressor,
286 |      {**max_features, **max_depth, **min_samples_split, **min_samples_leaf, **min_impurity_split,
287 |       'criterion': ['mse', 'mae']}),
288 | 
289 |     (ExtraTreesRegressor,
290 |      {**n_estimators, **max_features, **max_depth, **min_samples_split,
291 |       **min_samples_leaf, **min_impurity_split, **warm_start,
292 |       'criterion': ['mse', 'mae']}),
293 | 
294 | ]
295 | 
296 | tree_models_n_params_small = [
297 |     (DecisionTreeRegressor,
298 |      {**max_features_small, **max_depth_small, **min_samples_split, **min_samples_leaf,
299 |       'criterion': ['mse', 'mae']}),
300 | 
301 |     (ExtraTreesRegressor,
302 |      {**n_estimators_small, **max_features_small, **max_depth_small, **min_samples_split,
303 |       **min_samples_leaf,
304 |       'criterion': ['mse', 'mae']})
305 | ]
306 | 
307 | def run_linear_models(x, y, small = True, normalize_x = True):
308 |     return big_loop(linear_models_n_params_small if small else linear_models_n_params,
309 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False)
310 | 
311 | def run_svm_models(x, y, small = True, normalize_x = True):
312 |     return big_loop(svm_models_n_params_small if small else svm_models_n_params,
313 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False)
314 | 
315 | def run_neighbor_models(x, y, normalize_x = True):
316 |     return big_loop(neighbor_models_n_params,
317 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False)
318 | 
319 | def run_gaussian_models(x, y, normalize_x = True):
320 |     return big_loop(gaussianprocess_models_n_params,
321 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False)
322 | 
323 | def run_nn_models(x, y, small = True, normalize_x = True):
324 |     return big_loop(nn_models_n_params_small if small else nn_models_n_params,
325 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False)
326 | 
327 | def run_tree_models(x, y, small = True, normalize_x = True):
328 |     return big_loop(tree_models_n_params_small if small else tree_models_n_params,
329 |                     StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False)
330 | 
331 | def run_all(x, y, small = True, normalize_x = True, n_jobs=cpu_count()-1):
332 | 
333 |     all_params = (linear_models_n_params_small if small else linear_models_n_params) + \
334 |                  (nn_models_n_params_small if small else nn_models_n_params) + \
335 |                  ([] if small else gaussianprocess_models_n_params) + \
336 |                  neighbor_models_n_params + \
337 |                  (svm_models_n_params_small if small else svm_models_n_params) + \
338 |                  (tree_models_n_params_small if small else tree_models_n_params)
339 | 
340 |     return big_loop(all_params,
341 |                     StandardScaler().fit_transform(x) if normalize_x else x, y,
342 |                     isClassification=False, n_jobs=n_jobs)
343 | 
344 | 
345 | if __name__ == '__main__':
346 | 
347 |     x, y = gen_reg_data(10, 3, 100, 3, sum, 0.3)
348 |     run_all(x, y, small=True, normalize_x=True)
349 | 


--------------------------------------------------------------------------------
/universal_params.py:
--------------------------------------------------------------------------------
 1 | """
 2 | parameter settings used by multiple classifiers/regressors
 3 | """
 4 | 
 5 | import numpy as np
 6 | 
 7 | penalty_12 = {'penalty': ['l1', 'l2']}
 8 | penalty_12none = {'penalty': ['l1', 'l2', None]}
 9 | penalty_12e = {'penalty': ['l1', 'l2', 'elasticnet']}
10 | penalty_all = {'penalty': ['l1', 'l2', None, 'elasticnet']}
11 | max_iter = {'max_iter': [100, 300, 1000]}
12 | max_iter_inf = {'max_iter': [100, 300, 500, 1000, np.inf]}
13 | max_iter_inf2 = {'max_iter': [100, 300, 500, 1000, -1]}
14 | tol = {'tol': [1e-4, 1e-3, 1e-2]}
15 | warm_start = {'warm_start': [True, False]}
16 | alpha = {'alpha': [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 3, 10]}
17 | alpha_small = {'alpha': [1e-5, 1e-3, 0.1, 1]}
18 | n_iter = {'n_iter': [5, 10, 20]}
19 | 
20 | eta0 = {'eta0': [1e-4, 1e-3, 1e-2, 0.1]}
21 | C = {'C': [1e-2, 0.1, 1, 5, 10]}
22 | C_small = {'C': [ 0.1, 1, 5]}
23 | epsilon = {'epsilon': [1e-3, 1e-2, 0.1, 0]}
24 | normalize = {'normalize': [True, False]}
25 | kernel = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}
26 | degree = {'degree': [1, 2, 3, 4, 5]}
27 | gamma = {'gamma': list(np.logspace(-9, 3, 6)) + ['auto']}
28 | gamma_small = {'gamma': list(np.logspace(-6, 3, 3)) + ['auto']}
29 | coef0 = {'coef0': [0, 0.1, 0.3, 0.5, 0.7, 1]}
30 | coef0_small = {'coef0': [0, 0.4, 0.7, 1]}
31 | shrinking = {'shrinking': [True, False]}
32 | nu = {'nu': [1e-4, 1e-2, 0.1, 0.3, 0.5, 0.75, 0.9]}
33 | nu_small = {'nu': [1e-2, 0.1, 0.5, 0.9]}
34 | 
35 | n_neighbors = {'n_neighbors': [5, 7, 10, 15, 20]}
36 | neighbor_algo = {'algorithm': ['ball_tree', 'kd_tree', 'brute']}
37 | neighbor_leaf_size = {'leaf_size': [1, 2, 5, 10, 20, 30, 50, 100]}
38 | neighbor_metric = {'metric': ['cityblock', 'euclidean', 'l1', 'l2', 'manhattan']}
39 | neighbor_radius = {'radius': [1e-2, 0.1, 1, 5, 10]}
40 | learning_rate = {'learning_rate': ['constant', 'invscaling', 'adaptive']}
41 | learning_rate_small = {'learning_rate': ['invscaling', 'adaptive']}
42 | 
43 | n_estimators = {'n_estimators': [2, 3, 5, 10, 25, 50, 100]}
44 | n_estimators_small = {'n_estimators': [2, 10, 25, 100]}
45 | max_features = {'max_features': [3, 5, 10, 25, 50, 'auto', 'log2', None]}
46 | max_features_small = {'max_features': [3, 5, 10, 'auto', 'log2', None]}
47 | max_depth = {'max_depth': [None, 3, 5, 7, 10]}
48 | max_depth_small = {'max_depth': [None, 5, 10]}
49 | min_samples_split = {'min_samples_split': [2, 5, 10, 0.1]}
50 | min_impurity_split = {'min_impurity_split': [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]}
51 | tree_learning_rate = {'learning_rate': [0.8, 1]}
52 | min_samples_leaf = {'min_samples_leaf': [2]}
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------
/utilities.py:
--------------------------------------------------------------------------------
  1 | from pprint import pprint
  2 | import numpy as np
  3 | nan = float('nan')
  4 | from collections import Counter
  5 | from multiprocessing import cpu_count
  6 | from time import time
  7 | from tabulate import tabulate
  8 | 
  9 | from sklearn.cluster import KMeans
 10 | from sklearn.model_selection import StratifiedShuffleSplit as sss, ShuffleSplit as ss, GridSearchCV
 11 | from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, \
 12 |     ExtraTreesClassifier, ExtraTreesRegressor, AdaBoostClassifier, AdaBoostRegressor
 13 | from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
 14 | 
 15 | 
 16 | TREE_N_ENSEMBLE_MODELS = [RandomForestClassifier, GradientBoostingClassifier, DecisionTreeClassifier, DecisionTreeRegressor,
 17 |                           ExtraTreesClassifier, ExtraTreesRegressor, AdaBoostClassifier, AdaBoostRegressor]
 18 | 
 19 | def upsample_indices_clf(inds, y):
 20 |     """
 21 |     make all classes have the same number of samples. for classification only.
 22 |     :type inds: numpy array
 23 |     :type y: numpy array
 24 |     :return: a numpy array of indices
 25 |     """
 26 | 
 27 |     assert len(inds) == len(y)
 28 | 
 29 |     countByClass = dict(Counter(y))
 30 |     maxCount = max(countByClass.values())
 31 | 
 32 |     extras = []
 33 | 
 34 |     for klass, count in countByClass.items():
 35 |         if maxCount == count: continue
 36 | 
 37 |         ratio = int(maxCount / count)
 38 |         cur_inds = inds[y == klass]
 39 | 
 40 |         extras.append(np.concatenate(
 41 |             (np.repeat(cur_inds, ratio - 1),
 42 |              np.random.choice(cur_inds, maxCount - ratio * count, replace=False)
 43 |              ))
 44 |         )
 45 | 
 46 |         print('upsampling class %d, %d times' % (klass, ratio-1))
 47 | 
 48 |     return np.concatenate((inds, *extras))
 49 | 
 50 | 
 51 | def cv_clf(x, y,
 52 |            test_size = 0.2, n_splits = 5, random_state=None,
 53 |            doesUpsample = True):
 54 |     """
 55 |     an iterator of cross-validation groups with upsampling
 56 |     :param x:
 57 |     :param y:
 58 |     :param test_size:
 59 |     :param n_splits:
 60 |     :return:
 61 |     """
 62 | 
 63 |     sss_obj = sss(n_splits, test_size, random_state=random_state).split(x, y)
 64 | 
 65 |     # no upsampling needed
 66 |     if not doesUpsample:
 67 |         return sss_obj
 68 | 
 69 |     # with upsampling
 70 |     for train_inds, valid_inds in sss_obj:
 71 |         yield (upsample_indices_clf(train_inds, y[train_inds]), valid_inds)
 72 | 
 73 | 
 74 | def cv_reg(x, test_size = 0.2, n_splits = 5, random_state=None):
 75 |     return ss(n_splits, test_size, random_state=random_state).split(x)
 76 | 
 77 | def timeit(klass, params, x, y):
 78 |     """
 79 |     time in seconds
 80 |     """
 81 | 
 82 |     start = time()
 83 |     clf = klass(**params)
 84 |     clf.fit(x, y)
 85 | 
 86 |     return time() - start
 87 | 
 88 | def big_loop(models_n_params, x, y, isClassification,
 89 |              test_size = 0.2, n_splits = 5, random_state=None, doesUpsample=True,
 90 |              scoring=None,
 91 |              verbose=False, n_jobs = cpu_count()-1):
 92 |     """
 93 |     runs through all model classes with their perspective hyper parameters
 94 |     :param models_n_params: [(model class, hyper parameters),...]
 95 |     :param isClassification: whether it's a classification or regression problem
 96 |     :type isClassification: bool
 97 |     :param scoring: by default 'accuracy' for classification; 'neg_mean_squared_error' for regression
 98 |     :return: the best estimator, list of [(estimator, cv score),...]
 99 |     """
100 | 
101 |     def cv_():
102 |         return cv_clf(x, y, test_size, n_splits, random_state, doesUpsample) \
103 |             if isClassification \
104 |             else cv_reg(x, test_size, n_splits, random_state)
105 | 
106 |     res = []
107 |     num_features = x.shape[1]
108 |     scoring = scoring or ('accuracy' if isClassification else 'neg_mean_squared_error')
109 |     print('Scoring criteria:', scoring)
110 | 
111 |     for i, (clf_Klass, parameters) in enumerate(models_n_params):
112 |         try:
113 |             print('-'*15, 'model %d/%d' % (i+1, len(models_n_params)), '-'*15)
114 |             print(clf_Klass.__name__)
115 | 
116 |             if clf_Klass == KMeans:
117 |                 parameters['n_clusters'] = [len(np.unique(y))]
118 |             elif clf_Klass in TREE_N_ENSEMBLE_MODELS:
119 |                 parameters['max_features'] = [v for v in parameters['max_features']
120 |                                               if v is None or type(v)==str or v<=num_features]
121 | 
122 |             clf_search = GridSearchCV(clf_Klass(), parameters, scoring, cv=cv_(), n_jobs=n_jobs)
123 |             clf_search.fit(x, y)
124 | 
125 |             timespent = timeit(clf_Klass, clf_search.best_params_, x, y)
126 |             print('best score:', clf_search.best_score_, 'time/clf: %0.3f seconds' % timespent)
127 |             print('best params:')
128 |             pprint(clf_search.best_params_)
129 | 
130 |             if verbose:
131 |                 print('validation scores:', clf_search.cv_results_['mean_test_score'])
132 |                 print('training scores:', clf_search.cv_results_['mean_train_score'])
133 | 
134 |             res.append((clf_search.best_estimator_, clf_search.best_score_, timespent))
135 | 
136 |         except Exception as e:
137 |             print('ERROR OCCURRED')
138 |             if verbose: print(e)
139 |             res.append((clf_Klass(), -np.inf, np.inf))
140 | 
141 | 
142 |     print('='*60)
143 |     print(tabulate([[m.__class__.__name__, '%.3f'%s, '%.3f'%t] for m, s, t in res], headers=['Model', scoring, 'Time/clf (s)']))
144 |     winner_ind = np.argmax([v[1] for v in res])
145 |     winner = res[winner_ind][0]
146 |     print('='*60)
147 |     print('The winner is: %s with score %0.3f.' % (winner.__class__.__name__, res[winner_ind][1]))
148 | 
149 |     return winner, res
150 | 
151 | 
152 | if __name__ == '__main__':
153 | 
154 |     y = np.array([0,1,0,0,0,3,1,1,3])
155 |     x = np.zeros(len(y))
156 | 
157 |     for t, v in cv_reg(x):
158 |         print('---------')
159 |         print('training inds:', t)
160 |         print('valid inds:', v)
161 | 
162 |     for t, v in cv_clf(x, y, test_size=3):
163 |         print('---------')
164 |         print('training inds:', t)
165 |         print('valid inds:', v)


--------------------------------------------------------------------------------