├── .gitignore ├── Readme.md ├── run_classification.py ├── run_regression.py ├── universal_params.py └── utilities.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | .DS_store 3 | .DS_Store 4 | __pycache__ 5 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # WHAT IT IS 2 | Many believe that 3 | 4 | > most of the work of supervised (non-deep) Machine Learning lies in feature engineering, whereas the model-selection process is just running through all the models with a huge for-loop. 5 | 6 | So one glorious weekend, I decided to type out said loop! 7 | 8 | ![And I thought Machine Learning was hard.](https://user-images.githubusercontent.com/2030548/27264037-f532cb72-542a-11e7-9b37-d776f5a09a08.png) 9 | 10 | # HOW IT WORKS 11 | Runs through all `sklearn` models (both classification and regression), with **all possible hyperparameters**, and rank using cross-validation. 12 | 13 | # MODELS 14 | Runs **all the model** available on `sklearn` for supervised learning [here](http://scikit-learn.org/stable/supervised_learning.html). The categories are: 15 | 16 | * Generalized Linear Models 17 | * Kernel Ridge 18 | * Support Vector Machines 19 | * Nearest Neighbors 20 | * Gaussian Processes 21 | * Naive Bayes 22 | * Trees 23 | * Neural Networks 24 | * Ensemble methods 25 | 26 | Note: I skipped GradientTreeBoosting due to sub-par model performance, long run-time and constant convergence issues. Skipped AdaBoost because it keeps giving max_features errors. (Please ping me or feel free to contribute to the repo directly if you ever got AdaBoost to work at some point.) 27 | 28 | # USAGE 29 | 30 | ### How to run 31 | 1. Feed in `X` (2-D `numpy.array`) and `y` (1-D `numpy.array`). (The code also has fake data generated for testing purposes.) 32 | 2. Use `run_classification` or `run_regression` where appropriate. 33 | 34 | 35 | The output looks this: 36 | 37 | | Model | accuracy | Time/clf (s)| 38 | |---------------------------- |:-------------:|:-------------:| 39 | |SGDClassifier | 0.967 | 0.001 | 40 | |LogisticRegression | 0.940 | 0.001 | 41 | |Perceptron | 0.900 | 0.001 | 42 | |PassiveAggressiveClassifier | 0.967 | 0.001 | 43 | |MLPClassifier | 0.827 | 0.018 | 44 | |KMeans | 0.580 | 0.010 | 45 | |KNeighborsClassifier | 0.960 | 0.000 | 46 | |NearestCentroid | 0.933 | 0.000 | 47 | |RadiusNeighborsClassifier | 0.927 | 0.000 | 48 | |SVC | 0.960 | 0.000 | 49 | |NuSVC | 0.980 | 0.001 | 50 | |LinearSVC | 0.940 | 0.005 | 51 | |RandomForestClassifier | 0.980 | 0.015 | 52 | |DecisionTreeClassifier | 0.960 | 0.000 | 53 | |ExtraTreesClassifier | 0.993 | 0.002 | 54 | 55 | *The winner is: ExtraTreesClassifier with score 0.993.* 56 | 57 | ### Knobs 58 | 59 | * Evaluation criteria 60 | 61 | By default classification uses accuracy and regression uses negative MSE, given by the parameter of the `big_loop` function in `utilities.py`. It also accepts any `sklearn` scoring string. 62 | 63 | * Scale 64 | 65 | Because it takes a long time to run through all models and hyperparameters at full-blown scale, there is a "small" and a full version of hyperparameters for almost every model. The "small" ones run much faster by evaluating only the most essential hyperparameters in smaller ranges than the full version. It's controlled by the `small` parameter of all of the `run_all` functions. 66 | 67 | * Hyperparameters 68 | 69 | You can modify the search space of hyperparameters in `run_regression.py` and `run_classification.py`. 70 | 71 | * Running only a category of models 72 | 73 | Depending on the nature of the problem, certain categories of models work better than others. There are separate functions for each category in `run_regression.py` and `run_classification.py`. 74 | 75 | # TO-DO'S 76 | 77 | Feel free to contribute by hashing out the following: 78 | 79 | * Wrap an emsemble (bagging/boosting) model on top of the best models. 80 | * multi-target classification (i.e. `y` having multiple columns) 81 | 82 |   83 |   84 |   85 |   86 |   87 | 88 | Oh boy that was a lot of typing in the past 24 hours! Hopefully it saves you (and myself) some typing in the future. I'm gonna grab some lunch, sip a cold drink and enjoy the California summer heat. :) 89 | Check out more of my pet projects on [planetj.io](planetj.io). 90 | -------------------------------------------------------------------------------- /run_classification.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | warnings.filterwarnings('ignore') 3 | import numpy as np 4 | from sklearn import datasets 5 | from sklearn.linear_model import SGDClassifier, LogisticRegression, \ 6 | Perceptron, PassiveAggressiveClassifier 7 | 8 | from sklearn.preprocessing import StandardScaler 9 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier 10 | from sklearn.svm import SVC, LinearSVC, NuSVC 11 | from sklearn.cluster import KMeans 12 | from sklearn.neighbors import KNeighborsClassifier, NearestCentroid, RadiusNeighborsClassifier 13 | from sklearn.gaussian_process import GaussianProcessClassifier 14 | from sklearn.gaussian_process.kernels import RBF, ConstantKernel, DotProduct, Matern, StationaryKernelMixin, WhiteKernel 15 | from sklearn.naive_bayes import GaussianNB 16 | from sklearn.neural_network import MLPClassifier 17 | from sklearn.tree import DecisionTreeClassifier 18 | 19 | from utilities import * 20 | from universal_params import * 21 | 22 | 23 | def gen_classification_data(n=None): 24 | """ 25 | uses the iris data 26 | :return: x, y 27 | """ 28 | 29 | iris = datasets.load_iris() 30 | x = iris.data 31 | y = iris.target 32 | 33 | if n: 34 | half = int(n/2) 35 | np.concatenate((x[:half], x[-half:]), 1), np.concatenate((y[:half], y[-half:]), 0) 36 | 37 | return x, y 38 | 39 | linear_models_n_params = [ 40 | (SGDClassifier, 41 | {'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge'], 42 | 'alpha': [0.0001, 0.001, 0.1], 43 | **penalty_12none 44 | }), 45 | 46 | (LogisticRegression, 47 | {**penalty_12, **max_iter, **tol, ** warm_start, **C, 48 | 'solver': ['liblinear'] 49 | }), 50 | 51 | (Perceptron, 52 | {**penalty_all, **alpha, **n_iter, **eta0, **warm_start 53 | }), 54 | 55 | (PassiveAggressiveClassifier, 56 | {**C, **n_iter, **warm_start, 57 | 'loss': ['hinge', 'squared_hinge'], 58 | }) 59 | ] 60 | 61 | linear_models_n_params_small = linear_models_n_params 62 | 63 | svm_models_n_params = [ 64 | (SVC, 65 | {**C, **kernel, **degree, **gamma, **coef0, **shrinking, **tol, **max_iter_inf2}), 66 | 67 | (NuSVC, 68 | {**nu, **kernel, **degree, **gamma, **coef0, **shrinking, **tol 69 | }), 70 | 71 | (LinearSVC, 72 | { **C, **penalty_12, **tol, **max_iter, 73 | 'loss': ['hinge', 'squared_hinge'], 74 | }) 75 | ] 76 | 77 | svm_models_n_params_small = [ 78 | (SVC, 79 | {**kernel, **degree, **shrinking 80 | }), 81 | 82 | (NuSVC, 83 | {**nu_small, **kernel, **degree, **shrinking 84 | }), 85 | 86 | (LinearSVC, 87 | { **C_small, 88 | 'penalty': ['l2'], 89 | 'loss': ['hinge', 'squared_hinge'], 90 | }) 91 | ] 92 | 93 | neighbor_models_n_params = [ 94 | 95 | (KMeans, 96 | {'algorithm': ['auto', 'full', 'elkan'], 97 | 'init': ['k-means++', 'random']}), 98 | 99 | (KNeighborsClassifier, 100 | {**n_neighbors, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric, 101 | 'weights': ['uniform', 'distance'], 102 | 'p': [1, 2] 103 | }), 104 | 105 | (NearestCentroid, 106 | {**neighbor_metric, 107 | 'shrink_threshold': [1e-3, 1e-2, 0.1, 0.5, 0.9, 2] 108 | }), 109 | 110 | (RadiusNeighborsClassifier, 111 | {**neighbor_radius, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric, 112 | 'weights': ['uniform', 'distance'], 113 | 'p': [1, 2], 114 | 'outlier_label': [-1] 115 | }) 116 | ] 117 | 118 | gaussianprocess_models_n_params = [ 119 | (GaussianProcessClassifier, 120 | {**warm_start, 121 | 'kernel': [RBF(), ConstantKernel(), DotProduct(), WhiteKernel()], 122 | 'max_iter_predict': [500], 123 | 'n_restarts_optimizer': [3], 124 | }) 125 | ] 126 | 127 | bayes_models_n_params = [ 128 | (GaussianNB, {}) 129 | ] 130 | 131 | nn_models_n_params = [ 132 | (MLPClassifier, 133 | { 'hidden_layer_sizes': [(16,), (64,), (100,), (32, 32)], 134 | 'activation': ['identity', 'logistic', 'tanh', 'relu'], 135 | **alpha, **learning_rate, **tol, **warm_start, 136 | 'batch_size': ['auto', 50], 137 | 'max_iter': [1000], 138 | 'early_stopping': [True, False], 139 | 'epsilon': [1e-8, 1e-5] 140 | }) 141 | ] 142 | 143 | nn_models_n_params_small = [ 144 | (MLPClassifier, 145 | { 'hidden_layer_sizes': [(64,), (32, 64)], 146 | 'batch_size': ['auto', 50], 147 | 'activation': ['identity', 'tanh', 'relu'], 148 | 'max_iter': [500], 149 | 'early_stopping': [True], 150 | **learning_rate_small 151 | }) 152 | ] 153 | 154 | tree_models_n_params = [ 155 | 156 | (RandomForestClassifier, 157 | {'criterion': ['gini', 'entropy'], 158 | **max_features, **n_estimators, **max_depth, 159 | **min_samples_split, **min_impurity_split, **warm_start, **min_samples_leaf, 160 | }), 161 | 162 | (DecisionTreeClassifier, 163 | {'criterion': ['gini', 'entropy'], 164 | **max_features, **max_depth, **min_samples_split, **min_impurity_split, **min_samples_leaf 165 | }), 166 | 167 | (ExtraTreesClassifier, 168 | {**n_estimators, **max_features, **max_depth, 169 | **min_samples_split, **min_samples_leaf, **min_impurity_split, **warm_start, 170 | 'criterion': ['gini', 'entropy']}) 171 | ] 172 | 173 | 174 | tree_models_n_params_small = [ 175 | 176 | (RandomForestClassifier, 177 | {**max_features_small, **n_estimators_small, **min_samples_split, **max_depth_small, **min_samples_leaf 178 | }), 179 | 180 | (DecisionTreeClassifier, 181 | {**max_features_small, **max_depth_small, **min_samples_split, **min_samples_leaf 182 | }), 183 | 184 | (ExtraTreesClassifier, 185 | {**n_estimators_small, **max_features_small, **max_depth_small, 186 | **min_samples_split, **min_samples_leaf}) 187 | ] 188 | 189 | 190 | 191 | def run_linear_models(x, y, small = True, normalize_x = True): 192 | return big_loop(linear_models_n_params_small if small else linear_models_n_params, 193 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True) 194 | 195 | def run_svm_models(x, y, small = True, normalize_x = True): 196 | return big_loop(svm_models_n_params_small if small else svm_models_n_params, 197 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True) 198 | 199 | def run_neighbor_models(x, y, normalize_x = True): 200 | return big_loop(neighbor_models_n_params, 201 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True) 202 | 203 | def run_gaussian_models(x, y, normalize_x = True): 204 | return big_loop(gaussianprocess_models_n_params, 205 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True) 206 | 207 | def run_nn_models(x, y, small = True, normalize_x = True): 208 | return big_loop(nn_models_n_params_small if small else nn_models_n_params, 209 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True) 210 | 211 | def run_tree_models(x, y, small = True, normalize_x = True): 212 | return big_loop(tree_models_n_params_small if small else tree_models_n_params, 213 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=True) 214 | 215 | def run_all(x, y, small = True, normalize_x = True, n_jobs=cpu_count()-1): 216 | 217 | all_params = (linear_models_n_params_small if small else linear_models_n_params) + \ 218 | (nn_models_n_params_small if small else nn_models_n_params) + \ 219 | ([] if small else gaussianprocess_models_n_params) + \ 220 | neighbor_models_n_params + \ 221 | (svm_models_n_params_small if small else svm_models_n_params) + \ 222 | (tree_models_n_params_small if small else tree_models_n_params) 223 | 224 | return big_loop(all_params, 225 | StandardScaler().fit_transform(x) if normalize_x else x, y, 226 | isClassification=True, n_jobs=n_jobs) 227 | 228 | 229 | 230 | if __name__ == '__main__': 231 | 232 | x, y = gen_classification_data() 233 | run_all(x, y, n_jobs=1) 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | -------------------------------------------------------------------------------- /run_regression.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | warnings.filterwarnings('ignore') 3 | from multiprocessing import cpu_count 4 | 5 | # linear models: http://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd 6 | from sklearn.linear_model import \ 7 | LinearRegression, Ridge, Lasso, ElasticNet, \ 8 | Lars, LassoLars, \ 9 | OrthogonalMatchingPursuit, \ 10 | BayesianRidge, ARDRegression, \ 11 | SGDRegressor, \ 12 | PassiveAggressiveRegressor, \ 13 | RANSACRegressor, HuberRegressor 14 | 15 | from sklearn.kernel_ridge import KernelRidge 16 | from sklearn.preprocessing import StandardScaler 17 | 18 | # svm models: http://scikit-learn.org/stable/modules/svm.html 19 | from sklearn.svm import SVR, NuSVR, LinearSVR 20 | 21 | # neighbor models: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor 22 | from sklearn.neighbors import RadiusNeighborsRegressor, KNeighborsRegressor 23 | 24 | from sklearn.gaussian_process import GaussianProcessRegressor 25 | from sklearn.gaussian_process.kernels import RBF, ConstantKernel, DotProduct, WhiteKernel 26 | from sklearn.neural_network import MLPRegressor 27 | 28 | from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, RandomForestRegressor 29 | from sklearn.tree import DecisionTreeRegressor 30 | 31 | from utilities import * 32 | from universal_params import * 33 | 34 | 35 | def gen_reg_data(x_mu=10., x_sigma=1., num_samples=100, num_features=3, 36 | y_formula=sum, y_sigma=1.): 37 | """ 38 | generate some fake data for us to work with 39 | :return: x, y 40 | """ 41 | x = np.random.normal(x_mu, x_sigma, (num_samples, num_features)) 42 | y = np.apply_along_axis(y_formula, 1, x) + np.random.normal(0, y_sigma, (num_samples,)) 43 | 44 | return x, y 45 | 46 | 47 | linear_models_n_params = [ 48 | (LinearRegression, normalize), 49 | 50 | (Ridge, 51 | {**alpha, **normalize, **tol, 52 | 'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag'] 53 | }), 54 | 55 | (Lasso, 56 | {**alpha, **normalize, **tol, **warm_start 57 | }), 58 | 59 | (ElasticNet, 60 | {**alpha, **normalize, **tol, 61 | 'l1_ratio': [0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9], 62 | }), 63 | 64 | (Lars, 65 | {**normalize, 66 | 'n_nonzero_coefs': [100, 300, 500, np.inf], 67 | }), 68 | 69 | (LassoLars, 70 | {**normalize, **max_iter_inf, **normalize, **alpha 71 | }), 72 | 73 | (OrthogonalMatchingPursuit, 74 | {'n_nonzero_coefs': [100, 300, 500, np.inf, None], 75 | **tol, **normalize 76 | }), 77 | 78 | (BayesianRidge, 79 | { 80 | 'n_iter': [100, 300, 1000], 81 | **tol, **normalize, 82 | 'alpha_1': [1e-6, 1e-4, 1e-2, 0.1, 0], 83 | 'alpha_2': [1e-6, 1e-4, 1e-2, 0.1, 0], 84 | 'lambda_1': [1e-6, 1e-4, 1e-2, 0.1, 0], 85 | 'lambda_2': [1e-6, 1e-4, 1e-2, 0.1, 0], 86 | }), 87 | 88 | # WARNING: ARDRegression takes a long time to run 89 | (ARDRegression, 90 | {'n_iter': [100, 300, 1000], 91 | **tol, **normalize, 92 | 'alpha_1': [1e-6, 1e-4, 1e-2, 0.1, 0], 93 | 'alpha_2': [1e-6, 1e-4, 1e-2, 0.1, 0], 94 | 'lambda_1': [1e-6, 1e-4, 1e-2, 0.1, 0], 95 | 'lambda_2': [1e-6, 1e-4, 1e-2, 0.1, 0], 96 | 'threshold_lambda': [1e2, 1e3, 1e4, 1e6]}), 97 | 98 | (SGDRegressor, 99 | {'loss': ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'], 100 | **penalty_12e, **n_iter, **epsilon, **eta0, 101 | 'alpha': [1e-6, 1e-5, 1e-2, 'optimal'], 102 | 'l1_ratio': [0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9], 103 | 'learning_rate': ['constant', 'optimal', 'invscaling'], 104 | 'power_t': [0.1, 0.25, 0.5] 105 | }), 106 | 107 | (PassiveAggressiveRegressor, 108 | {**C, **epsilon, **n_iter, **warm_start, 109 | 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'] 110 | }), 111 | 112 | (RANSACRegressor, 113 | {'min_samples': [0.1, 0.5, 0.9, None], 114 | 'max_trials': n_iter['n_iter'], 115 | 'stop_score': [0.8, 0.9, 1], 116 | 'stop_probability': [0.9, 0.95, 0.99, 1], 117 | 'loss': ['absolute_loss', 'squared_loss'] 118 | }), 119 | 120 | (HuberRegressor, 121 | { 'epsilon': [1.1, 1.35, 1.5, 2], 122 | **max_iter, **alpha, **warm_start, **tol 123 | }), 124 | 125 | (KernelRidge, 126 | {**alpha, **degree, **gamma, **coef0 127 | }) 128 | ] 129 | 130 | linear_models_n_params_small = [ 131 | (LinearRegression, normalize), 132 | 133 | (Ridge, 134 | {**alpha_small, **normalize 135 | }), 136 | 137 | (Lasso, 138 | {**alpha_small, **normalize 139 | }), 140 | 141 | (ElasticNet, 142 | {**alpha, **normalize, 143 | 'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9], 144 | }), 145 | 146 | (Lars, 147 | {**normalize, 148 | 'n_nonzero_coefs': [100, 300, 500, np.inf], 149 | }), 150 | 151 | (LassoLars, 152 | {**normalize, **max_iter_inf, **normalize, **alpha_small 153 | }), 154 | 155 | (OrthogonalMatchingPursuit, 156 | {'n_nonzero_coefs': [100, 300, 500, np.inf, None], 157 | **normalize 158 | }), 159 | 160 | (BayesianRidge, 161 | { 'n_iter': [100, 300, 1000], 162 | 'alpha_1': [1e-6, 1e-3], 163 | 'alpha_2': [1e-6, 1e-3], 164 | 'lambda_1': [1e-6, 1e-3], 165 | 'lambda_2': [1e-6, 1e-3], 166 | **normalize, 167 | }), 168 | 169 | # WARNING: ARDRegression takes a long time to run 170 | (ARDRegression, 171 | {'n_iter': [100, 300], 172 | **normalize, 173 | 'alpha_1': [1e-6, 1e-3], 174 | 'alpha_2': [1e-6, 1e-3], 175 | 'lambda_1': [1e-6, 1e-3], 176 | 'lambda_2': [1e-6, 1e-3], 177 | }), 178 | 179 | (SGDRegressor, 180 | {'loss': ['squared_loss', 'huber'], 181 | **penalty_12e, **n_iter, 182 | 'alpha': [1e-6, 1e-5, 1e-2, 'optimal'], 183 | 'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9], 184 | }), 185 | 186 | (PassiveAggressiveRegressor, 187 | {**C, **n_iter, 188 | }), 189 | 190 | (RANSACRegressor, 191 | {'min_samples': [0.1, 0.5, 0.9, None], 192 | 'max_trials': n_iter['n_iter'], 193 | 'stop_score': [0.8, 1], 194 | 'loss': ['absolute_loss', 'squared_loss'] 195 | }), 196 | 197 | (HuberRegressor, 198 | { **max_iter, **alpha_small, 199 | }), 200 | 201 | (KernelRidge, 202 | {**alpha_small, **degree, 203 | }) 204 | ] 205 | 206 | svm_models_n_params_small = [ 207 | (SVR, 208 | {**kernel, **degree, **shrinking 209 | }), 210 | 211 | (NuSVR, 212 | {**nu_small, **kernel, **degree, **shrinking, 213 | }), 214 | 215 | (LinearSVR, 216 | {**C_small, **epsilon, 217 | 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 218 | 'intercept_scaling': [0.1, 1, 10] 219 | }) 220 | ] 221 | 222 | svm_models_n_params = [ 223 | (SVR, 224 | {**C, **epsilon, **kernel, **degree, **gamma, **coef0, **shrinking, **tol, **max_iter_inf2 225 | }), 226 | 227 | (NuSVR, 228 | {**C, **nu, **kernel, **degree, **gamma, **coef0, **shrinking , **tol, **max_iter_inf2 229 | }), 230 | 231 | (LinearSVR, 232 | {**C, **epsilon, **tol, **max_iter, 233 | 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 234 | 'intercept_scaling': [0.1, 0.5, 1, 5, 10] 235 | }) 236 | ] 237 | 238 | neighbor_models_n_params = [ 239 | (RadiusNeighborsRegressor, 240 | {**neighbor_radius, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric, 241 | 'weights': ['uniform', 'distance'], 242 | 'p': [1, 2], 243 | }), 244 | 245 | (KNeighborsRegressor, 246 | {**n_neighbors, **neighbor_algo, **neighbor_leaf_size, **neighbor_metric, 247 | 'p': [1, 2], 248 | 'weights': ['uniform', 'distance'], 249 | }) 250 | ] 251 | 252 | gaussianprocess_models_n_params = [ 253 | (GaussianProcessRegressor, 254 | {'kernel': [RBF(), ConstantKernel(), DotProduct(), WhiteKernel()], 255 | 'n_restarts_optimizer': [3], 256 | 'alpha': [1e-10, 1e-5], 257 | 'normalize_y': [True, False] 258 | }) 259 | ] 260 | 261 | nn_models_n_params = [ 262 | (MLPRegressor, 263 | { 'hidden_layer_sizes': [(16,), (64,), (100,), (32, 64)], 264 | 'activation': ['identity', 'logistic', 'tanh', 'relu'], 265 | **alpha, **learning_rate, **tol, **warm_start, 266 | 'batch_size': ['auto', 50], 267 | 'max_iter': [1000], 268 | 'early_stopping': [True, False], 269 | 'epsilon': [1e-8, 1e-5] 270 | }) 271 | ] 272 | 273 | nn_models_n_params_small = [ 274 | (MLPRegressor, 275 | { 'hidden_layer_sizes': [(64,), (32, 64)], 276 | 'activation': ['identity', 'tanh', 'relu'], 277 | 'max_iter': [500], 278 | 'early_stopping': [True], 279 | **learning_rate_small 280 | }) 281 | ] 282 | 283 | tree_models_n_params = [ 284 | 285 | (DecisionTreeRegressor, 286 | {**max_features, **max_depth, **min_samples_split, **min_samples_leaf, **min_impurity_split, 287 | 'criterion': ['mse', 'mae']}), 288 | 289 | (ExtraTreesRegressor, 290 | {**n_estimators, **max_features, **max_depth, **min_samples_split, 291 | **min_samples_leaf, **min_impurity_split, **warm_start, 292 | 'criterion': ['mse', 'mae']}), 293 | 294 | ] 295 | 296 | tree_models_n_params_small = [ 297 | (DecisionTreeRegressor, 298 | {**max_features_small, **max_depth_small, **min_samples_split, **min_samples_leaf, 299 | 'criterion': ['mse', 'mae']}), 300 | 301 | (ExtraTreesRegressor, 302 | {**n_estimators_small, **max_features_small, **max_depth_small, **min_samples_split, 303 | **min_samples_leaf, 304 | 'criterion': ['mse', 'mae']}) 305 | ] 306 | 307 | def run_linear_models(x, y, small = True, normalize_x = True): 308 | return big_loop(linear_models_n_params_small if small else linear_models_n_params, 309 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False) 310 | 311 | def run_svm_models(x, y, small = True, normalize_x = True): 312 | return big_loop(svm_models_n_params_small if small else svm_models_n_params, 313 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False) 314 | 315 | def run_neighbor_models(x, y, normalize_x = True): 316 | return big_loop(neighbor_models_n_params, 317 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False) 318 | 319 | def run_gaussian_models(x, y, normalize_x = True): 320 | return big_loop(gaussianprocess_models_n_params, 321 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False) 322 | 323 | def run_nn_models(x, y, small = True, normalize_x = True): 324 | return big_loop(nn_models_n_params_small if small else nn_models_n_params, 325 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False) 326 | 327 | def run_tree_models(x, y, small = True, normalize_x = True): 328 | return big_loop(tree_models_n_params_small if small else tree_models_n_params, 329 | StandardScaler().fit_transform(x) if normalize_x else x, y, isClassification=False) 330 | 331 | def run_all(x, y, small = True, normalize_x = True, n_jobs=cpu_count()-1): 332 | 333 | all_params = (linear_models_n_params_small if small else linear_models_n_params) + \ 334 | (nn_models_n_params_small if small else nn_models_n_params) + \ 335 | ([] if small else gaussianprocess_models_n_params) + \ 336 | neighbor_models_n_params + \ 337 | (svm_models_n_params_small if small else svm_models_n_params) + \ 338 | (tree_models_n_params_small if small else tree_models_n_params) 339 | 340 | return big_loop(all_params, 341 | StandardScaler().fit_transform(x) if normalize_x else x, y, 342 | isClassification=False, n_jobs=n_jobs) 343 | 344 | 345 | if __name__ == '__main__': 346 | 347 | x, y = gen_reg_data(10, 3, 100, 3, sum, 0.3) 348 | run_all(x, y, small=True, normalize_x=True) 349 | -------------------------------------------------------------------------------- /universal_params.py: -------------------------------------------------------------------------------- 1 | """ 2 | parameter settings used by multiple classifiers/regressors 3 | """ 4 | 5 | import numpy as np 6 | 7 | penalty_12 = {'penalty': ['l1', 'l2']} 8 | penalty_12none = {'penalty': ['l1', 'l2', None]} 9 | penalty_12e = {'penalty': ['l1', 'l2', 'elasticnet']} 10 | penalty_all = {'penalty': ['l1', 'l2', None, 'elasticnet']} 11 | max_iter = {'max_iter': [100, 300, 1000]} 12 | max_iter_inf = {'max_iter': [100, 300, 500, 1000, np.inf]} 13 | max_iter_inf2 = {'max_iter': [100, 300, 500, 1000, -1]} 14 | tol = {'tol': [1e-4, 1e-3, 1e-2]} 15 | warm_start = {'warm_start': [True, False]} 16 | alpha = {'alpha': [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 3, 10]} 17 | alpha_small = {'alpha': [1e-5, 1e-3, 0.1, 1]} 18 | n_iter = {'n_iter': [5, 10, 20]} 19 | 20 | eta0 = {'eta0': [1e-4, 1e-3, 1e-2, 0.1]} 21 | C = {'C': [1e-2, 0.1, 1, 5, 10]} 22 | C_small = {'C': [ 0.1, 1, 5]} 23 | epsilon = {'epsilon': [1e-3, 1e-2, 0.1, 0]} 24 | normalize = {'normalize': [True, False]} 25 | kernel = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']} 26 | degree = {'degree': [1, 2, 3, 4, 5]} 27 | gamma = {'gamma': list(np.logspace(-9, 3, 6)) + ['auto']} 28 | gamma_small = {'gamma': list(np.logspace(-6, 3, 3)) + ['auto']} 29 | coef0 = {'coef0': [0, 0.1, 0.3, 0.5, 0.7, 1]} 30 | coef0_small = {'coef0': [0, 0.4, 0.7, 1]} 31 | shrinking = {'shrinking': [True, False]} 32 | nu = {'nu': [1e-4, 1e-2, 0.1, 0.3, 0.5, 0.75, 0.9]} 33 | nu_small = {'nu': [1e-2, 0.1, 0.5, 0.9]} 34 | 35 | n_neighbors = {'n_neighbors': [5, 7, 10, 15, 20]} 36 | neighbor_algo = {'algorithm': ['ball_tree', 'kd_tree', 'brute']} 37 | neighbor_leaf_size = {'leaf_size': [1, 2, 5, 10, 20, 30, 50, 100]} 38 | neighbor_metric = {'metric': ['cityblock', 'euclidean', 'l1', 'l2', 'manhattan']} 39 | neighbor_radius = {'radius': [1e-2, 0.1, 1, 5, 10]} 40 | learning_rate = {'learning_rate': ['constant', 'invscaling', 'adaptive']} 41 | learning_rate_small = {'learning_rate': ['invscaling', 'adaptive']} 42 | 43 | n_estimators = {'n_estimators': [2, 3, 5, 10, 25, 50, 100]} 44 | n_estimators_small = {'n_estimators': [2, 10, 25, 100]} 45 | max_features = {'max_features': [3, 5, 10, 25, 50, 'auto', 'log2', None]} 46 | max_features_small = {'max_features': [3, 5, 10, 'auto', 'log2', None]} 47 | max_depth = {'max_depth': [None, 3, 5, 7, 10]} 48 | max_depth_small = {'max_depth': [None, 5, 10]} 49 | min_samples_split = {'min_samples_split': [2, 5, 10, 0.1]} 50 | min_impurity_split = {'min_impurity_split': [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]} 51 | tree_learning_rate = {'learning_rate': [0.8, 1]} 52 | min_samples_leaf = {'min_samples_leaf': [2]} 53 | 54 | 55 | -------------------------------------------------------------------------------- /utilities.py: -------------------------------------------------------------------------------- 1 | from pprint import pprint 2 | import numpy as np 3 | nan = float('nan') 4 | from collections import Counter 5 | from multiprocessing import cpu_count 6 | from time import time 7 | from tabulate import tabulate 8 | 9 | from sklearn.cluster import KMeans 10 | from sklearn.model_selection import StratifiedShuffleSplit as sss, ShuffleSplit as ss, GridSearchCV 11 | from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, \ 12 | ExtraTreesClassifier, ExtraTreesRegressor, AdaBoostClassifier, AdaBoostRegressor 13 | from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor 14 | 15 | 16 | TREE_N_ENSEMBLE_MODELS = [RandomForestClassifier, GradientBoostingClassifier, DecisionTreeClassifier, DecisionTreeRegressor, 17 | ExtraTreesClassifier, ExtraTreesRegressor, AdaBoostClassifier, AdaBoostRegressor] 18 | 19 | def upsample_indices_clf(inds, y): 20 | """ 21 | make all classes have the same number of samples. for classification only. 22 | :type inds: numpy array 23 | :type y: numpy array 24 | :return: a numpy array of indices 25 | """ 26 | 27 | assert len(inds) == len(y) 28 | 29 | countByClass = dict(Counter(y)) 30 | maxCount = max(countByClass.values()) 31 | 32 | extras = [] 33 | 34 | for klass, count in countByClass.items(): 35 | if maxCount == count: continue 36 | 37 | ratio = int(maxCount / count) 38 | cur_inds = inds[y == klass] 39 | 40 | extras.append(np.concatenate( 41 | (np.repeat(cur_inds, ratio - 1), 42 | np.random.choice(cur_inds, maxCount - ratio * count, replace=False) 43 | )) 44 | ) 45 | 46 | print('upsampling class %d, %d times' % (klass, ratio-1)) 47 | 48 | return np.concatenate((inds, *extras)) 49 | 50 | 51 | def cv_clf(x, y, 52 | test_size = 0.2, n_splits = 5, random_state=None, 53 | doesUpsample = True): 54 | """ 55 | an iterator of cross-validation groups with upsampling 56 | :param x: 57 | :param y: 58 | :param test_size: 59 | :param n_splits: 60 | :return: 61 | """ 62 | 63 | sss_obj = sss(n_splits, test_size, random_state=random_state).split(x, y) 64 | 65 | # no upsampling needed 66 | if not doesUpsample: 67 | return sss_obj 68 | 69 | # with upsampling 70 | for train_inds, valid_inds in sss_obj: 71 | yield (upsample_indices_clf(train_inds, y[train_inds]), valid_inds) 72 | 73 | 74 | def cv_reg(x, test_size = 0.2, n_splits = 5, random_state=None): 75 | return ss(n_splits, test_size, random_state=random_state).split(x) 76 | 77 | def timeit(klass, params, x, y): 78 | """ 79 | time in seconds 80 | """ 81 | 82 | start = time() 83 | clf = klass(**params) 84 | clf.fit(x, y) 85 | 86 | return time() - start 87 | 88 | def big_loop(models_n_params, x, y, isClassification, 89 | test_size = 0.2, n_splits = 5, random_state=None, doesUpsample=True, 90 | scoring=None, 91 | verbose=False, n_jobs = cpu_count()-1): 92 | """ 93 | runs through all model classes with their perspective hyper parameters 94 | :param models_n_params: [(model class, hyper parameters),...] 95 | :param isClassification: whether it's a classification or regression problem 96 | :type isClassification: bool 97 | :param scoring: by default 'accuracy' for classification; 'neg_mean_squared_error' for regression 98 | :return: the best estimator, list of [(estimator, cv score),...] 99 | """ 100 | 101 | def cv_(): 102 | return cv_clf(x, y, test_size, n_splits, random_state, doesUpsample) \ 103 | if isClassification \ 104 | else cv_reg(x, test_size, n_splits, random_state) 105 | 106 | res = [] 107 | num_features = x.shape[1] 108 | scoring = scoring or ('accuracy' if isClassification else 'neg_mean_squared_error') 109 | print('Scoring criteria:', scoring) 110 | 111 | for i, (clf_Klass, parameters) in enumerate(models_n_params): 112 | try: 113 | print('-'*15, 'model %d/%d' % (i+1, len(models_n_params)), '-'*15) 114 | print(clf_Klass.__name__) 115 | 116 | if clf_Klass == KMeans: 117 | parameters['n_clusters'] = [len(np.unique(y))] 118 | elif clf_Klass in TREE_N_ENSEMBLE_MODELS: 119 | parameters['max_features'] = [v for v in parameters['max_features'] 120 | if v is None or type(v)==str or v<=num_features] 121 | 122 | clf_search = GridSearchCV(clf_Klass(), parameters, scoring, cv=cv_(), n_jobs=n_jobs) 123 | clf_search.fit(x, y) 124 | 125 | timespent = timeit(clf_Klass, clf_search.best_params_, x, y) 126 | print('best score:', clf_search.best_score_, 'time/clf: %0.3f seconds' % timespent) 127 | print('best params:') 128 | pprint(clf_search.best_params_) 129 | 130 | if verbose: 131 | print('validation scores:', clf_search.cv_results_['mean_test_score']) 132 | print('training scores:', clf_search.cv_results_['mean_train_score']) 133 | 134 | res.append((clf_search.best_estimator_, clf_search.best_score_, timespent)) 135 | 136 | except Exception as e: 137 | print('ERROR OCCURRED') 138 | if verbose: print(e) 139 | res.append((clf_Klass(), -np.inf, np.inf)) 140 | 141 | 142 | print('='*60) 143 | print(tabulate([[m.__class__.__name__, '%.3f'%s, '%.3f'%t] for m, s, t in res], headers=['Model', scoring, 'Time/clf (s)'])) 144 | winner_ind = np.argmax([v[1] for v in res]) 145 | winner = res[winner_ind][0] 146 | print('='*60) 147 | print('The winner is: %s with score %0.3f.' % (winner.__class__.__name__, res[winner_ind][1])) 148 | 149 | return winner, res 150 | 151 | 152 | if __name__ == '__main__': 153 | 154 | y = np.array([0,1,0,0,0,3,1,1,3]) 155 | x = np.zeros(len(y)) 156 | 157 | for t, v in cv_reg(x): 158 | print('---------') 159 | print('training inds:', t) 160 | print('valid inds:', v) 161 | 162 | for t, v in cv_clf(x, y, test_size=3): 163 | print('---------') 164 | print('training inds:', t) 165 | print('valid inds:', v) --------------------------------------------------------------------------------