`_.
80 |
81 |
82 | References
83 | ----------
84 | .. [1] Bergstra, J., R. Bardenet, Y. Bengio and B. Kégl. Algorithms for Hyper-Parameter Optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2011
85 | .. [2] Breiman, L. Bagging Predictors. Machine Learning, 24(2):123–140, 1996
86 | .. [3] Breiman, L. Random Forests. Machine Learning, 45(1):5–32, 2001
87 | .. [4] Chen, T. and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016
88 | .. [5] Dua, D. and C. Graff. UCI Machine Learning Repository, 2017
89 | .. [6] Fauvel, K., V. Masson, E. Fromont, P. Faverdin and A. Termier. Towards Sustainable Dairy Management - A Machine Learning Enhanced Method for Estrus Detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019
90 | .. [7] Fauvel, K., E. Fromont, V. Masson, P. Faverdin and A. Termier. XEM: An Explainable-by-Design Ensemble Method for Multivariate Time Series Classification. Data Mining and Knowledge Discovery, 36(3):917–957, 2022
91 | .. [8] Grinsztajn, L., E. Oyallon and G. Varoquaux. Why Do Tree-Based Models still Outperform Deep Learning on Typical Tabular Data? In Proceedings of the 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022
92 | .. [9] Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye and T. Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017
93 | .. [10] Prokhorenkova, L., G. Gusev, A. Vorobev, A. Dorogush and A. Gulin. CatBoost: Unbiased Boosting with Categorical Features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018
94 | .. [11] Schapire, R. The Strength of Weak Learnability. Machine Learning, 5(2):197–227, 1990
95 |
96 |
97 |
98 | Installation
99 | ============
100 |
101 | You can install LCE from `PyPI `_ with ``pip``::
102 |
103 | pip install lcensemble
104 |
105 | Or ``conda``::
106 |
107 | conda install -c conda-forge lcensemble
108 |
109 |
110 | Code Examples
111 | =============
112 |
113 | The following examples illustrate the use of LCE on public datasets for a classification and a regression task.
114 | They also demonstrate the compatibility of LCE with scikit-learn pipelines and model selection tools through the use of ``cross_val_score``.
115 | An example of LCE on a dataset including missing values is also shown.
116 |
117 | Classification
118 | --------------
119 |
120 | - **Example 1: LCE on Iris Dataset**
121 |
122 | .. code-block:: python
123 |
124 | from lce import LCEClassifier
125 | from sklearn.datasets import load_iris
126 | from sklearn.metrics import accuracy_score
127 | from sklearn.model_selection import train_test_split
128 |
129 |
130 | # Load data and generate a train/test split
131 | data = load_iris()
132 | X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)
133 |
134 | # Train LCEClassifier with default parameters
135 | clf = LCEClassifier(n_jobs=-1, random_state=0)
136 | clf.fit(X_train, y_train)
137 |
138 | # Make prediction and compute accuracy score
139 | y_pred = clf.predict(X_test)
140 | accuracy = accuracy_score(y_test, y_pred)
141 | print("Accuracy: {:.1f}%".format(accuracy*100))
142 |
143 | .. code-block::
144 |
145 | Accuracy: 97.4%
146 |
147 |
148 | - **Example 2: LCE with scikit-learn cross validation score**
149 | This example demonstrates the compatibility of LCE with scikit-learn pipelines and model selection tools through the use of ``cross_val_score``.
150 |
151 | .. code-block:: python
152 |
153 | from lce import LCEClassifier
154 | from sklearn.datasets import load_iris
155 | from sklearn.model_selection import cross_val_score, train_test_split
156 |
157 | # Load data
158 | data = load_iris()
159 | X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)
160 |
161 | # Set LCEClassifier with default parameters
162 | clf = LCEClassifier(n_jobs=-1, random_state=0)
163 |
164 | # Compute cross-validation scores
165 | cv_scores = cross_val_score(clf, X_train, y_train, cv=3)
166 | cv_scores = [round(elem*100, 1) for elem in cv_scores.tolist()]
167 | print("Cross-validation scores on train set: ", cv_scores)
168 |
169 | .. code-block::
170 |
171 | Cross-validation scores on train set: [94.7, 100.0, 94.6]
172 |
173 |
174 | Regression
175 | ----------
176 |
177 | - **Example 3: LCE on Diabetes Dataset**
178 |
179 | .. code-block:: python
180 |
181 | from lce import LCERegressor
182 | from sklearn.datasets import load_diabetes
183 | from sklearn.metrics import mean_squared_error
184 | from sklearn.model_selection import train_test_split
185 |
186 |
187 | # Load data and generate a train/test split
188 | data = load_diabetes()
189 | X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)
190 |
191 | # Train LCERegressor with default parameters
192 | reg = LCERegressor(n_jobs=-1, random_state=0)
193 | reg.fit(X_train, y_train)
194 |
195 | # Make prediction
196 | y_pred = reg.predict(X_test)
197 | mse = mean_squared_error(y_test, y_pred)
198 | print("The mean squared error (MSE) on test set: {:.0f}".format(mse))
199 |
200 | .. code-block::
201 |
202 | The mean squared error (MSE) on test set: 3761
203 |
204 |
205 | - **Example 4: LCE with missing values**
206 | This example illustrates the robustness of LCE to missing values. The Diabetes train set is modified with 20% of missing values per variable.
207 |
208 | .. code-block:: python
209 |
210 | import numpy as np
211 | from lce import LCERegressor
212 | from sklearn.datasets import load_diabetes
213 | from sklearn.metrics import mean_squared_error
214 | from sklearn.model_selection import train_test_split
215 |
216 |
217 | # Load data and generate a train/test split
218 | data = load_diabetes()
219 | X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)
220 |
221 | # Input 20% of missing values per variable in the train set
222 | np.random.seed(0)
223 | m = 0.2
224 | for j in range(0, X_train.shape[1]):
225 | sub = np.random.choice(X_train.shape[0], int(X_train.shape[0]*m))
226 | X_train[sub, j] = np.nan
227 |
228 | # Train LCERegressor with default parameters
229 | reg = LCERegressor(n_jobs=-1, random_state=0)
230 | reg.fit(X_train, y_train)
231 |
232 | # Make prediction
233 | y_pred = reg.predict(X_test)
234 | mse = mean_squared_error(y_test, y_pred)
235 | print("The mean squared error (MSE) on test set: {:.0f}".format(mse))
236 |
237 | .. code-block::
238 |
239 | The mean squared error (MSE) on test set: 3895
240 |
241 |
242 | Python Source Files
243 | -------------------
244 |
245 |
246 | .. raw:: html
247 |
248 |
249 |
250 | .. only:: html
251 |
252 | .. figure:: _images/logo_lce.svg
253 | :alt: LCEClassifier on Iris dataset
254 |
255 | :ref:`sphx_glr_auto_examples_lceclassifier_iris.py`
256 |
257 | .. raw:: html
258 |
259 |
260 |
261 | .. toctree::
262 | :hidden:
263 |
264 | /auto_examples/lceclassifier_iris
265 |
266 |
267 |
268 | .. raw:: html
269 |
270 |
271 |
272 | .. only:: html
273 |
274 | .. figure:: _images/logo_lce.svg
275 | :alt: LCEClassifier on Iris dataset with scikit-learn cross validation score
276 |
277 | :ref:`sphx_glr_auto_examples_lceclassifier_iris_cv.py`
278 |
279 | .. raw:: html
280 |
281 |
282 |
283 | .. toctree::
284 | :hidden:
285 |
286 | /auto_examples/lceclassifier_iris_cv
287 |
288 |
289 |
290 | .. raw:: html
291 |
292 |
293 |
294 | .. only:: html
295 |
296 | .. figure:: _images/logo_lce.svg
297 | :alt: LCERegressor on Diabetes dataset
298 |
299 | :ref:`sphx_glr_auto_examples_lceregressor_diabetes.py`
300 |
301 | .. raw:: html
302 |
303 |
304 |
305 |
306 | .. toctree::
307 | :hidden:
308 |
309 | /auto_examples/lceregressor_diabetes
310 |
311 |
312 | .. raw:: html
313 |
314 |
315 |
316 | .. only:: html
317 |
318 | .. figure:: _images/logo_lce.svg
319 | :alt: LCERegressor on Diabetes dataset with missing values
320 |
321 | :ref:`sphx_glr_auto_examples_lceregressor_missing_diabetes.py`
322 |
323 | .. raw:: html
324 |
325 |
326 |
327 |
328 | .. toctree::
329 | :hidden:
330 |
331 | /auto_examples/lceregressor_missing_diabetes
332 |
333 |
334 |
335 | .. raw:: html
336 |
337 |
338 |
339 |
340 |
341 | .. only :: html
342 |
343 | .. container:: sphx-glr-footer
344 | :class: sphx-glr-footer-gallery
345 |
346 |
347 | .. container:: sphx-glr-download sphx-glr-download-python
348 |
349 | :download:`Download all examples in Python source code: auto_examples_python.zip `
350 |
351 |
352 |
353 | .. container:: sphx-glr-download sphx-glr-download-jupyter
354 |
355 | :download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip `
356 |
357 |
358 | .. only:: html
359 |
360 | .. rst-class:: sphx-glr-signature
361 |
362 | `Gallery generated by Sphinx-Gallery `_
363 |
--------------------------------------------------------------------------------
/lce/_lightgbm.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
3 | from sklearn.metrics import check_scoring
4 | import lightgbm as lgbm
5 |
6 |
7 | def lgbm_opt_classifier(
8 | X,
9 | y,
10 | n_iter=10,
11 | metric="accuracy",
12 | n_estimators=(10, 50, 100),
13 | max_depth=(3, 6, 9),
14 | num_leaves=(20, 50, 100, 500),
15 | learning_rate=(0.01, 0.1, 0.3, 0.5),
16 | boosting_type=("gbdt",),
17 | min_child_weight=(1, 5, 15, 100),
18 | subsample=(1.0,),
19 | subsample_for_bin=(200000,),
20 | colsample_bytree=(1.0,),
21 | reg_alpha=(0,),
22 | reg_lambda=(0.1, 1.0, 5.0),
23 | n_jobs=None,
24 | random_state=None,
25 | ):
26 | """
27 | Get LightGBM model with the best hyperparameters configuration.
28 |
29 | Parameters
30 | ----------
31 | X : array-like of shape (n_samples, n_features)
32 | The training input samples.
33 |
34 | y : array-like of shape (n_samples,)
35 | The class labels.
36 |
37 | n_iter: int, default=10
38 | Number of iterations to set the hyperparameters of the base classifier (LightGBM)
39 | in Hyperopt.
40 |
41 | metric: string, default="accuracy"
42 | The score of the base classifier (LightGBM) optimized by Hyperopt. Supported metrics
43 | are the ones from `scikit-learn `_.
44 |
45 | n_estimators : tuple, default=(10, 50, 100)
46 | The number of LightGBM estimators. The number of estimators of
47 | LightGBM corresponds to the number of boosting rounds. The tuple provided is
48 | the search space used for the hyperparameter optimization (Hyperopt).
49 |
50 | max_depth : tuple, default=(3, 6, 9)
51 | Maximum tree depth for LightGBM base learners. The tuple provided is the search
52 | space used for the hyperparameter optimization (Hyperopt).
53 |
54 | num_leaves : tuple, default=(20, 50, 100, 500)
55 | Maximum tree leaves. The tuple provided is the search
56 | space used for the hyperparameter optimization (Hyperopt).
57 |
58 | learning_rate : tuple, default=(0.01, 0.1, 0.3, 0.5)
59 | `learning_rate` of LightGBM. The tuple provided is the search space used for the
60 | hyperparameter optimization (Hyperopt).
61 |
62 | boosting_type : ("dart", "gbdt", "rf"), default=("gbdt",)
63 | The type of boosting type to use: "dart" dropouts meet Multiple Additive
64 | Regression Trees; "gbdt" traditional Gradient Boosting Decision Tree; "rf" Random Forest.
65 | The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).
66 |
67 | min_child_weight : tuple, default=(1, 5, 15, 100)
68 | `min_child_weight` of LightGBM. `min_child_weight` defines the
69 | minimum sum of instance weight (hessian) needed in a child. If the tree
70 | partition step results in a leaf node with the sum of instance weight
71 | less than `min_child_weight`, then the building process will give up further
72 | partitioning. The larger `min_child_weight` is, the more conservative LightGBM
73 | algorithm will be. The tuple provided is the search space used for the hyperparameter
74 | optimization (Hyperopt).
75 |
76 | subsample : tuple, default=(1.0,)
77 | LightGBM subsample ratio of the training instances. Setting it to 0.5 means
78 | that LightGBM would randomly sample half of the training data prior to
79 | growing trees, and this will prevent overfitting. Subsampling will occur
80 | once in every boosting iteration. The tuple provided is the search space used for
81 | the hyperparameter optimization (Hyperopt).
82 |
83 | subsample_for_bin : tuple, default=(200000,)
84 | Number of samples for constructing bins. The tuple provided is the
85 | search space used for the hyperparameter optimization (Hyperopt).
86 |
87 | colsample_bytree : tuple, default=(1.0,)
88 | LightGBM subsample ratio of columns when constructing each tree.
89 | Subsampling occurs once for every tree constructed. The tuple provided is the search
90 | space used for the hyperparameter optimization (Hyperopt).
91 |
92 | reg_alpha : tuple, default=(0,)
93 | `reg_alpha` of LightGBM. `reg_alpha` corresponds to the L1 regularization
94 | term on the weights. Increasing this value will make LightGBM model more
95 | conservative. The tuple provided is the search space used for the hyperparameter
96 | optimization (Hyperopt).
97 |
98 | reg_lambda : tuple, default=(0.1, 1.0, 5.0)
99 | `reg_lambda` of LightGBM. `reg_lambda` corresponds to the L2 regularization
100 | term on the weights. Increasing this value will make LightGBM model more
101 | conservative. The tuple provided is the search space used for the hyperparameter
102 | optimization (Hyperopt).
103 |
104 | n_jobs : int, default=None
105 | The number of jobs to run in parallel.
106 | ``n_jobs=None`` means 1. ``n_jobs=-1`` means using all processors.
107 |
108 | random_state : int, RandomState instance or None, default=None
109 | Controls the randomness of the base learner LightGBM and
110 | the Hyperopt algorithm.
111 |
112 | Returns
113 | -------
114 | model: object
115 | LightGBM model with the best configuration and fitted on the input data.
116 | """
117 | # Parameters
118 | classes, y = np.unique(y, return_inverse=True)
119 | n_classes = classes.size
120 |
121 | if n_classes == 2:
122 | objective = "binary"
123 | num_class = 1
124 | else:
125 | objective = "multiclass"
126 | num_class = n_classes
127 |
128 | space = {
129 | "n_estimators": hp.choice("n_estimators", n_estimators),
130 | "max_depth": hp.choice("max_depth", max_depth),
131 | "num_leaves": hp.choice("num_leaves", num_leaves),
132 | "learning_rate": hp.choice("learning_rate", learning_rate),
133 | "boosting_type": hp.choice("boosting_type", boosting_type),
134 | "min_child_weight": hp.choice("min_child_weight", min_child_weight),
135 | "subsample": hp.choice("subsample", subsample),
136 | "subsample_for_bin": hp.choice("subsample_for_bin", subsample_for_bin),
137 | "colsample_bytree": hp.choice("colsample_bytree", colsample_bytree),
138 | "reg_alpha": hp.choice("reg_alpha", reg_alpha),
139 | "reg_lambda": hp.choice("reg_lambda", reg_lambda),
140 | "objective": objective,
141 | "num_class": num_class,
142 | "n_jobs": n_jobs,
143 | "random_state": random_state,
144 | }
145 |
146 | # Get best configuration
147 | def p_model(params):
148 | clf = lgbm.LGBMClassifier(**params, verbose=-1)
149 | clf.fit(X, y)
150 | scorer = check_scoring(clf, scoring=metric)
151 | return scorer(clf, X, y)
152 |
153 | global best
154 | best = -np.inf
155 |
156 | def f(params):
157 | global best
158 | perf = p_model(params)
159 | if perf > best:
160 | best = perf
161 | return {"loss": -best, "status": STATUS_OK}
162 |
163 | rstate = np.random.default_rng(random_state)
164 | best_config = fmin(
165 | fn=f,
166 | space=space,
167 | algo=tpe.suggest,
168 | max_evals=n_iter,
169 | trials=Trials(),
170 | rstate=rstate,
171 | verbose=0,
172 | )
173 |
174 | # Fit best model
175 | final_params = {
176 | "n_estimators": n_estimators[best_config["n_estimators"]],
177 | "max_depth": max_depth[best_config["max_depth"]],
178 | "num_leaves": num_leaves[best_config["num_leaves"]],
179 | "learning_rate": learning_rate[best_config["learning_rate"]],
180 | "boosting_type": boosting_type[best_config["boosting_type"]],
181 | "min_child_weight": min_child_weight[best_config["min_child_weight"]],
182 | "subsample": subsample[best_config["subsample"]],
183 | "subsample_for_bin": subsample_for_bin[best_config["subsample_for_bin"]],
184 | "colsample_bytree": colsample_bytree[best_config["colsample_bytree"]],
185 | "reg_alpha": reg_alpha[best_config["reg_alpha"]],
186 | "reg_lambda": reg_lambda[best_config["reg_lambda"]],
187 | "objective": objective,
188 | "num_class": num_class,
189 | "n_jobs": n_jobs,
190 | "random_state": random_state,
191 | }
192 | clf = lgbm.LGBMClassifier(**final_params, verbose=-1)
193 | return clf.fit(X, y)
194 |
195 |
196 | def lgbm_opt_regressor(
197 | X,
198 | y,
199 | n_iter=10,
200 | metric="neg_mean_squared_error",
201 | n_estimators=(10, 50, 100),
202 | max_depth=(3, 6, 9),
203 | num_leaves=(20, 50, 100, 500),
204 | learning_rate=(0.01, 0.1, 0.3, 0.5),
205 | boosting_type=("gbdt",),
206 | min_child_weight=(1, 5, 15, 100),
207 | subsample=(1.0,),
208 | subsample_for_bin=(200000,),
209 | colsample_bytree=(1.0,),
210 | reg_alpha=(0,),
211 | reg_lambda=(0.1, 1.0, 5.0),
212 | n_jobs=None,
213 | random_state=None,
214 | ):
215 | """
216 | Get LightGBM model with the best hyperparameters configuration.
217 |
218 | Parameters
219 | ----------
220 | X : array-like of shape (n_samples, n_features)
221 | The training input samples.
222 |
223 | y : array-like of shape (n_samples,)
224 | The target values (real numbers).
225 |
226 | n_iter: int, default=10
227 | Number of iterations to set the hyperparameters of the base regressor (LightGBM)
228 | in Hyperopt.
229 |
230 | metric: string, default="neg_mean_squared_error"
231 | The score of the base regressor (LightGBM) optimized by Hyperopt. Supported metrics
232 | are the ones from `scikit-learn `_.
233 |
234 | n_estimators : tuple, default=(10, 50, 100)
235 | The number of LightGBM estimators. The number of estimators of
236 | LightGBM corresponds to the number of boosting rounds. The tuple provided is
237 | the search space used for the hyperparameter optimization (Hyperopt).
238 |
239 | max_depth : tuple, default=(3, 6, 9)
240 | Maximum tree depth for LightGBM base learners. The tuple provided is the search
241 | space used for the hyperparameter optimization (Hyperopt).
242 |
243 | num_leaves : tuple, default=(20, 50, 100, 500)
244 | Maximum tree leaves. The tuple provided is the search
245 | space used for the hyperparameter optimization (Hyperopt).
246 |
247 | learning_rate : tuple, default=(0.01, 0.1, 0.3, 0.5)
248 | `learning_rate` of LightGBM. The tuple provided is the search space used for the
249 | hyperparameter optimization (Hyperopt).
250 |
251 | boosting_type : ("dart", "gbdt", "rf"), default=("gbdt",)
252 | The type of boosting type to use: "dart" dropouts meet Multiple Additive
253 | Regression Trees; "gbdt" traditional Gradient Boosting Decision Tree; "rf" Random Forest.
254 | The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).
255 |
256 | min_child_weight : tuple, default=(1, 5, 15, 100)
257 | `min_child_weight` of LightGBM. `min_child_weight` defines the
258 | minimum sum of instance weight (hessian) needed in a child. If the tree
259 | partition step results in a leaf node with the sum of instance weight
260 | less than `min_child_weight`, then the building process will give up further
261 | partitioning. The larger `min_child_weight` is, the more conservative LightGBM
262 | algorithm will be. The tuple provided is the search space used for the hyperparameter
263 | optimization (Hyperopt).
264 |
265 | subsample : tuple, default=(1.0,)
266 | LightGBM subsample ratio of the training instances. Setting it to 0.5 means
267 | that LightGBM would randomly sample half of the training data prior to
268 | growing trees, and this will prevent overfitting. Subsampling will occur
269 | once in every boosting iteration. The tuple provided is the search space used for
270 | the hyperparameter optimization (Hyperopt).
271 |
272 | subsample_for_bin : tuple, default=(200000,)
273 | Number of samples for constructing bins. The tuple provided is the
274 | search space used for the hyperparameter optimization (Hyperopt).
275 |
276 | colsample_bytree : tuple, default=(1.0,)
277 | LightGBM subsample ratio of columns when constructing each tree.
278 | Subsampling occurs once for every tree constructed. The tuple provided is the search
279 | space used for the hyperparameter optimization (Hyperopt).
280 |
281 | reg_alpha : tuple, default=(0,)
282 | `reg_alpha` of LightGBM. `reg_alpha` corresponds to the L1 regularization
283 | term on the weights. Increasing this value will make LightGBM model more
284 | conservative. The tuple provided is the search space used for the hyperparameter
285 | optimization (Hyperopt).
286 |
287 | reg_lambda : tuple, default=(0.1, 1.0, 5.0)
288 | `reg_lambda` of LightGBM. `reg_lambda` corresponds to the L2 regularization
289 | term on the weights. Increasing this value will make LightGBM model more
290 | conservative. The tuple provided is the search space used for the hyperparameter
291 | optimization (Hyperopt).
292 |
293 | n_jobs : int, default=None
294 | The number of jobs to run in parallel.
295 | ``n_jobs=None`` means 1. ``n_jobs=-1`` means using all processors.
296 |
297 | random_state : int, RandomState instance or None, default=None
298 | Controls the randomness of the base learner LightGBM and
299 | the Hyperopt algorithm.
300 |
301 | Returns
302 | -------
303 | model: object
304 | LightGBM model with the best configuration and fitted on the input data.
305 | """
306 | space = {
307 | "n_estimators": hp.choice("n_estimators", n_estimators),
308 | "max_depth": hp.choice("max_depth", max_depth),
309 | "num_leaves": hp.choice("num_leaves", num_leaves),
310 | "learning_rate": hp.choice("learning_rate", learning_rate),
311 | "boosting_type": hp.choice("boosting_type", boosting_type),
312 | "min_child_weight": hp.choice("min_child_weight", min_child_weight),
313 | "subsample": hp.choice("subsample", subsample),
314 | "subsample_for_bin": hp.choice("subsample_for_bin", subsample_for_bin),
315 | "colsample_bytree": hp.choice("colsample_bytree", colsample_bytree),
316 | "reg_alpha": hp.choice("reg_alpha", reg_alpha),
317 | "reg_lambda": hp.choice("reg_lambda", reg_lambda),
318 | "objective": "regression",
319 | "n_jobs": n_jobs,
320 | "random_state": random_state,
321 | }
322 |
323 | # Get best configuration
324 | def p_model(params):
325 | reg = lgbm.LGBMRegressor(**params, verbose=-1)
326 | reg.fit(X, y)
327 | scorer = check_scoring(reg, scoring=metric)
328 | return scorer(reg, X, y)
329 |
330 | global best
331 | best = -np.inf
332 |
333 | def f(params):
334 | global best
335 | perf = p_model(params)
336 | if perf > best:
337 | best = perf
338 | return {"loss": -best, "status": STATUS_OK}
339 |
340 | rstate = np.random.default_rng(random_state)
341 | best_config = fmin(
342 | fn=f,
343 | space=space,
344 | algo=tpe.suggest,
345 | max_evals=n_iter,
346 | trials=Trials(),
347 | rstate=rstate,
348 | verbose=0,
349 | )
350 |
351 | # Fit best model
352 | final_params = {
353 | "n_estimators": n_estimators[best_config["n_estimators"]],
354 | "max_depth": max_depth[best_config["max_depth"]],
355 | "num_leaves": num_leaves[best_config["num_leaves"]],
356 | "learning_rate": learning_rate[best_config["learning_rate"]],
357 | "boosting_type": boosting_type[best_config["boosting_type"]],
358 | "min_child_weight": min_child_weight[best_config["min_child_weight"]],
359 | "subsample": subsample[best_config["subsample"]],
360 | "subsample_for_bin": subsample_for_bin[best_config["subsample_for_bin"]],
361 | "colsample_bytree": colsample_bytree[best_config["colsample_bytree"]],
362 | "reg_alpha": reg_alpha[best_config["reg_alpha"]],
363 | "reg_lambda": reg_lambda[best_config["reg_lambda"]],
364 | "objective": "regression",
365 | "n_jobs": n_jobs,
366 | "random_state": random_state,
367 | }
368 | reg = lgbm.LGBMRegressor(**final_params, verbose=-1)
369 | return reg.fit(X, y)
370 |
--------------------------------------------------------------------------------
/lce/_xgboost.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
3 | from sklearn.metrics import check_scoring
4 | from sklearn.preprocessing import OneHotEncoder
5 | import xgboost as xgb
6 |
7 |
8 | def xgb_opt_classifier(
9 | X,
10 | y,
11 | n_iter=10,
12 | metric="accuracy",
13 | n_estimators=(10, 50, 100),
14 | max_depth=(3, 6, 9),
15 | learning_rate=(0.01, 0.1, 0.3, 0.5),
16 | booster=("gbtree",),
17 | gamma=(0, 1, 10),
18 | min_child_weight=(1, 5, 15, 100),
19 | subsample=(1.0,),
20 | colsample_bytree=(1.0,),
21 | colsample_bylevel=(1.0,),
22 | colsample_bynode=(1.0,),
23 | reg_alpha=(0,),
24 | reg_lambda=(0.1, 1.0, 5.0),
25 | n_jobs=None,
26 | random_state=None,
27 | ):
28 | """
29 | Get XGBoost model with the best hyperparameters configuration.
30 |
31 | Parameters
32 | ----------
33 | X : array-like of shape (n_samples, n_features)
34 | The training input samples.
35 |
36 | y : array-like of shape (n_samples,)
37 | The class labels.
38 |
39 | n_iter: int, default=10
40 | Number of iterations to set the hyperparameters of the base classifier (XGBoost)
41 | in Hyperopt.
42 |
43 | metric: string, default="accuracy"
44 | The score of the base classifier (XGBoost) optimized by Hyperopt. Supported metrics
45 | are the ones from `scikit-learn `_.
46 |
47 | n_estimators : tuple, default=(10, 50, 100)
48 | The number of XGBoost estimators. The number of estimators of
49 | XGBoost corresponds to the number of boosting rounds. The tuple provided is
50 | the search space used for the hyperparameter optimization (Hyperopt).
51 |
52 | max_depth : tuple, default=(3, 6, 9)
53 | Maximum tree depth for XGBoost base learners. The tuple provided is the search
54 | space used for the hyperparameter optimization (Hyperopt).
55 |
56 | learning_rate : tuple, default=(0.01, 0.1, 0.3, 0.5)
57 | `learning_rate` of XGBoost. The learning rate corresponds to the
58 | step size shrinkage used in update to prevent overfitting. After each
59 | boosting step, the learning rate shrinks the feature weights to make the boosting
60 | process more conservative. The tuple provided is the search space used for the
61 | hyperparameter optimization (Hyperopt).
62 |
63 | booster : ("dart", "gblinear", "gbtree"), default=("gbtree",)
64 | The type of booster to use. "gbtree" and "dart" use tree based models
65 | while "gblinear" uses linear functions. The tuple provided is the search space used
66 | for the hyperparameter optimization (Hyperopt).
67 |
68 | gamma : tuple, default=(0, 1, 10)
69 | `gamma` of XGBoost. `gamma` corresponds to the minimum loss reduction
70 | required to make a further partition on a leaf node of the tree.
71 | The larger `gamma` is, the more conservative XGBoost algorithm will be.
72 | The tuple provided is the search space used for the hyperparameter optimization
73 | (Hyperopt).
74 |
75 | min_child_weight : tuple, default=(1, 5, 15, 100)
76 | `min_child_weight` of XGBoost. `min_child_weight` defines the
77 | minimum sum of instance weight (hessian) needed in a child. If the tree
78 | partition step results in a leaf node with the sum of instance weight
79 | less than `min_child_weight`, then the building process will give up further
80 | partitioning. The larger `min_child_weight` is, the more conservative XGBoost
81 | algorithm will be. The tuple provided is the search space used for the hyperparameter
82 | optimization (Hyperopt).
83 |
84 | subsample : tuple, default=(1.0,)
85 | XGBoost subsample ratio of the training instances. Setting it to 0.5 means
86 | that XGBoost would randomly sample half of the training data prior to
87 | growing trees, and this will prevent overfitting. Subsampling will occur
88 | once in every boosting iteration. The tuple provided is the search space used for
89 | the hyperparameter optimization (Hyperopt).
90 |
91 | colsample_bytree : tuple, default=(1.0,)
92 | XGBoost subsample ratio of columns when constructing each tree.
93 | Subsampling occurs once for every tree constructed. The tuple provided is the search
94 | space used for the hyperparameter optimization (Hyperopt).
95 |
96 | colsample_bylevel : tuple, default=(1.0,)
97 | XGBoost subsample ratio of columns for each level. Subsampling occurs
98 | once for every new depth level reached in a tree. Columns are subsampled
99 | from the set of columns chosen for the current tree. The tuple provided is the search
100 | space used for the hyperparameter optimization (Hyperopt).
101 |
102 | colsample_bynode : tuple, default=(1.0,)
103 | XGBoost subsample ratio of columns for each node (split). Subsampling
104 | occurs once every time a new split is evaluated. Columns are subsampled
105 | from the set of columns chosen for the current level. The tuple provided is the search
106 | space used for the hyperparameter optimization (Hyperopt).
107 |
108 | reg_alpha : tuple, default=(0,)
109 | `reg_alpha` of XGBoost. `reg_alpha` corresponds to the L1 regularization
110 | term on the weights. Increasing this value will make XGBoost model more
111 | conservative. The tuple provided is the search space used for the hyperparameter
112 | optimization (Hyperopt).
113 |
114 | reg_lambda : tuple, default=(0.1, 1.0, 5.0)
115 | `reg_lambda` of XGBoost. `reg_lambda` corresponds to the L2 regularization
116 | term on the weights. Increasing this value will make XGBoost model more
117 | conservative. The tuple provided is the search space used for the hyperparameter
118 | optimization (Hyperopt).
119 |
120 | n_jobs : int, default=None
121 | The number of jobs to run in parallel.
122 | ``n_jobs=None`` means 1. ``n_jobs=-1`` means using all processors.
123 |
124 | random_state : int, RandomState instance or None, default=None
125 | Controls the randomness of the base learner XGBoost and
126 | the Hyperopt algorithm.
127 |
128 | Returns
129 | -------
130 | model: object
131 | XGBoost model with the best configuration and fitted on the input data.
132 | """
133 | # Parameters
134 | classes, y = np.unique(y, return_inverse=True)
135 | n_classes = classes.size
136 |
137 | space = {
138 | "n_estimators": hp.choice("n_estimators", n_estimators),
139 | "max_depth": hp.choice("max_depth", max_depth),
140 | "learning_rate": hp.choice("learning_rate", learning_rate),
141 | "booster": hp.choice("booster", booster),
142 | "gamma": hp.choice("gamma", gamma),
143 | "min_child_weight": hp.choice("min_child_weight", min_child_weight),
144 | "subsample": hp.choice("subsample", subsample),
145 | "colsample_bytree": hp.choice("colsample_bytree", colsample_bytree),
146 | "colsample_bylevel": hp.choice("colsample_bylevel", colsample_bylevel),
147 | "colsample_bynode": hp.choice("colsample_bynode", colsample_bynode),
148 | "reg_alpha": hp.choice("reg_alpha", reg_alpha),
149 | "reg_lambda": hp.choice("reg_lambda", reg_lambda),
150 | "objective": "multi:softprob",
151 | "num_class": n_classes,
152 | "n_jobs": n_jobs,
153 | "random_state": random_state,
154 | }
155 |
156 | # Get best configuration
157 | def p_model(params):
158 | clf = xgb.XGBClassifier(**params, use_label_encoder=False, verbosity=0)
159 | clf.fit(X, y)
160 | if n_classes == 2:
161 | onehot_encoder = OneHotEncoder(sparse=False)
162 | y_score = onehot_encoder.fit_transform(y.reshape(len(y), 1))
163 | else:
164 | y_score = y
165 | scorer = check_scoring(clf, scoring=metric)
166 | return scorer(clf, X, y_score)
167 |
168 | global best
169 | best = -np.inf
170 |
171 | def f(params):
172 | global best
173 | perf = p_model(params)
174 | if perf > best:
175 | best = perf
176 | return {"loss": -best, "status": STATUS_OK}
177 |
178 | rstate = np.random.default_rng(random_state)
179 | best_config = fmin(
180 | fn=f,
181 | space=space,
182 | algo=tpe.suggest,
183 | max_evals=n_iter,
184 | trials=Trials(),
185 | rstate=rstate,
186 | verbose=0,
187 | )
188 |
189 | # Fit best model
190 | final_params = {
191 | "n_estimators": n_estimators[best_config["n_estimators"]],
192 | "max_depth": max_depth[best_config["max_depth"]],
193 | "learning_rate": learning_rate[best_config["learning_rate"]],
194 | "booster": booster[best_config["booster"]],
195 | "gamma": gamma[best_config["gamma"]],
196 | "min_child_weight": min_child_weight[best_config["min_child_weight"]],
197 | "subsample": subsample[best_config["subsample"]],
198 | "colsample_bytree": colsample_bytree[best_config["colsample_bytree"]],
199 | "colsample_bylevel": colsample_bylevel[best_config["colsample_bylevel"]],
200 | "colsample_bynode": colsample_bynode[best_config["colsample_bynode"]],
201 | "reg_alpha": reg_alpha[best_config["reg_alpha"]],
202 | "reg_lambda": reg_lambda[best_config["reg_lambda"]],
203 | "objective": "multi:softprob",
204 | "num_class": n_classes,
205 | "n_jobs": n_jobs,
206 | "random_state": random_state,
207 | }
208 | clf = xgb.XGBClassifier(**final_params, use_label_encoder=False, verbosity=0)
209 | return clf.fit(X, y)
210 |
211 |
212 | def xgb_opt_regressor(
213 | X,
214 | y,
215 | n_iter=10,
216 | metric="neg_mean_squared_error",
217 | n_estimators=(10, 50, 100),
218 | max_depth=(3, 6, 9),
219 | learning_rate=(0.01, 0.1, 0.3, 0.5),
220 | booster=("gbtree",),
221 | gamma=(0, 1, 10),
222 | min_child_weight=(1, 5, 15, 100),
223 | subsample=(1.0,),
224 | colsample_bytree=(1.0,),
225 | colsample_bylevel=(1.0,),
226 | colsample_bynode=(1.0,),
227 | reg_alpha=(0,),
228 | reg_lambda=(0.1, 1.0, 5.0),
229 | n_jobs=None,
230 | random_state=None,
231 | ):
232 | """
233 | Get XGBoost model with the best hyperparameters configuration.
234 |
235 | Parameters
236 | ----------
237 | X : array-like of shape (n_samples, n_features)
238 | The training input samples.
239 |
240 | y : array-like of shape (n_samples,)
241 | The target values (real numbers).
242 |
243 | n_iter: int, default=10
244 | Number of iterations to set the hyperparameters of the base regressor (XGBoost)
245 | in Hyperopt.
246 |
247 | metric: string, default="neg_mean_squared_error"
248 | The score of the base regressor (XGBoost) optimized by Hyperopt. Supported metrics
249 | are the ones from `scikit-learn `_.
250 |
251 | n_estimators : tuple, default=(10, 50, 100)
252 | The number of XGBoost estimators. The number of estimators of
253 | XGBoost corresponds to the number of boosting rounds. The tuple provided is
254 | the search space used for the hyperparameter optimization (Hyperopt).
255 |
256 | max_depth : tuple, default=(3, 6, 9)
257 | Maximum tree depth for XGBoost base learners. The tuple provided is the search
258 | space used for the hyperparameter optimization (Hyperopt).
259 |
260 | learning_rate : tuple, default=(0.01, 0.1, 0.3, 0.5)
261 | `learning_rate` of XGBoost. The learning rate corresponds to the
262 | step size shrinkage used in update to prevent overfitting. After each
263 | boosting step, the learning rate shrinks the feature weights to make the boosting
264 | process more conservative. The tuple provided is the search space used for the
265 | hyperparameter optimization (Hyperopt).
266 |
267 | booster : ("dart", "gblinear", "gbtree"), default=("gbtree",)
268 | The type of booster to use. "gbtree" and "dart" use tree based models
269 | while "gblinear" uses linear functions. The tuple provided is the search space used
270 | for the hyperparameter optimization (Hyperopt).
271 |
272 | gamma : tuple, default=(0, 1, 10)
273 | `gamma` of XGBoost. `gamma` corresponds to the minimum loss reduction
274 | required to make a further partition on a leaf node of the tree.
275 | The larger `gamma` is, the more conservative XGBoost algorithm will be.
276 | The tuple provided is the search space used for the hyperparameter optimization
277 | (Hyperopt).
278 |
279 | min_child_weight : tuple, default=(1, 5, 15, 100)
280 | `min_child_weight` of XGBoost. `min_child_weight` defines the
281 | minimum sum of instance weight (hessian) needed in a child. If the tree
282 | partition step results in a leaf node with the sum of instance weight
283 | less than `min_child_weight`, then the building process will give up further
284 | partitioning. The larger `min_child_weight` is, the more conservative XGBoost
285 | algorithm will be. The tuple provided is the search space used for the hyperparameter
286 | optimization (Hyperopt).
287 |
288 | subsample : tuple, default=(1.0,)
289 | XGBoost subsample ratio of the training instances. Setting it to 0.5 means
290 | that XGBoost would randomly sample half of the training data prior to
291 | growing trees, and this will prevent overfitting. Subsampling will occur
292 | once in every boosting iteration. The tuple provided is the search space used for
293 | the hyperparameter optimization (Hyperopt).
294 |
295 | colsample_bytree : tuple, default=(1.0,)
296 | XGBoost subsample ratio of columns when constructing each tree.
297 | Subsampling occurs once for every tree constructed. The tuple provided is the search
298 | space used for the hyperparameter optimization (Hyperopt).
299 |
300 | colsample_bylevel : tuple, default=(1.0,)
301 | XGBoost subsample ratio of columns for each level. Subsampling occurs
302 | once for every new depth level reached in a tree. Columns are subsampled
303 | from the set of columns chosen for the current tree. The tuple provided is the search
304 | space used for the hyperparameter optimization (Hyperopt).
305 |
306 | colsample_bynode : tuple, default=(1.0,)
307 | XGBoost subsample ratio of columns for each node (split). Subsampling
308 | occurs once every time a new split is evaluated. Columns are subsampled
309 | from the set of columns chosen for the current level. The tuple provided is the search
310 | space used for the hyperparameter optimization (Hyperopt).
311 |
312 | reg_alpha : tuple, default=(0,)
313 | `reg_alpha` of XGBoost. `reg_alpha` corresponds to the L1 regularization
314 | term on the weights. Increasing this value will make XGBoost model more
315 | conservative. The tuple provided is the search space used for the hyperparameter
316 | optimization (Hyperopt).
317 |
318 | reg_lambda : tuple, default=(0.1, 1.0, 5.0)
319 | `reg_lambda` of XGBoost. `reg_lambda` corresponds to the L2 regularization
320 | term on the weights. Increasing this value will make XGBoost model more
321 | conservative. The tuple provided is the search space used for the hyperparameter
322 | optimization (Hyperopt).
323 |
324 | n_jobs : int, default=None
325 | The number of jobs to run in parallel.
326 | ``n_jobs=None`` means 1. ``n_jobs=-1`` means using all processors.
327 |
328 | random_state : int, RandomState instance or None, default=None
329 | Controls the randomness of the base learner XGBoost and
330 | the Hyperopt algorithm.
331 |
332 | Returns
333 | -------
334 | model: object
335 | XGBoost model with the best configuration and fitted on the input data.
336 | """
337 | space = {
338 | "n_estimators": hp.choice("n_estimators", n_estimators),
339 | "max_depth": hp.choice("max_depth", max_depth),
340 | "learning_rate": hp.choice("learning_rate", learning_rate),
341 | "booster": hp.choice("booster", booster),
342 | "gamma": hp.choice("gamma", gamma),
343 | "min_child_weight": hp.choice("min_child_weight", min_child_weight),
344 | "subsample": hp.choice("subsample", subsample),
345 | "colsample_bytree": hp.choice("colsample_bytree", colsample_bytree),
346 | "colsample_bylevel": hp.choice("colsample_bylevel", colsample_bylevel),
347 | "colsample_bynode": hp.choice("colsample_bynode", colsample_bynode),
348 | "reg_alpha": hp.choice("reg_alpha", reg_alpha),
349 | "reg_lambda": hp.choice("reg_lambda", reg_lambda),
350 | "objective": "reg:squarederror",
351 | "n_jobs": n_jobs,
352 | "random_state": random_state,
353 | }
354 |
355 | # Get best configuration
356 | def p_model(params):
357 | reg = xgb.XGBRegressor(**params, verbosity=0)
358 | reg.fit(X, y)
359 | scorer = check_scoring(reg, scoring=metric)
360 | return scorer(reg, X, y)
361 |
362 | global best
363 | best = -np.inf
364 |
365 | def f(params):
366 | global best
367 | perf = p_model(params)
368 | if perf > best:
369 | best = perf
370 | return {"loss": -best, "status": STATUS_OK}
371 |
372 | rstate = np.random.default_rng(random_state)
373 | best_config = fmin(
374 | fn=f,
375 | space=space,
376 | algo=tpe.suggest,
377 | max_evals=n_iter,
378 | trials=Trials(),
379 | rstate=rstate,
380 | verbose=0,
381 | )
382 |
383 | # Fit best model
384 | final_params = {
385 | "n_estimators": n_estimators[best_config["n_estimators"]],
386 | "max_depth": max_depth[best_config["max_depth"]],
387 | "learning_rate": learning_rate[best_config["learning_rate"]],
388 | "booster": booster[best_config["booster"]],
389 | "gamma": gamma[best_config["gamma"]],
390 | "min_child_weight": min_child_weight[best_config["min_child_weight"]],
391 | "subsample": subsample[best_config["subsample"]],
392 | "colsample_bytree": colsample_bytree[best_config["colsample_bytree"]],
393 | "colsample_bylevel": colsample_bylevel[best_config["colsample_bylevel"]],
394 | "colsample_bynode": colsample_bynode[best_config["colsample_bynode"]],
395 | "reg_alpha": reg_alpha[best_config["reg_alpha"]],
396 | "reg_lambda": reg_lambda[best_config["reg_lambda"]],
397 | "objective": "reg:squarederror",
398 | "n_jobs": n_jobs,
399 | "random_state": random_state,
400 | }
401 | reg = xgb.XGBRegressor(**final_params, verbosity=0)
402 | return reg.fit(X, y)
403 |
--------------------------------------------------------------------------------
/lce/_lce.py:
--------------------------------------------------------------------------------
1 | import math
2 | import numbers
3 | import numpy as np
4 | from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin
5 | from sklearn.ensemble import BaggingClassifier, BaggingRegressor
6 | from sklearn.preprocessing import LabelEncoder
7 | from sklearn.utils.multiclass import check_classification_targets
8 | from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
9 |
10 | from ._lcetree import LCETreeClassifier, LCETreeRegressor
11 |
12 |
13 | class LCEClassifier(ClassifierMixin, BaseEstimator):
14 | """
15 | A **Local Cascade Ensemble (LCE) classifier**. LCEClassifier is **compatible with scikit-learn**;
16 | it passes the `check_estimator `_.
17 | Therefore, it can interact with scikit-learn pipelines and model selection tools.
18 |
19 |
20 | Parameters
21 | ----------
22 | n_estimators : int, default=10
23 | The number of trees in the ensemble.
24 |
25 | bootstrap : bool, default=True
26 | Whether bootstrap samples are used when building trees. If False, the
27 | whole dataset is used to build each tree.
28 |
29 | criterion : {"gini", "entropy"}, default="gini"
30 | The function to measure the quality of a split. Supported criteria are
31 | "gini" for the Gini impurity and "entropy" for the information gain.
32 |
33 | splitter : {"best", "random"}, default="best"
34 | The strategy used to choose the split at each node. Supported strategies
35 | are "best" to choose the best split and "random" to choose the best random
36 | split.
37 |
38 | max_depth : int, default=2
39 | The maximum depth of a tree.
40 |
41 | max_features : int, float or {"auto", "sqrt", "log"}, default=None
42 | The number of features to consider when looking for the best split:
43 |
44 | - If int, then consider `max_features` features at each split.
45 | - If float, then `max_features` is a fraction and
46 | `round(max_features * n_features)` features are considered at each
47 | split.
48 | - If "auto", then `max_features=sqrt(n_features)`.
49 | - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
50 | - If "log2", then `max_features=log2(n_features)`.
51 | - If None, then `max_features=n_features`.
52 |
53 | Note: the search for a split does not stop until at least one
54 | valid partition of the node samples is found, even if it requires to
55 | effectively inspect more than ``max_features`` features.
56 |
57 | max_samples : int or float, default=1.0
58 | The number of samples to draw from X to train each base estimator
59 | (with replacement by default, see ``bootstrap`` for more details).
60 |
61 | - If int, then draw `max_samples` samples.
62 | - If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`.
63 |
64 | min_samples_leaf : int or float, default=1
65 | The minimum number of samples required to be at a leaf node.
66 | A split point at any depth will only be considered if it leaves at
67 | least ``min_samples_leaf`` training samples in each of the left and
68 | right branches.
69 |
70 | - If int, then consider `min_samples_leaf` as the minimum number.
71 | - If float, then `min_samples_leaf` is a fraction and
72 | `ceil(min_samples_leaf * n_samples)` are the minimum
73 | number of samples for each node.
74 |
75 | n_iter: int, default=10
76 | Number of iterations to set the hyperparameters of each node base
77 | classifier in Hyperopt.
78 |
79 | metric: string, default="accuracy"
80 | The score of the base classifier optimized by Hyperopt. Supported metrics
81 | are the ones from `scikit-learn `_.
82 |
83 | base_learner : {"catboost", "lightgbm", "xgboost"}, default="xgboost"
84 | The base classifier trained in each node of a tree.
85 |
86 | base_n_estimators : tuple, default=(10, 50, 100)
87 | The number of estimators of the base learner. The tuple provided is
88 | the search space used for the hyperparameter optimization (Hyperopt).
89 |
90 | base_max_depth : tuple, default=(3, 6, 9)
91 | Maximum tree depth for base learners. The tuple provided is the search
92 | space used for the hyperparameter optimization (Hyperopt).
93 |
94 | base_num_leaves : tuple, default=(20, 50, 100, 500)
95 | Maximum tree leaves (applicable to LightGBM only). The tuple provided is the search
96 | space used for the hyperparameter optimization (Hyperopt).
97 |
98 | base_learning_rate : tuple, default=(0.01, 0.1, 0.3, 0.5)
99 | `learning_rate` of the base learner. The tuple provided is the search space used for the
100 | hyperparameter optimization (Hyperopt).
101 |
102 | base_booster : ("dart", "gblinear", "gbtree"), default=("gbtree",)
103 | The type of booster to use (applicable to XGBoost only). "gbtree" and "dart" use tree based models
104 | while "gblinear" uses linear functions. The tuple provided is the search space used
105 | for the hyperparameter optimization (Hyperopt).
106 |
107 | base_boosting_type : ("dart", "gbdt", "rf"), default=("gbdt",)
108 | The type of boosting type to use (applicable to LightGBM only): "dart" dropouts meet Multiple Additive
109 | Regression Trees; "gbdt" traditional Gradient Boosting Decision Tree; "rf" Random Forest.
110 | The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).
111 |
112 | base_gamma : tuple, default=(0, 1, 10)
113 | `gamma` of XGBoost. `gamma` corresponds to the minimum loss reduction
114 | required to make a further partition on a leaf node of the tree.
115 | The larger `gamma` is, the more conservative XGBoost algorithm will be.
116 | The tuple provided is the search space used for the hyperparameter optimization
117 | (Hyperopt).
118 |
119 | base_min_child_weight : tuple, default=(1, 5, 15, 100)
120 | `min_child_weight` of base learner (applicable to LightGBM and XGBoost only). `min_child_weight` defines the
121 | minimum sum of instance weight (hessian) needed in a child. If the tree
122 | partition step results in a leaf node with the sum of instance weight
123 | less than `min_child_weight`, then the building process will give up further
124 | partitioning. The larger `min_child_weight` is, the more conservative the base learner
125 | algorithm will be. The tuple provided is the search space used for the hyperparameter
126 | optimization (Hyperopt).
127 |
128 | base_subsample : tuple, default=(1.0,)
129 | Base learner subsample ratio of the training instances (applicable to LightGBM and XGBoost only).
130 | Setting it to 0.5 means that the base learner would randomly sample half of the training data prior to
131 | growing trees, and this will prevent overfitting. Subsampling will occur
132 | once in every boosting iteration. The tuple provided is the search space used for
133 | the hyperparameter optimization (Hyperopt).
134 |
135 | base_subsample_for_bin : tuple, default=(200000,)
136 | Number of samples for constructing bins (applicable to LightGBM only). The tuple provided is the
137 | search space used for the hyperparameter optimization (Hyperopt).
138 |
139 | base_colsample_bytree : tuple, default=(1.0,)
140 | Base learner subsample ratio of columns when constructing each tree (applicable to LightGBM and XGBoost only).
141 | Subsampling occurs once for every tree constructed. The tuple provided is the search
142 | space used for the hyperparameter optimization (Hyperopt).
143 |
144 | base_colsample_bylevel : tuple, default=(1.0,)
145 | Subsample ratio of columns for each level (applicable to CatBoost and XGBoost only). Subsampling occurs
146 | once for every new depth level reached in a tree. Columns are subsampled
147 | from the set of columns chosen for the current tree. The tuple provided is the search
148 | space used for the hyperparameter optimization (Hyperopt).
149 |
150 | base_colsample_bynode : tuple, default=(1.0,)
151 | Subsample ratio of columns for each node split (applicable to XGBoost only). Subsampling
152 | occurs once every time a new split is evaluated. Columns are subsampled
153 | from the set of columns chosen for the current level. The tuple provided is the search
154 | space used for the hyperparameter optimization (Hyperopt).
155 |
156 | base_reg_alpha : tuple, default=(0,)
157 | `reg_alpha` of the base learner (applicable to LightGBM and XGBoost only).
158 | `reg_alpha` corresponds to the L1 regularization term on the weights.
159 | Increasing this value will make the base learner more conservative.
160 | The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).
161 |
162 | base_reg_lambda : tuple, default=(0.1, 1.0, 5.0)
163 | `reg_lambda` of the base learner. `reg_lambda` corresponds to the L2 regularization term
164 | on the weights. Increasing this value will make the base learner more
165 | conservative. The tuple provided is the search space used for the hyperparameter
166 | optimization (Hyperopt).
167 |
168 | n_jobs : int, default=None
169 | The number of jobs to run in parallel.
170 | ``n_jobs=None`` means 1. ``n_jobs=-1`` means using all processors.
171 |
172 | random_state : int, RandomState instance or None, default=None
173 | Controls the randomness of the bootstrapping of the samples used
174 | when building trees (if ``bootstrap=True``), the sampling of the
175 | features to consider when looking for the best split at each node
176 | (if ``max_features < n_features``), the base classifier and
177 | the Hyperopt algorithm.
178 |
179 | verbose : int, default=0
180 | Controls the verbosity when fitting.
181 |
182 | Attributes
183 | ----------
184 | base_estimator_ : LCETreeClassifier
185 | The child estimator template used to create the collection of fitted
186 | sub-estimators.
187 |
188 | estimators_ : list of LCETreeClassifier
189 | The collection of fitted sub-estimators.
190 |
191 | classes_ : ndarray of shape (n_classes,) or a list of such arrays
192 | The classes labels.
193 |
194 | n_classes_ : int
195 | The number of classes.
196 |
197 | n_features_in_ : int
198 | The number of features when ``fit`` is performed.
199 |
200 | encoder_ : LabelEncoder
201 | The encoder to have target labels with value between 0 and n_classes-1.
202 |
203 | Notes
204 | -----
205 | The default values for the parameters controlling the size of the trees
206 | (e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
207 | unpruned trees which can potentially be very large on some data sets. To
208 | reduce memory consumption, the complexity and size of the trees should be
209 | controlled by setting those parameter values.
210 |
211 | The features are always randomly permuted at each split. Therefore,
212 | the best found split may vary, even with the same training data,
213 | ``max_features=n_features`` and ``bootstrap=False``, if the improvement
214 | of the criterion is identical for several splits enumerated during the
215 | search of the best split. To obtain a deterministic behaviour during
216 | fitting, ``random_state`` has to be fixed.
217 |
218 | References
219 | ----------
220 | .. [1] Fauvel, K., E. Fromont, V. Masson, P. Faverdin and A. Termier. "XEM: An Explainable-by-Design Ensemble Method for Multivariate Time Series Classification", Data Mining and Knowledge Discovery, 36(3):917-957, 2022. https://hal.inria.fr/hal-03599214/document
221 | """
222 |
223 | def __init__(
224 | self,
225 | n_estimators=10,
226 | bootstrap=True,
227 | criterion="gini",
228 | splitter="best",
229 | max_depth=2,
230 | max_features=None,
231 | max_samples=1.0,
232 | min_samples_leaf=1,
233 | n_iter=10,
234 | metric="accuracy",
235 | base_learner="xgboost",
236 | base_n_estimators=(10, 50, 100),
237 | base_max_depth=(3, 6, 9),
238 | base_num_leaves=(20, 50, 100, 500),
239 | base_learning_rate=(0.01, 0.1, 0.3, 0.5),
240 | base_booster=("gbtree",),
241 | base_boosting_type=("gbdt",),
242 | base_gamma=(0, 1, 10),
243 | base_min_child_weight=(1, 5, 15, 100),
244 | base_subsample=(1.0,),
245 | base_subsample_for_bin=(200000,),
246 | base_colsample_bytree=(1.0,),
247 | base_colsample_bylevel=(1.0,),
248 | base_colsample_bynode=(1.0,),
249 | base_reg_alpha=(0,),
250 | base_reg_lambda=(0.1, 1.0, 5.0),
251 | n_jobs=None,
252 | random_state=None,
253 | verbose=0,
254 | ):
255 | self.n_estimators = n_estimators
256 | self.bootstrap = bootstrap
257 | self.criterion = criterion
258 | self.splitter = splitter
259 | self.max_depth = max_depth
260 | self.max_features = max_features
261 | self.max_samples = max_samples
262 | self.min_samples_leaf = min_samples_leaf
263 | self.n_iter = n_iter
264 | self.metric = metric
265 | self.base_learner = base_learner
266 | self.base_n_estimators = base_n_estimators
267 | self.base_max_depth = base_max_depth
268 | self.base_num_leaves = base_num_leaves
269 | self.base_learning_rate = base_learning_rate
270 | self.base_booster = base_booster
271 | self.base_boosting_type = base_boosting_type
272 | self.base_gamma = base_gamma
273 | self.base_min_child_weight = base_min_child_weight
274 | self.base_subsample = base_subsample
275 | self.base_subsample_for_bin = base_subsample_for_bin
276 | self.base_colsample_bytree = base_colsample_bytree
277 | self.base_colsample_bylevel = base_colsample_bylevel
278 | self.base_colsample_bynode = base_colsample_bynode
279 | self.base_reg_alpha = base_reg_alpha
280 | self.base_reg_lambda = base_reg_lambda
281 | self.n_jobs = n_jobs
282 | self.random_state = random_state
283 | self.verbose = verbose
284 |
285 | def _generate_estimator(self):
286 | """Generate an estimator."""
287 | est = LCETreeClassifier()
288 | est.n_classes_in = self.n_classes_
289 | est.criterion = self.criterion
290 | est.splitter = self.splitter
291 | est.max_depth = self.max_depth
292 | est.max_features = self.max_features
293 | est.min_samples_leaf = self.min_samples_leaf
294 | est.n_iter = self.n_iter
295 | est.metric = self.metric
296 | est.base_learner = self.base_learner
297 | est.base_n_estimators = self.base_n_estimators
298 | est.base_max_depth = self.base_max_depth
299 | est.base_num_leaves = self.base_num_leaves
300 | est.base_learning_rate = self.base_learning_rate
301 | est.base_booster = self.base_booster
302 | est.base_boosting_type = self.base_boosting_type
303 | est.base_gamma = self.base_gamma
304 | est.base_min_child_weight = self.base_min_child_weight
305 | est.base_subsample = self.base_subsample
306 | est.base_subsample_for_bin = self.base_subsample_for_bin
307 | est.base_colsample_bytree = self.base_colsample_bytree
308 | est.base_colsample_bylevel = self.base_colsample_bylevel
309 | est.base_colsample_bynode = self.base_colsample_bynode
310 | est.base_reg_alpha = self.base_reg_alpha
311 | est.base_reg_alpha = self.base_reg_lambda
312 | est.n_jobs = self.n_jobs
313 | est.random_state = self.random_state
314 | est.verbose = self.verbose
315 | return est
316 |
317 | def _more_tags(self):
318 | """Update scikit-learn estimator tags."""
319 | return {"allow_nan": True, "requires_y": True}
320 |
321 | def _validate_extra_parameters(self, X):
322 | """Validate parameters not already validated by methods employed."""
323 | # Validate max_depth
324 | if isinstance(self.max_depth, numbers.Integral):
325 | if not (0 <= self.max_depth):
326 | raise ValueError(
327 | "max_depth must be greater than or equal to 0, "
328 | "got {0}.".format(self.max_depth)
329 | )
330 | else:
331 | raise ValueError("max_depth must be int")
332 |
333 | # Validate min_samples_leaf
334 | if isinstance(self.min_samples_leaf, numbers.Integral):
335 | if not 1 <= self.min_samples_leaf:
336 | raise ValueError(
337 | "min_samples_leaf must be at least 1 "
338 | "or in (0, 0.5], got %s" % self.min_samples_leaf
339 | )
340 | elif isinstance(self.min_samples_leaf, float):
341 | if not 0.0 < self.min_samples_leaf <= 0.5:
342 | raise ValueError(
343 | "min_samples_leaf must be at least 1 "
344 | "or in (0, 0.5], got %s" % self.min_samples_leaf
345 | )
346 | self.min_samples_leaf = int(math.ceil(self.min_samples_leaf * X.shape[0]))
347 | else:
348 | raise ValueError("min_samples_leaf must be int or float")
349 |
350 | # Validate n_iter
351 | if isinstance(self.n_iter, numbers.Integral):
352 | if self.n_iter <= 0:
353 | raise ValueError(
354 | "n_iter must be greater than 0, " "got {0}.".format(self.n_iter)
355 | )
356 | else:
357 | raise ValueError("n_iter must be int")
358 |
359 | # Validate verbose
360 | if isinstance(self.verbose, numbers.Integral):
361 | if self.verbose < 0:
362 | raise ValueError(
363 | "verbose must be greater than or equal to 0, "
364 | "got {0}.".format(self.verbose)
365 | )
366 | else:
367 | raise ValueError("verbose must be int")
368 |
369 | def fit(self, X, y):
370 | """
371 | Build a forest of LCE trees from the training set (X, y).
372 |
373 | Parameters
374 | ----------
375 | X : array-like of shape (n_samples, n_features)
376 | The training input samples.
377 |
378 | y : array-like of shape (n_samples,)
379 | The class labels.
380 |
381 | Returns
382 | -------
383 | self : object
384 | """
385 | X, y = check_X_y(X, y, force_all_finite="allow-nan")
386 | check_classification_targets(y)
387 | self._validate_extra_parameters(X)
388 | self.n_features_in_ = X.shape[1]
389 | self.X_ = True
390 | self.y_ = True
391 | self.classes_, y = np.unique(y, return_inverse=True)
392 | self.n_classes_ = self.classes_.size
393 | self.encoder_ = LabelEncoder()
394 | self.encoder_.fit(self.classes_)
395 | self.base_estimator_ = self._generate_estimator()
396 | self.estimators_ = BaggingClassifier(
397 | base_estimator=self.base_estimator_,
398 | n_estimators=self.n_estimators,
399 | bootstrap=self.bootstrap,
400 | max_samples=self.max_samples,
401 | n_jobs=self.n_jobs,
402 | random_state=self.random_state,
403 | )
404 | self.estimators_.fit(X, y)
405 | return self
406 |
407 | def predict(self, X):
408 | """
409 | Predict class for X.
410 | The predicted class of an input sample is computed as the class with
411 | the highest mean predicted probability.
412 |
413 | Parameters
414 | ----------
415 | X : array-like of shape (n_samples, n_features)
416 | The training input samples.
417 |
418 | Returns
419 | -------
420 | y : ndarray of shape (n_samples,)
421 | The predicted classes.
422 | """
423 | check_is_fitted(self, ["X_", "y_"])
424 | X = check_array(X, force_all_finite="allow-nan")
425 | predictions = self.estimators_.predict(X)
426 | return self.encoder_.inverse_transform(predictions)
427 |
428 | def predict_proba(self, X):
429 | """
430 | Predict class probabilities for X.
431 | The predicted class probabilities of an input sample are computed as
432 | the mean predicted class probabilities of the base estimators in the
433 | ensemble.
434 |
435 | Parameters
436 | ----------
437 | X : array-like of shape (n_samples, n_features)
438 | The training input samples.
439 |
440 | Returns
441 | -------
442 | y : ndarray of shape (n_samples,)
443 | The class probabilities of the input samples. The order of the
444 | classes corresponds to that in the attribute ``classes_``.
445 | """
446 | check_is_fitted(self, ["X_", "y_"])
447 | X = check_array(X, force_all_finite="allow-nan")
448 | return self.estimators_.predict_proba(X)
449 |
450 | def set_params(self, **params):
451 | """
452 | Set the parameters of the estimator.
453 |
454 | Parameters
455 | ----------
456 | **params : dict
457 | Estimator parameters.
458 |
459 | Returns
460 | -------
461 | self : object
462 | """
463 | if not params:
464 | return self
465 |
466 | for key, value in params.items():
467 | if hasattr(self, key):
468 | setattr(self, key, value)
469 |
470 | return self
471 |
472 |
473 | class LCERegressor(RegressorMixin, BaseEstimator):
474 | """
475 | A **Local Cascade Ensemble (LCE) regressor**. LCERegressor is **compatible with scikit-learn**;
476 | it passes the `check_estimator `_.
477 | Therefore, it can interact with scikit-learn pipelines and model selection tools.
478 |
479 |
480 | Parameters
481 | ----------
482 | n_estimators : int, default=10
483 | The number of trees in the ensemble.
484 |
485 | bootstrap : bool, default=True
486 | Whether bootstrap samples are used when building trees. If False, the
487 | whole dataset is used to build each tree.
488 |
489 | criterion : {"squared_error", "friedman_mse", "absolute_error", "poisson"}, default="squared_error"
490 | The function to measure the quality of a split. Supported criteria are "squared_error" for
491 | the mean squared error, which is equal to variance reduction as feature selection
492 | criterion and minimizes the L2 loss using the mean of each terminal node,
493 | "friedman_mse", which uses mean squared error with Friedman's improvement score
494 | for potential splits, "absolute_error" for the mean absolute error, which
495 | minimizes the L1 loss using the median of each terminal node, and "poisson"
496 | which uses reduction in Poisson deviance to find splits.
497 |
498 | splitter : {"best", "random"}, default="best"
499 | The strategy used to choose the split at each node. Supported strategies
500 | are "best" to choose the best split and "random" to choose the best random
501 | split.
502 |
503 | max_depth : int, default=2
504 | The maximum depth of a tree.
505 |
506 | max_features : int, float or {"auto", "sqrt", "log"}, default=None
507 | The number of features to consider when looking for the best split:
508 |
509 | - If int, then consider `max_features` features at each split.
510 | - If float, then `max_features` is a fraction and
511 | `round(max_features * n_features)` features are considered at each
512 | split.
513 | - If "auto", then `max_features=sqrt(n_features)`.
514 | - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
515 | - If "log2", then `max_features=log2(n_features)`.
516 | - If None, then `max_features=n_features`.
517 |
518 | Note: the search for a split does not stop until at least one
519 | valid partition of the node samples is found, even if it requires to
520 | effectively inspect more than ``max_features`` features.
521 |
522 | max_samples : int or float, default=1.0
523 | The number of samples to draw from X to train each base estimator
524 | (with replacement by default, see ``bootstrap`` for more details).
525 |
526 | - If int, then draw `max_samples` samples.
527 | - If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`.
528 |
529 | min_samples_leaf : int or float, default=1
530 | The minimum number of samples required to be at a leaf node.
531 | A split point at any depth will only be considered if it leaves at
532 | least ``min_samples_leaf`` training samples in each of the left and
533 | right branches.
534 |
535 | - If int, then consider `min_samples_leaf` as the minimum number.
536 | - If float, then `min_samples_leaf` is a fraction and
537 | `ceil(min_samples_leaf * n_samples)` are the minimum
538 | number of samples for each node.
539 |
540 | n_iter: int, default=10
541 | Number of iterations to set the hyperparameters of each node base
542 | regressor in Hyperopt.
543 |
544 | metric: string, default="neg_mean_squared_error"
545 | The score of the base regressor optimized by Hyperopt. Supported metrics
546 | are the ones from `scikit-learn `_.
547 |
548 | base_learner : {"catboost", "lightgbm", "xgboost"}, default="xgboost"
549 | The base classifier trained in each node of a tree.
550 |
551 | base_n_estimators : tuple, default=(10, 50, 100)
552 | The number of estimators of the base learner. The tuple provided is
553 | the search space used for the hyperparameter optimization (Hyperopt).
554 |
555 | base_max_depth : tuple, default=(3, 6, 9)
556 | Maximum tree depth for base learners. The tuple provided is the search
557 | space used for the hyperparameter optimization (Hyperopt).
558 |
559 | base_num_leaves : tuple, default=(20, 50, 100, 500)
560 | Maximum tree leaves (applicable to LightGBM only). The tuple provided is the search
561 | space used for the hyperparameter optimization (Hyperopt).
562 |
563 | base_learning_rate : tuple, default=(0.01, 0.1, 0.3, 0.5)
564 | `learning_rate` of the base learner. The tuple provided is the search space used for the
565 | hyperparameter optimization (Hyperopt).
566 |
567 | base_booster : ("dart", "gblinear", "gbtree"), default=("gbtree",)
568 | The type of booster to use (applicable to XGBoost only). "gbtree" and "dart" use tree based models
569 | while "gblinear" uses linear functions. The tuple provided is the search space used
570 | for the hyperparameter optimization (Hyperopt).
571 |
572 | base_boosting_type : ("dart", "gbdt", "rf"), default=("gbdt",)
573 | The type of boosting type to use (applicable to LightGBM only): "dart" dropouts meet Multiple Additive
574 | Regression Trees; "gbdt" traditional Gradient Boosting Decision Tree; "rf" Random Forest.
575 | The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).
576 |
577 | base_gamma : tuple, default=(0, 1, 10)
578 | `gamma` of XGBoost. `gamma` corresponds to the minimum loss reduction
579 | required to make a further partition on a leaf node of the tree.
580 | The larger `gamma` is, the more conservative XGBoost algorithm will be.
581 | The tuple provided is the search space used for the hyperparameter optimization
582 | (Hyperopt).
583 |
584 | base_min_child_weight : tuple, default=(1, 5, 15, 100)
585 | `min_child_weight` of base learner (applicable to LightGBM and XGBoost only). `min_child_weight` defines the
586 | minimum sum of instance weight (hessian) needed in a child. If the tree
587 | partition step results in a leaf node with the sum of instance weight
588 | less than `min_child_weight`, then the building process will give up further
589 | partitioning. The larger `min_child_weight` is, the more conservative the base learner
590 | algorithm will be. The tuple provided is the search space used for the hyperparameter
591 | optimization (Hyperopt).
592 |
593 | base_subsample : tuple, default=(1.0,)
594 | Base learner subsample ratio of the training instances (applicable to LightGBM and XGBoost only).
595 | Setting it to 0.5 means that the base learner would randomly sample half of the training data prior to
596 | growing trees, and this will prevent overfitting. Subsampling will occur
597 | once in every boosting iteration. The tuple provided is the search space used for
598 | the hyperparameter optimization (Hyperopt).
599 |
600 | base_subsample_for_bin : tuple, default=(200000,)
601 | Number of samples for constructing bins (applicable to LightGBM only). The tuple provided is the
602 | search space used for the hyperparameter optimization (Hyperopt).
603 |
604 | base_colsample_bytree : tuple, default=(1.0,)
605 | Base learner subsample ratio of columns when constructing each tree (applicable to LightGBM and XGBoost only).
606 | Subsampling occurs once for every tree constructed. The tuple provided is the search
607 | space used for the hyperparameter optimization (Hyperopt).
608 |
609 | base_colsample_bylevel : tuple, default=(1.0,)
610 | Subsample ratio of columns for each level (applicable to CatBoost and XGBoost only). Subsampling occurs
611 | once for every new depth level reached in a tree. Columns are subsampled
612 | from the set of columns chosen for the current tree. The tuple provided is the search
613 | space used for the hyperparameter optimization (Hyperopt).
614 |
615 | base_colsample_bynode : tuple, default=(1.0,)
616 | Subsample ratio of columns for each node split (applicable to XGBoost only). Subsampling
617 | occurs once every time a new split is evaluated. Columns are subsampled
618 | from the set of columns chosen for the current level. The tuple provided is the search
619 | space used for the hyperparameter optimization (Hyperopt).
620 |
621 | base_reg_alpha : tuple, default=(0,)
622 | `reg_alpha` of the base learner (applicable to LightGBM and XGBoost only).
623 | `reg_alpha` corresponds to the L1 regularization term on the weights.
624 | Increasing this value will make the base learner more conservative.
625 | The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).
626 |
627 | base_reg_lambda : tuple, default=(0.1, 1.0, 5.0)
628 | `reg_lambda` of the base learner. `reg_lambda` corresponds to the L2 regularization term
629 | on the weights. Increasing this value will make the base learner more
630 | conservative. The tuple provided is the search space used for the hyperparameter
631 | optimization (Hyperopt).
632 |
633 | n_jobs : int, default=None
634 | The number of jobs to run in parallel.
635 | ``n_jobs=None`` means 1. ``n_jobs=-1`` means using all processors.
636 |
637 | random_state : int, RandomState instance or None, default=None
638 | Controls the randomness of the bootstrapping of the samples used
639 | when building trees (if ``bootstrap=True``), the sampling of the
640 | features to consider when looking for the best split at each node
641 | (if ``max_features < n_features``), the base classifier (XGBoost) and
642 | the Hyperopt algorithm.
643 |
644 | verbose : int, default=0
645 | Controls the verbosity when fitting.
646 |
647 | Attributes
648 | ----------
649 | base_estimator_ : LCETreeRegressor
650 | The child estimator template used to create the collection of fitted
651 | sub-estimators.
652 |
653 | estimators_ : list of LCETreeRegressor
654 | The collection of fitted sub-estimators.
655 |
656 | n_features_in_ : int
657 | The number of features when ``fit`` is performed.
658 |
659 | Notes
660 | -----
661 | The default values for the parameters controlling the size of the trees
662 | (e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
663 | unpruned trees which can potentially be very large on some data sets. To
664 | reduce memory consumption, the complexity and size of the trees should be
665 | controlled by setting those parameter values.
666 |
667 | The features are always randomly permuted at each split. Therefore,
668 | the best found split may vary, even with the same training data,
669 | ``max_features=n_features`` and ``bootstrap=False``, if the improvement
670 | of the criterion is identical for several splits enumerated during the
671 | search of the best split. To obtain a deterministic behaviour during
672 | fitting, ``random_state`` has to be fixed.
673 | """
674 |
675 | def __init__(
676 | self,
677 | n_estimators=10,
678 | bootstrap=True,
679 | criterion="squared_error",
680 | splitter="best",
681 | max_depth=2,
682 | max_features=None,
683 | max_samples=1.0,
684 | min_samples_leaf=1,
685 | metric="neg_mean_squared_error",
686 | n_iter=10,
687 | base_learner="xgboost",
688 | base_n_estimators=(10, 50, 100),
689 | base_max_depth=(3, 6, 9),
690 | base_num_leaves=(20, 50, 100, 500),
691 | base_learning_rate=(0.01, 0.1, 0.3, 0.5),
692 | base_booster=("gbtree",),
693 | base_boosting_type=("gbdt",),
694 | base_gamma=(0, 1, 10),
695 | base_min_child_weight=(1, 5, 15, 100),
696 | base_subsample=(1.0,),
697 | base_subsample_for_bin=(200000,),
698 | base_colsample_bytree=(1.0,),
699 | base_colsample_bylevel=(1.0,),
700 | base_colsample_bynode=(1.0,),
701 | base_reg_alpha=(0,),
702 | base_reg_lambda=(0.1, 1.0, 5.0),
703 | n_jobs=None,
704 | random_state=None,
705 | verbose=0,
706 | ):
707 | self.n_estimators = n_estimators
708 | self.bootstrap = bootstrap
709 | self.criterion = criterion
710 | self.splitter = splitter
711 | self.max_depth = max_depth
712 | self.max_features = max_features
713 | self.max_samples = max_samples
714 | self.min_samples_leaf = min_samples_leaf
715 | self.n_iter = n_iter
716 | self.metric = metric
717 | self.base_learner = base_learner
718 | self.base_n_estimators = base_n_estimators
719 | self.base_max_depth = base_max_depth
720 | self.base_num_leaves = base_num_leaves
721 | self.base_learning_rate = base_learning_rate
722 | self.base_booster = base_booster
723 | self.base_boosting_type = base_boosting_type
724 | self.base_gamma = base_gamma
725 | self.base_min_child_weight = base_min_child_weight
726 | self.base_subsample = base_subsample
727 | self.base_subsample_for_bin = base_subsample_for_bin
728 | self.base_colsample_bytree = base_colsample_bytree
729 | self.base_colsample_bylevel = base_colsample_bylevel
730 | self.base_colsample_bynode = base_colsample_bynode
731 | self.base_reg_alpha = base_reg_alpha
732 | self.base_reg_lambda = base_reg_lambda
733 | self.n_jobs = n_jobs
734 | self.random_state = random_state
735 | self.verbose = verbose
736 |
737 | def _generate_estimator(self):
738 | """Generate an estimator."""
739 | est = LCETreeRegressor()
740 | est.criterion = self.criterion
741 | est.splitter = self.splitter
742 | est.max_depth = self.max_depth
743 | est.max_features = self.max_features
744 | est.min_samples_leaf = self.min_samples_leaf
745 | est.n_iter = self.n_iter
746 | est.metric = self.metric
747 | est.base_learner = self.base_learner
748 | est.base_n_estimators = self.base_n_estimators
749 | est.base_max_depth = self.base_max_depth
750 | est.base_num_leaves = self.base_num_leaves
751 | est.base_learning_rate = self.base_learning_rate
752 | est.base_booster = self.base_booster
753 | est.base_boosting_type = self.base_boosting_type
754 | est.base_gamma = self.base_gamma
755 | est.base_min_child_weight = self.base_min_child_weight
756 | est.base_subsample = self.base_subsample
757 | est.base_subsample_for_bin = self.base_subsample_for_bin
758 | est.base_colsample_bytree = self.base_colsample_bytree
759 | est.base_colsample_bylevel = self.base_colsample_bylevel
760 | est.base_colsample_bynode = self.base_colsample_bynode
761 | est.base_reg_alpha = self.base_reg_alpha
762 | est.base_reg_alpha = self.base_reg_lambda
763 | est.n_jobs = self.n_jobs
764 | est.random_state = self.random_state
765 | est.verbose = self.verbose
766 | return est
767 |
768 | def _more_tags(self):
769 | """Update scikit-learn estimator tags."""
770 | return {"allow_nan": True, "requires_y": True}
771 |
772 | def _validate_extra_parameters(self, X):
773 | """Validate parameters not already validated by methods employed."""
774 | # Validate max_depth
775 | if isinstance(self.max_depth, numbers.Integral):
776 | if not (0 <= self.max_depth):
777 | raise ValueError(
778 | "max_depth must be greater than or equal to 0, "
779 | "got {0}.".format(self.max_depth)
780 | )
781 | else:
782 | raise ValueError("max_depth must be int")
783 |
784 | # Validate min_samples_leaf
785 | if isinstance(self.min_samples_leaf, numbers.Integral):
786 | if not 1 <= self.min_samples_leaf:
787 | raise ValueError(
788 | "min_samples_leaf must be at least 1 "
789 | "or in (0, 0.5], got %s" % self.min_samples_leaf
790 | )
791 | elif isinstance(self.min_samples_leaf, float):
792 | if not 0.0 < self.min_samples_leaf <= 0.5:
793 | raise ValueError(
794 | "min_samples_leaf must be at least 1 "
795 | "or in (0, 0.5], got %s" % self.min_samples_leaf
796 | )
797 | self.min_samples_leaf = int(math.ceil(self.min_samples_leaf * X.shape[0]))
798 | else:
799 | raise ValueError("min_samples_leaf must be int or float")
800 |
801 | # Validate n_iter
802 | if isinstance(self.n_iter, numbers.Integral):
803 | if self.n_iter <= 0:
804 | raise ValueError(
805 | "n_iter must be greater than 0, " "got {0}.".format(self.n_iter)
806 | )
807 | else:
808 | raise ValueError("n_iter must be int")
809 |
810 | # Validate verbose
811 | if isinstance(self.verbose, numbers.Integral):
812 | if self.verbose < 0:
813 | raise ValueError(
814 | "verbose must be greater than or equal to 0, "
815 | "got {0}.".format(self.verbose)
816 | )
817 | else:
818 | raise ValueError("verbose must be int")
819 |
820 | def fit(self, X, y):
821 | """
822 | Build a forest of LCE trees from the training set (X, y).
823 |
824 | Parameters
825 | ----------
826 | X : array-like of shape (n_samples, n_features)
827 | The training input samples.
828 |
829 | y : array-like of shape (n_samples,)
830 | The target values (real numbers).
831 |
832 | Returns
833 | -------
834 | self : object
835 | """
836 | X, y = check_X_y(X, y, y_numeric=True, force_all_finite="allow-nan")
837 | self._validate_extra_parameters(X)
838 | self.n_features_in_ = X.shape[1]
839 | self.X_ = True
840 | self.y_ = True
841 | self.base_estimator_ = self._generate_estimator()
842 | self.estimators_ = BaggingRegressor(
843 | base_estimator=self.base_estimator_,
844 | n_estimators=self.n_estimators,
845 | bootstrap=self.bootstrap,
846 | max_samples=self.max_samples,
847 | n_jobs=self.n_jobs,
848 | random_state=self.random_state,
849 | )
850 | self.estimators_.fit(X, y)
851 | return self
852 |
853 | def predict(self, X):
854 | """
855 | Predict regression target for X.
856 | The predicted regression target of an input sample is computed as the
857 | mean predicted regression targets of the trees in the forest.
858 |
859 | Parameters
860 | ----------
861 | X : array-like of shape (n_samples, n_features)
862 | The training input samples.
863 |
864 | Returns
865 | -------
866 | y : ndarray of shape (n_samples,)
867 | The predicted values.
868 | """
869 | check_is_fitted(self, ["X_", "y_"])
870 | X = check_array(X, force_all_finite="allow-nan")
871 | return self.estimators_.predict(X)
872 |
873 | def set_params(self, **params):
874 | """
875 | Set the parameters of the estimator.
876 |
877 | Parameters
878 | ----------
879 | **params : dict
880 | Estimator parameters.
881 |
882 | Returns
883 | -------
884 | self : object
885 | """
886 | if not params:
887 | return self
888 |
889 | for key, value in params.items():
890 | if hasattr(self, key):
891 | setattr(self, key, value)
892 |
893 | return self
894 |
--------------------------------------------------------------------------------