├── LICENSE.md ├── README.md └── XGBRegressor.ipynb /LICENSE.md: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017, Albert Lam 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # XGBRegressor 2 | 3 | ## Overview 4 | A simple implementation to regression problems using Python 2.7, scikit-learn, and XGBoost. Bulk of code from [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) 5 | 6 | [XGBRegressor](https://github.com/albertkklam/XGBRegressor/blob/master/XGBRegressor.ipynb) is a general purpose notebook for model training using XGBoost. It contains: 7 | 8 | * Functions to preprocess a data file into the necessary train and test set dataframes for XGBoost 9 | * Functions to convert categorical variables into dummies or dense vectors, and convert string values into Python compatible strings 10 | * Additional user functionality that allows notification updates to be sent to a user's chosen Slack channel, so that you know when your model has finished training 11 | * Implementation of sequential hyperparameter grid search via the scikit-learn API 12 | * Implementation of early stopping via the Learning API 13 | 14 | ## Installing XGBoost for Python 15 | Follow instructions [here](https://github.com/dmlc/xgboost/tree/master/python-package) 16 | 17 | ## Resources 18 | Here are some additional resources if you are looking to explore XGBoost and its various APIs more extensively: 19 | 20 | 1. [Introduction to Boosted Trees and the XGBoost algorithm](http://xgboost.readthedocs.io/en/latest/model.html) 21 | 2. [The Python API documentation for XGBoost](http://xgboost.readthedocs.io/en/latest/python/python_api.html) 22 | 3. [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) 23 | 4. [scikit-learn's Gradient Boosting Classifer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) 24 | 5. [scikit-learn's GridSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 25 | 6. [Tong He's XGBoost presentation](https://www.slideshare.net/ShangxuanZhang/xgboost) 26 | -------------------------------------------------------------------------------- /XGBRegressor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": true, 7 | "editable": true 8 | }, 9 | "source": [ 10 | "## Import libraries\n", 11 | "\n", 12 | "We will make extensive use of `pandas`, `XGBoost` and it's `scikit-learn` API throughout this demo. `pickle` will be used to save and load model files" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "collapsed": false, 20 | "deletable": true, 21 | "editable": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd\n", 26 | "import numpy as np\n", 27 | "import pickle\n", 28 | "import xgboost as xgb\n", 29 | "import sklearn.model_selection\n", 30 | "from sklearn.model_selection import GridSearchCV \n", 31 | "from sklearn.metrics import confusion_matrix, mean_squared_error\n", 32 | "import matplotlib\n", 33 | "%matplotlib inline" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "deletable": true, 40 | "editable": true 41 | }, 42 | "source": [ 43 | "## Slack channel notifications\n", 44 | "\n", 45 | "Import `SlackClient` and create basic function that will post a Slack notification in `channel` when code is finished running" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "collapsed": true, 53 | "deletable": true, 54 | "editable": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "from slackclient import SlackClient\n", 59 | "def slack_message(message, channel):\n", 60 | " token = 'your_token'\n", 61 | " sc = SlackClient(token)\n", 62 | " sc.api_call('chat.postMessage', channel=channel, \n", 63 | " text=message, username='My Sweet Bot',\n", 64 | " icon_emoji=':upside_down_face:')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "deletable": true, 71 | "editable": true 72 | }, 73 | "source": [ 74 | "## Import data and set data types\n", 75 | "\n", 76 | "Set working directory and ensure schema is correct before importing train and test sets" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "collapsed": false, 84 | "deletable": true, 85 | "editable": true 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "data_dir = '/your/directory/' \n", 90 | "data_file = data_dir + 'data_file'" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": { 97 | "collapsed": true, 98 | "deletable": true, 99 | "editable": true 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "data = pd.read_csv(data_file, sep=\"\\t\", parse_dates = ['dates'], infer_datetime_format = True)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": { 109 | "deletable": true, 110 | "editable": true 111 | }, 112 | "source": [ 113 | "## Preprocess data frames\n", 114 | "\n", 115 | "Parse through `replaceValues` in order to standardise character strings" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": { 122 | "collapsed": true, 123 | "deletable": true, 124 | "editable": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "def replaceValues(df):\n", 129 | " df.replace(r'[\\s]','_', inplace = True, regex = True)\n", 130 | " df.replace(r'[\\.]','', inplace = True, regex = True)\n", 131 | " df.replace(r'__','_', inplace = True, regex = True)\n", 132 | "\n", 133 | "replaceValues(data)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": { 139 | "deletable": true, 140 | "editable": true 141 | }, 142 | "source": [ 143 | "## Combine train and test set\n", 144 | "\n", 145 | "Combine `train` and `test` data sets before parsing through one-hot encoder or dense vector encoding. This is especially important for one-hot encoding because we want to maintain the same set of columns across both train and test sets. These can be inconsistent if a particular level of a categorical variable is present in one data set but not the other\n", 146 | "\n", 147 | "* `cat_cols` are categorical columns that will be used in model training\n", 148 | "* `index_cols` are the index columns of the dataframe, which will not be used in model training\n", 149 | "* `pred_cols` are the response variable columns\n", 150 | "* `num_cols` are the numeric columns that will be used in model training" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": { 157 | "collapsed": false, 158 | "deletable": true, 159 | "editable": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "cat_cols = ['ATTRIBUTE_1','ATTRIBUTE_2','ATTRIBUTE_3']\n", 164 | "index_cols = ['FACTOR_1','FACTOR_2','FACTOR_3']\n", 165 | "pred_cols = ['RESPONSE']\n", 166 | "\n", 167 | "num_cols = [x for x in list(data.columns.values) if x not in cat_cols if x not in fac_cols if x not in pred_cols]" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": { 173 | "deletable": true, 174 | "editable": true 175 | }, 176 | "source": [ 177 | "## To one-hot encode the categorical variables, run the next cell. To code categorial variables as a dense vector, run the next cell instead" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": { 184 | "collapsed": false, 185 | "deletable": true, 186 | "editable": true 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "# def categoricalCols(indf, cat_var_list):\n", 191 | "# for cv in cat_var_list:\n", 192 | "# if [i for i, x in enumerate(cat_var_list) if cv == x][0] == 0:\n", 193 | "# dummy_df = pd.get_dummies(indf[cv], prefix = cv)\n", 194 | "# else:\n", 195 | "# dummy_df = pd.concat([dummy_df, pd.get_dummies(indf[cv], prefix = cv)], axis = 1)\n", 196 | "# return dummy_df\n", 197 | " \n", 198 | "# combined_cat = categoricalCols(combined[cat_cols], cat_cols)\n", 199 | "# combined_cat.columns.values" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "collapsed": false, 207 | "deletable": true, 208 | "editable": true 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "data_cat = pd.DataFrame(data[cat_cols])\n", 213 | "for feature in cat_cols: # Loop through all columns in the dataframe\n", 214 | " if data_cat[feature].dtype == 'object': # Only apply for columns with categorical strings\n", 215 | " data_cat[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": { 221 | "deletable": true, 222 | "editable": true 223 | }, 224 | "source": [ 225 | "## Prepare final dataframe before resplitting into train and test sets\n", 226 | "\n", 227 | "Importantly, we want to ensure that `train_final` and `test_final` are the same rows of data as `train` and `test`" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "collapsed": false, 235 | "deletable": true, 236 | "editable": true 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "data_num = data[num_cols]\n", 241 | "data_final = pd.concat([data_cat, data_num], axis=1)\n", 242 | "data_final['DATE'] = data['DATE']\n", 243 | "data_final['RESPONSE'] = data['RESPONSE']\n", 244 | "print data_final.shape" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": { 251 | "collapsed": false, 252 | "deletable": true, 253 | "editable": true 254 | }, 255 | "outputs": [], 256 | "source": [ 257 | "train_final = data_final[data_final['DATE'] <= 'DATE_SPLIT']\n", 258 | "test_final = data_final[data_final['DATE'] >= 'DATE_SPLIT' ]\n", 259 | "\n", 260 | "print(train_final.shape)\n", 261 | "print(test_final.shape)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": { 268 | "collapsed": false, 269 | "deletable": true, 270 | "editable": true 271 | }, 272 | "outputs": [], 273 | "source": [ 274 | "train = data[data['DATE'] <= 'DATE_SPLIT']\n", 275 | "test = data[data['DATE'] >= 'DATE_SPLIT' ]\n", 276 | "\n", 277 | "print(train.shape)\n", 278 | "print(test.shape)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": { 284 | "deletable": true, 285 | "editable": true 286 | }, 287 | "source": [ 288 | "## Create design matrix and response vector" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": { 295 | "collapsed": false, 296 | "deletable": true, 297 | "editable": true 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "y_train = train_final['RESPONSE']\n", 302 | "y_test = test_final['RESPONSE']\n", 303 | "x_train = train_final.drop(['RESPONSE','DATE'], axis=1)\n", 304 | "x_test = test_final.drop(['RESPONSE','DATE'], axis=1)\n", 305 | "\n", 306 | "print x_train.columns.values" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": { 312 | "deletable": true, 313 | "editable": true 314 | }, 315 | "source": [ 316 | "## Begin parameter tuning for XGBoost\n", 317 | "\n", 318 | "First, we tune the `max_depth` and `min_child_weight` parameters on a wide range of values. Later, we will refine these two choices with a smaller grid. Note that if you are running this in a Jupyter notebook, you can see the training process in your bash window. We will use the `parameters` dict to store the latest parameter values, and the `scores` vector to store the MSE values" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": { 325 | "collapsed": false, 326 | "deletable": true, 327 | "editable": true 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "objective = \"reg:linear\"\n", 332 | "seed = 100\n", 333 | "n_estimators = 100\n", 334 | "learning_rate = 0.1\n", 335 | "gamma = 0.1\n", 336 | "subsample = 0.8\n", 337 | "colsample_bytree = 0.8\n", 338 | "reg_alpha = 1\n", 339 | "reg_lambda = 1\n", 340 | "silent = False\n", 341 | "\n", 342 | "parameters = {}\n", 343 | "parameters['objective'] = objective\n", 344 | "parameters['seed'] = seed\n", 345 | "parameters['n_estimators'] = n_estimators\n", 346 | "parameters['learning_rate'] = learning_rate\n", 347 | "parameters['gamma'] = gamma\n", 348 | "parameters['colsample_bytree'] = colsample_bytree\n", 349 | "parameters['reg_alpha'] = reg_alpha\n", 350 | "parameters['reg_lambda'] = reg_lambda\n", 351 | "parameters['silent'] = silent\n", 352 | "\n", 353 | "scores = []\n", 354 | "\n", 355 | "cv_params = {'max_depth': [2,4,6,8],\n", 356 | " 'min_child_weight': [1,3,5,7]\n", 357 | " }\n", 358 | "\n", 359 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 360 | " objective = objective,\n", 361 | " seed = seed,\n", 362 | " n_estimators = n_estimators,\n", 363 | " learning_rate = learning_rate,\n", 364 | " gamma = gamma,\n", 365 | " subsample = subsample,\n", 366 | " colsample_bytree = colsample_bytree,\n", 367 | " reg_alpha = reg_alpha,\n", 368 | " reg_lambda = reg_lambda,\n", 369 | " silent = silent\n", 370 | "\n", 371 | " ),\n", 372 | " \n", 373 | " param_grid = cv_params,\n", 374 | " iid = False,\n", 375 | " scoring = \"neg_mean_squared_error\",\n", 376 | " cv = 5,\n", 377 | " verbose = True\n", 378 | ")\n", 379 | "\n", 380 | "gbm.fit(x_train,y_train)\n", 381 | "print gbm.cv_results_\n", 382 | "print \"Best parameters %s\" %gbm.best_params_\n", 383 | "print \"Best score %s\" %gbm.best_score_\n", 384 | "slack_message(\"max_depth and min_child_weight parameters tuned! moving on to refinement\", 'channel')" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": { 390 | "deletable": true, 391 | "editable": true 392 | }, 393 | "source": [ 394 | "## Refine with a smaller grid of values based on best values from the big grid above" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "metadata": { 401 | "collapsed": false, 402 | "deletable": true, 403 | "editable": true 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "max_depth = gbm.best_params_['max_depth']\n", 408 | "min_child_weight = gbm.best_params_['min_child_weight']\n", 409 | "parameters['max_depth'] = max_depth\n", 410 | "parameters['min_child_weight'] = min_child_weight\n", 411 | "scores.append(gbm.best_score_)\n", 412 | "\n", 413 | "cv_params = {'max_depth': [max_depth-1, max_depth, max_depth+1], \n", 414 | " 'min_child_weight': [min_child_weight-1, min_child_weight-0.5, min_child_weight, min_child_weight+0.5, min_child_weight+1]\n", 415 | " }\n", 416 | "\n", 417 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 418 | " objective = objective,\n", 419 | " seed = seed,\n", 420 | " n_estimators = n_estimators,\n", 421 | " learning_rate = learning_rate,\n", 422 | " gamma = gamma,\n", 423 | " subsample = subsample,\n", 424 | " colsample_bytree = colsample_bytree,\n", 425 | " reg_alpha = reg_alpha,\n", 426 | " reg_lambda = reg_lambda,\n", 427 | " silent = silent\n", 428 | "\n", 429 | " ),\n", 430 | " \n", 431 | " param_grid = cv_params,\n", 432 | " iid = False,\n", 433 | " scoring = \"neg_mean_squared_error\",\n", 434 | " cv = 5,\n", 435 | " verbose = True\n", 436 | ")\n", 437 | "\n", 438 | "gbm.fit(x_train,y_train)\n", 439 | "print gbm.cv_results_\n", 440 | "print \"Best parameters %s\" %gbm.best_params_\n", 441 | "print \"Best score %s\" %gbm.best_score_\n", 442 | "slack_message(\"max_depth and min_child_weight parameters refined! moving on to tuning gamma parameter\", 'channel')" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": { 448 | "deletable": true, 449 | "editable": true 450 | }, 451 | "source": [ 452 | "## Set max_depth and min_child_weight before tuning gamma parameter\n", 453 | "\n", 454 | "Set the `max_depth` and `min_child_weight` values based on the above before tuning the `gamma` parameter in a similar fashion" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "metadata": { 461 | "collapsed": false, 462 | "deletable": true, 463 | "editable": true 464 | }, 465 | "outputs": [], 466 | "source": [ 467 | "max_depth = gbm.best_params_['max_depth']\n", 468 | "min_child_weight = gbm.best_params_['min_child_weight']\n", 469 | "parameters['max_depth'] = max_depth\n", 470 | "parameters['min_child_weight'] = min_child_weight\n", 471 | "scores.append(gbm.best_score_)\n", 472 | "\n", 473 | "cv_params = {'gamma': [i/10.0 for i in range(1,10,2)]}\n", 474 | "\n", 475 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 476 | " objective = objective,\n", 477 | " seed = seed,\n", 478 | " n_estimators = n_estimators,\n", 479 | " max_depth = max_depth,\n", 480 | " min_child_weight = min_child_weight,\n", 481 | " learning_rate = learning_rate,\n", 482 | " subsample = subsample,\n", 483 | " colsample_bytree = colsample_bytree,\n", 484 | " reg_alpha = reg_alpha,\n", 485 | " reg_lambda = reg_lambda,\n", 486 | " silent = silent\n", 487 | "\n", 488 | " ),\n", 489 | " \n", 490 | " param_grid = cv_params,\n", 491 | " iid = False,\n", 492 | " scoring = \"neg_mean_squared_error\",\n", 493 | " cv = 5,\n", 494 | " verbose = True\n", 495 | ")\n", 496 | "\n", 497 | "gbm.fit(x_train,y_train)\n", 498 | "print gbm.cv_results_\n", 499 | "print \"Best parameters %s\" %gbm.best_params_\n", 500 | "print \"Best score %s\" %gbm.best_score_\n", 501 | "slack_message(\"gamma tuned! moving on to tuning subsample and colsample_bytree parameters\", 'channel')" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": { 507 | "deletable": true, 508 | "editable": true 509 | }, 510 | "source": [ 511 | "## Set the `gamma` parameter and tune the `subsample` and `colsample_bytree` parameters next \n", 512 | "\n", 513 | "We will look at 10% intervals from 60% to 100% for both `subsample` and `colsample_bytree`" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": { 520 | "collapsed": false, 521 | "deletable": true, 522 | "editable": true 523 | }, 524 | "outputs": [], 525 | "source": [ 526 | "gamma = gbm.best_params_['gamma']\n", 527 | "parameters['gamma'] = gamma\n", 528 | "scores.append(gbm.best_score_)\n", 529 | "\n", 530 | "cv_params = {'subsample': [i/10.0 for i in range(6,11)],\n", 531 | " 'colsample_bytree': [i/10.0 for i in range(6,11)]\n", 532 | " }\n", 533 | "\n", 534 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 535 | " objective = objective,\n", 536 | " seed = seed,\n", 537 | " n_estimators = n_estimators,\n", 538 | " max_depth = max_depth,\n", 539 | " min_child_weight = min_child_weight,\n", 540 | " learning_rate = learning_rate,\n", 541 | " gamma = gamma,\n", 542 | " reg_alpha = reg_alpha,\n", 543 | " reg_lambda = reg_lambda,\n", 544 | " silent = silent\n", 545 | "\n", 546 | " ),\n", 547 | " \n", 548 | " param_grid = cv_params,\n", 549 | " iid = False,\n", 550 | " scoring = \"neg_mean_squared_error\",\n", 551 | " cv = 5,\n", 552 | " verbose = True\n", 553 | ")\n", 554 | "\n", 555 | "gbm.fit(x_train,y_train)\n", 556 | "print gbm.cv_results_\n", 557 | "print \"Best parameters %s\" %gbm.best_params_\n", 558 | "print \"Best score %s\" %gbm.best_score_\n", 559 | "slack_message(\"subsample and colsample_bytree parameters tuned! moving on to refinement\", 'channel')" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": { 565 | "deletable": true, 566 | "editable": true 567 | }, 568 | "source": [ 569 | "## Retune with a smaller grid of values based on best values from the big grid above\n", 570 | "\n", 571 | "Look at 5% intervals in some range around the best values found previously" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": { 578 | "collapsed": false, 579 | "deletable": true, 580 | "editable": true 581 | }, 582 | "outputs": [], 583 | "source": [ 584 | "subsample = gbm.best_params_['subsample']\n", 585 | "colsample_bytree = gbm.best_params_['colsample_bytree']\n", 586 | "parameters['subsample'] = subsample\n", 587 | "parameters['colsample_bytree'] = colsample_bytree\n", 588 | "scores.append(gbm.best_score_)\n", 589 | "\n", 590 | "cv_params = {'subsample': [i/100.0 for i in range(int((subsample-0.1)*100.0), min(int((subsample+0.1)*100),105) , 5)],\n", 591 | " 'colsample_bytree': [i/100.0 for i in range(int((colsample_bytree-0.1)*100.0), min(int((subsample+0.1)*100),105), 5)]\n", 592 | " }\n", 593 | "\n", 594 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 595 | " objective = objective,\n", 596 | " seed = seed,\n", 597 | " n_estimators = n_estimators,\n", 598 | " max_depth = max_depth,\n", 599 | " min_child_weight = min_child_weight,\n", 600 | " learning_rate = learning_rate,\n", 601 | " gamma = gamma,\n", 602 | " reg_alpha = reg_alpha,\n", 603 | " reg_lambda = reg_lambda,\n", 604 | " silent = silent\n", 605 | "\n", 606 | " ),\n", 607 | " \n", 608 | " param_grid = cv_params,\n", 609 | " iid = False,\n", 610 | " scoring = \"neg_mean_squared_error\",\n", 611 | " cv = 5,\n", 612 | " verbose = True\n", 613 | ")\n", 614 | "\n", 615 | "gbm.fit(x_train,y_train)\n", 616 | "print gbm.cv_results_\n", 617 | "print \"Best parameters %s\" %gbm.best_params_\n", 618 | "print \"Best score %s\" %gbm.best_score_\n", 619 | "slack_message(\"subsample and colsample_bytree parameters refined! moving on to tuning the alpha and lambda parameters\", 'channel')" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": { 625 | "deletable": true, 626 | "editable": true 627 | }, 628 | "source": [ 629 | "## Set the `colsample_bytree` and `subsample` parameters before tuning `reg_alpha` and `reg_lambda` parameters\n", 630 | "\n", 631 | "`reg_alpha` controls L1 regularisation and `reg_lambda` controls L2 regularisation" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": { 638 | "collapsed": false, 639 | "deletable": true, 640 | "editable": true 641 | }, 642 | "outputs": [], 643 | "source": [ 644 | "colsample_bytree = gbm.best_params_['colsample_bytree']\n", 645 | "subsample = gbm.best_params_['subsample']\n", 646 | "parameters['colsample_bytree'] = colsample_bytree\n", 647 | "parameters['subsample'] = subsample\n", 648 | "scores.append(gbm.best_score_)\n", 649 | "\n", 650 | "cv_params = {'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100], \n", 651 | " 'reg_lambda': [1e-5, 1e-2, 0.1, 1, 100]\n", 652 | " }\n", 653 | "\n", 654 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 655 | " objective = objective,\n", 656 | " seed = seed,\n", 657 | " n_estimators = n_estimators,\n", 658 | " max_depth = max_depth,\n", 659 | " min_child_weight = min_child_weight,\n", 660 | " learning_rate = learning_rate,\n", 661 | " gamma = gamma,\n", 662 | " colsample_bytree = colsample_bytree,\n", 663 | " subsample = subsample,\n", 664 | " silent = silent\n", 665 | "\n", 666 | " ),\n", 667 | " \n", 668 | " param_grid = cv_params,\n", 669 | " iid = False,\n", 670 | " scoring = \"neg_mean_squared_error\",\n", 671 | " cv = 5,\n", 672 | " verbose = True\n", 673 | ")\n", 674 | "\n", 675 | "gbm.fit(x_train,y_train)\n", 676 | "print gbm.cv_results_\n", 677 | "print \"Best parameters %s\" %gbm.best_params_\n", 678 | "print \"Best score %s\" %gbm.best_score_\n", 679 | "slack_message(\"alpha and lambda parameters tuned! moving on to refinement\", 'channel')" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": { 685 | "deletable": true, 686 | "editable": true 687 | }, 688 | "source": [ 689 | "## Refine parameters on a smaller grid\n", 690 | "\n", 691 | "Look at a smaller grid around the best values found previously " 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": null, 697 | "metadata": { 698 | "collapsed": false, 699 | "deletable": true, 700 | "editable": true 701 | }, 702 | "outputs": [], 703 | "source": [ 704 | "reg_alpha = gbm.best_params_['reg_alpha']\n", 705 | "reg_lambda = gbm.best_params_['reg_lambda']\n", 706 | "parameters['reg_alpha'] = reg_alpha\n", 707 | "parameters['reg_lambda'] = reg_lambda\n", 708 | "scores.append(gbm.best_score_)\n", 709 | "\n", 710 | "cv_params = {'reg_lambda': [reg_alpha*0.2, reg_alpha*0.5, reg_alpha, reg_alpha*2, reg_alpha*5], \n", 711 | " 'reg_alpha': [reg_lambda*0.2, reg_lambda*0.5, reg_lambda, reg_lambda*2, reg_lambda*5]\n", 712 | " }\n", 713 | "\n", 714 | "gbm = GridSearchCV(xgb.XGBRegressor(\n", 715 | " objective = objective,\n", 716 | " seed = seed,\n", 717 | " n_estimators = n_estimators,\n", 718 | " max_depth = max_depth,\n", 719 | " min_child_weight = min_child_weight,\n", 720 | " learning_rate = learning_rate,\n", 721 | " gamma = gamma,\n", 722 | " colsample_bytree = colsample_bytree,\n", 723 | " subsample = subsample,\n", 724 | " silent = silent\n", 725 | "\n", 726 | " ),\n", 727 | " \n", 728 | " param_grid = cv_params,\n", 729 | " iid = False,\n", 730 | " scoring = \"neg_mean_squared_error\",\n", 731 | " cv = 5,\n", 732 | " verbose = True\n", 733 | ")\n", 734 | "\n", 735 | "gbm.fit(x_train,y_train)\n", 736 | "print gbm.cv_results_\n", 737 | "print \"Best parameters %s\" %gbm.best_params_\n", 738 | "print \"Best score %s\" %gbm.best_score_\n", 739 | "slack_message(\"alpha and lambda parameters refined! finalising model by reducing learning rate and increasing trees\", 'channel')" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": { 745 | "deletable": true, 746 | "editable": true 747 | }, 748 | "source": [ 749 | "## Set regularisation parameters before increasing the number of trees and reducing the learning rate\n", 750 | "\n", 751 | "The idea here is to find a better fit that actually converges based on the optimal parameters values we have found so far" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": null, 757 | "metadata": { 758 | "collapsed": true, 759 | "deletable": true, 760 | "editable": true 761 | }, 762 | "outputs": [], 763 | "source": [ 764 | "reg_alpha = gbm.best_params_['reg_alpha']\n", 765 | "reg_lambda = gbm.best_params_['reg_lambda']\n", 766 | "parameters['reg_alpha'] = reg_alpha\n", 767 | "parameters['reg_lambda'] = reg_lambda\n", 768 | "scores.append(gbm.best_score_)" 769 | ] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": { 774 | "deletable": true, 775 | "editable": true 776 | }, 777 | "source": [ 778 | "## Print final parameters used and scores obtained\n", 779 | "\n", 780 | "Importantly, ensure scores are increasing with each iteration. For the above implementation, the negative MSE objective function should be increasing in order to minimise MSE" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "metadata": { 787 | "collapsed": true, 788 | "deletable": true, 789 | "editable": true 790 | }, 791 | "outputs": [], 792 | "source": [ 793 | "print parameters\n", 794 | "print scores" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": { 801 | "collapsed": false, 802 | "deletable": true, 803 | "editable": true 804 | }, 805 | "outputs": [], 806 | "source": [ 807 | "# n_estimators = 3000\n", 808 | "# learning_rate = 0.05\n", 809 | "\n", 810 | "# parameters['n_estimators'] = n_estimators\n", 811 | "# parameters['learning_rate'] = learning_rate\n", 812 | "\n", 813 | "# xgbFinal = xgb.XGBRegressor(\n", 814 | "# objective = objective,\n", 815 | "# seed = seed,\n", 816 | "# n_estimators = n_estimators,\n", 817 | "# max_depth = max_depth,\n", 818 | "# min_child_weight = min_child_weight,\n", 819 | "# learning_rate = learning_rate,\n", 820 | "# gamma = gamma,\n", 821 | "# subsample = subsample,\n", 822 | "# colsample_bytree = colsample_bytree,\n", 823 | "# reg_alpha = reg_alpha,\n", 824 | "# reg_lambda = reg_lambda,\n", 825 | "# silent = False\n", 826 | "# )\n", 827 | "\n", 828 | "# xgb1.fit(x_train, y_train, eval_set = [(x_train, y_train), (x_test, y_test)], eval_metric = 'rmse', verbose = True)\n", 829 | "# slack_message(\"Training complete!\", 'channel')" 830 | ] 831 | }, 832 | { 833 | "cell_type": "markdown", 834 | "metadata": { 835 | "deletable": true, 836 | "editable": true 837 | }, 838 | "source": [ 839 | "## Create XGBoost's DMatrix\n", 840 | "\n", 841 | "We will use this for finding the best tree via cross validation, and in the final XGBoost model" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": { 848 | "collapsed": true, 849 | "deletable": true, 850 | "editable": true 851 | }, 852 | "outputs": [], 853 | "source": [ 854 | "trainDMat = xgb.DMatrix(data = x_train, label = y_train)\n", 855 | "testDMat = xgb.DMatrix(data = x_test, label = y_test)" 856 | ] 857 | }, 858 | { 859 | "cell_type": "markdown", 860 | "metadata": { 861 | "deletable": true, 862 | "editable": true 863 | }, 864 | "source": [ 865 | "## Find best tree\n", 866 | "\n", 867 | "Lower the `learning_rate` and set a large `num_boost_round` hyperparameter to ensure convergence. If convergence is slow, retry with a slightly higher learning rate (e.g. `0.075` instead of `0.05`)" 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "metadata": { 874 | "collapsed": true, 875 | "deletable": true, 876 | "editable": true 877 | }, 878 | "outputs": [], 879 | "source": [ 880 | "learning_rate = 0.05\n", 881 | "parameters['eta'] = learning_rate\n", 882 | "\n", 883 | "num_boost_round = 3000\n", 884 | "early_stopping_rounds = 20\n", 885 | "\n", 886 | "xgbCV = xgb.cv(\n", 887 | " params = parameters, \n", 888 | " dtrain = trainDMat, \n", 889 | " num_boost_round = num_boost_round,\n", 890 | " nfold = 5,\n", 891 | " metrics = {'rmse'},\n", 892 | " early_stopping_rounds = early_stopping_rounds,\n", 893 | " verbose_eval = True,\n", 894 | " seed = seed \n", 895 | ")\n", 896 | "\n", 897 | "slack_message(\"Training complete! Producing final booster object\", 'channel')" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": { 903 | "deletable": true, 904 | "editable": true 905 | }, 906 | "source": [ 907 | "## Finalise XGBoost model\n", 908 | "\n", 909 | "Produce the final booster object using the best tree from our cross validation" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": null, 915 | "metadata": { 916 | "collapsed": true, 917 | "deletable": true, 918 | "editable": true 919 | }, 920 | "outputs": [], 921 | "source": [ 922 | "num_boost_round = len(xgbCV)\n", 923 | "parameters['eval_metric'] = 'rmse'\n", 924 | "\n", 925 | "xgbFinal = xgb.train(\n", 926 | " params = parameters, \n", 927 | " dtrain = trainDMat, \n", 928 | " num_boost_round = num_boost_round,\n", 929 | " evals = [(trainDMat, 'train'), \n", 930 | " (testDMat, 'eval')]\n", 931 | ")\n", 932 | "\n", 933 | "slack_message(\"Booster object created!\", 'channel')" 934 | ] 935 | }, 936 | { 937 | "cell_type": "markdown", 938 | "metadata": { 939 | "deletable": true, 940 | "editable": true 941 | }, 942 | "source": [ 943 | "## Feature importance plot\n", 944 | "\n", 945 | "Plot the feature importance plot to check whether this is making sense before checking optimal parameters and loss function progression" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "metadata": { 952 | "collapsed": false, 953 | "deletable": true, 954 | "editable": true 955 | }, 956 | "outputs": [], 957 | "source": [ 958 | "xgb.plot_importance(xgbFinal)" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": { 964 | "deletable": true, 965 | "editable": true 966 | }, 967 | "source": [ 968 | "## Produce predictions for train and test sets before measuring accuracy\n", 969 | "\n", 970 | "Calculate predictions for both train and test sets, and then calculate MSE and RMSE for both datasets" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": { 977 | "collapsed": false, 978 | "deletable": true, 979 | "editable": true 980 | }, 981 | "outputs": [], 982 | "source": [ 983 | "xgbFinal_train_preds = xgbFinal.predict(x_train)\n", 984 | "xgbFinal_test_preds = xgbFinal.predict(x_test)" 985 | ] 986 | }, 987 | { 988 | "cell_type": "code", 989 | "execution_count": null, 990 | "metadata": { 991 | "collapsed": false, 992 | "deletable": true, 993 | "editable": true 994 | }, 995 | "outputs": [], 996 | "source": [ 997 | "print(xgbFinal_train_preds.shape)\n", 998 | "print(xgbFinal_test_preds.shape)" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": null, 1004 | "metadata": { 1005 | "collapsed": false, 1006 | "deletable": true, 1007 | "editable": true 1008 | }, 1009 | "outputs": [], 1010 | "source": [ 1011 | "print \"\\nModel Report\"\n", 1012 | "print \"MSE Train : %f\" % mean_squared_error(y_train, xgbFinal_train_preds)\n", 1013 | "print \"MSE Test: %f\" % mean_squared_error(y_test, xgbFinal_test_preds)\n", 1014 | "print \"RMSE Train: %f\" % mean_squared_error(y_train, xgbFinal_train_preds)**0.5\n", 1015 | "print \"RMSE Test: %f\" % mean_squared_error(y_test, xgbFinal_test_preds)**0.5" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "markdown", 1020 | "metadata": { 1021 | "deletable": true, 1022 | "editable": true 1023 | }, 1024 | "source": [ 1025 | "## Save xgb model file and write .csv files to working directory\n", 1026 | "\n", 1027 | "Save xgb model file for future reference. Similar function to load previously saved files is commented out below. Then, write all files to the working directory" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "code", 1032 | "execution_count": null, 1033 | "metadata": { 1034 | "collapsed": true, 1035 | "deletable": true, 1036 | "editable": true 1037 | }, 1038 | "outputs": [], 1039 | "source": [ 1040 | "pickle.dump(xgbFinal, open(\"xgbFinal.pickle.dat\", \"wb\"))" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "code", 1045 | "execution_count": null, 1046 | "metadata": { 1047 | "collapsed": false, 1048 | "deletable": true, 1049 | "editable": true 1050 | }, 1051 | "outputs": [], 1052 | "source": [ 1053 | "# xgb1 = pickle.load(open(\"xgb1.pickle.dat\", \"rb\"))\n", 1054 | "# xgb1_train_preds = xgb1.predict(x_train)\n", 1055 | "# xgb1_test_preds = xgb1.predict(x_test)" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": null, 1061 | "metadata": { 1062 | "collapsed": false, 1063 | "deletable": true, 1064 | "editable": true 1065 | }, 1066 | "outputs": [], 1067 | "source": [ 1068 | "# print \"\\nModel Report\"\n", 1069 | "# print \"MSE Train : %f\" % mean_squared_error(y_train, xgb1_train_preds)\n", 1070 | "# print \"MSE Test: %f\" % mean_squared_error(y_test, xgb1_test_preds)\n", 1071 | "# print \"RMSE Train: %f\" % mean_squared_error(y_train, xgb1_train_preds)**0.5\n", 1072 | "# print \"RMSE Test: %f\" % mean_squared_error(y_test, xgb1_test_preds)**0.5" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": null, 1078 | "metadata": { 1079 | "collapsed": false, 1080 | "deletable": true, 1081 | "editable": true 1082 | }, 1083 | "outputs": [], 1084 | "source": [ 1085 | "train_preds = pd.DataFrame(xgbFinal_train_preds)\n", 1086 | "test_preds = pd.DataFrame(xgbFinal_test_preds)\n", 1087 | "train_preds.columns = ['RESPONSE']\n", 1088 | "test_preds.column = ['RESPONSE']" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": null, 1094 | "metadata": { 1095 | "collapsed": true, 1096 | "deletable": true, 1097 | "editable": true 1098 | }, 1099 | "outputs": [], 1100 | "source": [ 1101 | "train.to_csv('XGBoost Train.csv', sep=',')\n", 1102 | "train_preds.to_csv('XGBoost Train Preds.csv', sep=',')\n", 1103 | "test.to_csv('XGBoost Test.csv', sep=',')\n", 1104 | "test_preds.to_csv('XGBoost Test Preds.csv', sep=',')\n", 1105 | "slack_message(\"Files saved!\", 'channel')" 1106 | ] 1107 | } 1108 | ], 1109 | "metadata": { 1110 | "kernelspec": { 1111 | "display_name": "Python 2", 1112 | "language": "python", 1113 | "name": "python2" 1114 | }, 1115 | "language_info": { 1116 | "codemirror_mode": { 1117 | "name": "ipython", 1118 | "version": 2 1119 | }, 1120 | "file_extension": ".py", 1121 | "mimetype": "text/x-python", 1122 | "name": "python", 1123 | "nbconvert_exporter": "python", 1124 | "pygments_lexer": "ipython2", 1125 | "version": "2.7.13" 1126 | } 1127 | }, 1128 | "nbformat": 4, 1129 | "nbformat_minor": 0 1130 | } 1131 | --------------------------------------------------------------------------------