├── LICENSE.md
├── README.md
└── XGBRegressor.ipynb


/LICENSE.md:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2017, Albert Lam
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # XGBRegressor
 2 | 
 3 | ## Overview
 4 | A simple implementation to regression problems using Python 2.7, scikit-learn, and XGBoost. Bulk of code from [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
 5 | 
 6 | [XGBRegressor](https://github.com/albertkklam/XGBRegressor/blob/master/XGBRegressor.ipynb) is a general purpose notebook for model training using XGBoost. It contains:
 7 | 
 8 | * Functions to preprocess a data file into the necessary train and test set dataframes for XGBoost
 9 | * Functions to convert categorical variables into dummies or dense vectors, and convert string values into Python compatible strings
10 | * Additional user functionality that allows notification updates to be sent to a user's chosen Slack channel, so that you know when your model has finished training
11 | * Implementation of sequential hyperparameter grid search via the scikit-learn API
12 | * Implementation of early stopping via the Learning API
13 | 
14 | ## Installing XGBoost for Python
15 | Follow instructions [here](https://github.com/dmlc/xgboost/tree/master/python-package)
16 | 
17 | ## Resources
18 | Here are some additional resources if you are looking to explore XGBoost and its various APIs more extensively:
19 | 
20 | 1. [Introduction to Boosted Trees and the XGBoost algorithm](http://xgboost.readthedocs.io/en/latest/model.html)
21 | 2. [The Python API documentation for XGBoost](http://xgboost.readthedocs.io/en/latest/python/python_api.html)
22 | 3. [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
23 | 4. [scikit-learn's Gradient Boosting Classifer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
24 | 5. [scikit-learn's GridSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
25 | 6. [Tong He's XGBoost presentation](https://www.slideshare.net/ShangxuanZhang/xgboost)
26 | 


--------------------------------------------------------------------------------
/XGBRegressor.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "deletable": true,
   7 |     "editable": true
   8 |    },
   9 |    "source": [
  10 |     "## Import libraries\n",
  11 |     "\n",
  12 |     "We will make extensive use of `pandas`, `XGBoost` and it's `scikit-learn` API throughout this demo. `pickle` will be used to save and load model files"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "code",
  17 |    "execution_count": null,
  18 |    "metadata": {
  19 |     "collapsed": false,
  20 |     "deletable": true,
  21 |     "editable": true
  22 |    },
  23 |    "outputs": [],
  24 |    "source": [
  25 |     "import pandas as pd\n",
  26 |     "import numpy as np\n",
  27 |     "import pickle\n",
  28 |     "import xgboost as xgb\n",
  29 |     "import sklearn.model_selection\n",
  30 |     "from sklearn.model_selection import GridSearchCV \n",
  31 |     "from sklearn.metrics import confusion_matrix, mean_squared_error\n",
  32 |     "import matplotlib\n",
  33 |     "%matplotlib inline"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "markdown",
  38 |    "metadata": {
  39 |     "deletable": true,
  40 |     "editable": true
  41 |    },
  42 |    "source": [
  43 |     "## Slack channel notifications\n",
  44 |     "\n",
  45 |     "Import `SlackClient` and create basic function that will post a Slack notification in `channel` when code is finished running"
  46 |    ]
  47 |   },
  48 |   {
  49 |    "cell_type": "code",
  50 |    "execution_count": null,
  51 |    "metadata": {
  52 |     "collapsed": true,
  53 |     "deletable": true,
  54 |     "editable": true
  55 |    },
  56 |    "outputs": [],
  57 |    "source": [
  58 |     "from slackclient import SlackClient\n",
  59 |     "def slack_message(message, channel):\n",
  60 |     "    token = 'your_token'\n",
  61 |     "    sc = SlackClient(token)\n",
  62 |     "    sc.api_call('chat.postMessage', channel=channel, \n",
  63 |     "                text=message, username='My Sweet Bot',\n",
  64 |     "                icon_emoji=':upside_down_face:')"
  65 |    ]
  66 |   },
  67 |   {
  68 |    "cell_type": "markdown",
  69 |    "metadata": {
  70 |     "deletable": true,
  71 |     "editable": true
  72 |    },
  73 |    "source": [
  74 |     "## Import data and set data types\n",
  75 |     "\n",
  76 |     "Set working directory and ensure schema is correct before importing train and test sets"
  77 |    ]
  78 |   },
  79 |   {
  80 |    "cell_type": "code",
  81 |    "execution_count": null,
  82 |    "metadata": {
  83 |     "collapsed": false,
  84 |     "deletable": true,
  85 |     "editable": true
  86 |    },
  87 |    "outputs": [],
  88 |    "source": [
  89 |     "data_dir = '/your/directory/'  \n",
  90 |     "data_file = data_dir + 'data_file'"
  91 |    ]
  92 |   },
  93 |   {
  94 |    "cell_type": "code",
  95 |    "execution_count": null,
  96 |    "metadata": {
  97 |     "collapsed": true,
  98 |     "deletable": true,
  99 |     "editable": true
 100 |    },
 101 |    "outputs": [],
 102 |    "source": [
 103 |     "data = pd.read_csv(data_file, sep=\"\\t\", parse_dates = ['dates'], infer_datetime_format = True)"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "markdown",
 108 |    "metadata": {
 109 |     "deletable": true,
 110 |     "editable": true
 111 |    },
 112 |    "source": [
 113 |     "## Preprocess data frames\n",
 114 |     "\n",
 115 |     "Parse through `replaceValues` in order to standardise character strings"
 116 |    ]
 117 |   },
 118 |   {
 119 |    "cell_type": "code",
 120 |    "execution_count": null,
 121 |    "metadata": {
 122 |     "collapsed": true,
 123 |     "deletable": true,
 124 |     "editable": true
 125 |    },
 126 |    "outputs": [],
 127 |    "source": [
 128 |     "def replaceValues(df):\n",
 129 |     "    df.replace(r'[\\s]','_', inplace = True, regex = True)\n",
 130 |     "    df.replace(r'[\\.]','', inplace = True, regex = True)\n",
 131 |     "    df.replace(r'__','_', inplace = True, regex = True)\n",
 132 |     "\n",
 133 |     "replaceValues(data)"
 134 |    ]
 135 |   },
 136 |   {
 137 |    "cell_type": "markdown",
 138 |    "metadata": {
 139 |     "deletable": true,
 140 |     "editable": true
 141 |    },
 142 |    "source": [
 143 |     "## Combine train and test set\n",
 144 |     "\n",
 145 |     "Combine `train` and `test` data sets before parsing through one-hot encoder or dense vector encoding. This is especially important for one-hot encoding because we want to maintain the same set of columns across both train and test sets. These can be inconsistent if a particular level of a categorical variable is present in one data set but not the other\n",
 146 |     "\n",
 147 |     "* `cat_cols` are categorical columns that will be used in model training\n",
 148 |     "* `index_cols` are the index columns of the dataframe, which will not be used in model training\n",
 149 |     "* `pred_cols` are the response variable columns\n",
 150 |     "* `num_cols` are the numeric columns that will be used in model training"
 151 |    ]
 152 |   },
 153 |   {
 154 |    "cell_type": "code",
 155 |    "execution_count": null,
 156 |    "metadata": {
 157 |     "collapsed": false,
 158 |     "deletable": true,
 159 |     "editable": true
 160 |    },
 161 |    "outputs": [],
 162 |    "source": [
 163 |     "cat_cols = ['ATTRIBUTE_1','ATTRIBUTE_2','ATTRIBUTE_3']\n",
 164 |     "index_cols = ['FACTOR_1','FACTOR_2','FACTOR_3']\n",
 165 |     "pred_cols = ['RESPONSE']\n",
 166 |     "\n",
 167 |     "num_cols = [x for x in list(data.columns.values) if x not in cat_cols if x not in fac_cols if x not in pred_cols]"
 168 |    ]
 169 |   },
 170 |   {
 171 |    "cell_type": "markdown",
 172 |    "metadata": {
 173 |     "deletable": true,
 174 |     "editable": true
 175 |    },
 176 |    "source": [
 177 |     "## To one-hot encode the categorical variables, run the next cell. To code categorial variables as a dense vector, run the next cell instead"
 178 |    ]
 179 |   },
 180 |   {
 181 |    "cell_type": "code",
 182 |    "execution_count": null,
 183 |    "metadata": {
 184 |     "collapsed": false,
 185 |     "deletable": true,
 186 |     "editable": true
 187 |    },
 188 |    "outputs": [],
 189 |    "source": [
 190 |     "# def categoricalCols(indf, cat_var_list):\n",
 191 |     "#     for cv in cat_var_list:\n",
 192 |     "#         if [i for i, x in enumerate(cat_var_list) if cv == x][0] == 0:\n",
 193 |     "#             dummy_df = pd.get_dummies(indf[cv], prefix = cv)\n",
 194 |     "#         else:\n",
 195 |     "#             dummy_df = pd.concat([dummy_df, pd.get_dummies(indf[cv], prefix = cv)], axis = 1)\n",
 196 |     "#     return dummy_df\n",
 197 |     "        \n",
 198 |     "# combined_cat = categoricalCols(combined[cat_cols], cat_cols)\n",
 199 |     "# combined_cat.columns.values"
 200 |    ]
 201 |   },
 202 |   {
 203 |    "cell_type": "code",
 204 |    "execution_count": null,
 205 |    "metadata": {
 206 |     "collapsed": false,
 207 |     "deletable": true,
 208 |     "editable": true
 209 |    },
 210 |    "outputs": [],
 211 |    "source": [
 212 |     "data_cat = pd.DataFrame(data[cat_cols])\n",
 213 |     "for feature in cat_cols: # Loop through all columns in the dataframe\n",
 214 |     "    if data_cat[feature].dtype == 'object': # Only apply for columns with categorical strings\n",
 215 |     "        data_cat[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer"
 216 |    ]
 217 |   },
 218 |   {
 219 |    "cell_type": "markdown",
 220 |    "metadata": {
 221 |     "deletable": true,
 222 |     "editable": true
 223 |    },
 224 |    "source": [
 225 |     "## Prepare final dataframe before resplitting into train and test sets\n",
 226 |     "\n",
 227 |     "Importantly, we want to ensure that `train_final` and `test_final` are the same rows of data as `train` and `test`"
 228 |    ]
 229 |   },
 230 |   {
 231 |    "cell_type": "code",
 232 |    "execution_count": null,
 233 |    "metadata": {
 234 |     "collapsed": false,
 235 |     "deletable": true,
 236 |     "editable": true
 237 |    },
 238 |    "outputs": [],
 239 |    "source": [
 240 |     "data_num = data[num_cols]\n",
 241 |     "data_final = pd.concat([data_cat, data_num], axis=1)\n",
 242 |     "data_final['DATE'] = data['DATE']\n",
 243 |     "data_final['RESPONSE'] = data['RESPONSE']\n",
 244 |     "print data_final.shape"
 245 |    ]
 246 |   },
 247 |   {
 248 |    "cell_type": "code",
 249 |    "execution_count": null,
 250 |    "metadata": {
 251 |     "collapsed": false,
 252 |     "deletable": true,
 253 |     "editable": true
 254 |    },
 255 |    "outputs": [],
 256 |    "source": [
 257 |     "train_final = data_final[data_final['DATE'] <= 'DATE_SPLIT']\n",
 258 |     "test_final = data_final[data_final['DATE'] >= 'DATE_SPLIT' ]\n",
 259 |     "\n",
 260 |     "print(train_final.shape)\n",
 261 |     "print(test_final.shape)"
 262 |    ]
 263 |   },
 264 |   {
 265 |    "cell_type": "code",
 266 |    "execution_count": null,
 267 |    "metadata": {
 268 |     "collapsed": false,
 269 |     "deletable": true,
 270 |     "editable": true
 271 |    },
 272 |    "outputs": [],
 273 |    "source": [
 274 |     "train = data[data['DATE'] <= 'DATE_SPLIT']\n",
 275 |     "test = data[data['DATE'] >= 'DATE_SPLIT' ]\n",
 276 |     "\n",
 277 |     "print(train.shape)\n",
 278 |     "print(test.shape)"
 279 |    ]
 280 |   },
 281 |   {
 282 |    "cell_type": "markdown",
 283 |    "metadata": {
 284 |     "deletable": true,
 285 |     "editable": true
 286 |    },
 287 |    "source": [
 288 |     "## Create design matrix and response vector"
 289 |    ]
 290 |   },
 291 |   {
 292 |    "cell_type": "code",
 293 |    "execution_count": null,
 294 |    "metadata": {
 295 |     "collapsed": false,
 296 |     "deletable": true,
 297 |     "editable": true
 298 |    },
 299 |    "outputs": [],
 300 |    "source": [
 301 |     "y_train = train_final['RESPONSE']\n",
 302 |     "y_test = test_final['RESPONSE']\n",
 303 |     "x_train = train_final.drop(['RESPONSE','DATE'], axis=1)\n",
 304 |     "x_test = test_final.drop(['RESPONSE','DATE'], axis=1)\n",
 305 |     "\n",
 306 |     "print x_train.columns.values"
 307 |    ]
 308 |   },
 309 |   {
 310 |    "cell_type": "markdown",
 311 |    "metadata": {
 312 |     "deletable": true,
 313 |     "editable": true
 314 |    },
 315 |    "source": [
 316 |     "## Begin parameter tuning for XGBoost\n",
 317 |     "\n",
 318 |     "First, we tune the `max_depth` and `min_child_weight` parameters on a wide range of values. Later, we will refine these two choices with a smaller grid. Note that if you are running this in a Jupyter notebook, you can see the training process in your bash window. We will use the `parameters` dict to store the latest parameter values, and the `scores` vector to store the MSE values"
 319 |    ]
 320 |   },
 321 |   {
 322 |    "cell_type": "code",
 323 |    "execution_count": null,
 324 |    "metadata": {
 325 |     "collapsed": false,
 326 |     "deletable": true,
 327 |     "editable": true
 328 |    },
 329 |    "outputs": [],
 330 |    "source": [
 331 |     "objective = \"reg:linear\"\n",
 332 |     "seed = 100\n",
 333 |     "n_estimators = 100\n",
 334 |     "learning_rate = 0.1\n",
 335 |     "gamma = 0.1\n",
 336 |     "subsample = 0.8\n",
 337 |     "colsample_bytree = 0.8\n",
 338 |     "reg_alpha = 1\n",
 339 |     "reg_lambda = 1\n",
 340 |     "silent = False\n",
 341 |     "\n",
 342 |     "parameters = {}\n",
 343 |     "parameters['objective'] = objective\n",
 344 |     "parameters['seed'] = seed\n",
 345 |     "parameters['n_estimators'] = n_estimators\n",
 346 |     "parameters['learning_rate'] = learning_rate\n",
 347 |     "parameters['gamma'] = gamma\n",
 348 |     "parameters['colsample_bytree'] = colsample_bytree\n",
 349 |     "parameters['reg_alpha'] = reg_alpha\n",
 350 |     "parameters['reg_lambda'] = reg_lambda\n",
 351 |     "parameters['silent'] = silent\n",
 352 |     "\n",
 353 |     "scores = []\n",
 354 |     "\n",
 355 |     "cv_params = {'max_depth': [2,4,6,8],\n",
 356 |     "             'min_child_weight': [1,3,5,7]\n",
 357 |     "            }\n",
 358 |     "\n",
 359 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 360 |     "                                        objective = objective,\n",
 361 |     "                                        seed = seed,\n",
 362 |     "                                        n_estimators = n_estimators,\n",
 363 |     "                                        learning_rate = learning_rate,\n",
 364 |     "                                        gamma = gamma,\n",
 365 |     "                                        subsample = subsample,\n",
 366 |     "                                        colsample_bytree = colsample_bytree,\n",
 367 |     "                                        reg_alpha = reg_alpha,\n",
 368 |     "                                        reg_lambda = reg_lambda,\n",
 369 |     "                                        silent = silent\n",
 370 |     "\n",
 371 |     "                                    ),\n",
 372 |     "                    \n",
 373 |     "                    param_grid = cv_params,\n",
 374 |     "                    iid = False,\n",
 375 |     "                    scoring = \"neg_mean_squared_error\",\n",
 376 |     "                    cv = 5,\n",
 377 |     "                    verbose = True\n",
 378 |     ")\n",
 379 |     "\n",
 380 |     "gbm.fit(x_train,y_train)\n",
 381 |     "print gbm.cv_results_\n",
 382 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 383 |     "print \"Best score %s\" %gbm.best_score_\n",
 384 |     "slack_message(\"max_depth and min_child_weight parameters tuned! moving on to refinement\", 'channel')"
 385 |    ]
 386 |   },
 387 |   {
 388 |    "cell_type": "markdown",
 389 |    "metadata": {
 390 |     "deletable": true,
 391 |     "editable": true
 392 |    },
 393 |    "source": [
 394 |     "## Refine with a smaller grid of values based on best values from the big grid above"
 395 |    ]
 396 |   },
 397 |   {
 398 |    "cell_type": "code",
 399 |    "execution_count": null,
 400 |    "metadata": {
 401 |     "collapsed": false,
 402 |     "deletable": true,
 403 |     "editable": true
 404 |    },
 405 |    "outputs": [],
 406 |    "source": [
 407 |     "max_depth = gbm.best_params_['max_depth']\n",
 408 |     "min_child_weight = gbm.best_params_['min_child_weight']\n",
 409 |     "parameters['max_depth'] = max_depth\n",
 410 |     "parameters['min_child_weight'] = min_child_weight\n",
 411 |     "scores.append(gbm.best_score_)\n",
 412 |     "\n",
 413 |     "cv_params = {'max_depth': [max_depth-1, max_depth, max_depth+1], \n",
 414 |     "             'min_child_weight': [min_child_weight-1, min_child_weight-0.5, min_child_weight, min_child_weight+0.5, min_child_weight+1]\n",
 415 |     "            }\n",
 416 |     "\n",
 417 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 418 |     "                                        objective = objective,\n",
 419 |     "                                        seed = seed,\n",
 420 |     "                                        n_estimators = n_estimators,\n",
 421 |     "                                        learning_rate = learning_rate,\n",
 422 |     "                                        gamma = gamma,\n",
 423 |     "                                        subsample = subsample,\n",
 424 |     "                                        colsample_bytree = colsample_bytree,\n",
 425 |     "                                        reg_alpha = reg_alpha,\n",
 426 |     "                                        reg_lambda = reg_lambda,\n",
 427 |     "                                        silent = silent\n",
 428 |     "\n",
 429 |     "                                    ),\n",
 430 |     "                   \n",
 431 |     "                    param_grid = cv_params,\n",
 432 |     "                    iid = False,\n",
 433 |     "                    scoring = \"neg_mean_squared_error\",\n",
 434 |     "                    cv = 5,\n",
 435 |     "                    verbose = True\n",
 436 |     ")\n",
 437 |     "\n",
 438 |     "gbm.fit(x_train,y_train)\n",
 439 |     "print gbm.cv_results_\n",
 440 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 441 |     "print \"Best score %s\" %gbm.best_score_\n",
 442 |     "slack_message(\"max_depth and min_child_weight parameters refined! moving on to tuning gamma parameter\", 'channel')"
 443 |    ]
 444 |   },
 445 |   {
 446 |    "cell_type": "markdown",
 447 |    "metadata": {
 448 |     "deletable": true,
 449 |     "editable": true
 450 |    },
 451 |    "source": [
 452 |     "## Set max_depth and min_child_weight before tuning gamma parameter\n",
 453 |     "\n",
 454 |     "Set the `max_depth` and `min_child_weight` values based on the above before tuning the `gamma` parameter in a similar fashion"
 455 |    ]
 456 |   },
 457 |   {
 458 |    "cell_type": "code",
 459 |    "execution_count": null,
 460 |    "metadata": {
 461 |     "collapsed": false,
 462 |     "deletable": true,
 463 |     "editable": true
 464 |    },
 465 |    "outputs": [],
 466 |    "source": [
 467 |     "max_depth = gbm.best_params_['max_depth']\n",
 468 |     "min_child_weight = gbm.best_params_['min_child_weight']\n",
 469 |     "parameters['max_depth'] = max_depth\n",
 470 |     "parameters['min_child_weight'] = min_child_weight\n",
 471 |     "scores.append(gbm.best_score_)\n",
 472 |     "\n",
 473 |     "cv_params = {'gamma': [i/10.0 for i in range(1,10,2)]}\n",
 474 |     "\n",
 475 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 476 |     "                                        objective = objective,\n",
 477 |     "                                        seed = seed,\n",
 478 |     "                                        n_estimators = n_estimators,\n",
 479 |     "                                        max_depth = max_depth,\n",
 480 |     "                                        min_child_weight = min_child_weight,\n",
 481 |     "                                        learning_rate = learning_rate,\n",
 482 |     "                                        subsample = subsample,\n",
 483 |     "                                        colsample_bytree = colsample_bytree,\n",
 484 |     "                                        reg_alpha = reg_alpha,\n",
 485 |     "                                        reg_lambda = reg_lambda,\n",
 486 |     "                                        silent = silent\n",
 487 |     "\n",
 488 |     "                                    ),\n",
 489 |     "                   \n",
 490 |     "                    param_grid = cv_params,\n",
 491 |     "                    iid = False,\n",
 492 |     "                    scoring = \"neg_mean_squared_error\",\n",
 493 |     "                    cv = 5,\n",
 494 |     "                    verbose = True\n",
 495 |     ")\n",
 496 |     "\n",
 497 |     "gbm.fit(x_train,y_train)\n",
 498 |     "print gbm.cv_results_\n",
 499 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 500 |     "print \"Best score %s\" %gbm.best_score_\n",
 501 |     "slack_message(\"gamma tuned! moving on to tuning subsample and colsample_bytree parameters\", 'channel')"
 502 |    ]
 503 |   },
 504 |   {
 505 |    "cell_type": "markdown",
 506 |    "metadata": {
 507 |     "deletable": true,
 508 |     "editable": true
 509 |    },
 510 |    "source": [
 511 |     "## Set the `gamma` parameter and tune the `subsample` and `colsample_bytree` parameters next \n",
 512 |     "\n",
 513 |     "We will look at 10% intervals from 60% to 100% for both `subsample` and `colsample_bytree`"
 514 |    ]
 515 |   },
 516 |   {
 517 |    "cell_type": "code",
 518 |    "execution_count": null,
 519 |    "metadata": {
 520 |     "collapsed": false,
 521 |     "deletable": true,
 522 |     "editable": true
 523 |    },
 524 |    "outputs": [],
 525 |    "source": [
 526 |     "gamma = gbm.best_params_['gamma']\n",
 527 |     "parameters['gamma'] = gamma\n",
 528 |     "scores.append(gbm.best_score_)\n",
 529 |     "\n",
 530 |     "cv_params = {'subsample': [i/10.0 for i in range(6,11)],\n",
 531 |     "             'colsample_bytree': [i/10.0 for i in range(6,11)]\n",
 532 |     "            }\n",
 533 |     "\n",
 534 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 535 |     "                                        objective = objective,\n",
 536 |     "                                        seed = seed,\n",
 537 |     "                                        n_estimators = n_estimators,\n",
 538 |     "                                        max_depth = max_depth,\n",
 539 |     "                                        min_child_weight = min_child_weight,\n",
 540 |     "                                        learning_rate = learning_rate,\n",
 541 |     "                                        gamma = gamma,\n",
 542 |     "                                        reg_alpha = reg_alpha,\n",
 543 |     "                                        reg_lambda = reg_lambda,\n",
 544 |     "                                        silent = silent\n",
 545 |     "\n",
 546 |     "                                    ),\n",
 547 |     "                   \n",
 548 |     "                    param_grid = cv_params,\n",
 549 |     "                    iid = False,\n",
 550 |     "                    scoring = \"neg_mean_squared_error\",\n",
 551 |     "                    cv = 5,\n",
 552 |     "                    verbose = True\n",
 553 |     ")\n",
 554 |     "\n",
 555 |     "gbm.fit(x_train,y_train)\n",
 556 |     "print gbm.cv_results_\n",
 557 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 558 |     "print \"Best score %s\" %gbm.best_score_\n",
 559 |     "slack_message(\"subsample and colsample_bytree parameters tuned! moving on to refinement\", 'channel')"
 560 |    ]
 561 |   },
 562 |   {
 563 |    "cell_type": "markdown",
 564 |    "metadata": {
 565 |     "deletable": true,
 566 |     "editable": true
 567 |    },
 568 |    "source": [
 569 |     "## Retune with a smaller grid of values based on best values from the big grid above\n",
 570 |     "\n",
 571 |     "Look at 5% intervals in some range around the best values found previously"
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "code",
 576 |    "execution_count": null,
 577 |    "metadata": {
 578 |     "collapsed": false,
 579 |     "deletable": true,
 580 |     "editable": true
 581 |    },
 582 |    "outputs": [],
 583 |    "source": [
 584 |     "subsample = gbm.best_params_['subsample']\n",
 585 |     "colsample_bytree = gbm.best_params_['colsample_bytree']\n",
 586 |     "parameters['subsample'] = subsample\n",
 587 |     "parameters['colsample_bytree'] = colsample_bytree\n",
 588 |     "scores.append(gbm.best_score_)\n",
 589 |     "\n",
 590 |     "cv_params = {'subsample': [i/100.0 for i in range(int((subsample-0.1)*100.0), min(int((subsample+0.1)*100),105) , 5)],\n",
 591 |     "             'colsample_bytree': [i/100.0 for i in range(int((colsample_bytree-0.1)*100.0), min(int((subsample+0.1)*100),105), 5)]\n",
 592 |     "            }\n",
 593 |     "\n",
 594 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 595 |     "                                        objective = objective,\n",
 596 |     "                                        seed = seed,\n",
 597 |     "                                        n_estimators = n_estimators,\n",
 598 |     "                                        max_depth = max_depth,\n",
 599 |     "                                        min_child_weight = min_child_weight,\n",
 600 |     "                                        learning_rate = learning_rate,\n",
 601 |     "                                        gamma = gamma,\n",
 602 |     "                                        reg_alpha = reg_alpha,\n",
 603 |     "                                        reg_lambda = reg_lambda,\n",
 604 |     "                                        silent = silent\n",
 605 |     "\n",
 606 |     "                                    ),\n",
 607 |     "                   \n",
 608 |     "                    param_grid = cv_params,\n",
 609 |     "                    iid = False,\n",
 610 |     "                    scoring = \"neg_mean_squared_error\",\n",
 611 |     "                    cv = 5,\n",
 612 |     "                    verbose = True\n",
 613 |     ")\n",
 614 |     "\n",
 615 |     "gbm.fit(x_train,y_train)\n",
 616 |     "print gbm.cv_results_\n",
 617 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 618 |     "print \"Best score %s\" %gbm.best_score_\n",
 619 |     "slack_message(\"subsample and colsample_bytree parameters refined! moving on to tuning the alpha and lambda parameters\", 'channel')"
 620 |    ]
 621 |   },
 622 |   {
 623 |    "cell_type": "markdown",
 624 |    "metadata": {
 625 |     "deletable": true,
 626 |     "editable": true
 627 |    },
 628 |    "source": [
 629 |     "## Set the `colsample_bytree` and `subsample` parameters before tuning `reg_alpha` and `reg_lambda` parameters\n",
 630 |     "\n",
 631 |     "`reg_alpha` controls L1 regularisation and `reg_lambda` controls L2 regularisation"
 632 |    ]
 633 |   },
 634 |   {
 635 |    "cell_type": "code",
 636 |    "execution_count": null,
 637 |    "metadata": {
 638 |     "collapsed": false,
 639 |     "deletable": true,
 640 |     "editable": true
 641 |    },
 642 |    "outputs": [],
 643 |    "source": [
 644 |     "colsample_bytree = gbm.best_params_['colsample_bytree']\n",
 645 |     "subsample = gbm.best_params_['subsample']\n",
 646 |     "parameters['colsample_bytree'] = colsample_bytree\n",
 647 |     "parameters['subsample'] = subsample\n",
 648 |     "scores.append(gbm.best_score_)\n",
 649 |     "\n",
 650 |     "cv_params = {'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100], \n",
 651 |     "             'reg_lambda': [1e-5, 1e-2, 0.1, 1, 100]\n",
 652 |     "            }\n",
 653 |     "\n",
 654 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 655 |     "                                        objective = objective,\n",
 656 |     "                                        seed = seed,\n",
 657 |     "                                        n_estimators = n_estimators,\n",
 658 |     "                                        max_depth = max_depth,\n",
 659 |     "                                        min_child_weight = min_child_weight,\n",
 660 |     "                                        learning_rate = learning_rate,\n",
 661 |     "                                        gamma = gamma,\n",
 662 |     "                                        colsample_bytree = colsample_bytree,\n",
 663 |     "                                        subsample = subsample,\n",
 664 |     "                                        silent = silent\n",
 665 |     "\n",
 666 |     "                                    ),\n",
 667 |     "                   \n",
 668 |     "                    param_grid = cv_params,\n",
 669 |     "                    iid = False,\n",
 670 |     "                    scoring = \"neg_mean_squared_error\",\n",
 671 |     "                    cv = 5,\n",
 672 |     "                    verbose = True\n",
 673 |     ")\n",
 674 |     "\n",
 675 |     "gbm.fit(x_train,y_train)\n",
 676 |     "print gbm.cv_results_\n",
 677 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 678 |     "print \"Best score %s\" %gbm.best_score_\n",
 679 |     "slack_message(\"alpha and lambda parameters tuned! moving on to refinement\", 'channel')"
 680 |    ]
 681 |   },
 682 |   {
 683 |    "cell_type": "markdown",
 684 |    "metadata": {
 685 |     "deletable": true,
 686 |     "editable": true
 687 |    },
 688 |    "source": [
 689 |     "## Refine parameters on a smaller grid\n",
 690 |     "\n",
 691 |     "Look at a smaller grid around the best values found previously "
 692 |    ]
 693 |   },
 694 |   {
 695 |    "cell_type": "code",
 696 |    "execution_count": null,
 697 |    "metadata": {
 698 |     "collapsed": false,
 699 |     "deletable": true,
 700 |     "editable": true
 701 |    },
 702 |    "outputs": [],
 703 |    "source": [
 704 |     "reg_alpha = gbm.best_params_['reg_alpha']\n",
 705 |     "reg_lambda = gbm.best_params_['reg_lambda']\n",
 706 |     "parameters['reg_alpha'] = reg_alpha\n",
 707 |     "parameters['reg_lambda'] = reg_lambda\n",
 708 |     "scores.append(gbm.best_score_)\n",
 709 |     "\n",
 710 |     "cv_params = {'reg_lambda': [reg_alpha*0.2, reg_alpha*0.5, reg_alpha, reg_alpha*2, reg_alpha*5], \n",
 711 |     "             'reg_alpha': [reg_lambda*0.2, reg_lambda*0.5, reg_lambda, reg_lambda*2, reg_lambda*5]\n",
 712 |     "            }\n",
 713 |     "\n",
 714 |     "gbm = GridSearchCV(xgb.XGBRegressor(\n",
 715 |     "                                        objective = objective,\n",
 716 |     "                                        seed = seed,\n",
 717 |     "                                        n_estimators = n_estimators,\n",
 718 |     "                                        max_depth = max_depth,\n",
 719 |     "                                        min_child_weight = min_child_weight,\n",
 720 |     "                                        learning_rate = learning_rate,\n",
 721 |     "                                        gamma = gamma,\n",
 722 |     "                                        colsample_bytree = colsample_bytree,\n",
 723 |     "                                        subsample = subsample,\n",
 724 |     "                                        silent = silent\n",
 725 |     "\n",
 726 |     "                                    ),\n",
 727 |     "                   \n",
 728 |     "                    param_grid = cv_params,\n",
 729 |     "                    iid = False,\n",
 730 |     "                    scoring = \"neg_mean_squared_error\",\n",
 731 |     "                    cv = 5,\n",
 732 |     "                    verbose = True\n",
 733 |     ")\n",
 734 |     "\n",
 735 |     "gbm.fit(x_train,y_train)\n",
 736 |     "print gbm.cv_results_\n",
 737 |     "print \"Best parameters %s\" %gbm.best_params_\n",
 738 |     "print \"Best score %s\" %gbm.best_score_\n",
 739 |     "slack_message(\"alpha and lambda parameters refined! finalising model by reducing learning rate and increasing trees\", 'channel')"
 740 |    ]
 741 |   },
 742 |   {
 743 |    "cell_type": "markdown",
 744 |    "metadata": {
 745 |     "deletable": true,
 746 |     "editable": true
 747 |    },
 748 |    "source": [
 749 |     "## Set regularisation parameters before increasing the number of trees and reducing the learning rate\n",
 750 |     "\n",
 751 |     "The idea here is to find a better fit that actually converges based on the optimal parameters values we have found so far"
 752 |    ]
 753 |   },
 754 |   {
 755 |    "cell_type": "code",
 756 |    "execution_count": null,
 757 |    "metadata": {
 758 |     "collapsed": true,
 759 |     "deletable": true,
 760 |     "editable": true
 761 |    },
 762 |    "outputs": [],
 763 |    "source": [
 764 |     "reg_alpha = gbm.best_params_['reg_alpha']\n",
 765 |     "reg_lambda = gbm.best_params_['reg_lambda']\n",
 766 |     "parameters['reg_alpha'] = reg_alpha\n",
 767 |     "parameters['reg_lambda'] = reg_lambda\n",
 768 |     "scores.append(gbm.best_score_)"
 769 |    ]
 770 |   },
 771 |   {
 772 |    "cell_type": "markdown",
 773 |    "metadata": {
 774 |     "deletable": true,
 775 |     "editable": true
 776 |    },
 777 |    "source": [
 778 |     "## Print final parameters used and scores obtained\n",
 779 |     "\n",
 780 |     "Importantly, ensure scores are increasing with each iteration. For the above implementation, the negative MSE objective function should be increasing in order to minimise MSE"
 781 |    ]
 782 |   },
 783 |   {
 784 |    "cell_type": "code",
 785 |    "execution_count": null,
 786 |    "metadata": {
 787 |     "collapsed": true,
 788 |     "deletable": true,
 789 |     "editable": true
 790 |    },
 791 |    "outputs": [],
 792 |    "source": [
 793 |     "print parameters\n",
 794 |     "print scores"
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": null,
 800 |    "metadata": {
 801 |     "collapsed": false,
 802 |     "deletable": true,
 803 |     "editable": true
 804 |    },
 805 |    "outputs": [],
 806 |    "source": [
 807 |     "# n_estimators = 3000\n",
 808 |     "# learning_rate = 0.05\n",
 809 |     "\n",
 810 |     "# parameters['n_estimators'] = n_estimators\n",
 811 |     "# parameters['learning_rate'] = learning_rate\n",
 812 |     "\n",
 813 |     "# xgbFinal = xgb.XGBRegressor(\n",
 814 |     "#     objective = objective,\n",
 815 |     "#     seed = seed,\n",
 816 |     "#     n_estimators = n_estimators,\n",
 817 |     "#     max_depth = max_depth,\n",
 818 |     "#     min_child_weight = min_child_weight,\n",
 819 |     "#     learning_rate = learning_rate,\n",
 820 |     "#     gamma = gamma,\n",
 821 |     "#     subsample = subsample,\n",
 822 |     "#     colsample_bytree = colsample_bytree,\n",
 823 |     "#     reg_alpha = reg_alpha,\n",
 824 |     "#     reg_lambda = reg_lambda,\n",
 825 |     "#     silent = False\n",
 826 |     "# )\n",
 827 |     "\n",
 828 |     "# xgb1.fit(x_train, y_train, eval_set = [(x_train, y_train), (x_test, y_test)], eval_metric = 'rmse', verbose = True)\n",
 829 |     "# slack_message(\"Training complete!\", 'channel')"
 830 |    ]
 831 |   },
 832 |   {
 833 |    "cell_type": "markdown",
 834 |    "metadata": {
 835 |     "deletable": true,
 836 |     "editable": true
 837 |    },
 838 |    "source": [
 839 |     "## Create XGBoost's DMatrix\n",
 840 |     "\n",
 841 |     "We will use this for finding the best tree via cross validation, and in the final XGBoost model"
 842 |    ]
 843 |   },
 844 |   {
 845 |    "cell_type": "code",
 846 |    "execution_count": null,
 847 |    "metadata": {
 848 |     "collapsed": true,
 849 |     "deletable": true,
 850 |     "editable": true
 851 |    },
 852 |    "outputs": [],
 853 |    "source": [
 854 |     "trainDMat = xgb.DMatrix(data = x_train, label = y_train)\n",
 855 |     "testDMat = xgb.DMatrix(data = x_test, label = y_test)"
 856 |    ]
 857 |   },
 858 |   {
 859 |    "cell_type": "markdown",
 860 |    "metadata": {
 861 |     "deletable": true,
 862 |     "editable": true
 863 |    },
 864 |    "source": [
 865 |     "## Find best tree\n",
 866 |     "\n",
 867 |     "Lower the `learning_rate` and set a large `num_boost_round` hyperparameter to ensure convergence. If convergence is slow, retry with a slightly higher learning rate (e.g. `0.075` instead of `0.05`)"
 868 |    ]
 869 |   },
 870 |   {
 871 |    "cell_type": "code",
 872 |    "execution_count": null,
 873 |    "metadata": {
 874 |     "collapsed": true,
 875 |     "deletable": true,
 876 |     "editable": true
 877 |    },
 878 |    "outputs": [],
 879 |    "source": [
 880 |     "learning_rate = 0.05\n",
 881 |     "parameters['eta'] = learning_rate\n",
 882 |     "\n",
 883 |     "num_boost_round = 3000\n",
 884 |     "early_stopping_rounds = 20\n",
 885 |     "\n",
 886 |     "xgbCV = xgb.cv(\n",
 887 |     "    params = parameters, \n",
 888 |     "    dtrain = trainDMat, \n",
 889 |     "    num_boost_round = num_boost_round,\n",
 890 |     "    nfold = 5,\n",
 891 |     "    metrics = {'rmse'},\n",
 892 |     "    early_stopping_rounds = early_stopping_rounds,\n",
 893 |     "    verbose_eval = True,\n",
 894 |     "    seed = seed     \n",
 895 |     ")\n",
 896 |     "\n",
 897 |     "slack_message(\"Training complete! Producing final booster object\", 'channel')"
 898 |    ]
 899 |   },
 900 |   {
 901 |    "cell_type": "markdown",
 902 |    "metadata": {
 903 |     "deletable": true,
 904 |     "editable": true
 905 |    },
 906 |    "source": [
 907 |     "## Finalise XGBoost model\n",
 908 |     "\n",
 909 |     "Produce the final booster object using the best tree from our cross validation"
 910 |    ]
 911 |   },
 912 |   {
 913 |    "cell_type": "code",
 914 |    "execution_count": null,
 915 |    "metadata": {
 916 |     "collapsed": true,
 917 |     "deletable": true,
 918 |     "editable": true
 919 |    },
 920 |    "outputs": [],
 921 |    "source": [
 922 |     "num_boost_round = len(xgbCV)\n",
 923 |     "parameters['eval_metric'] = 'rmse'\n",
 924 |     "\n",
 925 |     "xgbFinal = xgb.train(\n",
 926 |     "    params = parameters, \n",
 927 |     "    dtrain = trainDMat, \n",
 928 |     "    num_boost_round = num_boost_round,\n",
 929 |     "    evals = [(trainDMat, 'train'), \n",
 930 |     "             (testDMat, 'eval')]\n",
 931 |     ")\n",
 932 |     "\n",
 933 |     "slack_message(\"Booster object created!\", 'channel')"
 934 |    ]
 935 |   },
 936 |   {
 937 |    "cell_type": "markdown",
 938 |    "metadata": {
 939 |     "deletable": true,
 940 |     "editable": true
 941 |    },
 942 |    "source": [
 943 |     "## Feature importance plot\n",
 944 |     "\n",
 945 |     "Plot the feature importance plot to check whether this is making sense before checking optimal parameters and loss function progression"
 946 |    ]
 947 |   },
 948 |   {
 949 |    "cell_type": "code",
 950 |    "execution_count": null,
 951 |    "metadata": {
 952 |     "collapsed": false,
 953 |     "deletable": true,
 954 |     "editable": true
 955 |    },
 956 |    "outputs": [],
 957 |    "source": [
 958 |     "xgb.plot_importance(xgbFinal)"
 959 |    ]
 960 |   },
 961 |   {
 962 |    "cell_type": "markdown",
 963 |    "metadata": {
 964 |     "deletable": true,
 965 |     "editable": true
 966 |    },
 967 |    "source": [
 968 |     "## Produce predictions for train and test sets before measuring accuracy\n",
 969 |     "\n",
 970 |     "Calculate predictions for both train and test sets, and then calculate MSE and RMSE for both datasets"
 971 |    ]
 972 |   },
 973 |   {
 974 |    "cell_type": "code",
 975 |    "execution_count": null,
 976 |    "metadata": {
 977 |     "collapsed": false,
 978 |     "deletable": true,
 979 |     "editable": true
 980 |    },
 981 |    "outputs": [],
 982 |    "source": [
 983 |     "xgbFinal_train_preds = xgbFinal.predict(x_train)\n",
 984 |     "xgbFinal_test_preds = xgbFinal.predict(x_test)"
 985 |    ]
 986 |   },
 987 |   {
 988 |    "cell_type": "code",
 989 |    "execution_count": null,
 990 |    "metadata": {
 991 |     "collapsed": false,
 992 |     "deletable": true,
 993 |     "editable": true
 994 |    },
 995 |    "outputs": [],
 996 |    "source": [
 997 |     "print(xgbFinal_train_preds.shape)\n",
 998 |     "print(xgbFinal_test_preds.shape)"
 999 |    ]
1000 |   },
1001 |   {
1002 |    "cell_type": "code",
1003 |    "execution_count": null,
1004 |    "metadata": {
1005 |     "collapsed": false,
1006 |     "deletable": true,
1007 |     "editable": true
1008 |    },
1009 |    "outputs": [],
1010 |    "source": [
1011 |     "print \"\\nModel Report\"\n",
1012 |     "print \"MSE Train : %f\" % mean_squared_error(y_train, xgbFinal_train_preds)\n",
1013 |     "print \"MSE Test: %f\" % mean_squared_error(y_test, xgbFinal_test_preds)\n",
1014 |     "print \"RMSE Train: %f\" % mean_squared_error(y_train, xgbFinal_train_preds)**0.5\n",
1015 |     "print \"RMSE Test: %f\" % mean_squared_error(y_test, xgbFinal_test_preds)**0.5"
1016 |    ]
1017 |   },
1018 |   {
1019 |    "cell_type": "markdown",
1020 |    "metadata": {
1021 |     "deletable": true,
1022 |     "editable": true
1023 |    },
1024 |    "source": [
1025 |     "## Save xgb model file and write .csv files to working directory\n",
1026 |     "\n",
1027 |     "Save xgb model file for future reference. Similar function to load previously saved files is commented out below. Then, write all files to the working directory"
1028 |    ]
1029 |   },
1030 |   {
1031 |    "cell_type": "code",
1032 |    "execution_count": null,
1033 |    "metadata": {
1034 |     "collapsed": true,
1035 |     "deletable": true,
1036 |     "editable": true
1037 |    },
1038 |    "outputs": [],
1039 |    "source": [
1040 |     "pickle.dump(xgbFinal, open(\"xgbFinal.pickle.dat\", \"wb\"))"
1041 |    ]
1042 |   },
1043 |   {
1044 |    "cell_type": "code",
1045 |    "execution_count": null,
1046 |    "metadata": {
1047 |     "collapsed": false,
1048 |     "deletable": true,
1049 |     "editable": true
1050 |    },
1051 |    "outputs": [],
1052 |    "source": [
1053 |     "# xgb1 = pickle.load(open(\"xgb1.pickle.dat\", \"rb\"))\n",
1054 |     "# xgb1_train_preds = xgb1.predict(x_train)\n",
1055 |     "# xgb1_test_preds = xgb1.predict(x_test)"
1056 |    ]
1057 |   },
1058 |   {
1059 |    "cell_type": "code",
1060 |    "execution_count": null,
1061 |    "metadata": {
1062 |     "collapsed": false,
1063 |     "deletable": true,
1064 |     "editable": true
1065 |    },
1066 |    "outputs": [],
1067 |    "source": [
1068 |     "# print \"\\nModel Report\"\n",
1069 |     "# print \"MSE Train : %f\" % mean_squared_error(y_train, xgb1_train_preds)\n",
1070 |     "# print \"MSE Test: %f\" % mean_squared_error(y_test, xgb1_test_preds)\n",
1071 |     "# print \"RMSE Train: %f\" % mean_squared_error(y_train, xgb1_train_preds)**0.5\n",
1072 |     "# print \"RMSE Test: %f\" % mean_squared_error(y_test, xgb1_test_preds)**0.5"
1073 |    ]
1074 |   },
1075 |   {
1076 |    "cell_type": "code",
1077 |    "execution_count": null,
1078 |    "metadata": {
1079 |     "collapsed": false,
1080 |     "deletable": true,
1081 |     "editable": true
1082 |    },
1083 |    "outputs": [],
1084 |    "source": [
1085 |     "train_preds = pd.DataFrame(xgbFinal_train_preds)\n",
1086 |     "test_preds = pd.DataFrame(xgbFinal_test_preds)\n",
1087 |     "train_preds.columns = ['RESPONSE']\n",
1088 |     "test_preds.column = ['RESPONSE']"
1089 |    ]
1090 |   },
1091 |   {
1092 |    "cell_type": "code",
1093 |    "execution_count": null,
1094 |    "metadata": {
1095 |     "collapsed": true,
1096 |     "deletable": true,
1097 |     "editable": true
1098 |    },
1099 |    "outputs": [],
1100 |    "source": [
1101 |     "train.to_csv('XGBoost Train.csv', sep=',')\n",
1102 |     "train_preds.to_csv('XGBoost Train Preds.csv', sep=',')\n",
1103 |     "test.to_csv('XGBoost Test.csv', sep=',')\n",
1104 |     "test_preds.to_csv('XGBoost Test Preds.csv', sep=',')\n",
1105 |     "slack_message(\"Files saved!\", 'channel')"
1106 |    ]
1107 |   }
1108 |  ],
1109 |  "metadata": {
1110 |   "kernelspec": {
1111 |    "display_name": "Python 2",
1112 |    "language": "python",
1113 |    "name": "python2"
1114 |   },
1115 |   "language_info": {
1116 |    "codemirror_mode": {
1117 |     "name": "ipython",
1118 |     "version": 2
1119 |    },
1120 |    "file_extension": ".py",
1121 |    "mimetype": "text/x-python",
1122 |    "name": "python",
1123 |    "nbconvert_exporter": "python",
1124 |    "pygments_lexer": "ipython2",
1125 |    "version": "2.7.13"
1126 |   }
1127 |  },
1128 |  "nbformat": 4,
1129 |  "nbformat_minor": 0
1130 | }
1131 | 


--------------------------------------------------------------------------------