├── README.md └── xgbcode.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Effective XGBoost Code Repository 2 | 3 | Welcome to the official GitHub repository for "Effective XGBoost". Here, you'll find all the code examples included in the book, neatly organized by chapter. This repository serves as a practical resource for readers and allows for active collaboration through GitHub. 4 | 5 | 6 | 7 | ## Table of Contents 8 | 9 | 1. [About the Book](#about-the-book) 10 | 2. [How to Use This Repository](#how-to-use-this-repository) 11 | 3. [Purchase the Book](#purchase-the-book) 12 | 4. [Filing Bugs](#filing-bugs) 13 | 5. [Contributing](#contributing) 14 | 15 | ## About the Book 16 | 17 | "Effective XGBoost" is an in-depth, comprehensive guide to creating classification models, designed to help readers from a wide range of backgrounds, from beginners to seasoned professionals. It provides clear explanations, real-world examples, practical exercises, and much more. 18 | 19 | This book is the culmination of years of experience and knowledge shared by the author, Matt Harrison, a data science and Python consultant and corporate trainer. 20 | 21 | ## How to Use This Repository 22 | 23 | You'll find all code examples for the book here. 24 | 25 | ## Purchase the Book 26 | 27 | If you have not already done so, you can purchase "Effective XGBoost" from the following vendors: 28 | 29 | - Digital PDF/epub/kindle [Effective XGBoost Digital](https://store.metasnake.com/xgboost) 30 | - Physical Version [Amazon](https://amzn.to/441i9lm) 31 | 32 | If you find the content of this repository helpful, imagine how much more you could learn from the complete book! Your purchase not only supports the work of the author but also contributes to the continuous improvement of this code repository. 33 | 34 | ## Filing Bugs 35 | 36 | We strive for perfection, but nobody's perfect. If you encounter any issues or errors in the book or in the code samples, please don't hesitate to file a bug in the [Issues](https://github.com/mattharrison/effective_xgboost_book/issues) section of this repository. When filing an issue, please include as much detail as possible, such as the chapter and page number, description of the issue, and, if relevant, a screenshot or code snippet. 37 | 38 | ## Contributing 39 | 40 | We welcome and appreciate contributions from our readers. If you've noticed an error or a way to improve the code, feel free to create a pull request. For significant changes, please open an issue first to discuss the proposed changes. 41 | 42 | --- 43 | 44 | Happy coding, and enjoy the book! 45 | 46 | -------------------------------------------------------------------------------- /xgbcode.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a7edfa05-ede7-4f82-add8-41eb5c91bc41", 6 | "metadata": {}, 7 | "source": [ 8 | "## Datasets\n", 9 | "\n", 10 | "### Cleanup\n", 11 | "\n", 12 | "``` python\n", 13 | "import pandas as pd\n", 14 | "\n", 15 | "import urllib.request\n", 16 | "import zipfile\n", 17 | "\n", 18 | "\n", 19 | "url = 'https://github.com/mattharrison/datasets/raw/master/data/'\\\n", 20 | " 'kaggle-survey-2018.zip'\n", 21 | "fname = 'kaggle-survey-2018.zip'\n", 22 | "member_name = 'multipleChoiceResponses.csv'\n", 23 | "\n", 24 | "\n", 25 | "def extract_zip(src, dst, member_name):\n", 26 | " \"\"\"Extract a member file from a zip file and read it into a pandas \n", 27 | " DataFrame.\n", 28 | " \n", 29 | " Parameters:\n", 30 | " src (str): URL of the zip file to be downloaded and extracted.\n", 31 | " dst (str): Local file path where the zip file will be written.\n", 32 | " member_name (str): Name of the member file inside the zip file \n", 33 | " to be read into a DataFrame.\n", 34 | " \n", 35 | " Returns:\n", 36 | " pandas.DataFrame: DataFrame containing the contents of the \n", 37 | " member file.\n", 38 | " \"\"\" \n", 39 | " url = src\n", 40 | " fname = dst\n", 41 | " fin = urllib.request.urlopen(url)\n", 42 | " data = fin.read()\n", 43 | " with open(dst, mode='wb') as fout:\n", 44 | " fout.write(data)\n", 45 | " with zipfile.ZipFile(dst) as z:\n", 46 | " kag = pd.read_csv(z.open(member_name))\n", 47 | " kag_questions = kag.iloc[0]\n", 48 | " raw = kag.iloc[1:]\n", 49 | " return raw\n", 50 | "\n", 51 | "raw = extract_zip(url, fname, member_name) \n", 52 | "```\n", 53 | "\n", 54 | "### Cleanup Pipeline\n", 55 | "\n", 56 | "``` python\n", 57 | "def tweak_kag(df_: pd.DataFrame) -> pd.DataFrame:\n", 58 | " \"\"\"\n", 59 | " Tweak the Kaggle survey data and return a new DataFrame.\n", 60 | "\n", 61 | " This function takes a Pandas DataFrame containing Kaggle \n", 62 | " survey data as input and returns a new DataFrame. The \n", 63 | " modifications include extracting and transforming certain \n", 64 | " columns, renaming columns, and selecting a subset of columns.\n", 65 | "\n", 66 | " Parameters\n", 67 | " ----------\n", 68 | " df_ : pd.DataFrame\n", 69 | " The input DataFrame containing Kaggle survey data.\n", 70 | "\n", 71 | " Returns\n", 72 | " -------\n", 73 | " pd.DataFrame\n", 74 | " The new DataFrame with the modified and selected columns.\n", 75 | " \"\"\" \n", 76 | " return (df_\n", 77 | " .assign(age=df_.Q2.str.slice(0,2).astype(int),\n", 78 | " education=df_.Q4.replace({'Master’s degree': 18,\n", 79 | " 'Bachelor’s degree': 16,\n", 80 | " 'Doctoral degree': 20,\n", 81 | "'Some college/university study without earning a bachelor’s degree': 13,\n", 82 | " 'Professional degree': 19,\n", 83 | " 'I prefer not to answer': None,\n", 84 | " 'No formal education past high school': 12}),\n", 85 | " major=(df_.Q5\n", 86 | " .pipe(topn, n=3)\n", 87 | " .replace({\n", 88 | " 'Computer science (software engineering, etc.)': 'cs',\n", 89 | " 'Engineering (non-computer focused)': 'eng',\n", 90 | " 'Mathematics or statistics': 'stat'})\n", 91 | " ),\n", 92 | " years_exp=(df_.Q8.str.replace('+','', regex=False)\n", 93 | " .str.split('-', expand=True)\n", 94 | " .iloc[:,0]\n", 95 | " .astype(float)),\n", 96 | " compensation=(df_.Q9.str.replace('+','', regex=False)\n", 97 | " .str.replace(',','', regex=False)\n", 98 | " .str.replace('500000', '500', regex=False)\n", 99 | " .str.replace('I do not wish to disclose my approximate yearly compensation',\n", 100 | " '0', regex=False)\n", 101 | " .str.split('-', expand=True)\n", 102 | " .iloc[:,0]\n", 103 | " .fillna(0)\n", 104 | " .astype(int)\n", 105 | " .mul(1_000)\n", 106 | " ),\n", 107 | " python=df_.Q16_Part_1.fillna(0).replace('Python', 1),\n", 108 | " r=df_.Q16_Part_2.fillna(0).replace('R', 1),\n", 109 | " sql=df_.Q16_Part_3.fillna(0).replace('SQL', 1)\n", 110 | " )#assign\n", 111 | " .rename(columns=lambda col:col.replace(' ', '_'))\n", 112 | " .loc[:, 'Q1,Q3,age,education,major,years_exp,compensation,'\n", 113 | " 'python,r,sql'.split(',')] \n", 114 | " )\n", 115 | "\n", 116 | " \n", 117 | "def topn(ser, n=5, default='other'):\n", 118 | " \"\"\"\n", 119 | " Replace all values in a Pandas Series that are not among \n", 120 | " the top `n` most frequent values with a default value.\n", 121 | "\n", 122 | " This function takes a Pandas Series and returns a new \n", 123 | " Series with the values replaced as described above. The \n", 124 | " top `n` most frequent values are determined using the \n", 125 | " `value_counts` method of the input Series.\n", 126 | "\n", 127 | " Parameters\n", 128 | " ----------\n", 129 | " ser : pd.Series\n", 130 | " The input Series.\n", 131 | " n : int, optional\n", 132 | " The number of most frequent values to keep. The \n", 133 | " default value is 5.\n", 134 | " default : str, optional\n", 135 | " The default value to use for values that are not among \n", 136 | " the top `n` most frequent values. The default value is \n", 137 | " 'other'.\n", 138 | "\n", 139 | " Returns\n", 140 | " -------\n", 141 | " pd.Series\n", 142 | " The modified Series with the values replaced.\n", 143 | " \"\"\" \n", 144 | " counts = ser.value_counts()\n", 145 | " return ser.where(ser.isin(counts.index[:n]), default)\n", 146 | "```\n", 147 | "\n", 148 | "``` python\n", 149 | "from feature_engine import encoding, imputation\n", 150 | "from sklearn import base, pipeline\n", 151 | "\n", 152 | "\n", 153 | "class TweakKagTransformer(base.BaseEstimator,\n", 154 | " base.TransformerMixin):\n", 155 | " \"\"\"\n", 156 | " A transformer for tweaking Kaggle survey data.\n", 157 | "\n", 158 | " This transformer takes a Pandas DataFrame containing \n", 159 | " Kaggle survey data as input and returns a new version of \n", 160 | " the DataFrame. The modifications include extracting and \n", 161 | " transforming certain columns, renaming columns, and \n", 162 | " selecting a subset of columns.\n", 163 | "\n", 164 | " Parameters\n", 165 | " ----------\n", 166 | " ycol : str, optional\n", 167 | " The name of the column to be used as the target variable. \n", 168 | " If not specified, the target variable will not be set.\n", 169 | "\n", 170 | " Attributes\n", 171 | " ----------\n", 172 | " ycol : str\n", 173 | " The name of the column to be used as the target variable.\n", 174 | " \"\"\"\n", 175 | " \n", 176 | " def __init__(self, ycol=None):\n", 177 | " self.ycol = ycol\n", 178 | " \n", 179 | " def transform(self, X):\n", 180 | " return tweak_kag(X)\n", 181 | " \n", 182 | " def fit(self, X, y=None):\n", 183 | " return self\n", 184 | "```\n", 185 | "\n", 186 | "``` python\n", 187 | "def get_rawX_y(df, y_col):\n", 188 | " raw = (df\n", 189 | " .query('Q3.isin([\"United States of America\", \"China\", \"India\"]) '\n", 190 | " 'and Q6.isin([\"Data Scientist\", \"Software Engineer\"])')\n", 191 | " )\n", 192 | " return raw.drop(columns=[y_col]), raw[y_col]\n", 193 | "\n", 194 | "\n", 195 | "## Create a pipeline\n", 196 | "kag_pl = pipeline.Pipeline(\n", 197 | " [('tweak', TweakKagTransformer()),\n", 198 | " ('cat', encoding.OneHotEncoder(top_categories=5, drop_last=True, \n", 199 | " variables=['Q1', 'Q3', 'major'])),\n", 200 | " ('num_impute', imputation.MeanMedianImputer(imputation_method='median',\n", 201 | " variables=['education', 'years_exp']))]\n", 202 | ")\n", 203 | "```\n", 204 | "\n", 205 | "``` pycon\n", 206 | ">>> from sklearn import model_selection\n", 207 | ">>> kag_X, kag_y = get_rawX_y(raw, 'Q6')\n", 208 | " \n", 209 | ">>> kag_X_train, kag_X_test, kag_y_train, kag_y_test = \\\n", 210 | "... model_selection.train_test_split(\n", 211 | "... kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)\n", 212 | "\n", 213 | ">>> X_train = kag_pl.fit_transform(kag_X_train, kag_y_train)\n", 214 | ">>> X_test = kag_pl.transform(kag_X_test)\n", 215 | ">>> print(X_train)\n", 216 | " age education years_exp ... major_other major_eng major_stat\n", 217 | "587 25 18.0 4.0 ... 1 0 0\n", 218 | "3065 22 16.0 1.0 ... 0 0 0\n", 219 | "8435 22 18.0 1.0 ... 1 0 0\n", 220 | "3110 40 20.0 3.0 ... 1 0 0\n", 221 | "16372 45 12.0 5.0 ... 1 0 0\n", 222 | "... ... ... ... ... ... ... ...\n", 223 | "16608 25 16.0 2.0 ... 0 0 0\n", 224 | "7325 18 16.0 1.0 ... 0 0 0\n", 225 | "21810 18 16.0 2.0 ... 0 0 0\n", 226 | "4917 25 18.0 1.0 ... 0 0 1\n", 227 | "639 25 18.0 1.0 ... 0 0 0\n", 228 | "\n", 229 | "[2110 rows x 18 columns]\n", 230 | "```\n", 231 | "\n", 232 | "``` pycon\n", 233 | ">>> kag_y_train\n", 234 | "587 Software Engineer\n", 235 | "3065 Data Scientist\n", 236 | "8435 Data Scientist\n", 237 | "3110 Data Scientist\n", 238 | "16372 Software Engineer\n", 239 | " ... \n", 240 | "16608 Software Engineer\n", 241 | "7325 Software Engineer\n", 242 | "21810 Data Scientist\n", 243 | "4917 Data Scientist\n", 244 | "639 Data Scientist\n", 245 | "Name: Q6, Length: 2110, dtype: object\n", 246 | "```\n", 247 | "\n", 248 | "## Exploratory Data Analysis\n", 249 | "\n", 250 | "### Correlations\n", 251 | "\n", 252 | "``` python\n", 253 | "(X_train\n", 254 | " .assign(data_scientist = kag_y_train == 'Data Scientist')\n", 255 | " .corr(method='spearman')\n", 256 | " .style\n", 257 | " .background_gradient(cmap='RdBu', vmax=1, vmin=-1)\n", 258 | " .set_sticky(axis='index')\n", 259 | ")\n", 260 | "```\n", 261 | "\n", 262 | "### Bar Plot\n", 263 | "\n", 264 | "``` python\n", 265 | "import matplotlib.pyplot as plt\n", 266 | "\n", 267 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 268 | "(X_train\n", 269 | " .assign(data_scientist = kag_y_train)\n", 270 | " .groupby('r')\n", 271 | " .data_scientist\n", 272 | " .value_counts()\n", 273 | " .unstack()\n", 274 | " .plot.bar(ax=ax)\n", 275 | ")\n", 276 | "```\n", 277 | "\n", 278 | "``` python\n", 279 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 280 | "(pd.crosstab(index=X_train['major_cs'], \n", 281 | " columns=kag_y)\n", 282 | " .plot.bar(ax=ax)\n", 283 | ")\n", 284 | "```\n", 285 | "\n", 286 | "``` python\n", 287 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 288 | "(X_train\n", 289 | " .plot.scatter(x='years_exp', y='compensation', alpha=.3, ax=ax, c='purple')\n", 290 | ")\n", 291 | "```\n", 292 | "\n", 293 | "``` python\n", 294 | "import seaborn.objects as so\n", 295 | "fig = plt.figure(figsize=(8, 4))\n", 296 | "(so\n", 297 | " .Plot(X_train.assign(title=kag_y_train), x='years_exp', y='compensation', color='title')\n", 298 | " .add(so.Dots(alpha=.3, pointsize=2), so.Jitter(x=.5, y=10_000))\n", 299 | " .add(so.Line(), so.PolyFit())\n", 300 | " .on(fig) # not required unless saving to image\n", 301 | " .plot() # ditto\n", 302 | ")\n", 303 | "```\n", 304 | "\n", 305 | "``` python\n", 306 | "fig = plt.figure(figsize=(8, 4))\n", 307 | "(so\n", 308 | " .Plot(X_train\n", 309 | " #.query('compensation < 200_000 and years_exp < 16')\n", 310 | " .assign(\n", 311 | " title=kag_y_train,\n", 312 | " country=(X_train\n", 313 | " .loc[:, 'Q3_United States of America': 'Q3_China']\n", 314 | " .idxmax(axis='columns')\n", 315 | " )\n", 316 | " ), x='years_exp', y='compensation', color='title')\n", 317 | " .facet('country')\n", 318 | " .add(so.Dots(alpha=.01, pointsize=2, color='grey' ), so.Jitter(x=.5, y=10_000), col=None)\n", 319 | " .add(so.Dots(alpha=.5, pointsize=1.5), so.Jitter(x=.5, y=10_000))\n", 320 | " .add(so.Line(pointsize=1), so.PolyFit(order=2))\n", 321 | " .scale(x=so.Continuous().tick(at=[0,1,2,3,4,5]))\n", 322 | " .limit(y=(-10_000, 200_000), x=(-1, 6)) # zoom in with this not .query (above)\n", 323 | " .on(fig) # not required unless saving to image\n", 324 | " .plot() # ditto\n", 325 | ")\n", 326 | "```\n", 327 | "\n", 328 | "## Tree Creation\n", 329 | "\n", 330 | "### The Gini Coefficient\n", 331 | "\n", 332 | "``` python\n", 333 | "import numpy as np\n", 334 | "import numpy.random as rn\n", 335 | "\n", 336 | "pos_center = 12\n", 337 | "pos_count = 100\n", 338 | "neg_center = 7\n", 339 | "neg_count = 1000\n", 340 | "rs = rn.RandomState(rn.MT19937(rn.SeedSequence(42)))\n", 341 | "gini = pd.DataFrame({'value':\n", 342 | " np.append((pos_center) + rs.randn(pos_count),\n", 343 | " (neg_center) + rs.randn(neg_count)),\n", 344 | " 'label':\n", 345 | " ['pos']* pos_count + ['neg'] * neg_count})\n", 346 | "\n", 347 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 348 | "_ = (gini\n", 349 | " .groupby('label')\n", 350 | " [['value']]\n", 351 | " .plot.hist(bins=30, alpha=.5, ax=ax, edgecolor='black')\n", 352 | ")\n", 353 | "ax.legend(['Negative', 'Positive'])\n", 354 | "```\n", 355 | "\n", 356 | "``` python\n", 357 | "def calc_gini(df, val_col, label_col, pos_val, split_point,\n", 358 | " debug=False):\n", 359 | " \"\"\"\n", 360 | " This function calculates the Gini impurity of a dataset. Gini impurity \n", 361 | " is a measure of the probability of a random sample being classified \n", 362 | " incorrectly when a feature is used to split the data. The lower the \n", 363 | " impurity, the better the split.\n", 364 | "\n", 365 | " Parameters:\n", 366 | " df (pd.DataFrame): The dataframe containing the data\n", 367 | " val_col (str): The column name of the feature used to split the data\n", 368 | " label_col (str): The column name of the target variable\n", 369 | " pos_val (str or int): The value of the target variable that represents \n", 370 | " the positive class\n", 371 | " split_point (float): The threshold used to split the data.\n", 372 | " debug (bool): optional, when set to True, prints the calculated Gini\n", 373 | " impurities and the final weighted average\n", 374 | "\n", 375 | " Returns:\n", 376 | " float: The weighted average of Gini impurity for the positive and \n", 377 | " negative subsets.\n", 378 | " \"\"\" \n", 379 | " ge_split = df[val_col] >= split_point\n", 380 | " eq_pos = df[label_col] == pos_val\n", 381 | " tp = df[ge_split & eq_pos].shape[0]\n", 382 | " fp = df[ge_split & ~eq_pos].shape[0]\n", 383 | " tn = df[~ge_split & ~eq_pos].shape[0]\n", 384 | " fn = df[~ge_split & eq_pos].shape[0]\n", 385 | " pos_size = tp+fp\n", 386 | " neg_size = tn+fn\n", 387 | " total_size = len(df)\n", 388 | " if pos_size == 0:\n", 389 | " gini_pos = 0\n", 390 | " else:\n", 391 | " gini_pos = 1 - (tp/pos_size)**2 - (fp/pos_size)**2\n", 392 | " if neg_size == 0:\n", 393 | " gini_neg = 0\n", 394 | " else:\n", 395 | " gini_neg = 1 - (tn/neg_size)**2 - (fn/neg_size)**2\n", 396 | " weighted_avg = gini_pos * (pos_size/total_size) + \\\n", 397 | " gini_neg * (neg_size/total_size)\n", 398 | " if debug:\n", 399 | " print(f'{gini_pos=:.3} {gini_neg=:.3} {weighted_avg=:.3}')\n", 400 | " return weighted_avg\n", 401 | "```\n", 402 | "\n", 403 | "``` pycon\n", 404 | ">>> calc_gini(gini, val_col='value', label_col='label', pos_val='pos',\n", 405 | "... split_point=9.24, debug=True)\n", 406 | "gini_pos=0.217 gini_neg=0.00202 weighted_avg=0.0241\n", 407 | "0.024117224644432264\n", 408 | "```\n", 409 | "\n", 410 | "``` python\n", 411 | "values = np.arange(5, 15, .1)\n", 412 | "ginis = []\n", 413 | "for v in values:\n", 414 | " ginis.append(calc_gini(gini, val_col='value', label_col='label',\n", 415 | " pos_val='pos', split_point=v))\n", 416 | "fig, ax = plt.subplots(figsize=(8, 4)) \n", 417 | "ax.plot(values, ginis)\n", 418 | "ax.set_title('Gini Coefficient')\n", 419 | "ax.set_ylabel('Gini Coefficient')\n", 420 | "ax.set_xlabel('Split Point')\n", 421 | "```\n", 422 | "\n", 423 | "``` pycon\n", 424 | ">>> pd.Series(ginis, index=values).loc[9.5:10.5]\n", 425 | "9.6 0.013703\n", 426 | "9.7 0.010470\n", 427 | "9.8 0.007193\n", 428 | "9.9 0.005429\n", 429 | "10.0 0.007238\n", 430 | "10.1 0.005438\n", 431 | "10.2 0.005438\n", 432 | "10.3 0.007244\n", 433 | "10.4 0.009046\n", 434 | "10.5 0.009046\n", 435 | "dtype: float64\n", 436 | "```\n", 437 | "\n", 438 | "``` pycon\n", 439 | ">>> print(pd.DataFrame({'gini':ginis, 'split':values})\n", 440 | "... .query('gini <= gini.min()')\n", 441 | "... )\n", 442 | " gini split\n", 443 | "49 0.005429 9.9\n", 444 | "```\n", 445 | "\n", 446 | "### Coefficients in Trees\n", 447 | "\n", 448 | "``` python\n", 449 | "from sklearn import tree\n", 450 | "stump = tree.DecisionTreeClassifier(max_depth=1)\n", 451 | "stump.fit(gini[['value']], gini.label)\n", 452 | "```\n", 453 | "\n", 454 | "``` python\n", 455 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 456 | "tree.plot_tree(stump, feature_names=['value'],\n", 457 | " filled=True, \n", 458 | " class_names=stump.classes_,\n", 459 | " ax=ax)\n", 460 | "```\n", 461 | "\n", 462 | "``` pycon\n", 463 | ">>> gini_pos = 0.039\n", 464 | ">>> gini_neg = 0.002\n", 465 | ">>> pos_size = 101\n", 466 | ">>> neg_size = 999\n", 467 | ">>> total_size = pos_size + neg_size\n", 468 | ">>> weighted_avg = gini_pos * (pos_size/total_size) + \\\n", 469 | "... gini_neg * (neg_size/total_size)\n", 470 | ">>> print(weighted_avg)\n", 471 | "0.005397272727272727\n", 472 | "```\n", 473 | "\n", 474 | "``` python\n", 475 | "import xgboost as xgb\n", 476 | "xg_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1) \n", 477 | "xg_stump.fit(gini[['value']], (gini.label== 'pos'))\n", 478 | "```\n", 479 | "\n", 480 | "``` python\n", 481 | "xgb.plot_tree(xg_stump, num_trees=0)\n", 482 | "```\n", 483 | "\n", 484 | "``` python\n", 485 | "import subprocess\n", 486 | "def my_dot_export(xg, num_trees, filename, title='', direction='TB'):\n", 487 | " \"\"\"Exports a specified number of trees from an XGBoost model as a graph \n", 488 | " visualization in dot and png formats.\n", 489 | "\n", 490 | " Args:\n", 491 | " xg: An XGBoost model.\n", 492 | " num_trees: The number of tree to export.\n", 493 | " filename: The name of the file to save the exported visualization.\n", 494 | " title: The title to display on the graph visualization (optional).\n", 495 | " direction: The direction to lay out the graph, either 'TB' (top to \n", 496 | " bottom) or 'LR' (left to right) (optional).\n", 497 | " \"\"\"\n", 498 | " res = xgb.to_graphviz(xg, num_trees=num_trees)\n", 499 | " content = f''' node [fontname = \"Roboto Condensed\"];\n", 500 | " edge [fontname = \"Roboto Thin\"];\n", 501 | " label = \"{title}\"\n", 502 | " fontname = \"Roboto Condensed\"\n", 503 | " '''\n", 504 | " out = res.source.replace('graph [ rankdir=TB ]', \n", 505 | " f'graph [ rankdir={direction} ];\\n {content}')\n", 506 | " # dot -Gdpi=300 -Tpng -ocourseflow.png courseflow.dot \n", 507 | " dot_filename = filename\n", 508 | " with open(dot_filename, 'w') as fout:\n", 509 | " fout.write(out)\n", 510 | " png_filename = dot_filename.replace('.dot', '.png')\n", 511 | " subprocess.run(f'dot -Gdpi=300 -Tpng -o{png_filename} {dot_filename}'.split())\n", 512 | "```\n", 513 | "\n", 514 | "``` python\n", 515 | "my_dot_export(xg_stump, num_trees=0, filename='img/stump_xg.dot', title='A demo stump') \n", 516 | "```\n", 517 | "\n", 518 | "### Another Visualization Tool\n", 519 | "\n", 520 | "``` python\n", 521 | "import dtreeviz\n", 522 | "viz = dtreeviz.model(xg_stump, X_train=gini[['value']], \n", 523 | " y_train=gini.label=='pos',\n", 524 | " target_name='positive',\n", 525 | " feature_names=['value'], class_names=['negative', 'positive'],\n", 526 | " tree_index=0)\n", 527 | "viz.view()\n", 528 | "```\n", 529 | "\n", 530 | "## Stumps on Real Data\n", 531 | "\n", 532 | "### Scikit-learn stump on real data\n", 533 | "\n", 534 | "``` python\n", 535 | "stump_dt = tree.DecisionTreeClassifier(max_depth=1)\n", 536 | "X_train = kag_pl.fit_transform(kag_X_train)\n", 537 | "stump_dt.fit(X_train, kag_y_train)\n", 538 | "```\n", 539 | "\n", 540 | "``` python\n", 541 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 542 | "features = list(c for c in X_train.columns)\n", 543 | "tree.plot_tree(stump_dt, feature_names=features, \n", 544 | " filled=True, \n", 545 | " class_names=stump_dt.classes_,\n", 546 | " ax=ax)\n", 547 | "```\n", 548 | "\n", 549 | "``` pycon\n", 550 | ">>> X_test = kag_pl.transform(kag_X_test)\n", 551 | ">>> stump_dt.score(X_test, kag_y_test)\n", 552 | "0.6243093922651933\n", 553 | "```\n", 554 | "\n", 555 | "``` pycon\n", 556 | ">>> from sklearn import dummy\n", 557 | ">>> dummy_model = dummy.DummyClassifier()\n", 558 | ">>> dummy_model.fit(X_train, kag_y_train)\n", 559 | ">>> dummy_model.score(X_test, kag_y_test)\n", 560 | "0.5458563535911602\n", 561 | "```\n", 562 | "\n", 563 | "### Decision Stump with XGBoost\n", 564 | "\n", 565 | "``` python\n", 566 | "import xgboost as xgb\n", 567 | "kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)\n", 568 | "kag_stump.fit(X_train, kag_y_train)\n", 569 | "```\n", 570 | "\n", 571 | "``` pycon\n", 572 | ">>> print(kag_y_train)\n", 573 | "587 Software Engineer\n", 574 | "3065 Data Scientist\n", 575 | "8435 Data Scientist\n", 576 | "3110 Data Scientist\n", 577 | "16372 Software Engineer\n", 578 | " ... \n", 579 | "16608 Software Engineer\n", 580 | "7325 Software Engineer\n", 581 | "21810 Data Scientist\n", 582 | "4917 Data Scientist\n", 583 | "639 Data Scientist\n", 584 | "Name: Q6, Length: 2110, dtype: object\n", 585 | "```\n", 586 | "\n", 587 | "``` pycon\n", 588 | ">>> print(kag_y_train == 'Software Engineer')\n", 589 | "587 True\n", 590 | "3065 False\n", 591 | "8435 False\n", 592 | "3110 False\n", 593 | "16372 True\n", 594 | " ... \n", 595 | "16608 True\n", 596 | "7325 True\n", 597 | "21810 False\n", 598 | "4917 False\n", 599 | "639 False\n", 600 | "Name: Q6, Length: 2110, dtype: bool\n", 601 | "```\n", 602 | "\n", 603 | "``` pycon\n", 604 | ">>> from sklearn import preprocessing\n", 605 | ">>> label_encoder = preprocessing.LabelEncoder()\n", 606 | ">>> y_train = label_encoder.fit_transform(kag_y_train)\n", 607 | ">>> y_test = label_encoder.transform(kag_y_test)\n", 608 | ">>> y_test[:5]\n", 609 | "array([1, 0, 0, 1, 1])\n", 610 | "```\n", 611 | "\n", 612 | "``` pycon\n", 613 | ">>> label_encoder.classes_\n", 614 | "array(['Data Scientist', 'Software Engineer'], dtype=object)\n", 615 | "```\n", 616 | "\n", 617 | "``` pycon\n", 618 | ">>> label_encoder.inverse_transform([0, 1])\n", 619 | "array(['Data Scientist', 'Software Engineer'], dtype=object)\n", 620 | "```\n", 621 | "\n", 622 | "``` pycon\n", 623 | ">>> kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)\n", 624 | ">>> kag_stump.fit(X_train, y_train)\n", 625 | ">>> kag_stump.score(X_test, y_test)\n", 626 | "0.6243093922651933\n", 627 | "```\n", 628 | "\n", 629 | "``` python\n", 630 | "my_dot_export(kag_stump, num_trees=0, filename='img/stump_xg_kag.dot', \n", 631 | " title='XGBoost Stump') \n", 632 | "```\n", 633 | "\n", 634 | "### Values in the XGBoost Tree\n", 635 | "\n", 636 | "``` pycon\n", 637 | ">>> kag_stump.classes_\n", 638 | "array([0, 1])\n", 639 | "```\n", 640 | "\n", 641 | "``` python\n", 642 | "import numpy as np\n", 643 | "def inv_logit(p: float) -> float:\n", 644 | " \"\"\"\n", 645 | " Compute the inverse logit function of a given value.\n", 646 | "\n", 647 | " The inverse logit function is defined as:\n", 648 | " f(p) = exp(p) / (1 + exp(p))\n", 649 | "\n", 650 | " Parameters\n", 651 | " ----------\n", 652 | " p : float\n", 653 | " The input value to the inverse logit function.\n", 654 | "\n", 655 | " Returns\n", 656 | " -------\n", 657 | " float\n", 658 | " The output of the inverse logit function.\n", 659 | " \"\"\"\n", 660 | " return np.exp(p) / (1 + np.exp(p))\n", 661 | "```\n", 662 | "\n", 663 | "``` pycon\n", 664 | ">>> inv_logit(.0717741922)\n", 665 | "0.5179358489487103\n", 666 | "```\n", 667 | "\n", 668 | "``` pycon\n", 669 | ">>> inv_logit(-.3592)\n", 670 | "0.41115323716754393\n", 671 | "```\n", 672 | "\n", 673 | "``` python\n", 674 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 675 | "vals = np.linspace(-7, 7)\n", 676 | "ax.plot(vals, inv_logit(vals))\n", 677 | "ax.annotate('Crossover point', (0,.5), (-5,.8), arrowprops={'color':'k'}) \n", 678 | "ax.annotate('Predict Positive', (5,.6), (1,.6), va='center', arrowprops={'color':'k'}) \n", 679 | "ax.annotate('Predict Negative', (-5,.4), (-3,.4), va='center', arrowprops={'color':'k'}) \n", 680 | "```\n", 681 | "\n", 682 | "### Summary\n", 683 | "\n", 684 | "### Exercises\n", 685 | "\n", 686 | "## Model Complexity & Hyperparameters\n", 687 | "\n", 688 | "### Underfit\n", 689 | "\n", 690 | "``` pycon\n", 691 | ">>> underfit = tree.DecisionTreeClassifier(max_depth=1)\n", 692 | ">>> X_train = kag_pl.fit_transform(kag_X_train)\n", 693 | ">>> underfit.fit(X_train, kag_y_train)\n", 694 | ">>> underfit.score(X_test, kag_y_test)\n", 695 | "0.6243093922651933\n", 696 | "```\n", 697 | "\n", 698 | "### Growing a Tree\n", 699 | "\n", 700 | "### Overfitting\n", 701 | "\n", 702 | "### Overfitting with Decision Trees\n", 703 | "\n", 704 | "``` pycon\n", 705 | ">>> hi_variance = tree.DecisionTreeClassifier(max_depth=None)\n", 706 | ">>> X_train = kag_pl.fit_transform(kag_X_train)\n", 707 | ">>> hi_variance.fit(X_train, kag_y_train)\n", 708 | ">>> hi_variance.score(X_test, kag_y_test)\n", 709 | "0.6629834254143646\n", 710 | "```\n", 711 | "\n", 712 | "``` python\n", 713 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 714 | "features = list(c for c in X_train.columns)\n", 715 | "tree.plot_tree(hi_variance, feature_names=features, filled=True)\n", 716 | "```\n", 717 | "\n", 718 | "``` python\n", 719 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 720 | "features = list(c for c in X_train.columns)\n", 721 | "tree.plot_tree(hi_variance, feature_names=features, filled=True, \n", 722 | " class_names=hi_variance.classes_,\n", 723 | " max_depth=2, fontsize=6)\n", 724 | "```\n", 725 | "\n", 726 | "### Summary\n", 727 | "\n", 728 | "### Exercises\n", 729 | "\n", 730 | "## Tree Hyperparameters\n", 731 | "\n", 732 | "### Decision Tree Hyperparameters\n", 733 | "\n", 734 | "``` pycon\n", 735 | ">>> stump.get_params()\n", 736 | "{'ccp_alpha': 0.0,\n", 737 | " 'class_weight': None,\n", 738 | " 'criterion': 'gini',\n", 739 | " 'max_depth': 1,\n", 740 | " 'max_features': None,\n", 741 | " 'max_leaf_nodes': None,\n", 742 | " 'min_impurity_decrease': 0.0,\n", 743 | " 'min_samples_leaf': 1,\n", 744 | " 'min_samples_split': 2,\n", 745 | " 'min_weight_fraction_leaf': 0.0,\n", 746 | " 'random_state': None,\n", 747 | " 'splitter': 'best'}\n", 748 | "```\n", 749 | "\n", 750 | "### Tracking changes with Validation Curves\n", 751 | "\n", 752 | "``` python\n", 753 | "accuracies = []\n", 754 | "for depth in range(1, 15):\n", 755 | " between = tree.DecisionTreeClassifier(max_depth=depth)\n", 756 | " between.fit(X_train, kag_y_train)\n", 757 | " accuracies.append(between.score(X_test, kag_y_test))\n", 758 | "fig, ax = plt.subplots(figsize=(10,4)) \n", 759 | "(pd.Series(accuracies, name='Accuracy', index=range(1, len(accuracies)+1))\n", 760 | " .plot(ax=ax, title='Accuracy at a given Tree Depth'))\n", 761 | "ax.set_ylabel('Accuracy')\n", 762 | "ax.set_xlabel('max_depth')\n", 763 | "```\n", 764 | "\n", 765 | "``` pycon\n", 766 | ">>> between = tree.DecisionTreeClassifier(max_depth=7)\n", 767 | ">>> between.fit(X_train, kag_y_train)\n", 768 | ">>> between.score(X_test, kag_y_test)\n", 769 | "0.7359116022099448\n", 770 | "```\n", 771 | "\n", 772 | "### Leveraging Yellowbrick\n", 773 | "\n", 774 | "``` python\n", 775 | "from yellowbrick.model_selection import validation_curve\n", 776 | "fig, ax = plt.subplots(figsize=(10,4)) \n", 777 | "viz = validation_curve(tree.DecisionTreeClassifier(),\n", 778 | " X=pd.concat([X_train, X_test]),\n", 779 | " y=pd.concat([kag_y_train, kag_y_test]), \n", 780 | " param_name='max_depth', param_range=range(1,14),\n", 781 | " scoring='accuracy', cv=5, ax=ax, n_jobs=6) \n", 782 | "```\n", 783 | "\n", 784 | "### Grid Search\n", 785 | "\n", 786 | "``` python\n", 787 | "from sklearn.model_selection import GridSearchCV\n", 788 | "params = {\n", 789 | " 'max_depth': [3, 5, 7, 8],\n", 790 | " 'min_samples_leaf': [1, 3, 4, 5, 6],\n", 791 | " 'min_samples_split': [2, 3, 4, 5, 6],\n", 792 | "}\n", 793 | "grid_search = GridSearchCV(estimator=tree.DecisionTreeClassifier(), \n", 794 | " param_grid=params, cv=4, n_jobs=-1, \n", 795 | " verbose=1, scoring=\"accuracy\")\n", 796 | "grid_search.fit(pd.concat([X_train, X_test]),\n", 797 | " pd.concat([kag_y_train, kag_y_test]))\n", 798 | "```\n", 799 | "\n", 800 | "``` pycon\n", 801 | ">>> grid_search.best_params_\n", 802 | "{'max_depth': 7, 'min_samples_leaf': 5, 'min_samples_split': 6}\n", 803 | "```\n", 804 | "\n", 805 | "``` pycon\n", 806 | ">>> between2 = tree.DecisionTreeClassifier(**grid_search.best_params_)\n", 807 | ">>> between2.fit(X_train, kag_y_train)\n", 808 | ">>> between2.score(X_test, kag_y_test)\n", 809 | "0.7259668508287292\n", 810 | "```\n", 811 | "\n", 812 | "``` python\n", 813 | "# why is the score different than between_tree?\n", 814 | "(pd.DataFrame(grid_search.cv_results_)\n", 815 | " .sort_values(by='rank_test_score')\n", 816 | " .style\n", 817 | " .background_gradient(axis='rows')\n", 818 | ")\n", 819 | "```\n", 820 | "\n", 821 | "``` pycon\n", 822 | ">>> results = model_selection.cross_val_score(\n", 823 | "... tree.DecisionTreeClassifier(max_depth=7),\n", 824 | "... X=pd.concat([X_train, X_test], axis='index'),\n", 825 | "... y=pd.concat([kag_y_train, kag_y_test], axis='index'),\n", 826 | "... cv=4\n", 827 | "... )\n", 828 | "\n", 829 | ">>> results\n", 830 | "array([0.69628647, 0.73607427, 0.70291777, 0.7184595 ])\n", 831 | "```\n", 832 | "\n", 833 | "``` pycon\n", 834 | ">>> results.mean()\n", 835 | "0.7134345024851962\n", 836 | "```\n", 837 | "\n", 838 | "``` pycon\n", 839 | ">>> results = model_selection.cross_val_score(\n", 840 | "... tree.DecisionTreeClassifier(max_depth=7, min_samples_leaf=5,\n", 841 | "... min_samples_split=2),\n", 842 | "... X=pd.concat([X_train, X_test], axis='index'),\n", 843 | "... y=pd.concat([kag_y_train, kag_y_test], axis='index'),\n", 844 | "... cv=4\n", 845 | "... )\n", 846 | "\n", 847 | ">>> results\n", 848 | "array([0.70822281, 0.73740053, 0.70689655, 0.71580345])\n", 849 | "```\n", 850 | "\n", 851 | "``` pycon\n", 852 | ">>> results.mean()\n", 853 | "0.7170808366886126\n", 854 | "```\n", 855 | "\n", 856 | "### Summary\n", 857 | "\n", 858 | "### Exercises\n", 859 | "\n", 860 | "## Random Forest\n", 861 | "\n", 862 | "### Ensembles with Bagging\n", 863 | "\n", 864 | "### Scikit-learn Random Forest\n", 865 | "\n", 866 | "``` pycon\n", 867 | ">>> from sklearn import ensemble\n", 868 | ">>> rf = ensemble.RandomForestClassifier(random_state=42)\n", 869 | ">>> rf.fit(X_train, kag_y_train)\n", 870 | ">>> rf.score(X_test, kag_y_test)\n", 871 | "0.7237569060773481\n", 872 | "```\n", 873 | "\n", 874 | "``` pycon\n", 875 | ">>> rf.get_params()\n", 876 | "{'bootstrap': True,\n", 877 | " 'ccp_alpha': 0.0,\n", 878 | " 'class_weight': None,\n", 879 | " 'criterion': 'gini',\n", 880 | " 'max_depth': None,\n", 881 | " 'max_features': 'sqrt',\n", 882 | " 'max_leaf_nodes': None,\n", 883 | " 'max_samples': None,\n", 884 | " 'min_impurity_decrease': 0.0,\n", 885 | " 'min_samples_leaf': 1,\n", 886 | " 'min_samples_split': 2,\n", 887 | " 'min_weight_fraction_leaf': 0.0,\n", 888 | " 'n_estimators': 100,\n", 889 | " 'n_jobs': None,\n", 890 | " 'oob_score': False,\n", 891 | " 'random_state': 42,\n", 892 | " 'verbose': 0,\n", 893 | " 'warm_start': False}\n", 894 | "```\n", 895 | "\n", 896 | "``` pycon\n", 897 | ">>> len(rf.estimators_)\n", 898 | "100\n", 899 | "```\n", 900 | "\n", 901 | "``` pycon\n", 902 | ">>> print(rf.estimators_[0])\n", 903 | "DecisionTreeClassifier(max_features='sqrt', random_state=1608637542)\n", 904 | "```\n", 905 | "\n", 906 | "``` python\n", 907 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 908 | "features = list(c for c in X_train.columns)\n", 909 | "tree.plot_tree(rf.estimators_[0], feature_names=features, \n", 910 | " filled=True, class_names=rf.classes_, ax=ax,\n", 911 | " max_depth=2, fontsize=6)\n", 912 | "```\n", 913 | "\n", 914 | "### XGBoost Random Forest\n", 915 | "\n", 916 | "``` pycon\n", 917 | ">>> import xgboost as xgb\n", 918 | ">>> rf_xg = xgb.XGBRFClassifier(random_state=42)\n", 919 | ">>> rf_xg.fit(X_train, y_train) \n", 920 | ">>> rf_xg.score(X_test, y_test)\n", 921 | "0.7447513812154696\n", 922 | "```\n", 923 | "\n", 924 | "``` pycon\n", 925 | ">>> rf_xg.get_params()\n", 926 | "{'colsample_bynode': 0.8,\n", 927 | " 'learning_rate': 1.0,\n", 928 | " 'reg_lambda': 1e-05,\n", 929 | " 'subsample': 0.8,\n", 930 | " 'objective': 'binary:logistic',\n", 931 | " 'use_label_encoder': None,\n", 932 | " 'base_score': 0.5,\n", 933 | " 'booster': 'gbtree',\n", 934 | " 'callbacks': None,\n", 935 | " 'colsample_bylevel': 1,\n", 936 | " 'colsample_bytree': 1,\n", 937 | " 'early_stopping_rounds': None,\n", 938 | " 'enable_categorical': False,\n", 939 | " 'eval_metric': None,\n", 940 | " 'feature_types': None,\n", 941 | " 'gamma': 0,\n", 942 | " 'gpu_id': -1,\n", 943 | " 'grow_policy': 'depthwise',\n", 944 | " 'importance_type': None,\n", 945 | " 'interaction_constraints': '',\n", 946 | " 'max_bin': 256,\n", 947 | " 'max_cat_threshold': 64,\n", 948 | " 'max_cat_to_onehot': 4,\n", 949 | " 'max_delta_step': 0,\n", 950 | " 'max_depth': 6,\n", 951 | " 'max_leaves': 0,\n", 952 | " 'min_child_weight': 1,\n", 953 | " 'missing': nan,\n", 954 | " 'monotone_constraints': '()',\n", 955 | " 'n_estimators': 100,\n", 956 | " 'n_jobs': 0,\n", 957 | " 'num_parallel_tree': 100,\n", 958 | " 'predictor': 'auto',\n", 959 | " 'random_state': 42,\n", 960 | " 'reg_alpha': 0,\n", 961 | " 'sampling_method': 'uniform',\n", 962 | " 'scale_pos_weight': 1,\n", 963 | " 'tree_method': 'exact',\n", 964 | " 'validate_parameters': 1,\n", 965 | " 'verbosity': None}\n", 966 | "```\n", 967 | "\n", 968 | "``` python\n", 969 | "fig, ax = plt.subplots(figsize=(6,12), dpi=600)\n", 970 | "xgb.plot_tree(rf_xg, num_trees=0, ax=ax, size='1,1')\n", 971 | "```\n", 972 | "\n", 973 | "``` python\n", 974 | "my_dot_export(rf_xg, num_trees=0, filename='img/rf_xg_kag.dot', \n", 975 | " title='First Random Forest Tree', direction='LR') \n", 976 | "```\n", 977 | "\n", 978 | "``` python\n", 979 | "viz = dtreeviz.model(rf_xg, X_train=X_train,\n", 980 | " y_train=y_train,\n", 981 | " target_name='Job', feature_names=list(X_train.columns), \n", 982 | " class_names=['DS', 'SE'], tree_index=0)\n", 983 | "viz.view(depth_range_to_display=[0,2])\n", 984 | "```\n", 985 | "\n", 986 | "### Random Forest Hyperparameters\n", 987 | "\n", 988 | "### Training the Number of Trees in the Forest\n", 989 | "\n", 990 | "``` python\n", 991 | "from yellowbrick.model_selection import validation_curve\n", 992 | "fig, ax = plt.subplots(figsize=(10,4)) \n", 993 | "viz = validation_curve(xgb.XGBClassifier(random_state=42),\n", 994 | " x=pd.concat([X_train, X_test], axis='index'),\n", 995 | " y=np.concatenate([y_train, y_test]),\n", 996 | " param_name='n_estimators', param_range=range(1, 100, 2),\n", 997 | " scoring='accuracy', cv=3, \n", 998 | " ax=ax) \n", 999 | "```\n", 1000 | "\n", 1001 | "``` pycon\n", 1002 | ">>> rf_xg29 = xgb.XGBRFClassifier(random_state=42, n_estimators=29)\n", 1003 | ">>> rf_xg29.fit(X_train, y_train) \n", 1004 | ">>> rf_xg29.score(X_test, y_test)\n", 1005 | "0.7480662983425415\n", 1006 | "```\n", 1007 | "\n", 1008 | "### Summary\n", 1009 | "\n", 1010 | "### Exercises\n", 1011 | "\n", 1012 | "## XGBoost\n", 1013 | "\n", 1014 | "### Jargon\n", 1015 | "\n", 1016 | "### Benefits of Boosting\n", 1017 | "\n", 1018 | "### A Big Downside\n", 1019 | "\n", 1020 | "### Creating an XGBoost Model\n", 1021 | "\n", 1022 | "``` python\n", 1023 | "%matplotlib inline\n", 1024 | "\n", 1025 | "import dtreeviz\n", 1026 | "from feature_engine import encoding, imputation\n", 1027 | "import matplotlib.pyplot as plt\n", 1028 | "import numpy as np\n", 1029 | "import pandas as pd\n", 1030 | "from sklearn import base, compose, datasets, ensemble, \\\n", 1031 | " metrics, model_selection, pipeline, preprocessing, tree\n", 1032 | "import scikitplot\n", 1033 | "import xgboost as xgb\n", 1034 | "import yellowbrick.model_selection as ms\n", 1035 | "from yellowbrick import classifier\n", 1036 | "\n", 1037 | "import urllib\n", 1038 | "import zipfile\n", 1039 | "\n", 1040 | "import xg_helpers as xhelp\n", 1041 | "```\n", 1042 | "\n", 1043 | "``` python\n", 1044 | "url = 'https://github.com/mattharrison/datasets/raw/master/data/'\\\n", 1045 | " 'kaggle-survey-2018.zip'\n", 1046 | "fname = 'kaggle-survey-2018.zip'\n", 1047 | "member_name = 'multipleChoiceResponses.csv'\n", 1048 | "\n", 1049 | "raw = xhelp.extract_zip(url, fname, member_name)\n", 1050 | "```\n", 1051 | "\n", 1052 | "``` python\n", 1053 | "## Create raw X and raw y\n", 1054 | "kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')\n", 1055 | " \n", 1056 | "## Split data \n", 1057 | "kag_X_train, kag_X_test, kag_y_train, kag_y_test = \\\n", 1058 | " model_selection.train_test_split(\n", 1059 | " kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y) \n", 1060 | "\n", 1061 | "## Transform X with pipeline\n", 1062 | "X_train = xhelp.kag_pl.fit_transform(kag_X_train)\n", 1063 | "X_test = xhelp.kag_pl.transform(kag_X_test)\n", 1064 | "\n", 1065 | "## Transform y with label encoder\n", 1066 | "label_encoder = preprocessing.LabelEncoder()\n", 1067 | "label_encoder.fit(kag_y_train)\n", 1068 | "y_train = label_encoder.transform(kag_y_train)\n", 1069 | "y_test = label_encoder.transform(kag_y_test)\n", 1070 | "\n", 1071 | "# Combined Data for cross validation/etc\n", 1072 | "X = pd.concat([X_train, X_test], axis='index')\n", 1073 | "y = pd.Series([*y_train, *y_test], index=X.index)\n", 1074 | "```\n", 1075 | "\n", 1076 | "### A Boosted Model\n", 1077 | "\n", 1078 | "``` pycon\n", 1079 | ">>> xg_oob = xgb.XGBClassifier()\n", 1080 | ">>> xg_oob.fit(X_train, y_train)\n", 1081 | ">>> xg_oob.score(X_test, y_test)\n", 1082 | "0.7458563535911602\n", 1083 | "```\n", 1084 | "\n", 1085 | "``` pycon\n", 1086 | ">>> # Let's try w/ depth of 2 and 2 trees\n", 1087 | ">>> xg2 = xgb.XGBClassifier(max_depth=2, n_estimators=2)\n", 1088 | ">>> xg2.fit(X_train, y_train)\n", 1089 | ">>> xg2.score(X_test, y_test)\n", 1090 | "0.6685082872928176\n", 1091 | "```\n", 1092 | "\n", 1093 | "``` python\n", 1094 | "import dtreeviz\n", 1095 | "\n", 1096 | "viz = dtreeviz.model(xg2, X_train=X, y_train=y, target_name='Job',\n", 1097 | " feature_names=list(X_train.columns), \n", 1098 | " class_names=['DS', 'SE'], tree_index=0)\n", 1099 | "viz.view(depth_range_to_display=[0,2])\n", 1100 | "```\n", 1101 | "\n", 1102 | "### Understanding the Output of the Trees\n", 1103 | "\n", 1104 | "``` python\n", 1105 | "xhelp.my_dot_export(xg2, num_trees=0, filename='img/xgb_md2.dot', \n", 1106 | " title='First Tree') \n", 1107 | "```\n", 1108 | "\n", 1109 | "``` pycon\n", 1110 | ">>> # Predicts 1 - Software engineer\n", 1111 | ">>> se7894 = pd.DataFrame({'age': {7894: 22}, \n", 1112 | "... 'education': {7894: 16.0},\n", 1113 | "... 'years_exp': {7894: 1.0},\n", 1114 | "... 'compensation': {7894: 0},\n", 1115 | "... 'python': {7894: 1},\n", 1116 | "... 'r': {7894: 0},\n", 1117 | "... 'sql': {7894: 0},\n", 1118 | "... 'Q1_Male': {7894: 1}, \n", 1119 | "... 'Q1_Female': {7894: 0},\n", 1120 | "... 'Q1_Prefer not to say': {7894: 0},\n", 1121 | "... 'Q1_Prefer to self-describe': {7894: 0},\n", 1122 | "... 'Q3_United States of America': {7894: 0},\n", 1123 | "... 'Q3_India': {7894: 1},\n", 1124 | "... 'Q3_China': {7894: 0},\n", 1125 | "... 'major_cs': {7894: 0},\n", 1126 | "... 'major_other': {7894: 0},\n", 1127 | "... 'major_eng': {7894: 0},\n", 1128 | "... 'major_stat': {7894: 0}})\n", 1129 | ">>> xg2.predict_proba(se7894)\n", 1130 | "array([[0.4986236, 0.5013764]], dtype=float32)\n", 1131 | "```\n", 1132 | "\n", 1133 | "``` pycon\n", 1134 | ">>> # Predicts 1 - Software engineer\n", 1135 | ">>> xg2.predict(pd.DataFrame(se7894))\n", 1136 | "array([1])\n", 1137 | "```\n", 1138 | "\n", 1139 | "``` python\n", 1140 | "xhelp.my_dot_export(xg2, num_trees=1, filename='img/xgb_md2_tree1.dot', title='Second Tree') \n", 1141 | "```\n", 1142 | "\n", 1143 | "``` python\n", 1144 | "def inv_logit(p):\n", 1145 | " return np.exp(p) / (1 + np.exp(p))\n", 1146 | "```\n", 1147 | "\n", 1148 | "``` pycon\n", 1149 | ">>> inv_logit(-0.08476+0.0902701)\n", 1150 | "0.5013775215147345\n", 1151 | "```\n", 1152 | "\n", 1153 | "### Summary\n", 1154 | "\n", 1155 | "### Exercises\n", 1156 | "\n", 1157 | "## Early Stopping\n", 1158 | "\n", 1159 | "### Early Stopping Rounds\n", 1160 | "\n", 1161 | "``` pycon\n", 1162 | ">>> # Defaults\n", 1163 | ">>> xg = xgb.XGBClassifier()\n", 1164 | ">>> xg.fit(X_train, y_train)\n", 1165 | ">>> xg.score(X_test, y_test)\n", 1166 | "0.7458563535911602\n", 1167 | "```\n", 1168 | "\n", 1169 | "``` pycon\n", 1170 | ">>> xg = xgb.XGBClassifier(early_stopping_rounds=20)\n", 1171 | ">>> xg.fit(X_train, y_train,\n", 1172 | "... eval_set=[(X_train, y_train),\n", 1173 | "... (X_test, y_test)\n", 1174 | "... ]\n", 1175 | "... )\n", 1176 | ">>> xg.score(X_test, y_test)\n", 1177 | "[0] validation_0-logloss:0.61534 validation_1-logloss:0.61775\n", 1178 | "[1] validation_0-logloss:0.57046 validation_1-logloss:0.57623\n", 1179 | "[2] validation_0-logloss:0.54011 validation_1-logloss:0.55333\n", 1180 | "[3] validation_0-logloss:0.51965 validation_1-logloss:0.53711\n", 1181 | "[4] validation_0-logloss:0.50419 validation_1-logloss:0.52511\n", 1182 | "[5] validation_0-logloss:0.49176 validation_1-logloss:0.51741\n", 1183 | "[6] validation_0-logloss:0.48159 validation_1-logloss:0.51277\n", 1184 | "[7] validation_0-logloss:0.47221 validation_1-logloss:0.51040\n", 1185 | "[8] validation_0-logloss:0.46221 validation_1-logloss:0.50713\n", 1186 | "[9] validation_0-logloss:0.45700 validation_1-logloss:0.50583\n", 1187 | "[10] validation_0-logloss:0.45062 validation_1-logloss:0.50430\n", 1188 | "[11] validation_0-logloss:0.44533 validation_1-logloss:0.50338\n", 1189 | "[12] validation_0-logloss:0.43736 validation_1-logloss:0.50033\n", 1190 | "[13] validation_0-logloss:0.43399 validation_1-logloss:0.50034\n", 1191 | "[14] validation_0-logloss:0.43004 validation_1-logloss:0.50192\n", 1192 | "[15] validation_0-logloss:0.42550 validation_1-logloss:0.50268\n", 1193 | "[16] validation_0-logloss:0.42169 validation_1-logloss:0.50196\n", 1194 | "[17] validation_0-logloss:0.41854 validation_1-logloss:0.50223\n", 1195 | "[18] validation_0-logloss:0.41485 validation_1-logloss:0.50360\n", 1196 | "[19] validation_0-logloss:0.41228 validation_1-logloss:0.50527\n", 1197 | "[20] validation_0-logloss:0.40872 validation_1-logloss:0.50839\n", 1198 | "[21] validation_0-logloss:0.40490 validation_1-logloss:0.50623\n", 1199 | "[22] validation_0-logloss:0.40280 validation_1-logloss:0.50806\n", 1200 | "[23] validation_0-logloss:0.39942 validation_1-logloss:0.51007\n", 1201 | "[24] validation_0-logloss:0.39807 validation_1-logloss:0.50987\n", 1202 | "[25] validation_0-logloss:0.39473 validation_1-logloss:0.51189\n", 1203 | "[26] validation_0-logloss:0.39389 validation_1-logloss:0.51170\n", 1204 | "[27] validation_0-logloss:0.39040 validation_1-logloss:0.51218\n", 1205 | "[28] validation_0-logloss:0.38837 validation_1-logloss:0.51135\n", 1206 | "[29] validation_0-logloss:0.38569 validation_1-logloss:0.51202\n", 1207 | "[30] validation_0-logloss:0.37945 validation_1-logloss:0.51352\n", 1208 | "[31] validation_0-logloss:0.37840 validation_1-logloss:0.51545\n", 1209 | "0.7558011049723757\n", 1210 | "```\n", 1211 | "\n", 1212 | "``` pycon\n", 1213 | ">>> xg.best_ntree_limit\n", 1214 | "13\n", 1215 | "```\n", 1216 | "\n", 1217 | "### Plotting Tree Performance\n", 1218 | "\n", 1219 | "``` pycon\n", 1220 | ">>> # validation_0 is for training data\n", 1221 | ">>> # validation_1 is for testing data\n", 1222 | ">>> results = xg.evals_result()\n", 1223 | ">>> results\n", 1224 | "{'validation_0': OrderedDict([('logloss',\n", 1225 | " [0.6153406503923696,\n", 1226 | " 0.5704566627034644,\n", 1227 | " 0.5401074953836288,\n", 1228 | " 0.519646179894983,\n", 1229 | " 0.5041859194071372,\n", 1230 | " 0.49175883369140716,\n", 1231 | " 0.4815858465553177,\n", 1232 | " 0.4722135672319274,\n", 1233 | " 0.46221246084118905,\n", 1234 | " 0.4570046103131291,\n", 1235 | " 0.45062119092139025,\n", 1236 | " 0.44533101600634545,\n", 1237 | " 0.4373589513231934,\n", 1238 | " 0.4339914069003403,\n", 1239 | " 0.4300442738158372,\n", 1240 | " 0.42550266018419824,\n", 1241 | " 0.42168949383456633,\n", 1242 | " 0.41853931894949614,\n", 1243 | " 0.41485192559138645,\n", 1244 | " 0.4122836278413833,\n", 1245 | " 0.4087179538231096,\n", 1246 | " 0.404898268053467,\n", 1247 | " 0.4027963532207719,\n", 1248 | " 0.39941699938733854,\n", 1249 | " 0.3980718078477953,\n", 1250 | " 0.39473153180519993,\n", 1251 | " 0.39388538948800944,\n", 1252 | " 0.39039599470886893,\n", 1253 | " 0.38837148147752126,\n", 1254 | " 0.38569152626668,\n", 1255 | " 0.3794510693344513,\n", 1256 | " 0.37840359436957194,\n", 1257 | " 0.37538466192241227])]),\n", 1258 | " 'validation_1': OrderedDict([('logloss',\n", 1259 | " [0.6177459120091813,\n", 1260 | " 0.5762297115602546,\n", 1261 | " 0.5533292921537852,\n", 1262 | " 0.5371078260695736,\n", 1263 | " 0.5251118483299708,\n", 1264 | " 0.5174100387491574,\n", 1265 | " 0.5127666981510036,\n", 1266 | " 0.5103968678752362,\n", 1267 | " 0.5071349115538004,\n", 1268 | " 0.5058257413585542,\n", 1269 | " 0.5043005662687247,\n", 1270 | " 0.5033770955193438,\n", 1271 | " 0.5003349146419797,\n", 1272 | " 0.5003436393562437,\n", 1273 | " 0.5019165392779843,\n", 1274 | " 0.502677517614806,\n", 1275 | " 0.501961292550791,\n", 1276 | " 0.5022262006329157,\n", 1277 | " 0.5035970173261607,\n", 1278 | " 0.5052709663297096,\n", 1279 | " 0.508388655664636,\n", 1280 | " 0.5062287504923689,\n", 1281 | " 0.5080608455824424,\n", 1282 | " 0.5100736726054829,\n", 1283 | " 0.5098673969229365,\n", 1284 | " 0.5118910041889845,\n", 1285 | " 0.5117007332982608,\n", 1286 | " 0.5121825202836434,\n", 1287 | " 0.5113475993625531,\n", 1288 | " 0.5120185821281118,\n", 1289 | " 0.5135189292720874,\n", 1290 | " 0.5154504034915188,\n", 1291 | " 0.5158137131755071])])}\n", 1292 | "```\n", 1293 | "\n", 1294 | "``` python\n", 1295 | "# Testing score is best at 13 trees\n", 1296 | "results = xg.evals_result()\n", 1297 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1298 | "ax = (pd.DataFrame({'training': results['validation_0']['logloss'],\n", 1299 | " 'testing': results['validation_1']['logloss']})\n", 1300 | " .assign(ntrees=lambda adf: range(1, len(adf)+1)) \n", 1301 | " .set_index('ntrees')\n", 1302 | " .plot(figsize=(5,4), ax=ax, \n", 1303 | " title='eval_results with early_stopping')\n", 1304 | ")\n", 1305 | "ax.annotate('Best number \\nof trees (13)', xy=(13, .498),\n", 1306 | " xytext=(20,.42), arrowprops={'color':'k'})\n", 1307 | "ax.set_xlabel('ntrees')\n", 1308 | "```\n", 1309 | "\n", 1310 | "``` python\n", 1311 | "# Using value from early stopping gives same result\n", 1312 | ">>> xg13 = xgb.XGBClassifier(n_estimators=13)\n", 1313 | ">>> xg13.fit(X_train, y_train,\n", 1314 | "... eval_set=[(X_train, y_train),\n", 1315 | "... (X_test, y_test)]\n", 1316 | "... )\n", 1317 | ">>> xg13.score(X_test, y_test)\n", 1318 | "```\n", 1319 | "\n", 1320 | "``` pycon\n", 1321 | ">>> xg.score(X_test, y_test)\n", 1322 | "0.7558011049723757\n", 1323 | "```\n", 1324 | "\n", 1325 | "``` pycon\n", 1326 | ">>> # No early stopping, uses all estimators\n", 1327 | ">>> xg_no_es = xgb.XGBClassifier()\n", 1328 | ">>> xg_no_es.fit(X_train, y_train)\n", 1329 | ">>> xg_no_es.score(X_test, y_test)\n", 1330 | "0.7458563535911602\n", 1331 | "```\n", 1332 | "\n", 1333 | "### Different `eval_metrics`\n", 1334 | "\n", 1335 | "``` pycon\n", 1336 | ">>> xg_err = xgb.XGBClassifier(early_stopping_rounds=20, \n", 1337 | "... eval_metric='error')\n", 1338 | ">>> xg_err.fit(X_train, y_train,\n", 1339 | "... eval_set=[(X_train, y_train),\n", 1340 | "... (X_test, y_test)\n", 1341 | "... ]\n", 1342 | "... )\n", 1343 | ">>> xg_err.score(X_test, y_test)\n", 1344 | "[0] validation_0-error:0.24739 validation_1-error:0.27072\n", 1345 | "[1] validation_0-error:0.24218 validation_1-error:0.26188\n", 1346 | "[2] validation_0-error:0.23839 validation_1-error:0.24751\n", 1347 | "[3] validation_0-error:0.23697 validation_1-error:0.25193\n", 1348 | "[4] validation_0-error:0.23081 validation_1-error:0.24530\n", 1349 | "[5] validation_0-error:0.22607 validation_1-error:0.24420\n", 1350 | "[6] validation_0-error:0.22180 validation_1-error:0.24862\n", 1351 | "[7] validation_0-error:0.21801 validation_1-error:0.24862\n", 1352 | "[8] validation_0-error:0.21280 validation_1-error:0.25304\n", 1353 | "[9] validation_0-error:0.21043 validation_1-error:0.25304\n", 1354 | "[10] validation_0-error:0.20806 validation_1-error:0.24641\n", 1355 | "[11] validation_0-error:0.20284 validation_1-error:0.25193\n", 1356 | "[12] validation_0-error:0.20047 validation_1-error:0.24420\n", 1357 | "[13] validation_0-error:0.19668 validation_1-error:0.24420\n", 1358 | "[14] validation_0-error:0.19384 validation_1-error:0.24530\n", 1359 | "[15] validation_0-error:0.18815 validation_1-error:0.24199\n", 1360 | "[16] validation_0-error:0.18531 validation_1-error:0.24199\n", 1361 | "[17] validation_0-error:0.18389 validation_1-error:0.23867\n", 1362 | "[18] validation_0-error:0.18531 validation_1-error:0.23757\n", 1363 | "[19] validation_0-error:0.18815 validation_1-error:0.23867\n", 1364 | "[20] validation_0-error:0.18246 validation_1-error:0.24199\n", 1365 | "[21] validation_0-error:0.17915 validation_1-error:0.24862\n", 1366 | "[22] validation_0-error:0.17867 validation_1-error:0.24751\n", 1367 | "[23] validation_0-error:0.17630 validation_1-error:0.24199\n", 1368 | "[24] validation_0-error:0.17488 validation_1-error:0.24309\n", 1369 | "[25] validation_0-error:0.17251 validation_1-error:0.24530\n", 1370 | "[26] validation_0-error:0.17204 validation_1-error:0.24309\n", 1371 | "[27] validation_0-error:0.16825 validation_1-error:0.24199\n", 1372 | "[28] validation_0-error:0.16730 validation_1-error:0.24088\n", 1373 | "[29] validation_0-error:0.16019 validation_1-error:0.24199\n", 1374 | "[30] validation_0-error:0.15782 validation_1-error:0.24972\n", 1375 | "[31] validation_0-error:0.15972 validation_1-error:0.24862\n", 1376 | "[32] validation_0-error:0.15924 validation_1-error:0.24641\n", 1377 | "[33] validation_0-error:0.15403 validation_1-error:0.25635\n", 1378 | "[34] validation_0-error:0.15261 validation_1-error:0.25525\n", 1379 | "[35] validation_0-error:0.15213 validation_1-error:0.25525\n", 1380 | "[36] validation_0-error:0.15166 validation_1-error:0.25525\n", 1381 | "[37] validation_0-error:0.14550 validation_1-error:0.25525\n", 1382 | "[38] validation_0-error:0.14597 validation_1-error:0.25083\n", 1383 | "0.7624309392265194\n", 1384 | "```\n", 1385 | "\n", 1386 | "``` pycon\n", 1387 | ">>> xg_err.best_ntree_limit\n", 1388 | "19\n", 1389 | "```\n", 1390 | "\n", 1391 | "### Summary\n", 1392 | "\n", 1393 | "### Exercises\n", 1394 | "\n", 1395 | "## XGBoost Hyperparameters\n", 1396 | "\n", 1397 | "### Hyperparameters\n", 1398 | "\n", 1399 | "### Examining Hyperparameters\n", 1400 | "\n", 1401 | "``` pycon\n", 1402 | ">>> xg = xgb.XGBClassifier() # set the hyperparamters in here\n", 1403 | ">>> xg.fit(X_train, y_train)\n", 1404 | ">>> xg.get_params()\n", 1405 | "{'objective': 'binary:logistic',\n", 1406 | " 'use_label_encoder': None,\n", 1407 | " 'base_score': 0.5,\n", 1408 | " 'booster': 'gbtree',\n", 1409 | " 'callbacks': None,\n", 1410 | " 'colsample_bylevel': 1,\n", 1411 | " 'colsample_bynode': 1,\n", 1412 | " 'colsample_bytree': 1,\n", 1413 | " 'early_stopping_rounds': None,\n", 1414 | " 'enable_categorical': False,\n", 1415 | " 'eval_metric': None,\n", 1416 | " 'feature_types': None,\n", 1417 | " 'gamma': 0,\n", 1418 | " 'gpu_id': -1,\n", 1419 | " 'grow_policy': 'depthwise',\n", 1420 | " 'importance_type': None,\n", 1421 | " 'interaction_constraints': '',\n", 1422 | " 'learning_rate': 0.300000012,\n", 1423 | " 'max_bin': 256,\n", 1424 | " 'max_cat_threshold': 64,\n", 1425 | " 'max_cat_to_onehot': 4,\n", 1426 | " 'max_delta_step': 0,\n", 1427 | " 'max_depth': 6,\n", 1428 | " 'max_leaves': 0,\n", 1429 | " 'min_child_weight': 1,\n", 1430 | " 'missing': nan,\n", 1431 | " 'monotone_constraints': '()',\n", 1432 | " 'n_estimators': 100,\n", 1433 | " 'n_jobs': 0,\n", 1434 | " 'num_parallel_tree': 1,\n", 1435 | " 'predictor': 'auto',\n", 1436 | " 'random_state': 0,\n", 1437 | " 'reg_alpha': 0,\n", 1438 | " 'reg_lambda': 1,\n", 1439 | " 'sampling_method': 'uniform',\n", 1440 | " 'scale_pos_weight': 1,\n", 1441 | " 'subsample': 1,\n", 1442 | " 'tree_method': 'exact',\n", 1443 | " 'validate_parameters': 1,\n", 1444 | " 'verbosity': None}\n", 1445 | "```\n", 1446 | "\n", 1447 | "### Tuning Hyperparameters\n", 1448 | "\n", 1449 | "``` python\n", 1450 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1451 | "ms.validation_curve(xgb.XGBClassifier(), X_train, y_train, param_name='gamma', \n", 1452 | " param_range=[0, .5, 1,5,10, 20, 30], n_jobs=-1, ax=ax)\n", 1453 | "```\n", 1454 | "\n", 1455 | "### Intuitive Understanding of Learning Rate\n", 1456 | "\n", 1457 | "``` python\n", 1458 | "# check impact of learning weight on scores\n", 1459 | "xg_lr1 = xgb.XGBClassifier(learning_rate=1, max_depth=2)\n", 1460 | "xg_lr1.fit(X_train, y_train)\n", 1461 | "```\n", 1462 | "\n", 1463 | "``` python\n", 1464 | "my_dot_export(xg_lr1, num_trees=0, filename='img/xg_depth2_tree0.dot', \n", 1465 | " title='Learning Rate set to 1') \n", 1466 | "```\n", 1467 | "\n", 1468 | "``` python\n", 1469 | "# check impact of learning weight on scores\n", 1470 | "xg_lr001 = xgb.XGBClassifier(learning_rate=.001, max_depth=2)\n", 1471 | "xg_lr001.fit(X_train, y_train)\n", 1472 | "```\n", 1473 | "\n", 1474 | "``` python\n", 1475 | "my_dot_export(xg_lr001, num_trees=0, filename='img/xg_depth2_tree0_lr001.dot',\n", 1476 | " title='Learning Rate set to .001') \n", 1477 | "```\n", 1478 | "\n", 1479 | "### Grid Search\n", 1480 | "\n", 1481 | "``` python\n", 1482 | "from sklearn import model_selection\n", 1483 | "params = {'reg_lambda': [0], # No effect\n", 1484 | " 'learning_rate': [.1, .3], # makes each boost more conservative \n", 1485 | " 'subsample': [.7, 1],\n", 1486 | " 'max_depth': [2, 3],\n", 1487 | " 'random_state': [42],\n", 1488 | " 'n_jobs': [-1],\n", 1489 | " 'n_estimators': [200]}\n", 1490 | "\n", 1491 | "xgb2 = xgb.XGBClassifier(early_stopping_rounds=5) \n", 1492 | "cv = (model_selection.GridSearchCV(xgb2, params, cv=3, n_jobs=-1)\n", 1493 | " .fit(X_train, y_train,\n", 1494 | " eval_set=[(X_test, y_test)],\n", 1495 | " verbose=50\n", 1496 | " )\n", 1497 | ")\n", 1498 | "```\n", 1499 | "\n", 1500 | "``` pycon\n", 1501 | ">>> cv.best_params_\n", 1502 | "{'learning_rate': 0.3,\n", 1503 | " 'max_depth': 2,\n", 1504 | " 'n_estimators': 200,\n", 1505 | " 'n_jobs': -1,\n", 1506 | " 'random_state': 42,\n", 1507 | " 'reg_lambda': 0,\n", 1508 | " 'subsample': 1}\n", 1509 | "```\n", 1510 | "\n", 1511 | "``` python\n", 1512 | "params = {'learning_rate': 0.3,\n", 1513 | " 'max_depth': 2,\n", 1514 | " 'n_estimators': 200,\n", 1515 | " 'n_jobs': -1,\n", 1516 | " 'random_state': 42,\n", 1517 | " 'reg_lambda': 0,\n", 1518 | " 'subsample': 1\n", 1519 | "}\n", 1520 | "xgb_grid = xgb.XGBClassifier(**params, early_stopping_rounds=50)\n", 1521 | "xgb_grid.fit(X_train, y_train, eval_set=[(X_train, y_train),\n", 1522 | " (X_test, y_test)],\n", 1523 | " verbose=10\n", 1524 | ")\n", 1525 | "```\n", 1526 | "\n", 1527 | "``` python\n", 1528 | "# vs default\n", 1529 | "xgb_def = xgb.XGBClassifier(early_stopping_rounds=50)\n", 1530 | "xgb_def.fit(X_train, y_train, eval_set=[(X_train, y_train),\n", 1531 | " (X_test, y_test)],\n", 1532 | " verbose=10\n", 1533 | ")\n", 1534 | "```\n", 1535 | "\n", 1536 | "``` pycon\n", 1537 | ">>> xgb_def.score(X_test, y_test), xgb_grid.score(X_test, y_test)\n", 1538 | "(0.7558011049723757, 0.7524861878453039)\n", 1539 | "```\n", 1540 | "\n", 1541 | "``` pycon\n", 1542 | ">>> results_default = model_selection.cross_val_score(\n", 1543 | "... xgb.XGBClassifier(),\n", 1544 | "... X=X, y=y,\n", 1545 | "... cv=4\n", 1546 | "... )\n", 1547 | "```\n", 1548 | "\n", 1549 | "``` pycon\n", 1550 | ">>> results_default\n", 1551 | "array([0.71352785, 0.72413793, 0.69496021, 0.74501992])\n", 1552 | "```\n", 1553 | "\n", 1554 | "``` pycon\n", 1555 | ">>> results_default.mean()\n", 1556 | "0.7194114787534214\n", 1557 | "```\n", 1558 | "\n", 1559 | "``` pycon\n", 1560 | ">>> results_grid = model_selection.cross_val_score(\n", 1561 | "... xgb.XGBClassifier(**params),\n", 1562 | "... X=X, y=y,\n", 1563 | "... cv=4\n", 1564 | "... )\n", 1565 | "```\n", 1566 | "\n", 1567 | "``` pycon\n", 1568 | ">>> results_grid\n", 1569 | "array([0.74137931, 0.74137931, 0.74801061, 0.73572377])\n", 1570 | "```\n", 1571 | "\n", 1572 | "``` pycon\n", 1573 | ">>> results_grid.mean()\n", 1574 | "0.7416232505873941\n", 1575 | "```\n", 1576 | "\n", 1577 | "### Summary\n", 1578 | "\n", 1579 | "### Exercises\n", 1580 | "\n", 1581 | "## Hyperopt\n", 1582 | "\n", 1583 | "### Bayesian Optimization\n", 1584 | "\n", 1585 | "### Exhaustive Tuning with Hyperopt\n", 1586 | "\n", 1587 | "``` python\n", 1588 | "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n", 1589 | "from sklearn.metrics import accuracy_score, roc_auc_score \n", 1590 | "\n", 1591 | "from typing import Any, Dict, Union\n", 1592 | "\n", 1593 | "def hyperparameter_tuning(space: Dict[str, Union[float, int]], \n", 1594 | " X_train: pd.DataFrame, y_train: pd.Series, \n", 1595 | " X_test: pd.DataFrame, y_test: pd.Series, \n", 1596 | " early_stopping_rounds: int=50,\n", 1597 | " metric:callable=accuracy_score) -> Dict[str, Any]:\n", 1598 | " \"\"\"\n", 1599 | " Perform hyperparameter tuning for an XGBoost classifier.\n", 1600 | "\n", 1601 | " This function takes a dictionary of hyperparameters, training \n", 1602 | " and test data, and an optional value for early stopping rounds, \n", 1603 | " and returns a dictionary with the loss and model resulting from \n", 1604 | " the tuning process. The model is trained using the training \n", 1605 | " data and evaluated on the test data. The loss is computed as \n", 1606 | " the negative of the accuracy score.\n", 1607 | "\n", 1608 | " Parameters\n", 1609 | " ----------\n", 1610 | " space : Dict[str, Union[float, int]]\n", 1611 | " A dictionary of hyperparameters for the XGBoost classifier.\n", 1612 | " X_train : pd.DataFrame\n", 1613 | " The training data.\n", 1614 | " y_train : pd.Series\n", 1615 | " The training target.\n", 1616 | " X_test : pd.DataFrame\n", 1617 | " The test data.\n", 1618 | " y_test : pd.Series\n", 1619 | " The test target.\n", 1620 | " early_stopping_rounds : int, optional\n", 1621 | " The number of early stopping rounds to use. The default value \n", 1622 | " is 50.\n", 1623 | " metric : callable\n", 1624 | " Metric to maximize. Default is accuracy\n", 1625 | "\n", 1626 | " Returns\n", 1627 | " -------\n", 1628 | " Dict[str, Any]\n", 1629 | " A dictionary with the loss and model resulting from the \n", 1630 | " tuning process. The loss is a float, and the model is an \n", 1631 | " XGBoost classifier.\n", 1632 | " \"\"\"\n", 1633 | " int_vals = ['max_depth', 'reg_alpha']\n", 1634 | " space = {k: (int(val) if k in int_vals else val)\n", 1635 | " for k,val in space.items()}\n", 1636 | " space['early_stopping_rounds'] = early_stopping_rounds\n", 1637 | " model = xgb.XGBClassifier(**space)\n", 1638 | " evaluation = [(X_train, y_train),\n", 1639 | " (X_test, y_test)]\n", 1640 | " model.fit(X_train, y_train,\n", 1641 | " eval_set=evaluation, \n", 1642 | " verbose=False) \n", 1643 | " \n", 1644 | " pred = model.predict(X_test)\n", 1645 | " score = metric(y_test, pred)\n", 1646 | " return {'loss': -score, 'status': STATUS_OK, 'model': model}\n", 1647 | "```\n", 1648 | "\n", 1649 | "``` python\n", 1650 | "options = {'max_depth': hp.quniform('max_depth', 1, 8, 1), # tree\n", 1651 | " 'min_child_weight': hp.loguniform('min_child_weight', -2, 3),\n", 1652 | " 'subsample': hp.uniform('subsample', 0.5, 1), # stochastic\n", 1653 | " 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),\n", 1654 | " 'reg_alpha': hp.uniform('reg_alpha', 0, 10),\n", 1655 | " 'reg_lambda': hp.uniform('reg_lambda', 1, 10),\n", 1656 | " 'gamma': hp.loguniform('gamma', -10, 10), # regularization\n", 1657 | " 'learning_rate': hp.loguniform('learning_rate', -7, 0), # boosting\n", 1658 | " 'random_state': 42\n", 1659 | "}\n", 1660 | "\n", 1661 | "trials = Trials()\n", 1662 | "best = fmin(fn=lambda space: hyperparameter_tuning(space, X_train, y_train, \n", 1663 | " X_test, y_test), \n", 1664 | " space=options, \n", 1665 | " algo=tpe.suggest, \n", 1666 | " max_evals=2_000, \n", 1667 | " trials=trials,\n", 1668 | " #timeout=60*5 # 5 minutes\n", 1669 | ")\n", 1670 | "```\n", 1671 | "\n", 1672 | "``` python\n", 1673 | "# 2 hours of training (paste best in here)\n", 1674 | "long_params = {'colsample_bytree': 0.6874845219014455, \n", 1675 | " 'gamma': 0.06936323554883501, \n", 1676 | " 'learning_rate': 0.21439214284976907, \n", 1677 | " 'max_depth': 6, \n", 1678 | " 'min_child_weight': 0.6678357091609912, \n", 1679 | " 'reg_alpha': 3.2979862933185546, \n", 1680 | " 'reg_lambda': 7.850943400390477, \n", 1681 | " 'subsample': 0.999767483950891}\n", 1682 | "```\n", 1683 | "\n", 1684 | "``` python\n", 1685 | "xg_ex = xgb.XGBClassifier(**long_params, early_stopping_rounds=50,\n", 1686 | " n_estimators=500)\n", 1687 | "xg_ex.fit(X_train, y_train,\n", 1688 | " eval_set=[(X_train, y_train),\n", 1689 | " (X_test, y_test)\n", 1690 | " ],\n", 1691 | " verbose=100\n", 1692 | " )\n", 1693 | "```\n", 1694 | "\n", 1695 | "``` pycon\n", 1696 | ">>> xg_ex.score(X_test, y_test)\n", 1697 | "0.7580110497237569\n", 1698 | "```\n", 1699 | "\n", 1700 | "### Defining Parameter Distributions\n", 1701 | "\n", 1702 | "``` pycon\n", 1703 | ">>> from hyperopt import hp, pyll\n", 1704 | ">>> pyll.stochastic.sample(hp.choice('value', ['a', 'b', 'c']))\n", 1705 | "'a'\n", 1706 | "```\n", 1707 | "\n", 1708 | "``` pycon\n", 1709 | ">>> pyll.stochastic.sample(hp.pchoice('value', [(.05, 'a'), (.9, 'b'), \n", 1710 | "... (.05, 'c')]))\n", 1711 | "'c'\n", 1712 | "```\n", 1713 | "\n", 1714 | "``` pycon\n", 1715 | ">>> from hyperopt import hp, pyll\n", 1716 | "\n", 1717 | ">>> pyll.stochastic.sample(hp.uniform('value', 0, 1))\n", 1718 | "0.7875384438202859\n", 1719 | "```\n", 1720 | "\n", 1721 | "``` python\n", 1722 | "uniform_vals = [pyll.stochastic.sample(hp.uniform('value', 0, 1)) \n", 1723 | " for _ in range(10_000)]\n", 1724 | "```\n", 1725 | "\n", 1726 | "``` python\n", 1727 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1728 | "ax.hist(uniform_vals)\n", 1729 | "```\n", 1730 | "\n", 1731 | "``` python\n", 1732 | "loguniform_vals = [pyll.stochastic.sample(hp.loguniform('value', -5, 5)) \n", 1733 | " for _ in range(10_000)]\n", 1734 | "```\n", 1735 | "\n", 1736 | "``` python\n", 1737 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1738 | "ax.hist(loguniform_vals)\n", 1739 | "```\n", 1740 | "\n", 1741 | "``` python\n", 1742 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1743 | "(pd.Series(np.arange(-5, 5, step=.1))\n", 1744 | " .rename('x')\n", 1745 | " .to_frame()\n", 1746 | " .assign(y=lambda adf:np.exp(adf.x))\n", 1747 | " .plot(x='x', y='y', ax=ax)\n", 1748 | ")\n", 1749 | "```\n", 1750 | "\n", 1751 | "``` pycon\n", 1752 | ">>> from hyperopt import hp, pyll\n", 1753 | ">>> from math import log\n", 1754 | ">>> pyll.stochastic.sample(hp.loguniform('value', log(.1), log(10)))\n", 1755 | "3.0090767867889174\n", 1756 | "```\n", 1757 | "\n", 1758 | "``` python\n", 1759 | "quniform_vals = [pyll.stochastic.sample(hp.quniform('value', -5, 5, q=2)) \n", 1760 | " for _ in range(10_000)]\n", 1761 | "```\n", 1762 | "\n", 1763 | "``` pycon\n", 1764 | ">>> pd.Series(quniform_vals).value_counts()\n", 1765 | "-0.0 2042\n", 1766 | "-2.0 2021\n", 1767 | " 2.0 2001\n", 1768 | " 4.0 2000\n", 1769 | "-4.0 1936\n", 1770 | "dtype: int64\n", 1771 | "```\n", 1772 | "\n", 1773 | "### Exploring the Trials\n", 1774 | "\n", 1775 | "``` python\n", 1776 | "from typing import Any, Dict, Sequence\n", 1777 | "def trial2df(trial: Sequence[Dict[str, Any]]) -> pd.DataFrame:\n", 1778 | " \"\"\"\n", 1779 | " Convert a Trial object (sequence of trial dictionaries)\n", 1780 | " to a Pandas DataFrame.\n", 1781 | "\n", 1782 | " Parameters\n", 1783 | " ----------\n", 1784 | " trial : List[Dict[str, Any]]\n", 1785 | " A list of trial dictionaries.\n", 1786 | "\n", 1787 | " Returns\n", 1788 | " -------\n", 1789 | " pd.DataFrame\n", 1790 | " A DataFrame with columns for the loss, trial id, and\n", 1791 | " values from each trial dictionary.\n", 1792 | " \"\"\"\n", 1793 | " vals = []\n", 1794 | " for t in trial:\n", 1795 | " result = t['result']\n", 1796 | " misc = t['misc']\n", 1797 | " val = {k:(v[0] if isinstance(v, list) else v) \n", 1798 | " for k,v in misc['vals'].items()\n", 1799 | " }\n", 1800 | " val['loss'] = result['loss']\n", 1801 | " val['tid'] = t['tid']\n", 1802 | " vals.append(val)\n", 1803 | " return pd.DataFrame(vals)\n", 1804 | "```\n", 1805 | "\n", 1806 | "``` pycon\n", 1807 | ">>> hyper2hr = trial2df(trials)\n", 1808 | "```\n", 1809 | "\n", 1810 | "``` pycon\n", 1811 | ">>> hyper2hr\n", 1812 | " colsample_bytree gamma learning_rate ... subsample loss \\\n", 1813 | "0 0.854670 2.753933 0.042056 ... 0.913247 -0.744751 \n", 1814 | "1 0.512653 0.153628 0.611973 ... 0.550048 -0.746961 \n", 1815 | "2 0.552569 1.010561 0.002412 ... 0.508593 -0.735912 \n", 1816 | "3 0.604020 682.836185 0.005037 ... 0.536935 -0.545856 \n", 1817 | "4 0.785281 0.004130 0.015200 ... 0.691211 -0.739227 \n", 1818 | "... ... ... ... ... ... ... \n", 1819 | "1995 0.717890 0.000543 0.141629 ... 0.893414 -0.765746 \n", 1820 | "1996 0.725305 0.000248 0.172854 ... 0.919415 -0.765746 \n", 1821 | "1997 0.698025 0.028484 0.162207 ... 0.952204 -0.770166 \n", 1822 | "1998 0.688053 0.068223 0.099814 ... 0.939489 -0.762431 \n", 1823 | "1999 0.666225 0.125253 0.203441 ... 0.980354 -0.767956 \n", 1824 | "\n", 1825 | " tid \n", 1826 | "0 0 \n", 1827 | "1 1 \n", 1828 | "2 2 \n", 1829 | "3 3 \n", 1830 | "4 4 \n", 1831 | "... ... \n", 1832 | "1995 1995 \n", 1833 | "1996 1996 \n", 1834 | "1997 1997 \n", 1835 | "1998 1998 \n", 1836 | "1999 1999 \n", 1837 | "\n", 1838 | "[2000 rows x 10 columns]\n", 1839 | "```\n", 1840 | "\n", 1841 | "``` python\n", 1842 | "import seaborn as sns\n", 1843 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1844 | "sns.heatmap(hyper2hr.corr(method='spearman'),\n", 1845 | " cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n", 1846 | ")\n", 1847 | "```\n", 1848 | "\n", 1849 | "``` python\n", 1850 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1851 | "(hyper2hr\n", 1852 | " .plot.scatter(x='tid', y='loss', alpha=.1, color='purple', ax=ax)\n", 1853 | ")\n", 1854 | "```\n", 1855 | "\n", 1856 | "``` python\n", 1857 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1858 | "(hyper2hr\n", 1859 | " .plot.scatter(x='max_depth', y='loss', alpha=1, color='purple', ax=ax)\n", 1860 | ")\n", 1861 | "```\n", 1862 | "\n", 1863 | "``` python\n", 1864 | "import numpy as np\n", 1865 | "\n", 1866 | "def jitter(df: pd.DataFrame, col: str, amount: float=1) -> pd.Series:\n", 1867 | " \"\"\"\n", 1868 | " Add random noise to the values in a Pandas DataFrame column.\n", 1869 | "\n", 1870 | " This function adds random noise to the values in a specified \n", 1871 | " column of a Pandas DataFrame. The noise is uniform random \n", 1872 | " noise with a range of `amount` centered around zero. The \n", 1873 | " function returns a Pandas Series with the jittered values.\n", 1874 | "\n", 1875 | " Parameters\n", 1876 | " ----------\n", 1877 | " df : pd.DataFrame\n", 1878 | " The input DataFrame.\n", 1879 | " col : str\n", 1880 | " The name of the column to jitter.\n", 1881 | " amount : float, optional\n", 1882 | " The range of the noise to add. The default value is 1.\n", 1883 | "\n", 1884 | " Returns\n", 1885 | " -------\n", 1886 | " pd.Series\n", 1887 | " A Pandas Series with the jittered values.\n", 1888 | " \"\"\"\n", 1889 | " vals = np.random.uniform(low=-amount/2, high=amount/2,\n", 1890 | " size=df.shape[0])\n", 1891 | " return df[col] + vals\n", 1892 | "```\n", 1893 | "\n", 1894 | "``` python\n", 1895 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1896 | "(hyper2hr\n", 1897 | " .assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8))\n", 1898 | " .plot.scatter(x='max_depth', y='loss', alpha=.1, color='purple', ax=ax)\n", 1899 | ")\n", 1900 | "```\n", 1901 | "\n", 1902 | "``` python\n", 1903 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1904 | "(hyper2hr\n", 1905 | " .assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8))\n", 1906 | " .plot.scatter(x='max_depth', y='loss', alpha=.5, \n", 1907 | " color='tid', cmap='viridis', ax=ax)\n", 1908 | ")\n", 1909 | "```\n", 1910 | "\n", 1911 | "``` python\n", 1912 | "import seaborn as sns\n", 1913 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1914 | "sns.violinplot(x='max_depth', y='loss', data=hyper2hr, kind='violin', ax=ax)\n", 1915 | "```\n", 1916 | "\n", 1917 | "``` python\n", 1918 | "\n", 1919 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1920 | "(hyper2hr\n", 1921 | " .plot.scatter(x='reg_alpha', y='colsample_bytree', alpha=.8,\n", 1922 | " color='tid', cmap='viridis', ax=ax)\n", 1923 | ")\n", 1924 | "\n", 1925 | "ax.annotate('Min Loss (-0.77)', xy=(4.56, 0.692),\n", 1926 | " xytext=(.7, .84), arrowprops={'color':'k'})\n", 1927 | "```\n", 1928 | "\n", 1929 | "``` python\n", 1930 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1931 | "(hyper2hr\n", 1932 | " .plot.scatter(x='gamma', y='loss', alpha=.1, color='purple', ax=ax)\n", 1933 | ")\n", 1934 | "```\n", 1935 | "\n", 1936 | "``` python\n", 1937 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 1938 | "(hyper2hr\n", 1939 | " .plot.scatter(x='gamma', y='loss', alpha=.5, color='tid', ax=ax, \n", 1940 | " logx=True, cmap='viridis')\n", 1941 | ")\n", 1942 | "\n", 1943 | "ax.annotate('Min Loss (-0.77)', xy=(0.000581, -0.777),\n", 1944 | " xytext=(1, -.6), arrowprops={'color':'k'})\n", 1945 | "```\n", 1946 | "\n", 1947 | "### EDA with Plotly\n", 1948 | "\n", 1949 | "``` python\n", 1950 | "import plotly.graph_objects as go\n", 1951 | "\n", 1952 | "def plot_3d_mesh(df: pd.DataFrame, x_col: str, y_col: str, \n", 1953 | " z_col: str) -> go.Figure:\n", 1954 | " \"\"\"\n", 1955 | " Create a 3D mesh plot using Plotly.\n", 1956 | "\n", 1957 | " This function creates a 3D mesh plot using Plotly, with \n", 1958 | " the `x_col`, `y_col`, and `z_col` columns of the `df` \n", 1959 | " DataFrame as the x, y, and z values, respectively. The \n", 1960 | " plot has a title and axis labels that match the column \n", 1961 | " names, and the intensity of the mesh is proportional \n", 1962 | " to the values in the `z_col` column. The function returns \n", 1963 | " a Plotly Figure object that can be displayed or saved as \n", 1964 | " desired.\n", 1965 | "\n", 1966 | " Parameters\n", 1967 | " ----------\n", 1968 | " df : pd.DataFrame\n", 1969 | " The DataFrame containing the data to plot.\n", 1970 | " x_col : str\n", 1971 | " The name of the column to use as the x values.\n", 1972 | " y_col : str\n", 1973 | " The name of the column to use as the y values.\n", 1974 | " z_col : str\n", 1975 | " The name of the column to use as the z values.\n", 1976 | "\n", 1977 | " Returns\n", 1978 | " -------\n", 1979 | " go.Figure\n", 1980 | " A Plotly Figure object with the 3D mesh plot.\n", 1981 | " \"\"\"\n", 1982 | " fig = go.Figure(data=[go.Mesh3d(x=df[x_col], y=df[y_col], z=df[z_col],\n", 1983 | " intensity=df[z_col]/ df[z_col].min(),\n", 1984 | " hovertemplate=f\"{z_col}: %{{z}}
{x_col}: %{{x}}
{y_col}: \"\n", 1985 | " \"%{{y}}\")],\n", 1986 | " )\n", 1987 | "\n", 1988 | " fig.update_layout( \n", 1989 | " title=dict(text=f'{y_col} vs {x_col}'),\n", 1990 | " scene = dict(\n", 1991 | " xaxis_title=x_col,\n", 1992 | " yaxis_title=y_col,\n", 1993 | " zaxis_title=z_col),\n", 1994 | " width=700,\n", 1995 | " margin=dict(r=20, b=10, l=10, t=50)\n", 1996 | " )\n", 1997 | " return fig\n", 1998 | "```\n", 1999 | "\n", 2000 | "``` python\n", 2001 | "fig = plot_3d_mesh(hyper2hr.query('gamma < .2'),\n", 2002 | " 'reg_lambda', 'gamma', 'loss')\n", 2003 | " \n", 2004 | "fig\n", 2005 | "```\n", 2006 | "\n", 2007 | "``` python\n", 2008 | "import plotly.express as px\n", 2009 | "import plotly.graph_objects as go\n", 2010 | "\n", 2011 | "def plot_3d_scatter(df: pd.DataFrame, x_col: str, y_col: str, \n", 2012 | " z_col: str, color_col: str, \n", 2013 | " opacity: float=1) -> go.Figure:\n", 2014 | " \"\"\"\n", 2015 | " Create a 3D scatter plot using Plotly Express.\n", 2016 | "\n", 2017 | " This function creates a 3D scatter plot using Plotly Express, \n", 2018 | " with the `x_col`, `y_col`, and `z_col` columns of the `df` \n", 2019 | " DataFrame as the x, y, and z values, respectively. The points \n", 2020 | " in the plot are colored according to the values in the \n", 2021 | " `color_col` column, using a continuous color scale. The \n", 2022 | " function returns a Plotly Express scatter_3d object that \n", 2023 | " can be displayed or saved as desired.\n", 2024 | "\n", 2025 | " Parameters\n", 2026 | " ----------\n", 2027 | " df : pd.DataFrame\n", 2028 | " The DataFrame containing the data to plot.\n", 2029 | " x_col : str\n", 2030 | " The name of the column to use as the x values.\n", 2031 | " y_col : str\n", 2032 | " The name of the column to use as the y values.\n", 2033 | " z_col : str\n", 2034 | " The name of the column to use as the z values.\n", 2035 | " color_col : str\n", 2036 | " The name of the column to use for coloring.\n", 2037 | " opacity : float\n", 2038 | " The opacity (alpha) of the points.\n", 2039 | "\n", 2040 | " Returns\n", 2041 | " -------\n", 2042 | " go.Figure\n", 2043 | " A Plotly Figure object with the 3D mesh plot.\n", 2044 | " \"\"\"\n", 2045 | " fig = px.scatter_3d(data_frame=df, x=x_col,\n", 2046 | " y=y_col, z=z_col, color=color_col,\n", 2047 | " color_continuous_scale=px.colors.sequential.Viridis_r,\n", 2048 | " opacity=opacity)\n", 2049 | " return fig\n", 2050 | "```\n", 2051 | "\n", 2052 | "``` python\n", 2053 | "plot_3d_scatter(hyper2hr.query('gamma < .2'), \n", 2054 | " 'reg_lambda', 'gamma', 'tid', color_col='loss')\n", 2055 | "```\n", 2056 | "\n", 2057 | "### Conclusion\n", 2058 | "\n", 2059 | "### Exercises\n", 2060 | "\n", 2061 | "## Step-wise Tuning with Hyperopt\n", 2062 | "\n", 2063 | "### Groups of Hyperparameters\n", 2064 | "\n", 2065 | "``` python\n", 2066 | "from hyperopt import fmin, tpe, hp, Trials\n", 2067 | "params = {'random_state': 42}\n", 2068 | "\n", 2069 | "rounds = [{'max_depth': hp.quniform('max_depth', 1, 8, 1), # tree\n", 2070 | " 'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},\n", 2071 | " {'subsample': hp.uniform('subsample', 0.5, 1), # stochastic\n", 2072 | " 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},\n", 2073 | " {'reg_alpha': hp.uniform('reg_alpha', 0, 10),\n", 2074 | " 'reg_lambda': hp.uniform('reg_lambda', 1, 10),},\n", 2075 | " {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization\n", 2076 | " {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting\n", 2077 | "]\n", 2078 | "\n", 2079 | "all_trials = []\n", 2080 | "for round in rounds:\n", 2081 | " params = {**params, **round}\n", 2082 | " trials = Trials()\n", 2083 | " best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(space, X_train, \n", 2084 | " y_train, X_test, y_test), \n", 2085 | " space=params, \n", 2086 | " algo=tpe.suggest, \n", 2087 | " max_evals=20, \n", 2088 | " trials=trials,\n", 2089 | " )\n", 2090 | " params = {**params, **best}\n", 2091 | " all_trials.append(trials)\n", 2092 | "```\n", 2093 | "\n", 2094 | "### Visualization Hyperparameter Scores\n", 2095 | "\n", 2096 | "``` python\n", 2097 | "xhelp.plot_3d_mesh(xhelp.trial2df(all_trials[2]),\n", 2098 | " 'reg_alpha', 'reg_lambda', 'loss') \n", 2099 | "```\n", 2100 | "\n", 2101 | "### Training an Optimized Model\n", 2102 | "\n", 2103 | "``` python\n", 2104 | "step_params = {'random_state': 42,\n", 2105 | " 'max_depth': 5,\n", 2106 | " 'min_child_weight': 0.6411044640540848,\n", 2107 | " 'subsample': 0.9492383155577023,\n", 2108 | " 'colsample_bytree': 0.6235721099295888,\n", 2109 | " 'gamma': 0.00011273797329538491,\n", 2110 | " 'learning_rate': 0.24399020050740935}\n", 2111 | "```\n", 2112 | "\n", 2113 | "``` python\n", 2114 | "xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50,\n", 2115 | " n_estimators=500)\n", 2116 | "xg_step.fit(X_train, y_train,\n", 2117 | " eval_set=[(X_train, y_train),\n", 2118 | " (X_test, y_test)\n", 2119 | " ],\n", 2120 | " verbose=100\n", 2121 | " )\n", 2122 | "```\n", 2123 | "\n", 2124 | "``` pycon\n", 2125 | ">>> xg_step.score(X_test, y_test)\n", 2126 | "0.7613259668508288\n", 2127 | "```\n", 2128 | "\n", 2129 | "``` pycon\n", 2130 | ">>> xg_def = xgb.XGBClassifier()\n", 2131 | ">>> xg_def.fit(X_train, y_train)\n", 2132 | ">>> xg_def.score(X_test, y_test)\n", 2133 | "0.7458563535911602\n", 2134 | "```\n", 2135 | "\n", 2136 | "### Summary\n", 2137 | "\n", 2138 | "### Exercises\n", 2139 | "\n", 2140 | "## Do you have enough data?\n", 2141 | "\n", 2142 | "### Learning Curves\n", 2143 | "\n", 2144 | "``` python\n", 2145 | "params = {'learning_rate': 0.3,\n", 2146 | " 'max_depth': 2,\n", 2147 | " 'n_estimators': 200,\n", 2148 | " 'n_jobs': -1,\n", 2149 | " 'random_state': 42,\n", 2150 | " 'reg_lambda': 0,\n", 2151 | " 'subsample': 1}\n", 2152 | "```\n", 2153 | "\n", 2154 | "``` python\n", 2155 | "import yellowbrick.model_selection as ms\n", 2156 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2157 | "viz = ms.learning_curve(xgb.XGBClassifier(**params),\n", 2158 | " X, y, ax=ax\n", 2159 | ")\n", 2160 | "ax.set_ylim(0.6, 1)\n", 2161 | "```\n", 2162 | "\n", 2163 | "### Learning Curves for Decision Trees\n", 2164 | "\n", 2165 | "``` python\n", 2166 | "# tuned tree\n", 2167 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2168 | "viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=7),\n", 2169 | " X, y, ax=ax)\n", 2170 | "viz.ax.set_ylim(0.6, 1)\n", 2171 | "```\n", 2172 | "\n", 2173 | "### Underfit Learning Curves\n", 2174 | "\n", 2175 | "``` python\n", 2176 | "# underfit\n", 2177 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2178 | "viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=1),\n", 2179 | " X, y, ax=ax\n", 2180 | ")\n", 2181 | "ax.set_ylim(0.6, 1)\n", 2182 | "```\n", 2183 | "\n", 2184 | "### Overfit Learning Curves\n", 2185 | "\n", 2186 | "``` python\n", 2187 | "# overfit\n", 2188 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2189 | "viz = ms.learning_curve(tree.DecisionTreeClassifier(),\n", 2190 | " X, y, ax=ax\n", 2191 | ")\n", 2192 | "ax.set_ylim(0.6, 1)\n", 2193 | "```\n", 2194 | "\n", 2195 | "### Summary\n", 2196 | "\n", 2197 | "### Exercises\n", 2198 | "\n", 2199 | "## Model Evaluation\n", 2200 | "\n", 2201 | "### Accuracy\n", 2202 | "\n", 2203 | "``` python\n", 2204 | "xgb_def = xgb.XGBClassifier()\n", 2205 | "xgb_def.fit(X_train, y_train)\n", 2206 | "```\n", 2207 | "\n", 2208 | "``` pycon\n", 2209 | ">>> xgb_def.score(X_test, y_test)\n", 2210 | "0.7458563535911602\n", 2211 | "```\n", 2212 | "\n", 2213 | "``` pycon\n", 2214 | ">>> from sklearn import metrics\n", 2215 | ">>> metrics.accuracy_score(y_test, xgb_def.predict(X_test))\n", 2216 | "0.7458563535911602\n", 2217 | "```\n", 2218 | "\n", 2219 | "### Confusion Matrix\n", 2220 | "\n", 2221 | "``` python\n", 2222 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2223 | "classifier.confusion_matrix(xgb_def, X_train, y_train,\n", 2224 | " X_test, y_test,\n", 2225 | " classes=['DS', 'SE'], ax=ax\n", 2226 | " )\n", 2227 | "```\n", 2228 | "\n", 2229 | "``` pycon\n", 2230 | ">>> from sklearn import metrics\n", 2231 | ">>> cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test))\n", 2232 | ">>> cm\n", 2233 | "array([[372, 122],\n", 2234 | " [108, 303]])\n", 2235 | "```\n", 2236 | "\n", 2237 | "``` python\n", 2238 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2239 | "disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, \n", 2240 | " display_labels=['DS', 'SE'])\n", 2241 | "disp.plot(ax=ax, cmap='Blues')\n", 2242 | "```\n", 2243 | "\n", 2244 | "``` python\n", 2245 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2246 | "cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test), \n", 2247 | " normalize='true')\n", 2248 | "disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, \n", 2249 | " display_labels=['DS', 'SE']) \n", 2250 | "disp.plot(ax=ax, cmap='Blues')\n", 2251 | "```\n", 2252 | "\n", 2253 | "### Precision and Recall\n", 2254 | "\n", 2255 | "``` pycon\n", 2256 | ">>> metrics.precision_score(y_test, xgb_def.predict(X_test))\n", 2257 | "0.7129411764705882\n", 2258 | "```\n", 2259 | "\n", 2260 | "``` pycon\n", 2261 | ">>> metrics.recall_score(y_test, xgb_def.predict(X_test))\n", 2262 | "0.7372262773722628\n", 2263 | "```\n", 2264 | "\n", 2265 | "``` python\n", 2266 | "from yellowbrick import classifier\n", 2267 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2268 | "classifier.precision_recall_curve(xgb_def, X_train, y_train,\n", 2269 | " X_test, y_test, micro=False, macro=False, ax=ax, per_class=True)\n", 2270 | "ax.set_ylim((0,1.05))\n", 2271 | "```\n", 2272 | "\n", 2273 | "### F1 Score\n", 2274 | "\n", 2275 | "``` pycon\n", 2276 | ">>> metrics.f1_score(y_test, xgb_def.predict(X_test))\n", 2277 | "0.7248803827751197\n", 2278 | "```\n", 2279 | "\n", 2280 | "``` pycon\n", 2281 | ">>> print(metrics.classification_report(y_test, \n", 2282 | "... y_pred=xgb_def.predict(X_test), target_names=['DS', 'SE']))\n", 2283 | " precision recall f1-score support\n", 2284 | "\n", 2285 | " DS 0.78 0.75 0.76 494\n", 2286 | " SE 0.71 0.74 0.72 411\n", 2287 | "\n", 2288 | " accuracy 0.75 905\n", 2289 | " macro avg 0.74 0.75 0.74 905\n", 2290 | "weighted avg 0.75 0.75 0.75 905\n", 2291 | "```\n", 2292 | "\n", 2293 | "``` python\n", 2294 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2295 | "classifier.classification_report(xgb_def, X_train, y_train,\n", 2296 | " X_test, y_test, classes=['DS', 'SE'],\n", 2297 | " micro=False, macro=False, ax=ax)\n", 2298 | "```\n", 2299 | "\n", 2300 | "### ROC Curve\n", 2301 | "\n", 2302 | "``` python\n", 2303 | "fig, ax = plt.subplots(figsize=(8,8))\n", 2304 | "metrics.RocCurveDisplay.from_estimator(xgb_def,\n", 2305 | " X_test, y_test,ax=ax, label='default')\n", 2306 | "metrics.RocCurveDisplay.from_estimator(xg_step,\n", 2307 | " X_test, y_test,ax=ax)\n", 2308 | "```\n", 2309 | "\n", 2310 | "``` python\n", 2311 | "fig, axes = plt.subplots(figsize=(8, 4), ncols=2)\n", 2312 | "metrics.RocCurveDisplay.from_estimator(xgb_def,\n", 2313 | " X_train, y_train,ax=axes[0], label='detault train')\n", 2314 | "metrics.RocCurveDisplay.from_estimator(xgb_def,\n", 2315 | " X_test, y_test,ax=axes[0])\n", 2316 | "axes[0].set(title='ROC plots for default model')\n", 2317 | "\n", 2318 | "metrics.RocCurveDisplay.from_estimator(xg_step,\n", 2319 | " X_train, y_train,ax=axes[1], label='step train')\n", 2320 | "metrics.RocCurveDisplay.from_estimator(xg_step,\n", 2321 | " X_test, y_test,ax=axes[1])\n", 2322 | "axes[1].set(title='ROC plots for stepwise model')\n", 2323 | "```\n", 2324 | "\n", 2325 | "### Threshold Metrics\n", 2326 | "\n", 2327 | "``` python\n", 2328 | "class ThresholdXGBClassifier(xgb.XGBClassifier):\n", 2329 | " def __init__(self, threshold=0.5, **kwargs):\n", 2330 | " super().__init__(**kwargs)\n", 2331 | " self.threshold = threshold\n", 2332 | "\n", 2333 | " def predict(self, X, *args, **kwargs):\n", 2334 | " \"\"\"Predict with `threshold` applied to predicted class probabilities.\n", 2335 | " \"\"\"\n", 2336 | " proba = self.predict_proba(X, *args, **kwargs)\n", 2337 | " return (proba[:, 1] > self.threshold).astype(int)\n", 2338 | "```\n", 2339 | "\n", 2340 | "``` pycon\n", 2341 | ">>> xgb_def = xgb.XGBClassifier()\n", 2342 | ">>> xgb_def.fit(X_train, y_train)\n", 2343 | ">>> xgb_def.predict_proba(X_test.iloc[[0]])\n", 2344 | "array([[0.14253652, 0.8574635 ]], dtype=float32)\n", 2345 | "```\n", 2346 | "\n", 2347 | "``` pycon\n", 2348 | ">>> xgb_def.predict(X_test.iloc[[0]])\n", 2349 | "array([1])\n", 2350 | "```\n", 2351 | "\n", 2352 | "``` pycon\n", 2353 | ">>> xgb90 = ThresholdXGBClassifier(threshold=.9, verbosity=0)\n", 2354 | ">>> xgb90.fit(X_train, y_train)\n", 2355 | ">>> xgb90.predict(X_test.iloc[[0]])\n", 2356 | "array([0])\n", 2357 | "```\n", 2358 | "\n", 2359 | "``` python\n", 2360 | "def get_tpr_fpr(probs, y_truth):\n", 2361 | " \"\"\"\n", 2362 | " Calculates true positive rate (TPR) and false positive rate\n", 2363 | " (FPR) given predicted probabilities and ground truth labels.\n", 2364 | "\n", 2365 | " Parameters:\n", 2366 | " probs (np.array): predicted probabilities of positive class\n", 2367 | " y_truth (np.array): ground truth labels\n", 2368 | "\n", 2369 | " Returns:\n", 2370 | " tuple: (tpr, fpr)\n", 2371 | " \"\"\"\n", 2372 | " tp = (probs == 1) & (y_truth == 1)\n", 2373 | " tn = (probs < 1) & (y_truth == 0)\n", 2374 | " fp = (probs == 1) & (y_truth == 0)\n", 2375 | " fn = (probs < 1) & (y_truth == 1)\n", 2376 | " tpr = tp.sum() / (tp.sum() + fn.sum())\n", 2377 | " fpr = fp.sum() / (fp.sum() + tn.sum())\n", 2378 | " return tpr, fpr\n", 2379 | "\n", 2380 | "\n", 2381 | "vals = []\n", 2382 | "for thresh in np.arange(0, 1, step=.05):\n", 2383 | " probs = xg_step.predict_proba(X_test)[:, 1]\n", 2384 | " tpr, fpr = get_tpr_fpr(probs > thresh, y_test)\n", 2385 | " val = [thresh, tpr, fpr]\n", 2386 | " for metric in [metrics.accuracy_score, metrics.precision_score,\n", 2387 | " metrics.recall_score, metrics.f1_score, \n", 2388 | " metrics.roc_auc_score]:\n", 2389 | " val.append(metric(y_test, probs > thresh))\n", 2390 | " vals.append(val)\n", 2391 | "```\n", 2392 | "\n", 2393 | "``` python\n", 2394 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2395 | "(pd.DataFrame(vals, columns=['thresh', 'tpr/rec', 'fpr', 'acc', \n", 2396 | " 'prec', 'rec', 'f1', 'auc'])\n", 2397 | " .drop(columns='rec')\n", 2398 | " .set_index('thresh')\n", 2399 | " .plot(ax=ax, title='Threshold Metrics')\n", 2400 | ")\n", 2401 | "```\n", 2402 | "\n", 2403 | "### Cumulative Gains Curve\n", 2404 | "\n", 2405 | "``` python\n", 2406 | "import scikitplot\n", 2407 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2408 | "y_probs = xgb_def.predict_proba(X_test)\n", 2409 | "scikitplot.metrics.plot_cumulative_gain(y_test, y_probs, ax=ax)\n", 2410 | "ax.plot([0, (y_test == 1).mean(), 1], [0, 1, 1], label='Optimal Class 1')\n", 2411 | "ax.set_ylim(0, 1.05)\n", 2412 | "ax.annotate('Reach 60% of\\nClass 1\\nby contacting top 35%', xy=(.35, .6),\n", 2413 | " xytext=(.55,.25), arrowprops={'color':'k'})\n", 2414 | "ax.legend()\n", 2415 | "```\n", 2416 | "\n", 2417 | "### Lift Curves\n", 2418 | "\n", 2419 | "``` python\n", 2420 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2421 | "y_probs = xgb_def.predict_proba(X_test)\n", 2422 | "scikitplot.metrics.plot_lift_curve(y_test, y_probs, ax=ax)\n", 2423 | "mean = (y_test == 1).mean()\n", 2424 | "ax.plot([0, mean, 1], [1/mean, 1/mean, 1], label='Optimal Class 1')\n", 2425 | "ax.legend()\n", 2426 | "```\n", 2427 | "\n", 2428 | "### Summary\n", 2429 | "\n", 2430 | "### Exercises\n", 2431 | "\n", 2432 | "## Training For Different Metrics\n", 2433 | "\n", 2434 | "### Metric overview\n", 2435 | "\n", 2436 | "### Training with Validation Curves\n", 2437 | "\n", 2438 | "``` python\n", 2439 | "from yellowbrick import model_selection as ms\n", 2440 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2441 | "ms.validation_curve(xgb.XGBClassifier(), X_train, y_train,\n", 2442 | " scoring='accuracy', param_name='learning_rate', \n", 2443 | " param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax\n", 2444 | ")\n", 2445 | "ax.set_xlabel('Accuracy')\n", 2446 | "```\n", 2447 | "\n", 2448 | "``` python\n", 2449 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2450 | "ms.validation_curve(xgb.XGBClassifier(), X_train, y_train,\n", 2451 | " scoring='roc_auc', param_name='learning_rate',\n", 2452 | " param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax\n", 2453 | " )\n", 2454 | "ax.set_xlabel('roc_auc')\n", 2455 | "```\n", 2456 | "\n", 2457 | "### Step-wise Recall Tuning\n", 2458 | "\n", 2459 | "``` python\n", 2460 | "from sklearn.metrics import roc_auc_score\n", 2461 | "from hyperopt import hp, Trials, fmin, tpe\n", 2462 | "params = {'random_state': 42}\n", 2463 | "\n", 2464 | "rounds = [{'max_depth': hp.quniform('max_depth', 1, 9, 1), # tree\n", 2465 | " 'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},\n", 2466 | " {'subsample': hp.uniform('subsample', 0.5, 1), # stochastic\n", 2467 | " 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},\n", 2468 | " {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization\n", 2469 | " {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting\n", 2470 | "]\n", 2471 | "\n", 2472 | "for round in rounds:\n", 2473 | " params = {**params, **round}\n", 2474 | " trials = Trials()\n", 2475 | " best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(\n", 2476 | " space, X_train, y_train, X_test, y_test, metric=roc_auc_score),\n", 2477 | " space=params, \n", 2478 | " algo=tpe.suggest, \n", 2479 | " max_evals=40, \n", 2480 | " trials=trials,\n", 2481 | " )\n", 2482 | " params = {**params, **best}\n", 2483 | "```\n", 2484 | "\n", 2485 | "``` pycon\n", 2486 | ">>> xgb_def = xgb.XGBClassifier()\n", 2487 | ">>> xgb_def.fit(X_train, y_train)\n", 2488 | ">>> metrics.roc_auc_score(y_test, xgb_def.predict(X_test))\n", 2489 | "0.7451313573096131\n", 2490 | "```\n", 2491 | "\n", 2492 | "``` pycon\n", 2493 | ">>> # the values from above training\n", 2494 | ">>> params = {'random_state': 42,\n", 2495 | "... 'max_depth': 4,\n", 2496 | "... 'min_child_weight': 4.808561584650579,\n", 2497 | "... 'subsample': 0.9265505972233746,\n", 2498 | "... 'colsample_bytree': 0.9870944989347749,\n", 2499 | "... 'gamma': 0.1383762861356536,\n", 2500 | "... 'learning_rate': 0.13664139307301595}\n", 2501 | "```\n", 2502 | "\n", 2503 | "``` pycon\n", 2504 | ">>> xgb_tuned = xgb.XGBClassifier(**params, early_stopping_rounds=50,\n", 2505 | "... n_estimators=500)\n", 2506 | ">>> xgb_tuned.fit(X_train, y_train, eval_set=[(X_train, y_train), \n", 2507 | "... (X_test, y_test)], verbose=100)\n", 2508 | "[0] validation_0-logloss:0.66207 validation_1-logloss:0.66289\n", 2509 | "[100] validation_0-logloss:0.44945 validation_1-logloss:0.49416\n", 2510 | "[150] validation_0-logloss:0.43196 validation_1-logloss:0.49833\n", 2511 | "XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,\n", 2512 | " colsample_bylevel=1, colsample_bynode=1,\n", 2513 | " colsample_bytree=0.9870944989347749, early_stopping_rounds=50,\n", 2514 | " enable_categorical=False, eval_metric=None, feature_types=None,\n", 2515 | " gamma=0.1383762861356536, gpu_id=-1, grow_policy='depthwise',\n", 2516 | " importance_type=None, interaction_constraints='',\n", 2517 | " learning_rate=0.13664139307301595, max_bin=256,\n", 2518 | " max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,\n", 2519 | " max_depth=4, max_leaves=0, min_child_weight=4.808561584650579,\n", 2520 | " missing=nan, monotone_constraints='()', n_estimators=500,\n", 2521 | " n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=42, ...)\n", 2522 | "```\n", 2523 | "\n", 2524 | "``` pycon\n", 2525 | ">>> metrics.roc_auc_score(y_test, xgb_tuned.predict(X_test))\n", 2526 | "0.7629510328319394\n", 2527 | "```\n", 2528 | "\n", 2529 | "### Summary\n", 2530 | "\n", 2531 | "### Exercises\n", 2532 | "\n", 2533 | "## Model Interpretation\n", 2534 | "\n", 2535 | "### Logistic Regression Interpretation\n", 2536 | "\n", 2537 | "``` pycon\n", 2538 | ">>> from sklearn import linear_model, preprocessing\n", 2539 | ">>> std = preprocessing.StandardScaler()\n", 2540 | ">>> lr = linear_model.LogisticRegression(penalty=None)\n", 2541 | ">>> lr.fit(std.fit_transform(X_train), y_train)\n", 2542 | ">>> lr.score(std.transform(X_test), y_test)\n", 2543 | "0.7337016574585635\n", 2544 | "```\n", 2545 | "\n", 2546 | "``` pycon\n", 2547 | ">>> lr.coef_\n", 2548 | "array([[-1.56018160e-01, -4.01817103e-01, 6.01542610e-01,\n", 2549 | " -1.45213121e-01, -8.13849902e-02, -6.03727624e-01,\n", 2550 | " 3.11683777e-02, 3.16120596e-02, -3.14510213e-02,\n", 2551 | " -4.59272439e-04, -8.21683100e-03, -5.27737710e-02,\n", 2552 | " -4.48524110e-03, 1.01853988e-01, 3.49376790e-01,\n", 2553 | " -1.79149729e-01, 2.41389081e-02, -3.37424750e-01]])\n", 2554 | "```\n", 2555 | "\n", 2556 | "``` python\n", 2557 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2558 | "(pd.Series(lr.coef_[0], index=X_train.columns)\n", 2559 | " .sort_values()\n", 2560 | " .plot.barh(ax=ax)\n", 2561 | ")\n", 2562 | "```\n", 2563 | "\n", 2564 | "### Decision Tree Interpretation\n", 2565 | "\n", 2566 | "``` pycon\n", 2567 | ">>> tree7 = tree.DecisionTreeClassifier(max_depth=7)\n", 2568 | ">>> tree7.fit(X_train, y_train)\n", 2569 | ">>> tree7.score(X_test, y_test)\n", 2570 | "0.7337016574585635\n", 2571 | "```\n", 2572 | "\n", 2573 | "``` python\n", 2574 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2575 | "(pd.Series(tree7.feature_importances_, index=X_train.columns)\n", 2576 | " .sort_values()\n", 2577 | " .plot.barh(ax=ax)\n", 2578 | ")\n", 2579 | "```\n", 2580 | "\n", 2581 | "``` python\n", 2582 | "import dtreeviz\n", 2583 | "dt3 = tree.DecisionTreeClassifier(max_depth=3)\n", 2584 | "dt3.fit(X_train, y_train)\n", 2585 | "\n", 2586 | "viz = dtreeviz.model(dt3, X_train=X_train, y_train=y_train, \n", 2587 | " feature_names=list(X_train.columns), target_name='Job',\n", 2588 | " class_names=['DS', 'SE'])\n", 2589 | "viz.view()\n", 2590 | "```\n", 2591 | "\n", 2592 | "### XGBoost Feature Importance\n", 2593 | "\n", 2594 | "``` python\n", 2595 | "xgb_def = xgb.XGBClassifier()\n", 2596 | "xgb_def.fit(X_train, y_train)\n", 2597 | "```\n", 2598 | "\n", 2599 | "``` python\n", 2600 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2601 | "(pd.Series(xgb_def.feature_importances_, index=X_train.columns)\n", 2602 | " .sort_values()\n", 2603 | " .plot.barh(ax=ax)\n", 2604 | ")\n", 2605 | "```\n", 2606 | "\n", 2607 | "``` python\n", 2608 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2609 | "xgb.plot_importance(xgb_def, importance_type='cover', ax=ax)\n", 2610 | "```\n", 2611 | "\n", 2612 | "### Surrogate Models\n", 2613 | "\n", 2614 | "``` python\n", 2615 | "from sklearn import tree\n", 2616 | "\n", 2617 | "sur_reg_sk = tree.DecisionTreeRegressor(max_depth=4)\n", 2618 | "sur_reg_sk.fit(X_train, xgb_def.predict_proba(X_train)[:,-1])\n", 2619 | "```\n", 2620 | "\n", 2621 | "``` python\n", 2622 | "```\n", 2623 | "\n", 2624 | "### Summary\n", 2625 | "\n", 2626 | "### Exercises\n", 2627 | "\n", 2628 | "## xgbfir (Feature Interactions Reshaped)\n", 2629 | "\n", 2630 | "### Feature Interactions\n", 2631 | "\n", 2632 | "### xgbfir\n", 2633 | "\n", 2634 | "``` python\n", 2635 | "import xgbfir\n", 2636 | "xgbfir.saveXgbFI(xgb_def, feature_names=X_train.columns, OutputXlsxFile='fir.xlsx')\n", 2637 | "```\n", 2638 | "\n", 2639 | "``` pycon\n", 2640 | ">>> fir = pd.read_excel('fir.xlsx')\n", 2641 | ">>> print(fir\n", 2642 | "... .sort_values(by='Average Rank')\n", 2643 | "... .head()\n", 2644 | "... .round(1)\n", 2645 | "... )\n", 2646 | " Interaction Gain FScore ... Average Rank Average Tree Index \\\n", 2647 | "2 r 517.8 84 ... 3.3 44.6 \n", 2648 | "0 years_exp 597.0 627 ... 4.5 45.1 \n", 2649 | "5 education 296.0 254 ... 4.5 45.2 \n", 2650 | "1 compensation 518.5 702 ... 4.8 47.5 \n", 2651 | "4 major_cs 327.1 96 ... 5.5 48.9 \n", 2652 | "\n", 2653 | " Average Tree Depth \n", 2654 | "2 2.6 \n", 2655 | "0 3.7 \n", 2656 | "5 3.3 \n", 2657 | "1 3.7 \n", 2658 | "4 3.6 \n", 2659 | "\n", 2660 | "[5 rows x 16 columns]\n", 2661 | "```\n", 2662 | "\n", 2663 | "``` pycon\n", 2664 | ">>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 1').iloc[:20]\n", 2665 | "... .sort_values(by='Average Rank') \n", 2666 | "... .head(10) \n", 2667 | "... .round(1) \n", 2668 | "... ) \n", 2669 | " Interaction Gain FScore wFScore Average wFScore \\\n", 2670 | "1 education|years_exp 523.8 106 14.8 0.1 \n", 2671 | "0 major_cs|r 1210.8 15 5.4 0.4 \n", 2672 | "6 compensation|education 207.2 103 18.8 0.2 \n", 2673 | "11 age|education 133.2 80 27.2 0.3 \n", 2674 | "3 major_cs|years_exp 441.3 36 4.8 0.1 \n", 2675 | "5 age|years_exp 316.3 216 43.9 0.2 \n", 2676 | "4 age|compensation 344.7 219 38.8 0.2 \n", 2677 | "15 major_stat|years_exp 97.7 32 6.7 0.2 \n", 2678 | "14 education|r 116.5 14 4.6 0.3 \n", 2679 | "18 age|age 90.5 66 24.7 0.4 \n", 2680 | "\n", 2681 | " Average Gain Expected Gain Gain Rank FScore Rank wFScore Rank \\\n", 2682 | "1 4.9 77.9 2 5 8 \n", 2683 | "0 80.7 607.6 1 45 20 \n", 2684 | "6 2.0 34.0 7 6 7 \n", 2685 | "11 1.7 25.6 12 8 4 \n", 2686 | "3 12.3 108.2 4 20 25 \n", 2687 | "5 1.5 44.0 6 3 1 \n", 2688 | "4 1.6 30.6 5 2 2 \n", 2689 | "15 3.1 20.4 16 25 15 \n", 2690 | "14 8.3 72.3 15 52 27 \n", 2691 | "18 1.4 16.6 19 11 6 \n", 2692 | "\n", 2693 | " Avg wFScore Rank Avg Gain Rank Expected Gain Rank Average Rank \\\n", 2694 | "1 43 8 3 11.5 \n", 2695 | "0 8 1 1 12.7 \n", 2696 | "6 32 25 9 14.3 \n", 2697 | "11 12 40 13 14.8 \n", 2698 | "3 46 3 2 16.7 \n", 2699 | "5 26 57 7 16.7 \n", 2700 | "4 34 48 11 17.0 \n", 2701 | "15 24 14 14 18.0 \n", 2702 | "14 13 5 4 19.3 \n", 2703 | "18 7 62 16 20.2 \n", 2704 | "\n", 2705 | " Average Tree Index Average Tree Depth \n", 2706 | "1 38.0 3.5 \n", 2707 | "0 12.3 1.6 \n", 2708 | "6 50.6 3.7 \n", 2709 | "11 38.8 3.6 \n", 2710 | "3 29.2 3.2 \n", 2711 | "5 45.6 3.9 \n", 2712 | "4 48.9 3.9 \n", 2713 | "15 25.5 3.1 \n", 2714 | "14 40.4 2.4 \n", 2715 | "18 48.0 3.6 \n", 2716 | "```\n", 2717 | "\n", 2718 | "``` python\n", 2719 | "(X_train\n", 2720 | " .assign(software_eng=y_train)\n", 2721 | " .corr(method='spearman')\n", 2722 | " .loc[:, ['education', 'years_exp', 'major_cs', 'r', 'compensation', 'age']]\n", 2723 | " .style\n", 2724 | " .background_gradient(cmap='RdBu', vmin=-1, vmax=1)\n", 2725 | " .format('{:.2f}')\n", 2726 | ")\n", 2727 | "```\n", 2728 | "\n", 2729 | "``` python\n", 2730 | "import seaborn as sns\n", 2731 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2732 | "sns.heatmap(X_train \n", 2733 | " .assign(software_eng=y_train)\n", 2734 | " .corr(method='spearman')\n", 2735 | " .loc[:, ['age','education', 'years_exp', 'compensation', 'r', \n", 2736 | " 'major_cs', 'software_eng']],\n", 2737 | " cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n", 2738 | ")\n", 2739 | "```\n", 2740 | "\n", 2741 | "``` python\n", 2742 | "import seaborn.objects as so\n", 2743 | "fig = plt.figure(figsize=(8, 4))\n", 2744 | "(so\n", 2745 | " .Plot(X_train.assign(software_eng=y_train), x='years_exp', y='education', \n", 2746 | " color='software_eng')\n", 2747 | " .add(so.Dots(alpha=.9, pointsize=2), so.Jitter(x=.7, y=1))\n", 2748 | " .add(so.Line(), so.PolyFit())\n", 2749 | " .scale(color='viridis')\n", 2750 | " .on(fig) # not required unless saving to image\n", 2751 | " .plot() # ditto\n", 2752 | ")\n", 2753 | "```\n", 2754 | "\n", 2755 | "``` pycon\n", 2756 | ">>> print(X_train\n", 2757 | "... .assign(software_eng=y_train)\n", 2758 | "... .groupby(['software_eng', 'r', 'major_cs'])\n", 2759 | "... .age\n", 2760 | "... .count()\n", 2761 | "... .unstack()\n", 2762 | "... .unstack()\n", 2763 | "... )\n", 2764 | "major_cs 0 1 \n", 2765 | "r 0 1 0 1\n", 2766 | "software_eng \n", 2767 | "0 410 390 243 110\n", 2768 | "1 308 53 523 73\n", 2769 | "```\n", 2770 | "\n", 2771 | "``` pycon\n", 2772 | ">>> both = (X_train\n", 2773 | "... .assign(software_eng=y_train)\n", 2774 | "... )\n", 2775 | ">>> print(pd.crosstab(index=both.software_eng, columns=[both.major_cs, both.r]))\n", 2776 | "major_cs 0 1 \n", 2777 | "r 0 1 0 1\n", 2778 | "software_eng \n", 2779 | "0 410 390 243 110\n", 2780 | "1 308 53 523 73\n", 2781 | "```\n", 2782 | "\n", 2783 | "``` python\n", 2784 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 2785 | "grey = '#999999'\n", 2786 | "blue = '#16a2c6'\n", 2787 | "font = 'Roboto'\n", 2788 | "\n", 2789 | "data = (X_train\n", 2790 | " .assign(software_eng=y_train)\n", 2791 | " .groupby(['software_eng', 'r', 'major_cs'])\n", 2792 | " .age\n", 2793 | " .count()\n", 2794 | " .unstack()\n", 2795 | " .unstack())\n", 2796 | "\n", 2797 | "(data\n", 2798 | " .pipe(lambda adf: adf.iloc[:,-2:].plot(color=[grey,blue], linewidth=4, ax=ax, \n", 2799 | " legend=None) and adf)\n", 2800 | " .plot(color=[grey, blue, grey, blue], ax=ax, legend=None)\n", 2801 | ")\n", 2802 | "\n", 2803 | "ax.set_xticks([0, 1], ['Data Scientist', 'Software Engineer'], font=font, size=12, \n", 2804 | " weight=600)\n", 2805 | "ax.set_yticks([])\n", 2806 | "ax.set_xlabel('')\n", 2807 | "ax.text(x=0, y=.93, s=\"Count Data Scientist or Software Engineer by R/CS\", \n", 2808 | " transform=fig.transFigure, ha='left', font=font, fontsize=10, weight=1000)\n", 2809 | "ax.text(x=0, y=.83, s=\"(Studied CS) Thick lines\\n(R) Blue\", transform=fig.transFigure, \n", 2810 | " ha='left', font=font, fontsize=10, weight=300)\n", 2811 | "for side in 'left,top,right,bottom'.split(','):\n", 2812 | " ax.spines[side].set_visible(False) \n", 2813 | "# labels\n", 2814 | "for left,txt in zip(data.iloc[0], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']):\n", 2815 | " ax.text(x=-.02, y=left, s=f'{txt} ({left})', ha='right', va='center', \n", 2816 | " font=font, weight=300)\n", 2817 | "for right,txt in zip(data.iloc[1], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']):\n", 2818 | " ax.text(x=1.02, y=right, s=f'{txt} ({right})', ha='left', va='center', \n", 2819 | " font=font, weight=300)\n", 2820 | "```\n", 2821 | "\n", 2822 | "### Deeper Interactions\n", 2823 | "\n", 2824 | "``` pycon\n", 2825 | ">>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 2').iloc[:20]\n", 2826 | "... .sort_values(by='Average Rank') \n", 2827 | "... .head(5) \n", 2828 | "... ) \n", 2829 | " Interaction Gain FScore ... Average Rank \\\n", 2830 | "0 major_cs|r|years_exp 1842.711375 17 ... 12.000000 \n", 2831 | "7 age|education|years_exp 267.537987 53 ... 15.666667 \n", 2832 | "13 age|compensation|education 154.313245 55 ... 15.833333 \n", 2833 | "2 compensation|education|years_exp 431.541357 91 ... 17.166667 \n", 2834 | "14 education|r|years_exp 145.534591 17 ... 19.000000 \n", 2835 | "\n", 2836 | " Average Tree Index Average Tree Depth \n", 2837 | "0 2.588235 2.117647 \n", 2838 | "7 31.452830 3.981132 \n", 2839 | "13 47.381818 3.800000 \n", 2840 | "2 47.175824 4.010989 \n", 2841 | "14 34.352941 2.588235 \n", 2842 | "\n", 2843 | "[5 rows x 16 columns]\n", 2844 | "```\n", 2845 | "\n", 2846 | "### Specifying Feature Interactions\n", 2847 | "\n", 2848 | "``` python\n", 2849 | "constraints = [['education', 'years_exp'], ['major_cs', 'r'],\n", 2850 | " ['compensation', 'education'], ['age', 'education'],\n", 2851 | " ['major_cs', 'years_exp'], ['age', 'years_exp'],\n", 2852 | " ['age', 'compensation'], ['major_stat', 'years_exp'],\n", 2853 | "]\n", 2854 | "```\n", 2855 | "\n", 2856 | "``` python\n", 2857 | "def flatten(seq):\n", 2858 | " res = []\n", 2859 | " for sub in seq:\n", 2860 | " res.extend(sub)\n", 2861 | " return res\n", 2862 | "\n", 2863 | "\n", 2864 | "small_cols = sorted(set(flatten(constraints)))\n", 2865 | "```\n", 2866 | "\n", 2867 | "``` pycon\n", 2868 | ">>> print(small_cols)\n", 2869 | "['age', 'compensation', 'education', 'major_cs', 'major_stat', 'r', 'years_exp']\n", 2870 | "```\n", 2871 | "\n", 2872 | "``` pycon\n", 2873 | ">>> xg_constraints = xgb.XGBClassifier(interaction_constraints=constraints)\n", 2874 | ">>> xg_constraints.fit(X_train.loc[:, small_cols], y_train)\n", 2875 | ">>> xg_constraints.score(X_test.loc[:, small_cols], y_test)\n", 2876 | "\n", 2877 | "0.7259668508287292\n", 2878 | "```\n", 2879 | "\n", 2880 | "``` python\n", 2881 | "my_dot_export(xg_constraints, num_trees=0, filename='img/constrains0_xg.dot', \n", 2882 | " title='First Constrained Tree') \n", 2883 | "```\n", 2884 | "\n", 2885 | "### Summary\n", 2886 | "\n", 2887 | "### Exercises\n", 2888 | "\n", 2889 | "## Exploring SHAP\n", 2890 | "\n", 2891 | "### SHAP\n", 2892 | "\n", 2893 | "``` python\n", 2894 | "step_params = {'random_state': 42,\n", 2895 | " 'max_depth': 5,\n", 2896 | " 'min_child_weight': 0.6411044640540848,\n", 2897 | " 'subsample': 0.9492383155577023,\n", 2898 | " 'colsample_bytree': 0.6235721099295888,\n", 2899 | " 'gamma': 0.00011273797329538491,\n", 2900 | " 'learning_rate': 0.24399020050740935}\n", 2901 | "xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50,\n", 2902 | " n_estimators=500)\n", 2903 | "xg_step.fit(X_train, y_train,\n", 2904 | " eval_set=[(X_train, y_train),\n", 2905 | " (X_test, y_test)\n", 2906 | " ]\n", 2907 | " )\n", 2908 | "```\n", 2909 | "\n", 2910 | "``` python\n", 2911 | "import shap\n", 2912 | "shap.initjs()\n", 2913 | "\n", 2914 | "shap_ex = shap.TreeExplainer(xg_step)\n", 2915 | "vals = shap_ex(X_test)\n", 2916 | "```\n", 2917 | "\n", 2918 | "``` pycon\n", 2919 | ">>> shap_df = pd.DataFrame(vals.values, columns=X_test.columns)\n", 2920 | ">>> print(shap_df)\n", 2921 | " age education years_exp compensation python r \\\n", 2922 | "0 0.426614 0.390184 -0.246353 0.145825 -0.034680 0.379261 \n", 2923 | "1 0.011164 -0.131144 -0.292135 -0.014521 0.016003 -1.043464 \n", 2924 | "2 -0.218063 -0.140705 -0.411293 0.048281 0.424516 0.487451 \n", 2925 | "3 -0.015227 -0.299068 -0.426323 -0.205840 -0.125867 0.320594 \n", 2926 | "4 -0.468785 -0.200953 -0.230639 0.064272 0.021362 0.355619 \n", 2927 | ".. ... ... ... ... ... ... \n", 2928 | "900 0.268237 -0.112710 0.330096 -0.209942 0.012074 -1.144335 \n", 2929 | "901 0.154642 0.572190 -0.227121 0.448253 -0.057847 0.290381 \n", 2930 | "902 0.079129 -0.095771 1.136799 0.150705 0.133260 0.484103 \n", 2931 | "903 -0.206584 0.430074 -0.385100 -0.078808 -0.083052 -0.992487 \n", 2932 | "904 0.007351 0.589351 1.485712 0.056398 -0.047231 0.373149 \n", 2933 | "\n", 2934 | " sql Q1_Male Q1_Female Q1_Prefer not to say \\\n", 2935 | "0 -0.019017 0.004868 0.000877 0.002111 \n", 2936 | "1 0.020524 0.039019 0.047712 0.001010 \n", 2937 | "2 -0.098703 -0.004710 0.063545 0.000258 \n", 2938 | "3 -0.062712 0.019110 0.012257 0.002184 \n", 2939 | "4 -0.083344 -0.017202 0.002754 0.001432 \n", 2940 | ".. ... ... ... ... \n", 2941 | "900 -0.065815 0.028274 0.032291 0.001012 \n", 2942 | "901 -0.069114 0.006243 0.007443 0.002198 \n", 2943 | "902 -0.120819 0.012034 0.057516 0.000266 \n", 2944 | "903 -0.088811 0.080561 0.028648 0.000876 \n", 2945 | "904 -0.105290 0.029283 0.074762 0.001406 \n", 2946 | "\n", 2947 | " Q1_Prefer to self-describe Q3_United States of America Q3_India \\\n", 2948 | "0 0.0 0.033738 -0.117918 \n", 2949 | "1 0.0 0.068171 0.086444 \n", 2950 | "2 0.0 0.005533 -0.105534 \n", 2951 | "3 0.0 -0.000044 0.042814 \n", 2952 | "4 0.0 0.035772 -0.073206 \n", 2953 | ".. ... ... ... \n", 2954 | "900 0.0 -0.086408 0.136677 \n", 2955 | "901 0.0 -0.074364 0.115520 \n", 2956 | "902 0.0 0.103810 -0.097848 \n", 2957 | "903 0.0 0.045213 0.066553 \n", 2958 | "904 0.0 -0.031587 0.117050 \n", 2959 | "\n", 2960 | " Q3_China major_cs major_other major_eng major_stat \n", 2961 | "0 -0.018271 0.369876 0.014006 -0.013465 0.104177 \n", 2962 | "1 -0.026271 -0.428484 -0.064157 -0.026041 0.069931 \n", 2963 | "2 -0.010548 -0.333695 0.016919 -0.026932 -0.591922 \n", 2964 | "3 -0.024099 0.486864 0.038438 -0.013727 0.047564 \n", 2965 | "4 -0.022188 0.324419 0.012664 -0.019550 0.093926 \n", 2966 | ".. ... ... ... ... ... \n", 2967 | "900 0.310404 -0.407444 -0.013195 -0.026412 -0.484734 \n", 2968 | "901 -0.008244 0.602087 0.039680 -0.012820 0.083934 \n", 2969 | "902 0.003234 -0.313785 -0.080046 -0.066032 0.101975 \n", 2970 | "903 -0.031448 -0.524141 -0.048108 -0.007185 0.093196 \n", 2971 | "904 0.008734 -0.505613 -0.159411 -0.067388 0.126560 \n", 2972 | "\n", 2973 | "[905 rows x 18 columns]\n", 2974 | " age education years_exp compensation python r \\\n", 2975 | "0 0.426614 0.390184 -0.246353 0.145825 -0.034680 0.379261 \n", 2976 | "1 0.011164 -0.131144 -0.292135 -0.014521 0.016003 -1.043464 \n", 2977 | "2 -0.218063 -0.140705 -0.411293 0.048281 0.424516 0.487451 \n", 2978 | "3 -0.015227 -0.299068 -0.426323 -0.205840 -0.125867 0.320594 \n", 2979 | "4 -0.468785 -0.200953 -0.230639 0.064272 0.021362 0.355619 \n", 2980 | ".. ... ... ... ... ... ... \n", 2981 | "900 0.268237 -0.112710 0.330096 -0.209942 0.012074 -1.144335 \n", 2982 | "901 0.154642 0.572190 -0.227121 0.448253 -0.057847 0.290381 \n", 2983 | "902 0.079129 -0.095771 1.136799 0.150705 0.133260 0.484103 \n", 2984 | "903 -0.206584 0.430074 -0.385100 -0.078808 -0.083052 -0.992487 \n", 2985 | "904 0.007351 0.589351 1.485712 0.056398 -0.047231 0.373149 \n", 2986 | "\n", 2987 | " sql Q1_Male Q1_Female Q1_Prefer not to say \\\n", 2988 | "0 -0.019017 0.004868 0.000877 0.002111 \n", 2989 | "1 0.020524 0.039019 0.047712 0.001010 \n", 2990 | "2 -0.098703 -0.004710 0.063545 0.000258 \n", 2991 | "3 -0.062712 0.019110 0.012257 0.002184 \n", 2992 | "4 -0.083344 -0.017202 0.002754 0.001432 \n", 2993 | ".. ... ... ... ... \n", 2994 | "900 -0.065815 0.028274 0.032291 0.001012 \n", 2995 | "901 -0.069114 0.006243 0.007443 0.002198 \n", 2996 | "902 -0.120819 0.012034 0.057516 0.000266 \n", 2997 | "903 -0.088811 0.080561 0.028648 0.000876 \n", 2998 | "904 -0.105290 0.029283 0.074762 0.001406 \n", 2999 | "\n", 3000 | " Q1_Prefer to self-describe Q3_United States of America Q3_India \\\n", 3001 | "0 0.0 0.033738 -0.117918 \n", 3002 | "1 0.0 0.068171 0.086444 \n", 3003 | "2 0.0 0.005533 -0.105534 \n", 3004 | "3 0.0 -0.000044 0.042814 \n", 3005 | "4 0.0 0.035772 -0.073206 \n", 3006 | ".. ... ... ... \n", 3007 | "900 0.0 -0.086408 0.136677 \n", 3008 | "901 0.0 -0.074364 0.115520 \n", 3009 | "902 0.0 0.103810 -0.097848 \n", 3010 | "903 0.0 0.045213 0.066553 \n", 3011 | "904 0.0 -0.031587 0.117050 \n", 3012 | "\n", 3013 | " Q3_China major_cs major_other major_eng major_stat \n", 3014 | "0 -0.018271 0.369876 0.014006 -0.013465 0.104177 \n", 3015 | "1 -0.026271 -0.428484 -0.064157 -0.026041 0.069931 \n", 3016 | "2 -0.010548 -0.333695 0.016919 -0.026932 -0.591922 \n", 3017 | "3 -0.024099 0.486864 0.038438 -0.013727 0.047564 \n", 3018 | "4 -0.022188 0.324419 0.012664 -0.019550 0.093926 \n", 3019 | ".. ... ... ... ... ... \n", 3020 | "900 0.310404 -0.407444 -0.013195 -0.026412 -0.484734 \n", 3021 | "901 -0.008244 0.602087 0.039680 -0.012820 0.083934 \n", 3022 | "902 0.003234 -0.313785 -0.080046 -0.066032 0.101975 \n", 3023 | "903 -0.031448 -0.524141 -0.048108 -0.007185 0.093196 \n", 3024 | "904 0.008734 -0.505613 -0.159411 -0.067388 0.126560 \n", 3025 | "\n", 3026 | "[905 rows x 18 columns]\n", 3027 | "```\n", 3028 | "\n", 3029 | "``` pycon\n", 3030 | ">>> print(pd.concat([shap_df.sum(axis='columns')\n", 3031 | "... .rename('pred') + vals.base_values,\n", 3032 | "... pd.Series(y_test, name='true')], axis='columns')\n", 3033 | "... .assign(prob=lambda adf: (np.exp(adf.pred) / \n", 3034 | "... (1 + np.exp(adf.pred))))\n", 3035 | "... ) \n", 3036 | " pred true prob\n", 3037 | "0 1.204692 1 0.769358\n", 3038 | "1 -2.493559 0 0.076311\n", 3039 | "2 -2.205473 0 0.099260\n", 3040 | "3 -0.843847 1 0.300725\n", 3041 | "4 -0.168726 1 0.457918\n", 3042 | ".. ... ... ...\n", 3043 | "900 -1.698727 0 0.154632\n", 3044 | "901 1.957872 0 0.876302\n", 3045 | "902 0.786588 0 0.687098\n", 3046 | "903 -2.299702 0 0.091148\n", 3047 | "904 1.497035 1 0.817132\n", 3048 | "\n", 3049 | "[905 rows x 3 columns]\n", 3050 | "```\n", 3051 | "\n", 3052 | "### Examining a Single Prediction\n", 3053 | "\n", 3054 | "``` pycon\n", 3055 | ">>> X_test.iloc[0]\n", 3056 | "age 22.0\n", 3057 | "education 16.0\n", 3058 | "years_exp 1.0\n", 3059 | "compensation 0.0\n", 3060 | "python 1.0\n", 3061 | "r 0.0\n", 3062 | "sql 0.0\n", 3063 | "Q1_Male 1.0\n", 3064 | "Q1_Female 0.0\n", 3065 | "Q1_Prefer not to say 0.0\n", 3066 | "Q1_Prefer to self-describe 0.0\n", 3067 | "Q3_United States of America 0.0\n", 3068 | "Q3_India 1.0\n", 3069 | "Q3_China 0.0\n", 3070 | "major_cs 1.0\n", 3071 | "major_other 0.0\n", 3072 | "major_eng 0.0\n", 3073 | "major_stat 0.0\n", 3074 | "Name: 7894, dtype: float64\n", 3075 | "```\n", 3076 | "\n", 3077 | "``` pycon\n", 3078 | ">>> # predicts software engineer... why?\n", 3079 | ">>> xg_step.predict(X_test.iloc[[0]]) \n", 3080 | "array([1])\n", 3081 | "```\n", 3082 | "\n", 3083 | "``` pycon\n", 3084 | ">>> # ground truth\n", 3085 | ">>> y_test[0]\n", 3086 | "1\n", 3087 | "```\n", 3088 | "\n", 3089 | "``` pycon\n", 3090 | ">>> # Since this is below zero, the default is Data Scientist\n", 3091 | ">>> shap_ex.expected_value\n", 3092 | "-0.2166416\n", 3093 | "```\n", 3094 | "\n", 3095 | "``` pycon\n", 3096 | ">>> # > 0 therefore ... Software Engineer\n", 3097 | ">>> shap_ex.expected_value + vals.values[0].sum()\n", 3098 | "1.2046916\n", 3099 | "```\n", 3100 | "\n", 3101 | "### Waterfall Plots\n", 3102 | "\n", 3103 | "``` python\n", 3104 | "fig = plt.figure(figsize=(8, 4))\n", 3105 | "shap.plots.waterfall(vals[0], show=False)\n", 3106 | "```\n", 3107 | "\n", 3108 | "``` python\n", 3109 | "def plot_histograms(df, columns, row=None, title='', color='shap'):\n", 3110 | " \"\"\"\n", 3111 | " Parameters\n", 3112 | " ----------\n", 3113 | " df : pandas.DataFrame\n", 3114 | " The DataFrame to plot histograms for.\n", 3115 | " columns : list of str\n", 3116 | " The names of the columns to plot histograms for.\n", 3117 | " row : pandas.Series, optional\n", 3118 | " A row of data to plot a vertical line for.\n", 3119 | " title : str, optional\n", 3120 | " The title to use for the figure.\n", 3121 | " color : str, optional\n", 3122 | " 'shap' - color positive values red. Negative blue\n", 3123 | " 'mean' - above mean red. Below blue.\n", 3124 | " None - black\n", 3125 | "\n", 3126 | " Returns\n", 3127 | " -------\n", 3128 | " matplotlib.figure.Figure\n", 3129 | " The figure object containing the histogram plots. \n", 3130 | " \"\"\"\n", 3131 | " red = '#ff0051'\n", 3132 | " blue = '#008bfb'\n", 3133 | "\n", 3134 | " fig, ax = plt.subplots(figsize=(8, 4))\n", 3135 | " hist = (df\n", 3136 | " [columns]\n", 3137 | " .hist(ax=ax, color='#bbb')\n", 3138 | " )\n", 3139 | " fig = hist[0][0].get_figure()\n", 3140 | " if row is not None:\n", 3141 | " name2ax = {ax.get_title():ax for ax in fig.axes}\n", 3142 | " pos, neg = red, blue\n", 3143 | " if color is None:\n", 3144 | " pos, neg = 'black', 'black'\n", 3145 | " for column in columns:\n", 3146 | " if color == 'mean':\n", 3147 | " mid = df[column].mean()\n", 3148 | " else:\n", 3149 | " mid = 0\n", 3150 | " if row[column] > mid:\n", 3151 | " c = pos\n", 3152 | " else:\n", 3153 | " c = neg\n", 3154 | " name2ax[column].axvline(row[column], c=c)\n", 3155 | " fig.tight_layout()\n", 3156 | " fig.suptitle(title)\n", 3157 | " return fig \n", 3158 | "```\n", 3159 | "\n", 3160 | "``` python\n", 3161 | "features = ['education', 'r', 'major_cs', 'age', 'years_exp', \n", 3162 | " 'compensation']\n", 3163 | "fig = plot_histograms(shap_df, features, shap_df.iloc[0], \n", 3164 | " title='SHAP values for row 0')\n", 3165 | "```\n", 3166 | "\n", 3167 | "``` python\n", 3168 | "fig = plot_histograms(X_test, features, X_test.iloc[0], \n", 3169 | " title='Values for row 0', color='mean')\n", 3170 | "```\n", 3171 | "\n", 3172 | "``` python\n", 3173 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3174 | "(pd.Series(vals.values[0], index=X_test.columns)\n", 3175 | " .sort_values(key=np.abs)\n", 3176 | " .plot.barh(ax=ax)\n", 3177 | ")\n", 3178 | "```\n", 3179 | "\n", 3180 | "### A Force Plot\n", 3181 | "\n", 3182 | "``` python\n", 3183 | "# use matplotlib if having js issues\n", 3184 | "# blue - DS\n", 3185 | "# red - Software Engineer\n", 3186 | "# to save need both matplotlib=True, show=False\n", 3187 | "res = shap.plots.force(base_value=vals.base_values, \n", 3188 | " shap_values=vals.values[0,:], features=X_test.iloc[0], \n", 3189 | " matplotlib=True, show=False\n", 3190 | ")\n", 3191 | "res.savefig('img/shap_forceplot0.png', dpi=600, bbox_inches='tight')\n", 3192 | "```\n", 3193 | "\n", 3194 | "### Force Plot with Multiple Predictions\n", 3195 | "\n", 3196 | "``` python\n", 3197 | "# First n values\n", 3198 | "n = 100\n", 3199 | "# blue - DS\n", 3200 | "# red - Software Engineer\n", 3201 | "shap.plots.force(base_value=vals.base_values, \n", 3202 | " shap_values=vals.values[:n,:], features=X_test.iloc[:n], \n", 3203 | " )\n", 3204 | "```\n", 3205 | "\n", 3206 | "### Understanding Features with Dependence Plots\n", 3207 | "\n", 3208 | "``` python\n", 3209 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3210 | "shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals, \n", 3211 | " x_jitter=0, hist=False)\n", 3212 | "```\n", 3213 | "\n", 3214 | "### Jittering a Dependence Plot\n", 3215 | "\n", 3216 | "``` python\n", 3217 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3218 | "shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals[:, 'years_exp'], x_jitter=1,\n", 3219 | " alpha=.5)\n", 3220 | "```\n", 3221 | "\n", 3222 | "``` python\n", 3223 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3224 | "shap.plots.scatter(vals[:, 'major_cs'], ax=ax, color=vals[:, 'r'], alpha=.5)\n", 3225 | "```\n", 3226 | "\n", 3227 | "### Heatmaps and Correlations\n", 3228 | "\n", 3229 | "``` python\n", 3230 | "import seaborn as sns\n", 3231 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3232 | "sns.heatmap(X_test \n", 3233 | " .assign(software_eng=y_test)\n", 3234 | " .corr(method='spearman')\n", 3235 | " .loc[:, ['age', 'education', 'years_exp',\n", 3236 | " 'compensation', 'r', 'major_cs', \n", 3237 | " 'software_eng']],\n", 3238 | " cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n", 3239 | ")\n", 3240 | "```\n", 3241 | "\n", 3242 | "``` python\n", 3243 | "import seaborn as sns\n", 3244 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3245 | "sns.heatmap(shap_df \n", 3246 | " .assign(software_eng=y_test)\n", 3247 | " .corr(method='spearman')\n", 3248 | " .loc[:, ['age', 'education', 'years_exp', 'compensation', 'r', 'major_cs',\n", 3249 | " 'software_eng']],\n", 3250 | " cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n", 3251 | ")\n", 3252 | "```\n", 3253 | "\n", 3254 | "### Beeswarm Plots of Global Behavior\n", 3255 | "\n", 3256 | "``` python\n", 3257 | "fig = plt.figure(figsize=(8, 4))\n", 3258 | "shap.plots.beeswarm(vals)\n", 3259 | "```\n", 3260 | "\n", 3261 | "``` python\n", 3262 | "from matplotlib import cm\n", 3263 | "fig = plt.figure(figsize=(8, 4))\n", 3264 | "shap.plots.beeswarm(vals, max_display=len(X_test.columns), color=cm.autumn_r)\n", 3265 | "```\n", 3266 | "\n", 3267 | "### SHAP with No Interaction\n", 3268 | "\n", 3269 | "``` python\n", 3270 | "no_int_params = {'random_state': 42,\n", 3271 | " 'max_depth': 1\n", 3272 | "}\n", 3273 | "xg_no_int = xgb.XGBClassifier(**no_int_params, early_stopping_rounds=50,\n", 3274 | " n_estimators=500)\n", 3275 | "xg_no_int.fit(X_train, y_train,\n", 3276 | " eval_set=[(X_train, y_train),\n", 3277 | " (X_test, y_test)\n", 3278 | " ]\n", 3279 | ")\n", 3280 | "```\n", 3281 | "\n", 3282 | "``` pycon\n", 3283 | ">>> xg_no_int.score(X_test, y_test)\n", 3284 | "0.7370165745856354\n", 3285 | "```\n", 3286 | "\n", 3287 | "``` python\n", 3288 | "shap_ind = shap.TreeExplainer(xg_no_int)\n", 3289 | "shap_ind_vals = shap_ind(X_test)\n", 3290 | "```\n", 3291 | "\n", 3292 | "``` python\n", 3293 | "from matplotlib import cm\n", 3294 | "fig = plt.figure(figsize=(8, 4))\n", 3295 | "shap.plots.beeswarm(shap_ind_vals, max_display=len(X_test.columns))\n", 3296 | "```\n", 3297 | "\n", 3298 | "``` python\n", 3299 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3300 | "shap.plots.scatter(vals[:, 'years_exp'], ax=ax, \n", 3301 | " color=vals[:, 'age'], alpha=.5,\n", 3302 | " x_jitter=1)\n", 3303 | "```\n", 3304 | "\n", 3305 | "``` python\n", 3306 | "fig, ax = plt.subplots(figsize=(8, 4))\n", 3307 | "shap.plots.scatter(shap_ind_vals[:, 'years_exp'], ax=ax,\n", 3308 | " color=shap_ind_vals[:, 'age'], alpha=.5,\n", 3309 | " x_jitter=1)\n", 3310 | "```\n", 3311 | "\n", 3312 | "### Summary\n", 3313 | "\n", 3314 | "### Exercises\n", 3315 | "\n", 3316 | "## Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration\n", 3317 | "\n", 3318 | "### ICE Plots\n", 3319 | "\n", 3320 | "``` python\n", 3321 | "xgb_def = xgb.XGBClassifier(random_state=42)\n", 3322 | "xgb_def.fit(X_train, y_train)\n", 3323 | "xgb_def.score(X_test, y_test)\n", 3324 | "```\n", 3325 | "\n", 3326 | "``` python\n", 3327 | "from sklearn.inspection import PartialDependenceDisplay\n", 3328 | "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n", 3329 | "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n", 3330 | " kind='individual', ax=axes)\n", 3331 | "```\n", 3332 | "\n", 3333 | "``` python\n", 3334 | "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n", 3335 | "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n", 3336 | " centered=True,\n", 3337 | " kind='individual', ax=axes)\n", 3338 | "```\n", 3339 | "\n", 3340 | "``` python\n", 3341 | "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n", 3342 | "ax_h0 = axes[0].twinx()\n", 3343 | "ax_h0.hist(X_train.r, zorder=0)\n", 3344 | "\n", 3345 | "ax_h1 = axes[1].twinx()\n", 3346 | "ax_h1.hist(X_train.education, zorder=0)\n", 3347 | "\n", 3348 | "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n", 3349 | " centered=True,\n", 3350 | " ice_lines_kw={'zorder':10},\n", 3351 | " kind='individual', ax=axes)\n", 3352 | "fig.tight_layout()\n", 3353 | "```\n", 3354 | "\n", 3355 | "``` python\n", 3356 | "def quantile_ice(clf, X, col, center=True, q=10, color='k', alpha=.5, legend=True,\n", 3357 | " add_hist=False, title='', val_limit=10, ax=None):\n", 3358 | " \"\"\"\n", 3359 | " Generate an ICE plot for a binary classifier's predicted probabilities split \n", 3360 | " by quantiles.\n", 3361 | "\n", 3362 | " Parameters:\n", 3363 | " ----------\n", 3364 | " clf : binary classifier\n", 3365 | " A binary classifier with a `predict_proba` method.\n", 3366 | " X : DataFrame\n", 3367 | " Feature matrix to predict on with shape (n_samples, n_features).\n", 3368 | " col : str\n", 3369 | " Name of column in `X` to plot against the quantiles of predicted probabilities.\n", 3370 | " center : bool, default=True\n", 3371 | " Whether to center the plot on 0.5.\n", 3372 | " q : int, default=10\n", 3373 | " Number of quantiles to split the predicted probabilities into.\n", 3374 | " color : str or array-like, default='k'\n", 3375 | " Color(s) of the lines in the plot.\n", 3376 | " alpha : float, default=0.5\n", 3377 | " Opacity of the lines in the plot.\n", 3378 | " legend : bool, default=True\n", 3379 | " Whether to show the plot legend.\n", 3380 | " add_hist : bool, default=False\n", 3381 | " Whether to add a histogram of the `col` variable to the plot.\n", 3382 | " title : str, default=''\n", 3383 | " Title of the plot.\n", 3384 | " val_limit : num, default=10\n", 3385 | " Maximum number of values to test for col.\n", 3386 | " ax : Matplotlib Axis, deafault=None\n", 3387 | " Axis to plot on.\n", 3388 | "\n", 3389 | " Returns:\n", 3390 | " -------\n", 3391 | " results : DataFrame\n", 3392 | " A DataFrame with the same columns as `X`, as well as a `prob` column with \n", 3393 | " the predicted probabilities of `clf` for each row in `X`, and a `group` \n", 3394 | " column indicating which quantile group the row belongs to.\n", 3395 | " \"\"\" \n", 3396 | " probs = clf.predict_proba(X)\n", 3397 | " df = (X\n", 3398 | " .assign(probs=probs[:,-1],\n", 3399 | " p_bin=lambda df_:pd.qcut(df_.probs, q=q, \n", 3400 | " labels=[f'q{n}' for n in range(1,q+1)])\n", 3401 | " )\n", 3402 | " )\n", 3403 | " groups = df.groupby('p_bin')\n", 3404 | "\n", 3405 | " vals = X.loc[:,col].unique()\n", 3406 | " if len(vals) > val_limit:\n", 3407 | " vals = np.linspace(min(vals), max(vals), num=val_limit)\n", 3408 | " res = []\n", 3409 | " for name,g in groups:\n", 3410 | " for val in vals:\n", 3411 | " this_X = g.loc[:,X.columns].assign(**{col:val})\n", 3412 | " q_prob = clf.predict_proba(this_X)[:,-1]\n", 3413 | " res.append(this_X.assign(prob=q_prob, group=name))\n", 3414 | " results = pd.concat(res, axis='index') \n", 3415 | " if ax is None:\n", 3416 | " fig, ax = plt.subplots(figsize=(8,4))\n", 3417 | " if add_hist:\n", 3418 | " back_ax = ax.twinx()\n", 3419 | " back_ax.hist(X[col], density=True, alpha=.2) \n", 3420 | " for name, g in results.groupby('group'):\n", 3421 | " g.groupby(col).prob.mean().plot(ax=ax, label=name, color=color, alpha=alpha)\n", 3422 | " if legend:\n", 3423 | " ax.legend()\n", 3424 | " if title:\n", 3425 | " ax.set_title(title)\n", 3426 | " return results\n", 3427 | "```\n", 3428 | "\n", 3429 | "``` python\n", 3430 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3431 | "quantile_ice(xgb_def, X_train, 'education', q=10, legend=False, add_hist=True, ax=ax,\n", 3432 | " title='ICE plot for Age')\n", 3433 | "```\n", 3434 | "\n", 3435 | "### ICE Plots with SHAP\n", 3436 | "\n", 3437 | "``` python\n", 3438 | "import shap\n", 3439 | "\n", 3440 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3441 | " \n", 3442 | "shap.plots.partial_dependence_plot(ind='education', \n", 3443 | " model=lambda rows: xgb_def.predict_proba(rows)[:,-1],\n", 3444 | " data=X_train.iloc[0:1000], ice=True, \n", 3445 | " npoints=(X_train.education.nunique()),\n", 3446 | " pd_linewidth=0, show=False, ax=ax)\n", 3447 | "ax.set_title('ICE plot (from SHAP)')\n", 3448 | "```\n", 3449 | "\n", 3450 | "### Partial Dependence Plots\n", 3451 | "\n", 3452 | "``` python\n", 3453 | "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n", 3454 | "\n", 3455 | "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n", 3456 | " kind='average', ax=axes)\n", 3457 | "fig.tight_layout()\n", 3458 | "```\n", 3459 | "\n", 3460 | "``` python\n", 3461 | "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n", 3462 | "\n", 3463 | "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n", 3464 | " centered=True, kind='both',\n", 3465 | " ax=axes)\n", 3466 | "fig.tight_layout()\n", 3467 | "```\n", 3468 | "\n", 3469 | "``` python\n", 3470 | "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n", 3471 | "\n", 3472 | "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['years_exp', 'Q1_Male'],\n", 3473 | " centered=True, kind='both',\n", 3474 | " ax=axes)\n", 3475 | "fig.tight_layout()\n", 3476 | "```\n", 3477 | "\n", 3478 | "### PDP with SHAP\n", 3479 | "\n", 3480 | "``` python\n", 3481 | "import shap\n", 3482 | "\n", 3483 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3484 | "\n", 3485 | "col = 'years_exp' \n", 3486 | "shap.plots.partial_dependence_plot(ind=col,\n", 3487 | " model=lambda rows: xgb_def.predict_proba(rows)[:,-1],\n", 3488 | " data=X_train.iloc[0:1000], ice=False, \n", 3489 | " npoints=(X_train[col].nunique()),\n", 3490 | " pd_linewidth=2, show=False, ax=ax)\n", 3491 | "ax.set_title('PDP plot (from SHAP)')\n", 3492 | "```\n", 3493 | "\n", 3494 | "``` python\n", 3495 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3496 | "\n", 3497 | "col = 'years_exp' \n", 3498 | "shap.plots.partial_dependence_plot(ind=col, \n", 3499 | " model=lambda rows: xgb_def.predict_proba(rows)[:,-1],\n", 3500 | " data=X_train.iloc[0:1000], ice=True, \n", 3501 | " npoints=(X_train[col].nunique()),\n", 3502 | " model_expected_value=True,\n", 3503 | " feature_expected_value=True,\n", 3504 | " pd_linewidth=2, show=False, ax=ax)\n", 3505 | "ax.set_title('PDP plot (from SHAP) with ICE Plots')\n", 3506 | "```\n", 3507 | "\n", 3508 | "### Monotonic Constraints\n", 3509 | "\n", 3510 | "``` python\n", 3511 | "\n", 3512 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3513 | "\n", 3514 | "(X_test\n", 3515 | " .assign(target=y_test)\n", 3516 | " .corr(method='spearman')\n", 3517 | " .iloc[:-1]\n", 3518 | " .loc[:,'target']\n", 3519 | " .sort_values(key=np.abs)\n", 3520 | " .plot.barh(title='Spearman Correlation with Target', ax=ax)\n", 3521 | ")\n", 3522 | "```\n", 3523 | "\n", 3524 | "``` pycon\n", 3525 | ">>> print(X_train\n", 3526 | "... .assign(target=y_train)\n", 3527 | "... .groupby('education')\n", 3528 | "... .mean()\n", 3529 | "... .loc[:, ['age', 'years_exp', 'target']]\n", 3530 | "... )\n", 3531 | "\n", 3532 | " age years_exp target\n", 3533 | "education \n", 3534 | "12.0 30.428571 2.857143 0.714286\n", 3535 | "13.0 30.369565 6.760870 0.652174\n", 3536 | "16.0 25.720867 2.849593 0.605691\n", 3537 | "18.0 28.913628 3.225528 0.393474\n", 3538 | "19.0 27.642857 4.166667 0.571429\n", 3539 | "20.0 35.310638 4.834043 0.174468\n", 3540 | "```\n", 3541 | "\n", 3542 | "``` pycon\n", 3543 | ">>> X_train.education.value_counts()\n", 3544 | "18.0 1042\n", 3545 | "16.0 738\n", 3546 | "20.0 235\n", 3547 | "13.0 46\n", 3548 | "19.0 42\n", 3549 | "12.0 7\n", 3550 | "Name: education, dtype: int64\n", 3551 | "```\n", 3552 | "\n", 3553 | "``` pycon\n", 3554 | ">>> print(raw\n", 3555 | "... .query('Q3.isin([\"United States of America\", \"China\", \"India\"]) '\n", 3556 | "... 'and Q6.isin([\"Data Scientist\", \"Software Engineer\"])') \n", 3557 | "... .query('Q4 == \"Professional degree\"')\n", 3558 | "... .pipe(lambda df_:pd.crosstab(index=df_.Q5, columns=df_.Q6))\n", 3559 | "... )\n", 3560 | " \n", 3561 | "Q6 Data Scientist \\\n", 3562 | "Q5 \n", 3563 | "A business discipline (accounting, economics, f... 0 \n", 3564 | "Computer science (software engineering, etc.) 12 \n", 3565 | "Engineering (non-computer focused) 6 \n", 3566 | "Humanities (history, literature, philosophy, etc.) 2 \n", 3567 | "I never declared a major 0 \n", 3568 | "Mathematics or statistics 2 \n", 3569 | "Other 2 \n", 3570 | "Physics or astronomy 2 \n", 3571 | "\n", 3572 | "Q6 Software Engineer \n", 3573 | "Q5 \n", 3574 | "A business discipline (accounting, economics, f... 1 \n", 3575 | "Computer science (software engineering, etc.) 19 \n", 3576 | "Engineering (non-computer focused) 10 \n", 3577 | "Humanities (history, literature, philosophy, etc.) 0 \n", 3578 | "I never declared a major 1 \n", 3579 | "Mathematics or statistics 1 \n", 3580 | "Other 1 \n", 3581 | "Physics or astronomy 1 \n", 3582 | "```\n", 3583 | "\n", 3584 | "``` python\n", 3585 | "xgb_const = xgb.XGBClassifier(random_state=42,\n", 3586 | " monotone_constraints={'years_exp':1, 'education':-1})\n", 3587 | "xgb_const.fit(X_train, y_train)\n", 3588 | "xgb_const.score(X_test, y_test)\n", 3589 | "```\n", 3590 | "\n", 3591 | "``` python\n", 3592 | "small_cols = ['age', 'education', 'years_exp', 'compensation', 'python', 'r', 'sql',\n", 3593 | " #'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say',\n", 3594 | " #'Q1_Prefer to self-describe', \n", 3595 | " 'Q3_United States of America', 'Q3_India',\n", 3596 | " 'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat']\n", 3597 | "xgb_const2 = xgb.XGBClassifier(random_state=42,\n", 3598 | " monotone_constraints={'years_exp':1, 'education':-1})\n", 3599 | "xgb_const2.fit(X_train[small_cols], y_train)\n", 3600 | "```\n", 3601 | "\n", 3602 | "``` pycon\n", 3603 | ">>> xgb_const2.score(X_test[small_cols], y_test)\n", 3604 | "0.7569060773480663\n", 3605 | "```\n", 3606 | "\n", 3607 | "``` python\n", 3608 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3609 | "(pd.Series(xgb_def.feature_importances_, index=X_train.columns)\n", 3610 | " .sort_values()\n", 3611 | " .plot.barh(ax=ax)\n", 3612 | ")\n", 3613 | "```\n", 3614 | "\n", 3615 | "``` python\n", 3616 | "fig, ax = plt.subplots(figsize=(8,4))\n", 3617 | "(pd.Series(xgb_const2.feature_importances_, index=small_cols)\n", 3618 | " .sort_values()\n", 3619 | " .plot.barh(ax=ax)\n", 3620 | ")\n", 3621 | "```\n", 3622 | "\n", 3623 | "### Calibrating a Model\n", 3624 | "\n", 3625 | "``` python\n", 3626 | "from sklearn.calibration import CalibratedClassifierCV\n", 3627 | "\n", 3628 | "xgb_cal = CalibratedClassifierCV(xgb_def, method='sigmoid', cv='prefit')\n", 3629 | "xgb_cal.fit(X_test, y_test)\n", 3630 | "\n", 3631 | "xgb_cal_iso = CalibratedClassifierCV(xgb_def, method='isotonic', cv='prefit')\n", 3632 | "xgb_cal_iso.fit(X_test, y_test)\n", 3633 | "```\n", 3634 | "\n", 3635 | "### Calibration Curves\n", 3636 | "\n", 3637 | "``` python\n", 3638 | "from sklearn.calibration import CalibrationDisplay\n", 3639 | "from matplotlib.gridspec import GridSpec\n", 3640 | "fig = plt.figure(figsize=(8,6))\n", 3641 | "gs = GridSpec(4, 3)\n", 3642 | "axes = fig.add_subplot(gs[:2, :3])\n", 3643 | "display = CalibrationDisplay.from_estimator(xgb_def, X_test, y_test, \n", 3644 | " n_bins=10, ax=axes)\n", 3645 | "disp_cal = CalibrationDisplay.from_estimator(xgb_cal, X_test, y_test, \n", 3646 | " n_bins=10,ax=axes, name='sigmoid')\n", 3647 | "disp_cal_iso = CalibrationDisplay.from_estimator(xgb_cal_iso, X_test, y_test, \n", 3648 | " n_bins=10, ax=axes, name='isotonic')\n", 3649 | "row = 2\n", 3650 | "col = 0\n", 3651 | "ax = fig.add_subplot(gs[row, col])\n", 3652 | "ax.hist(display.y_prob, range=(0,1), bins=20)\n", 3653 | "ax.set(title='Default', xlabel='Predicted Prob')\n", 3654 | "ax2 = fig.add_subplot(gs[row, 1])\n", 3655 | "ax2.hist(disp_cal.y_prob, range=(0,1), bins=20)\n", 3656 | "ax2.set(title='Sigmoid', xlabel='Predicted Prob')\n", 3657 | "ax3 = fig.add_subplot(gs[row, 2])\n", 3658 | "ax3.hist(disp_cal_iso.y_prob, range=(0,1), bins=20)\n", 3659 | "ax3.set(title='Isotonic', xlabel='Predicted Prob')\n", 3660 | "fig.tight_layout()\n", 3661 | "```\n", 3662 | "\n", 3663 | "``` pycon\n", 3664 | ">>> xgb_cal.score(X_test, y_test)\n", 3665 | "0.7480662983425415\n", 3666 | "```\n", 3667 | "\n", 3668 | "``` pycon\n", 3669 | ">>> xgb_cal_iso.score(X_test, y_test)\n", 3670 | "0.7491712707182321\n", 3671 | "```\n", 3672 | "\n", 3673 | "``` pycon\n", 3674 | ">>> xgb_def.score(X_test, y_test)\n", 3675 | "0.7458563535911602\n", 3676 | "```\n", 3677 | "\n", 3678 | "### Summary\n", 3679 | "\n", 3680 | "### Exercises\n", 3681 | "\n", 3682 | "## Serving Models with MLFlow\n", 3683 | "\n", 3684 | "### Installation and Setup\n", 3685 | "\n", 3686 | "``` python\n", 3687 | "%matplotlib inline\n", 3688 | "\n", 3689 | "from feature_engine import encoding, imputation\n", 3690 | "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n", 3691 | "import matplotlib.pyplot as plt\n", 3692 | "import mlflow\n", 3693 | "import numpy as np\n", 3694 | "import pandas as pd\n", 3695 | "from sklearn import base, metrics, model_selection, \\\n", 3696 | " pipeline, preprocessing\n", 3697 | "from sklearn.metrics import accuracy_score, roc_auc_score \n", 3698 | "import xgboost as xgb\n", 3699 | "\n", 3700 | "\n", 3701 | "import urllib\n", 3702 | "import zipfile\n", 3703 | "```\n", 3704 | "\n", 3705 | "``` python\n", 3706 | "import pandas as pd\n", 3707 | "from sklearn import model_selection, preprocessing\n", 3708 | "import xg_helpers as xhelp\n", 3709 | "\n", 3710 | "\n", 3711 | "url = 'https://github.com/mattharrison/datasets/raw/master/data/'\\\n", 3712 | " 'kaggle-survey-2018.zip'\n", 3713 | "fname = 'kaggle-survey-2018.zip'\n", 3714 | "member_name = 'multipleChoiceResponses.csv'\n", 3715 | "\n", 3716 | "raw = xhelp.extract_zip(url, fname, member_name)\n", 3717 | "## Create raw X and raw y\n", 3718 | "kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')\n", 3719 | " \n", 3720 | "## Split data \n", 3721 | "kag_X_train, kag_X_test, kag_y_train, kag_y_test = \\\n", 3722 | " model_selection.train_test_split(\n", 3723 | " kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y) \n", 3724 | "\n", 3725 | "## Transform X with pipeline\n", 3726 | "X_train = xhelp.kag_pl.fit_transform(kag_X_train)\n", 3727 | "X_test = xhelp.kag_pl.transform(kag_X_test)\n", 3728 | "\n", 3729 | "## Transform y with label encoder\n", 3730 | "label_encoder = preprocessing.LabelEncoder()\n", 3731 | "label_encoder.fit(kag_y_train)\n", 3732 | "y_train = label_encoder.transform(kag_y_train)\n", 3733 | "y_test = label_encoder.transform(kag_y_test)\n", 3734 | "\n", 3735 | "# Combined Data for cross validation/etc\n", 3736 | "X = pd.concat([X_train, X_test], axis='index')\n", 3737 | "y = pd.Series([*y_train, *y_test], index=X.index)\n", 3738 | "```\n", 3739 | "\n", 3740 | "``` python\n", 3741 | "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n", 3742 | "import mlflow\n", 3743 | "from sklearn import metrics\n", 3744 | "import xgboost as xgb\n", 3745 | "\n", 3746 | "ex_id = mlflow.create_experiment(name='ex3', artifact_location='ex2path')\n", 3747 | "mlflow.set_experiment(experiment_name='ex3')\n", 3748 | "with mlflow.start_run():\n", 3749 | " params = {'random_state': 42}\n", 3750 | " rounds = [{'max_depth': hp.quniform('max_depth', 1, 12, 1), # tree\n", 3751 | " 'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},\n", 3752 | " {'subsample': hp.uniform('subsample', 0.5, 1), # stochastic\n", 3753 | " 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},\n", 3754 | " {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization\n", 3755 | " {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting\n", 3756 | " ]\n", 3757 | "\n", 3758 | " for round in rounds:\n", 3759 | " params = {**params, **round}\n", 3760 | " trials = Trials()\n", 3761 | " best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(\n", 3762 | " space, X_train, y_train, X_test, y_test), \n", 3763 | " space=params, \n", 3764 | " algo=tpe.suggest, \n", 3765 | " max_evals=10, \n", 3766 | " trials=trials,\n", 3767 | " timeout=60*5 # 5 minutes\n", 3768 | " )\n", 3769 | " params = {**params, **best}\n", 3770 | " for param, val in params.items():\n", 3771 | " mlflow.log_param(param, val)\n", 3772 | " params['max_depth'] = int(params['max_depth'])\n", 3773 | " xg = xgb.XGBClassifier(eval_metric='logloss', early_stopping_rounds=50, **params)\n", 3774 | " xg.fit(X_train, y_train,\n", 3775 | " eval_set=[(X_train, y_train),\n", 3776 | " (X_test, y_test)\n", 3777 | " ]\n", 3778 | " ) \n", 3779 | " for metric in [metrics.accuracy_score, metrics.precision_score, metrics.recall_score, \n", 3780 | " metrics.f1_score]:\n", 3781 | " mlflow.log_metric(metric.__name__, metric(y_test, xg.predict(X_test)))\n", 3782 | "\n", 3783 | " model_info = mlflow.xgboost.log_model(xg, artifact_path='model')\n", 3784 | " \n", 3785 | "```\n", 3786 | "\n", 3787 | "``` pycon\n", 3788 | ">>> ex_id\n", 3789 | "'172212630951564101'\n", 3790 | "```\n", 3791 | "\n", 3792 | "``` pycon\n", 3793 | ">>> model_info.run_id\n", 3794 | "'263b3e793f584251a4e4cd1a2d494110'\n", 3795 | "```\n", 3796 | "\n", 3797 | "### Inspecting Model Artifacts\n", 3798 | "\n", 3799 | "### Running A Model From Code\n", 3800 | "\n", 3801 | "``` python\n", 3802 | "import mlflow\n", 3803 | "logged_model = 'runs:/ecc05fedb5c942598741816a1c6d76e2/model'\n", 3804 | "\n", 3805 | "# Load model as a PyFuncModel.\n", 3806 | "loaded_model = mlflow.pyfunc.load_model(logged_model)\n", 3807 | "```\n", 3808 | "\n", 3809 | "``` pycon\n", 3810 | ">>> loaded_model.predict(X_test.iloc[[0]])\n", 3811 | "array([1])\n", 3812 | "```\n", 3813 | "\n", 3814 | "### Serving Predictions\n", 3815 | "\n", 3816 | "### Querying from the Command Line\n", 3817 | "\n", 3818 | "``` pycon\n", 3819 | ">>> X_test.head(2).to_json(orient='split', index=False)\n", 3820 | "'{\"columns\":[\"age\",\"education\",\"years_exp\",\"compensation\",\n", 3821 | "\"python\",\"r\",\"sql\",\"Q1_Male\",\"Q1_Female\",\"Q1_Prefer not to say\",\n", 3822 | "\"Q1_Prefer to self-describe\",\"Q3_United States of America\",\n", 3823 | "\"Q3_India\",\"Q3_China\",\"major_cs\",\"major_other\",\"major_eng\",\n", 3824 | "\"major_stat\"],\"data\":[[22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0,\n", 3825 | "0,0],[25,18.0,1.0,70000,1,1,0,1,0,0,0,1,0,0,0,1,0,0]]}'\n", 3826 | "```\n", 3827 | "\n", 3828 | "``` pycon\n", 3829 | ">>> import json\n", 3830 | ">>> json.loads(X_test.head(2).to_json(orient='split', index=False))\n", 3831 | "{'columns': ['age',\n", 3832 | " 'education',\n", 3833 | " 'years_exp',\n", 3834 | " 'compensation',\n", 3835 | " 'python',\n", 3836 | " 'r',\n", 3837 | " 'sql',\n", 3838 | " 'Q1_Male',\n", 3839 | " 'Q1_Female',\n", 3840 | " 'Q1_Prefer not to say',\n", 3841 | " 'Q1_Prefer to self-describe',\n", 3842 | " 'Q3_United States of America',\n", 3843 | " 'Q3_India',\n", 3844 | " 'Q3_China',\n", 3845 | " 'major_cs',\n", 3846 | " 'major_other',\n", 3847 | " 'major_eng',\n", 3848 | " 'major_stat'],\n", 3849 | " 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],\n", 3850 | " [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}\n", 3851 | "```\n", 3852 | "\n", 3853 | "``` pycon\n", 3854 | ">>> {'dataframe_split': json.loads(X_test.head(2).to_json(orient='split', \n", 3855 | "... index=False))}\n", 3856 | "{'dataframe_split': {'columns': ['age',\n", 3857 | " 'education',\n", 3858 | " 'years_exp',\n", 3859 | " 'compensation',\n", 3860 | " 'python',\n", 3861 | " 'r',\n", 3862 | " 'sql',\n", 3863 | " 'Q1_Male',\n", 3864 | " 'Q1_Female',\n", 3865 | " 'Q1_Prefer not to say',\n", 3866 | " 'Q1_Prefer to self-describe',\n", 3867 | " 'Q3_United States of America',\n", 3868 | " 'Q3_India',\n", 3869 | " 'Q3_China',\n", 3870 | " 'major_cs',\n", 3871 | " 'major_other',\n", 3872 | " 'major_eng',\n", 3873 | " 'major_stat'],\n", 3874 | " 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],\n", 3875 | " [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\n", 3876 | "```\n", 3877 | "\n", 3878 | "``` python\n", 3879 | "def create_post_data(df):\n", 3880 | " dictionary = json.loads(df\n", 3881 | " .to_json(orient='split', index=False))\n", 3882 | " return json.dumps({'dataframe_split': dictionary})\n", 3883 | "```\n", 3884 | "\n", 3885 | "``` pycon\n", 3886 | ">>> post_data = create_post_data(X_test.head(2))\n", 3887 | ">>> print(post_data)\n", 3888 | "{\"dataframe_split\": {\"columns\": [\"age\", \"education\", \"years_exp\", \"compensation\", \n", 3889 | " \"python\", \"r\", \"sql\", \"Q1_Male\", \"Q1_Female\", \"Q1_Prefer not to say\", \n", 3890 | " \"Q1_Prefer to self-describe\", \"Q3_United States of America\", \"Q3_India\", \"Q3_China\", \n", 3891 | " \"major_cs\", \"major_other\", \"major_eng\", \"major_stat\"], \n", 3892 | " \"data\": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], \n", 3893 | " [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\n", 3894 | "```\n", 3895 | "\n", 3896 | "``` python\n", 3897 | "!curl http://127.0.0.1:1234/invocations -X POST -H \\\n", 3898 | " \"Content-Type:application/json\" --data $post_data \n", 3899 | "```\n", 3900 | "\n", 3901 | "``` pycon\n", 3902 | ">>> quoted = f\"'{post_data}'\"\n", 3903 | ">>> quoted\n", 3904 | "'\\'{\"dataframe_split\": {\"columns\": [\"age\", \"education\", \n", 3905 | "\"years_exp\", \"compensation\", \"python\", \"r\", \"sql\", \"Q1_Male\", \n", 3906 | "\"Q1_Female\", \"Q1_Prefer not to say\", \"Q1_Prefer to self-describe\",\n", 3907 | "\"Q3_United States of America\", \"Q3_India\", \"Q3_China\", \"major_cs\",\n", 3908 | "\"major_other\", \"major_eng\", \"major_stat\"], \"data\": [[22, 16.0, 1.0,\n", 3909 | "0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, \n", 3910 | "70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\\''\n", 3911 | "```\n", 3912 | "\n", 3913 | "``` python\n", 3914 | "def create_post_data(df, quote=True):\n", 3915 | " dictionary = {'dataframe_split': json.loads(df\n", 3916 | " .to_json(orient='split', index=False))}\n", 3917 | " if quote:\n", 3918 | " return f\"'{dictionary}'\"\n", 3919 | " else:\n", 3920 | " return dictionary\n", 3921 | "\n", 3922 | "quoted = create_post_data(X_test.head(2))\n", 3923 | "```\n", 3924 | "\n", 3925 | "``` python\n", 3926 | "!curl http://127.0.0.1:1234/invocations -x post -h \\\n", 3927 | " \"content-type:application/json\" --data $quoted \n", 3928 | "```\n", 3929 | "\n", 3930 | "### Querying with the Requests Library\n", 3931 | "\n", 3932 | "``` pycon\n", 3933 | ">>> import requests as req\n", 3934 | ">>> import json\n", 3935 | "\n", 3936 | ">>> r = req.post('http://127.0.0.1:1234/invocations', \n", 3937 | "... json=create_post_data(x_test.head(2), quote=False))\n", 3938 | ">>> print(r.text)\n", 3939 | "{\"predictions\": [1, 0]}\n", 3940 | "```\n", 3941 | "\n", 3942 | "### Building with Docker\n", 3943 | "\n", 3944 | "### Conclusion\n", 3945 | "\n", 3946 | "### Exercises" 3947 | ] 3948 | } 3949 | ], 3950 | "metadata": { 3951 | "kernelspec": { 3952 | "display_name": "Python 3 (ipykernel)", 3953 | "language": "python", 3954 | "name": "python3" 3955 | }, 3956 | "language_info": { 3957 | "codemirror_mode": { 3958 | "name": "ipython", 3959 | "version": 3 3960 | }, 3961 | "file_extension": ".py", 3962 | "mimetype": "text/x-python", 3963 | "name": "python", 3964 | "nbconvert_exporter": "python", 3965 | "pygments_lexer": "ipython3", 3966 | "version": "3.10.11" 3967 | } 3968 | }, 3969 | "nbformat": 4, 3970 | "nbformat_minor": 5 3971 | } 3972 | --------------------------------------------------------------------------------