├── README.md
└── xgbcode.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # Effective XGBoost Code Repository
 2 | 
 3 | Welcome to the official GitHub repository for "Effective XGBoost". Here, you'll find all the code examples included in the book, neatly organized by chapter. This repository serves as a practical resource for readers and allows for active collaboration through GitHub.
 4 | 
 5 | <a href="https://store.metasnake.com/xgboost"><img src="https://d31ezp3r8jwmks.cloudfront.net/gl3elhhoiyv9o5vend753g5esxi7"/></a>
 6 | 
 7 | ## Table of Contents
 8 | 
 9 | 1. [About the Book](#about-the-book)
10 | 2. [How to Use This Repository](#how-to-use-this-repository)
11 | 3. [Purchase the Book](#purchase-the-book)
12 | 4. [Filing Bugs](#filing-bugs)
13 | 5. [Contributing](#contributing)
14 | 
15 | ## About the Book
16 | 
17 | "Effective XGBoost" is an in-depth, comprehensive guide to creating classification models, designed to help readers from a wide range of backgrounds, from beginners to seasoned professionals. It provides clear explanations, real-world examples, practical exercises, and much more.
18 | 
19 | This book is the culmination of years of experience and knowledge shared by the author, Matt Harrison, a data science and Python consultant and corporate trainer.
20 | 
21 | ## How to Use This Repository
22 | 
23 | You'll find all code examples for the book here.
24 | 
25 | ## Purchase the Book
26 | 
27 | If you have not already done so, you can purchase "Effective XGBoost" from the following vendors:
28 | 
29 | - Digital PDF/epub/kindle [Effective XGBoost Digital](https://store.metasnake.com/xgboost)
30 | - Physical Version [Amazon](https://amzn.to/441i9lm)
31 | 
32 | If you find the content of this repository helpful, imagine how much more you could learn from the complete book! Your purchase not only supports the work of the author but also contributes to the continuous improvement of this code repository.
33 | 
34 | ## Filing Bugs
35 | 
36 | We strive for perfection, but nobody's perfect. If you encounter any issues or errors in the book or in the code samples, please don't hesitate to file a bug in the [Issues](https://github.com/mattharrison/effective_xgboost_book/issues) section of this repository. When filing an issue, please include as much detail as possible, such as the chapter and page number, description of the issue, and, if relevant, a screenshot or code snippet.
37 | 
38 | ## Contributing
39 | 
40 | We welcome and appreciate contributions from our readers. If you've noticed an error or a way to improve the code, feel free to create a pull request. For significant changes, please open an issue first to discuss the proposed changes.
41 | 
42 | ---
43 | 
44 | Happy coding, and enjoy the book!
45 | 
46 | 


--------------------------------------------------------------------------------
/xgbcode.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "a7edfa05-ede7-4f82-add8-41eb5c91bc41",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "## Datasets\n",
   9 |     "\n",
  10 |     "### Cleanup\n",
  11 |     "\n",
  12 |     "``` python\n",
  13 |     "import pandas as pd\n",
  14 |     "\n",
  15 |     "import urllib.request\n",
  16 |     "import zipfile\n",
  17 |     "\n",
  18 |     "\n",
  19 |     "url = 'https://github.com/mattharrison/datasets/raw/master/data/'\\\n",
  20 |     "    'kaggle-survey-2018.zip'\n",
  21 |     "fname = 'kaggle-survey-2018.zip'\n",
  22 |     "member_name = 'multipleChoiceResponses.csv'\n",
  23 |     "\n",
  24 |     "\n",
  25 |     "def extract_zip(src, dst, member_name):\n",
  26 |     "    \"\"\"Extract a member file from a zip file and read it into a pandas \n",
  27 |     "    DataFrame.\n",
  28 |     "    \n",
  29 |     "    Parameters:\n",
  30 |     "        src (str): URL of the zip file to be downloaded and extracted.\n",
  31 |     "        dst (str): Local file path where the zip file will be written.\n",
  32 |     "        member_name (str): Name of the member file inside the zip file \n",
  33 |     "            to be read into a DataFrame.\n",
  34 |     "    \n",
  35 |     "    Returns:\n",
  36 |     "        pandas.DataFrame: DataFrame containing the contents of the \n",
  37 |     "            member file.\n",
  38 |     "    \"\"\"    \n",
  39 |     "    url = src\n",
  40 |     "    fname = dst\n",
  41 |     "    fin = urllib.request.urlopen(url)\n",
  42 |     "    data = fin.read()\n",
  43 |     "    with open(dst, mode='wb') as fout:\n",
  44 |     "        fout.write(data)\n",
  45 |     "    with zipfile.ZipFile(dst) as z:\n",
  46 |     "        kag = pd.read_csv(z.open(member_name))\n",
  47 |     "        kag_questions = kag.iloc[0]\n",
  48 |     "        raw = kag.iloc[1:]\n",
  49 |     "        return raw\n",
  50 |     "\n",
  51 |     "raw = extract_zip(url, fname, member_name)        \n",
  52 |     "```\n",
  53 |     "\n",
  54 |     "### Cleanup Pipeline\n",
  55 |     "\n",
  56 |     "``` python\n",
  57 |     "def tweak_kag(df_: pd.DataFrame) -> pd.DataFrame:\n",
  58 |     "    \"\"\"\n",
  59 |     "    Tweak the Kaggle survey data and return a new DataFrame.\n",
  60 |     "\n",
  61 |     "    This function takes a Pandas DataFrame containing Kaggle \n",
  62 |     "    survey data as input and returns a new DataFrame. The \n",
  63 |     "    modifications include extracting and transforming certain \n",
  64 |     "    columns, renaming columns, and selecting a subset of columns.\n",
  65 |     "\n",
  66 |     "    Parameters\n",
  67 |     "    ----------\n",
  68 |     "    df_ : pd.DataFrame\n",
  69 |     "        The input DataFrame containing Kaggle survey data.\n",
  70 |     "\n",
  71 |     "    Returns\n",
  72 |     "    -------\n",
  73 |     "    pd.DataFrame\n",
  74 |     "        The new DataFrame with the modified and selected columns.\n",
  75 |     "    \"\"\"    \n",
  76 |     "    return (df_\n",
  77 |     "            .assign(age=df_.Q2.str.slice(0,2).astype(int),\n",
  78 |     "                    education=df_.Q4.replace({'Master’s degree': 18,\n",
  79 |     "                         'Bachelor’s degree': 16,\n",
  80 |     "                         'Doctoral degree': 20,\n",
  81 |     "'Some college/university study without earning a bachelor’s degree': 13,\n",
  82 |     "                         'Professional degree': 19,\n",
  83 |     "                         'I prefer not to answer': None,\n",
  84 |     "                         'No formal education past high school': 12}),\n",
  85 |     "                    major=(df_.Q5\n",
  86 |     "                              .pipe(topn, n=3)\n",
  87 |     "                              .replace({\n",
  88 |     "                        'Computer science (software engineering, etc.)': 'cs',\n",
  89 |     "                        'Engineering (non-computer focused)': 'eng',\n",
  90 |     "                        'Mathematics or statistics': 'stat'})\n",
  91 |     "                         ),\n",
  92 |     "                    years_exp=(df_.Q8.str.replace('+','', regex=False)\n",
  93 |     "                           .str.split('-', expand=True)\n",
  94 |     "                           .iloc[:,0]\n",
  95 |     "                           .astype(float)),\n",
  96 |     "                    compensation=(df_.Q9.str.replace('+','', regex=False)\n",
  97 |     "                           .str.replace(',','', regex=False)\n",
  98 |     "                           .str.replace('500000', '500', regex=False)\n",
  99 |     "  .str.replace('I do not wish to disclose my approximate yearly compensation',\n",
 100 |     "             '0', regex=False)\n",
 101 |     "                           .str.split('-', expand=True)\n",
 102 |     "                           .iloc[:,0]\n",
 103 |     "                           .fillna(0)\n",
 104 |     "                           .astype(int)\n",
 105 |     "                           .mul(1_000)\n",
 106 |     "                                    ),\n",
 107 |     "                    python=df_.Q16_Part_1.fillna(0).replace('Python', 1),\n",
 108 |     "                    r=df_.Q16_Part_2.fillna(0).replace('R', 1),\n",
 109 |     "                    sql=df_.Q16_Part_3.fillna(0).replace('SQL', 1)\n",
 110 |     "               )#assign\n",
 111 |     "        .rename(columns=lambda col:col.replace(' ', '_'))\n",
 112 |     "        .loc[:, 'Q1,Q3,age,education,major,years_exp,compensation,'\n",
 113 |     "                'python,r,sql'.split(',')]   \n",
 114 |     "       )\n",
 115 |     "\n",
 116 |     "        \n",
 117 |     "def topn(ser, n=5, default='other'):\n",
 118 |     "    \"\"\"\n",
 119 |     "    Replace all values in a Pandas Series that are not among \n",
 120 |     "    the top `n` most frequent values with a default value.\n",
 121 |     "\n",
 122 |     "    This function takes a Pandas Series and returns a new \n",
 123 |     "    Series with the values replaced as described above. The \n",
 124 |     "    top `n` most frequent values are determined using the \n",
 125 |     "    `value_counts` method of the input Series.\n",
 126 |     "\n",
 127 |     "    Parameters\n",
 128 |     "    ----------\n",
 129 |     "    ser : pd.Series\n",
 130 |     "        The input Series.\n",
 131 |     "    n : int, optional\n",
 132 |     "        The number of most frequent values to keep. The \n",
 133 |     "        default value is 5.\n",
 134 |     "    default : str, optional\n",
 135 |     "        The default value to use for values that are not among \n",
 136 |     "        the top `n` most frequent values. The default value is \n",
 137 |     "        'other'.\n",
 138 |     "\n",
 139 |     "    Returns\n",
 140 |     "    -------\n",
 141 |     "    pd.Series\n",
 142 |     "        The modified Series with the values replaced.\n",
 143 |     "    \"\"\"    \n",
 144 |     "    counts = ser.value_counts()\n",
 145 |     "    return ser.where(ser.isin(counts.index[:n]), default)\n",
 146 |     "```\n",
 147 |     "\n",
 148 |     "``` python\n",
 149 |     "from feature_engine import encoding, imputation\n",
 150 |     "from sklearn import base, pipeline\n",
 151 |     "\n",
 152 |     "\n",
 153 |     "class TweakKagTransformer(base.BaseEstimator,\n",
 154 |     "    base.TransformerMixin):\n",
 155 |     "    \"\"\"\n",
 156 |     "    A transformer for tweaking Kaggle survey data.\n",
 157 |     "\n",
 158 |     "    This transformer takes a Pandas DataFrame containing \n",
 159 |     "    Kaggle survey data as input and returns a new version of \n",
 160 |     "    the DataFrame. The modifications include extracting and \n",
 161 |     "    transforming certain columns, renaming columns, and \n",
 162 |     "    selecting a subset of columns.\n",
 163 |     "\n",
 164 |     "    Parameters\n",
 165 |     "    ----------\n",
 166 |     "    ycol : str, optional\n",
 167 |     "        The name of the column to be used as the target variable. \n",
 168 |     "        If not specified, the target variable will not be set.\n",
 169 |     "\n",
 170 |     "    Attributes\n",
 171 |     "    ----------\n",
 172 |     "    ycol : str\n",
 173 |     "        The name of the column to be used as the target variable.\n",
 174 |     "    \"\"\"\n",
 175 |     "    \n",
 176 |     "    def __init__(self, ycol=None):\n",
 177 |     "        self.ycol = ycol\n",
 178 |     "        \n",
 179 |     "    def transform(self, X):\n",
 180 |     "        return tweak_kag(X)\n",
 181 |     "    \n",
 182 |     "    def fit(self, X, y=None):\n",
 183 |     "        return self\n",
 184 |     "```\n",
 185 |     "\n",
 186 |     "``` python\n",
 187 |     "def get_rawX_y(df, y_col):\n",
 188 |     "    raw = (df\n",
 189 |     "            .query('Q3.isin([\"United States of America\", \"China\", \"India\"]) '\n",
 190 |     "               'and Q6.isin([\"Data Scientist\", \"Software Engineer\"])')\n",
 191 |     "          )\n",
 192 |     "    return raw.drop(columns=[y_col]), raw[y_col]\n",
 193 |     "\n",
 194 |     "\n",
 195 |     "## Create a pipeline\n",
 196 |     "kag_pl = pipeline.Pipeline(\n",
 197 |     "    [('tweak', TweakKagTransformer()),\n",
 198 |     "     ('cat', encoding.OneHotEncoder(top_categories=5, drop_last=True, \n",
 199 |     "           variables=['Q1', 'Q3', 'major'])),\n",
 200 |     "     ('num_impute', imputation.MeanMedianImputer(imputation_method='median',\n",
 201 |     "          variables=['education', 'years_exp']))]\n",
 202 |     ")\n",
 203 |     "```\n",
 204 |     "\n",
 205 |     "``` pycon\n",
 206 |     ">>> from sklearn import model_selection\n",
 207 |     ">>> kag_X, kag_y = get_rawX_y(raw, 'Q6')\n",
 208 |     "    \n",
 209 |     ">>> kag_X_train, kag_X_test, kag_y_train, kag_y_test = \\\n",
 210 |     "...    model_selection.train_test_split(\n",
 211 |     "...        kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)\n",
 212 |     "\n",
 213 |     ">>> X_train = kag_pl.fit_transform(kag_X_train, kag_y_train)\n",
 214 |     ">>> X_test = kag_pl.transform(kag_X_test)\n",
 215 |     ">>> print(X_train)\n",
 216 |     "       age  education  years_exp  ...  major_other  major_eng  major_stat\n",
 217 |     "587     25       18.0        4.0  ...            1          0           0\n",
 218 |     "3065    22       16.0        1.0  ...            0          0           0\n",
 219 |     "8435    22       18.0        1.0  ...            1          0           0\n",
 220 |     "3110    40       20.0        3.0  ...            1          0           0\n",
 221 |     "16372   45       12.0        5.0  ...            1          0           0\n",
 222 |     "...    ...        ...        ...  ...          ...        ...         ...\n",
 223 |     "16608   25       16.0        2.0  ...            0          0           0\n",
 224 |     "7325    18       16.0        1.0  ...            0          0           0\n",
 225 |     "21810   18       16.0        2.0  ...            0          0           0\n",
 226 |     "4917    25       18.0        1.0  ...            0          0           1\n",
 227 |     "639     25       18.0        1.0  ...            0          0           0\n",
 228 |     "\n",
 229 |     "[2110 rows x 18 columns]\n",
 230 |     "```\n",
 231 |     "\n",
 232 |     "``` pycon\n",
 233 |     ">>> kag_y_train\n",
 234 |     "587      Software Engineer\n",
 235 |     "3065        Data Scientist\n",
 236 |     "8435        Data Scientist\n",
 237 |     "3110        Data Scientist\n",
 238 |     "16372    Software Engineer\n",
 239 |     "               ...        \n",
 240 |     "16608    Software Engineer\n",
 241 |     "7325     Software Engineer\n",
 242 |     "21810       Data Scientist\n",
 243 |     "4917        Data Scientist\n",
 244 |     "639         Data Scientist\n",
 245 |     "Name: Q6, Length: 2110, dtype: object\n",
 246 |     "```\n",
 247 |     "\n",
 248 |     "## Exploratory Data Analysis\n",
 249 |     "\n",
 250 |     "### Correlations\n",
 251 |     "\n",
 252 |     "``` python\n",
 253 |     "(X_train\n",
 254 |     " .assign(data_scientist = kag_y_train == 'Data Scientist')\n",
 255 |     " .corr(method='spearman')\n",
 256 |     " .style\n",
 257 |     " .background_gradient(cmap='RdBu', vmax=1, vmin=-1)\n",
 258 |     " .set_sticky(axis='index')\n",
 259 |     ")\n",
 260 |     "```\n",
 261 |     "\n",
 262 |     "### Bar Plot\n",
 263 |     "\n",
 264 |     "``` python\n",
 265 |     "import matplotlib.pyplot as plt\n",
 266 |     "\n",
 267 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 268 |     "(X_train\n",
 269 |     " .assign(data_scientist = kag_y_train)\n",
 270 |     " .groupby('r')\n",
 271 |     " .data_scientist\n",
 272 |     " .value_counts()\n",
 273 |     " .unstack()\n",
 274 |     " .plot.bar(ax=ax)\n",
 275 |     ")\n",
 276 |     "```\n",
 277 |     "\n",
 278 |     "``` python\n",
 279 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 280 |     "(pd.crosstab(index=X_train['major_cs'], \n",
 281 |     "             columns=kag_y)\n",
 282 |     "    .plot.bar(ax=ax)\n",
 283 |     ")\n",
 284 |     "```\n",
 285 |     "\n",
 286 |     "``` python\n",
 287 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 288 |     "(X_train\n",
 289 |     " .plot.scatter(x='years_exp', y='compensation', alpha=.3, ax=ax, c='purple')\n",
 290 |     ")\n",
 291 |     "```\n",
 292 |     "\n",
 293 |     "``` python\n",
 294 |     "import seaborn.objects as so\n",
 295 |     "fig = plt.figure(figsize=(8, 4))\n",
 296 |     "(so\n",
 297 |     " .Plot(X_train.assign(title=kag_y_train), x='years_exp', y='compensation', color='title')\n",
 298 |     " .add(so.Dots(alpha=.3, pointsize=2), so.Jitter(x=.5, y=10_000))\n",
 299 |     " .add(so.Line(), so.PolyFit())\n",
 300 |     " .on(fig)  # not required unless saving to image\n",
 301 |     " .plot()   # ditto\n",
 302 |     ")\n",
 303 |     "```\n",
 304 |     "\n",
 305 |     "``` python\n",
 306 |     "fig = plt.figure(figsize=(8, 4))\n",
 307 |     "(so\n",
 308 |     " .Plot(X_train\n",
 309 |     "       #.query('compensation < 200_000 and years_exp < 16')\n",
 310 |     "       .assign(\n",
 311 |     "         title=kag_y_train,\n",
 312 |     "         country=(X_train\n",
 313 |     "             .loc[:, 'Q3_United States of America': 'Q3_China']\n",
 314 |     "             .idxmax(axis='columns')\n",
 315 |     "            )\n",
 316 |     "       ), x='years_exp', y='compensation', color='title')\n",
 317 |     " .facet('country')\n",
 318 |     " .add(so.Dots(alpha=.01, pointsize=2, color='grey' ), so.Jitter(x=.5, y=10_000), col=None)\n",
 319 |     " .add(so.Dots(alpha=.5, pointsize=1.5), so.Jitter(x=.5, y=10_000))\n",
 320 |     " .add(so.Line(pointsize=1), so.PolyFit(order=2))\n",
 321 |     " .scale(x=so.Continuous().tick(at=[0,1,2,3,4,5]))\n",
 322 |     " .limit(y=(-10_000, 200_000), x=(-1, 6))  # zoom in with this not .query (above)\n",
 323 |     " .on(fig)  # not required unless saving to image\n",
 324 |     " .plot()   # ditto\n",
 325 |     ")\n",
 326 |     "```\n",
 327 |     "\n",
 328 |     "## Tree Creation\n",
 329 |     "\n",
 330 |     "### The Gini Coefficient\n",
 331 |     "\n",
 332 |     "``` python\n",
 333 |     "import numpy as np\n",
 334 |     "import numpy.random as rn\n",
 335 |     "\n",
 336 |     "pos_center = 12\n",
 337 |     "pos_count = 100\n",
 338 |     "neg_center = 7\n",
 339 |     "neg_count = 1000\n",
 340 |     "rs = rn.RandomState(rn.MT19937(rn.SeedSequence(42)))\n",
 341 |     "gini = pd.DataFrame({'value':\n",
 342 |     "    np.append((pos_center) + rs.randn(pos_count),\n",
 343 |     "              (neg_center) + rs.randn(neg_count)),\n",
 344 |     "                     'label':\n",
 345 |     "    ['pos']* pos_count + ['neg'] * neg_count})\n",
 346 |     "\n",
 347 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 348 |     "_ = (gini\n",
 349 |     " .groupby('label')\n",
 350 |     " [['value']]\n",
 351 |     "     .plot.hist(bins=30, alpha=.5, ax=ax, edgecolor='black')\n",
 352 |     ")\n",
 353 |     "ax.legend(['Negative', 'Positive'])\n",
 354 |     "```\n",
 355 |     "\n",
 356 |     "``` python\n",
 357 |     "def calc_gini(df, val_col, label_col, pos_val, split_point,\n",
 358 |     "              debug=False):\n",
 359 |     "    \"\"\"\n",
 360 |     "    This function calculates the Gini impurity of a dataset. Gini impurity \n",
 361 |     "    is a measure of the probability of a random sample being classified \n",
 362 |     "    incorrectly when a feature is used to split the data. The lower the \n",
 363 |     "    impurity, the better the split.\n",
 364 |     "\n",
 365 |     "    Parameters:\n",
 366 |     "    df (pd.DataFrame): The dataframe containing the data\n",
 367 |     "    val_col (str): The column name of the feature used to split the data\n",
 368 |     "    label_col (str): The column name of the target variable\n",
 369 |     "    pos_val (str or int): The value of the target variable that represents \n",
 370 |     "        the positive class\n",
 371 |     "    split_point (float): The threshold used to split the data.\n",
 372 |     "    debug (bool): optional, when set to True, prints the calculated Gini\n",
 373 |     "        impurities and the final weighted average\n",
 374 |     "\n",
 375 |     "    Returns:\n",
 376 |     "    float: The weighted average of Gini impurity for the positive and \n",
 377 |     "        negative subsets.\n",
 378 |     "    \"\"\"    \n",
 379 |     "    ge_split = df[val_col] >= split_point\n",
 380 |     "    eq_pos = df[label_col] == pos_val\n",
 381 |     "    tp = df[ge_split & eq_pos].shape[0]\n",
 382 |     "    fp = df[ge_split & ~eq_pos].shape[0]\n",
 383 |     "    tn = df[~ge_split & ~eq_pos].shape[0]\n",
 384 |     "    fn = df[~ge_split & eq_pos].shape[0]\n",
 385 |     "    pos_size = tp+fp\n",
 386 |     "    neg_size = tn+fn\n",
 387 |     "    total_size = len(df)\n",
 388 |     "    if pos_size == 0:\n",
 389 |     "        gini_pos = 0\n",
 390 |     "    else:\n",
 391 |     "        gini_pos = 1 - (tp/pos_size)**2 - (fp/pos_size)**2\n",
 392 |     "    if neg_size == 0:\n",
 393 |     "        gini_neg = 0\n",
 394 |     "    else:\n",
 395 |     "        gini_neg = 1 - (tn/neg_size)**2 - (fn/neg_size)**2\n",
 396 |     "    weighted_avg = gini_pos * (pos_size/total_size) + \\\n",
 397 |     "                   gini_neg * (neg_size/total_size)\n",
 398 |     "    if debug:\n",
 399 |     "        print(f'{gini_pos=:.3} {gini_neg=:.3} {weighted_avg=:.3}')\n",
 400 |     "    return weighted_avg\n",
 401 |     "```\n",
 402 |     "\n",
 403 |     "``` pycon\n",
 404 |     ">>> calc_gini(gini, val_col='value', label_col='label', pos_val='pos',\n",
 405 |     "...          split_point=9.24, debug=True)\n",
 406 |     "gini_pos=0.217 gini_neg=0.00202 weighted_avg=0.0241\n",
 407 |     "0.024117224644432264\n",
 408 |     "```\n",
 409 |     "\n",
 410 |     "``` python\n",
 411 |     "values = np.arange(5, 15, .1)\n",
 412 |     "ginis = []\n",
 413 |     "for v in values:\n",
 414 |     "    ginis.append(calc_gini(gini, val_col='value', label_col='label',\n",
 415 |     "                           pos_val='pos', split_point=v))\n",
 416 |     "fig, ax = plt.subplots(figsize=(8, 4))    \n",
 417 |     "ax.plot(values, ginis)\n",
 418 |     "ax.set_title('Gini Coefficient')\n",
 419 |     "ax.set_ylabel('Gini Coefficient')\n",
 420 |     "ax.set_xlabel('Split Point')\n",
 421 |     "```\n",
 422 |     "\n",
 423 |     "``` pycon\n",
 424 |     ">>> pd.Series(ginis, index=values).loc[9.5:10.5]\n",
 425 |     "9.6     0.013703\n",
 426 |     "9.7     0.010470\n",
 427 |     "9.8     0.007193\n",
 428 |     "9.9     0.005429\n",
 429 |     "10.0    0.007238\n",
 430 |     "10.1    0.005438\n",
 431 |     "10.2    0.005438\n",
 432 |     "10.3    0.007244\n",
 433 |     "10.4    0.009046\n",
 434 |     "10.5    0.009046\n",
 435 |     "dtype: float64\n",
 436 |     "```\n",
 437 |     "\n",
 438 |     "``` pycon\n",
 439 |     ">>> print(pd.DataFrame({'gini':ginis, 'split':values})\n",
 440 |     "...  .query('gini <= gini.min()')\n",
 441 |     "... )\n",
 442 |     "        gini  split\n",
 443 |     "49  0.005429    9.9\n",
 444 |     "```\n",
 445 |     "\n",
 446 |     "### Coefficients in Trees\n",
 447 |     "\n",
 448 |     "``` python\n",
 449 |     "from sklearn import tree\n",
 450 |     "stump = tree.DecisionTreeClassifier(max_depth=1)\n",
 451 |     "stump.fit(gini[['value']], gini.label)\n",
 452 |     "```\n",
 453 |     "\n",
 454 |     "``` python\n",
 455 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 456 |     "tree.plot_tree(stump, feature_names=['value'],\n",
 457 |     "               filled=True, \n",
 458 |     "               class_names=stump.classes_,\n",
 459 |     "               ax=ax)\n",
 460 |     "```\n",
 461 |     "\n",
 462 |     "``` pycon\n",
 463 |     ">>> gini_pos = 0.039\n",
 464 |     ">>> gini_neg = 0.002\n",
 465 |     ">>> pos_size = 101\n",
 466 |     ">>> neg_size = 999\n",
 467 |     ">>> total_size = pos_size + neg_size\n",
 468 |     ">>> weighted_avg = gini_pos * (pos_size/total_size) + \\\n",
 469 |     "...               gini_neg * (neg_size/total_size)\n",
 470 |     ">>> print(weighted_avg)\n",
 471 |     "0.005397272727272727\n",
 472 |     "```\n",
 473 |     "\n",
 474 |     "``` python\n",
 475 |     "import xgboost as xgb\n",
 476 |     "xg_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)                 \n",
 477 |     "xg_stump.fit(gini[['value']], (gini.label== 'pos'))\n",
 478 |     "```\n",
 479 |     "\n",
 480 |     "``` python\n",
 481 |     "xgb.plot_tree(xg_stump, num_trees=0)\n",
 482 |     "```\n",
 483 |     "\n",
 484 |     "``` python\n",
 485 |     "import subprocess\n",
 486 |     "def my_dot_export(xg, num_trees, filename, title='', direction='TB'):\n",
 487 |     "    \"\"\"Exports a specified number of trees from an XGBoost model as a graph \n",
 488 |     "    visualization in dot and png formats.\n",
 489 |     "\n",
 490 |     "    Args:\n",
 491 |     "        xg: An XGBoost model.\n",
 492 |     "        num_trees: The number of tree to export.\n",
 493 |     "        filename: The name of the file to save the exported visualization.\n",
 494 |     "        title: The title to display on the graph visualization (optional).\n",
 495 |     "        direction: The direction to lay out the graph, either 'TB' (top to \n",
 496 |     "            bottom) or 'LR' (left to right) (optional).\n",
 497 |     "    \"\"\"\n",
 498 |     "    res = xgb.to_graphviz(xg, num_trees=num_trees)\n",
 499 |     "    content = f'''    node [fontname = \"Roboto Condensed\"];\n",
 500 |     "    edge [fontname = \"Roboto Thin\"];\n",
 501 |     "    label = \"{title}\"\n",
 502 |     "    fontname = \"Roboto Condensed\"\n",
 503 |     "    '''\n",
 504 |     "    out = res.source.replace('graph [ rankdir=TB ]', \n",
 505 |     "                             f'graph [ rankdir={direction} ];\\n {content}')\n",
 506 |     "    # dot -Gdpi=300 -Tpng -ocourseflow.png courseflow.dot \n",
 507 |     "    dot_filename = filename\n",
 508 |     "    with open(dot_filename, 'w') as fout:\n",
 509 |     "        fout.write(out)\n",
 510 |     "    png_filename = dot_filename.replace('.dot', '.png')\n",
 511 |     "    subprocess.run(f'dot -Gdpi=300 -Tpng -o{png_filename} {dot_filename}'.split())\n",
 512 |     "```\n",
 513 |     "\n",
 514 |     "``` python\n",
 515 |     "my_dot_export(xg_stump, num_trees=0, filename='img/stump_xg.dot', title='A demo stump')    \n",
 516 |     "```\n",
 517 |     "\n",
 518 |     "### Another Visualization Tool\n",
 519 |     "\n",
 520 |     "``` python\n",
 521 |     "import dtreeviz\n",
 522 |     "viz = dtreeviz.model(xg_stump, X_train=gini[['value']], \n",
 523 |     "                     y_train=gini.label=='pos',\n",
 524 |     "    target_name='positive',\n",
 525 |     "    feature_names=['value'], class_names=['negative', 'positive'],\n",
 526 |     "    tree_index=0)\n",
 527 |     "viz.view()\n",
 528 |     "```\n",
 529 |     "\n",
 530 |     "## Stumps on Real Data\n",
 531 |     "\n",
 532 |     "### Scikit-learn stump on real data\n",
 533 |     "\n",
 534 |     "``` python\n",
 535 |     "stump_dt = tree.DecisionTreeClassifier(max_depth=1)\n",
 536 |     "X_train = kag_pl.fit_transform(kag_X_train)\n",
 537 |     "stump_dt.fit(X_train, kag_y_train)\n",
 538 |     "```\n",
 539 |     "\n",
 540 |     "``` python\n",
 541 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 542 |     "features = list(c for c in X_train.columns)\n",
 543 |     "tree.plot_tree(stump_dt, feature_names=features, \n",
 544 |     "               filled=True, \n",
 545 |     "               class_names=stump_dt.classes_,\n",
 546 |     "               ax=ax)\n",
 547 |     "```\n",
 548 |     "\n",
 549 |     "``` pycon\n",
 550 |     ">>> X_test = kag_pl.transform(kag_X_test)\n",
 551 |     ">>> stump_dt.score(X_test, kag_y_test)\n",
 552 |     "0.6243093922651933\n",
 553 |     "```\n",
 554 |     "\n",
 555 |     "``` pycon\n",
 556 |     ">>> from sklearn import dummy\n",
 557 |     ">>> dummy_model = dummy.DummyClassifier()\n",
 558 |     ">>> dummy_model.fit(X_train, kag_y_train)\n",
 559 |     ">>> dummy_model.score(X_test, kag_y_test)\n",
 560 |     "0.5458563535911602\n",
 561 |     "```\n",
 562 |     "\n",
 563 |     "### Decision Stump with XGBoost\n",
 564 |     "\n",
 565 |     "``` python\n",
 566 |     "import xgboost as xgb\n",
 567 |     "kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)\n",
 568 |     "kag_stump.fit(X_train, kag_y_train)\n",
 569 |     "```\n",
 570 |     "\n",
 571 |     "``` pycon\n",
 572 |     ">>> print(kag_y_train)\n",
 573 |     "587      Software Engineer\n",
 574 |     "3065        Data Scientist\n",
 575 |     "8435        Data Scientist\n",
 576 |     "3110        Data Scientist\n",
 577 |     "16372    Software Engineer\n",
 578 |     "               ...        \n",
 579 |     "16608    Software Engineer\n",
 580 |     "7325     Software Engineer\n",
 581 |     "21810       Data Scientist\n",
 582 |     "4917        Data Scientist\n",
 583 |     "639         Data Scientist\n",
 584 |     "Name: Q6, Length: 2110, dtype: object\n",
 585 |     "```\n",
 586 |     "\n",
 587 |     "``` pycon\n",
 588 |     ">>> print(kag_y_train == 'Software Engineer')\n",
 589 |     "587       True\n",
 590 |     "3065     False\n",
 591 |     "8435     False\n",
 592 |     "3110     False\n",
 593 |     "16372     True\n",
 594 |     "         ...  \n",
 595 |     "16608     True\n",
 596 |     "7325      True\n",
 597 |     "21810    False\n",
 598 |     "4917     False\n",
 599 |     "639      False\n",
 600 |     "Name: Q6, Length: 2110, dtype: bool\n",
 601 |     "```\n",
 602 |     "\n",
 603 |     "``` pycon\n",
 604 |     ">>> from sklearn import preprocessing\n",
 605 |     ">>> label_encoder = preprocessing.LabelEncoder()\n",
 606 |     ">>> y_train = label_encoder.fit_transform(kag_y_train)\n",
 607 |     ">>> y_test = label_encoder.transform(kag_y_test)\n",
 608 |     ">>> y_test[:5]\n",
 609 |     "array([1, 0, 0, 1, 1])\n",
 610 |     "```\n",
 611 |     "\n",
 612 |     "``` pycon\n",
 613 |     ">>> label_encoder.classes_\n",
 614 |     "array(['Data Scientist', 'Software Engineer'], dtype=object)\n",
 615 |     "```\n",
 616 |     "\n",
 617 |     "``` pycon\n",
 618 |     ">>> label_encoder.inverse_transform([0, 1])\n",
 619 |     "array(['Data Scientist', 'Software Engineer'], dtype=object)\n",
 620 |     "```\n",
 621 |     "\n",
 622 |     "``` pycon\n",
 623 |     ">>> kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)\n",
 624 |     ">>> kag_stump.fit(X_train, y_train)\n",
 625 |     ">>> kag_stump.score(X_test, y_test)\n",
 626 |     "0.6243093922651933\n",
 627 |     "```\n",
 628 |     "\n",
 629 |     "``` python\n",
 630 |     "my_dot_export(kag_stump, num_trees=0, filename='img/stump_xg_kag.dot', \n",
 631 |     "              title='XGBoost Stump')    \n",
 632 |     "```\n",
 633 |     "\n",
 634 |     "### Values in the XGBoost Tree\n",
 635 |     "\n",
 636 |     "``` pycon\n",
 637 |     ">>> kag_stump.classes_\n",
 638 |     "array([0, 1])\n",
 639 |     "```\n",
 640 |     "\n",
 641 |     "``` python\n",
 642 |     "import numpy as np\n",
 643 |     "def inv_logit(p: float) -> float:\n",
 644 |     "    \"\"\"\n",
 645 |     "    Compute the inverse logit function of a given value.\n",
 646 |     "\n",
 647 |     "    The inverse logit function is defined as:\n",
 648 |     "        f(p) = exp(p) / (1 + exp(p))\n",
 649 |     "\n",
 650 |     "    Parameters\n",
 651 |     "    ----------\n",
 652 |     "    p : float\n",
 653 |     "        The input value to the inverse logit function.\n",
 654 |     "\n",
 655 |     "    Returns\n",
 656 |     "    -------\n",
 657 |     "    float\n",
 658 |     "        The output of the inverse logit function.\n",
 659 |     "    \"\"\"\n",
 660 |     "    return np.exp(p) / (1 + np.exp(p))\n",
 661 |     "```\n",
 662 |     "\n",
 663 |     "``` pycon\n",
 664 |     ">>> inv_logit(.0717741922)\n",
 665 |     "0.5179358489487103\n",
 666 |     "```\n",
 667 |     "\n",
 668 |     "``` pycon\n",
 669 |     ">>> inv_logit(-.3592)\n",
 670 |     "0.41115323716754393\n",
 671 |     "```\n",
 672 |     "\n",
 673 |     "``` python\n",
 674 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 675 |     "vals = np.linspace(-7, 7)\n",
 676 |     "ax.plot(vals, inv_logit(vals))\n",
 677 |     "ax.annotate('Crossover point', (0,.5), (-5,.8), arrowprops={'color':'k'}) \n",
 678 |     "ax.annotate('Predict Positive', (5,.6), (1,.6), va='center', arrowprops={'color':'k'}) \n",
 679 |     "ax.annotate('Predict Negative', (-5,.4), (-3,.4), va='center', arrowprops={'color':'k'}) \n",
 680 |     "```\n",
 681 |     "\n",
 682 |     "### Summary\n",
 683 |     "\n",
 684 |     "### Exercises\n",
 685 |     "\n",
 686 |     "## Model Complexity & Hyperparameters\n",
 687 |     "\n",
 688 |     "### Underfit\n",
 689 |     "\n",
 690 |     "``` pycon\n",
 691 |     ">>> underfit = tree.DecisionTreeClassifier(max_depth=1)\n",
 692 |     ">>> X_train = kag_pl.fit_transform(kag_X_train)\n",
 693 |     ">>> underfit.fit(X_train, kag_y_train)\n",
 694 |     ">>> underfit.score(X_test, kag_y_test)\n",
 695 |     "0.6243093922651933\n",
 696 |     "```\n",
 697 |     "\n",
 698 |     "### Growing a Tree\n",
 699 |     "\n",
 700 |     "### Overfitting\n",
 701 |     "\n",
 702 |     "### Overfitting with Decision Trees\n",
 703 |     "\n",
 704 |     "``` pycon\n",
 705 |     ">>> hi_variance = tree.DecisionTreeClassifier(max_depth=None)\n",
 706 |     ">>> X_train = kag_pl.fit_transform(kag_X_train)\n",
 707 |     ">>> hi_variance.fit(X_train, kag_y_train)\n",
 708 |     ">>> hi_variance.score(X_test, kag_y_test)\n",
 709 |     "0.6629834254143646\n",
 710 |     "```\n",
 711 |     "\n",
 712 |     "``` python\n",
 713 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 714 |     "features = list(c for c in X_train.columns)\n",
 715 |     "tree.plot_tree(hi_variance, feature_names=features, filled=True)\n",
 716 |     "```\n",
 717 |     "\n",
 718 |     "``` python\n",
 719 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 720 |     "features = list(c for c in X_train.columns)\n",
 721 |     "tree.plot_tree(hi_variance, feature_names=features, filled=True, \n",
 722 |     "                  class_names=hi_variance.classes_,\n",
 723 |     "                max_depth=2, fontsize=6)\n",
 724 |     "```\n",
 725 |     "\n",
 726 |     "### Summary\n",
 727 |     "\n",
 728 |     "### Exercises\n",
 729 |     "\n",
 730 |     "## Tree Hyperparameters\n",
 731 |     "\n",
 732 |     "### Decision Tree Hyperparameters\n",
 733 |     "\n",
 734 |     "``` pycon\n",
 735 |     ">>> stump.get_params()\n",
 736 |     "{'ccp_alpha': 0.0,\n",
 737 |     " 'class_weight': None,\n",
 738 |     " 'criterion': 'gini',\n",
 739 |     " 'max_depth': 1,\n",
 740 |     " 'max_features': None,\n",
 741 |     " 'max_leaf_nodes': None,\n",
 742 |     " 'min_impurity_decrease': 0.0,\n",
 743 |     " 'min_samples_leaf': 1,\n",
 744 |     " 'min_samples_split': 2,\n",
 745 |     " 'min_weight_fraction_leaf': 0.0,\n",
 746 |     " 'random_state': None,\n",
 747 |     " 'splitter': 'best'}\n",
 748 |     "```\n",
 749 |     "\n",
 750 |     "### Tracking changes with Validation Curves\n",
 751 |     "\n",
 752 |     "``` python\n",
 753 |     "accuracies = []\n",
 754 |     "for depth in range(1, 15):\n",
 755 |     "    between = tree.DecisionTreeClassifier(max_depth=depth)\n",
 756 |     "    between.fit(X_train, kag_y_train)\n",
 757 |     "    accuracies.append(between.score(X_test, kag_y_test))\n",
 758 |     "fig, ax = plt.subplots(figsize=(10,4))    \n",
 759 |     "(pd.Series(accuracies, name='Accuracy', index=range(1, len(accuracies)+1))\n",
 760 |     " .plot(ax=ax, title='Accuracy at a given Tree Depth'))\n",
 761 |     "ax.set_ylabel('Accuracy')\n",
 762 |     "ax.set_xlabel('max_depth')\n",
 763 |     "```\n",
 764 |     "\n",
 765 |     "``` pycon\n",
 766 |     ">>> between = tree.DecisionTreeClassifier(max_depth=7)\n",
 767 |     ">>> between.fit(X_train, kag_y_train)\n",
 768 |     ">>> between.score(X_test, kag_y_test)\n",
 769 |     "0.7359116022099448\n",
 770 |     "```\n",
 771 |     "\n",
 772 |     "### Leveraging Yellowbrick\n",
 773 |     "\n",
 774 |     "``` python\n",
 775 |     "from yellowbrick.model_selection import validation_curve\n",
 776 |     "fig, ax = plt.subplots(figsize=(10,4))    \n",
 777 |     "viz = validation_curve(tree.DecisionTreeClassifier(),\n",
 778 |     "    X=pd.concat([X_train, X_test]),\n",
 779 |     "    y=pd.concat([kag_y_train, kag_y_test]),   \n",
 780 |     "    param_name='max_depth', param_range=range(1,14),\n",
 781 |     "    scoring='accuracy', cv=5, ax=ax, n_jobs=6)                           \n",
 782 |     "```\n",
 783 |     "\n",
 784 |     "### Grid Search\n",
 785 |     "\n",
 786 |     "``` python\n",
 787 |     "from sklearn.model_selection import GridSearchCV\n",
 788 |     "params = {\n",
 789 |     "    'max_depth': [3, 5, 7, 8],\n",
 790 |     "    'min_samples_leaf': [1, 3, 4, 5, 6],\n",
 791 |     "    'min_samples_split': [2, 3, 4, 5, 6],\n",
 792 |     "}\n",
 793 |     "grid_search = GridSearchCV(estimator=tree.DecisionTreeClassifier(), \n",
 794 |     "                           param_grid=params, cv=4, n_jobs=-1, \n",
 795 |     "                           verbose=1, scoring=\"accuracy\")\n",
 796 |     "grid_search.fit(pd.concat([X_train, X_test]),\n",
 797 |     "    pd.concat([kag_y_train, kag_y_test]))\n",
 798 |     "```\n",
 799 |     "\n",
 800 |     "``` pycon\n",
 801 |     ">>> grid_search.best_params_\n",
 802 |     "{'max_depth': 7, 'min_samples_leaf': 5, 'min_samples_split': 6}\n",
 803 |     "```\n",
 804 |     "\n",
 805 |     "``` pycon\n",
 806 |     ">>> between2 = tree.DecisionTreeClassifier(**grid_search.best_params_)\n",
 807 |     ">>> between2.fit(X_train, kag_y_train)\n",
 808 |     ">>> between2.score(X_test, kag_y_test)\n",
 809 |     "0.7259668508287292\n",
 810 |     "```\n",
 811 |     "\n",
 812 |     "``` python\n",
 813 |     "# why is the score different than between_tree?\n",
 814 |     "(pd.DataFrame(grid_search.cv_results_)\n",
 815 |     " .sort_values(by='rank_test_score')\n",
 816 |     " .style\n",
 817 |     " .background_gradient(axis='rows')\n",
 818 |     ")\n",
 819 |     "```\n",
 820 |     "\n",
 821 |     "``` pycon\n",
 822 |     ">>> results = model_selection.cross_val_score(\n",
 823 |     "...    tree.DecisionTreeClassifier(max_depth=7),\n",
 824 |     "...    X=pd.concat([X_train, X_test], axis='index'),\n",
 825 |     "...    y=pd.concat([kag_y_train, kag_y_test], axis='index'),\n",
 826 |     "...    cv=4\n",
 827 |     "... )\n",
 828 |     "\n",
 829 |     ">>> results\n",
 830 |     "array([0.69628647, 0.73607427, 0.70291777, 0.7184595 ])\n",
 831 |     "```\n",
 832 |     "\n",
 833 |     "``` pycon\n",
 834 |     ">>> results.mean()\n",
 835 |     "0.7134345024851962\n",
 836 |     "```\n",
 837 |     "\n",
 838 |     "``` pycon\n",
 839 |     ">>> results = model_selection.cross_val_score(\n",
 840 |     "...    tree.DecisionTreeClassifier(max_depth=7, min_samples_leaf=5,\n",
 841 |     "...                                min_samples_split=2),\n",
 842 |     "...    X=pd.concat([X_train, X_test], axis='index'),\n",
 843 |     "...    y=pd.concat([kag_y_train, kag_y_test], axis='index'),\n",
 844 |     "...    cv=4\n",
 845 |     "... )\n",
 846 |     "\n",
 847 |     ">>> results\n",
 848 |     "array([0.70822281, 0.73740053, 0.70689655, 0.71580345])\n",
 849 |     "```\n",
 850 |     "\n",
 851 |     "``` pycon\n",
 852 |     ">>> results.mean()\n",
 853 |     "0.7170808366886126\n",
 854 |     "```\n",
 855 |     "\n",
 856 |     "### Summary\n",
 857 |     "\n",
 858 |     "### Exercises\n",
 859 |     "\n",
 860 |     "## Random Forest\n",
 861 |     "\n",
 862 |     "### Ensembles with Bagging\n",
 863 |     "\n",
 864 |     "### Scikit-learn Random Forest\n",
 865 |     "\n",
 866 |     "``` pycon\n",
 867 |     ">>> from sklearn import ensemble\n",
 868 |     ">>> rf = ensemble.RandomForestClassifier(random_state=42)\n",
 869 |     ">>> rf.fit(X_train, kag_y_train)\n",
 870 |     ">>> rf.score(X_test, kag_y_test)\n",
 871 |     "0.7237569060773481\n",
 872 |     "```\n",
 873 |     "\n",
 874 |     "``` pycon\n",
 875 |     ">>> rf.get_params()\n",
 876 |     "{'bootstrap': True,\n",
 877 |     " 'ccp_alpha': 0.0,\n",
 878 |     " 'class_weight': None,\n",
 879 |     " 'criterion': 'gini',\n",
 880 |     " 'max_depth': None,\n",
 881 |     " 'max_features': 'sqrt',\n",
 882 |     " 'max_leaf_nodes': None,\n",
 883 |     " 'max_samples': None,\n",
 884 |     " 'min_impurity_decrease': 0.0,\n",
 885 |     " 'min_samples_leaf': 1,\n",
 886 |     " 'min_samples_split': 2,\n",
 887 |     " 'min_weight_fraction_leaf': 0.0,\n",
 888 |     " 'n_estimators': 100,\n",
 889 |     " 'n_jobs': None,\n",
 890 |     " 'oob_score': False,\n",
 891 |     " 'random_state': 42,\n",
 892 |     " 'verbose': 0,\n",
 893 |     " 'warm_start': False}\n",
 894 |     "```\n",
 895 |     "\n",
 896 |     "``` pycon\n",
 897 |     ">>> len(rf.estimators_)\n",
 898 |     "100\n",
 899 |     "```\n",
 900 |     "\n",
 901 |     "``` pycon\n",
 902 |     ">>> print(rf.estimators_[0])\n",
 903 |     "DecisionTreeClassifier(max_features='sqrt', random_state=1608637542)\n",
 904 |     "```\n",
 905 |     "\n",
 906 |     "``` python\n",
 907 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
 908 |     "features = list(c for c in X_train.columns)\n",
 909 |     "tree.plot_tree(rf.estimators_[0], feature_names=features, \n",
 910 |     "               filled=True, class_names=rf.classes_, ax=ax,\n",
 911 |     "               max_depth=2, fontsize=6)\n",
 912 |     "```\n",
 913 |     "\n",
 914 |     "### XGBoost Random Forest\n",
 915 |     "\n",
 916 |     "``` pycon\n",
 917 |     ">>> import xgboost as xgb\n",
 918 |     ">>> rf_xg = xgb.XGBRFClassifier(random_state=42)\n",
 919 |     ">>> rf_xg.fit(X_train, y_train) \n",
 920 |     ">>> rf_xg.score(X_test, y_test)\n",
 921 |     "0.7447513812154696\n",
 922 |     "```\n",
 923 |     "\n",
 924 |     "``` pycon\n",
 925 |     ">>> rf_xg.get_params()\n",
 926 |     "{'colsample_bynode': 0.8,\n",
 927 |     " 'learning_rate': 1.0,\n",
 928 |     " 'reg_lambda': 1e-05,\n",
 929 |     " 'subsample': 0.8,\n",
 930 |     " 'objective': 'binary:logistic',\n",
 931 |     " 'use_label_encoder': None,\n",
 932 |     " 'base_score': 0.5,\n",
 933 |     " 'booster': 'gbtree',\n",
 934 |     " 'callbacks': None,\n",
 935 |     " 'colsample_bylevel': 1,\n",
 936 |     " 'colsample_bytree': 1,\n",
 937 |     " 'early_stopping_rounds': None,\n",
 938 |     " 'enable_categorical': False,\n",
 939 |     " 'eval_metric': None,\n",
 940 |     " 'feature_types': None,\n",
 941 |     " 'gamma': 0,\n",
 942 |     " 'gpu_id': -1,\n",
 943 |     " 'grow_policy': 'depthwise',\n",
 944 |     " 'importance_type': None,\n",
 945 |     " 'interaction_constraints': '',\n",
 946 |     " 'max_bin': 256,\n",
 947 |     " 'max_cat_threshold': 64,\n",
 948 |     " 'max_cat_to_onehot': 4,\n",
 949 |     " 'max_delta_step': 0,\n",
 950 |     " 'max_depth': 6,\n",
 951 |     " 'max_leaves': 0,\n",
 952 |     " 'min_child_weight': 1,\n",
 953 |     " 'missing': nan,\n",
 954 |     " 'monotone_constraints': '()',\n",
 955 |     " 'n_estimators': 100,\n",
 956 |     " 'n_jobs': 0,\n",
 957 |     " 'num_parallel_tree': 100,\n",
 958 |     " 'predictor': 'auto',\n",
 959 |     " 'random_state': 42,\n",
 960 |     " 'reg_alpha': 0,\n",
 961 |     " 'sampling_method': 'uniform',\n",
 962 |     " 'scale_pos_weight': 1,\n",
 963 |     " 'tree_method': 'exact',\n",
 964 |     " 'validate_parameters': 1,\n",
 965 |     " 'verbosity': None}\n",
 966 |     "```\n",
 967 |     "\n",
 968 |     "``` python\n",
 969 |     "fig, ax = plt.subplots(figsize=(6,12), dpi=600)\n",
 970 |     "xgb.plot_tree(rf_xg, num_trees=0, ax=ax, size='1,1')\n",
 971 |     "```\n",
 972 |     "\n",
 973 |     "``` python\n",
 974 |     "my_dot_export(rf_xg, num_trees=0, filename='img/rf_xg_kag.dot', \n",
 975 |     "              title='First Random Forest Tree', direction='LR')    \n",
 976 |     "```\n",
 977 |     "\n",
 978 |     "``` python\n",
 979 |     "viz = dtreeviz.model(rf_xg, X_train=X_train,\n",
 980 |     "    y_train=y_train,\n",
 981 |     "    target_name='Job', feature_names=list(X_train.columns), \n",
 982 |     "    class_names=['DS', 'SE'], tree_index=0)\n",
 983 |     "viz.view(depth_range_to_display=[0,2])\n",
 984 |     "```\n",
 985 |     "\n",
 986 |     "### Random Forest Hyperparameters\n",
 987 |     "\n",
 988 |     "### Training the Number of Trees in the Forest\n",
 989 |     "\n",
 990 |     "``` python\n",
 991 |     "from yellowbrick.model_selection import validation_curve\n",
 992 |     "fig, ax = plt.subplots(figsize=(10,4))    \n",
 993 |     "viz = validation_curve(xgb.XGBClassifier(random_state=42),\n",
 994 |     "    x=pd.concat([X_train, X_test], axis='index'),\n",
 995 |     "    y=np.concatenate([y_train, y_test]),\n",
 996 |     "    param_name='n_estimators', param_range=range(1, 100, 2),\n",
 997 |     "    scoring='accuracy', cv=3, \n",
 998 |     "    ax=ax)                           \n",
 999 |     "```\n",
1000 |     "\n",
1001 |     "``` pycon\n",
1002 |     ">>> rf_xg29 = xgb.XGBRFClassifier(random_state=42, n_estimators=29)\n",
1003 |     ">>> rf_xg29.fit(X_train, y_train) \n",
1004 |     ">>> rf_xg29.score(X_test, y_test)\n",
1005 |     "0.7480662983425415\n",
1006 |     "```\n",
1007 |     "\n",
1008 |     "### Summary\n",
1009 |     "\n",
1010 |     "### Exercises\n",
1011 |     "\n",
1012 |     "## XGBoost\n",
1013 |     "\n",
1014 |     "### Jargon\n",
1015 |     "\n",
1016 |     "### Benefits of Boosting\n",
1017 |     "\n",
1018 |     "### A Big Downside\n",
1019 |     "\n",
1020 |     "### Creating an XGBoost Model\n",
1021 |     "\n",
1022 |     "``` python\n",
1023 |     "%matplotlib inline\n",
1024 |     "\n",
1025 |     "import dtreeviz\n",
1026 |     "from feature_engine import encoding, imputation\n",
1027 |     "import matplotlib.pyplot as plt\n",
1028 |     "import numpy as np\n",
1029 |     "import pandas as pd\n",
1030 |     "from sklearn import base, compose, datasets, ensemble, \\\n",
1031 |     "    metrics, model_selection, pipeline, preprocessing, tree\n",
1032 |     "import scikitplot\n",
1033 |     "import xgboost as xgb\n",
1034 |     "import yellowbrick.model_selection as ms\n",
1035 |     "from yellowbrick import classifier\n",
1036 |     "\n",
1037 |     "import urllib\n",
1038 |     "import zipfile\n",
1039 |     "\n",
1040 |     "import xg_helpers as xhelp\n",
1041 |     "```\n",
1042 |     "\n",
1043 |     "``` python\n",
1044 |     "url = 'https://github.com/mattharrison/datasets/raw/master/data/'\\\n",
1045 |     "    'kaggle-survey-2018.zip'\n",
1046 |     "fname = 'kaggle-survey-2018.zip'\n",
1047 |     "member_name = 'multipleChoiceResponses.csv'\n",
1048 |     "\n",
1049 |     "raw = xhelp.extract_zip(url, fname, member_name)\n",
1050 |     "```\n",
1051 |     "\n",
1052 |     "``` python\n",
1053 |     "## Create raw X and raw y\n",
1054 |     "kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')\n",
1055 |     "    \n",
1056 |     "## Split data    \n",
1057 |     "kag_X_train, kag_X_test, kag_y_train, kag_y_test = \\\n",
1058 |     "    model_selection.train_test_split(\n",
1059 |     "        kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)    \n",
1060 |     "\n",
1061 |     "## Transform X with pipeline\n",
1062 |     "X_train = xhelp.kag_pl.fit_transform(kag_X_train)\n",
1063 |     "X_test = xhelp.kag_pl.transform(kag_X_test)\n",
1064 |     "\n",
1065 |     "## Transform y with label encoder\n",
1066 |     "label_encoder = preprocessing.LabelEncoder()\n",
1067 |     "label_encoder.fit(kag_y_train)\n",
1068 |     "y_train = label_encoder.transform(kag_y_train)\n",
1069 |     "y_test = label_encoder.transform(kag_y_test)\n",
1070 |     "\n",
1071 |     "# Combined Data for cross validation/etc\n",
1072 |     "X = pd.concat([X_train, X_test], axis='index')\n",
1073 |     "y = pd.Series([*y_train, *y_test], index=X.index)\n",
1074 |     "```\n",
1075 |     "\n",
1076 |     "### A Boosted Model\n",
1077 |     "\n",
1078 |     "``` pycon\n",
1079 |     ">>> xg_oob = xgb.XGBClassifier()\n",
1080 |     ">>> xg_oob.fit(X_train, y_train)\n",
1081 |     ">>> xg_oob.score(X_test, y_test)\n",
1082 |     "0.7458563535911602\n",
1083 |     "```\n",
1084 |     "\n",
1085 |     "``` pycon\n",
1086 |     ">>> # Let's try w/ depth of 2 and 2 trees\n",
1087 |     ">>> xg2 = xgb.XGBClassifier(max_depth=2, n_estimators=2)\n",
1088 |     ">>> xg2.fit(X_train, y_train)\n",
1089 |     ">>> xg2.score(X_test, y_test)\n",
1090 |     "0.6685082872928176\n",
1091 |     "```\n",
1092 |     "\n",
1093 |     "``` python\n",
1094 |     "import dtreeviz\n",
1095 |     "\n",
1096 |     "viz = dtreeviz.model(xg2, X_train=X, y_train=y, target_name='Job',\n",
1097 |     "    feature_names=list(X_train.columns), \n",
1098 |     "    class_names=['DS', 'SE'], tree_index=0)\n",
1099 |     "viz.view(depth_range_to_display=[0,2])\n",
1100 |     "```\n",
1101 |     "\n",
1102 |     "### Understanding the Output of the Trees\n",
1103 |     "\n",
1104 |     "``` python\n",
1105 |     "xhelp.my_dot_export(xg2, num_trees=0, filename='img/xgb_md2.dot', \n",
1106 |     "                    title='First Tree') \n",
1107 |     "```\n",
1108 |     "\n",
1109 |     "``` pycon\n",
1110 |     ">>> # Predicts 1 - Software engineer\n",
1111 |     ">>> se7894 = pd.DataFrame({'age': {7894: 22},                                            \n",
1112 |     "... 'education': {7894: 16.0},\n",
1113 |     "... 'years_exp': {7894: 1.0},\n",
1114 |     "... 'compensation': {7894: 0},\n",
1115 |     "... 'python': {7894: 1},\n",
1116 |     "... 'r': {7894: 0},\n",
1117 |     "... 'sql': {7894: 0},\n",
1118 |     "... 'Q1_Male': {7894: 1},                                   \n",
1119 |     "... 'Q1_Female': {7894: 0},\n",
1120 |     "... 'Q1_Prefer not to say': {7894: 0},\n",
1121 |     "... 'Q1_Prefer to self-describe': {7894: 0},\n",
1122 |     "... 'Q3_United States of America': {7894: 0},\n",
1123 |     "... 'Q3_India': {7894: 1},\n",
1124 |     "... 'Q3_China': {7894: 0},\n",
1125 |     "... 'major_cs': {7894: 0},\n",
1126 |     "... 'major_other': {7894: 0},\n",
1127 |     "... 'major_eng': {7894: 0},\n",
1128 |     "... 'major_stat': {7894: 0}})\n",
1129 |     ">>> xg2.predict_proba(se7894)\n",
1130 |     "array([[0.4986236, 0.5013764]], dtype=float32)\n",
1131 |     "```\n",
1132 |     "\n",
1133 |     "``` pycon\n",
1134 |     ">>> # Predicts 1 - Software engineer\n",
1135 |     ">>> xg2.predict(pd.DataFrame(se7894))\n",
1136 |     "array([1])\n",
1137 |     "```\n",
1138 |     "\n",
1139 |     "``` python\n",
1140 |     "xhelp.my_dot_export(xg2, num_trees=1, filename='img/xgb_md2_tree1.dot', title='Second Tree') \n",
1141 |     "```\n",
1142 |     "\n",
1143 |     "``` python\n",
1144 |     "def inv_logit(p):\n",
1145 |     "    return np.exp(p) / (1 + np.exp(p))\n",
1146 |     "```\n",
1147 |     "\n",
1148 |     "``` pycon\n",
1149 |     ">>> inv_logit(-0.08476+0.0902701)\n",
1150 |     "0.5013775215147345\n",
1151 |     "```\n",
1152 |     "\n",
1153 |     "### Summary\n",
1154 |     "\n",
1155 |     "### Exercises\n",
1156 |     "\n",
1157 |     "## Early Stopping\n",
1158 |     "\n",
1159 |     "### Early Stopping Rounds\n",
1160 |     "\n",
1161 |     "``` pycon\n",
1162 |     ">>> # Defaults\n",
1163 |     ">>> xg = xgb.XGBClassifier()\n",
1164 |     ">>> xg.fit(X_train, y_train)\n",
1165 |     ">>> xg.score(X_test, y_test)\n",
1166 |     "0.7458563535911602\n",
1167 |     "```\n",
1168 |     "\n",
1169 |     "``` pycon\n",
1170 |     ">>> xg = xgb.XGBClassifier(early_stopping_rounds=20)\n",
1171 |     ">>> xg.fit(X_train, y_train,\n",
1172 |     "...        eval_set=[(X_train, y_train),\n",
1173 |     "...                  (X_test, y_test)\n",
1174 |     "...                 ]\n",
1175 |     "...       )\n",
1176 |     ">>> xg.score(X_test, y_test)\n",
1177 |     "[0] validation_0-logloss:0.61534    validation_1-logloss:0.61775\n",
1178 |     "[1] validation_0-logloss:0.57046    validation_1-logloss:0.57623\n",
1179 |     "[2] validation_0-logloss:0.54011    validation_1-logloss:0.55333\n",
1180 |     "[3] validation_0-logloss:0.51965    validation_1-logloss:0.53711\n",
1181 |     "[4] validation_0-logloss:0.50419    validation_1-logloss:0.52511\n",
1182 |     "[5] validation_0-logloss:0.49176    validation_1-logloss:0.51741\n",
1183 |     "[6] validation_0-logloss:0.48159    validation_1-logloss:0.51277\n",
1184 |     "[7] validation_0-logloss:0.47221    validation_1-logloss:0.51040\n",
1185 |     "[8] validation_0-logloss:0.46221    validation_1-logloss:0.50713\n",
1186 |     "[9] validation_0-logloss:0.45700    validation_1-logloss:0.50583\n",
1187 |     "[10]    validation_0-logloss:0.45062    validation_1-logloss:0.50430\n",
1188 |     "[11]    validation_0-logloss:0.44533    validation_1-logloss:0.50338\n",
1189 |     "[12]    validation_0-logloss:0.43736    validation_1-logloss:0.50033\n",
1190 |     "[13]    validation_0-logloss:0.43399    validation_1-logloss:0.50034\n",
1191 |     "[14]    validation_0-logloss:0.43004    validation_1-logloss:0.50192\n",
1192 |     "[15]    validation_0-logloss:0.42550    validation_1-logloss:0.50268\n",
1193 |     "[16]    validation_0-logloss:0.42169    validation_1-logloss:0.50196\n",
1194 |     "[17]    validation_0-logloss:0.41854    validation_1-logloss:0.50223\n",
1195 |     "[18]    validation_0-logloss:0.41485    validation_1-logloss:0.50360\n",
1196 |     "[19]    validation_0-logloss:0.41228    validation_1-logloss:0.50527\n",
1197 |     "[20]    validation_0-logloss:0.40872    validation_1-logloss:0.50839\n",
1198 |     "[21]    validation_0-logloss:0.40490    validation_1-logloss:0.50623\n",
1199 |     "[22]    validation_0-logloss:0.40280    validation_1-logloss:0.50806\n",
1200 |     "[23]    validation_0-logloss:0.39942    validation_1-logloss:0.51007\n",
1201 |     "[24]    validation_0-logloss:0.39807    validation_1-logloss:0.50987\n",
1202 |     "[25]    validation_0-logloss:0.39473    validation_1-logloss:0.51189\n",
1203 |     "[26]    validation_0-logloss:0.39389    validation_1-logloss:0.51170\n",
1204 |     "[27]    validation_0-logloss:0.39040    validation_1-logloss:0.51218\n",
1205 |     "[28]    validation_0-logloss:0.38837    validation_1-logloss:0.51135\n",
1206 |     "[29]    validation_0-logloss:0.38569    validation_1-logloss:0.51202\n",
1207 |     "[30]    validation_0-logloss:0.37945    validation_1-logloss:0.51352\n",
1208 |     "[31]    validation_0-logloss:0.37840    validation_1-logloss:0.51545\n",
1209 |     "0.7558011049723757\n",
1210 |     "```\n",
1211 |     "\n",
1212 |     "``` pycon\n",
1213 |     ">>> xg.best_ntree_limit\n",
1214 |     "13\n",
1215 |     "```\n",
1216 |     "\n",
1217 |     "### Plotting Tree Performance\n",
1218 |     "\n",
1219 |     "``` pycon\n",
1220 |     ">>> # validation_0 is for training data\n",
1221 |     ">>> # validation_1 is for testing data\n",
1222 |     ">>> results = xg.evals_result()\n",
1223 |     ">>> results\n",
1224 |     "{'validation_0': OrderedDict([('logloss',\n",
1225 |     "               [0.6153406503923696,\n",
1226 |     "                0.5704566627034644,\n",
1227 |     "                0.5401074953836288,\n",
1228 |     "                0.519646179894983,\n",
1229 |     "                0.5041859194071372,\n",
1230 |     "                0.49175883369140716,\n",
1231 |     "                0.4815858465553177,\n",
1232 |     "                0.4722135672319274,\n",
1233 |     "                0.46221246084118905,\n",
1234 |     "                0.4570046103131291,\n",
1235 |     "                0.45062119092139025,\n",
1236 |     "                0.44533101600634545,\n",
1237 |     "                0.4373589513231934,\n",
1238 |     "                0.4339914069003403,\n",
1239 |     "                0.4300442738158372,\n",
1240 |     "                0.42550266018419824,\n",
1241 |     "                0.42168949383456633,\n",
1242 |     "                0.41853931894949614,\n",
1243 |     "                0.41485192559138645,\n",
1244 |     "                0.4122836278413833,\n",
1245 |     "                0.4087179538231096,\n",
1246 |     "                0.404898268053467,\n",
1247 |     "                0.4027963532207719,\n",
1248 |     "                0.39941699938733854,\n",
1249 |     "                0.3980718078477953,\n",
1250 |     "                0.39473153180519993,\n",
1251 |     "                0.39388538948800944,\n",
1252 |     "                0.39039599470886893,\n",
1253 |     "                0.38837148147752126,\n",
1254 |     "                0.38569152626668,\n",
1255 |     "                0.3794510693344513,\n",
1256 |     "                0.37840359436957194,\n",
1257 |     "                0.37538466192241227])]),\n",
1258 |     " 'validation_1': OrderedDict([('logloss',\n",
1259 |     "               [0.6177459120091813,\n",
1260 |     "                0.5762297115602546,\n",
1261 |     "                0.5533292921537852,\n",
1262 |     "                0.5371078260695736,\n",
1263 |     "                0.5251118483299708,\n",
1264 |     "                0.5174100387491574,\n",
1265 |     "                0.5127666981510036,\n",
1266 |     "                0.5103968678752362,\n",
1267 |     "                0.5071349115538004,\n",
1268 |     "                0.5058257413585542,\n",
1269 |     "                0.5043005662687247,\n",
1270 |     "                0.5033770955193438,\n",
1271 |     "                0.5003349146419797,\n",
1272 |     "                0.5003436393562437,\n",
1273 |     "                0.5019165392779843,\n",
1274 |     "                0.502677517614806,\n",
1275 |     "                0.501961292550791,\n",
1276 |     "                0.5022262006329157,\n",
1277 |     "                0.5035970173261607,\n",
1278 |     "                0.5052709663297096,\n",
1279 |     "                0.508388655664636,\n",
1280 |     "                0.5062287504923689,\n",
1281 |     "                0.5080608455824424,\n",
1282 |     "                0.5100736726054829,\n",
1283 |     "                0.5098673969229365,\n",
1284 |     "                0.5118910041889845,\n",
1285 |     "                0.5117007332982608,\n",
1286 |     "                0.5121825202836434,\n",
1287 |     "                0.5113475993625531,\n",
1288 |     "                0.5120185821281118,\n",
1289 |     "                0.5135189292720874,\n",
1290 |     "                0.5154504034915188,\n",
1291 |     "                0.5158137131755071])])}\n",
1292 |     "```\n",
1293 |     "\n",
1294 |     "``` python\n",
1295 |     "# Testing score is best at 13 trees\n",
1296 |     "results = xg.evals_result()\n",
1297 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1298 |     "ax = (pd.DataFrame({'training': results['validation_0']['logloss'],\n",
1299 |     "                    'testing': results['validation_1']['logloss']})\n",
1300 |     "  .assign(ntrees=lambda adf: range(1, len(adf)+1))      \n",
1301 |     "  .set_index('ntrees')\n",
1302 |     "  .plot(figsize=(5,4), ax=ax, \n",
1303 |     "        title='eval_results with early_stopping')\n",
1304 |     ")\n",
1305 |     "ax.annotate('Best number \\nof trees (13)', xy=(13, .498),\n",
1306 |     "           xytext=(20,.42), arrowprops={'color':'k'})\n",
1307 |     "ax.set_xlabel('ntrees')\n",
1308 |     "```\n",
1309 |     "\n",
1310 |     "``` python\n",
1311 |     "# Using value from early stopping gives same result\n",
1312 |     ">>> xg13 = xgb.XGBClassifier(n_estimators=13)\n",
1313 |     ">>> xg13.fit(X_train, y_train,\n",
1314 |     "...          eval_set=[(X_train, y_train),\n",
1315 |     "...                    (X_test, y_test)]\n",
1316 |     "... )\n",
1317 |     ">>> xg13.score(X_test, y_test)\n",
1318 |     "```\n",
1319 |     "\n",
1320 |     "``` pycon\n",
1321 |     ">>> xg.score(X_test, y_test)\n",
1322 |     "0.7558011049723757\n",
1323 |     "```\n",
1324 |     "\n",
1325 |     "``` pycon\n",
1326 |     ">>> # No early stopping, uses all estimators\n",
1327 |     ">>> xg_no_es = xgb.XGBClassifier()\n",
1328 |     ">>> xg_no_es.fit(X_train, y_train)\n",
1329 |     ">>> xg_no_es.score(X_test, y_test)\n",
1330 |     "0.7458563535911602\n",
1331 |     "```\n",
1332 |     "\n",
1333 |     "### Different `eval_metrics`\n",
1334 |     "\n",
1335 |     "``` pycon\n",
1336 |     ">>> xg_err = xgb.XGBClassifier(early_stopping_rounds=20, \n",
1337 |     "...                            eval_metric='error')\n",
1338 |     ">>> xg_err.fit(X_train, y_train,\n",
1339 |     "...        eval_set=[(X_train, y_train),\n",
1340 |     "...                  (X_test, y_test)\n",
1341 |     "...                 ]\n",
1342 |     "...       )\n",
1343 |     ">>> xg_err.score(X_test, y_test)\n",
1344 |     "[0] validation_0-error:0.24739  validation_1-error:0.27072\n",
1345 |     "[1] validation_0-error:0.24218  validation_1-error:0.26188\n",
1346 |     "[2] validation_0-error:0.23839  validation_1-error:0.24751\n",
1347 |     "[3] validation_0-error:0.23697  validation_1-error:0.25193\n",
1348 |     "[4] validation_0-error:0.23081  validation_1-error:0.24530\n",
1349 |     "[5] validation_0-error:0.22607  validation_1-error:0.24420\n",
1350 |     "[6] validation_0-error:0.22180  validation_1-error:0.24862\n",
1351 |     "[7] validation_0-error:0.21801  validation_1-error:0.24862\n",
1352 |     "[8] validation_0-error:0.21280  validation_1-error:0.25304\n",
1353 |     "[9] validation_0-error:0.21043  validation_1-error:0.25304\n",
1354 |     "[10]    validation_0-error:0.20806  validation_1-error:0.24641\n",
1355 |     "[11]    validation_0-error:0.20284  validation_1-error:0.25193\n",
1356 |     "[12]    validation_0-error:0.20047  validation_1-error:0.24420\n",
1357 |     "[13]    validation_0-error:0.19668  validation_1-error:0.24420\n",
1358 |     "[14]    validation_0-error:0.19384  validation_1-error:0.24530\n",
1359 |     "[15]    validation_0-error:0.18815  validation_1-error:0.24199\n",
1360 |     "[16]    validation_0-error:0.18531  validation_1-error:0.24199\n",
1361 |     "[17]    validation_0-error:0.18389  validation_1-error:0.23867\n",
1362 |     "[18]    validation_0-error:0.18531  validation_1-error:0.23757\n",
1363 |     "[19]    validation_0-error:0.18815  validation_1-error:0.23867\n",
1364 |     "[20]    validation_0-error:0.18246  validation_1-error:0.24199\n",
1365 |     "[21]    validation_0-error:0.17915  validation_1-error:0.24862\n",
1366 |     "[22]    validation_0-error:0.17867  validation_1-error:0.24751\n",
1367 |     "[23]    validation_0-error:0.17630  validation_1-error:0.24199\n",
1368 |     "[24]    validation_0-error:0.17488  validation_1-error:0.24309\n",
1369 |     "[25]    validation_0-error:0.17251  validation_1-error:0.24530\n",
1370 |     "[26]    validation_0-error:0.17204  validation_1-error:0.24309\n",
1371 |     "[27]    validation_0-error:0.16825  validation_1-error:0.24199\n",
1372 |     "[28]    validation_0-error:0.16730  validation_1-error:0.24088\n",
1373 |     "[29]    validation_0-error:0.16019  validation_1-error:0.24199\n",
1374 |     "[30]    validation_0-error:0.15782  validation_1-error:0.24972\n",
1375 |     "[31]    validation_0-error:0.15972  validation_1-error:0.24862\n",
1376 |     "[32]    validation_0-error:0.15924  validation_1-error:0.24641\n",
1377 |     "[33]    validation_0-error:0.15403  validation_1-error:0.25635\n",
1378 |     "[34]    validation_0-error:0.15261  validation_1-error:0.25525\n",
1379 |     "[35]    validation_0-error:0.15213  validation_1-error:0.25525\n",
1380 |     "[36]    validation_0-error:0.15166  validation_1-error:0.25525\n",
1381 |     "[37]    validation_0-error:0.14550  validation_1-error:0.25525\n",
1382 |     "[38]    validation_0-error:0.14597  validation_1-error:0.25083\n",
1383 |     "0.7624309392265194\n",
1384 |     "```\n",
1385 |     "\n",
1386 |     "``` pycon\n",
1387 |     ">>> xg_err.best_ntree_limit\n",
1388 |     "19\n",
1389 |     "```\n",
1390 |     "\n",
1391 |     "### Summary\n",
1392 |     "\n",
1393 |     "### Exercises\n",
1394 |     "\n",
1395 |     "## XGBoost Hyperparameters\n",
1396 |     "\n",
1397 |     "### Hyperparameters\n",
1398 |     "\n",
1399 |     "### Examining Hyperparameters\n",
1400 |     "\n",
1401 |     "``` pycon\n",
1402 |     ">>> xg = xgb.XGBClassifier() # set the hyperparamters in here\n",
1403 |     ">>> xg.fit(X_train, y_train)\n",
1404 |     ">>> xg.get_params()\n",
1405 |     "{'objective': 'binary:logistic',\n",
1406 |     " 'use_label_encoder': None,\n",
1407 |     " 'base_score': 0.5,\n",
1408 |     " 'booster': 'gbtree',\n",
1409 |     " 'callbacks': None,\n",
1410 |     " 'colsample_bylevel': 1,\n",
1411 |     " 'colsample_bynode': 1,\n",
1412 |     " 'colsample_bytree': 1,\n",
1413 |     " 'early_stopping_rounds': None,\n",
1414 |     " 'enable_categorical': False,\n",
1415 |     " 'eval_metric': None,\n",
1416 |     " 'feature_types': None,\n",
1417 |     " 'gamma': 0,\n",
1418 |     " 'gpu_id': -1,\n",
1419 |     " 'grow_policy': 'depthwise',\n",
1420 |     " 'importance_type': None,\n",
1421 |     " 'interaction_constraints': '',\n",
1422 |     " 'learning_rate': 0.300000012,\n",
1423 |     " 'max_bin': 256,\n",
1424 |     " 'max_cat_threshold': 64,\n",
1425 |     " 'max_cat_to_onehot': 4,\n",
1426 |     " 'max_delta_step': 0,\n",
1427 |     " 'max_depth': 6,\n",
1428 |     " 'max_leaves': 0,\n",
1429 |     " 'min_child_weight': 1,\n",
1430 |     " 'missing': nan,\n",
1431 |     " 'monotone_constraints': '()',\n",
1432 |     " 'n_estimators': 100,\n",
1433 |     " 'n_jobs': 0,\n",
1434 |     " 'num_parallel_tree': 1,\n",
1435 |     " 'predictor': 'auto',\n",
1436 |     " 'random_state': 0,\n",
1437 |     " 'reg_alpha': 0,\n",
1438 |     " 'reg_lambda': 1,\n",
1439 |     " 'sampling_method': 'uniform',\n",
1440 |     " 'scale_pos_weight': 1,\n",
1441 |     " 'subsample': 1,\n",
1442 |     " 'tree_method': 'exact',\n",
1443 |     " 'validate_parameters': 1,\n",
1444 |     " 'verbosity': None}\n",
1445 |     "```\n",
1446 |     "\n",
1447 |     "### Tuning Hyperparameters\n",
1448 |     "\n",
1449 |     "``` python\n",
1450 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1451 |     "ms.validation_curve(xgb.XGBClassifier(), X_train, y_train, param_name='gamma', \n",
1452 |     "    param_range=[0, .5, 1,5,10, 20, 30], n_jobs=-1, ax=ax)\n",
1453 |     "```\n",
1454 |     "\n",
1455 |     "### Intuitive Understanding of Learning Rate\n",
1456 |     "\n",
1457 |     "``` python\n",
1458 |     "# check impact of learning weight on scores\n",
1459 |     "xg_lr1 = xgb.XGBClassifier(learning_rate=1, max_depth=2)\n",
1460 |     "xg_lr1.fit(X_train, y_train)\n",
1461 |     "```\n",
1462 |     "\n",
1463 |     "``` python\n",
1464 |     "my_dot_export(xg_lr1, num_trees=0, filename='img/xg_depth2_tree0.dot', \n",
1465 |     "              title='Learning Rate set to 1')    \n",
1466 |     "```\n",
1467 |     "\n",
1468 |     "``` python\n",
1469 |     "# check impact of learning weight on scores\n",
1470 |     "xg_lr001 = xgb.XGBClassifier(learning_rate=.001, max_depth=2)\n",
1471 |     "xg_lr001.fit(X_train, y_train)\n",
1472 |     "```\n",
1473 |     "\n",
1474 |     "``` python\n",
1475 |     "my_dot_export(xg_lr001, num_trees=0, filename='img/xg_depth2_tree0_lr001.dot',\n",
1476 |     "              title='Learning Rate set to .001')    \n",
1477 |     "```\n",
1478 |     "\n",
1479 |     "### Grid Search\n",
1480 |     "\n",
1481 |     "``` python\n",
1482 |     "from sklearn import model_selection\n",
1483 |     "params = {'reg_lambda': [0],  # No effect\n",
1484 |     "          'learning_rate': [.1, .3], # makes each boost more conservative \n",
1485 |     "          'subsample': [.7, 1],\n",
1486 |     "          'max_depth': [2, 3],\n",
1487 |     "          'random_state': [42],\n",
1488 |     "          'n_jobs': [-1],\n",
1489 |     "          'n_estimators': [200]}\n",
1490 |     "\n",
1491 |     "xgb2 = xgb.XGBClassifier(early_stopping_rounds=5) \n",
1492 |     "cv = (model_selection.GridSearchCV(xgb2, params, cv=3, n_jobs=-1)\n",
1493 |     "    .fit(X_train, y_train,\n",
1494 |     "         eval_set=[(X_test, y_test)],\n",
1495 |     "         verbose=50\n",
1496 |     "    )\n",
1497 |     ")\n",
1498 |     "```\n",
1499 |     "\n",
1500 |     "``` pycon\n",
1501 |     ">>> cv.best_params_\n",
1502 |     "{'learning_rate': 0.3,\n",
1503 |     " 'max_depth': 2,\n",
1504 |     " 'n_estimators': 200,\n",
1505 |     " 'n_jobs': -1,\n",
1506 |     " 'random_state': 42,\n",
1507 |     " 'reg_lambda': 0,\n",
1508 |     " 'subsample': 1}\n",
1509 |     "```\n",
1510 |     "\n",
1511 |     "``` python\n",
1512 |     "params = {'learning_rate': 0.3,\n",
1513 |     "          'max_depth': 2,\n",
1514 |     "          'n_estimators': 200,\n",
1515 |     "          'n_jobs': -1,\n",
1516 |     "          'random_state': 42,\n",
1517 |     "          'reg_lambda': 0,\n",
1518 |     "          'subsample': 1\n",
1519 |     "}\n",
1520 |     "xgb_grid = xgb.XGBClassifier(**params, early_stopping_rounds=50)\n",
1521 |     "xgb_grid.fit(X_train, y_train, eval_set=[(X_train, y_train),\n",
1522 |     " (X_test, y_test)],\n",
1523 |     " verbose=10\n",
1524 |     ")\n",
1525 |     "```\n",
1526 |     "\n",
1527 |     "``` python\n",
1528 |     "# vs default\n",
1529 |     "xgb_def = xgb.XGBClassifier(early_stopping_rounds=50)\n",
1530 |     "xgb_def.fit(X_train, y_train, eval_set=[(X_train, y_train),\n",
1531 |     " (X_test, y_test)],\n",
1532 |     " verbose=10\n",
1533 |     ")\n",
1534 |     "```\n",
1535 |     "\n",
1536 |     "``` pycon\n",
1537 |     ">>> xgb_def.score(X_test, y_test), xgb_grid.score(X_test, y_test)\n",
1538 |     "(0.7558011049723757, 0.7524861878453039)\n",
1539 |     "```\n",
1540 |     "\n",
1541 |     "``` pycon\n",
1542 |     ">>> results_default = model_selection.cross_val_score(\n",
1543 |     "...    xgb.XGBClassifier(),\n",
1544 |     "...    X=X, y=y,\n",
1545 |     "...    cv=4\n",
1546 |     "... )\n",
1547 |     "```\n",
1548 |     "\n",
1549 |     "``` pycon\n",
1550 |     ">>> results_default\n",
1551 |     "array([0.71352785, 0.72413793, 0.69496021, 0.74501992])\n",
1552 |     "```\n",
1553 |     "\n",
1554 |     "``` pycon\n",
1555 |     ">>> results_default.mean()\n",
1556 |     "0.7194114787534214\n",
1557 |     "```\n",
1558 |     "\n",
1559 |     "``` pycon\n",
1560 |     ">>> results_grid = model_selection.cross_val_score(\n",
1561 |     "...    xgb.XGBClassifier(**params),\n",
1562 |     "...    X=X, y=y,\n",
1563 |     "...    cv=4\n",
1564 |     "... )\n",
1565 |     "```\n",
1566 |     "\n",
1567 |     "``` pycon\n",
1568 |     ">>> results_grid\n",
1569 |     "array([0.74137931, 0.74137931, 0.74801061, 0.73572377])\n",
1570 |     "```\n",
1571 |     "\n",
1572 |     "``` pycon\n",
1573 |     ">>> results_grid.mean()\n",
1574 |     "0.7416232505873941\n",
1575 |     "```\n",
1576 |     "\n",
1577 |     "### Summary\n",
1578 |     "\n",
1579 |     "### Exercises\n",
1580 |     "\n",
1581 |     "## Hyperopt\n",
1582 |     "\n",
1583 |     "### Bayesian Optimization\n",
1584 |     "\n",
1585 |     "### Exhaustive Tuning with Hyperopt\n",
1586 |     "\n",
1587 |     "``` python\n",
1588 |     "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n",
1589 |     "from sklearn.metrics import accuracy_score, roc_auc_score  \n",
1590 |     "\n",
1591 |     "from typing import Any, Dict, Union\n",
1592 |     "\n",
1593 |     "def hyperparameter_tuning(space: Dict[str, Union[float, int]], \n",
1594 |     "                    X_train: pd.DataFrame, y_train: pd.Series, \n",
1595 |     "                    X_test: pd.DataFrame, y_test: pd.Series, \n",
1596 |     "                    early_stopping_rounds: int=50,\n",
1597 |     "                    metric:callable=accuracy_score) -> Dict[str, Any]:\n",
1598 |     "    \"\"\"\n",
1599 |     "    Perform hyperparameter tuning for an XGBoost classifier.\n",
1600 |     "\n",
1601 |     "    This function takes a dictionary of hyperparameters, training \n",
1602 |     "    and test data, and an optional value for early stopping rounds, \n",
1603 |     "    and returns a dictionary with the loss and model resulting from \n",
1604 |     "    the tuning process. The model is trained using the training \n",
1605 |     "    data and evaluated on the test data. The loss is computed as \n",
1606 |     "    the negative of the accuracy score.\n",
1607 |     "\n",
1608 |     "    Parameters\n",
1609 |     "    ----------\n",
1610 |     "    space : Dict[str, Union[float, int]]\n",
1611 |     "        A dictionary of hyperparameters for the XGBoost classifier.\n",
1612 |     "    X_train : pd.DataFrame\n",
1613 |     "        The training data.\n",
1614 |     "    y_train : pd.Series\n",
1615 |     "        The training target.\n",
1616 |     "    X_test : pd.DataFrame\n",
1617 |     "        The test data.\n",
1618 |     "    y_test : pd.Series\n",
1619 |     "        The test target.\n",
1620 |     "    early_stopping_rounds : int, optional\n",
1621 |     "        The number of early stopping rounds to use. The default value \n",
1622 |     "        is 50.\n",
1623 |     "    metric : callable\n",
1624 |     "        Metric to maximize. Default is accuracy\n",
1625 |     "\n",
1626 |     "    Returns\n",
1627 |     "    -------\n",
1628 |     "    Dict[str, Any]\n",
1629 |     "        A dictionary with the loss and model resulting from the \n",
1630 |     "        tuning process. The loss is a float, and the model is an \n",
1631 |     "        XGBoost classifier.\n",
1632 |     "    \"\"\"\n",
1633 |     "    int_vals = ['max_depth', 'reg_alpha']\n",
1634 |     "    space = {k: (int(val) if k in int_vals else val)\n",
1635 |     "             for k,val in space.items()}\n",
1636 |     "    space['early_stopping_rounds'] = early_stopping_rounds\n",
1637 |     "    model = xgb.XGBClassifier(**space)\n",
1638 |     "    evaluation = [(X_train, y_train),\n",
1639 |     "                  (X_test, y_test)]\n",
1640 |     "    model.fit(X_train, y_train,\n",
1641 |     "              eval_set=evaluation, \n",
1642 |     "              verbose=False)    \n",
1643 |     "         \n",
1644 |     "    pred = model.predict(X_test)\n",
1645 |     "    score = metric(y_test, pred)\n",
1646 |     "    return {'loss': -score, 'status': STATUS_OK, 'model': model}\n",
1647 |     "```\n",
1648 |     "\n",
1649 |     "``` python\n",
1650 |     "options = {'max_depth': hp.quniform('max_depth', 1, 8, 1),  # tree\n",
1651 |     "    'min_child_weight': hp.loguniform('min_child_weight', -2, 3),\n",
1652 |     "    'subsample': hp.uniform('subsample', 0.5, 1),   # stochastic\n",
1653 |     "    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),\n",
1654 |     "    'reg_alpha': hp.uniform('reg_alpha', 0, 10),\n",
1655 |     "    'reg_lambda': hp.uniform('reg_lambda', 1, 10),\n",
1656 |     "    'gamma': hp.loguniform('gamma', -10, 10), # regularization\n",
1657 |     "    'learning_rate': hp.loguniform('learning_rate', -7, 0),  # boosting\n",
1658 |     "    'random_state': 42\n",
1659 |     "}\n",
1660 |     "\n",
1661 |     "trials = Trials()\n",
1662 |     "best = fmin(fn=lambda space: hyperparameter_tuning(space, X_train, y_train, \n",
1663 |     "                                                   X_test, y_test),            \n",
1664 |     "    space=options,           \n",
1665 |     "    algo=tpe.suggest,            \n",
1666 |     "    max_evals=2_000,            \n",
1667 |     "    trials=trials,\n",
1668 |     "    #timeout=60*5 # 5 minutes\n",
1669 |     ")\n",
1670 |     "```\n",
1671 |     "\n",
1672 |     "``` python\n",
1673 |     "# 2 hours of training (paste best in here)\n",
1674 |     "long_params = {'colsample_bytree': 0.6874845219014455, \n",
1675 |     "               'gamma': 0.06936323554883501, \n",
1676 |     "               'learning_rate': 0.21439214284976907, \n",
1677 |     "               'max_depth': 6, \n",
1678 |     "               'min_child_weight': 0.6678357091609912, \n",
1679 |     "               'reg_alpha': 3.2979862933185546, \n",
1680 |     "               'reg_lambda': 7.850943400390477, \n",
1681 |     "               'subsample': 0.999767483950891}\n",
1682 |     "```\n",
1683 |     "\n",
1684 |     "``` python\n",
1685 |     "xg_ex = xgb.XGBClassifier(**long_params, early_stopping_rounds=50,\n",
1686 |     "                            n_estimators=500)\n",
1687 |     "xg_ex.fit(X_train, y_train,\n",
1688 |     "       eval_set=[(X_train, y_train),\n",
1689 |     "                 (X_test, y_test)\n",
1690 |     "                ],\n",
1691 |     "        verbose=100\n",
1692 |     "      )\n",
1693 |     "```\n",
1694 |     "\n",
1695 |     "``` pycon\n",
1696 |     ">>> xg_ex.score(X_test, y_test)\n",
1697 |     "0.7580110497237569\n",
1698 |     "```\n",
1699 |     "\n",
1700 |     "### Defining Parameter Distributions\n",
1701 |     "\n",
1702 |     "``` pycon\n",
1703 |     ">>> from hyperopt import hp, pyll\n",
1704 |     ">>> pyll.stochastic.sample(hp.choice('value', ['a', 'b', 'c']))\n",
1705 |     "'a'\n",
1706 |     "```\n",
1707 |     "\n",
1708 |     "``` pycon\n",
1709 |     ">>> pyll.stochastic.sample(hp.pchoice('value', [(.05, 'a'), (.9, 'b'), \n",
1710 |     "...     (.05, 'c')]))\n",
1711 |     "'c'\n",
1712 |     "```\n",
1713 |     "\n",
1714 |     "``` pycon\n",
1715 |     ">>> from hyperopt import hp, pyll\n",
1716 |     "\n",
1717 |     ">>> pyll.stochastic.sample(hp.uniform('value', 0, 1))\n",
1718 |     "0.7875384438202859\n",
1719 |     "```\n",
1720 |     "\n",
1721 |     "``` python\n",
1722 |     "uniform_vals = [pyll.stochastic.sample(hp.uniform('value', 0, 1)) \n",
1723 |     "               for _ in range(10_000)]\n",
1724 |     "```\n",
1725 |     "\n",
1726 |     "``` python\n",
1727 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1728 |     "ax.hist(uniform_vals)\n",
1729 |     "```\n",
1730 |     "\n",
1731 |     "``` python\n",
1732 |     "loguniform_vals = [pyll.stochastic.sample(hp.loguniform('value', -5, 5)) \n",
1733 |     "               for _ in range(10_000)]\n",
1734 |     "```\n",
1735 |     "\n",
1736 |     "``` python\n",
1737 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1738 |     "ax.hist(loguniform_vals)\n",
1739 |     "```\n",
1740 |     "\n",
1741 |     "``` python\n",
1742 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1743 |     "(pd.Series(np.arange(-5, 5, step=.1))\n",
1744 |     " .rename('x')\n",
1745 |     " .to_frame()\n",
1746 |     " .assign(y=lambda adf:np.exp(adf.x))\n",
1747 |     " .plot(x='x', y='y', ax=ax)\n",
1748 |     ")\n",
1749 |     "```\n",
1750 |     "\n",
1751 |     "``` pycon\n",
1752 |     ">>> from hyperopt import hp, pyll\n",
1753 |     ">>> from math import log\n",
1754 |     ">>> pyll.stochastic.sample(hp.loguniform('value', log(.1), log(10)))\n",
1755 |     "3.0090767867889174\n",
1756 |     "```\n",
1757 |     "\n",
1758 |     "``` python\n",
1759 |     "quniform_vals = [pyll.stochastic.sample(hp.quniform('value', -5, 5, q=2)) \n",
1760 |     "               for _ in range(10_000)]\n",
1761 |     "```\n",
1762 |     "\n",
1763 |     "``` pycon\n",
1764 |     ">>> pd.Series(quniform_vals).value_counts()\n",
1765 |     "-0.0    2042\n",
1766 |     "-2.0    2021\n",
1767 |     " 2.0    2001\n",
1768 |     " 4.0    2000\n",
1769 |     "-4.0    1936\n",
1770 |     "dtype: int64\n",
1771 |     "```\n",
1772 |     "\n",
1773 |     "### Exploring the Trials\n",
1774 |     "\n",
1775 |     "``` python\n",
1776 |     "from typing import Any, Dict, Sequence\n",
1777 |     "def trial2df(trial: Sequence[Dict[str, Any]]) -> pd.DataFrame:\n",
1778 |     "    \"\"\"\n",
1779 |     "    Convert a Trial object (sequence of trial dictionaries)\n",
1780 |     "    to a Pandas DataFrame.\n",
1781 |     "\n",
1782 |     "    Parameters\n",
1783 |     "    ----------\n",
1784 |     "    trial : List[Dict[str, Any]]\n",
1785 |     "        A list of trial dictionaries.\n",
1786 |     "\n",
1787 |     "    Returns\n",
1788 |     "    -------\n",
1789 |     "    pd.DataFrame\n",
1790 |     "        A DataFrame with columns for the loss, trial id, and\n",
1791 |     "        values from each trial dictionary.\n",
1792 |     "    \"\"\"\n",
1793 |     "    vals = []\n",
1794 |     "    for t in trial:\n",
1795 |     "        result = t['result']\n",
1796 |     "        misc = t['misc']\n",
1797 |     "        val = {k:(v[0] if isinstance(v, list) else v)  \n",
1798 |     "               for k,v in misc['vals'].items()\n",
1799 |     "              }\n",
1800 |     "        val['loss'] = result['loss']\n",
1801 |     "        val['tid'] = t['tid']\n",
1802 |     "        vals.append(val)\n",
1803 |     "    return pd.DataFrame(vals)\n",
1804 |     "```\n",
1805 |     "\n",
1806 |     "``` pycon\n",
1807 |     ">>> hyper2hr = trial2df(trials)\n",
1808 |     "```\n",
1809 |     "\n",
1810 |     "``` pycon\n",
1811 |     ">>> hyper2hr\n",
1812 |     "      colsample_bytree       gamma  learning_rate  ...  subsample      loss  \\\n",
1813 |     "0             0.854670    2.753933       0.042056  ...   0.913247 -0.744751   \n",
1814 |     "1             0.512653    0.153628       0.611973  ...   0.550048 -0.746961   \n",
1815 |     "2             0.552569    1.010561       0.002412  ...   0.508593 -0.735912   \n",
1816 |     "3             0.604020  682.836185       0.005037  ...   0.536935 -0.545856   \n",
1817 |     "4             0.785281    0.004130       0.015200  ...   0.691211 -0.739227   \n",
1818 |     "...                ...         ...            ...  ...        ...       ...   \n",
1819 |     "1995          0.717890    0.000543       0.141629  ...   0.893414 -0.765746   \n",
1820 |     "1996          0.725305    0.000248       0.172854  ...   0.919415 -0.765746   \n",
1821 |     "1997          0.698025    0.028484       0.162207  ...   0.952204 -0.770166   \n",
1822 |     "1998          0.688053    0.068223       0.099814  ...   0.939489 -0.762431   \n",
1823 |     "1999          0.666225    0.125253       0.203441  ...   0.980354 -0.767956   \n",
1824 |     "\n",
1825 |     "       tid  \n",
1826 |     "0        0  \n",
1827 |     "1        1  \n",
1828 |     "2        2  \n",
1829 |     "3        3  \n",
1830 |     "4        4  \n",
1831 |     "...    ...  \n",
1832 |     "1995  1995  \n",
1833 |     "1996  1996  \n",
1834 |     "1997  1997  \n",
1835 |     "1998  1998  \n",
1836 |     "1999  1999  \n",
1837 |     "\n",
1838 |     "[2000 rows x 10 columns]\n",
1839 |     "```\n",
1840 |     "\n",
1841 |     "``` python\n",
1842 |     "import seaborn as sns\n",
1843 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1844 |     "sns.heatmap(hyper2hr.corr(method='spearman'),\n",
1845 |     "    cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n",
1846 |     ")\n",
1847 |     "```\n",
1848 |     "\n",
1849 |     "``` python\n",
1850 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1851 |     "(hyper2hr\n",
1852 |     "  .plot.scatter(x='tid', y='loss', alpha=.1, color='purple', ax=ax)\n",
1853 |     ")\n",
1854 |     "```\n",
1855 |     "\n",
1856 |     "``` python\n",
1857 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1858 |     "(hyper2hr\n",
1859 |     "  .plot.scatter(x='max_depth', y='loss', alpha=1, color='purple', ax=ax)\n",
1860 |     ")\n",
1861 |     "```\n",
1862 |     "\n",
1863 |     "``` python\n",
1864 |     "import numpy as np\n",
1865 |     "\n",
1866 |     "def jitter(df: pd.DataFrame, col: str, amount: float=1) -> pd.Series:\n",
1867 |     "    \"\"\"\n",
1868 |     "    Add random noise to the values in a Pandas DataFrame column.\n",
1869 |     "\n",
1870 |     "    This function adds random noise to the values in a specified \n",
1871 |     "    column of a Pandas DataFrame. The noise is uniform random \n",
1872 |     "    noise with a range of `amount` centered around zero. The \n",
1873 |     "    function returns a Pandas Series with the jittered values.\n",
1874 |     "\n",
1875 |     "    Parameters\n",
1876 |     "    ----------\n",
1877 |     "    df : pd.DataFrame\n",
1878 |     "        The input DataFrame.\n",
1879 |     "    col : str\n",
1880 |     "        The name of the column to jitter.\n",
1881 |     "    amount : float, optional\n",
1882 |     "        The range of the noise to add. The default value is 1.\n",
1883 |     "\n",
1884 |     "    Returns\n",
1885 |     "    -------\n",
1886 |     "    pd.Series\n",
1887 |     "        A Pandas Series with the jittered values.\n",
1888 |     "    \"\"\"\n",
1889 |     "    vals = np.random.uniform(low=-amount/2, high=amount/2,\n",
1890 |     "                            size=df.shape[0])\n",
1891 |     "    return df[col] + vals\n",
1892 |     "```\n",
1893 |     "\n",
1894 |     "``` python\n",
1895 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1896 |     "(hyper2hr\n",
1897 |     "  .assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8))\n",
1898 |     "  .plot.scatter(x='max_depth', y='loss', alpha=.1, color='purple', ax=ax)\n",
1899 |     ")\n",
1900 |     "```\n",
1901 |     "\n",
1902 |     "``` python\n",
1903 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1904 |     "(hyper2hr\n",
1905 |     "  .assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8))\n",
1906 |     "  .plot.scatter(x='max_depth', y='loss', alpha=.5, \n",
1907 |     "               color='tid', cmap='viridis', ax=ax)\n",
1908 |     ")\n",
1909 |     "```\n",
1910 |     "\n",
1911 |     "``` python\n",
1912 |     "import seaborn as sns\n",
1913 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1914 |     "sns.violinplot(x='max_depth', y='loss', data=hyper2hr, kind='violin', ax=ax)\n",
1915 |     "```\n",
1916 |     "\n",
1917 |     "``` python\n",
1918 |     "\n",
1919 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1920 |     "(hyper2hr\n",
1921 |     "  .plot.scatter(x='reg_alpha', y='colsample_bytree', alpha=.8,\n",
1922 |     "               color='tid', cmap='viridis', ax=ax)\n",
1923 |     ")\n",
1924 |     "\n",
1925 |     "ax.annotate('Min Loss (-0.77)', xy=(4.56, 0.692),\n",
1926 |     "           xytext=(.7, .84), arrowprops={'color':'k'})\n",
1927 |     "```\n",
1928 |     "\n",
1929 |     "``` python\n",
1930 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1931 |     "(hyper2hr\n",
1932 |     "  .plot.scatter(x='gamma', y='loss', alpha=.1, color='purple', ax=ax)\n",
1933 |     ")\n",
1934 |     "```\n",
1935 |     "\n",
1936 |     "``` python\n",
1937 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
1938 |     "(hyper2hr\n",
1939 |     "  .plot.scatter(x='gamma', y='loss', alpha=.5, color='tid', ax=ax, \n",
1940 |     "                logx=True, cmap='viridis')\n",
1941 |     ")\n",
1942 |     "\n",
1943 |     "ax.annotate('Min Loss (-0.77)', xy=(0.000581, -0.777),\n",
1944 |     "           xytext=(1, -.6), arrowprops={'color':'k'})\n",
1945 |     "```\n",
1946 |     "\n",
1947 |     "### EDA with Plotly\n",
1948 |     "\n",
1949 |     "``` python\n",
1950 |     "import plotly.graph_objects as go\n",
1951 |     "\n",
1952 |     "def plot_3d_mesh(df: pd.DataFrame, x_col: str, y_col: str, \n",
1953 |     "                 z_col: str) -> go.Figure:\n",
1954 |     "    \"\"\"\n",
1955 |     "    Create a 3D mesh plot using Plotly.\n",
1956 |     "\n",
1957 |     "    This function creates a 3D mesh plot using Plotly, with \n",
1958 |     "    the `x_col`, `y_col`, and `z_col` columns of the `df` \n",
1959 |     "    DataFrame as the x, y, and z values, respectively. The \n",
1960 |     "    plot has a title and axis labels that match the column \n",
1961 |     "    names, and the intensity of the mesh is proportional \n",
1962 |     "    to the values in the `z_col` column. The function returns \n",
1963 |     "    a Plotly Figure object that can be displayed or saved as \n",
1964 |     "    desired.\n",
1965 |     "\n",
1966 |     "    Parameters\n",
1967 |     "    ----------\n",
1968 |     "    df : pd.DataFrame\n",
1969 |     "        The DataFrame containing the data to plot.\n",
1970 |     "    x_col : str\n",
1971 |     "        The name of the column to use as the x values.\n",
1972 |     "    y_col : str\n",
1973 |     "        The name of the column to use as the y values.\n",
1974 |     "    z_col : str\n",
1975 |     "        The name of the column to use as the z values.\n",
1976 |     "\n",
1977 |     "    Returns\n",
1978 |     "    -------\n",
1979 |     "    go.Figure\n",
1980 |     "        A Plotly Figure object with the 3D mesh plot.\n",
1981 |     "    \"\"\"\n",
1982 |     "    fig = go.Figure(data=[go.Mesh3d(x=df[x_col], y=df[y_col], z=df[z_col],\n",
1983 |     "        intensity=df[z_col]/ df[z_col].min(),\n",
1984 |     "        hovertemplate=f\"{z_col}: %{{z}}<br>{x_col}: %{{x}}<br>{y_col}: \"\n",
1985 |     "                                    \"%{{y}}<extra></extra>\")],\n",
1986 |     "    )\n",
1987 |     "\n",
1988 |     "    fig.update_layout( \n",
1989 |     "        title=dict(text=f'{y_col} vs {x_col}'),\n",
1990 |     "        scene = dict(\n",
1991 |     "          xaxis_title=x_col,\n",
1992 |     "          yaxis_title=y_col,\n",
1993 |     "          zaxis_title=z_col),\n",
1994 |     "        width=700,\n",
1995 |     "        margin=dict(r=20, b=10, l=10, t=50)\n",
1996 |     "    )\n",
1997 |     "    return fig\n",
1998 |     "```\n",
1999 |     "\n",
2000 |     "``` python\n",
2001 |     "fig = plot_3d_mesh(hyper2hr.query('gamma < .2'),\n",
2002 |     "    'reg_lambda', 'gamma', 'loss')\n",
2003 |     "           \n",
2004 |     "fig\n",
2005 |     "```\n",
2006 |     "\n",
2007 |     "``` python\n",
2008 |     "import plotly.express as px\n",
2009 |     "import plotly.graph_objects as go\n",
2010 |     "\n",
2011 |     "def plot_3d_scatter(df: pd.DataFrame, x_col: str, y_col: str, \n",
2012 |     "                 z_col: str, color_col: str, \n",
2013 |     "                 opacity: float=1) -> go.Figure:\n",
2014 |     "    \"\"\"\n",
2015 |     "    Create a 3D scatter plot using Plotly Express.\n",
2016 |     "\n",
2017 |     "    This function creates a 3D scatter plot using Plotly Express, \n",
2018 |     "    with the `x_col`, `y_col`, and `z_col` columns of the `df` \n",
2019 |     "    DataFrame as the x, y, and z values, respectively. The points \n",
2020 |     "    in the plot are colored according to the values in the \n",
2021 |     "    `color_col` column, using a continuous color scale. The \n",
2022 |     "    function returns a Plotly Express scatter_3d object that \n",
2023 |     "    can be displayed or saved as desired.\n",
2024 |     "\n",
2025 |     "    Parameters\n",
2026 |     "    ----------\n",
2027 |     "    df : pd.DataFrame\n",
2028 |     "        The DataFrame containing the data to plot.\n",
2029 |     "    x_col : str\n",
2030 |     "        The name of the column to use as the x values.\n",
2031 |     "    y_col : str\n",
2032 |     "        The name of the column to use as the y values.\n",
2033 |     "    z_col : str\n",
2034 |     "        The name of the column to use as the z values.\n",
2035 |     "    color_col : str\n",
2036 |     "        The name of the column to use for coloring.\n",
2037 |     "    opacity : float\n",
2038 |     "        The opacity (alpha) of the points.\n",
2039 |     "\n",
2040 |     "    Returns\n",
2041 |     "    -------\n",
2042 |     "    go.Figure\n",
2043 |     "        A Plotly Figure object with the 3D mesh plot.\n",
2044 |     "    \"\"\"\n",
2045 |     "    fig = px.scatter_3d(data_frame=df, x=x_col,\n",
2046 |     "                y=y_col, z=z_col, color=color_col,\n",
2047 |     "                color_continuous_scale=px.colors.sequential.Viridis_r,\n",
2048 |     "                opacity=opacity)\n",
2049 |     "    return fig\n",
2050 |     "```\n",
2051 |     "\n",
2052 |     "``` python\n",
2053 |     "plot_3d_scatter(hyper2hr.query('gamma < .2'), \n",
2054 |     "              'reg_lambda', 'gamma', 'tid', color_col='loss')\n",
2055 |     "```\n",
2056 |     "\n",
2057 |     "### Conclusion\n",
2058 |     "\n",
2059 |     "### Exercises\n",
2060 |     "\n",
2061 |     "## Step-wise Tuning with Hyperopt\n",
2062 |     "\n",
2063 |     "### Groups of Hyperparameters\n",
2064 |     "\n",
2065 |     "``` python\n",
2066 |     "from hyperopt import fmin, tpe, hp, Trials\n",
2067 |     "params = {'random_state': 42}\n",
2068 |     "\n",
2069 |     "rounds = [{'max_depth': hp.quniform('max_depth', 1, 8, 1),  # tree\n",
2070 |     "           'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},\n",
2071 |     "          {'subsample': hp.uniform('subsample', 0.5, 1),   # stochastic\n",
2072 |     "           'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},\n",
2073 |     "          {'reg_alpha': hp.uniform('reg_alpha', 0, 10),\n",
2074 |     "            'reg_lambda': hp.uniform('reg_lambda', 1, 10),},\n",
2075 |     "          {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization\n",
2076 |     "          {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting\n",
2077 |     "]\n",
2078 |     "\n",
2079 |     "all_trials = []\n",
2080 |     "for round in rounds:\n",
2081 |     "    params = {**params, **round}\n",
2082 |     "    trials = Trials()\n",
2083 |     "    best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(space, X_train, \n",
2084 |     "                                        y_train, X_test, y_test),            \n",
2085 |     "        space=params,           \n",
2086 |     "        algo=tpe.suggest,            \n",
2087 |     "        max_evals=20,            \n",
2088 |     "        trials=trials,\n",
2089 |     "    )\n",
2090 |     "    params = {**params, **best}\n",
2091 |     "    all_trials.append(trials)\n",
2092 |     "```\n",
2093 |     "\n",
2094 |     "### Visualization Hyperparameter Scores\n",
2095 |     "\n",
2096 |     "``` python\n",
2097 |     "xhelp.plot_3d_mesh(xhelp.trial2df(all_trials[2]),\n",
2098 |     "    'reg_alpha', 'reg_lambda', 'loss')    \n",
2099 |     "```\n",
2100 |     "\n",
2101 |     "### Training an Optimized Model\n",
2102 |     "\n",
2103 |     "``` python\n",
2104 |     "step_params = {'random_state': 42,\n",
2105 |     " 'max_depth': 5,\n",
2106 |     " 'min_child_weight': 0.6411044640540848,\n",
2107 |     " 'subsample': 0.9492383155577023,\n",
2108 |     " 'colsample_bytree': 0.6235721099295888,\n",
2109 |     " 'gamma': 0.00011273797329538491,\n",
2110 |     " 'learning_rate': 0.24399020050740935}\n",
2111 |     "```\n",
2112 |     "\n",
2113 |     "``` python\n",
2114 |     "xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50,\n",
2115 |     "                            n_estimators=500)\n",
2116 |     "xg_step.fit(X_train, y_train,\n",
2117 |     "       eval_set=[(X_train, y_train),\n",
2118 |     "                 (X_test, y_test)\n",
2119 |     "                ],\n",
2120 |     "        verbose=100\n",
2121 |     "      )\n",
2122 |     "```\n",
2123 |     "\n",
2124 |     "``` pycon\n",
2125 |     ">>> xg_step.score(X_test, y_test)\n",
2126 |     "0.7613259668508288\n",
2127 |     "```\n",
2128 |     "\n",
2129 |     "``` pycon\n",
2130 |     ">>> xg_def = xgb.XGBClassifier()\n",
2131 |     ">>> xg_def.fit(X_train, y_train)\n",
2132 |     ">>> xg_def.score(X_test, y_test)\n",
2133 |     "0.7458563535911602\n",
2134 |     "```\n",
2135 |     "\n",
2136 |     "### Summary\n",
2137 |     "\n",
2138 |     "### Exercises\n",
2139 |     "\n",
2140 |     "## Do you have enough data?\n",
2141 |     "\n",
2142 |     "### Learning Curves\n",
2143 |     "\n",
2144 |     "``` python\n",
2145 |     "params = {'learning_rate': 0.3,\n",
2146 |     " 'max_depth': 2,\n",
2147 |     " 'n_estimators': 200,\n",
2148 |     " 'n_jobs': -1,\n",
2149 |     " 'random_state': 42,\n",
2150 |     " 'reg_lambda': 0,\n",
2151 |     " 'subsample': 1}\n",
2152 |     "```\n",
2153 |     "\n",
2154 |     "``` python\n",
2155 |     "import yellowbrick.model_selection as ms\n",
2156 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2157 |     "viz = ms.learning_curve(xgb.XGBClassifier(**params),\n",
2158 |     "      X, y, ax=ax\n",
2159 |     ")\n",
2160 |     "ax.set_ylim(0.6, 1)\n",
2161 |     "```\n",
2162 |     "\n",
2163 |     "### Learning Curves for Decision Trees\n",
2164 |     "\n",
2165 |     "``` python\n",
2166 |     "# tuned tree\n",
2167 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2168 |     "viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=7),\n",
2169 |     "      X, y, ax=ax)\n",
2170 |     "viz.ax.set_ylim(0.6, 1)\n",
2171 |     "```\n",
2172 |     "\n",
2173 |     "### Underfit Learning Curves\n",
2174 |     "\n",
2175 |     "``` python\n",
2176 |     "# underfit\n",
2177 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2178 |     "viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=1),\n",
2179 |     "      X, y, ax=ax\n",
2180 |     ")\n",
2181 |     "ax.set_ylim(0.6, 1)\n",
2182 |     "```\n",
2183 |     "\n",
2184 |     "### Overfit Learning Curves\n",
2185 |     "\n",
2186 |     "``` python\n",
2187 |     "# overfit\n",
2188 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2189 |     "viz = ms.learning_curve(tree.DecisionTreeClassifier(),\n",
2190 |     "      X, y, ax=ax\n",
2191 |     ")\n",
2192 |     "ax.set_ylim(0.6, 1)\n",
2193 |     "```\n",
2194 |     "\n",
2195 |     "### Summary\n",
2196 |     "\n",
2197 |     "### Exercises\n",
2198 |     "\n",
2199 |     "## Model Evaluation\n",
2200 |     "\n",
2201 |     "### Accuracy\n",
2202 |     "\n",
2203 |     "``` python\n",
2204 |     "xgb_def = xgb.XGBClassifier()\n",
2205 |     "xgb_def.fit(X_train, y_train)\n",
2206 |     "```\n",
2207 |     "\n",
2208 |     "``` pycon\n",
2209 |     ">>> xgb_def.score(X_test, y_test)\n",
2210 |     "0.7458563535911602\n",
2211 |     "```\n",
2212 |     "\n",
2213 |     "``` pycon\n",
2214 |     ">>> from sklearn import metrics\n",
2215 |     ">>> metrics.accuracy_score(y_test, xgb_def.predict(X_test))\n",
2216 |     "0.7458563535911602\n",
2217 |     "```\n",
2218 |     "\n",
2219 |     "### Confusion Matrix\n",
2220 |     "\n",
2221 |     "``` python\n",
2222 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2223 |     "classifier.confusion_matrix(xgb_def, X_train, y_train,\n",
2224 |     "                            X_test, y_test,\n",
2225 |     "                            classes=['DS', 'SE'], ax=ax\n",
2226 |     "                           )\n",
2227 |     "```\n",
2228 |     "\n",
2229 |     "``` pycon\n",
2230 |     ">>> from sklearn import metrics\n",
2231 |     ">>> cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test))\n",
2232 |     ">>> cm\n",
2233 |     "array([[372, 122],\n",
2234 |     "       [108, 303]])\n",
2235 |     "```\n",
2236 |     "\n",
2237 |     "``` python\n",
2238 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2239 |     "disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, \n",
2240 |     "                                      display_labels=['DS', 'SE'])\n",
2241 |     "disp.plot(ax=ax, cmap='Blues')\n",
2242 |     "```\n",
2243 |     "\n",
2244 |     "``` python\n",
2245 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2246 |     "cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test), \n",
2247 |     "                             normalize='true')\n",
2248 |     "disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, \n",
2249 |     "                                      display_labels=['DS', 'SE'])                        \n",
2250 |     "disp.plot(ax=ax, cmap='Blues')\n",
2251 |     "```\n",
2252 |     "\n",
2253 |     "### Precision and Recall\n",
2254 |     "\n",
2255 |     "``` pycon\n",
2256 |     ">>> metrics.precision_score(y_test, xgb_def.predict(X_test))\n",
2257 |     "0.7129411764705882\n",
2258 |     "```\n",
2259 |     "\n",
2260 |     "``` pycon\n",
2261 |     ">>> metrics.recall_score(y_test, xgb_def.predict(X_test))\n",
2262 |     "0.7372262773722628\n",
2263 |     "```\n",
2264 |     "\n",
2265 |     "``` python\n",
2266 |     "from yellowbrick import classifier\n",
2267 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2268 |     "classifier.precision_recall_curve(xgb_def, X_train, y_train,\n",
2269 |     "    X_test, y_test, micro=False, macro=False, ax=ax, per_class=True)\n",
2270 |     "ax.set_ylim((0,1.05))\n",
2271 |     "```\n",
2272 |     "\n",
2273 |     "### F1 Score\n",
2274 |     "\n",
2275 |     "``` pycon\n",
2276 |     ">>> metrics.f1_score(y_test, xgb_def.predict(X_test))\n",
2277 |     "0.7248803827751197\n",
2278 |     "```\n",
2279 |     "\n",
2280 |     "``` pycon\n",
2281 |     ">>> print(metrics.classification_report(y_test, \n",
2282 |     "...     y_pred=xgb_def.predict(X_test), target_names=['DS', 'SE']))\n",
2283 |     "              precision    recall  f1-score   support\n",
2284 |     "\n",
2285 |     "          DS       0.78      0.75      0.76       494\n",
2286 |     "          SE       0.71      0.74      0.72       411\n",
2287 |     "\n",
2288 |     "    accuracy                           0.75       905\n",
2289 |     "   macro avg       0.74      0.75      0.74       905\n",
2290 |     "weighted avg       0.75      0.75      0.75       905\n",
2291 |     "```\n",
2292 |     "\n",
2293 |     "``` python\n",
2294 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2295 |     "classifier.classification_report(xgb_def, X_train, y_train,\n",
2296 |     "    X_test, y_test, classes=['DS', 'SE'],\n",
2297 |     "    micro=False, macro=False, ax=ax)\n",
2298 |     "```\n",
2299 |     "\n",
2300 |     "### ROC Curve\n",
2301 |     "\n",
2302 |     "``` python\n",
2303 |     "fig, ax = plt.subplots(figsize=(8,8))\n",
2304 |     "metrics.RocCurveDisplay.from_estimator(xgb_def,\n",
2305 |     "                       X_test, y_test,ax=ax, label='default')\n",
2306 |     "metrics.RocCurveDisplay.from_estimator(xg_step,\n",
2307 |     "                       X_test, y_test,ax=ax)\n",
2308 |     "```\n",
2309 |     "\n",
2310 |     "``` python\n",
2311 |     "fig, axes = plt.subplots(figsize=(8, 4), ncols=2)\n",
2312 |     "metrics.RocCurveDisplay.from_estimator(xgb_def,\n",
2313 |     "                       X_train, y_train,ax=axes[0], label='detault train')\n",
2314 |     "metrics.RocCurveDisplay.from_estimator(xgb_def,\n",
2315 |     "                       X_test, y_test,ax=axes[0])\n",
2316 |     "axes[0].set(title='ROC plots for default model')\n",
2317 |     "\n",
2318 |     "metrics.RocCurveDisplay.from_estimator(xg_step,\n",
2319 |     "                       X_train, y_train,ax=axes[1], label='step train')\n",
2320 |     "metrics.RocCurveDisplay.from_estimator(xg_step,\n",
2321 |     "                       X_test, y_test,ax=axes[1])\n",
2322 |     "axes[1].set(title='ROC plots for stepwise model')\n",
2323 |     "```\n",
2324 |     "\n",
2325 |     "### Threshold Metrics\n",
2326 |     "\n",
2327 |     "``` python\n",
2328 |     "class ThresholdXGBClassifier(xgb.XGBClassifier):\n",
2329 |     "    def __init__(self, threshold=0.5, **kwargs):\n",
2330 |     "        super().__init__(**kwargs)\n",
2331 |     "        self.threshold = threshold\n",
2332 |     "\n",
2333 |     "    def predict(self, X, *args, **kwargs):\n",
2334 |     "        \"\"\"Predict with `threshold` applied to predicted class probabilities.\n",
2335 |     "        \"\"\"\n",
2336 |     "        proba = self.predict_proba(X, *args, **kwargs)\n",
2337 |     "        return (proba[:, 1] > self.threshold).astype(int)\n",
2338 |     "```\n",
2339 |     "\n",
2340 |     "``` pycon\n",
2341 |     ">>> xgb_def = xgb.XGBClassifier()\n",
2342 |     ">>> xgb_def.fit(X_train, y_train)\n",
2343 |     ">>> xgb_def.predict_proba(X_test.iloc[[0]])\n",
2344 |     "array([[0.14253652, 0.8574635 ]], dtype=float32)\n",
2345 |     "```\n",
2346 |     "\n",
2347 |     "``` pycon\n",
2348 |     ">>> xgb_def.predict(X_test.iloc[[0]])\n",
2349 |     "array([1])\n",
2350 |     "```\n",
2351 |     "\n",
2352 |     "``` pycon\n",
2353 |     ">>> xgb90 = ThresholdXGBClassifier(threshold=.9, verbosity=0)\n",
2354 |     ">>> xgb90.fit(X_train, y_train)\n",
2355 |     ">>> xgb90.predict(X_test.iloc[[0]])\n",
2356 |     "array([0])\n",
2357 |     "```\n",
2358 |     "\n",
2359 |     "``` python\n",
2360 |     "def get_tpr_fpr(probs, y_truth):\n",
2361 |     "    \"\"\"\n",
2362 |     "    Calculates true positive rate (TPR) and false positive rate\n",
2363 |     "    (FPR) given predicted probabilities and ground truth labels.\n",
2364 |     "\n",
2365 |     "    Parameters:\n",
2366 |     "    probs (np.array): predicted probabilities of positive class\n",
2367 |     "    y_truth (np.array): ground truth labels\n",
2368 |     "\n",
2369 |     "    Returns:\n",
2370 |     "    tuple: (tpr, fpr)\n",
2371 |     "    \"\"\"\n",
2372 |     "    tp = (probs == 1) & (y_truth == 1)\n",
2373 |     "    tn = (probs < 1) & (y_truth == 0)\n",
2374 |     "    fp = (probs == 1) & (y_truth == 0)\n",
2375 |     "    fn = (probs < 1) & (y_truth == 1)\n",
2376 |     "    tpr = tp.sum() / (tp.sum() + fn.sum())\n",
2377 |     "    fpr = fp.sum() / (fp.sum() + tn.sum())\n",
2378 |     "    return tpr, fpr\n",
2379 |     "\n",
2380 |     "\n",
2381 |     "vals = []\n",
2382 |     "for thresh in np.arange(0, 1, step=.05):\n",
2383 |     "    probs = xg_step.predict_proba(X_test)[:, 1]\n",
2384 |     "    tpr, fpr = get_tpr_fpr(probs > thresh, y_test)\n",
2385 |     "    val = [thresh, tpr, fpr]\n",
2386 |     "    for metric in [metrics.accuracy_score, metrics.precision_score,\n",
2387 |     "                   metrics.recall_score, metrics.f1_score, \n",
2388 |     "                   metrics.roc_auc_score]:\n",
2389 |     "        val.append(metric(y_test, probs > thresh))\n",
2390 |     "    vals.append(val)\n",
2391 |     "```\n",
2392 |     "\n",
2393 |     "``` python\n",
2394 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2395 |     "(pd.DataFrame(vals, columns=['thresh', 'tpr/rec', 'fpr', 'acc', \n",
2396 |     "                             'prec', 'rec', 'f1', 'auc'])\n",
2397 |     " .drop(columns='rec')\n",
2398 |     " .set_index('thresh')\n",
2399 |     " .plot(ax=ax, title='Threshold Metrics')\n",
2400 |     ")\n",
2401 |     "```\n",
2402 |     "\n",
2403 |     "### Cumulative Gains Curve\n",
2404 |     "\n",
2405 |     "``` python\n",
2406 |     "import scikitplot\n",
2407 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2408 |     "y_probs = xgb_def.predict_proba(X_test)\n",
2409 |     "scikitplot.metrics.plot_cumulative_gain(y_test, y_probs, ax=ax)\n",
2410 |     "ax.plot([0, (y_test == 1).mean(), 1], [0, 1, 1], label='Optimal Class 1')\n",
2411 |     "ax.set_ylim(0, 1.05)\n",
2412 |     "ax.annotate('Reach 60% of\\nClass 1\\nby contacting top 35%', xy=(.35, .6),\n",
2413 |     "           xytext=(.55,.25), arrowprops={'color':'k'})\n",
2414 |     "ax.legend()\n",
2415 |     "```\n",
2416 |     "\n",
2417 |     "### Lift Curves\n",
2418 |     "\n",
2419 |     "``` python\n",
2420 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2421 |     "y_probs = xgb_def.predict_proba(X_test)\n",
2422 |     "scikitplot.metrics.plot_lift_curve(y_test, y_probs, ax=ax)\n",
2423 |     "mean = (y_test == 1).mean()\n",
2424 |     "ax.plot([0, mean, 1], [1/mean, 1/mean, 1], label='Optimal Class 1')\n",
2425 |     "ax.legend()\n",
2426 |     "```\n",
2427 |     "\n",
2428 |     "### Summary\n",
2429 |     "\n",
2430 |     "### Exercises\n",
2431 |     "\n",
2432 |     "## Training For Different Metrics\n",
2433 |     "\n",
2434 |     "### Metric overview\n",
2435 |     "\n",
2436 |     "### Training with Validation Curves\n",
2437 |     "\n",
2438 |     "``` python\n",
2439 |     "from yellowbrick import model_selection as ms\n",
2440 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2441 |     "ms.validation_curve(xgb.XGBClassifier(), X_train, y_train,\n",
2442 |     "    scoring='accuracy', param_name='learning_rate', \n",
2443 |     "    param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax\n",
2444 |     ")\n",
2445 |     "ax.set_xlabel('Accuracy')\n",
2446 |     "```\n",
2447 |     "\n",
2448 |     "``` python\n",
2449 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2450 |     "ms.validation_curve(xgb.XGBClassifier(), X_train, y_train,\n",
2451 |     "    scoring='roc_auc', param_name='learning_rate',\n",
2452 |     "    param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax\n",
2453 |     "    )\n",
2454 |     "ax.set_xlabel('roc_auc')\n",
2455 |     "```\n",
2456 |     "\n",
2457 |     "### Step-wise Recall Tuning\n",
2458 |     "\n",
2459 |     "``` python\n",
2460 |     "from sklearn.metrics import roc_auc_score\n",
2461 |     "from hyperopt import hp, Trials, fmin, tpe\n",
2462 |     "params = {'random_state': 42}\n",
2463 |     "\n",
2464 |     "rounds = [{'max_depth': hp.quniform('max_depth', 1, 9, 1),  # tree\n",
2465 |     "           'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},\n",
2466 |     "          {'subsample': hp.uniform('subsample', 0.5, 1),   # stochastic\n",
2467 |     "           'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},\n",
2468 |     "          {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization\n",
2469 |     "          {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting\n",
2470 |     "]\n",
2471 |     "\n",
2472 |     "for round in rounds:\n",
2473 |     "    params = {**params, **round}\n",
2474 |     "    trials = Trials()\n",
2475 |     "    best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(\n",
2476 |     "        space, X_train, y_train, X_test, y_test, metric=roc_auc_score),\n",
2477 |     "        space=params,           \n",
2478 |     "        algo=tpe.suggest,            \n",
2479 |     "        max_evals=40,            \n",
2480 |     "        trials=trials,\n",
2481 |     "    )\n",
2482 |     "    params = {**params, **best}\n",
2483 |     "```\n",
2484 |     "\n",
2485 |     "``` pycon\n",
2486 |     ">>> xgb_def = xgb.XGBClassifier()\n",
2487 |     ">>> xgb_def.fit(X_train, y_train)\n",
2488 |     ">>> metrics.roc_auc_score(y_test, xgb_def.predict(X_test))\n",
2489 |     "0.7451313573096131\n",
2490 |     "```\n",
2491 |     "\n",
2492 |     "``` pycon\n",
2493 |     ">>> # the values from above training\n",
2494 |     ">>> params = {'random_state': 42,\n",
2495 |     "...  'max_depth': 4,\n",
2496 |     "...  'min_child_weight': 4.808561584650579,\n",
2497 |     "...  'subsample': 0.9265505972233746,\n",
2498 |     "...  'colsample_bytree': 0.9870944989347749,\n",
2499 |     "...  'gamma': 0.1383762861356536,\n",
2500 |     "...  'learning_rate': 0.13664139307301595}\n",
2501 |     "```\n",
2502 |     "\n",
2503 |     "``` pycon\n",
2504 |     ">>> xgb_tuned = xgb.XGBClassifier(**params, early_stopping_rounds=50,\n",
2505 |     "...    n_estimators=500)\n",
2506 |     ">>> xgb_tuned.fit(X_train, y_train, eval_set=[(X_train, y_train), \n",
2507 |     "...     (X_test, y_test)], verbose=100)\n",
2508 |     "[0] validation_0-logloss:0.66207    validation_1-logloss:0.66289\n",
2509 |     "[100]   validation_0-logloss:0.44945    validation_1-logloss:0.49416\n",
2510 |     "[150]   validation_0-logloss:0.43196    validation_1-logloss:0.49833\n",
2511 |     "XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,\n",
2512 |     "              colsample_bylevel=1, colsample_bynode=1,\n",
2513 |     "              colsample_bytree=0.9870944989347749, early_stopping_rounds=50,\n",
2514 |     "              enable_categorical=False, eval_metric=None, feature_types=None,\n",
2515 |     "              gamma=0.1383762861356536, gpu_id=-1, grow_policy='depthwise',\n",
2516 |     "              importance_type=None, interaction_constraints='',\n",
2517 |     "              learning_rate=0.13664139307301595, max_bin=256,\n",
2518 |     "              max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,\n",
2519 |     "              max_depth=4, max_leaves=0, min_child_weight=4.808561584650579,\n",
2520 |     "              missing=nan, monotone_constraints='()', n_estimators=500,\n",
2521 |     "              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=42, ...)\n",
2522 |     "```\n",
2523 |     "\n",
2524 |     "``` pycon\n",
2525 |     ">>> metrics.roc_auc_score(y_test, xgb_tuned.predict(X_test))\n",
2526 |     "0.7629510328319394\n",
2527 |     "```\n",
2528 |     "\n",
2529 |     "### Summary\n",
2530 |     "\n",
2531 |     "### Exercises\n",
2532 |     "\n",
2533 |     "## Model Interpretation\n",
2534 |     "\n",
2535 |     "### Logistic Regression Interpretation\n",
2536 |     "\n",
2537 |     "``` pycon\n",
2538 |     ">>> from sklearn import linear_model, preprocessing\n",
2539 |     ">>> std = preprocessing.StandardScaler()\n",
2540 |     ">>> lr = linear_model.LogisticRegression(penalty=None)\n",
2541 |     ">>> lr.fit(std.fit_transform(X_train), y_train)\n",
2542 |     ">>> lr.score(std.transform(X_test), y_test)\n",
2543 |     "0.7337016574585635\n",
2544 |     "```\n",
2545 |     "\n",
2546 |     "``` pycon\n",
2547 |     ">>> lr.coef_\n",
2548 |     "array([[-1.56018160e-01, -4.01817103e-01,  6.01542610e-01,\n",
2549 |     "        -1.45213121e-01, -8.13849902e-02, -6.03727624e-01,\n",
2550 |     "         3.11683777e-02,  3.16120596e-02, -3.14510213e-02,\n",
2551 |     "        -4.59272439e-04, -8.21683100e-03, -5.27737710e-02,\n",
2552 |     "        -4.48524110e-03,  1.01853988e-01,  3.49376790e-01,\n",
2553 |     "        -1.79149729e-01,  2.41389081e-02, -3.37424750e-01]])\n",
2554 |     "```\n",
2555 |     "\n",
2556 |     "``` python\n",
2557 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2558 |     "(pd.Series(lr.coef_[0], index=X_train.columns)\n",
2559 |     " .sort_values()\n",
2560 |     " .plot.barh(ax=ax)\n",
2561 |     ")\n",
2562 |     "```\n",
2563 |     "\n",
2564 |     "### Decision Tree Interpretation\n",
2565 |     "\n",
2566 |     "``` pycon\n",
2567 |     ">>> tree7 = tree.DecisionTreeClassifier(max_depth=7)\n",
2568 |     ">>> tree7.fit(X_train, y_train)\n",
2569 |     ">>> tree7.score(X_test, y_test)\n",
2570 |     "0.7337016574585635\n",
2571 |     "```\n",
2572 |     "\n",
2573 |     "``` python\n",
2574 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2575 |     "(pd.Series(tree7.feature_importances_, index=X_train.columns)\n",
2576 |     " .sort_values()\n",
2577 |     " .plot.barh(ax=ax)\n",
2578 |     ")\n",
2579 |     "```\n",
2580 |     "\n",
2581 |     "``` python\n",
2582 |     "import dtreeviz\n",
2583 |     "dt3 = tree.DecisionTreeClassifier(max_depth=3)\n",
2584 |     "dt3.fit(X_train, y_train)\n",
2585 |     "\n",
2586 |     "viz = dtreeviz.model(dt3, X_train=X_train, y_train=y_train, \n",
2587 |     "    feature_names=list(X_train.columns), target_name='Job',\n",
2588 |     "    class_names=['DS', 'SE'])\n",
2589 |     "viz.view()\n",
2590 |     "```\n",
2591 |     "\n",
2592 |     "### XGBoost Feature Importance\n",
2593 |     "\n",
2594 |     "``` python\n",
2595 |     "xgb_def = xgb.XGBClassifier()\n",
2596 |     "xgb_def.fit(X_train, y_train)\n",
2597 |     "```\n",
2598 |     "\n",
2599 |     "``` python\n",
2600 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2601 |     "(pd.Series(xgb_def.feature_importances_, index=X_train.columns)\n",
2602 |     " .sort_values()\n",
2603 |     " .plot.barh(ax=ax)\n",
2604 |     ")\n",
2605 |     "```\n",
2606 |     "\n",
2607 |     "``` python\n",
2608 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2609 |     "xgb.plot_importance(xgb_def, importance_type='cover', ax=ax)\n",
2610 |     "```\n",
2611 |     "\n",
2612 |     "### Surrogate Models\n",
2613 |     "\n",
2614 |     "``` python\n",
2615 |     "from sklearn import tree\n",
2616 |     "\n",
2617 |     "sur_reg_sk = tree.DecisionTreeRegressor(max_depth=4)\n",
2618 |     "sur_reg_sk.fit(X_train, xgb_def.predict_proba(X_train)[:,-1])\n",
2619 |     "```\n",
2620 |     "\n",
2621 |     "``` python\n",
2622 |     "```\n",
2623 |     "\n",
2624 |     "### Summary\n",
2625 |     "\n",
2626 |     "### Exercises\n",
2627 |     "\n",
2628 |     "## xgbfir (Feature Interactions Reshaped)\n",
2629 |     "\n",
2630 |     "### Feature Interactions\n",
2631 |     "\n",
2632 |     "### xgbfir\n",
2633 |     "\n",
2634 |     "``` python\n",
2635 |     "import xgbfir\n",
2636 |     "xgbfir.saveXgbFI(xgb_def, feature_names=X_train.columns, OutputXlsxFile='fir.xlsx')\n",
2637 |     "```\n",
2638 |     "\n",
2639 |     "``` pycon\n",
2640 |     ">>> fir = pd.read_excel('fir.xlsx')\n",
2641 |     ">>> print(fir\n",
2642 |     "...  .sort_values(by='Average Rank')\n",
2643 |     "...  .head()\n",
2644 |     "...  .round(1)\n",
2645 |     "... )\n",
2646 |     "    Interaction   Gain  FScore  ...  Average Rank  Average Tree Index  \\\n",
2647 |     "2             r  517.8      84  ...           3.3                44.6   \n",
2648 |     "0     years_exp  597.0     627  ...           4.5                45.1   \n",
2649 |     "5     education  296.0     254  ...           4.5                45.2   \n",
2650 |     "1  compensation  518.5     702  ...           4.8                47.5   \n",
2651 |     "4      major_cs  327.1      96  ...           5.5                48.9   \n",
2652 |     "\n",
2653 |     "   Average Tree Depth  \n",
2654 |     "2                 2.6  \n",
2655 |     "0                 3.7  \n",
2656 |     "5                 3.3  \n",
2657 |     "1                 3.7  \n",
2658 |     "4                 3.6  \n",
2659 |     "\n",
2660 |     "[5 rows x 16 columns]\n",
2661 |     "```\n",
2662 |     "\n",
2663 |     "``` pycon\n",
2664 |     ">>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 1').iloc[:20]\n",
2665 |     "...    .sort_values(by='Average Rank')     \n",
2666 |     "...    .head(10)          \n",
2667 |     "...    .round(1)          \n",
2668 |     "... )          \n",
2669 |     "               Interaction    Gain  FScore  wFScore  Average wFScore  \\\n",
2670 |     "1      education|years_exp   523.8     106     14.8              0.1   \n",
2671 |     "0               major_cs|r  1210.8      15      5.4              0.4   \n",
2672 |     "6   compensation|education   207.2     103     18.8              0.2   \n",
2673 |     "11           age|education   133.2      80     27.2              0.3   \n",
2674 |     "3       major_cs|years_exp   441.3      36      4.8              0.1   \n",
2675 |     "5            age|years_exp   316.3     216     43.9              0.2   \n",
2676 |     "4         age|compensation   344.7     219     38.8              0.2   \n",
2677 |     "15    major_stat|years_exp    97.7      32      6.7              0.2   \n",
2678 |     "14             education|r   116.5      14      4.6              0.3   \n",
2679 |     "18                 age|age    90.5      66     24.7              0.4   \n",
2680 |     "\n",
2681 |     "    Average Gain  Expected Gain  Gain Rank  FScore Rank  wFScore Rank  \\\n",
2682 |     "1            4.9           77.9          2            5             8   \n",
2683 |     "0           80.7          607.6          1           45            20   \n",
2684 |     "6            2.0           34.0          7            6             7   \n",
2685 |     "11           1.7           25.6         12            8             4   \n",
2686 |     "3           12.3          108.2          4           20            25   \n",
2687 |     "5            1.5           44.0          6            3             1   \n",
2688 |     "4            1.6           30.6          5            2             2   \n",
2689 |     "15           3.1           20.4         16           25            15   \n",
2690 |     "14           8.3           72.3         15           52            27   \n",
2691 |     "18           1.4           16.6         19           11             6   \n",
2692 |     "\n",
2693 |     "    Avg wFScore Rank  Avg Gain Rank  Expected Gain Rank  Average Rank  \\\n",
2694 |     "1                 43              8                   3          11.5   \n",
2695 |     "0                  8              1                   1          12.7   \n",
2696 |     "6                 32             25                   9          14.3   \n",
2697 |     "11                12             40                  13          14.8   \n",
2698 |     "3                 46              3                   2          16.7   \n",
2699 |     "5                 26             57                   7          16.7   \n",
2700 |     "4                 34             48                  11          17.0   \n",
2701 |     "15                24             14                  14          18.0   \n",
2702 |     "14                13              5                   4          19.3   \n",
2703 |     "18                 7             62                  16          20.2   \n",
2704 |     "\n",
2705 |     "    Average Tree Index  Average Tree Depth  \n",
2706 |     "1                 38.0                 3.5  \n",
2707 |     "0                 12.3                 1.6  \n",
2708 |     "6                 50.6                 3.7  \n",
2709 |     "11                38.8                 3.6  \n",
2710 |     "3                 29.2                 3.2  \n",
2711 |     "5                 45.6                 3.9  \n",
2712 |     "4                 48.9                 3.9  \n",
2713 |     "15                25.5                 3.1  \n",
2714 |     "14                40.4                 2.4  \n",
2715 |     "18                48.0                 3.6  \n",
2716 |     "```\n",
2717 |     "\n",
2718 |     "``` python\n",
2719 |     "(X_train\n",
2720 |     " .assign(software_eng=y_train)\n",
2721 |     " .corr(method='spearman')\n",
2722 |     " .loc[:, ['education', 'years_exp', 'major_cs', 'r', 'compensation', 'age']]\n",
2723 |     " .style\n",
2724 |     " .background_gradient(cmap='RdBu', vmin=-1, vmax=1)\n",
2725 |     " .format('{:.2f}')\n",
2726 |     ")\n",
2727 |     "```\n",
2728 |     "\n",
2729 |     "``` python\n",
2730 |     "import seaborn as sns\n",
2731 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2732 |     "sns.heatmap(X_train       \n",
2733 |     "            .assign(software_eng=y_train)\n",
2734 |     "            .corr(method='spearman')\n",
2735 |     "            .loc[:, ['age','education', 'years_exp', 'compensation', 'r', \n",
2736 |     "                     'major_cs', 'software_eng']],\n",
2737 |     "            cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n",
2738 |     ")\n",
2739 |     "```\n",
2740 |     "\n",
2741 |     "``` python\n",
2742 |     "import seaborn.objects as so\n",
2743 |     "fig = plt.figure(figsize=(8, 4))\n",
2744 |     "(so\n",
2745 |     " .Plot(X_train.assign(software_eng=y_train), x='years_exp', y='education', \n",
2746 |     "       color='software_eng')\n",
2747 |     " .add(so.Dots(alpha=.9, pointsize=2), so.Jitter(x=.7, y=1))\n",
2748 |     " .add(so.Line(), so.PolyFit())\n",
2749 |     " .scale(color='viridis')\n",
2750 |     " .on(fig)  # not required unless saving to image\n",
2751 |     " .plot()   # ditto\n",
2752 |     ")\n",
2753 |     "```\n",
2754 |     "\n",
2755 |     "``` pycon\n",
2756 |     ">>> print(X_train\n",
2757 |     "...  .assign(software_eng=y_train)\n",
2758 |     "...  .groupby(['software_eng', 'r', 'major_cs'])\n",
2759 |     "...  .age\n",
2760 |     "...  .count()\n",
2761 |     "...  .unstack()\n",
2762 |     "...  .unstack()\n",
2763 |     "... )\n",
2764 |     "major_cs        0         1     \n",
2765 |     "r               0    1    0    1\n",
2766 |     "software_eng                    \n",
2767 |     "0             410  390  243  110\n",
2768 |     "1             308   53  523   73\n",
2769 |     "```\n",
2770 |     "\n",
2771 |     "``` pycon\n",
2772 |     ">>> both = (X_train\n",
2773 |     "...  .assign(software_eng=y_train)\n",
2774 |     "... )\n",
2775 |     ">>> print(pd.crosstab(index=both.software_eng, columns=[both.major_cs, both.r]))\n",
2776 |     "major_cs        0         1     \n",
2777 |     "r               0    1    0    1\n",
2778 |     "software_eng                    \n",
2779 |     "0             410  390  243  110\n",
2780 |     "1             308   53  523   73\n",
2781 |     "```\n",
2782 |     "\n",
2783 |     "``` python\n",
2784 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
2785 |     "grey = '#999999'\n",
2786 |     "blue = '#16a2c6'\n",
2787 |     "font = 'Roboto'\n",
2788 |     "\n",
2789 |     "data = (X_train\n",
2790 |     " .assign(software_eng=y_train)\n",
2791 |     " .groupby(['software_eng', 'r', 'major_cs'])\n",
2792 |     " .age\n",
2793 |     " .count()\n",
2794 |     " .unstack()\n",
2795 |     " .unstack())\n",
2796 |     "\n",
2797 |     "(data\n",
2798 |     " .pipe(lambda adf: adf.iloc[:,-2:].plot(color=[grey,blue], linewidth=4, ax=ax, \n",
2799 |     "                                        legend=None) and adf)\n",
2800 |     " .plot(color=[grey, blue, grey, blue], ax=ax, legend=None)\n",
2801 |     ")\n",
2802 |     "\n",
2803 |     "ax.set_xticks([0, 1], ['Data Scientist', 'Software Engineer'], font=font, size=12, \n",
2804 |     "              weight=600)\n",
2805 |     "ax.set_yticks([])\n",
2806 |     "ax.set_xlabel('')\n",
2807 |     "ax.text(x=0, y=.93, s=\"Count Data Scientist or Software Engineer by R/CS\", \n",
2808 |     "        transform=fig.transFigure, ha='left', font=font, fontsize=10, weight=1000)\n",
2809 |     "ax.text(x=0, y=.83, s=\"(Studied CS) Thick lines\\n(R) Blue\", transform=fig.transFigure, \n",
2810 |     "        ha='left', font=font, fontsize=10, weight=300)\n",
2811 |     "for side in 'left,top,right,bottom'.split(','):\n",
2812 |     "    ax.spines[side].set_visible(False)  \n",
2813 |     "# labels\n",
2814 |     "for left,txt in zip(data.iloc[0], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']):\n",
2815 |     "    ax.text(x=-.02, y=left, s=f'{txt} ({left})', ha='right', va='center', \n",
2816 |     "            font=font, weight=300)\n",
2817 |     "for right,txt in zip(data.iloc[1], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']):\n",
2818 |     "    ax.text(x=1.02, y=right, s=f'{txt} ({right})', ha='left', va='center', \n",
2819 |     "            font=font, weight=300)\n",
2820 |     "```\n",
2821 |     "\n",
2822 |     "### Deeper Interactions\n",
2823 |     "\n",
2824 |     "``` pycon\n",
2825 |     ">>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 2').iloc[:20]\n",
2826 |     "...   .sort_values(by='Average Rank')          \n",
2827 |     "...   .head(5)          \n",
2828 |     "... )          \n",
2829 |     "                         Interaction         Gain  FScore  ...  Average Rank  \\\n",
2830 |     "0               major_cs|r|years_exp  1842.711375      17  ...     12.000000   \n",
2831 |     "7            age|education|years_exp   267.537987      53  ...     15.666667   \n",
2832 |     "13        age|compensation|education   154.313245      55  ...     15.833333   \n",
2833 |     "2   compensation|education|years_exp   431.541357      91  ...     17.166667   \n",
2834 |     "14             education|r|years_exp   145.534591      17  ...     19.000000   \n",
2835 |     "\n",
2836 |     "    Average Tree Index  Average Tree Depth  \n",
2837 |     "0             2.588235            2.117647  \n",
2838 |     "7            31.452830            3.981132  \n",
2839 |     "13           47.381818            3.800000  \n",
2840 |     "2            47.175824            4.010989  \n",
2841 |     "14           34.352941            2.588235  \n",
2842 |     "\n",
2843 |     "[5 rows x 16 columns]\n",
2844 |     "```\n",
2845 |     "\n",
2846 |     "### Specifying Feature Interactions\n",
2847 |     "\n",
2848 |     "``` python\n",
2849 |     "constraints = [['education', 'years_exp'], ['major_cs', 'r'],\n",
2850 |     "   ['compensation', 'education'], ['age', 'education'],\n",
2851 |     "   ['major_cs', 'years_exp'], ['age', 'years_exp'],\n",
2852 |     "   ['age', 'compensation'], ['major_stat', 'years_exp'],\n",
2853 |     "]\n",
2854 |     "```\n",
2855 |     "\n",
2856 |     "``` python\n",
2857 |     "def flatten(seq):\n",
2858 |     "    res = []\n",
2859 |     "    for sub in seq:\n",
2860 |     "        res.extend(sub)\n",
2861 |     "    return res\n",
2862 |     "\n",
2863 |     "\n",
2864 |     "small_cols = sorted(set(flatten(constraints)))\n",
2865 |     "```\n",
2866 |     "\n",
2867 |     "``` pycon\n",
2868 |     ">>> print(small_cols)\n",
2869 |     "['age', 'compensation', 'education', 'major_cs', 'major_stat', 'r', 'years_exp']\n",
2870 |     "```\n",
2871 |     "\n",
2872 |     "``` pycon\n",
2873 |     ">>> xg_constraints = xgb.XGBClassifier(interaction_constraints=constraints)\n",
2874 |     ">>> xg_constraints.fit(X_train.loc[:, small_cols], y_train)\n",
2875 |     ">>> xg_constraints.score(X_test.loc[:, small_cols], y_test)\n",
2876 |     "\n",
2877 |     "0.7259668508287292\n",
2878 |     "```\n",
2879 |     "\n",
2880 |     "``` python\n",
2881 |     "my_dot_export(xg_constraints, num_trees=0, filename='img/constrains0_xg.dot', \n",
2882 |     "              title='First Constrained Tree')    \n",
2883 |     "```\n",
2884 |     "\n",
2885 |     "### Summary\n",
2886 |     "\n",
2887 |     "### Exercises\n",
2888 |     "\n",
2889 |     "## Exploring SHAP\n",
2890 |     "\n",
2891 |     "### SHAP\n",
2892 |     "\n",
2893 |     "``` python\n",
2894 |     "step_params = {'random_state': 42,\n",
2895 |     " 'max_depth': 5,\n",
2896 |     " 'min_child_weight': 0.6411044640540848,\n",
2897 |     " 'subsample': 0.9492383155577023,\n",
2898 |     " 'colsample_bytree': 0.6235721099295888,\n",
2899 |     " 'gamma': 0.00011273797329538491,\n",
2900 |     " 'learning_rate': 0.24399020050740935}\n",
2901 |     "xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50,\n",
2902 |     "                            n_estimators=500)\n",
2903 |     "xg_step.fit(X_train, y_train,\n",
2904 |     "       eval_set=[(X_train, y_train),\n",
2905 |     "                 (X_test, y_test)\n",
2906 |     "                ]\n",
2907 |     "      )\n",
2908 |     "```\n",
2909 |     "\n",
2910 |     "``` python\n",
2911 |     "import shap\n",
2912 |     "shap.initjs()\n",
2913 |     "\n",
2914 |     "shap_ex = shap.TreeExplainer(xg_step)\n",
2915 |     "vals = shap_ex(X_test)\n",
2916 |     "```\n",
2917 |     "\n",
2918 |     "``` pycon\n",
2919 |     ">>> shap_df = pd.DataFrame(vals.values, columns=X_test.columns)\n",
2920 |     ">>> print(shap_df)\n",
2921 |     "          age  education  years_exp  compensation    python         r  \\\n",
2922 |     "0    0.426614   0.390184  -0.246353      0.145825 -0.034680  0.379261   \n",
2923 |     "1    0.011164  -0.131144  -0.292135     -0.014521  0.016003 -1.043464   \n",
2924 |     "2   -0.218063  -0.140705  -0.411293      0.048281  0.424516  0.487451   \n",
2925 |     "3   -0.015227  -0.299068  -0.426323     -0.205840 -0.125867  0.320594   \n",
2926 |     "4   -0.468785  -0.200953  -0.230639      0.064272  0.021362  0.355619   \n",
2927 |     "..        ...        ...        ...           ...       ...       ...   \n",
2928 |     "900  0.268237  -0.112710   0.330096     -0.209942  0.012074 -1.144335   \n",
2929 |     "901  0.154642   0.572190  -0.227121      0.448253 -0.057847  0.290381   \n",
2930 |     "902  0.079129  -0.095771   1.136799      0.150705  0.133260  0.484103   \n",
2931 |     "903 -0.206584   0.430074  -0.385100     -0.078808 -0.083052 -0.992487   \n",
2932 |     "904  0.007351   0.589351   1.485712      0.056398 -0.047231  0.373149   \n",
2933 |     "\n",
2934 |     "          sql   Q1_Male  Q1_Female  Q1_Prefer not to say  \\\n",
2935 |     "0   -0.019017  0.004868   0.000877              0.002111   \n",
2936 |     "1    0.020524  0.039019   0.047712              0.001010   \n",
2937 |     "2   -0.098703 -0.004710   0.063545              0.000258   \n",
2938 |     "3   -0.062712  0.019110   0.012257              0.002184   \n",
2939 |     "4   -0.083344 -0.017202   0.002754              0.001432   \n",
2940 |     "..        ...       ...        ...                   ...   \n",
2941 |     "900 -0.065815  0.028274   0.032291              0.001012   \n",
2942 |     "901 -0.069114  0.006243   0.007443              0.002198   \n",
2943 |     "902 -0.120819  0.012034   0.057516              0.000266   \n",
2944 |     "903 -0.088811  0.080561   0.028648              0.000876   \n",
2945 |     "904 -0.105290  0.029283   0.074762              0.001406   \n",
2946 |     "\n",
2947 |     "     Q1_Prefer to self-describe  Q3_United States of America  Q3_India  \\\n",
2948 |     "0                           0.0                     0.033738 -0.117918   \n",
2949 |     "1                           0.0                     0.068171  0.086444   \n",
2950 |     "2                           0.0                     0.005533 -0.105534   \n",
2951 |     "3                           0.0                    -0.000044  0.042814   \n",
2952 |     "4                           0.0                     0.035772 -0.073206   \n",
2953 |     "..                          ...                          ...       ...   \n",
2954 |     "900                         0.0                    -0.086408  0.136677   \n",
2955 |     "901                         0.0                    -0.074364  0.115520   \n",
2956 |     "902                         0.0                     0.103810 -0.097848   \n",
2957 |     "903                         0.0                     0.045213  0.066553   \n",
2958 |     "904                         0.0                    -0.031587  0.117050   \n",
2959 |     "\n",
2960 |     "     Q3_China  major_cs  major_other  major_eng  major_stat  \n",
2961 |     "0   -0.018271  0.369876     0.014006  -0.013465    0.104177  \n",
2962 |     "1   -0.026271 -0.428484    -0.064157  -0.026041    0.069931  \n",
2963 |     "2   -0.010548 -0.333695     0.016919  -0.026932   -0.591922  \n",
2964 |     "3   -0.024099  0.486864     0.038438  -0.013727    0.047564  \n",
2965 |     "4   -0.022188  0.324419     0.012664  -0.019550    0.093926  \n",
2966 |     "..        ...       ...          ...        ...         ...  \n",
2967 |     "900  0.310404 -0.407444    -0.013195  -0.026412   -0.484734  \n",
2968 |     "901 -0.008244  0.602087     0.039680  -0.012820    0.083934  \n",
2969 |     "902  0.003234 -0.313785    -0.080046  -0.066032    0.101975  \n",
2970 |     "903 -0.031448 -0.524141    -0.048108  -0.007185    0.093196  \n",
2971 |     "904  0.008734 -0.505613    -0.159411  -0.067388    0.126560  \n",
2972 |     "\n",
2973 |     "[905 rows x 18 columns]\n",
2974 |     "          age  education  years_exp  compensation    python         r  \\\n",
2975 |     "0    0.426614   0.390184  -0.246353      0.145825 -0.034680  0.379261   \n",
2976 |     "1    0.011164  -0.131144  -0.292135     -0.014521  0.016003 -1.043464   \n",
2977 |     "2   -0.218063  -0.140705  -0.411293      0.048281  0.424516  0.487451   \n",
2978 |     "3   -0.015227  -0.299068  -0.426323     -0.205840 -0.125867  0.320594   \n",
2979 |     "4   -0.468785  -0.200953  -0.230639      0.064272  0.021362  0.355619   \n",
2980 |     "..        ...        ...        ...           ...       ...       ...   \n",
2981 |     "900  0.268237  -0.112710   0.330096     -0.209942  0.012074 -1.144335   \n",
2982 |     "901  0.154642   0.572190  -0.227121      0.448253 -0.057847  0.290381   \n",
2983 |     "902  0.079129  -0.095771   1.136799      0.150705  0.133260  0.484103   \n",
2984 |     "903 -0.206584   0.430074  -0.385100     -0.078808 -0.083052 -0.992487   \n",
2985 |     "904  0.007351   0.589351   1.485712      0.056398 -0.047231  0.373149   \n",
2986 |     "\n",
2987 |     "          sql   Q1_Male  Q1_Female  Q1_Prefer not to say  \\\n",
2988 |     "0   -0.019017  0.004868   0.000877              0.002111   \n",
2989 |     "1    0.020524  0.039019   0.047712              0.001010   \n",
2990 |     "2   -0.098703 -0.004710   0.063545              0.000258   \n",
2991 |     "3   -0.062712  0.019110   0.012257              0.002184   \n",
2992 |     "4   -0.083344 -0.017202   0.002754              0.001432   \n",
2993 |     "..        ...       ...        ...                   ...   \n",
2994 |     "900 -0.065815  0.028274   0.032291              0.001012   \n",
2995 |     "901 -0.069114  0.006243   0.007443              0.002198   \n",
2996 |     "902 -0.120819  0.012034   0.057516              0.000266   \n",
2997 |     "903 -0.088811  0.080561   0.028648              0.000876   \n",
2998 |     "904 -0.105290  0.029283   0.074762              0.001406   \n",
2999 |     "\n",
3000 |     "     Q1_Prefer to self-describe  Q3_United States of America  Q3_India  \\\n",
3001 |     "0                           0.0                     0.033738 -0.117918   \n",
3002 |     "1                           0.0                     0.068171  0.086444   \n",
3003 |     "2                           0.0                     0.005533 -0.105534   \n",
3004 |     "3                           0.0                    -0.000044  0.042814   \n",
3005 |     "4                           0.0                     0.035772 -0.073206   \n",
3006 |     "..                          ...                          ...       ...   \n",
3007 |     "900                         0.0                    -0.086408  0.136677   \n",
3008 |     "901                         0.0                    -0.074364  0.115520   \n",
3009 |     "902                         0.0                     0.103810 -0.097848   \n",
3010 |     "903                         0.0                     0.045213  0.066553   \n",
3011 |     "904                         0.0                    -0.031587  0.117050   \n",
3012 |     "\n",
3013 |     "     Q3_China  major_cs  major_other  major_eng  major_stat  \n",
3014 |     "0   -0.018271  0.369876     0.014006  -0.013465    0.104177  \n",
3015 |     "1   -0.026271 -0.428484    -0.064157  -0.026041    0.069931  \n",
3016 |     "2   -0.010548 -0.333695     0.016919  -0.026932   -0.591922  \n",
3017 |     "3   -0.024099  0.486864     0.038438  -0.013727    0.047564  \n",
3018 |     "4   -0.022188  0.324419     0.012664  -0.019550    0.093926  \n",
3019 |     "..        ...       ...          ...        ...         ...  \n",
3020 |     "900  0.310404 -0.407444    -0.013195  -0.026412   -0.484734  \n",
3021 |     "901 -0.008244  0.602087     0.039680  -0.012820    0.083934  \n",
3022 |     "902  0.003234 -0.313785    -0.080046  -0.066032    0.101975  \n",
3023 |     "903 -0.031448 -0.524141    -0.048108  -0.007185    0.093196  \n",
3024 |     "904  0.008734 -0.505613    -0.159411  -0.067388    0.126560  \n",
3025 |     "\n",
3026 |     "[905 rows x 18 columns]\n",
3027 |     "```\n",
3028 |     "\n",
3029 |     "``` pycon\n",
3030 |     ">>> print(pd.concat([shap_df.sum(axis='columns')\n",
3031 |     "...                        .rename('pred') + vals.base_values,\n",
3032 |     "...    pd.Series(y_test, name='true')], axis='columns')\n",
3033 |     "...    .assign(prob=lambda adf: (np.exp(adf.pred) / \n",
3034 |     "...                              (1 + np.exp(adf.pred))))\n",
3035 |     "... )     \n",
3036 |     "         pred  true      prob\n",
3037 |     "0    1.204692     1  0.769358\n",
3038 |     "1   -2.493559     0  0.076311\n",
3039 |     "2   -2.205473     0  0.099260\n",
3040 |     "3   -0.843847     1  0.300725\n",
3041 |     "4   -0.168726     1  0.457918\n",
3042 |     "..        ...   ...       ...\n",
3043 |     "900 -1.698727     0  0.154632\n",
3044 |     "901  1.957872     0  0.876302\n",
3045 |     "902  0.786588     0  0.687098\n",
3046 |     "903 -2.299702     0  0.091148\n",
3047 |     "904  1.497035     1  0.817132\n",
3048 |     "\n",
3049 |     "[905 rows x 3 columns]\n",
3050 |     "```\n",
3051 |     "\n",
3052 |     "### Examining a Single Prediction\n",
3053 |     "\n",
3054 |     "``` pycon\n",
3055 |     ">>> X_test.iloc[0]\n",
3056 |     "age                            22.0\n",
3057 |     "education                      16.0\n",
3058 |     "years_exp                       1.0\n",
3059 |     "compensation                    0.0\n",
3060 |     "python                          1.0\n",
3061 |     "r                               0.0\n",
3062 |     "sql                             0.0\n",
3063 |     "Q1_Male                         1.0\n",
3064 |     "Q1_Female                       0.0\n",
3065 |     "Q1_Prefer not to say            0.0\n",
3066 |     "Q1_Prefer to self-describe      0.0\n",
3067 |     "Q3_United States of America     0.0\n",
3068 |     "Q3_India                        1.0\n",
3069 |     "Q3_China                        0.0\n",
3070 |     "major_cs                        1.0\n",
3071 |     "major_other                     0.0\n",
3072 |     "major_eng                       0.0\n",
3073 |     "major_stat                      0.0\n",
3074 |     "Name: 7894, dtype: float64\n",
3075 |     "```\n",
3076 |     "\n",
3077 |     "``` pycon\n",
3078 |     ">>> # predicts software engineer... why?\n",
3079 |     ">>> xg_step.predict(X_test.iloc[[0]])  \n",
3080 |     "array([1])\n",
3081 |     "```\n",
3082 |     "\n",
3083 |     "``` pycon\n",
3084 |     ">>> # ground truth\n",
3085 |     ">>> y_test[0]\n",
3086 |     "1\n",
3087 |     "```\n",
3088 |     "\n",
3089 |     "``` pycon\n",
3090 |     ">>> # Since this is below zero, the default is Data Scientist\n",
3091 |     ">>> shap_ex.expected_value\n",
3092 |     "-0.2166416\n",
3093 |     "```\n",
3094 |     "\n",
3095 |     "``` pycon\n",
3096 |     ">>> # > 0 therefore ... Software Engineer\n",
3097 |     ">>> shap_ex.expected_value + vals.values[0].sum()\n",
3098 |     "1.2046916\n",
3099 |     "```\n",
3100 |     "\n",
3101 |     "### Waterfall Plots\n",
3102 |     "\n",
3103 |     "``` python\n",
3104 |     "fig = plt.figure(figsize=(8, 4))\n",
3105 |     "shap.plots.waterfall(vals[0], show=False)\n",
3106 |     "```\n",
3107 |     "\n",
3108 |     "``` python\n",
3109 |     "def plot_histograms(df, columns, row=None, title='', color='shap'):\n",
3110 |     "    \"\"\"\n",
3111 |     "    Parameters\n",
3112 |     "    ----------\n",
3113 |     "    df : pandas.DataFrame\n",
3114 |     "        The DataFrame to plot histograms for.\n",
3115 |     "    columns : list of str\n",
3116 |     "        The names of the columns to plot histograms for.\n",
3117 |     "    row : pandas.Series, optional\n",
3118 |     "        A row of data to plot a vertical line for.\n",
3119 |     "    title : str, optional\n",
3120 |     "        The title to use for the figure.\n",
3121 |     "    color : str, optional\n",
3122 |     "        'shap' - color positive values red. Negative blue\n",
3123 |     "        'mean' - above mean red. Below blue.\n",
3124 |     "        None - black\n",
3125 |     "\n",
3126 |     "    Returns\n",
3127 |     "    -------\n",
3128 |     "    matplotlib.figure.Figure\n",
3129 |     "        The figure object containing the histogram plots.    \n",
3130 |     "    \"\"\"\n",
3131 |     "    red = '#ff0051'\n",
3132 |     "    blue = '#008bfb'\n",
3133 |     "\n",
3134 |     "    fig, ax = plt.subplots(figsize=(8, 4))\n",
3135 |     "    hist = (df\n",
3136 |     "     [columns]\n",
3137 |     "     .hist(ax=ax, color='#bbb')\n",
3138 |     "    )\n",
3139 |     "    fig = hist[0][0].get_figure()\n",
3140 |     "    if row is not None:\n",
3141 |     "        name2ax = {ax.get_title():ax for ax in fig.axes}\n",
3142 |     "        pos, neg = red, blue\n",
3143 |     "        if color is None:\n",
3144 |     "            pos, neg = 'black', 'black'\n",
3145 |     "        for column in columns:\n",
3146 |     "            if color == 'mean':\n",
3147 |     "                mid = df[column].mean()\n",
3148 |     "            else:\n",
3149 |     "                mid = 0\n",
3150 |     "            if row[column] > mid:\n",
3151 |     "                c = pos\n",
3152 |     "            else:\n",
3153 |     "                c = neg\n",
3154 |     "            name2ax[column].axvline(row[column], c=c)\n",
3155 |     "    fig.tight_layout()\n",
3156 |     "    fig.suptitle(title)\n",
3157 |     "    return fig    \n",
3158 |     "```\n",
3159 |     "\n",
3160 |     "``` python\n",
3161 |     "features = ['education', 'r', 'major_cs', 'age', 'years_exp', \n",
3162 |     "           'compensation']\n",
3163 |     "fig = plot_histograms(shap_df, features, shap_df.iloc[0], \n",
3164 |     "                      title='SHAP values for row 0')\n",
3165 |     "```\n",
3166 |     "\n",
3167 |     "``` python\n",
3168 |     "fig = plot_histograms(X_test, features, X_test.iloc[0], \n",
3169 |     "                      title='Values for row 0', color='mean')\n",
3170 |     "```\n",
3171 |     "\n",
3172 |     "``` python\n",
3173 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3174 |     "(pd.Series(vals.values[0], index=X_test.columns)\n",
3175 |     " .sort_values(key=np.abs)\n",
3176 |     " .plot.barh(ax=ax)\n",
3177 |     ")\n",
3178 |     "```\n",
3179 |     "\n",
3180 |     "### A Force Plot\n",
3181 |     "\n",
3182 |     "``` python\n",
3183 |     "# use matplotlib if having js issues\n",
3184 |     "# blue - DS\n",
3185 |     "# red - Software Engineer\n",
3186 |     "# to save need both matplotlib=True, show=False\n",
3187 |     "res = shap.plots.force(base_value=vals.base_values, \n",
3188 |     "                      shap_values=vals.values[0,:], features=X_test.iloc[0], \n",
3189 |     "                      matplotlib=True, show=False\n",
3190 |     ")\n",
3191 |     "res.savefig('img/shap_forceplot0.png', dpi=600, bbox_inches='tight')\n",
3192 |     "```\n",
3193 |     "\n",
3194 |     "### Force Plot with Multiple Predictions\n",
3195 |     "\n",
3196 |     "``` python\n",
3197 |     "# First n values\n",
3198 |     "n = 100\n",
3199 |     "# blue - DS\n",
3200 |     "# red - Software Engineer\n",
3201 |     "shap.plots.force(base_value=vals.base_values, \n",
3202 |     "               shap_values=vals.values[:n,:], features=X_test.iloc[:n], \n",
3203 |     "               )\n",
3204 |     "```\n",
3205 |     "\n",
3206 |     "### Understanding Features with Dependence Plots\n",
3207 |     "\n",
3208 |     "``` python\n",
3209 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3210 |     "shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals, \n",
3211 |     "                   x_jitter=0, hist=False)\n",
3212 |     "```\n",
3213 |     "\n",
3214 |     "### Jittering a Dependence Plot\n",
3215 |     "\n",
3216 |     "``` python\n",
3217 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3218 |     "shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals[:, 'years_exp'], x_jitter=1,\n",
3219 |     "                   alpha=.5)\n",
3220 |     "```\n",
3221 |     "\n",
3222 |     "``` python\n",
3223 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3224 |     "shap.plots.scatter(vals[:, 'major_cs'], ax=ax, color=vals[:, 'r'], alpha=.5)\n",
3225 |     "```\n",
3226 |     "\n",
3227 |     "### Heatmaps and Correlations\n",
3228 |     "\n",
3229 |     "``` python\n",
3230 |     "import seaborn as sns\n",
3231 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3232 |     "sns.heatmap(X_test       \n",
3233 |     "            .assign(software_eng=y_test)\n",
3234 |     "            .corr(method='spearman')\n",
3235 |     "            .loc[:, ['age', 'education', 'years_exp',\n",
3236 |     "                     'compensation', 'r', 'major_cs', \n",
3237 |     "                     'software_eng']],\n",
3238 |     "            cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n",
3239 |     ")\n",
3240 |     "```\n",
3241 |     "\n",
3242 |     "``` python\n",
3243 |     "import seaborn as sns\n",
3244 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3245 |     "sns.heatmap(shap_df       \n",
3246 |     "            .assign(software_eng=y_test)\n",
3247 |     "            .corr(method='spearman')\n",
3248 |     "            .loc[:, ['age', 'education', 'years_exp',  'compensation', 'r', 'major_cs',\n",
3249 |     "                     'software_eng']],\n",
3250 |     "            cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax\n",
3251 |     ")\n",
3252 |     "```\n",
3253 |     "\n",
3254 |     "### Beeswarm Plots of Global Behavior\n",
3255 |     "\n",
3256 |     "``` python\n",
3257 |     "fig = plt.figure(figsize=(8, 4))\n",
3258 |     "shap.plots.beeswarm(vals)\n",
3259 |     "```\n",
3260 |     "\n",
3261 |     "``` python\n",
3262 |     "from matplotlib import cm\n",
3263 |     "fig = plt.figure(figsize=(8, 4))\n",
3264 |     "shap.plots.beeswarm(vals, max_display=len(X_test.columns), color=cm.autumn_r)\n",
3265 |     "```\n",
3266 |     "\n",
3267 |     "### SHAP with No Interaction\n",
3268 |     "\n",
3269 |     "``` python\n",
3270 |     "no_int_params = {'random_state': 42,\n",
3271 |     "                 'max_depth': 1\n",
3272 |     "}\n",
3273 |     "xg_no_int = xgb.XGBClassifier(**no_int_params, early_stopping_rounds=50,\n",
3274 |     "                              n_estimators=500)\n",
3275 |     "xg_no_int.fit(X_train, y_train,\n",
3276 |     "       eval_set=[(X_train, y_train),\n",
3277 |     "                 (X_test, y_test)\n",
3278 |     "                ]\n",
3279 |     ")\n",
3280 |     "```\n",
3281 |     "\n",
3282 |     "``` pycon\n",
3283 |     ">>> xg_no_int.score(X_test, y_test)\n",
3284 |     "0.7370165745856354\n",
3285 |     "```\n",
3286 |     "\n",
3287 |     "``` python\n",
3288 |     "shap_ind = shap.TreeExplainer(xg_no_int)\n",
3289 |     "shap_ind_vals = shap_ind(X_test)\n",
3290 |     "```\n",
3291 |     "\n",
3292 |     "``` python\n",
3293 |     "from matplotlib import cm\n",
3294 |     "fig = plt.figure(figsize=(8, 4))\n",
3295 |     "shap.plots.beeswarm(shap_ind_vals, max_display=len(X_test.columns))\n",
3296 |     "```\n",
3297 |     "\n",
3298 |     "``` python\n",
3299 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3300 |     "shap.plots.scatter(vals[:, 'years_exp'], ax=ax, \n",
3301 |     "                   color=vals[:, 'age'], alpha=.5,\n",
3302 |     "                   x_jitter=1)\n",
3303 |     "```\n",
3304 |     "\n",
3305 |     "``` python\n",
3306 |     "fig, ax = plt.subplots(figsize=(8, 4))\n",
3307 |     "shap.plots.scatter(shap_ind_vals[:, 'years_exp'], ax=ax,\n",
3308 |     "                   color=shap_ind_vals[:, 'age'], alpha=.5,\n",
3309 |     "                   x_jitter=1)\n",
3310 |     "```\n",
3311 |     "\n",
3312 |     "### Summary\n",
3313 |     "\n",
3314 |     "### Exercises\n",
3315 |     "\n",
3316 |     "## Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration\n",
3317 |     "\n",
3318 |     "### ICE Plots\n",
3319 |     "\n",
3320 |     "``` python\n",
3321 |     "xgb_def = xgb.XGBClassifier(random_state=42)\n",
3322 |     "xgb_def.fit(X_train, y_train)\n",
3323 |     "xgb_def.score(X_test, y_test)\n",
3324 |     "```\n",
3325 |     "\n",
3326 |     "``` python\n",
3327 |     "from sklearn.inspection import PartialDependenceDisplay\n",
3328 |     "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n",
3329 |     "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n",
3330 |     "                                       kind='individual', ax=axes)\n",
3331 |     "```\n",
3332 |     "\n",
3333 |     "``` python\n",
3334 |     "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n",
3335 |     "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n",
3336 |     "                                       centered=True,\n",
3337 |     "                                       kind='individual', ax=axes)\n",
3338 |     "```\n",
3339 |     "\n",
3340 |     "``` python\n",
3341 |     "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n",
3342 |     "ax_h0 = axes[0].twinx()\n",
3343 |     "ax_h0.hist(X_train.r, zorder=0)\n",
3344 |     "\n",
3345 |     "ax_h1 = axes[1].twinx()\n",
3346 |     "ax_h1.hist(X_train.education, zorder=0)\n",
3347 |     "\n",
3348 |     "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n",
3349 |     "                                        centered=True,\n",
3350 |     "                                        ice_lines_kw={'zorder':10},\n",
3351 |     "                                        kind='individual', ax=axes)\n",
3352 |     "fig.tight_layout()\n",
3353 |     "```\n",
3354 |     "\n",
3355 |     "``` python\n",
3356 |     "def quantile_ice(clf, X, col, center=True, q=10, color='k', alpha=.5, legend=True,\n",
3357 |     "                add_hist=False, title='', val_limit=10, ax=None):\n",
3358 |     "  \"\"\"\n",
3359 |     "    Generate an ICE plot for a binary classifier's predicted probabilities split \n",
3360 |     "    by quantiles.\n",
3361 |     "\n",
3362 |     "    Parameters:\n",
3363 |     "    ----------\n",
3364 |     "    clf : binary classifier\n",
3365 |     "        A binary classifier with a `predict_proba` method.\n",
3366 |     "    X : DataFrame\n",
3367 |     "        Feature matrix to predict on with shape (n_samples, n_features).\n",
3368 |     "    col : str\n",
3369 |     "        Name of column in `X` to plot against the quantiles of predicted probabilities.\n",
3370 |     "    center : bool, default=True\n",
3371 |     "        Whether to center the plot on 0.5.\n",
3372 |     "    q : int, default=10\n",
3373 |     "        Number of quantiles to split the predicted probabilities into.\n",
3374 |     "    color : str or array-like, default='k'\n",
3375 |     "        Color(s) of the lines in the plot.\n",
3376 |     "    alpha : float, default=0.5\n",
3377 |     "        Opacity of the lines in the plot.\n",
3378 |     "    legend : bool, default=True\n",
3379 |     "        Whether to show the plot legend.\n",
3380 |     "    add_hist : bool, default=False\n",
3381 |     "        Whether to add a histogram of the `col` variable to the plot.\n",
3382 |     "    title : str, default=''\n",
3383 |     "        Title of the plot.\n",
3384 |     "    val_limit : num, default=10\n",
3385 |     "        Maximum number of values to test for col.\n",
3386 |     "    ax : Matplotlib Axis, deafault=None\n",
3387 |     "        Axis to plot on.\n",
3388 |     "\n",
3389 |     "    Returns:\n",
3390 |     "    -------\n",
3391 |     "    results : DataFrame\n",
3392 |     "        A DataFrame with the same columns as `X`, as well as a `prob` column with \n",
3393 |     "        the predicted probabilities of `clf` for each row in `X`, and a `group` \n",
3394 |     "        column indicating which quantile group the row belongs to.\n",
3395 |     "  \"\"\"                  \n",
3396 |     "  probs = clf.predict_proba(X)\n",
3397 |     "  df = (X\n",
3398 |     "        .assign(probs=probs[:,-1],\n",
3399 |     "               p_bin=lambda df_:pd.qcut(df_.probs, q=q, \n",
3400 |     "                                        labels=[f'q{n}' for n in range(1,q+1)])\n",
3401 |     "               )\n",
3402 |     "       )\n",
3403 |     "  groups = df.groupby('p_bin')\n",
3404 |     "\n",
3405 |     "  vals = X.loc[:,col].unique()\n",
3406 |     "  if len(vals) > val_limit:\n",
3407 |     "    vals = np.linspace(min(vals), max(vals), num=val_limit)\n",
3408 |     "  res = []\n",
3409 |     "  for name,g in groups:\n",
3410 |     "    for val in vals:\n",
3411 |     "      this_X = g.loc[:,X.columns].assign(**{col:val})\n",
3412 |     "      q_prob = clf.predict_proba(this_X)[:,-1]\n",
3413 |     "      res.append(this_X.assign(prob=q_prob, group=name))\n",
3414 |     "  results = pd.concat(res, axis='index')     \n",
3415 |     "  if ax is None:\n",
3416 |     "    fig, ax = plt.subplots(figsize=(8,4))\n",
3417 |     "  if add_hist:\n",
3418 |     "    back_ax = ax.twinx()\n",
3419 |     "    back_ax.hist(X[col], density=True, alpha=.2) \n",
3420 |     "  for name, g in results.groupby('group'):\n",
3421 |     "    g.groupby(col).prob.mean().plot(ax=ax, label=name, color=color, alpha=alpha)\n",
3422 |     "  if legend:\n",
3423 |     "    ax.legend()\n",
3424 |     "  if title:\n",
3425 |     "    ax.set_title(title)\n",
3426 |     "  return results\n",
3427 |     "```\n",
3428 |     "\n",
3429 |     "``` python\n",
3430 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3431 |     "quantile_ice(xgb_def, X_train, 'education', q=10, legend=False, add_hist=True, ax=ax,\n",
3432 |     "            title='ICE plot for Age')\n",
3433 |     "```\n",
3434 |     "\n",
3435 |     "### ICE Plots with SHAP\n",
3436 |     "\n",
3437 |     "``` python\n",
3438 |     "import shap\n",
3439 |     "\n",
3440 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3441 |     "  \n",
3442 |     "shap.plots.partial_dependence_plot(ind='education', \n",
3443 |     "    model=lambda rows: xgb_def.predict_proba(rows)[:,-1],\n",
3444 |     "    data=X_train.iloc[0:1000], ice=True, \n",
3445 |     "    npoints=(X_train.education.nunique()),\n",
3446 |     "    pd_linewidth=0, show=False, ax=ax)\n",
3447 |     "ax.set_title('ICE plot (from SHAP)')\n",
3448 |     "```\n",
3449 |     "\n",
3450 |     "### Partial Dependence Plots\n",
3451 |     "\n",
3452 |     "``` python\n",
3453 |     "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n",
3454 |     "\n",
3455 |     "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n",
3456 |     "                                        kind='average', ax=axes)\n",
3457 |     "fig.tight_layout()\n",
3458 |     "```\n",
3459 |     "\n",
3460 |     "``` python\n",
3461 |     "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n",
3462 |     "\n",
3463 |     "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],\n",
3464 |     "                                        centered=True, kind='both',\n",
3465 |     "                                        ax=axes)\n",
3466 |     "fig.tight_layout()\n",
3467 |     "```\n",
3468 |     "\n",
3469 |     "``` python\n",
3470 |     "fig, axes = plt.subplots(ncols=2, figsize=(8,4))\n",
3471 |     "\n",
3472 |     "PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['years_exp', 'Q1_Male'],\n",
3473 |     "                                        centered=True, kind='both',\n",
3474 |     "                                        ax=axes)\n",
3475 |     "fig.tight_layout()\n",
3476 |     "```\n",
3477 |     "\n",
3478 |     "### PDP with SHAP\n",
3479 |     "\n",
3480 |     "``` python\n",
3481 |     "import shap\n",
3482 |     "\n",
3483 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3484 |     "\n",
3485 |     "col = 'years_exp'  \n",
3486 |     "shap.plots.partial_dependence_plot(ind=col,\n",
3487 |     "                             model=lambda rows: xgb_def.predict_proba(rows)[:,-1],\n",
3488 |     "                             data=X_train.iloc[0:1000], ice=False, \n",
3489 |     "                             npoints=(X_train[col].nunique()),\n",
3490 |     "                             pd_linewidth=2, show=False, ax=ax)\n",
3491 |     "ax.set_title('PDP plot (from SHAP)')\n",
3492 |     "```\n",
3493 |     "\n",
3494 |     "``` python\n",
3495 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3496 |     "\n",
3497 |     "col = 'years_exp'  \n",
3498 |     "shap.plots.partial_dependence_plot(ind=col, \n",
3499 |     "                             model=lambda rows: xgb_def.predict_proba(rows)[:,-1],\n",
3500 |     "                             data=X_train.iloc[0:1000], ice=True, \n",
3501 |     "                             npoints=(X_train[col].nunique()),\n",
3502 |     "                             model_expected_value=True,\n",
3503 |     "                             feature_expected_value=True,\n",
3504 |     "                             pd_linewidth=2, show=False, ax=ax)\n",
3505 |     "ax.set_title('PDP plot (from SHAP) with ICE Plots')\n",
3506 |     "```\n",
3507 |     "\n",
3508 |     "### Monotonic Constraints\n",
3509 |     "\n",
3510 |     "``` python\n",
3511 |     "\n",
3512 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3513 |     "\n",
3514 |     "(X_test\n",
3515 |     " .assign(target=y_test)\n",
3516 |     " .corr(method='spearman')\n",
3517 |     " .iloc[:-1]\n",
3518 |     " .loc[:,'target']\n",
3519 |     " .sort_values(key=np.abs)\n",
3520 |     " .plot.barh(title='Spearman Correlation with Target', ax=ax)\n",
3521 |     ")\n",
3522 |     "```\n",
3523 |     "\n",
3524 |     "``` pycon\n",
3525 |     ">>> print(X_train\n",
3526 |     "... .assign(target=y_train)\n",
3527 |     "... .groupby('education')\n",
3528 |     "... .mean()\n",
3529 |     "... .loc[:, ['age', 'years_exp', 'target']]\n",
3530 |     "... )\n",
3531 |     "\n",
3532 |     "                 age  years_exp    target\n",
3533 |     "education                                \n",
3534 |     "12.0       30.428571   2.857143  0.714286\n",
3535 |     "13.0       30.369565   6.760870  0.652174\n",
3536 |     "16.0       25.720867   2.849593  0.605691\n",
3537 |     "18.0       28.913628   3.225528  0.393474\n",
3538 |     "19.0       27.642857   4.166667  0.571429\n",
3539 |     "20.0       35.310638   4.834043  0.174468\n",
3540 |     "```\n",
3541 |     "\n",
3542 |     "``` pycon\n",
3543 |     ">>> X_train.education.value_counts()\n",
3544 |     "18.0    1042\n",
3545 |     "16.0     738\n",
3546 |     "20.0     235\n",
3547 |     "13.0      46\n",
3548 |     "19.0      42\n",
3549 |     "12.0       7\n",
3550 |     "Name: education, dtype: int64\n",
3551 |     "```\n",
3552 |     "\n",
3553 |     "``` pycon\n",
3554 |     ">>> print(raw\n",
3555 |     "... .query('Q3.isin([\"United States of America\", \"China\", \"India\"]) '\n",
3556 |     "...        'and Q6.isin([\"Data Scientist\", \"Software Engineer\"])') \n",
3557 |     "... .query('Q4 == \"Professional degree\"')\n",
3558 |     "... .pipe(lambda df_:pd.crosstab(index=df_.Q5, columns=df_.Q6))\n",
3559 |     "... )\n",
3560 |     " \n",
3561 |     "Q6                                                  Data Scientist  \\\n",
3562 |     "Q5                                                                   \n",
3563 |     "A business discipline (accounting, economics, f...               0   \n",
3564 |     "Computer science (software engineering, etc.)                   12   \n",
3565 |     "Engineering (non-computer focused)                               6   \n",
3566 |     "Humanities (history, literature, philosophy, etc.)               2   \n",
3567 |     "I never declared a major                                         0   \n",
3568 |     "Mathematics or statistics                                        2   \n",
3569 |     "Other                                                            2   \n",
3570 |     "Physics or astronomy                                             2   \n",
3571 |     "\n",
3572 |     "Q6                                                  Software Engineer  \n",
3573 |     "Q5                                                                     \n",
3574 |     "A business discipline (accounting, economics, f...                  1  \n",
3575 |     "Computer science (software engineering, etc.)                      19  \n",
3576 |     "Engineering (non-computer focused)                                 10  \n",
3577 |     "Humanities (history, literature, philosophy, etc.)                  0  \n",
3578 |     "I never declared a major                                            1  \n",
3579 |     "Mathematics or statistics                                           1  \n",
3580 |     "Other                                                               1  \n",
3581 |     "Physics or astronomy                                                1  \n",
3582 |     "```\n",
3583 |     "\n",
3584 |     "``` python\n",
3585 |     "xgb_const = xgb.XGBClassifier(random_state=42,\n",
3586 |     "          monotone_constraints={'years_exp':1, 'education':-1})\n",
3587 |     "xgb_const.fit(X_train, y_train)\n",
3588 |     "xgb_const.score(X_test, y_test)\n",
3589 |     "```\n",
3590 |     "\n",
3591 |     "``` python\n",
3592 |     "small_cols = ['age', 'education', 'years_exp', 'compensation', 'python', 'r', 'sql',\n",
3593 |     "              #'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say',\n",
3594 |     "              #'Q1_Prefer to self-describe', \n",
3595 |     "              'Q3_United States of America', 'Q3_India',\n",
3596 |     "              'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat']\n",
3597 |     "xgb_const2 = xgb.XGBClassifier(random_state=42,\n",
3598 |     "          monotone_constraints={'years_exp':1, 'education':-1})\n",
3599 |     "xgb_const2.fit(X_train[small_cols], y_train)\n",
3600 |     "```\n",
3601 |     "\n",
3602 |     "``` pycon\n",
3603 |     ">>> xgb_const2.score(X_test[small_cols], y_test)\n",
3604 |     "0.7569060773480663\n",
3605 |     "```\n",
3606 |     "\n",
3607 |     "``` python\n",
3608 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3609 |     "(pd.Series(xgb_def.feature_importances_, index=X_train.columns)\n",
3610 |     " .sort_values()\n",
3611 |     " .plot.barh(ax=ax)\n",
3612 |     ")\n",
3613 |     "```\n",
3614 |     "\n",
3615 |     "``` python\n",
3616 |     "fig, ax = plt.subplots(figsize=(8,4))\n",
3617 |     "(pd.Series(xgb_const2.feature_importances_, index=small_cols)\n",
3618 |     " .sort_values()\n",
3619 |     " .plot.barh(ax=ax)\n",
3620 |     ")\n",
3621 |     "```\n",
3622 |     "\n",
3623 |     "### Calibrating a Model\n",
3624 |     "\n",
3625 |     "``` python\n",
3626 |     "from sklearn.calibration import CalibratedClassifierCV\n",
3627 |     "\n",
3628 |     "xgb_cal = CalibratedClassifierCV(xgb_def, method='sigmoid', cv='prefit')\n",
3629 |     "xgb_cal.fit(X_test, y_test)\n",
3630 |     "\n",
3631 |     "xgb_cal_iso = CalibratedClassifierCV(xgb_def, method='isotonic', cv='prefit')\n",
3632 |     "xgb_cal_iso.fit(X_test, y_test)\n",
3633 |     "```\n",
3634 |     "\n",
3635 |     "### Calibration Curves\n",
3636 |     "\n",
3637 |     "``` python\n",
3638 |     "from sklearn.calibration import CalibrationDisplay\n",
3639 |     "from matplotlib.gridspec import GridSpec\n",
3640 |     "fig = plt.figure(figsize=(8,6))\n",
3641 |     "gs = GridSpec(4, 3)\n",
3642 |     "axes = fig.add_subplot(gs[:2, :3])\n",
3643 |     "display = CalibrationDisplay.from_estimator(xgb_def, X_test, y_test, \n",
3644 |     "                                            n_bins=10, ax=axes)\n",
3645 |     "disp_cal = CalibrationDisplay.from_estimator(xgb_cal, X_test, y_test, \n",
3646 |     "                                      n_bins=10,ax=axes, name='sigmoid')\n",
3647 |     "disp_cal_iso = CalibrationDisplay.from_estimator(xgb_cal_iso, X_test, y_test, \n",
3648 |     "                                      n_bins=10, ax=axes, name='isotonic')\n",
3649 |     "row = 2\n",
3650 |     "col = 0\n",
3651 |     "ax = fig.add_subplot(gs[row, col])\n",
3652 |     "ax.hist(display.y_prob, range=(0,1), bins=20)\n",
3653 |     "ax.set(title='Default', xlabel='Predicted Prob')\n",
3654 |     "ax2 = fig.add_subplot(gs[row, 1])\n",
3655 |     "ax2.hist(disp_cal.y_prob, range=(0,1), bins=20)\n",
3656 |     "ax2.set(title='Sigmoid', xlabel='Predicted Prob')\n",
3657 |     "ax3 = fig.add_subplot(gs[row, 2])\n",
3658 |     "ax3.hist(disp_cal_iso.y_prob, range=(0,1), bins=20)\n",
3659 |     "ax3.set(title='Isotonic', xlabel='Predicted Prob')\n",
3660 |     "fig.tight_layout()\n",
3661 |     "```\n",
3662 |     "\n",
3663 |     "``` pycon\n",
3664 |     ">>> xgb_cal.score(X_test, y_test)\n",
3665 |     "0.7480662983425415\n",
3666 |     "```\n",
3667 |     "\n",
3668 |     "``` pycon\n",
3669 |     ">>> xgb_cal_iso.score(X_test, y_test)\n",
3670 |     "0.7491712707182321\n",
3671 |     "```\n",
3672 |     "\n",
3673 |     "``` pycon\n",
3674 |     ">>> xgb_def.score(X_test, y_test)\n",
3675 |     "0.7458563535911602\n",
3676 |     "```\n",
3677 |     "\n",
3678 |     "### Summary\n",
3679 |     "\n",
3680 |     "### Exercises\n",
3681 |     "\n",
3682 |     "## Serving Models with MLFlow\n",
3683 |     "\n",
3684 |     "### Installation and Setup\n",
3685 |     "\n",
3686 |     "``` python\n",
3687 |     "%matplotlib inline\n",
3688 |     "\n",
3689 |     "from feature_engine import encoding, imputation\n",
3690 |     "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n",
3691 |     "import matplotlib.pyplot as plt\n",
3692 |     "import mlflow\n",
3693 |     "import numpy as np\n",
3694 |     "import pandas as pd\n",
3695 |     "from sklearn import base, metrics, model_selection, \\\n",
3696 |     "   pipeline, preprocessing\n",
3697 |     "from sklearn.metrics import accuracy_score, roc_auc_score  \n",
3698 |     "import xgboost as xgb\n",
3699 |     "\n",
3700 |     "\n",
3701 |     "import urllib\n",
3702 |     "import zipfile\n",
3703 |     "```\n",
3704 |     "\n",
3705 |     "``` python\n",
3706 |     "import pandas as pd\n",
3707 |     "from sklearn import model_selection, preprocessing\n",
3708 |     "import xg_helpers as xhelp\n",
3709 |     "\n",
3710 |     "\n",
3711 |     "url = 'https://github.com/mattharrison/datasets/raw/master/data/'\\\n",
3712 |     "    'kaggle-survey-2018.zip'\n",
3713 |     "fname = 'kaggle-survey-2018.zip'\n",
3714 |     "member_name = 'multipleChoiceResponses.csv'\n",
3715 |     "\n",
3716 |     "raw = xhelp.extract_zip(url, fname, member_name)\n",
3717 |     "## Create raw X and raw y\n",
3718 |     "kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')\n",
3719 |     "    \n",
3720 |     "## Split data    \n",
3721 |     "kag_X_train, kag_X_test, kag_y_train, kag_y_test = \\\n",
3722 |     "    model_selection.train_test_split(\n",
3723 |     "        kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)    \n",
3724 |     "\n",
3725 |     "## Transform X with pipeline\n",
3726 |     "X_train = xhelp.kag_pl.fit_transform(kag_X_train)\n",
3727 |     "X_test = xhelp.kag_pl.transform(kag_X_test)\n",
3728 |     "\n",
3729 |     "## Transform y with label encoder\n",
3730 |     "label_encoder = preprocessing.LabelEncoder()\n",
3731 |     "label_encoder.fit(kag_y_train)\n",
3732 |     "y_train = label_encoder.transform(kag_y_train)\n",
3733 |     "y_test = label_encoder.transform(kag_y_test)\n",
3734 |     "\n",
3735 |     "# Combined Data for cross validation/etc\n",
3736 |     "X = pd.concat([X_train, X_test], axis='index')\n",
3737 |     "y = pd.Series([*y_train, *y_test], index=X.index)\n",
3738 |     "```\n",
3739 |     "\n",
3740 |     "``` python\n",
3741 |     "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n",
3742 |     "import mlflow\n",
3743 |     "from sklearn import metrics\n",
3744 |     "import xgboost as xgb\n",
3745 |     "\n",
3746 |     "ex_id = mlflow.create_experiment(name='ex3', artifact_location='ex2path')\n",
3747 |     "mlflow.set_experiment(experiment_name='ex3')\n",
3748 |     "with mlflow.start_run():\n",
3749 |     "    params = {'random_state': 42}\n",
3750 |     "    rounds = [{'max_depth': hp.quniform('max_depth', 1, 12, 1),  # tree\n",
3751 |     "               'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},\n",
3752 |     "              {'subsample': hp.uniform('subsample', 0.5, 1),   # stochastic\n",
3753 |     "               'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},\n",
3754 |     "              {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization\n",
3755 |     "              {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting\n",
3756 |     "    ]\n",
3757 |     "\n",
3758 |     "    for round in rounds:\n",
3759 |     "        params = {**params, **round}\n",
3760 |     "        trials = Trials()\n",
3761 |     "        best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(\n",
3762 |     "                space, X_train, y_train, X_test, y_test),            \n",
3763 |     "            space=params,           \n",
3764 |     "            algo=tpe.suggest,            \n",
3765 |     "            max_evals=10,            \n",
3766 |     "            trials=trials,\n",
3767 |     "            timeout=60*5 # 5 minutes\n",
3768 |     "        )\n",
3769 |     "        params = {**params, **best}\n",
3770 |     "        for param, val in params.items():\n",
3771 |     "            mlflow.log_param(param, val)\n",
3772 |     "        params['max_depth'] = int(params['max_depth'])\n",
3773 |     "        xg = xgb.XGBClassifier(eval_metric='logloss', early_stopping_rounds=50, **params)\n",
3774 |     "        xg.fit(X_train, y_train,\n",
3775 |     "               eval_set=[(X_train, y_train),\n",
3776 |     "                         (X_test, y_test)\n",
3777 |     "                        ]\n",
3778 |     "              )     \n",
3779 |     "        for metric in [metrics.accuracy_score, metrics.precision_score, metrics.recall_score, \n",
3780 |     "                       metrics.f1_score]:\n",
3781 |     "            mlflow.log_metric(metric.__name__, metric(y_test, xg.predict(X_test)))\n",
3782 |     "\n",
3783 |     "    model_info = mlflow.xgboost.log_model(xg, artifact_path='model')\n",
3784 |     "    \n",
3785 |     "```\n",
3786 |     "\n",
3787 |     "``` pycon\n",
3788 |     ">>> ex_id\n",
3789 |     "'172212630951564101'\n",
3790 |     "```\n",
3791 |     "\n",
3792 |     "``` pycon\n",
3793 |     ">>> model_info.run_id\n",
3794 |     "'263b3e793f584251a4e4cd1a2d494110'\n",
3795 |     "```\n",
3796 |     "\n",
3797 |     "### Inspecting Model Artifacts\n",
3798 |     "\n",
3799 |     "### Running A Model From Code\n",
3800 |     "\n",
3801 |     "``` python\n",
3802 |     "import mlflow\n",
3803 |     "logged_model = 'runs:/ecc05fedb5c942598741816a1c6d76e2/model'\n",
3804 |     "\n",
3805 |     "# Load model as a PyFuncModel.\n",
3806 |     "loaded_model = mlflow.pyfunc.load_model(logged_model)\n",
3807 |     "```\n",
3808 |     "\n",
3809 |     "``` pycon\n",
3810 |     ">>> loaded_model.predict(X_test.iloc[[0]])\n",
3811 |     "array([1])\n",
3812 |     "```\n",
3813 |     "\n",
3814 |     "### Serving Predictions\n",
3815 |     "\n",
3816 |     "### Querying from the Command Line\n",
3817 |     "\n",
3818 |     "``` pycon\n",
3819 |     ">>> X_test.head(2).to_json(orient='split', index=False)\n",
3820 |     "'{\"columns\":[\"age\",\"education\",\"years_exp\",\"compensation\",\n",
3821 |     "\"python\",\"r\",\"sql\",\"Q1_Male\",\"Q1_Female\",\"Q1_Prefer not to say\",\n",
3822 |     "\"Q1_Prefer to self-describe\",\"Q3_United States of America\",\n",
3823 |     "\"Q3_India\",\"Q3_China\",\"major_cs\",\"major_other\",\"major_eng\",\n",
3824 |     "\"major_stat\"],\"data\":[[22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0,\n",
3825 |     "0,0],[25,18.0,1.0,70000,1,1,0,1,0,0,0,1,0,0,0,1,0,0]]}'\n",
3826 |     "```\n",
3827 |     "\n",
3828 |     "``` pycon\n",
3829 |     ">>> import json\n",
3830 |     ">>> json.loads(X_test.head(2).to_json(orient='split', index=False))\n",
3831 |     "{'columns': ['age',\n",
3832 |     "  'education',\n",
3833 |     "  'years_exp',\n",
3834 |     "  'compensation',\n",
3835 |     "  'python',\n",
3836 |     "  'r',\n",
3837 |     "  'sql',\n",
3838 |     "  'Q1_Male',\n",
3839 |     "  'Q1_Female',\n",
3840 |     "  'Q1_Prefer not to say',\n",
3841 |     "  'Q1_Prefer to self-describe',\n",
3842 |     "  'Q3_United States of America',\n",
3843 |     "  'Q3_India',\n",
3844 |     "  'Q3_China',\n",
3845 |     "  'major_cs',\n",
3846 |     "  'major_other',\n",
3847 |     "  'major_eng',\n",
3848 |     "  'major_stat'],\n",
3849 |     " 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],\n",
3850 |     "  [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}\n",
3851 |     "```\n",
3852 |     "\n",
3853 |     "``` pycon\n",
3854 |     ">>> {'dataframe_split': json.loads(X_test.head(2).to_json(orient='split', \n",
3855 |     "...                                                       index=False))}\n",
3856 |     "{'dataframe_split': {'columns': ['age',\n",
3857 |     "   'education',\n",
3858 |     "   'years_exp',\n",
3859 |     "   'compensation',\n",
3860 |     "   'python',\n",
3861 |     "   'r',\n",
3862 |     "   'sql',\n",
3863 |     "   'Q1_Male',\n",
3864 |     "   'Q1_Female',\n",
3865 |     "   'Q1_Prefer not to say',\n",
3866 |     "   'Q1_Prefer to self-describe',\n",
3867 |     "   'Q3_United States of America',\n",
3868 |     "   'Q3_India',\n",
3869 |     "   'Q3_China',\n",
3870 |     "   'major_cs',\n",
3871 |     "   'major_other',\n",
3872 |     "   'major_eng',\n",
3873 |     "   'major_stat'],\n",
3874 |     "  'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],\n",
3875 |     "   [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\n",
3876 |     "```\n",
3877 |     "\n",
3878 |     "``` python\n",
3879 |     "def create_post_data(df):\n",
3880 |     "    dictionary = json.loads(df\n",
3881 |     "       .to_json(orient='split', index=False))\n",
3882 |     "    return json.dumps({'dataframe_split': dictionary})\n",
3883 |     "```\n",
3884 |     "\n",
3885 |     "``` pycon\n",
3886 |     ">>> post_data = create_post_data(X_test.head(2))\n",
3887 |     ">>> print(post_data)\n",
3888 |     "{\"dataframe_split\": {\"columns\": [\"age\", \"education\", \"years_exp\", \"compensation\", \n",
3889 |     "   \"python\", \"r\", \"sql\", \"Q1_Male\", \"Q1_Female\", \"Q1_Prefer not to say\", \n",
3890 |     "   \"Q1_Prefer to self-describe\", \"Q3_United States of America\", \"Q3_India\", \"Q3_China\", \n",
3891 |     "   \"major_cs\", \"major_other\", \"major_eng\", \"major_stat\"], \n",
3892 |     " \"data\": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], \n",
3893 |     "          [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\n",
3894 |     "```\n",
3895 |     "\n",
3896 |     "``` python\n",
3897 |     "!curl http://127.0.0.1:1234/invocations -X POST -H \\\n",
3898 |     "    \"Content-Type:application/json\" --data $post_data \n",
3899 |     "```\n",
3900 |     "\n",
3901 |     "``` pycon\n",
3902 |     ">>> quoted = f\"'{post_data}'\"\n",
3903 |     ">>> quoted\n",
3904 |     "'\\'{\"dataframe_split\": {\"columns\": [\"age\", \"education\", \n",
3905 |     "\"years_exp\", \"compensation\", \"python\", \"r\", \"sql\", \"Q1_Male\", \n",
3906 |     "\"Q1_Female\", \"Q1_Prefer not to say\", \"Q1_Prefer to self-describe\",\n",
3907 |     "\"Q3_United States of America\", \"Q3_India\", \"Q3_China\", \"major_cs\",\n",
3908 |     "\"major_other\", \"major_eng\", \"major_stat\"], \"data\": [[22, 16.0, 1.0,\n",
3909 |     "0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, \n",
3910 |     "70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\\''\n",
3911 |     "```\n",
3912 |     "\n",
3913 |     "``` python\n",
3914 |     "def create_post_data(df, quote=True):\n",
3915 |     "    dictionary = {'dataframe_split': json.loads(df\n",
3916 |     "       .to_json(orient='split', index=False))}\n",
3917 |     "    if quote:\n",
3918 |     "        return f\"'{dictionary}'\"\n",
3919 |     "    else:\n",
3920 |     "        return dictionary\n",
3921 |     "\n",
3922 |     "quoted = create_post_data(X_test.head(2))\n",
3923 |     "```\n",
3924 |     "\n",
3925 |     "``` python\n",
3926 |     "!curl http://127.0.0.1:1234/invocations -x post -h \\\n",
3927 |     "        \"content-type:application/json\" --data $quoted \n",
3928 |     "```\n",
3929 |     "\n",
3930 |     "### Querying with the Requests Library\n",
3931 |     "\n",
3932 |     "``` pycon\n",
3933 |     ">>> import requests as req\n",
3934 |     ">>> import json\n",
3935 |     "\n",
3936 |     ">>> r = req.post('http://127.0.0.1:1234/invocations', \n",
3937 |     "...     json=create_post_data(x_test.head(2), quote=False))\n",
3938 |     ">>> print(r.text)\n",
3939 |     "{\"predictions\": [1, 0]}\n",
3940 |     "```\n",
3941 |     "\n",
3942 |     "### Building with Docker\n",
3943 |     "\n",
3944 |     "### Conclusion\n",
3945 |     "\n",
3946 |     "### Exercises"
3947 |    ]
3948 |   }
3949 |  ],
3950 |  "metadata": {
3951 |   "kernelspec": {
3952 |    "display_name": "Python 3 (ipykernel)",
3953 |    "language": "python",
3954 |    "name": "python3"
3955 |   },
3956 |   "language_info": {
3957 |    "codemirror_mode": {
3958 |     "name": "ipython",
3959 |     "version": 3
3960 |    },
3961 |    "file_extension": ".py",
3962 |    "mimetype": "text/x-python",
3963 |    "name": "python",
3964 |    "nbconvert_exporter": "python",
3965 |    "pygments_lexer": "ipython3",
3966 |    "version": "3.10.11"
3967 |   }
3968 |  },
3969 |  "nbformat": 4,
3970 |  "nbformat_minor": 5
3971 | }
3972 | 


--------------------------------------------------------------------------------