├── IBM-HR-Employee-Attrition.csv
├── LICENSE
├── README.md
├── YandexCatBoost-Demo.ipynb
├── YandexCatBoost-Demo.py
└── images
    ├── output_24_0.png
    ├── output_30_0.png
    └── output_69_0.png


/LICENSE:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | For more information, please refer to <https://unlicense.org>
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | 
   2 | ## Exploration of Yandex CatBoost in Python ##
   3 | 
   4 | This demo will provide a brief introduction in  
   5 | - performing data exploration and preprocessing 
   6 | - feature subset selection: low variance filter
   7 | - feature subset selection: high correlation filter
   8 | - catboost model tuning 
   9 | - importance of data preprocessing: data normalization
  10 | - exploration of catboost's feature importance ranking  
  11 | 
  12 | ## Getting started
  13 | Open `YandexCatBoost-Demo.ipynb` on a jupyter notebook environment, or Google colab. The notebook consists of further technical details.
  14 | 
  15 | ## Future Improvements ##
  16 | 
  17 | Results from the feature importance ranking shows that attribute ‘MaritalStatus’ impacts minimally in class label prediction and could potential be a noise attribute. Removing it might increase model’s accuracy.  
  18 | 
  19 | 
  20 | ## Codes Walkthrough
  21 | 
  22 | Installing the open source Yandex CatBoost package
  23 | 
  24 | 
  25 | ```python
  26 | pip install catboost
  27 | ```
  28 | 
  29 | Importing the required packaged: Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn and CatBoost
  30 | 
  31 | 
  32 | ```python
  33 | import numpy as np
  34 | import pandas as pd
  35 | import matplotlib.pyplot as plt
  36 | # plt.style.use('ggplot') 
  37 | import seaborn as sns
  38 | from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget
  39 | from sklearn.preprocessing import MinMaxScaler
  40 | from sklearn.feature_selection import VarianceThreshold
  41 | ```
  42 | 
  43 | Loading of [IBM HR Dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) into pandas dataframe
  44 | 
  45 | 
  46 | ```python
  47 | ibm_hr_df = pd.read_csv("IBM-HR-Employee-Attrition.csv")
  48 | ```
  49 | 
  50 | ### Part 1a: Data Exploration - Summary Statistics ###
  51 | 
  52 | Getting the summary statistics of the IBM HR dataset
  53 | 
  54 | 
  55 | ```python
  56 | ibm_hr_df.describe()
  57 | ```
  58 | 
  59 | 
  60 | 
  61 | 
  62 | <div>
  63 | <table class="dataframe">
  64 |   <thead>
  65 |     <tr style="text-align: right;">
  66 |       <th></th>
  67 |       <th>Age</th>
  68 |       <th>DailyRate</th>
  69 |       <th>DistanceFromHome</th>
  70 |       <th>Education</th>
  71 |       <th>EmployeeCount</th>
  72 |       <th>EmployeeNumber</th>
  73 |       <th>EnvironmentSatisfaction</th>
  74 |       <th>HourlyRate</th>
  75 |       <th>JobInvolvement</th>
  76 |       <th>JobLevel</th>
  77 |       <th>...</th>
  78 |       <th>RelationshipSatisfaction</th>
  79 |       <th>StandardHours</th>
  80 |       <th>StockOptionLevel</th>
  81 |       <th>TotalWorkingYears</th>
  82 |       <th>TrainingTimesLastYear</th>
  83 |       <th>WorkLifeBalance</th>
  84 |       <th>YearsAtCompany</th>
  85 |       <th>YearsInCurrentRole</th>
  86 |       <th>YearsSinceLastPromotion</th>
  87 |       <th>YearsWithCurrManager</th>
  88 |     </tr>
  89 |   </thead>
  90 |   <tbody>
  91 |     <tr>
  92 |       <th>count</th>
  93 |       <td>1470.000000</td>
  94 |       <td>1470.000000</td>
  95 |       <td>1470.000000</td>
  96 |       <td>1470.000000</td>
  97 |       <td>1470.0</td>
  98 |       <td>1470.000000</td>
  99 |       <td>1470.000000</td>
 100 |       <td>1470.000000</td>
 101 |       <td>1470.000000</td>
 102 |       <td>1470.000000</td>
 103 |       <td>...</td>
 104 |       <td>1470.000000</td>
 105 |       <td>1470.0</td>
 106 |       <td>1470.000000</td>
 107 |       <td>1470.000000</td>
 108 |       <td>1470.000000</td>
 109 |       <td>1470.000000</td>
 110 |       <td>1470.000000</td>
 111 |       <td>1470.000000</td>
 112 |       <td>1470.000000</td>
 113 |       <td>1470.000000</td>
 114 |     </tr>
 115 |     <tr>
 116 |       <th>mean</th>
 117 |       <td>36.923810</td>
 118 |       <td>802.485714</td>
 119 |       <td>9.192517</td>
 120 |       <td>2.912925</td>
 121 |       <td>1.0</td>
 122 |       <td>1024.865306</td>
 123 |       <td>2.721769</td>
 124 |       <td>65.891156</td>
 125 |       <td>2.729932</td>
 126 |       <td>2.063946</td>
 127 |       <td>...</td>
 128 |       <td>2.712245</td>
 129 |       <td>80.0</td>
 130 |       <td>0.793878</td>
 131 |       <td>11.279592</td>
 132 |       <td>2.799320</td>
 133 |       <td>2.761224</td>
 134 |       <td>7.008163</td>
 135 |       <td>4.229252</td>
 136 |       <td>2.187755</td>
 137 |       <td>4.123129</td>
 138 |     </tr>
 139 |     <tr>
 140 |       <th>std</th>
 141 |       <td>9.135373</td>
 142 |       <td>403.509100</td>
 143 |       <td>8.106864</td>
 144 |       <td>1.024165</td>
 145 |       <td>0.0</td>
 146 |       <td>602.024335</td>
 147 |       <td>1.093082</td>
 148 |       <td>20.329428</td>
 149 |       <td>0.711561</td>
 150 |       <td>1.106940</td>
 151 |       <td>...</td>
 152 |       <td>1.081209</td>
 153 |       <td>0.0</td>
 154 |       <td>0.852077</td>
 155 |       <td>7.780782</td>
 156 |       <td>1.289271</td>
 157 |       <td>0.706476</td>
 158 |       <td>6.126525</td>
 159 |       <td>3.623137</td>
 160 |       <td>3.222430</td>
 161 |       <td>3.568136</td>
 162 |     </tr>
 163 |     <tr>
 164 |       <th>min</th>
 165 |       <td>18.000000</td>
 166 |       <td>102.000000</td>
 167 |       <td>1.000000</td>
 168 |       <td>1.000000</td>
 169 |       <td>1.0</td>
 170 |       <td>1.000000</td>
 171 |       <td>1.000000</td>
 172 |       <td>30.000000</td>
 173 |       <td>1.000000</td>
 174 |       <td>1.000000</td>
 175 |       <td>...</td>
 176 |       <td>1.000000</td>
 177 |       <td>80.0</td>
 178 |       <td>0.000000</td>
 179 |       <td>0.000000</td>
 180 |       <td>0.000000</td>
 181 |       <td>1.000000</td>
 182 |       <td>0.000000</td>
 183 |       <td>0.000000</td>
 184 |       <td>0.000000</td>
 185 |       <td>0.000000</td>
 186 |     </tr>
 187 |     <tr>
 188 |       <th>25%</th>
 189 |       <td>30.000000</td>
 190 |       <td>465.000000</td>
 191 |       <td>2.000000</td>
 192 |       <td>2.000000</td>
 193 |       <td>1.0</td>
 194 |       <td>491.250000</td>
 195 |       <td>2.000000</td>
 196 |       <td>48.000000</td>
 197 |       <td>2.000000</td>
 198 |       <td>1.000000</td>
 199 |       <td>...</td>
 200 |       <td>2.000000</td>
 201 |       <td>80.0</td>
 202 |       <td>0.000000</td>
 203 |       <td>6.000000</td>
 204 |       <td>2.000000</td>
 205 |       <td>2.000000</td>
 206 |       <td>3.000000</td>
 207 |       <td>2.000000</td>
 208 |       <td>0.000000</td>
 209 |       <td>2.000000</td>
 210 |     </tr>
 211 |     <tr>
 212 |       <th>50%</th>
 213 |       <td>36.000000</td>
 214 |       <td>802.000000</td>
 215 |       <td>7.000000</td>
 216 |       <td>3.000000</td>
 217 |       <td>1.0</td>
 218 |       <td>1020.500000</td>
 219 |       <td>3.000000</td>
 220 |       <td>66.000000</td>
 221 |       <td>3.000000</td>
 222 |       <td>2.000000</td>
 223 |       <td>...</td>
 224 |       <td>3.000000</td>
 225 |       <td>80.0</td>
 226 |       <td>1.000000</td>
 227 |       <td>10.000000</td>
 228 |       <td>3.000000</td>
 229 |       <td>3.000000</td>
 230 |       <td>5.000000</td>
 231 |       <td>3.000000</td>
 232 |       <td>1.000000</td>
 233 |       <td>3.000000</td>
 234 |     </tr>
 235 |     <tr>
 236 |       <th>75%</th>
 237 |       <td>43.000000</td>
 238 |       <td>1157.000000</td>
 239 |       <td>14.000000</td>
 240 |       <td>4.000000</td>
 241 |       <td>1.0</td>
 242 |       <td>1555.750000</td>
 243 |       <td>4.000000</td>
 244 |       <td>83.750000</td>
 245 |       <td>3.000000</td>
 246 |       <td>3.000000</td>
 247 |       <td>...</td>
 248 |       <td>4.000000</td>
 249 |       <td>80.0</td>
 250 |       <td>1.000000</td>
 251 |       <td>15.000000</td>
 252 |       <td>3.000000</td>
 253 |       <td>3.000000</td>
 254 |       <td>9.000000</td>
 255 |       <td>7.000000</td>
 256 |       <td>3.000000</td>
 257 |       <td>7.000000</td>
 258 |     </tr>
 259 |     <tr>
 260 |       <th>max</th>
 261 |       <td>60.000000</td>
 262 |       <td>1499.000000</td>
 263 |       <td>29.000000</td>
 264 |       <td>5.000000</td>
 265 |       <td>1.0</td>
 266 |       <td>2068.000000</td>
 267 |       <td>4.000000</td>
 268 |       <td>100.000000</td>
 269 |       <td>4.000000</td>
 270 |       <td>5.000000</td>
 271 |       <td>...</td>
 272 |       <td>4.000000</td>
 273 |       <td>80.0</td>
 274 |       <td>3.000000</td>
 275 |       <td>40.000000</td>
 276 |       <td>6.000000</td>
 277 |       <td>4.000000</td>
 278 |       <td>40.000000</td>
 279 |       <td>18.000000</td>
 280 |       <td>15.000000</td>
 281 |       <td>17.000000</td>
 282 |     </tr>
 283 |   </tbody>
 284 | </table>
 285 | <p>8 rows × 26 columns</p>
 286 | </div>
 287 | 
 288 | 
 289 | 
 290 | Zooming in on the summary statistics of irrelevant attributes __*EmployeeCount*__ and __*StandardHours*__
 291 | 
 292 | 
 293 | ```python
 294 | irrList = ['EmployeeCount', 'StandardHours'] 
 295 | ibm_hr_df[irrList].describe()
 296 | ```
 297 | 
 298 | 
 299 | 
 300 | 
 301 | <div>
 302 | <table class="dataframe">
 303 |   <thead>
 304 |     <tr style="text-align: right;">
 305 |       <th></th>
 306 |       <th>EmployeeCount</th>
 307 |       <th>StandardHours</th>
 308 |     </tr>
 309 |   </thead>
 310 |   <tbody>
 311 |     <tr>
 312 |       <th>count</th>
 313 |       <td>1470.0</td>
 314 |       <td>1470.0</td>
 315 |     </tr>
 316 |     <tr>
 317 |       <th>mean</th>
 318 |       <td>1.0</td>
 319 |       <td>80.0</td>
 320 |     </tr>
 321 |     <tr>
 322 |       <th>std</th>
 323 |       <td>0.0</td>
 324 |       <td>0.0</td>
 325 |     </tr>
 326 |     <tr>
 327 |       <th>min</th>
 328 |       <td>1.0</td>
 329 |       <td>80.0</td>
 330 |     </tr>
 331 |     <tr>
 332 |       <th>25%</th>
 333 |       <td>1.0</td>
 334 |       <td>80.0</td>
 335 |     </tr>
 336 |     <tr>
 337 |       <th>50%</th>
 338 |       <td>1.0</td>
 339 |       <td>80.0</td>
 340 |     </tr>
 341 |     <tr>
 342 |       <th>75%</th>
 343 |       <td>1.0</td>
 344 |       <td>80.0</td>
 345 |     </tr>
 346 |     <tr>
 347 |       <th>max</th>
 348 |       <td>1.0</td>
 349 |       <td>80.0</td>
 350 |     </tr>
 351 |   </tbody>
 352 | </table>
 353 | </div>
 354 | 
 355 | 
 356 | 
 357 | Zooming in on the summary statistics of irrelevant attribute __*Over18*__ 
 358 | 
 359 | 
 360 | ```python
 361 | ibm_hr_df["Over18"].value_counts()
 362 | ```
 363 | 
 364 | 
 365 | 
 366 | 
 367 |     Y    1470
 368 |     Name: Over18, dtype: int64
 369 | 
 370 | 
 371 | 
 372 | From the summary statistics, one could see that attributes __*EmployeeCount*__, __*StandardHours*__ and __*Over18*__ holds only one single value for all of the 1470 records <br>
 373 | 
 374 | __*EmployeeCount*__ only holds a single value - 1.0 <br>
 375 | __*StandardHours*__ only holds a single value - 80.0 <br>
 376 | __*Over18*__        only holds a single value - 'Y'  <br>
 377 | 
 378 | These irrelevant attributes are duely dropped from the dataset
 379 | 
 380 | ### Part 1b: Data Exploration - Missing Values and Duplicate Records ###
 381 | 
 382 | Checking for 'NA' and missing values in the dataset.
 383 | 
 384 | 
 385 | ```python
 386 | ibm_hr_df.isnull().sum(axis=0)
 387 | ```
 388 | 
 389 | 
 390 | 
 391 | 
 392 |     Age                         0
 393 |     Attrition                   0
 394 |     BusinessTravel              0
 395 |     DailyRate                   0
 396 |     Department                  0
 397 |     DistanceFromHome            0
 398 |     Education                   0
 399 |     EducationField              0
 400 |     EmployeeCount               0
 401 |     EmployeeNumber              0
 402 |     EnvironmentSatisfaction     0
 403 |     Gender                      0
 404 |     HourlyRate                  0
 405 |     JobInvolvement              0
 406 |     JobLevel                    0
 407 |     JobRole                     0
 408 |     JobSatisfaction             0
 409 |     MaritalStatus               0
 410 |     MonthlyIncome               0
 411 |     MonthlyRate                 0
 412 |     NumCompaniesWorked          0
 413 |     Over18                      0
 414 |     OverTime                    0
 415 |     PercentSalaryHike           0
 416 |     PerformanceRating           0
 417 |     RelationshipSatisfaction    0
 418 |     StandardHours               0
 419 |     StockOptionLevel            0
 420 |     TotalWorkingYears           0
 421 |     TrainingTimesLastYear       0
 422 |     WorkLifeBalance             0
 423 |     YearsAtCompany              0
 424 |     YearsInCurrentRole          0
 425 |     YearsSinceLastPromotion     0
 426 |     YearsWithCurrManager        0
 427 |     dtype: int64
 428 | 
 429 | 
 430 | 
 431 | Well, we got lucky here, there isn't any missing values in this dataset
 432 | 
 433 | Next, let's check for the existence of duplicate records in the dataset
 434 | 
 435 | 
 436 | ```python
 437 | ibm_hr_df.duplicated().sum()
 438 | ```
 439 | 
 440 | 
 441 | 
 442 | 
 443 |     0
 444 | 
 445 | 
 446 | 
 447 | There are also no duplicate records in the dataset
 448 | 
 449 | Converting __*OverTime*__ binary categorical attribute to {1, 0}
 450 | 
 451 | 
 452 | ```python
 453 | ibm_hr_df['OverTime'].replace(to_replace=dict(Yes=1, No=0), inplace=True)
 454 | ```
 455 | 
 456 | ### Part 2a: Data Preprocessing - Removal of Irrelevant Attributes ###
 457 | 
 458 | 
 459 | ```python
 460 | ibm_hr_df = ibm_hr_df.drop(['EmployeeCount', 'StandardHours', 'Over18'], axis=1)
 461 | ```
 462 | 
 463 | ### Part 2b: Data Preprocessing - Feature Subset Selection - Low Variance Filter  ###
 464 | 
 465 | Performing variance analysis to aid in feature selection
 466 | 
 467 | 
 468 | ```python
 469 | variance_x = ibm_hr_df.drop('Attrition', axis=1)
 470 | variance_one_hot = pd.get_dummies(variance_x)
 471 | ```
 472 | 
 473 | 
 474 | ```python
 475 | #Normalise the dataset. This is required for getting the variance threshold
 476 | scaler = MinMaxScaler()
 477 | scaler.fit(variance_one_hot)
 478 | MinMaxScaler(copy=True, feature_range=(0, 1))
 479 | scaled_variance_one_hot = scaler.transform(variance_one_hot)
 480 | ```
 481 | 
 482 | 
 483 | ```python
 484 | #Set the threshold values and run VarianceThreshold 
 485 | thres = .85* (1 - .85)
 486 | sel = VarianceThreshold(threshold=thres)
 487 | sel.fit(scaled_variance_one_hot)
 488 | variance = sel.variances_
 489 | ```
 490 | 
 491 | 
 492 | ```python
 493 | #Sorting of the score in acsending orders for plotting
 494 | indices = np.argsort(variance)[::-1]
 495 | feature_list = list(variance_one_hot)
 496 | sorted_feature_list = []
 497 | thres_list = []
 498 | for f in range(len(variance_one_hot.columns)):
 499 |     sorted_feature_list.append(feature_list[indices[f]])
 500 |     thres_list.append(thres)
 501 | ```
 502 | 
 503 | 
 504 | ```python
 505 | plt.figure(figsize=(14,6))
 506 | plt.title("Feature Variance: %f" %(thres), fontsize = 14)
 507 | plt.bar(range(len(variance_one_hot.columns)), variance[indices], color="c")
 508 | plt.xticks(range(len(variance_one_hot.columns)), sorted_feature_list, rotation = 90)
 509 | plt.xlim([-0.5, len(variance_one_hot.columns)])
 510 | plt.plot(range(len(variance_one_hot.columns)), thres_list, "k-", color="r")
 511 | plt.tight_layout()
 512 | plt.show()
 513 | ```
 514 | 
 515 | 
 516 | ![png](images/output_30_0.png)
 517 | 
 518 | 
 519 | ### Part 2c: Data Preprocessing - Feature Subset Selection - High Correlation Filter  ###
 520 | 
 521 | Performing Pearson correlation analysis between attributes to aid in feature selection
 522 | 
 523 | 
 524 | ```python
 525 | plt.figure(figsize=(16,16))
 526 | sns.heatmap(ibm_hr_df.corr(), annot=True, fmt=".2f")
 527 | 
 528 | plt.show()
 529 | ```
 530 | 
 531 | 
 532 | ![png](images/output_24_0.png)
 533 | 
 534 | 
 535 | 
 536 | ### Part 3: Mutli-Class Label Generation ###
 537 | 
 538 | 
 539 | ```python
 540 | rAttrList = ['Department', 'OverTime', 'HourlyRate',
 541 |              'StockOptionLevel', 'DistanceFromHome',
 542 |              'YearsInCurrentRole', 'Age']
 543 | ```
 544 | 
 545 | 
 546 | ```python
 547 | #keep only the attribute list on rAttrList
 548 | label_hr_df = ibm_hr_df[rAttrList]
 549 | ```
 550 | 
 551 | 
 552 | ```python
 553 | #convert continous attribute DistanceFromHome to Catergorical
 554 | #: 1: near, 2: mid distance, 3: far
 555 | maxValues = label_hr_df['DistanceFromHome'].max()
 556 | minValues = label_hr_df['DistanceFromHome'].min()
 557 | intervals = (maxValues - minValues)/3
 558 | bins = [0, (minValues + intervals), (maxValues - intervals), maxValues]
 559 | groupName = [1, 2, 3]
 560 | label_hr_df['CatDistanceFromHome'] = pd.cut(label_hr_df['DistanceFromHome'], bins, labels = groupName)
 561 | ```
 562 | 
 563 | 
 564 | ```python
 565 | # convert col type from cat to int64
 566 | label_hr_df['CatDistanceFromHome'] = pd.to_numeric(label_hr_df['CatDistanceFromHome']) 
 567 | label_hr_df.drop(['DistanceFromHome'], axis = 1, inplace = True)
 568 | ```
 569 | 
 570 | 
 571 | ```python
 572 | #replace department into 0 & 1, 0: R&D, and 1: Non-R&D
 573 | label_hr_df['Department'].replace(['Research & Development', 'Human Resources', 'Sales'],
 574 |                                   [0, 1, 1], inplace = True)
 575 | ```
 576 | 
 577 | 
 578 | ```python
 579 | #normalise data
 580 | label_hr_df_norm = (label_hr_df - label_hr_df.min()) / (label_hr_df.max() - label_hr_df.min())
 581 | ```
 582 | 
 583 | 
 584 | ```python
 585 | #create a data frame for the function value and class labels
 586 | value_df = pd.DataFrame(columns = ['ClassValue'])
 587 | ```
 588 | 
 589 | 
 590 | ```python
 591 | #compute the class value
 592 | for row in range (0, ibm_hr_df.shape[0]):
 593 |     if label_hr_df_norm['Department'][row] == 0:
 594 |         value = 0.3 * label_hr_df_norm['HourlyRate'][row] - 0.2 * label_hr_df_norm['OverTime'][row] + \
 595 |             - 0.2 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.15 * label_hr_df_norm['StockOptionLevel'][row] + \
 596 |             0.1 * label_hr_df_norm['Age'][row] - 0.05 * label_hr_df_norm['YearsInCurrentRole'][row]
 597 |     
 598 |     else:
 599 |         value = 0.2 * label_hr_df_norm['HourlyRate'][row] - 0.3 * label_hr_df_norm['OverTime'][row] + \
 600 |             - 0.15 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.2 * label_hr_df_norm['StockOptionLevel'][row] + \
 601 |             0.05 * label_hr_df_norm['Age'][row] - 0.1 * label_hr_df_norm['YearsInCurrentRole'][row]
 602 |     value_df.loc[row] = value
 603 | ```
 604 | 
 605 | 
 606 | ```python
 607 | # top 500 highest class value is satisfied with their job
 608 | v1 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True)\
 609 |         ['ClassValue'][499]
 610 | # next top 500 is neutral
 611 | v2 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True)\
 612 |         ['ClassValue'][999]
 613 | # rest is unsatisfied
 614 | ```
 615 | 
 616 | 
 617 | ```python
 618 | label_df = pd.DataFrame(columns = ['ClassLabel'])
 619 | ```
 620 | 
 621 | 
 622 | ```python
 623 | #compute the classlabel
 624 | for row in range (0, value_df.shape[0]):
 625 |     if value_df['ClassValue'][row] >= v1:
 626 |         cat = "Satisfied"
 627 |     elif value_df['ClassValue'][row] >= v2:
 628 |         cat = "Neutral"
 629 |     else:
 630 |         cat = "Unsatisfied"
 631 |     label_df.loc[row] = cat
 632 | ```
 633 | 
 634 | 
 635 | ```python
 636 | df = pd.concat([ibm_hr_df, label_df], axis = 1)
 637 | ```
 638 | 
 639 | ### Part 4: Classification with CatBoost ###
 640 | 
 641 | 
 642 | ```python
 643 | df = df[['Age', 'Department', 'DistanceFromHome', 'HourlyRate', 'OverTime', 'StockOptionLevel', 
 644 |          'MaritalStatus', 'YearsInCurrentRole', 'EmployeeNumber', 'ClassLabel']]
 645 | ```
 646 | 
 647 | Split dataset into attributes/features __*X*__ and label/class __*y*__
 648 | 
 649 | 
 650 | ```python
 651 | X = df.drop('ClassLabel', axis=1)
 652 | y = df.ClassLabel
 653 | ```
 654 | 
 655 | Replacing label/class value from __*'Satisfied'*__, __*'Neutral'*__ and *__'Unsatisfied'__* to *__2__*, __*1*__ and __*0*__
 656 | 
 657 | 
 658 | ```python
 659 | y.replace(to_replace=dict(Satisfied=2, Neutral=1, Unsatisfied=0), inplace=True)
 660 | ```
 661 | 
 662 | Performing __'one hot encoding'__ method
 663 | 
 664 | 
 665 | ```python
 666 | one_hot = pd.get_dummies(X)
 667 | ```
 668 | 
 669 | ```python
 670 | categorical_features_indices = np.where(one_hot.dtypes != np.float)[0]
 671 | ```
 672 | 
 673 | ### Part 5: Model training with CatBoost ###
 674 | Now lets split our data to train (70%) and test (30%) set:
 675 | 
 676 | 
 677 | ```python
 678 | from sklearn.model_selection import train_test_split
 679 | 
 680 | X_train, X_test, y_train, y_test = train_test_split(one_hot, y, train_size=0.7, random_state=1234)
 681 | ```
 682 | 
 683 | 
 684 | ```python
 685 | model = CatBoostClassifier(
 686 |     custom_loss = ['Accuracy'],
 687 |     random_seed = 100,
 688 |     loss_function = 'MultiClass'
 689 | )
 690 | ```
 691 | 
 692 | 
 693 | ```python
 694 | model.fit(
 695 |     X_train, y_train,
 696 |     cat_features = categorical_features_indices,
 697 |     verbose = True,  # you can uncomment this for text output
 698 |     #plot = True
 699 | )
 700 | ```
 701 | 
 702 | 
 703 | ```python
 704 | cm = pd.DataFrame()
 705 | cm['Satisfaction'] = y_test
 706 | cm['Predict'] = model.predict(X_test)
 707 | ```
 708 | 
 709 | 
 710 | ```python
 711 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'}
 712 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'}
 713 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict})
 714 | ```
 715 | 
 716 | 
 717 | ```python
 718 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True)
 719 | ```
 720 | 
 721 | 
 722 | 
 723 | 
 724 | <div>
 725 | <table class="dataframe">
 726 |   <thead>
 727 |     <tr style="text-align: right;">
 728 |       <th>Predict</th>
 729 |       <th>Neutral</th>
 730 |       <th>Satisfied</th>
 731 |       <th>Unsatisfied</th>
 732 |       <th>All</th>
 733 |     </tr>
 734 |     <tr>
 735 |       <th>Satisfaction</th>
 736 |       <th></th>
 737 |       <th></th>
 738 |       <th></th>
 739 |       <th></th>
 740 |     </tr>
 741 |   </thead>
 742 |   <tbody>
 743 |     <tr>
 744 |       <th>Neutral</th>
 745 |       <td>143</td>
 746 |       <td>8</td>
 747 |       <td>8</td>
 748 |       <td>159</td>
 749 |     </tr>
 750 |     <tr>
 751 |       <th>Satisfied</th>
 752 |       <td>20</td>
 753 |       <td>123</td>
 754 |       <td>1</td>
 755 |       <td>144</td>
 756 |     </tr>
 757 |     <tr>
 758 |       <th>Unsatisfied</th>
 759 |       <td>18</td>
 760 |       <td>0</td>
 761 |       <td>120</td>
 762 |       <td>138</td>
 763 |     </tr>
 764 |     <tr>
 765 |       <th>All</th>
 766 |       <td>181</td>
 767 |       <td>131</td>
 768 |       <td>129</td>
 769 |       <td>441</td>
 770 |     </tr>
 771 |   </tbody>
 772 | </table>
 773 | </div>
 774 | 
 775 | 
 776 | 
 777 | 
 778 | ```python
 779 | model.score(X_test, y_test)
 780 | ```
 781 | 
 782 | 
 783 | 
 784 | 
 785 |     0.87528344671201819
 786 | 
 787 | 
 788 | ### Part 6: CatBoost Classifier Tuning ###
 789 | 
 790 | 
 791 | ```python
 792 | model = CatBoostClassifier(
 793 |     l2_leaf_reg = 5,
 794 |     iterations = 1000,
 795 |     fold_len_multiplier = 1.1,
 796 |     custom_loss = ['Accuracy'],
 797 |     random_seed = 100,
 798 |     loss_function = 'MultiClass'
 799 | )
 800 | ```
 801 | 
 802 | 
 803 | ```python
 804 | model.fit(
 805 |     X_train, y_train,
 806 |     cat_features = categorical_features_indices,
 807 |     verbose = True,  # you can uncomment this for text output
 808 |     #plot = True
 809 | )
 810 | ```
 811 | 
 812 | 
 813 | ```python
 814 | cm = pd.DataFrame()
 815 | cm['Satisfaction'] = y_test
 816 | cm['Predict'] = model.predict(X_test)
 817 | ```
 818 | 
 819 | 
 820 | ```python
 821 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'}
 822 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'}
 823 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict})
 824 | ```
 825 | 
 826 | 
 827 | ```python
 828 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True)
 829 | ```
 830 | 
 831 | 
 832 | 
 833 | 
 834 | <div>
 835 | <table class="dataframe">
 836 |   <thead>
 837 |     <tr style="text-align: right;">
 838 |       <th>Predict</th>
 839 |       <th>Neutral</th>
 840 |       <th>Satisfied</th>
 841 |       <th>Unsatisfied</th>
 842 |       <th>All</th>
 843 |     </tr>
 844 |     <tr>
 845 |       <th>Satisfaction</th>
 846 |       <th></th>
 847 |       <th></th>
 848 |       <th></th>
 849 |       <th></th>
 850 |     </tr>
 851 |   </thead>
 852 |   <tbody>
 853 |     <tr>
 854 |       <th>Neutral</th>
 855 |       <td>142</td>
 856 |       <td>9</td>
 857 |       <td>8</td>
 858 |       <td>159</td>
 859 |     </tr>
 860 |     <tr>
 861 |       <th>Satisfied</th>
 862 |       <td>17</td>
 863 |       <td>126</td>
 864 |       <td>1</td>
 865 |       <td>144</td>
 866 |     </tr>
 867 |     <tr>
 868 |       <th>Unsatisfied</th>
 869 |       <td>12</td>
 870 |       <td>0</td>
 871 |       <td>126</td>
 872 |       <td>138</td>
 873 |     </tr>
 874 |     <tr>
 875 |       <th>All</th>
 876 |       <td>171</td>
 877 |       <td>135</td>
 878 |       <td>135</td>
 879 |       <td>441</td>
 880 |     </tr>
 881 |   </tbody>
 882 | </table>
 883 | </div>
 884 | 
 885 | 
 886 | 
 887 | 
 888 | ```python
 889 | model.score(X_test, y_test)
 890 | ```
 891 | 
 892 | 
 893 | 
 894 | 
 895 |     0.89342403628117917
 896 | 
 897 | 
 898 | ### Part 7: Data Preprocessing: Attributes Value Normalization ###
 899 | 
 900 | 
 901 | Normalization of features, after realizing that tuning no longer improve model's accuracy 
 902 | 
 903 | 
 904 | ```python
 905 | one_hot = (one_hot - one_hot.mean()) / (one_hot.max() - one_hot.min())
 906 | ```
 907 | 
 908 | 
 909 | ```python
 910 | categorical_features_indices = np.where(one_hot.dtypes != np.float)[0]
 911 | ```
 912 | 
 913 | 
 914 | ```python
 915 | from sklearn.model_selection import train_test_split
 916 | 
 917 | X_train, X_test, y_train, y_test = train_test_split(one_hot, y, train_size=0.7, random_state=1234)
 918 | ```
 919 | 
 920 | 
 921 | ```python
 922 | model = CatBoostClassifier(
 923 |     l2_leaf_reg = 5,
 924 |     iterations = 1000,
 925 |     fold_len_multiplier = 1.1,
 926 |     custom_loss = ['Accuracy'],
 927 |     random_seed = 100,
 928 |     loss_function = 'MultiClass'
 929 | )
 930 | ```
 931 | 
 932 | 
 933 | ```python
 934 | model.fit(
 935 |     X_train, y_train,
 936 |     cat_features = categorical_features_indices,
 937 |     verbose = True,  # you can uncomment this for text output
 938 |     #plot = True
 939 | )
 940 | ```
 941 | 
 942 | 
 943 | ```python
 944 | feature_score = pd.DataFrame(list(zip(one_hot.dtypes.index, model.get_feature_importance(Pool(one_hot, label=y, cat_features=categorical_features_indices)))),
 945 |                 columns=['Feature','Score'])
 946 | ```
 947 | 
 948 | 
 949 | ```python
 950 | feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')
 951 | ```
 952 | 
 953 | 
 954 | ```python
 955 | plt.rcParams["figure.figsize"] = (12,7)
 956 | ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
 957 | ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
 958 | ax.set_xlabel('')
 959 | 
 960 | rects = ax.patches
 961 | 
 962 | # get feature score as labels round to 2 decimal
 963 | labels = feature_score['Score'].round(2)
 964 | 
 965 | for rect, label in zip(rects, labels):
 966 |     height = rect.get_height()
 967 |     ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')
 968 | 
 969 | plt.show()
 970 | ```
 971 | 
 972 | 
 973 | ![png](images/output_69_0.png)
 974 | 
 975 | 
 976 | 
 977 | ```python
 978 | cm = pd.DataFrame()
 979 | cm['Satisfaction'] = y_test
 980 | cm['Predict'] = model.predict(X_test)
 981 | ```
 982 | 
 983 | 
 984 | ```python
 985 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'}
 986 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'}
 987 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict})
 988 | ```
 989 | 
 990 | 
 991 | ```python
 992 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True)
 993 | ```
 994 | 
 995 | 
 996 | 
 997 | 
 998 | <div>
 999 | <table class="dataframe">
1000 |   <thead>
1001 |     <tr style="text-align: right;">
1002 |       <th>Predict</th>
1003 |       <th>Neutral</th>
1004 |       <th>Satisfied</th>
1005 |       <th>Unsatisfied</th>
1006 |       <th>All</th>
1007 |     </tr>
1008 |     <tr>
1009 |       <th>Satisfaction</th>
1010 |       <th></th>
1011 |       <th></th>
1012 |       <th></th>
1013 |       <th></th>
1014 |     </tr>
1015 |   </thead>
1016 |   <tbody>
1017 |     <tr>
1018 |       <th>Neutral</th>
1019 |       <td>146</td>
1020 |       <td>11</td>
1021 |       <td>2</td>
1022 |       <td>159</td>
1023 |     </tr>
1024 |     <tr>
1025 |       <th>Satisfied</th>
1026 |       <td>7</td>
1027 |       <td>137</td>
1028 |       <td>0</td>
1029 |       <td>144</td>
1030 |     </tr>
1031 |     <tr>
1032 |       <th>Unsatisfied</th>
1033 |       <td>8</td>
1034 |       <td>0</td>
1035 |       <td>130</td>
1036 |       <td>138</td>
1037 |     </tr>
1038 |     <tr>
1039 |       <th>All</th>
1040 |       <td>161</td>
1041 |       <td>148</td>
1042 |       <td>132</td>
1043 |       <td>441</td>
1044 |     </tr>
1045 |   </tbody>
1046 | </table>
1047 | </div>
1048 | 
1049 | 
1050 | 
1051 | 
1052 | ```python
1053 | model.score(X_test, y_test)
1054 | ```
1055 | 
1056 | 
1057 | 
1058 | 
1059 |     0.93650793650793651
1060 | 
1061 | 
1062 |   
1063 | 


--------------------------------------------------------------------------------
/YandexCatBoost-Demo.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # coding: utf-8
  3 | 
  4 | # ## CI6227 Assignment ##
  5 | 
  6 | # Installing the open source Yandex CatBoost package
  7 | 
  8 | # In[2]:
  9 | 
 10 | 
 11 | get_ipython().system(u'pip install catboost')
 12 | 
 13 | 
 14 | # Importing the required packaged: Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn and CatBoost
 15 | 
 16 | # In[3]:
 17 | 
 18 | 
 19 | import numpy as np
 20 | import pandas as pd
 21 | import matplotlib.pyplot as plt
 22 | # plt.style.use('ggplot') 
 23 | import seaborn as sns
 24 | from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget
 25 | from sklearn.preprocessing import MinMaxScaler
 26 | from sklearn.feature_selection import VarianceThreshold
 27 | 
 28 | 
 29 | # Loading of [IBM HR Dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) into pandas dataframe
 30 | 
 31 | # In[4]:
 32 | 
 33 | 
 34 | ibm_hr_df = pd.read_csv("/home/nbuser/library/IBM-HR-Employee-Attrition.csv")
 35 | 
 36 | 
 37 | # ### Part 1a: Data Exploration - Summary Statistics ###
 38 | # 
 39 | # Getting the summary statistics of the IBM HR dataset
 40 | 
 41 | # In[5]:
 42 | 
 43 | 
 44 | ibm_hr_df.describe()
 45 | 
 46 | 
 47 | # Zooming in on the summary statistics of irrelevant attributes __*EmployeeCount*__ and __*StandardHours*__
 48 | 
 49 | # In[6]:
 50 | 
 51 | 
 52 | irrList = ['EmployeeCount', 'StandardHours'] 
 53 | ibm_hr_df[irrList].describe()
 54 | 
 55 | 
 56 | # Zooming in on the summary statistics of irrelevant attribute __*Over18*__ 
 57 | 
 58 | # In[7]:
 59 | 
 60 | 
 61 | ibm_hr_df["Over18"].value_counts()
 62 | 
 63 | 
 64 | # From the summary statistics, one could see that attributes __*EmployeeCount*__, __*StandardHours*__ and __*Over18*__ holds only one single value for all of the 1470 records <br>
 65 | # 
 66 | # __*EmployeeCount*__ only holds a single value - 1.0 <br>
 67 | # __*StandardHours*__ only holds a single value - 80.0 <br>
 68 | # __*Over18*__        only holds a single value - 'Y'  <br>
 69 | # 
 70 | # These irrelevant attributes are duely dropped from the dataset
 71 | 
 72 | # ### Part 1b: Data Exploration - Missing Values and Duplicate Records ###
 73 | # 
 74 | # Checking for 'NA' and missing values in the dataset.
 75 | 
 76 | # In[8]:
 77 | 
 78 | 
 79 | ibm_hr_df.isnull().sum(axis=0)
 80 | 
 81 | 
 82 | # Well, we got lucky here, there isn't any missing values in this dataset
 83 | # 
 84 | # Next, let's check for the existence of duplicate records in the dataset
 85 | 
 86 | # In[9]:
 87 | 
 88 | 
 89 | ibm_hr_df.duplicated().sum()
 90 | 
 91 | 
 92 | # There are also no duplicate records in the dataset
 93 | # 
 94 | # Converting __*OverTime*__ binary categorical attribute to {1, 0}
 95 | 
 96 | # In[10]:
 97 | 
 98 | 
 99 | ibm_hr_df['OverTime'].replace(to_replace=dict(Yes=1, No=0), inplace=True)
100 | 
101 | 
102 | # ### Part 2a: Data Preprocessing - Removal of Irrelevant Attributes ###
103 | 
104 | # In[12]:
105 | 
106 | 
107 | ibm_hr_df = ibm_hr_df.drop(['EmployeeCount', 'StandardHours', 'Over18'], axis=1)
108 | 
109 | 
110 | # ### Part 2b: Data Preprocessing - Feature Subset Selection - Low Variance Filter  ###
111 | # 
112 | # Performing variance analysis
113 | 
114 | # Performing Pearson correlation analysis between attributes to aid in dimension reduction
115 | 
116 | # In[15]:
117 | 
118 | 
119 | plt.figure(figsize=(16,16))
120 | sns.heatmap(ibm_hr_df.corr(), annot=True, fmt=".2f")
121 | 
122 | plt.show()
123 | 
124 | 
125 | # Performing variance analysis to aid in dimension reduction
126 | 
127 | # In[16]:
128 | 
129 | 
130 | variance_x = ibm_hr_df.drop('Attrition', axis=1)
131 | variance_one_hot = pd.get_dummies(variance_x)
132 | 
133 | 
134 | # In[17]:
135 | 
136 | 
137 | #Normalise the dataset. This is required for getting the variance threshold
138 | scaler = MinMaxScaler()
139 | scaler.fit(variance_one_hot)
140 | MinMaxScaler(copy=True, feature_range=(0, 1))
141 | scaled_variance_one_hot = scaler.transform(variance_one_hot)
142 | 
143 | 
144 | # In[18]:
145 | 
146 | 
147 | #Set the threshold values and run VarianceThreshold 
148 | thres = .85* (1 - .85)
149 | sel = VarianceThreshold(threshold=thres)
150 | sel.fit(scaled_variance_one_hot)
151 | variance = sel.variances_
152 | 
153 | 
154 | # In[19]:
155 | 
156 | 
157 | #Sorting of the score in acsending orders for plotting
158 | indices = np.argsort(variance)[::-1]
159 | feature_list = list(variance_one_hot)
160 | sorted_feature_list = []
161 | thres_list = []
162 | for f in range(len(variance_one_hot.columns)):
163 |     sorted_feature_list.append(feature_list[indices[f]])
164 |     thres_list.append(thres)
165 | 
166 | 
167 | # In[20]:
168 | 
169 | 
170 | plt.figure(figsize=(14,6))
171 | plt.title("Feature Variance: %f" %(thres), fontsize = 14)
172 | plt.bar(range(len(variance_one_hot.columns)), variance[indices], color="c")
173 | plt.xticks(range(len(variance_one_hot.columns)), sorted_feature_list, rotation = 90)
174 | plt.xlim([-0.5, len(variance_one_hot.columns)])
175 | plt.plot(range(len(variance_one_hot.columns)), thres_list, "k-", color="r")
176 | plt.tight_layout()
177 | plt.show()
178 | 
179 | 
180 | # Performing Pearson correlation analysis between attributes to aid in dimension reduction
181 | 
182 | # ### Part 3 ###
183 | 
184 | # In[21]:
185 | 
186 | 
187 | rAttrList = ['Department', 'OverTime', 'HourlyRate',
188 |              'StockOptionLevel', 'DistanceFromHome',
189 |              'YearsInCurrentRole', 'Age']
190 | 
191 | 
192 | # In[22]:
193 | 
194 | 
195 | #keep only the attribute list on rAttrList
196 | label_hr_df = ibm_hr_df[rAttrList]
197 | 
198 | 
199 | # In[23]:
200 | 
201 | 
202 | #convert continous attribute DistanceFromHome to Catergorical
203 | #: 1: near, 2: mid distance, 3: far
204 | maxValues = label_hr_df['DistanceFromHome'].max()
205 | minValues = label_hr_df['DistanceFromHome'].min()
206 | intervals = (maxValues - minValues)/3
207 | bins = [0, (minValues + intervals), (maxValues - intervals), maxValues]
208 | groupName = [1, 2, 3]
209 | label_hr_df['CatDistanceFromHome'] = pd.cut(label_hr_df['DistanceFromHome'], bins, labels = groupName)
210 | 
211 | 
212 | # In[24]:
213 | 
214 | 
215 | # convert col type from cat to int64
216 | label_hr_df['CatDistanceFromHome'] = pd.to_numeric(label_hr_df['CatDistanceFromHome']) 
217 | label_hr_df.drop(['DistanceFromHome'], axis = 1, inplace = True)
218 | 
219 | 
220 | # In[25]:
221 | 
222 | 
223 | #replace department into 0 & 1, 0: R&D, and 1: Non-R&D
224 | label_hr_df['Department'].replace(['Research & Development', 'Human Resources', 'Sales'],
225 |                                   [0, 1, 1], inplace = True)
226 | 
227 | 
228 | # In[26]:
229 | 
230 | 
231 | #normalise data
232 | label_hr_df_norm = (label_hr_df - label_hr_df.min()) / (label_hr_df.max() - label_hr_df.min())
233 | 
234 | 
235 | # In[27]:
236 | 
237 | 
238 | #create a data frame for the function value and class labels
239 | value_df = pd.DataFrame(columns = ['ClassValue'])
240 | 
241 | 
242 | # In[28]:
243 | 
244 | 
245 | #compute the class value
246 | for row in range (0, ibm_hr_df.shape[0]):
247 |     if label_hr_df_norm['Department'][row] == 0:
248 |         value = 0.3 * label_hr_df_norm['HourlyRate'][row] - 0.2 * label_hr_df_norm['OverTime'][row] +             - 0.2 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.15 * label_hr_df_norm['StockOptionLevel'][row] +             0.1 * label_hr_df_norm['Age'][row] - 0.05 * label_hr_df_norm['YearsInCurrentRole'][row]
249 |     
250 |     else:
251 |         value = 0.2 * label_hr_df_norm['HourlyRate'][row] - 0.3 * label_hr_df_norm['OverTime'][row] +             - 0.15 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.2 * label_hr_df_norm['StockOptionLevel'][row] +             0.05 * label_hr_df_norm['Age'][row] - 0.1 * label_hr_df_norm['YearsInCurrentRole'][row]
252 |     value_df.loc[row] = value
253 | 
254 | 
255 | # In[29]:
256 | 
257 | 
258 | # top 500 highest class value is satisfied with their job
259 | v1 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True)        ['ClassValue'][499]
260 | # next top 500 is neutral
261 | v2 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True)        ['ClassValue'][999]
262 | # rest is unsatisfied
263 | 
264 | 
265 | # In[30]:
266 | 
267 | 
268 | label_df = pd.DataFrame(columns = ['ClassLabel'])
269 | 
270 | 
271 | # In[31]:
272 | 
273 | 
274 | #compute the classlabel
275 | for row in range (0, value_df.shape[0]):
276 |     if value_df['ClassValue'][row] >= v1:
277 |         cat = "Satisfied"
278 |     elif value_df['ClassValue'][row] >= v2:
279 |         cat = "Neutral"
280 |     else:
281 |         cat = "Unsatisfied"
282 |     label_df.loc[row] = cat
283 | 
284 | 
285 | # In[32]:
286 | 
287 | 
288 | df = pd.concat([ibm_hr_df, label_df], axis = 1)
289 | 
290 | 
291 | # ### Part 3: Classification with CatBoost ###
292 | 
293 | # In[26]:
294 | 
295 | 
296 | #df = pd.read_csv("/home/nbuser/library/HR_dataset_generated_label.csv")
297 | 
298 | 
299 | # In[33]:
300 | 
301 | 
302 | df = df[['Age', 'Department', 'DistanceFromHome', 'HourlyRate', 'OverTime', 'StockOptionLevel', 
303 |          'MaritalStatus', 'YearsInCurrentRole', 'EmployeeNumber', 'ClassLabel']]
304 | 
305 | 
306 | # Split dataset into attributes/features __*X*__ and label/class __*y*__
307 | 
308 | # In[34]:
309 | 
310 | 
311 | X = df.drop('ClassLabel', axis=1)
312 | y = df.ClassLabel
313 | 
314 | 
315 | # Replacing label/class value from __*'Satisfied'*__, __*'Neutral'*__ and *__'Unsatisfied'__* to *__2__*, __*1*__ and __*0*__
316 | 
317 | # In[35]:
318 | 
319 | 
320 | y.replace(to_replace=dict(Satisfied=2, Neutral=1, Unsatisfied=0), inplace=True)
321 | 
322 | 
323 | # Performing __'one hot encoding'__ method
324 | 
325 | # In[36]:
326 | 
327 | 
328 | one_hot = pd.get_dummies(X)
329 | 
330 | 
331 | # Normalisation of features
332 | 
333 | # In[37]:
334 | 
335 | 
336 | one_hot = (one_hot - one_hot.mean()) / (one_hot.max() - one_hot.min())
337 | 
338 | 
339 | # In[38]:
340 | 
341 | 
342 | categorical_features_indices = np.where(one_hot.dtypes != np.float)[0]
343 | 
344 | 
345 | # ### Part 3a: Model training with CatBoost ###
346 | # Now lets split our data to train (70%) and test (30%) set:
347 | 
348 | # In[39]:
349 | 
350 | 
351 | from sklearn.model_selection import train_test_split
352 | 
353 | X_train, X_test, y_train, y_test = train_test_split(one_hot, y, train_size=0.7, random_state=1234)
354 | 
355 | 
356 | # In[44]:
357 | 
358 | 
359 | model = CatBoostClassifier(
360 |     custom_loss = ['Accuracy'],
361 |     random_seed = 100,
362 |     loss_function = 'MultiClass'
363 | )
364 | 
365 | 
366 | # In[51]:
367 | 
368 | 
369 | model.fit(
370 |     X_train, y_train,
371 |     cat_features = categorical_features_indices,
372 |     verbose = True,  # you can uncomment this for text output
373 |     #plot = True
374 | )
375 | 
376 | 
377 | # In[48]:
378 | 
379 | 
380 | feature_score = pd.DataFrame(list(zip(one_hot.dtypes.index, model.get_feature_importance(Pool(one_hot, label=y, cat_features=categorical_features_indices)))),
381 |                 columns=['Feature','Score'])
382 | feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')
383 | 
384 | 
385 | # In[49]:
386 | 
387 | 
388 | plt.rcParams["figure.figsize"] = (12,7)
389 | ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
390 | ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
391 | ax.set_xlabel('')
392 | 
393 | rects = ax.patches
394 | 
395 | # get feature score as labels round to 2 decimal
396 | labels = feature_score['Score'].round(2)
397 | 
398 | for rect, label in zip(rects, labels):
399 |     height = rect.get_height()
400 |     ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')
401 | 
402 | plt.show()
403 | 
404 | 
405 | # In[50]:
406 | 
407 | 
408 | model.score(X_test, y_test)
409 | 
410 | 
411 | # ### Part 4: CatBoost Classifier Tuning ###
412 | 
413 | # In[40]:
414 | 
415 | 
416 | model = CatBoostClassifier(
417 |     l2_leaf_reg = 3,
418 |     iterations = 1000,
419 |     fold_len_multiplier = 1.05,
420 |     learning_rate = 0.05,
421 |     custom_loss = ['Accuracy'],
422 |     random_seed = 100,
423 |     loss_function = 'MultiClass'
424 | )
425 | 
426 | 
427 | # In[41]:
428 | 
429 | 
430 | model.fit(
431 |     X_train, y_train,
432 |     cat_features = categorical_features_indices,
433 |     verbose = True,  # you can uncomment this for text output
434 |     #plot = True
435 | )
436 | 
437 | 
438 | # In[42]:
439 | 
440 | 
441 | feature_score = pd.DataFrame(list(zip(one_hot.dtypes.index, model.get_feature_importance(Pool(one_hot, label=y, cat_features=categorical_features_indices)))),
442 |                 columns=['Feature','Score'])
443 | 
444 | 
445 | # In[43]:
446 | 
447 | 
448 | feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')
449 | 
450 | 
451 | # In[44]:
452 | 
453 | 
454 | plt.rcParams["figure.figsize"] = (12,7)
455 | ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
456 | ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
457 | ax.set_xlabel('')
458 | 
459 | rects = ax.patches
460 | 
461 | # get feature score as labels round to 2 decimal
462 | labels = feature_score['Score'].round(2)
463 | 
464 | for rect, label in zip(rects, labels):
465 |     height = rect.get_height()
466 |     ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')
467 | 
468 | plt.show()
469 | #plt.savefig("image.png")
470 | 
471 | 
472 | # In[61]:
473 | 
474 | 
475 | cm = pd.DataFrame()
476 | cm['Satisfaction'] = y_test
477 | cm['Predict'] = model.predict(X_test)
478 | 
479 | 
480 | # In[63]:
481 | 
482 | 
483 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'}
484 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'}
485 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict})
486 | 
487 | 
488 | # In[64]:
489 | 
490 | 
491 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True)
492 | 
493 | 
494 | # In[65]:
495 | 
496 | 
497 | model.score(X_test, y_test)
498 | 
499 | 


--------------------------------------------------------------------------------
/images/output_24_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KwokHing/YandexCatBoost-Python-Demo/54e70cc5ccb39f0bcaf06a58aec5ac003fd9f985/images/output_24_0.png


--------------------------------------------------------------------------------
/images/output_30_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KwokHing/YandexCatBoost-Python-Demo/54e70cc5ccb39f0bcaf06a58aec5ac003fd9f985/images/output_30_0.png


--------------------------------------------------------------------------------
/images/output_69_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KwokHing/YandexCatBoost-Python-Demo/54e70cc5ccb39f0bcaf06a58aec5ac003fd9f985/images/output_69_0.png


--------------------------------------------------------------------------------
	Age	DailyRate	DistanceFromHome	Education	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	HourlyRate	JobInvolvement	JobLevel	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
count	1470.000000	1470.000000	1470.000000	1470.000000	1470.0	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	...	1470.000000	1470.0	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000	1470.000000
mean	36.923810	802.485714	9.192517	2.912925	1.0	1024.865306	2.721769	65.891156	2.729932	2.063946	...	2.712245	80.0	0.793878	11.279592	2.799320	2.761224	7.008163	4.229252	2.187755	4.123129
std	9.135373	403.509100	8.106864	1.024165	0.0	602.024335	1.093082	20.329428	0.711561	1.106940	...	1.081209	0.0	0.852077	7.780782	1.289271	0.706476	6.126525	3.623137	3.222430	3.568136
min	18.000000	102.000000	1.000000	1.000000	1.0	1.000000	1.000000	30.000000	1.000000	1.000000	...	1.000000	80.0	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
25%	30.000000	465.000000	2.000000	2.000000	1.0	491.250000	2.000000	48.000000	2.000000	1.000000	...	2.000000	80.0	0.000000	6.000000	2.000000	2.000000	3.000000	2.000000	0.000000	2.000000
50%	36.000000	802.000000	7.000000	3.000000	1.0	1020.500000	3.000000	66.000000	3.000000	2.000000	...	3.000000	80.0	1.000000	10.000000	3.000000	3.000000	5.000000	3.000000	1.000000	3.000000
75%	43.000000	1157.000000	14.000000	4.000000	1.0	1555.750000	4.000000	83.750000	3.000000	3.000000	...	4.000000	80.0	1.000000	15.000000	3.000000	3.000000	9.000000	7.000000	3.000000	7.000000
max	60.000000	1499.000000	29.000000	5.000000	1.0	2068.000000	4.000000	100.000000	4.000000	5.000000	...	4.000000	80.0	3.000000	40.000000	6.000000	4.000000	40.000000	18.000000	15.000000	17.000000
Predict	Neutral	Satisfied	Unsatisfied	All
Satisfaction
Neutral	143	8	8	159
Satisfied	20	123	1	144
Unsatisfied	18	0	120	138
All	181	131	129	441