├── IBM-HR-Employee-Attrition.csv ├── LICENSE ├── README.md ├── YandexCatBoost-Demo.ipynb ├── YandexCatBoost-Demo.py └── images ├── output_24_0.png ├── output_30_0.png └── output_69_0.png /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Exploration of Yandex CatBoost in Python ## 3 | 4 | This demo will provide a brief introduction in 5 | - performing data exploration and preprocessing 6 | - feature subset selection: low variance filter 7 | - feature subset selection: high correlation filter 8 | - catboost model tuning 9 | - importance of data preprocessing: data normalization 10 | - exploration of catboost's feature importance ranking 11 | 12 | ## Getting started 13 | Open `YandexCatBoost-Demo.ipynb` on a jupyter notebook environment, or Google colab. The notebook consists of further technical details. 14 | 15 | ## Future Improvements ## 16 | 17 | Results from the feature importance ranking shows that attribute ‘MaritalStatus’ impacts minimally in class label prediction and could potential be a noise attribute. Removing it might increase model’s accuracy. 18 | 19 | 20 | ## Codes Walkthrough 21 | 22 | Installing the open source Yandex CatBoost package 23 | 24 | 25 | ```python 26 | pip install catboost 27 | ``` 28 | 29 | Importing the required packaged: Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn and CatBoost 30 | 31 | 32 | ```python 33 | import numpy as np 34 | import pandas as pd 35 | import matplotlib.pyplot as plt 36 | # plt.style.use('ggplot') 37 | import seaborn as sns 38 | from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget 39 | from sklearn.preprocessing import MinMaxScaler 40 | from sklearn.feature_selection import VarianceThreshold 41 | ``` 42 | 43 | Loading of [IBM HR Dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) into pandas dataframe 44 | 45 | 46 | ```python 47 | ibm_hr_df = pd.read_csv("IBM-HR-Employee-Attrition.csv") 48 | ``` 49 | 50 | ### Part 1a: Data Exploration - Summary Statistics ### 51 | 52 | Getting the summary statistics of the IBM HR dataset 53 | 54 | 55 | ```python 56 | ibm_hr_df.describe() 57 | ``` 58 | 59 | 60 | 61 | 62 |
63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 |
AgeDailyRateDistanceFromHomeEducationEmployeeCountEmployeeNumberEnvironmentSatisfactionHourlyRateJobInvolvementJobLevel...RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
count1470.0000001470.0000001470.0000001470.0000001470.01470.0000001470.0000001470.0000001470.0000001470.000000...1470.0000001470.01470.0000001470.0000001470.0000001470.0000001470.0000001470.0000001470.0000001470.000000
mean36.923810802.4857149.1925172.9129251.01024.8653062.72176965.8911562.7299322.063946...2.71224580.00.79387811.2795922.7993202.7612247.0081634.2292522.1877554.123129
std9.135373403.5091008.1068641.0241650.0602.0243351.09308220.3294280.7115611.106940...1.0812090.00.8520777.7807821.2892710.7064766.1265253.6231373.2224303.568136
min18.000000102.0000001.0000001.0000001.01.0000001.00000030.0000001.0000001.000000...1.00000080.00.0000000.0000000.0000001.0000000.0000000.0000000.0000000.000000
25%30.000000465.0000002.0000002.0000001.0491.2500002.00000048.0000002.0000001.000000...2.00000080.00.0000006.0000002.0000002.0000003.0000002.0000000.0000002.000000
50%36.000000802.0000007.0000003.0000001.01020.5000003.00000066.0000003.0000002.000000...3.00000080.01.00000010.0000003.0000003.0000005.0000003.0000001.0000003.000000
75%43.0000001157.00000014.0000004.0000001.01555.7500004.00000083.7500003.0000003.000000...4.00000080.01.00000015.0000003.0000003.0000009.0000007.0000003.0000007.000000
max60.0000001499.00000029.0000005.0000001.02068.0000004.000000100.0000004.0000005.000000...4.00000080.03.00000040.0000006.0000004.00000040.00000018.00000015.00000017.000000
285 |

8 rows × 26 columns

286 |
287 | 288 | 289 | 290 | Zooming in on the summary statistics of irrelevant attributes __*EmployeeCount*__ and __*StandardHours*__ 291 | 292 | 293 | ```python 294 | irrList = ['EmployeeCount', 'StandardHours'] 295 | ibm_hr_df[irrList].describe() 296 | ``` 297 | 298 | 299 | 300 | 301 |
302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 |
EmployeeCountStandardHours
count1470.01470.0
mean1.080.0
std0.00.0
min1.080.0
25%1.080.0
50%1.080.0
75%1.080.0
max1.080.0
353 |
354 | 355 | 356 | 357 | Zooming in on the summary statistics of irrelevant attribute __*Over18*__ 358 | 359 | 360 | ```python 361 | ibm_hr_df["Over18"].value_counts() 362 | ``` 363 | 364 | 365 | 366 | 367 | Y 1470 368 | Name: Over18, dtype: int64 369 | 370 | 371 | 372 | From the summary statistics, one could see that attributes __*EmployeeCount*__, __*StandardHours*__ and __*Over18*__ holds only one single value for all of the 1470 records
373 | 374 | __*EmployeeCount*__ only holds a single value - 1.0
375 | __*StandardHours*__ only holds a single value - 80.0
376 | __*Over18*__ only holds a single value - 'Y'
377 | 378 | These irrelevant attributes are duely dropped from the dataset 379 | 380 | ### Part 1b: Data Exploration - Missing Values and Duplicate Records ### 381 | 382 | Checking for 'NA' and missing values in the dataset. 383 | 384 | 385 | ```python 386 | ibm_hr_df.isnull().sum(axis=0) 387 | ``` 388 | 389 | 390 | 391 | 392 | Age 0 393 | Attrition 0 394 | BusinessTravel 0 395 | DailyRate 0 396 | Department 0 397 | DistanceFromHome 0 398 | Education 0 399 | EducationField 0 400 | EmployeeCount 0 401 | EmployeeNumber 0 402 | EnvironmentSatisfaction 0 403 | Gender 0 404 | HourlyRate 0 405 | JobInvolvement 0 406 | JobLevel 0 407 | JobRole 0 408 | JobSatisfaction 0 409 | MaritalStatus 0 410 | MonthlyIncome 0 411 | MonthlyRate 0 412 | NumCompaniesWorked 0 413 | Over18 0 414 | OverTime 0 415 | PercentSalaryHike 0 416 | PerformanceRating 0 417 | RelationshipSatisfaction 0 418 | StandardHours 0 419 | StockOptionLevel 0 420 | TotalWorkingYears 0 421 | TrainingTimesLastYear 0 422 | WorkLifeBalance 0 423 | YearsAtCompany 0 424 | YearsInCurrentRole 0 425 | YearsSinceLastPromotion 0 426 | YearsWithCurrManager 0 427 | dtype: int64 428 | 429 | 430 | 431 | Well, we got lucky here, there isn't any missing values in this dataset 432 | 433 | Next, let's check for the existence of duplicate records in the dataset 434 | 435 | 436 | ```python 437 | ibm_hr_df.duplicated().sum() 438 | ``` 439 | 440 | 441 | 442 | 443 | 0 444 | 445 | 446 | 447 | There are also no duplicate records in the dataset 448 | 449 | Converting __*OverTime*__ binary categorical attribute to {1, 0} 450 | 451 | 452 | ```python 453 | ibm_hr_df['OverTime'].replace(to_replace=dict(Yes=1, No=0), inplace=True) 454 | ``` 455 | 456 | ### Part 2a: Data Preprocessing - Removal of Irrelevant Attributes ### 457 | 458 | 459 | ```python 460 | ibm_hr_df = ibm_hr_df.drop(['EmployeeCount', 'StandardHours', 'Over18'], axis=1) 461 | ``` 462 | 463 | ### Part 2b: Data Preprocessing - Feature Subset Selection - Low Variance Filter ### 464 | 465 | Performing variance analysis to aid in feature selection 466 | 467 | 468 | ```python 469 | variance_x = ibm_hr_df.drop('Attrition', axis=1) 470 | variance_one_hot = pd.get_dummies(variance_x) 471 | ``` 472 | 473 | 474 | ```python 475 | #Normalise the dataset. This is required for getting the variance threshold 476 | scaler = MinMaxScaler() 477 | scaler.fit(variance_one_hot) 478 | MinMaxScaler(copy=True, feature_range=(0, 1)) 479 | scaled_variance_one_hot = scaler.transform(variance_one_hot) 480 | ``` 481 | 482 | 483 | ```python 484 | #Set the threshold values and run VarianceThreshold 485 | thres = .85* (1 - .85) 486 | sel = VarianceThreshold(threshold=thres) 487 | sel.fit(scaled_variance_one_hot) 488 | variance = sel.variances_ 489 | ``` 490 | 491 | 492 | ```python 493 | #Sorting of the score in acsending orders for plotting 494 | indices = np.argsort(variance)[::-1] 495 | feature_list = list(variance_one_hot) 496 | sorted_feature_list = [] 497 | thres_list = [] 498 | for f in range(len(variance_one_hot.columns)): 499 | sorted_feature_list.append(feature_list[indices[f]]) 500 | thres_list.append(thres) 501 | ``` 502 | 503 | 504 | ```python 505 | plt.figure(figsize=(14,6)) 506 | plt.title("Feature Variance: %f" %(thres), fontsize = 14) 507 | plt.bar(range(len(variance_one_hot.columns)), variance[indices], color="c") 508 | plt.xticks(range(len(variance_one_hot.columns)), sorted_feature_list, rotation = 90) 509 | plt.xlim([-0.5, len(variance_one_hot.columns)]) 510 | plt.plot(range(len(variance_one_hot.columns)), thres_list, "k-", color="r") 511 | plt.tight_layout() 512 | plt.show() 513 | ``` 514 | 515 | 516 | ![png](images/output_30_0.png) 517 | 518 | 519 | ### Part 2c: Data Preprocessing - Feature Subset Selection - High Correlation Filter ### 520 | 521 | Performing Pearson correlation analysis between attributes to aid in feature selection 522 | 523 | 524 | ```python 525 | plt.figure(figsize=(16,16)) 526 | sns.heatmap(ibm_hr_df.corr(), annot=True, fmt=".2f") 527 | 528 | plt.show() 529 | ``` 530 | 531 | 532 | ![png](images/output_24_0.png) 533 | 534 | 535 | 536 | ### Part 3: Mutli-Class Label Generation ### 537 | 538 | 539 | ```python 540 | rAttrList = ['Department', 'OverTime', 'HourlyRate', 541 | 'StockOptionLevel', 'DistanceFromHome', 542 | 'YearsInCurrentRole', 'Age'] 543 | ``` 544 | 545 | 546 | ```python 547 | #keep only the attribute list on rAttrList 548 | label_hr_df = ibm_hr_df[rAttrList] 549 | ``` 550 | 551 | 552 | ```python 553 | #convert continous attribute DistanceFromHome to Catergorical 554 | #: 1: near, 2: mid distance, 3: far 555 | maxValues = label_hr_df['DistanceFromHome'].max() 556 | minValues = label_hr_df['DistanceFromHome'].min() 557 | intervals = (maxValues - minValues)/3 558 | bins = [0, (minValues + intervals), (maxValues - intervals), maxValues] 559 | groupName = [1, 2, 3] 560 | label_hr_df['CatDistanceFromHome'] = pd.cut(label_hr_df['DistanceFromHome'], bins, labels = groupName) 561 | ``` 562 | 563 | 564 | ```python 565 | # convert col type from cat to int64 566 | label_hr_df['CatDistanceFromHome'] = pd.to_numeric(label_hr_df['CatDistanceFromHome']) 567 | label_hr_df.drop(['DistanceFromHome'], axis = 1, inplace = True) 568 | ``` 569 | 570 | 571 | ```python 572 | #replace department into 0 & 1, 0: R&D, and 1: Non-R&D 573 | label_hr_df['Department'].replace(['Research & Development', 'Human Resources', 'Sales'], 574 | [0, 1, 1], inplace = True) 575 | ``` 576 | 577 | 578 | ```python 579 | #normalise data 580 | label_hr_df_norm = (label_hr_df - label_hr_df.min()) / (label_hr_df.max() - label_hr_df.min()) 581 | ``` 582 | 583 | 584 | ```python 585 | #create a data frame for the function value and class labels 586 | value_df = pd.DataFrame(columns = ['ClassValue']) 587 | ``` 588 | 589 | 590 | ```python 591 | #compute the class value 592 | for row in range (0, ibm_hr_df.shape[0]): 593 | if label_hr_df_norm['Department'][row] == 0: 594 | value = 0.3 * label_hr_df_norm['HourlyRate'][row] - 0.2 * label_hr_df_norm['OverTime'][row] + \ 595 | - 0.2 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.15 * label_hr_df_norm['StockOptionLevel'][row] + \ 596 | 0.1 * label_hr_df_norm['Age'][row] - 0.05 * label_hr_df_norm['YearsInCurrentRole'][row] 597 | 598 | else: 599 | value = 0.2 * label_hr_df_norm['HourlyRate'][row] - 0.3 * label_hr_df_norm['OverTime'][row] + \ 600 | - 0.15 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.2 * label_hr_df_norm['StockOptionLevel'][row] + \ 601 | 0.05 * label_hr_df_norm['Age'][row] - 0.1 * label_hr_df_norm['YearsInCurrentRole'][row] 602 | value_df.loc[row] = value 603 | ``` 604 | 605 | 606 | ```python 607 | # top 500 highest class value is satisfied with their job 608 | v1 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True)\ 609 | ['ClassValue'][499] 610 | # next top 500 is neutral 611 | v2 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True)\ 612 | ['ClassValue'][999] 613 | # rest is unsatisfied 614 | ``` 615 | 616 | 617 | ```python 618 | label_df = pd.DataFrame(columns = ['ClassLabel']) 619 | ``` 620 | 621 | 622 | ```python 623 | #compute the classlabel 624 | for row in range (0, value_df.shape[0]): 625 | if value_df['ClassValue'][row] >= v1: 626 | cat = "Satisfied" 627 | elif value_df['ClassValue'][row] >= v2: 628 | cat = "Neutral" 629 | else: 630 | cat = "Unsatisfied" 631 | label_df.loc[row] = cat 632 | ``` 633 | 634 | 635 | ```python 636 | df = pd.concat([ibm_hr_df, label_df], axis = 1) 637 | ``` 638 | 639 | ### Part 4: Classification with CatBoost ### 640 | 641 | 642 | ```python 643 | df = df[['Age', 'Department', 'DistanceFromHome', 'HourlyRate', 'OverTime', 'StockOptionLevel', 644 | 'MaritalStatus', 'YearsInCurrentRole', 'EmployeeNumber', 'ClassLabel']] 645 | ``` 646 | 647 | Split dataset into attributes/features __*X*__ and label/class __*y*__ 648 | 649 | 650 | ```python 651 | X = df.drop('ClassLabel', axis=1) 652 | y = df.ClassLabel 653 | ``` 654 | 655 | Replacing label/class value from __*'Satisfied'*__, __*'Neutral'*__ and *__'Unsatisfied'__* to *__2__*, __*1*__ and __*0*__ 656 | 657 | 658 | ```python 659 | y.replace(to_replace=dict(Satisfied=2, Neutral=1, Unsatisfied=0), inplace=True) 660 | ``` 661 | 662 | Performing __'one hot encoding'__ method 663 | 664 | 665 | ```python 666 | one_hot = pd.get_dummies(X) 667 | ``` 668 | 669 | ```python 670 | categorical_features_indices = np.where(one_hot.dtypes != np.float)[0] 671 | ``` 672 | 673 | ### Part 5: Model training with CatBoost ### 674 | Now lets split our data to train (70%) and test (30%) set: 675 | 676 | 677 | ```python 678 | from sklearn.model_selection import train_test_split 679 | 680 | X_train, X_test, y_train, y_test = train_test_split(one_hot, y, train_size=0.7, random_state=1234) 681 | ``` 682 | 683 | 684 | ```python 685 | model = CatBoostClassifier( 686 | custom_loss = ['Accuracy'], 687 | random_seed = 100, 688 | loss_function = 'MultiClass' 689 | ) 690 | ``` 691 | 692 | 693 | ```python 694 | model.fit( 695 | X_train, y_train, 696 | cat_features = categorical_features_indices, 697 | verbose = True, # you can uncomment this for text output 698 | #plot = True 699 | ) 700 | ``` 701 | 702 | 703 | ```python 704 | cm = pd.DataFrame() 705 | cm['Satisfaction'] = y_test 706 | cm['Predict'] = model.predict(X_test) 707 | ``` 708 | 709 | 710 | ```python 711 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'} 712 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'} 713 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict}) 714 | ``` 715 | 716 | 717 | ```python 718 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True) 719 | ``` 720 | 721 | 722 | 723 | 724 |
725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 |
PredictNeutralSatisfiedUnsatisfiedAll
Satisfaction
Neutral14388159
Satisfied201231144
Unsatisfied180120138
All181131129441
773 |
774 | 775 | 776 | 777 | 778 | ```python 779 | model.score(X_test, y_test) 780 | ``` 781 | 782 | 783 | 784 | 785 | 0.87528344671201819 786 | 787 | 788 | ### Part 6: CatBoost Classifier Tuning ### 789 | 790 | 791 | ```python 792 | model = CatBoostClassifier( 793 | l2_leaf_reg = 5, 794 | iterations = 1000, 795 | fold_len_multiplier = 1.1, 796 | custom_loss = ['Accuracy'], 797 | random_seed = 100, 798 | loss_function = 'MultiClass' 799 | ) 800 | ``` 801 | 802 | 803 | ```python 804 | model.fit( 805 | X_train, y_train, 806 | cat_features = categorical_features_indices, 807 | verbose = True, # you can uncomment this for text output 808 | #plot = True 809 | ) 810 | ``` 811 | 812 | 813 | ```python 814 | cm = pd.DataFrame() 815 | cm['Satisfaction'] = y_test 816 | cm['Predict'] = model.predict(X_test) 817 | ``` 818 | 819 | 820 | ```python 821 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'} 822 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'} 823 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict}) 824 | ``` 825 | 826 | 827 | ```python 828 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True) 829 | ``` 830 | 831 | 832 | 833 | 834 |
835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 |
PredictNeutralSatisfiedUnsatisfiedAll
Satisfaction
Neutral14298159
Satisfied171261144
Unsatisfied120126138
All171135135441
883 |
884 | 885 | 886 | 887 | 888 | ```python 889 | model.score(X_test, y_test) 890 | ``` 891 | 892 | 893 | 894 | 895 | 0.89342403628117917 896 | 897 | 898 | ### Part 7: Data Preprocessing: Attributes Value Normalization ### 899 | 900 | 901 | Normalization of features, after realizing that tuning no longer improve model's accuracy 902 | 903 | 904 | ```python 905 | one_hot = (one_hot - one_hot.mean()) / (one_hot.max() - one_hot.min()) 906 | ``` 907 | 908 | 909 | ```python 910 | categorical_features_indices = np.where(one_hot.dtypes != np.float)[0] 911 | ``` 912 | 913 | 914 | ```python 915 | from sklearn.model_selection import train_test_split 916 | 917 | X_train, X_test, y_train, y_test = train_test_split(one_hot, y, train_size=0.7, random_state=1234) 918 | ``` 919 | 920 | 921 | ```python 922 | model = CatBoostClassifier( 923 | l2_leaf_reg = 5, 924 | iterations = 1000, 925 | fold_len_multiplier = 1.1, 926 | custom_loss = ['Accuracy'], 927 | random_seed = 100, 928 | loss_function = 'MultiClass' 929 | ) 930 | ``` 931 | 932 | 933 | ```python 934 | model.fit( 935 | X_train, y_train, 936 | cat_features = categorical_features_indices, 937 | verbose = True, # you can uncomment this for text output 938 | #plot = True 939 | ) 940 | ``` 941 | 942 | 943 | ```python 944 | feature_score = pd.DataFrame(list(zip(one_hot.dtypes.index, model.get_feature_importance(Pool(one_hot, label=y, cat_features=categorical_features_indices)))), 945 | columns=['Feature','Score']) 946 | ``` 947 | 948 | 949 | ```python 950 | feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last') 951 | ``` 952 | 953 | 954 | ```python 955 | plt.rcParams["figure.figsize"] = (12,7) 956 | ax = feature_score.plot('Feature', 'Score', kind='bar', color='c') 957 | ax.set_title("Catboost Feature Importance Ranking", fontsize = 14) 958 | ax.set_xlabel('') 959 | 960 | rects = ax.patches 961 | 962 | # get feature score as labels round to 2 decimal 963 | labels = feature_score['Score'].round(2) 964 | 965 | for rect, label in zip(rects, labels): 966 | height = rect.get_height() 967 | ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom') 968 | 969 | plt.show() 970 | ``` 971 | 972 | 973 | ![png](images/output_69_0.png) 974 | 975 | 976 | 977 | ```python 978 | cm = pd.DataFrame() 979 | cm['Satisfaction'] = y_test 980 | cm['Predict'] = model.predict(X_test) 981 | ``` 982 | 983 | 984 | ```python 985 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'} 986 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'} 987 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict}) 988 | ``` 989 | 990 | 991 | ```python 992 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True) 993 | ``` 994 | 995 | 996 | 997 | 998 |
999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 | 1017 | 1018 | 1019 | 1020 | 1021 | 1022 | 1023 | 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 |
PredictNeutralSatisfiedUnsatisfiedAll
Satisfaction
Neutral146112159
Satisfied71370144
Unsatisfied80130138
All161148132441
1047 |
1048 | 1049 | 1050 | 1051 | 1052 | ```python 1053 | model.score(X_test, y_test) 1054 | ``` 1055 | 1056 | 1057 | 1058 | 1059 | 0.93650793650793651 1060 | 1061 | 1062 | 1063 | -------------------------------------------------------------------------------- /YandexCatBoost-Demo.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # ## CI6227 Assignment ## 5 | 6 | # Installing the open source Yandex CatBoost package 7 | 8 | # In[2]: 9 | 10 | 11 | get_ipython().system(u'pip install catboost') 12 | 13 | 14 | # Importing the required packaged: Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn and CatBoost 15 | 16 | # In[3]: 17 | 18 | 19 | import numpy as np 20 | import pandas as pd 21 | import matplotlib.pyplot as plt 22 | # plt.style.use('ggplot') 23 | import seaborn as sns 24 | from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget 25 | from sklearn.preprocessing import MinMaxScaler 26 | from sklearn.feature_selection import VarianceThreshold 27 | 28 | 29 | # Loading of [IBM HR Dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset/data) into pandas dataframe 30 | 31 | # In[4]: 32 | 33 | 34 | ibm_hr_df = pd.read_csv("/home/nbuser/library/IBM-HR-Employee-Attrition.csv") 35 | 36 | 37 | # ### Part 1a: Data Exploration - Summary Statistics ### 38 | # 39 | # Getting the summary statistics of the IBM HR dataset 40 | 41 | # In[5]: 42 | 43 | 44 | ibm_hr_df.describe() 45 | 46 | 47 | # Zooming in on the summary statistics of irrelevant attributes __*EmployeeCount*__ and __*StandardHours*__ 48 | 49 | # In[6]: 50 | 51 | 52 | irrList = ['EmployeeCount', 'StandardHours'] 53 | ibm_hr_df[irrList].describe() 54 | 55 | 56 | # Zooming in on the summary statistics of irrelevant attribute __*Over18*__ 57 | 58 | # In[7]: 59 | 60 | 61 | ibm_hr_df["Over18"].value_counts() 62 | 63 | 64 | # From the summary statistics, one could see that attributes __*EmployeeCount*__, __*StandardHours*__ and __*Over18*__ holds only one single value for all of the 1470 records
65 | # 66 | # __*EmployeeCount*__ only holds a single value - 1.0
67 | # __*StandardHours*__ only holds a single value - 80.0
68 | # __*Over18*__ only holds a single value - 'Y'
69 | # 70 | # These irrelevant attributes are duely dropped from the dataset 71 | 72 | # ### Part 1b: Data Exploration - Missing Values and Duplicate Records ### 73 | # 74 | # Checking for 'NA' and missing values in the dataset. 75 | 76 | # In[8]: 77 | 78 | 79 | ibm_hr_df.isnull().sum(axis=0) 80 | 81 | 82 | # Well, we got lucky here, there isn't any missing values in this dataset 83 | # 84 | # Next, let's check for the existence of duplicate records in the dataset 85 | 86 | # In[9]: 87 | 88 | 89 | ibm_hr_df.duplicated().sum() 90 | 91 | 92 | # There are also no duplicate records in the dataset 93 | # 94 | # Converting __*OverTime*__ binary categorical attribute to {1, 0} 95 | 96 | # In[10]: 97 | 98 | 99 | ibm_hr_df['OverTime'].replace(to_replace=dict(Yes=1, No=0), inplace=True) 100 | 101 | 102 | # ### Part 2a: Data Preprocessing - Removal of Irrelevant Attributes ### 103 | 104 | # In[12]: 105 | 106 | 107 | ibm_hr_df = ibm_hr_df.drop(['EmployeeCount', 'StandardHours', 'Over18'], axis=1) 108 | 109 | 110 | # ### Part 2b: Data Preprocessing - Feature Subset Selection - Low Variance Filter ### 111 | # 112 | # Performing variance analysis 113 | 114 | # Performing Pearson correlation analysis between attributes to aid in dimension reduction 115 | 116 | # In[15]: 117 | 118 | 119 | plt.figure(figsize=(16,16)) 120 | sns.heatmap(ibm_hr_df.corr(), annot=True, fmt=".2f") 121 | 122 | plt.show() 123 | 124 | 125 | # Performing variance analysis to aid in dimension reduction 126 | 127 | # In[16]: 128 | 129 | 130 | variance_x = ibm_hr_df.drop('Attrition', axis=1) 131 | variance_one_hot = pd.get_dummies(variance_x) 132 | 133 | 134 | # In[17]: 135 | 136 | 137 | #Normalise the dataset. This is required for getting the variance threshold 138 | scaler = MinMaxScaler() 139 | scaler.fit(variance_one_hot) 140 | MinMaxScaler(copy=True, feature_range=(0, 1)) 141 | scaled_variance_one_hot = scaler.transform(variance_one_hot) 142 | 143 | 144 | # In[18]: 145 | 146 | 147 | #Set the threshold values and run VarianceThreshold 148 | thres = .85* (1 - .85) 149 | sel = VarianceThreshold(threshold=thres) 150 | sel.fit(scaled_variance_one_hot) 151 | variance = sel.variances_ 152 | 153 | 154 | # In[19]: 155 | 156 | 157 | #Sorting of the score in acsending orders for plotting 158 | indices = np.argsort(variance)[::-1] 159 | feature_list = list(variance_one_hot) 160 | sorted_feature_list = [] 161 | thres_list = [] 162 | for f in range(len(variance_one_hot.columns)): 163 | sorted_feature_list.append(feature_list[indices[f]]) 164 | thres_list.append(thres) 165 | 166 | 167 | # In[20]: 168 | 169 | 170 | plt.figure(figsize=(14,6)) 171 | plt.title("Feature Variance: %f" %(thres), fontsize = 14) 172 | plt.bar(range(len(variance_one_hot.columns)), variance[indices], color="c") 173 | plt.xticks(range(len(variance_one_hot.columns)), sorted_feature_list, rotation = 90) 174 | plt.xlim([-0.5, len(variance_one_hot.columns)]) 175 | plt.plot(range(len(variance_one_hot.columns)), thres_list, "k-", color="r") 176 | plt.tight_layout() 177 | plt.show() 178 | 179 | 180 | # Performing Pearson correlation analysis between attributes to aid in dimension reduction 181 | 182 | # ### Part 3 ### 183 | 184 | # In[21]: 185 | 186 | 187 | rAttrList = ['Department', 'OverTime', 'HourlyRate', 188 | 'StockOptionLevel', 'DistanceFromHome', 189 | 'YearsInCurrentRole', 'Age'] 190 | 191 | 192 | # In[22]: 193 | 194 | 195 | #keep only the attribute list on rAttrList 196 | label_hr_df = ibm_hr_df[rAttrList] 197 | 198 | 199 | # In[23]: 200 | 201 | 202 | #convert continous attribute DistanceFromHome to Catergorical 203 | #: 1: near, 2: mid distance, 3: far 204 | maxValues = label_hr_df['DistanceFromHome'].max() 205 | minValues = label_hr_df['DistanceFromHome'].min() 206 | intervals = (maxValues - minValues)/3 207 | bins = [0, (minValues + intervals), (maxValues - intervals), maxValues] 208 | groupName = [1, 2, 3] 209 | label_hr_df['CatDistanceFromHome'] = pd.cut(label_hr_df['DistanceFromHome'], bins, labels = groupName) 210 | 211 | 212 | # In[24]: 213 | 214 | 215 | # convert col type from cat to int64 216 | label_hr_df['CatDistanceFromHome'] = pd.to_numeric(label_hr_df['CatDistanceFromHome']) 217 | label_hr_df.drop(['DistanceFromHome'], axis = 1, inplace = True) 218 | 219 | 220 | # In[25]: 221 | 222 | 223 | #replace department into 0 & 1, 0: R&D, and 1: Non-R&D 224 | label_hr_df['Department'].replace(['Research & Development', 'Human Resources', 'Sales'], 225 | [0, 1, 1], inplace = True) 226 | 227 | 228 | # In[26]: 229 | 230 | 231 | #normalise data 232 | label_hr_df_norm = (label_hr_df - label_hr_df.min()) / (label_hr_df.max() - label_hr_df.min()) 233 | 234 | 235 | # In[27]: 236 | 237 | 238 | #create a data frame for the function value and class labels 239 | value_df = pd.DataFrame(columns = ['ClassValue']) 240 | 241 | 242 | # In[28]: 243 | 244 | 245 | #compute the class value 246 | for row in range (0, ibm_hr_df.shape[0]): 247 | if label_hr_df_norm['Department'][row] == 0: 248 | value = 0.3 * label_hr_df_norm['HourlyRate'][row] - 0.2 * label_hr_df_norm['OverTime'][row] + - 0.2 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.15 * label_hr_df_norm['StockOptionLevel'][row] + 0.1 * label_hr_df_norm['Age'][row] - 0.05 * label_hr_df_norm['YearsInCurrentRole'][row] 249 | 250 | else: 251 | value = 0.2 * label_hr_df_norm['HourlyRate'][row] - 0.3 * label_hr_df_norm['OverTime'][row] + - 0.15 * label_hr_df_norm['CatDistanceFromHome'][row] + 0.2 * label_hr_df_norm['StockOptionLevel'][row] + 0.05 * label_hr_df_norm['Age'][row] - 0.1 * label_hr_df_norm['YearsInCurrentRole'][row] 252 | value_df.loc[row] = value 253 | 254 | 255 | # In[29]: 256 | 257 | 258 | # top 500 highest class value is satisfied with their job 259 | v1 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True) ['ClassValue'][499] 260 | # next top 500 is neutral 261 | v2 = value_df.sort_values('ClassValue', ascending = False).reset_index(drop = True) ['ClassValue'][999] 262 | # rest is unsatisfied 263 | 264 | 265 | # In[30]: 266 | 267 | 268 | label_df = pd.DataFrame(columns = ['ClassLabel']) 269 | 270 | 271 | # In[31]: 272 | 273 | 274 | #compute the classlabel 275 | for row in range (0, value_df.shape[0]): 276 | if value_df['ClassValue'][row] >= v1: 277 | cat = "Satisfied" 278 | elif value_df['ClassValue'][row] >= v2: 279 | cat = "Neutral" 280 | else: 281 | cat = "Unsatisfied" 282 | label_df.loc[row] = cat 283 | 284 | 285 | # In[32]: 286 | 287 | 288 | df = pd.concat([ibm_hr_df, label_df], axis = 1) 289 | 290 | 291 | # ### Part 3: Classification with CatBoost ### 292 | 293 | # In[26]: 294 | 295 | 296 | #df = pd.read_csv("/home/nbuser/library/HR_dataset_generated_label.csv") 297 | 298 | 299 | # In[33]: 300 | 301 | 302 | df = df[['Age', 'Department', 'DistanceFromHome', 'HourlyRate', 'OverTime', 'StockOptionLevel', 303 | 'MaritalStatus', 'YearsInCurrentRole', 'EmployeeNumber', 'ClassLabel']] 304 | 305 | 306 | # Split dataset into attributes/features __*X*__ and label/class __*y*__ 307 | 308 | # In[34]: 309 | 310 | 311 | X = df.drop('ClassLabel', axis=1) 312 | y = df.ClassLabel 313 | 314 | 315 | # Replacing label/class value from __*'Satisfied'*__, __*'Neutral'*__ and *__'Unsatisfied'__* to *__2__*, __*1*__ and __*0*__ 316 | 317 | # In[35]: 318 | 319 | 320 | y.replace(to_replace=dict(Satisfied=2, Neutral=1, Unsatisfied=0), inplace=True) 321 | 322 | 323 | # Performing __'one hot encoding'__ method 324 | 325 | # In[36]: 326 | 327 | 328 | one_hot = pd.get_dummies(X) 329 | 330 | 331 | # Normalisation of features 332 | 333 | # In[37]: 334 | 335 | 336 | one_hot = (one_hot - one_hot.mean()) / (one_hot.max() - one_hot.min()) 337 | 338 | 339 | # In[38]: 340 | 341 | 342 | categorical_features_indices = np.where(one_hot.dtypes != np.float)[0] 343 | 344 | 345 | # ### Part 3a: Model training with CatBoost ### 346 | # Now lets split our data to train (70%) and test (30%) set: 347 | 348 | # In[39]: 349 | 350 | 351 | from sklearn.model_selection import train_test_split 352 | 353 | X_train, X_test, y_train, y_test = train_test_split(one_hot, y, train_size=0.7, random_state=1234) 354 | 355 | 356 | # In[44]: 357 | 358 | 359 | model = CatBoostClassifier( 360 | custom_loss = ['Accuracy'], 361 | random_seed = 100, 362 | loss_function = 'MultiClass' 363 | ) 364 | 365 | 366 | # In[51]: 367 | 368 | 369 | model.fit( 370 | X_train, y_train, 371 | cat_features = categorical_features_indices, 372 | verbose = True, # you can uncomment this for text output 373 | #plot = True 374 | ) 375 | 376 | 377 | # In[48]: 378 | 379 | 380 | feature_score = pd.DataFrame(list(zip(one_hot.dtypes.index, model.get_feature_importance(Pool(one_hot, label=y, cat_features=categorical_features_indices)))), 381 | columns=['Feature','Score']) 382 | feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last') 383 | 384 | 385 | # In[49]: 386 | 387 | 388 | plt.rcParams["figure.figsize"] = (12,7) 389 | ax = feature_score.plot('Feature', 'Score', kind='bar', color='c') 390 | ax.set_title("Catboost Feature Importance Ranking", fontsize = 14) 391 | ax.set_xlabel('') 392 | 393 | rects = ax.patches 394 | 395 | # get feature score as labels round to 2 decimal 396 | labels = feature_score['Score'].round(2) 397 | 398 | for rect, label in zip(rects, labels): 399 | height = rect.get_height() 400 | ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom') 401 | 402 | plt.show() 403 | 404 | 405 | # In[50]: 406 | 407 | 408 | model.score(X_test, y_test) 409 | 410 | 411 | # ### Part 4: CatBoost Classifier Tuning ### 412 | 413 | # In[40]: 414 | 415 | 416 | model = CatBoostClassifier( 417 | l2_leaf_reg = 3, 418 | iterations = 1000, 419 | fold_len_multiplier = 1.05, 420 | learning_rate = 0.05, 421 | custom_loss = ['Accuracy'], 422 | random_seed = 100, 423 | loss_function = 'MultiClass' 424 | ) 425 | 426 | 427 | # In[41]: 428 | 429 | 430 | model.fit( 431 | X_train, y_train, 432 | cat_features = categorical_features_indices, 433 | verbose = True, # you can uncomment this for text output 434 | #plot = True 435 | ) 436 | 437 | 438 | # In[42]: 439 | 440 | 441 | feature_score = pd.DataFrame(list(zip(one_hot.dtypes.index, model.get_feature_importance(Pool(one_hot, label=y, cat_features=categorical_features_indices)))), 442 | columns=['Feature','Score']) 443 | 444 | 445 | # In[43]: 446 | 447 | 448 | feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last') 449 | 450 | 451 | # In[44]: 452 | 453 | 454 | plt.rcParams["figure.figsize"] = (12,7) 455 | ax = feature_score.plot('Feature', 'Score', kind='bar', color='c') 456 | ax.set_title("Catboost Feature Importance Ranking", fontsize = 14) 457 | ax.set_xlabel('') 458 | 459 | rects = ax.patches 460 | 461 | # get feature score as labels round to 2 decimal 462 | labels = feature_score['Score'].round(2) 463 | 464 | for rect, label in zip(rects, labels): 465 | height = rect.get_height() 466 | ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom') 467 | 468 | plt.show() 469 | #plt.savefig("image.png") 470 | 471 | 472 | # In[61]: 473 | 474 | 475 | cm = pd.DataFrame() 476 | cm['Satisfaction'] = y_test 477 | cm['Predict'] = model.predict(X_test) 478 | 479 | 480 | # In[63]: 481 | 482 | 483 | mappingSatisfaction = {0:'Unsatisfied', 1: 'Neutral', 2: 'Satisfied'} 484 | mappingPredict = {0.0:'Unsatisfied', 1.0: 'Neutral', 2.0: 'Satisfied'} 485 | cm = cm.replace({'Satisfaction': mappingSatisfaction, 'Predict': mappingPredict}) 486 | 487 | 488 | # In[64]: 489 | 490 | 491 | pd.crosstab(cm['Satisfaction'], cm['Predict'], margins=True) 492 | 493 | 494 | # In[65]: 495 | 496 | 497 | model.score(X_test, y_test) 498 | 499 | -------------------------------------------------------------------------------- /images/output_24_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KwokHing/YandexCatBoost-Python-Demo/54e70cc5ccb39f0bcaf06a58aec5ac003fd9f985/images/output_24_0.png -------------------------------------------------------------------------------- /images/output_30_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KwokHing/YandexCatBoost-Python-Demo/54e70cc5ccb39f0bcaf06a58aec5ac003fd9f985/images/output_30_0.png -------------------------------------------------------------------------------- /images/output_69_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KwokHing/YandexCatBoost-Python-Demo/54e70cc5ccb39f0bcaf06a58aec5ac003fd9f985/images/output_69_0.png --------------------------------------------------------------------------------