├── Credit Risk Modeling.ipynb ├── List of dummy variables.txt ├── List of reference variables.txt └── README.md /Credit Risk Modeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Consumer loans are the most typical retail product where credit risk modeling is applied." 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "PD (Probability of default): Logistic regression\n", 15 | "\n", 16 | "LGD (Loss given default): Beta regression\n", 17 | "\n", 18 | "EAD (Exposure at default): Beta regression" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "PD needs a flag of whether borrower defaulted or not (loan_status column helps).\n", 26 | "\n", 27 | "LGD: How much of the loan was recovered after borrower had defaulted (recoveries column helps)\n", 28 | "\n", 29 | "EAD: Total exposure the moment the borrower defaulted compared to total exosure in the past (total_rec_prncp column helps)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "Grade is that of the external agency given as letters from A to G.\n", 37 | "\n", 38 | "DTI is debt-to-income ratio" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "PD model: All independent features need to be categorical. Grouping multiple categories into one to reduce the number of categories. We need to make continuous variables (Annual income, number of credit inquiries in the last 6 months) discrete using dummy variables." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "`Fine classing:` Dividing the data into finite intervals (Ex: number of months since the loan has been granted can be grouped as less than 1, 1 to 3, 4 to 6...). The grouping of the variables will be based on if the adjacent category discriminates between deafulted and non=defaulted borrowers. If they don't discriminate, they are merged into one." 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "`Coarse classing:` After grouping based on discrimination between defaulted and non-defaulted, categories are obatained. There is no need for intervals to be equal." 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "# Data Preparation" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "# Import libraries" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "import pandas as pd\n", 83 | "import numpy as np" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "# Import Data" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "loan_data_backup = pd.read_csv('loan.csv')" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "It's always a good practice to store a copy of our data before making any changes" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "loan_data = loan_data_backup.copy()" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "# Explore Data" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": { 129 | "scrolled": true 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "loan_data" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "`loan_data` just displays a few columns." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "pd.options.display.max_columns = None" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Hereafter, pandas displays all the columns of all objects." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "scrolled": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "loan_data" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "TO view the first 5 records of the dataframe:" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "loan_data.head()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "To view the last 5 records of a dataframe:" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "loan_data.tail()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "To see the names of all columns:" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "loan_data.columns.values" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "To see the datatype of all columns:" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "scrolled": true 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "loan_data.info()" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "Object datatype is for text strings." 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "# General Preprocessing" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "## Preprocessing few continuous variables" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Term and emp_length are strings rather than numeric. Let's correct them" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "#To see the values that emp_length takes\n", 271 | "loan_data['emp_length'].unique()" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "Because there is the word 'years', the datatype is string. We need to get rid of this word and '+', '<'." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\\+ years', '')\n", 288 | "loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))\n", 289 | "loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a', str(0))\n", 290 | "loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')\n", 291 | "loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "type(loan_data['emp_length_int'][0]) #The type is still string" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int']) #Converts a series into numeric type" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "type(loan_data['emp_length_int'][0])" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "### Preprocessing term variable" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "loan_data['term'].unique()" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "loan_data['term_int'] = loan_data['term'].str.replace(' months', '')\n", 344 | "loan_data['term_int'] = loan_data['term_int'].str.replace(' ', '')" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "type(loan_data['term_int'][0])" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "loan_data['term_int'] = pd.to_numeric(loan_data['term_int'])" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "type(loan_data['term_int'][0])" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "### Preprocessing date variables (earliest_cr_line)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "Earliest credit line should be of type date but is an object instead." 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "loan_data['earliest_cr_line']" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "Convert this into number of months that has passed since the given month and year." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format = '%b-%y')\n", 411 | "#%b indicates first 3 letters of the month, %y indicates last 2 digits of the year" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "type(loan_data['earliest_cr_line_date'][0])" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "We need to provide a reference date to calculate the number of months that has passed. Taking today's date is a standard practice. Since this is older data, let's take December 1, 2017 as the reference." 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "pd.to_datetime('2017-12-01') - loan_data['earliest_cr_line_date']" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "We prefer working with months, we can just divide by 30. Better approach is to get the difference in months." 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "Take the difference and divide by `np.deltatime(1, 'M')` to get in months and round to get a whole number." 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "metadata": {}, 457 | "outputs": [], 458 | "source": [ 459 | "loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric((pd.to_datetime('2017-12-01') - loan_data['earliest_cr_line_date']) / np.timedelta64(1, 'M')))" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": null, 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [ 468 | "#Descriptive statistics\n", 469 | "loan_data['mths_since_earliest_cr_line'].describe()" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "The time difference cannot be negative. Let's find out what's wrong." 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "loan_data.loc[: , ['earliest_cr_line', 'earliest_cr_line_date', 'mths_since_earliest_cr_line']][loan_data['mths_since_earliest_cr_line'] < 0]" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "The problem is while converting mth-year initially, it interpreted as 2065, 2067... instead of 1965, 1967. The issue arose in the first place because the origin of the built-in time scale starts after 1970. Removing these will not impact our conclusion. Intead of removing, we will impute the values (substitute the maximum observed difference, as they are in the distant past, somewhere in the 60s). " 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "loan_data['mths_since_earliest_cr_line'][loan_data['mths_since_earliest_cr_line'] < 0] = loan_data['mths_since_earliest_cr_line'].max()" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "min(loan_data['mths_since_earliest_cr_line'])" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "### Preprocessing date variable (issue_date)" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "loan_data['issue_d']" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": {}, 533 | "outputs": [], 534 | "source": [ 535 | "loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format = '%b-%y')" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "type(loan_data['issue_d_date'][0])" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "pd.to_datetime('2017-12-01') - loan_data['issue_d_date']" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "loan_data['mths_since_issue_d_date'] = round(pd.to_numeric((pd.to_datetime('2017-12-01') - loan_data['issue_d_date']) / np.timedelta64(1, 'M')))" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": {}, 569 | "outputs": [], 570 | "source": [ 571 | "loan_data['mths_since_issue_d_date'].describe()" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "### Preprocessing few discrete variables" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "Dummy variables have to be created which are binary indicators: 1 if an observation belongs to a category, else 0.\n", 586 | "For example, gender. Dummy variables would be Male (1 for Male, 0 for Female) and Female (1 for Female, 0 for Male). The other dummy variable is redundant. \n", 587 | "\n", 588 | "`Conclusion:` 1 dummy variable is enough to represent 2 categories. k - 1 dummy variables for k categories" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "We will create a new dataframe for dummy variables. Concatenate this with `loan_data` dataframe" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "### Grade column" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": null, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "pd.get_dummies(loan_data['grade'])" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "`get_dummies` gets number of dummy variables equal to number of categories. The names of the dummy variables are same as the categories." 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "#To make the names more descriptive:\n", 628 | "pd.get_dummies(loan_data['grade'], prefix = 'Grade', prefix_sep = ' : ')" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": null, 634 | "metadata": {}, 635 | "outputs": [], 636 | "source": [ 637 | "loan_data_dummies = [pd.get_dummies(loan_data['grade'], prefix = 'Grade', prefix_sep = ' : '),\n", 638 | " pd.get_dummies(loan_data['sub_grade'], prefix = 'sub_grade', prefix_sep = ' : '),\n", 639 | " pd.get_dummies(loan_data['home_ownership'], prefix = 'home_ownership', prefix_sep = ' : '),\n", 640 | " pd.get_dummies(loan_data['verification_status'], prefix = 'verification_status', prefix_sep = ' : '),\n", 641 | " pd.get_dummies(loan_data['loan_status'], prefix = 'loan_status', prefix_sep = ' : '),\n", 642 | " pd.get_dummies(loan_data['purpose'], prefix = 'purpose', prefix_sep = ' : '),\n", 643 | " pd.get_dummies(loan_data['addr_state'], prefix = 'addr_state', prefix_sep = ' : '),\n", 644 | " pd.get_dummies(loan_data['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ' : ')\n", 645 | " ]" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": {}, 652 | "outputs": [], 653 | "source": [ 654 | "#Converting list into a dataframe\n", 655 | "loan_data_dummies = pd.concat(loan_data_dummies, axis = 1)" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "type(loan_data_dummies)" 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "execution_count": null, 670 | "metadata": {}, 671 | "outputs": [], 672 | "source": [ 673 | "#Appending this dataframe to the original one\n", 674 | "loan_data = pd.concat([loan_data, loan_data_dummies], axis = 1)" 675 | ] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "execution_count": null, 680 | "metadata": {}, 681 | "outputs": [], 682 | "source": [ 683 | "loan_data.columns.values" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "# Check for missing values and clean" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": {}, 697 | "outputs": [], 698 | "source": [ 699 | "loan_data.isnull()" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "#number of missing values in all columns\n", 709 | "pd.options.display.max_rows = None\n", 710 | "loan_data.isnull().sum()" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "#filling missing values with the funded amount in the same place (variable)\n", 720 | "loan_data['total_rev_hi_lim'].fillna(loan_data['funded_amnt'], inplace = True)" 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": null, 726 | "metadata": {}, 727 | "outputs": [], 728 | "source": [ 729 | "loan_data['total_rev_hi_lim'].isnull().sum()" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "#df_inputs_prepr['total_rev_hi_lim'].unique()" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "#df_inputs_prepr['total_rev_hi_lim_factor'] = pd.cut(df_inputs_prepr['total_rev_hi_lim'], 50)" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": {}, 754 | "outputs": [], 755 | "source": [ 756 | "#df_temp = woe_ordered_continuous(df_inputs_prepr, df_targets_prepr)" 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": null, 762 | "metadata": {}, 763 | "outputs": [], 764 | "source": [] 765 | }, 766 | { 767 | "cell_type": "markdown", 768 | "metadata": {}, 769 | "source": [ 770 | "### Preprocessing annual income" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": null, 776 | "metadata": {}, 777 | "outputs": [], 778 | "source": [ 779 | "loan_data['annual_inc'].isnull().sum()" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": null, 785 | "metadata": {}, 786 | "outputs": [], 787 | "source": [ 788 | "mean_annual_income = loan_data['annual_inc'].mean()\n", 789 | "loan_data['annual_inc'].fillna(mean_annual_income, inplace = True)" 790 | ] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "execution_count": null, 795 | "metadata": {}, 796 | "outputs": [], 797 | "source": [ 798 | "loan_data['annual_inc'].isnull().sum()" 799 | ] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "execution_count": null, 804 | "metadata": {}, 805 | "outputs": [], 806 | "source": [ 807 | "loan_data['mths_since_earliest_cr_line'].fillna(0, inplace = True)\n", 808 | "loan_data['acc_now_delinq'].fillna(0, inplace = True)\n", 809 | "loan_data['total_acc'].fillna(0, inplace = True)\n", 810 | "loan_data['pub_rec'].fillna(0, inplace = True)\n", 811 | "loan_data['open_acc'].fillna(0, inplace = True)\n", 812 | "loan_data['inq_last_6mths'].fillna(0, inplace = True)\n", 813 | "loan_data['delinq_2yrs'].fillna(0, inplace = True)\n", 814 | "loan_data['emp_length_int'].fillna(0, inplace = True)" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": null, 820 | "metadata": {}, 821 | "outputs": [], 822 | "source": [ 823 | "loan_data['mths_since_earliest_cr_line'].isnull().sum()" 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": null, 829 | "metadata": {}, 830 | "outputs": [], 831 | "source": [ 832 | "loan_data['acc_now_delinq'].isnull().sum()" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [ 841 | "loan_data['total_acc'].isnull().sum()" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": {}, 848 | "outputs": [], 849 | "source": [ 850 | "loan_data['pub_rec'].isnull().sum()" 851 | ] 852 | }, 853 | { 854 | "cell_type": "code", 855 | "execution_count": null, 856 | "metadata": {}, 857 | "outputs": [], 858 | "source": [ 859 | "loan_data['open_acc'].isnull().sum()" 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "execution_count": null, 865 | "metadata": {}, 866 | "outputs": [], 867 | "source": [ 868 | "loan_data['inq_last_6mths'].isnull().sum()" 869 | ] 870 | }, 871 | { 872 | "cell_type": "code", 873 | "execution_count": null, 874 | "metadata": {}, 875 | "outputs": [], 876 | "source": [ 877 | "loan_data['delinq_2yrs'].isnull().sum()" 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": null, 883 | "metadata": {}, 884 | "outputs": [], 885 | "source": [ 886 | "loan_data['emp_length_int'].isnull().sum()" 887 | ] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "metadata": {}, 892 | "source": [ 893 | "Credit Risk calculation means calculating expected loss.\n", 894 | "\n", 895 | "`Expected loss = Probability of default * Loss given default * Exposure at default`" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": {}, 901 | "source": [ 902 | "Deafult: 0, Non-default: 1\n", 903 | "Many definitions of default: 90 days overdue payment, committed a fraud." 904 | ] 905 | }, 906 | { 907 | "cell_type": "markdown", 908 | "metadata": {}, 909 | "source": [ 910 | "PD model: Logisitic regression (LR). Dependent variable (output variable) ranges between 0 (default) and 1 (non-default). " 911 | ] 912 | }, 913 | { 914 | "cell_type": "markdown", 915 | "metadata": {}, 916 | "source": [ 917 | "LR estimates the relationship between ln(odds) of outcome variable and a linear combination of independent variables. \n", 918 | "ln(odds) = ln(non-defaults / defaults)" 919 | ] 920 | }, 921 | { 922 | "cell_type": "markdown", 923 | "metadata": {}, 924 | "source": [ 925 | "PD model should be easy to use and understand. So, all independent variables in PD model should be dummy variables. Discrete variables have already been converted into dummy variables. Continuous variables too should be converted. " 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": {}, 931 | "source": [ 932 | "# PD model " 933 | ] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "metadata": {}, 938 | "source": [ 939 | "# Data Preparation" 940 | ] 941 | }, 942 | { 943 | "cell_type": "code", 944 | "execution_count": null, 945 | "metadata": {}, 946 | "outputs": [], 947 | "source": [ 948 | "loan_data['loan_status'].unique()" 949 | ] 950 | }, 951 | { 952 | "cell_type": "markdown", 953 | "metadata": {}, 954 | "source": [ 955 | "`Charged off:` Borrower declaring that it's highly unlikely to pay debt.\n", 956 | "\n", 957 | "`Grace period:` A set length of time after the due date during which payment may be made without penalty." 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": null, 963 | "metadata": {}, 964 | "outputs": [], 965 | "source": [ 966 | "#Number of people with each loan_status\n", 967 | "loan_data['loan_status'].value_counts()" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": {}, 973 | "source": [ 974 | "We need to find the coefficients of the independent variables (logistic regression)" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": null, 980 | "metadata": {}, 981 | "outputs": [], 982 | "source": [ 983 | "#Ratio of each counts\n", 984 | "loan_data['loan_status'].value_counts() / loan_data['loan_status'].count()" 985 | ] 986 | }, 987 | { 988 | "cell_type": "code", 989 | "execution_count": null, 990 | "metadata": {}, 991 | "outputs": [], 992 | "source": [ 993 | "#Applying loan default and non-default definition\n", 994 | "#Where works like if, else\n", 995 | "#isin checks if values are in a list\n", 996 | "#2nd arg (0): if condition is True, returns 0, else 1 (3rd arg)\n", 997 | "loan_data['good_bad'] = np.where(loan_data['loan_status'].isin(['Charged Off', 'Default', 'Does not meet the credit policy. Status:Charged Off', \n", 998 | " 'Late (31-120 days)']), 0, 1)" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": null, 1004 | "metadata": { 1005 | "scrolled": true 1006 | }, 1007 | "outputs": [], 1008 | "source": [ 1009 | "pd.options.display.max_rows = 10\n", 1010 | "loan_data['good_bad']" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "markdown", 1015 | "metadata": {}, 1016 | "source": [ 1017 | "Weight of evidence = ln(%good i / %bad i )\n", 1018 | "\n", 1019 | "%good i is the ratio of good in a category to the total number of goods.\n", 1020 | "\n", 1021 | "%bad i is the ratio of bad in a category to the total number of bads" 1022 | ] 1023 | }, 1024 | { 1025 | "cell_type": "markdown", 1026 | "metadata": {}, 1027 | "source": [ 1028 | " Weight of evidence: To what extent does an independent variable would predict a dependent variable. WOE is used to group multiple categories. Categories with similar WOE are bundled together. Further away from 0 WOE is, is better in differentiating categories." 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": {}, 1034 | "source": [ 1035 | "Information value: How much informaiton an ind. variable brings in explaining the dependent variable. It helps for pre-selection of features. Range of IV is 0 - 1. " 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "markdown", 1040 | "metadata": {}, 1041 | "source": [ 1042 | "`Information value = sum of all ((%good i - %bad i) * WOE)`" 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "markdown", 1047 | "metadata": {}, 1048 | "source": [ 1049 | "Scale of information value: (PP means predictive power)\n", 1050 | "\n", 1051 | " IV<0.02: No PP\n", 1052 | " 0.02=0.5: Suspiciously high, too good to be true" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "markdown", 1060 | "metadata": {}, 1061 | "source": [ 1062 | "# Splitting data" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": null, 1068 | "metadata": {}, 1069 | "outputs": [], 1070 | "source": [ 1071 | "from sklearn.model_selection import train_test_split" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": null, 1077 | "metadata": {}, 1078 | "outputs": [], 1079 | "source": [ 1080 | "loan_data_inputs_train, loan_data_inputs_test, loan_data_targets_train, loan_data_targets_test = train_test_split(loan_data.drop('good_bad', axis = 1), loan_data['good_bad'])" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": null, 1086 | "metadata": {}, 1087 | "outputs": [], 1088 | "source": [ 1089 | "loan_data_inputs_train.shape" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": null, 1095 | "metadata": {}, 1096 | "outputs": [], 1097 | "source": [ 1098 | "loan_data_targets_train.shape" 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "code", 1103 | "execution_count": null, 1104 | "metadata": {}, 1105 | "outputs": [], 1106 | "source": [ 1107 | "loan_data_inputs_test.shape" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": null, 1113 | "metadata": {}, 1114 | "outputs": [], 1115 | "source": [ 1116 | "loan_data_targets_test.shape" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "markdown", 1121 | "metadata": {}, 1122 | "source": [ 1123 | "By default, sklearn splits train: test = 75%:25% (349713:116572). To get 80%:20% split, use `test_size`" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": null, 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [ 1132 | "loan_data_inputs_train, loan_data_inputs_test, loan_data_targets_train, loan_data_targets_test = train_test_split(\n", 1133 | " loan_data.drop('good_bad', axis = 1), loan_data['good_bad'], test_size = 0.2)" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "markdown", 1138 | "metadata": {}, 1139 | "source": [ 1140 | "By default, shuffling (boolean var) = True. It means that sklearn shuffles the data. So, every run gives a different split, loss and accuracy. To get the same split everytime we run, use `random_state`" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "execution_count": null, 1146 | "metadata": {}, 1147 | "outputs": [], 1148 | "source": [ 1149 | "loan_data_inputs_train, loan_data_inputs_test, loan_data_targets_train, loan_data_targets_test = train_test_split(\n", 1150 | " loan_data.drop('good_bad', axis = 1), loan_data['good_bad'], test_size = 0.2, random_state = 42)" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "code", 1155 | "execution_count": null, 1156 | "metadata": {}, 1157 | "outputs": [], 1158 | "source": [ 1159 | "loan_data_inputs_train.shape" 1160 | ] 1161 | }, 1162 | { 1163 | "cell_type": "code", 1164 | "execution_count": null, 1165 | "metadata": {}, 1166 | "outputs": [], 1167 | "source": [ 1168 | "loan_data_targets_train.shape" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "code", 1173 | "execution_count": null, 1174 | "metadata": {}, 1175 | "outputs": [], 1176 | "source": [ 1177 | "loan_data_targets_test.shape" 1178 | ] 1179 | }, 1180 | { 1181 | "cell_type": "code", 1182 | "execution_count": null, 1183 | "metadata": {}, 1184 | "outputs": [], 1185 | "source": [ 1186 | "loan_data_inputs_test.shape" 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "markdown", 1191 | "metadata": {}, 1192 | "source": [ 1193 | "# Data Preparation: An Example" 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "execution_count": null, 1199 | "metadata": {}, 1200 | "outputs": [], 1201 | "source": [ 1202 | "df_inputs_prepr = loan_data_inputs_train\n", 1203 | "df_targets_prepr = loan_data_targets_train\n", 1204 | "\n", 1205 | "#Assigning test dataset for test dataset preprocessing\n", 1206 | "#df_inputs_prepr = loan_data_inputs_test\n", 1207 | "#df_targets_prepr = loan_data_targets_test" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "code", 1212 | "execution_count": null, 1213 | "metadata": {}, 1214 | "outputs": [], 1215 | "source": [ 1216 | "df_inputs_prepr.columns.values" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": null, 1222 | "metadata": {}, 1223 | "outputs": [], 1224 | "source": [ 1225 | "df_inputs_prepr['grade'].unique()" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "execution_count": null, 1231 | "metadata": {}, 1232 | "outputs": [], 1233 | "source": [ 1234 | "pd.options.display.max_columns = 5\n", 1235 | "df1 = pd.concat([df_inputs_prepr['grade'], df_targets_prepr], axis = 1)\n", 1236 | "df1.head()" 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "markdown", 1241 | "metadata": {}, 1242 | "source": [ 1243 | "A = Highest credit worthiness, G = Lowest credit worthiness" 1244 | ] 1245 | }, 1246 | { 1247 | "cell_type": "code", 1248 | "execution_count": null, 1249 | "metadata": {}, 1250 | "outputs": [], 1251 | "source": [ 1252 | "#Count of each grade\n", 1253 | "df1.groupby(df1.columns.values[0], as_index=False)[df1.columns.values[1]].count()" 1254 | ] 1255 | }, 1256 | { 1257 | "cell_type": "markdown", 1258 | "metadata": {}, 1259 | "source": [ 1260 | "Proportion of good borrowers = 1 - Proportion of bad borrowers" 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "code", 1265 | "execution_count": null, 1266 | "metadata": {}, 1267 | "outputs": [], 1268 | "source": [ 1269 | "#Proportion of good for each grade\n", 1270 | "df1.groupby(df1.columns.values[0], as_index=False)[df1.columns.values[1]].mean()" 1271 | ] 1272 | }, 1273 | { 1274 | "cell_type": "code", 1275 | "execution_count": null, 1276 | "metadata": {}, 1277 | "outputs": [], 1278 | "source": [ 1279 | "#Mergigng both the above outputs into 1 dataframe\n", 1280 | "df1 = pd.concat([df1.groupby(df1.columns.values[0], as_index=False)[df1.columns.values[1]].count(),\n", 1281 | " df1.groupby(df1.columns.values[0], as_index=False)[df1.columns.values[1]].mean()], axis = 1)" 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": null, 1287 | "metadata": {}, 1288 | "outputs": [], 1289 | "source": [ 1290 | "df1" 1291 | ] 1292 | }, 1293 | { 1294 | "cell_type": "markdown", 1295 | "metadata": {}, 1296 | "source": [ 1297 | "We can be certain that the concatenation has happened correctly, since the grade column matches for every row. Let's get rid of the the second `grade` column using df1.iloc[:, [0,1,3]]" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "code", 1302 | "execution_count": null, 1303 | "metadata": {}, 1304 | "outputs": [], 1305 | "source": [ 1306 | "#Renaming column names\n", 1307 | "df1 = df1.iloc[:, [0,1,3]]\n", 1308 | "df1.columns = [df1.columns.values[0], 'n_obs', 'prop_good']\n", 1309 | "df1" 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": null, 1315 | "metadata": {}, 1316 | "outputs": [], 1317 | "source": [ 1318 | "#Calculating proportion of observations\n", 1319 | "df1['prop_n_obs'] = df1['n_obs'] / df1['n_obs'].sum()" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "code", 1324 | "execution_count": null, 1325 | "metadata": {}, 1326 | "outputs": [], 1327 | "source": [ 1328 | "df1" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "code", 1333 | "execution_count": null, 1334 | "metadata": {}, 1335 | "outputs": [], 1336 | "source": [ 1337 | "#Calculating number of good and bad borrowers for each grade\n", 1338 | "pd.options.display.max_columns = None\n", 1339 | "df1['n_good'] = df1['prop_good'] * df1['n_obs']\n", 1340 | "df1['n_bad'] = (1 - df1['prop_good']) * df1['n_obs']\n", 1341 | "df1" 1342 | ] 1343 | }, 1344 | { 1345 | "cell_type": "code", 1346 | "execution_count": null, 1347 | "metadata": {}, 1348 | "outputs": [], 1349 | "source": [ 1350 | "#Calculating proportion of good and bad\n", 1351 | "df1['prop_n_good'] = df1['n_good'] / df1['n_good'].sum()\n", 1352 | "df1['prop_n_bad'] = df1['n_bad'] / df1['n_bad'].sum()\n", 1353 | "df1" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": null, 1359 | "metadata": {}, 1360 | "outputs": [], 1361 | "source": [ 1362 | "#Calculating Weight of evidence\n", 1363 | "df1['WOE'] = np.log(df1['prop_n_good'] / df1['prop_n_bad'])\n", 1364 | "df1" 1365 | ] 1366 | }, 1367 | { 1368 | "cell_type": "code", 1369 | "execution_count": null, 1370 | "metadata": {}, 1371 | "outputs": [], 1372 | "source": [ 1373 | "df1 = df1.sort_values(['WOE'])\n", 1374 | "df1 = df1.reset_index(drop=True)\n", 1375 | "df1" 1376 | ] 1377 | }, 1378 | { 1379 | "cell_type": "code", 1380 | "execution_count": null, 1381 | "metadata": {}, 1382 | "outputs": [], 1383 | "source": [ 1384 | "#Calculating the difference between rows (above - below)\n", 1385 | "df1['diff_prop_good'] = df1['prop_good'].diff().abs()\n", 1386 | "df1['diff_WOE'] = df1['WOE'].diff().abs()\n", 1387 | "df1" 1388 | ] 1389 | }, 1390 | { 1391 | "cell_type": "code", 1392 | "execution_count": null, 1393 | "metadata": {}, 1394 | "outputs": [], 1395 | "source": [ 1396 | "#Calculating Information Value (IV)\n", 1397 | "df1['IV'] = (df1['prop_n_good'] - df1['prop_n_bad']) * df1['WOE']\n", 1398 | "df1['IV'] = df1['IV'].sum()\n", 1399 | "df1" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "markdown", 1404 | "metadata": {}, 1405 | "source": [ 1406 | "IV is the same for all rows, as it's the value for grade overall." 1407 | ] 1408 | }, 1409 | { 1410 | "cell_type": "markdown", 1411 | "metadata": {}, 1412 | "source": [ 1413 | "# Preprocessing discrete variables: Automating calculations" 1414 | ] 1415 | }, 1416 | { 1417 | "cell_type": "code", 1418 | "execution_count": null, 1419 | "metadata": {}, 1420 | "outputs": [], 1421 | "source": [ 1422 | "def woe_discrete(df, discrete_variable_name, good_bad_variable_df):\n", 1423 | " df = pd.concat([df[discrete_variable_name], good_bad_variable_df], axis=1)\n", 1424 | " df = pd.concat([df.groupby(df.columns.values[0], as_index=False)[df.columns.values[1]].count(),\n", 1425 | " df.groupby(df.columns.values[0], as_index=False)[df.columns.values[1]].mean()], axis=1)\n", 1426 | " df = df.iloc[:, [0,1,3]]\n", 1427 | " df.columns = [df.columns.values[0], 'n_obs', 'prop_good']\n", 1428 | " df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()\n", 1429 | " df['n_good'] = df['prop_good'] * df['n_obs']\n", 1430 | " df['n_bad'] = (1 - df['prop_good']) * df['n_obs']\n", 1431 | " df['prop_n_good'] = df['n_good'] / df['n_good'].sum()\n", 1432 | " df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()\n", 1433 | " df['WOE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])\n", 1434 | " df = df.sort_values(['WOE'])\n", 1435 | " df = df.reset_index(drop=True)\n", 1436 | " df['diff_prop_good'] = df['prop_good'].diff().abs()\n", 1437 | " df['diff_WOE'] = df['WOE'].diff().abs()\n", 1438 | " df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WOE']\n", 1439 | " df['IV'] = df['IV'].sum()\n", 1440 | " return df" 1441 | ] 1442 | }, 1443 | { 1444 | "cell_type": "code", 1445 | "execution_count": null, 1446 | "metadata": {}, 1447 | "outputs": [], 1448 | "source": [ 1449 | "df_temp = woe_discrete(df_inputs_prepr, 'grade', df_targets_prepr)\n", 1450 | "df_temp" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "markdown", 1455 | "metadata": {}, 1456 | "source": [ 1457 | "# Preprocessing Discrete variables: Visualizing Results" 1458 | ] 1459 | }, 1460 | { 1461 | "cell_type": "code", 1462 | "execution_count": null, 1463 | "metadata": {}, 1464 | "outputs": [], 1465 | "source": [ 1466 | "import matplotlib.pyplot as plt\n", 1467 | "import seaborn as sns\n", 1468 | "sns.set()" 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "code", 1473 | "execution_count": null, 1474 | "metadata": {}, 1475 | "outputs": [], 1476 | "source": [ 1477 | "def plot_by_woe(df_WOE, rotation_of_x_axis_labels = 0):\n", 1478 | " #converting the independent variable categories into strings and making an array\n", 1479 | " x = np.array(df_WOE.iloc[:,0].apply(str))\n", 1480 | " y = df_WOE['WOE']\n", 1481 | " #width = 18 inches, height = 6 inches\n", 1482 | " plt.figure(figsize = (18,6))\n", 1483 | " plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k') #marker='o': displays a dot for each point, dashed lines, black color\n", 1484 | " plt.xlabel(df_WOE.columns[0])\n", 1485 | " plt.ylabel('Weight of Evidence')\n", 1486 | " plt.title(str('Weight of Evidence by ' + df_WOE.columns[0]))\n", 1487 | " plt.xticks(rotation = rotation_of_x_axis_labels)" 1488 | ] 1489 | }, 1490 | { 1491 | "cell_type": "code", 1492 | "execution_count": null, 1493 | "metadata": {}, 1494 | "outputs": [], 1495 | "source": [ 1496 | "plot_by_woe(df_temp)" 1497 | ] 1498 | }, 1499 | { 1500 | "cell_type": "markdown", 1501 | "metadata": {}, 1502 | "source": [ 1503 | "The category with the lowest weight of evidence is the reference category." 1504 | ] 1505 | }, 1506 | { 1507 | "cell_type": "code", 1508 | "execution_count": null, 1509 | "metadata": {}, 1510 | "outputs": [], 1511 | "source": [ 1512 | "df_temp = woe_discrete(df_inputs_prepr, 'home_ownership', df_targets_prepr)\n", 1513 | "df_temp" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "code", 1518 | "execution_count": null, 1519 | "metadata": {}, 1520 | "outputs": [], 1521 | "source": [ 1522 | "plot_by_woe(df_temp)" 1523 | ] 1524 | }, 1525 | { 1526 | "cell_type": "markdown", 1527 | "metadata": {}, 1528 | "source": [ 1529 | "# Preprocessing discrete variables: Creating dummy variables" 1530 | ] 1531 | }, 1532 | { 1533 | "cell_type": "markdown", 1534 | "metadata": {}, 1535 | "source": [ 1536 | "## Home ownership variable" 1537 | ] 1538 | }, 1539 | { 1540 | "cell_type": "markdown", 1541 | "metadata": {}, 1542 | "source": [ 1543 | "Since OTHER, NONE, OWN, ANY have such a low number of observations, we will combine them into 1 dummy variable" 1544 | ] 1545 | }, 1546 | { 1547 | "cell_type": "code", 1548 | "execution_count": null, 1549 | "metadata": {}, 1550 | "outputs": [], 1551 | "source": [ 1552 | "df_inputs_prepr['home_ownership : RENT_OTHER_NONE_ANY'] = sum([df_inputs_prepr['home_ownership : RENT'], \n", 1553 | "df_inputs_prepr['home_ownership : OTHER'], df_inputs_prepr['home_ownership : NONE'], df_inputs_prepr['home_ownership : ANY']])" 1554 | ] 1555 | }, 1556 | { 1557 | "cell_type": "markdown", 1558 | "metadata": {}, 1559 | "source": [ 1560 | "## Addr_state variable" 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "code", 1565 | "execution_count": null, 1566 | "metadata": {}, 1567 | "outputs": [], 1568 | "source": [ 1569 | "df_inputs_prepr['addr_state'].unique()" 1570 | ] 1571 | }, 1572 | { 1573 | "cell_type": "code", 1574 | "execution_count": null, 1575 | "metadata": { 1576 | "scrolled": true 1577 | }, 1578 | "outputs": [], 1579 | "source": [ 1580 | "pd.options.display.max_rows = None\n", 1581 | "df_temp = woe_discrete(df_inputs_prepr, 'addr_state', df_targets_prepr)\n", 1582 | "df_temp" 1583 | ] 1584 | }, 1585 | { 1586 | "cell_type": "code", 1587 | "execution_count": null, 1588 | "metadata": {}, 1589 | "outputs": [], 1590 | "source": [ 1591 | "plot_by_woe(df_temp)" 1592 | ] 1593 | }, 1594 | { 1595 | "cell_type": "markdown", 1596 | "metadata": {}, 1597 | "source": [ 1598 | "North Dakota (ND) is not present in the plot on the x-axis since there are no borrowers from that state." 1599 | ] 1600 | }, 1601 | { 1602 | "cell_type": "code", 1603 | "execution_count": null, 1604 | "metadata": {}, 1605 | "outputs": [], 1606 | "source": [ 1607 | "#set all values of the state ND to 0 if there is no such column, else pass\n", 1608 | "if ['addr_state : ND'] in df_inputs_prepr.columns.values:\n", 1609 | " pass\n", 1610 | "else:\n", 1611 | " df_inputs_prepr['addr_state : ND'] = 0" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "markdown", 1616 | "metadata": {}, 1617 | "source": [ 1618 | "Nebraska (NE) and Iowa (IA) have the lowest WOE, WOE of Maine (ME) and Idaho (ID) couldn't be calculated as there are no bad borrowers from that state (denominato = 0, => WOE = inf)" 1619 | ] 1620 | }, 1621 | { 1622 | "cell_type": "markdown", 1623 | "metadata": {}, 1624 | "source": [ 1625 | "NE, IA, ME, ID have low number of observations. This might be the reason for their extreme WOE values. The graph says that for the other 46 states, WOE is more or less the same (except first 2 and last 2)" 1626 | ] 1627 | }, 1628 | { 1629 | "cell_type": "code", 1630 | "execution_count": null, 1631 | "metadata": {}, 1632 | "outputs": [], 1633 | "source": [ 1634 | "#Plotting without first 2 and last 2 states\n", 1635 | "plot_by_woe(df_temp.iloc[2:-2, :])" 1636 | ] 1637 | }, 1638 | { 1639 | "cell_type": "markdown", 1640 | "metadata": {}, 1641 | "source": [ 1642 | "We were misguided that for the 46 states, WOE is almost the same. Now, we see it from a different perspective. We can group first 6 + ND (the state with information is assumed in the WORST category), last 6 in one category. Let's plot the remaining ones:" 1643 | ] 1644 | }, 1645 | { 1646 | "cell_type": "code", 1647 | "execution_count": null, 1648 | "metadata": {}, 1649 | "outputs": [], 1650 | "source": [ 1651 | "plot_by_woe(df_temp.iloc[6:-6, :])" 1652 | ] 1653 | }, 1654 | { 1655 | "cell_type": "markdown", 1656 | "metadata": {}, 1657 | "source": [ 1658 | "Categories:\n", 1659 | "* NE to AL + ND\n", 1660 | "* NM, VA\n", 1661 | "* NY (2nd highest)\n", 1662 | "* OK to NC\n", 1663 | "* CA (highest number of borrowers)\n", 1664 | "* UT to NJ\n", 1665 | "* AR to MN\n", 1666 | "* RI to IN\n", 1667 | "* GA to OR\n", 1668 | "* WI, MT\n", 1669 | "* TX (3rd highest)\n", 1670 | "* IL, CT\n", 1671 | "* KS to MS\n", 1672 | "* WV to ID (low number of observations)" 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "markdown", 1677 | "metadata": {}, 1678 | "source": [ 1679 | "We just need to create dummy variables if we are combining more than 1 state into a category. Since, the single state dummy variables are already present." 1680 | ] 1681 | }, 1682 | { 1683 | "cell_type": "code", 1684 | "execution_count": null, 1685 | "metadata": {}, 1686 | "outputs": [], 1687 | "source": [ 1688 | "#Creating the actual dummy variables\n", 1689 | "df_inputs_prepr['addr_state : ND_NE_IA_NV_FL_HI_AL'] = sum([df_inputs_prepr['addr_state : ND'], df_inputs_prepr['addr_state : NE'],\n", 1690 | "df_inputs_prepr['addr_state : IA'], df_inputs_prepr['addr_state : NV'], df_inputs_prepr['addr_state : FL'], \n", 1691 | "df_inputs_prepr['addr_state : HI'], df_inputs_prepr['addr_state : AL']])\n", 1692 | "\n", 1693 | "df_inputs_prepr['addr_state : NM_VA'] = sum([df_inputs_prepr['addr_state : NM'], df_inputs_prepr['addr_state : VA']])\n", 1694 | "\n", 1695 | "df_inputs_prepr['addr_state : OK_TN_MO_LA_MD_NC'] = sum([df_inputs_prepr['addr_state : OK'], \n", 1696 | "df_inputs_prepr['addr_state : TN'], df_inputs_prepr['addr_state : MO'], df_inputs_prepr['addr_state : LA'],\n", 1697 | "df_inputs_prepr['addr_state : MD'], df_inputs_prepr['addr_state : NC']])\n", 1698 | "\n", 1699 | "df_inputs_prepr['addr_state : UT_KY_AZ_NJ'] = sum([df_inputs_prepr['addr_state : UT'], df_inputs_prepr['addr_state : KY'],\n", 1700 | "df_inputs_prepr['addr_state : AZ'], df_inputs_prepr['addr_state : NJ']])\n", 1701 | "\n", 1702 | "df_inputs_prepr['addr_state : AR_MI_PA_OH_MN'] = sum([df_inputs_prepr['addr_state : AR'], df_inputs_prepr['addr_state : MI'],\n", 1703 | "df_inputs_prepr['addr_state : PA'], df_inputs_prepr['addr_state : OH'], df_inputs_prepr['addr_state : MN']])\n", 1704 | "\n", 1705 | "df_inputs_prepr['addr_state : RI_MA_DE_SD_IN'] = sum([df_inputs_prepr['addr_state : RI'], df_inputs_prepr['addr_state : MA'],\n", 1706 | "df_inputs_prepr['addr_state : DE'], df_inputs_prepr['addr_state : SD'], df_inputs_prepr['addr_state : IN']])\n", 1707 | "\n", 1708 | "df_inputs_prepr['addr_state : GA_WA_OR'] = sum([df_inputs_prepr['addr_state : GA'], df_inputs_prepr['addr_state : WA'], \n", 1709 | "df_inputs_prepr['addr_state : OR']])\n", 1710 | "\n", 1711 | "df_inputs_prepr['addr_state : WI_MT'] = sum([df_inputs_prepr['addr_state : WI'], df_inputs_prepr['addr_state : MT']])\n", 1712 | "\n", 1713 | "df_inputs_prepr['addr_state : IL_CT'] = sum([df_inputs_prepr['addr_state : IL'], df_inputs_prepr['addr_state : CT']])\n", 1714 | "\n", 1715 | "df_inputs_prepr['addr_state : KS_SC_CO_VT_AK_MS'] = sum([df_inputs_prepr['addr_state : KS'], \n", 1716 | "df_inputs_prepr['addr_state : SC'], df_inputs_prepr['addr_state : CO'], df_inputs_prepr['addr_state : VT'],\n", 1717 | "df_inputs_prepr['addr_state : AK'], df_inputs_prepr['addr_state : MS']])\n", 1718 | "\n", 1719 | "df_inputs_prepr['addr_state : WV_NH_WY_DC_ME_ID'] = sum([df_inputs_prepr['addr_state : WV'], \n", 1720 | "df_inputs_prepr['addr_state : NH'], df_inputs_prepr['addr_state : WY'], df_inputs_prepr['addr_state : DC'],\n", 1721 | "df_inputs_prepr['addr_state : ME'], df_inputs_prepr['addr_state : ID']])" 1722 | ] 1723 | }, 1724 | { 1725 | "cell_type": "markdown", 1726 | "metadata": {}, 1727 | "source": [ 1728 | "## Verification_status variable" 1729 | ] 1730 | }, 1731 | { 1732 | "cell_type": "code", 1733 | "execution_count": null, 1734 | "metadata": {}, 1735 | "outputs": [], 1736 | "source": [ 1737 | "df_temp = woe_discrete(df_inputs_prepr, 'verification_status', df_targets_prepr)\n", 1738 | "df_temp" 1739 | ] 1740 | }, 1741 | { 1742 | "cell_type": "code", 1743 | "execution_count": null, 1744 | "metadata": {}, 1745 | "outputs": [], 1746 | "source": [ 1747 | "plot_by_woe(df_temp)" 1748 | ] 1749 | }, 1750 | { 1751 | "cell_type": "markdown", 1752 | "metadata": {}, 1753 | "source": [ 1754 | "`Verification_status` should not be used in the PD model as it's IV is only approx. 0.02 which means it has no predictive power" 1755 | ] 1756 | }, 1757 | { 1758 | "cell_type": "markdown", 1759 | "metadata": {}, 1760 | "source": [ 1761 | "## Purpose variable" 1762 | ] 1763 | }, 1764 | { 1765 | "cell_type": "code", 1766 | "execution_count": null, 1767 | "metadata": {}, 1768 | "outputs": [], 1769 | "source": [ 1770 | "df_temp = woe_discrete(df_inputs_prepr, 'purpose', df_targets_prepr)\n", 1771 | "df_temp" 1772 | ] 1773 | }, 1774 | { 1775 | "cell_type": "code", 1776 | "execution_count": null, 1777 | "metadata": {}, 1778 | "outputs": [], 1779 | "source": [ 1780 | "plot_by_woe(df_temp, rotation_of_x_axis_labels = 90)" 1781 | ] 1782 | }, 1783 | { 1784 | "cell_type": "markdown", 1785 | "metadata": {}, 1786 | "source": [ 1787 | "`Purpose` should not be used in the PD model since it has weak predictive power since it's IV falls in the second category" 1788 | ] 1789 | }, 1790 | { 1791 | "cell_type": "code", 1792 | "execution_count": null, 1793 | "metadata": {}, 1794 | "outputs": [], 1795 | "source": [ 1796 | "df_inputs_prepr['purpose : SB_ED'] = sum([df_inputs_prepr['purpose : small_business'], df_inputs_prepr['purpose : educational']])\n", 1797 | "\n", 1798 | "df_inputs_prepr['purpose : HO_OT_RE_ME'] = sum([df_inputs_prepr['purpose : house'], df_inputs_prepr['purpose : other'],\n", 1799 | "df_inputs_prepr['purpose : renewable_energy'], df_inputs_prepr['purpose : medical']])\n", 1800 | "\n", 1801 | "df_inputs_prepr['purpose : WE_VA_DC'] = sum([df_inputs_prepr['purpose : wedding'], df_inputs_prepr['purpose : vacation'],\n", 1802 | "df_inputs_prepr['purpose : debt_consolidation']])\n", 1803 | "\n", 1804 | "df_inputs_prepr['purpose : HI_MP_CA_CC'] = sum([df_inputs_prepr['purpose : home_improvement'], df_inputs_prepr['purpose : major_purchase'],\n", 1805 | "df_inputs_prepr['purpose : car'], df_inputs_prepr['purpose : credit_card']])\n" 1806 | ] 1807 | }, 1808 | { 1809 | "cell_type": "markdown", 1810 | "metadata": {}, 1811 | "source": [ 1812 | "## initial_list_status variable" 1813 | ] 1814 | }, 1815 | { 1816 | "cell_type": "code", 1817 | "execution_count": null, 1818 | "metadata": {}, 1819 | "outputs": [], 1820 | "source": [ 1821 | "df_temp = woe_discrete(df_inputs_prepr, 'initial_list_status', df_targets_prepr)\n", 1822 | "df_temp" 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "code", 1827 | "execution_count": null, 1828 | "metadata": {}, 1829 | "outputs": [], 1830 | "source": [ 1831 | "plot_by_woe(df_temp)" 1832 | ] 1833 | }, 1834 | { 1835 | "cell_type": "markdown", 1836 | "metadata": {}, 1837 | "source": [ 1838 | "`initial_list_status` should not be used in the PD model since it has no predictive power as it's IV is approx. 0.02" 1839 | ] 1840 | }, 1841 | { 1842 | "cell_type": "markdown", 1843 | "metadata": {}, 1844 | "source": [ 1845 | "# Preprocessing continuous variables: Automating calculations and visualizing results" 1846 | ] 1847 | }, 1848 | { 1849 | "cell_type": "markdown", 1850 | "metadata": {}, 1851 | "source": [ 1852 | "The same function used for discrete variables can be used for continuous variables. Discrete variables don't have qualitative comparison (so, sorted them by WOE). But continuous variables have a quantitative comparison (so,leave them in natural order). For example, all values in $80k - $90k are less than $90k - $100k" 1853 | ] 1854 | }, 1855 | { 1856 | "cell_type": "code", 1857 | "execution_count": null, 1858 | "metadata": {}, 1859 | "outputs": [], 1860 | "source": [ 1861 | "def woe_ordered_continuous(df, discrete_variable_name, good_bad_variable_df):\n", 1862 | " df = pd.concat([df[discrete_variable_name], good_bad_variable_df], axis=1)\n", 1863 | " df = pd.concat([df.groupby(df.columns.values[0], as_index=False)[df.columns.values[1]].count(),\n", 1864 | " df.groupby(df.columns.values[0], as_index=False)[df.columns.values[1]].mean()], axis=1)\n", 1865 | " df = df.iloc[:, [0,1,3]]\n", 1866 | " df.columns = [df.columns.values[0], 'n_obs', 'prop_good']\n", 1867 | " df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()\n", 1868 | " df['n_good'] = df['prop_good'] * df['n_obs']\n", 1869 | " df['n_bad'] = (1 - df['prop_good']) * df['n_obs']\n", 1870 | " df['prop_n_good'] = df['n_good'] / df['n_good'].sum()\n", 1871 | " df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()\n", 1872 | " df['WOE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])\n", 1873 | " df['diff_prop_good'] = df['prop_good'].diff().abs()\n", 1874 | " df['diff_WOE'] = df['WOE'].diff().abs()\n", 1875 | " df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WOE']\n", 1876 | " df['IV'] = df['IV'].sum()\n", 1877 | " return df" 1878 | ] 1879 | }, 1880 | { 1881 | "cell_type": "code", 1882 | "execution_count": null, 1883 | "metadata": {}, 1884 | "outputs": [], 1885 | "source": [ 1886 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'term_int', df_targets_prepr)\n", 1887 | "df_temp" 1888 | ] 1889 | }, 1890 | { 1891 | "cell_type": "code", 1892 | "execution_count": null, 1893 | "metadata": {}, 1894 | "outputs": [], 1895 | "source": [ 1896 | "plot_by_woe(df_temp)" 1897 | ] 1898 | }, 1899 | { 1900 | "cell_type": "markdown", 1901 | "metadata": {}, 1902 | "source": [ 1903 | "60 months loan are much riskier. We will give all 36 months term, value 1. Else, 0" 1904 | ] 1905 | }, 1906 | { 1907 | "cell_type": "code", 1908 | "execution_count": null, 1909 | "metadata": {}, 1910 | "outputs": [], 1911 | "source": [ 1912 | "df_inputs_prepr['term : 36'] = np.where((df_inputs_prepr['term_int'] == 36), 1, 0)\n", 1913 | "df_inputs_prepr['term : 60'] = np.where((df_inputs_prepr['term_int'] == 60), 1, 0)" 1914 | ] 1915 | }, 1916 | { 1917 | "cell_type": "markdown", 1918 | "metadata": {}, 1919 | "source": [ 1920 | "## emp_length variable" 1921 | ] 1922 | }, 1923 | { 1924 | "cell_type": "code", 1925 | "execution_count": null, 1926 | "metadata": {}, 1927 | "outputs": [], 1928 | "source": [ 1929 | "df_inputs_prepr['emp_length_int'].unique()" 1930 | ] 1931 | }, 1932 | { 1933 | "cell_type": "code", 1934 | "execution_count": null, 1935 | "metadata": {}, 1936 | "outputs": [], 1937 | "source": [ 1938 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'emp_length_int', df_targets_prepr)\n", 1939 | "df_temp" 1940 | ] 1941 | }, 1942 | { 1943 | "cell_type": "code", 1944 | "execution_count": null, 1945 | "metadata": {}, 1946 | "outputs": [], 1947 | "source": [ 1948 | "plot_by_woe(df_temp)" 1949 | ] 1950 | }, 1951 | { 1952 | "cell_type": "code", 1953 | "execution_count": null, 1954 | "metadata": {}, 1955 | "outputs": [], 1956 | "source": [ 1957 | "df_inputs_prepr['emp_length : 0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)\n", 1958 | "df_inputs_prepr['emp_length : 1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)\n", 1959 | "df_inputs_prepr['emp_length : 2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2,5)), 1, 0)\n", 1960 | "df_inputs_prepr['emp_length : 5-6'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5,7)), 1, 0)\n", 1961 | "df_inputs_prepr['emp_length : 7-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(7,10)), 1, 0)\n", 1962 | "df_inputs_prepr['emp_length : 10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)" 1963 | ] 1964 | }, 1965 | { 1966 | "cell_type": "markdown", 1967 | "metadata": {}, 1968 | "source": [ 1969 | "## mths_since_issue_date variable" 1970 | ] 1971 | }, 1972 | { 1973 | "cell_type": "code", 1974 | "execution_count": null, 1975 | "metadata": {}, 1976 | "outputs": [], 1977 | "source": [ 1978 | "df_inputs_prepr['mths_since_issue_d_date'].unique()" 1979 | ] 1980 | }, 1981 | { 1982 | "cell_type": "code", 1983 | "execution_count": null, 1984 | "metadata": {}, 1985 | "outputs": [], 1986 | "source": [ 1987 | "#We want to divide the above values into 50 categories since it's easy to work with < 50 categories\n", 1988 | "#Fine classing\n", 1989 | "df_inputs_prepr['mths_since_issue_d_date_factor'] = pd.cut(df_inputs_prepr['mths_since_issue_d_date'], 50)\n", 1990 | "df_inputs_prepr['mths_since_issue_d_date_factor']" 1991 | ] 1992 | }, 1993 | { 1994 | "cell_type": "markdown", 1995 | "metadata": {}, 1996 | "source": [ 1997 | "Each interval is from greater than 1st no. and less than or equal to 2nd no." 1998 | ] 1999 | }, 2000 | { 2001 | "cell_type": "code", 2002 | "execution_count": null, 2003 | "metadata": {}, 2004 | "outputs": [], 2005 | "source": [ 2006 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_issue_d_date_factor', df_targets_prepr)\n", 2007 | "df_temp" 2008 | ] 2009 | }, 2010 | { 2011 | "cell_type": "code", 2012 | "execution_count": null, 2013 | "metadata": {}, 2014 | "outputs": [], 2015 | "source": [ 2016 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2017 | ] 2018 | }, 2019 | { 2020 | "cell_type": "markdown", 2021 | "metadata": {}, 2022 | "source": [ 2023 | "Since the first 3 values have a high WOE than rest of the categories, let's plot without these 3 categories" 2024 | ] 2025 | }, 2026 | { 2027 | "cell_type": "code", 2028 | "execution_count": null, 2029 | "metadata": {}, 2030 | "outputs": [], 2031 | "source": [ 2032 | "plot_by_woe(df_temp.iloc[3:, :], rotation_of_x_axis_labels=90)" 2033 | ] 2034 | }, 2035 | { 2036 | "cell_type": "markdown", 2037 | "metadata": {}, 2038 | "source": [ 2039 | " Surely check the no. of observations if the plot goes up and down." 2040 | ] 2041 | }, 2042 | { 2043 | "cell_type": "code", 2044 | "execution_count": null, 2045 | "metadata": {}, 2046 | "outputs": [], 2047 | "source": [ 2048 | "df_inputs_prepr['mths_since_issue_d_date_factor : <38'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(38)), 1, 0)\n", 2049 | "df_inputs_prepr['mths_since_issue_d_date_factor : 38-39'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(38,40)), 1, 0)\n", 2050 | "df_inputs_prepr['mths_since_issue_d_date_factor : 40-41'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(40,42)), 1, 0)\n", 2051 | "df_inputs_prepr['mths_since_issue_d_date_factor : 42-48'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(42,49)), 1, 0)\n", 2052 | "df_inputs_prepr['mths_since_issue_d_date_factor : 49-52'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(49,53)), 1, 0)\n", 2053 | "df_inputs_prepr['mths_since_issue_d_date_factor : 53-64'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(53,65)), 1, 0)\n", 2054 | "df_inputs_prepr['mths_since_issue_d_date_factor : 65-84'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(65,85)), 1, 0)\n", 2055 | "df_inputs_prepr['mths_since_issue_d_date_factor : >84'] = np.where(df_inputs_prepr['mths_since_issue_d_date_factor'].isin(range(85, int(df_inputs_prepr['mths_since_issue_d_date'].max()))), 1, 0)" 2056 | ] 2057 | }, 2058 | { 2059 | "cell_type": "markdown", 2060 | "metadata": {}, 2061 | "source": [ 2062 | "## Interest rate variable" 2063 | ] 2064 | }, 2065 | { 2066 | "cell_type": "code", 2067 | "execution_count": null, 2068 | "metadata": {}, 2069 | "outputs": [], 2070 | "source": [ 2071 | "df_inputs_prepr['int_rate_factor'] = pd.cut(df_inputs_prepr['int_rate'], 50)" 2072 | ] 2073 | }, 2074 | { 2075 | "cell_type": "code", 2076 | "execution_count": null, 2077 | "metadata": {}, 2078 | "outputs": [], 2079 | "source": [ 2080 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'int_rate_factor' , df_targets_prepr)\n", 2081 | "df_temp" 2082 | ] 2083 | }, 2084 | { 2085 | "cell_type": "code", 2086 | "execution_count": null, 2087 | "metadata": {}, 2088 | "outputs": [], 2089 | "source": [ 2090 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2091 | ] 2092 | }, 2093 | { 2094 | "cell_type": "markdown", 2095 | "metadata": {}, 2096 | "source": [ 2097 | "Greater the interest rate, higher WOE and greater probability of default." 2098 | ] 2099 | }, 2100 | { 2101 | "cell_type": "code", 2102 | "execution_count": null, 2103 | "metadata": {}, 2104 | "outputs": [], 2105 | "source": [ 2106 | "df_inputs_prepr['int_rate : <9.548'] = np.where((df_inputs_prepr['int_rate'] <= 9.548), 1, 0)\n", 2107 | "df_inputs_prepr['int_rate : 9.548-12.025'] = np.where((df_inputs_prepr['int_rate'] > 9.548) & (df_inputs_prepr['int_rate'] <= 12.025), 1, 0)\n", 2108 | "df_inputs_prepr['int_rate : 12.025-15.74'] = np.where((df_inputs_prepr['int_rate'] > 12.025) & (df_inputs_prepr['int_rate'] <= 15.74), 1, 0)\n", 2109 | "df_inputs_prepr['int_rate : 15.74-20.281'] = np.where((df_inputs_prepr['int_rate'] > 15.74) & (df_inputs_prepr['int_rate'] <= 20.281), 1, 0)\n", 2110 | "df_inputs_prepr['int_rate : >20.281'] = np.where((df_inputs_prepr['int_rate'] > 20.281), 1, 0)" 2111 | ] 2112 | }, 2113 | { 2114 | "cell_type": "markdown", 2115 | "metadata": {}, 2116 | "source": [ 2117 | "## funded_amt_factor variable" 2118 | ] 2119 | }, 2120 | { 2121 | "cell_type": "code", 2122 | "execution_count": null, 2123 | "metadata": {}, 2124 | "outputs": [], 2125 | "source": [ 2126 | "df_inputs_prepr['funded_amnt_factor'] = pd.cut(df_inputs_prepr['funded_amnt'], 50)" 2127 | ] 2128 | }, 2129 | { 2130 | "cell_type": "code", 2131 | "execution_count": null, 2132 | "metadata": { 2133 | "scrolled": true 2134 | }, 2135 | "outputs": [], 2136 | "source": [ 2137 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'funded_amnt_factor', df_targets_prepr)\n", 2138 | "df_temp" 2139 | ] 2140 | }, 2141 | { 2142 | "cell_type": "code", 2143 | "execution_count": null, 2144 | "metadata": {}, 2145 | "outputs": [], 2146 | "source": [ 2147 | "plot_by_woe(df_temp, 90)" 2148 | ] 2149 | }, 2150 | { 2151 | "cell_type": "markdown", 2152 | "metadata": {}, 2153 | "source": [ 2154 | "`funded_amnt_factor` varies greatly. There seems to be no association between WOE and `funded_amt_factor`.So, we won't use this variable in our PD model." 2155 | ] 2156 | }, 2157 | { 2158 | "cell_type": "markdown", 2159 | "metadata": {}, 2160 | "source": [ 2161 | "## mths_since_earliest_cr_line variable" 2162 | ] 2163 | }, 2164 | { 2165 | "cell_type": "code", 2166 | "execution_count": null, 2167 | "metadata": { 2168 | "scrolled": true 2169 | }, 2170 | "outputs": [], 2171 | "source": [ 2172 | "df_inputs_prepr['mths_since_earliest_cr_line'].unique()" 2173 | ] 2174 | }, 2175 | { 2176 | "cell_type": "code", 2177 | "execution_count": null, 2178 | "metadata": {}, 2179 | "outputs": [], 2180 | "source": [ 2181 | "df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 50)\n", 2182 | "df_inputs_prepr['mths_since_earliest_cr_line_factor']" 2183 | ] 2184 | }, 2185 | { 2186 | "cell_type": "code", 2187 | "execution_count": null, 2188 | "metadata": { 2189 | "scrolled": true 2190 | }, 2191 | "outputs": [], 2192 | "source": [ 2193 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_earliest_cr_line_factor', df_targets_prepr)\n", 2194 | "df_temp" 2195 | ] 2196 | }, 2197 | { 2198 | "cell_type": "code", 2199 | "execution_count": null, 2200 | "metadata": {}, 2201 | "outputs": [], 2202 | "source": [ 2203 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2204 | ] 2205 | }, 2206 | { 2207 | "cell_type": "code", 2208 | "execution_count": null, 2209 | "metadata": {}, 2210 | "outputs": [], 2211 | "source": [ 2212 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : <70'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(70)), 1, 0)\n", 2213 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : 70-93'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(70,93)), 1, 0)\n", 2214 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : 94-140'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(94,140)), 1, 0)\n", 2215 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : 141-270'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(141-270)), 1, 0)\n", 2216 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : 271-352'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(271,352)), 1, 0)\n", 2217 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : 353-410'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(353,410)), 1, 0)\n", 2218 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : 411-563'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(411,563)), 1, 0)\n", 2219 | "df_inputs_prepr['mths_since_earliest_cr_line_factor : >563'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line_factor'].isin(range(563, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)" 2220 | ] 2221 | }, 2222 | { 2223 | "cell_type": "markdown", 2224 | "metadata": {}, 2225 | "source": [ 2226 | "## Installment variable" 2227 | ] 2228 | }, 2229 | { 2230 | "cell_type": "code", 2231 | "execution_count": null, 2232 | "metadata": {}, 2233 | "outputs": [], 2234 | "source": [ 2235 | "unique_installment = []\n", 2236 | "for value in df_inputs_prepr['installment']:\n", 2237 | " if value not in unique_installment:\n", 2238 | " unique_installment.append(value)" 2239 | ] 2240 | }, 2241 | { 2242 | "cell_type": "code", 2243 | "execution_count": null, 2244 | "metadata": {}, 2245 | "outputs": [], 2246 | "source": [ 2247 | "len(unique_installment)" 2248 | ] 2249 | }, 2250 | { 2251 | "cell_type": "markdown", 2252 | "metadata": {}, 2253 | "source": [ 2254 | "When I used .unique() method, it didn't display all values in the list (it used ellipsis). To display all values I used a for loop and an if statement. To find the number of unique values, use len()" 2255 | ] 2256 | }, 2257 | { 2258 | "cell_type": "code", 2259 | "execution_count": null, 2260 | "metadata": {}, 2261 | "outputs": [], 2262 | "source": [ 2263 | "df_inputs_prepr['installment_factor'] = pd.cut(df_inputs_prepr['installment'], 50)" 2264 | ] 2265 | }, 2266 | { 2267 | "cell_type": "code", 2268 | "execution_count": null, 2269 | "metadata": { 2270 | "scrolled": true 2271 | }, 2272 | "outputs": [], 2273 | "source": [ 2274 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'installment_factor', df_targets_prepr)\n", 2275 | "df_temp" 2276 | ] 2277 | }, 2278 | { 2279 | "cell_type": "code", 2280 | "execution_count": null, 2281 | "metadata": {}, 2282 | "outputs": [], 2283 | "source": [ 2284 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2285 | ] 2286 | }, 2287 | { 2288 | "cell_type": "code", 2289 | "execution_count": null, 2290 | "metadata": {}, 2291 | "outputs": [], 2292 | "source": [ 2293 | "df_inputs_prepr['installment_factor : <183'] = np.where(df_inputs_prepr['installment_factor'].isin(range(183)), 1, 0)\n", 2294 | "df_inputs_prepr['installment_factor : 183-266'] = np.where(df_inputs_prepr['installment_factor'].isin(range(183,266)), 1, 0)\n", 2295 | "df_inputs_prepr['installment_factor : 267-517'] = np.where(df_inputs_prepr['installment_factor'].isin(range(267,517)), 1, 0)\n", 2296 | "df_inputs_prepr['installment_factor : 518-601'] = np.where(df_inputs_prepr['installment_factor'].isin(range(518,601)), 1, 0)\n", 2297 | "df_inputs_prepr['installment_factor : 602-880'] = np.where(df_inputs_prepr['installment_factor'].isin(range(602,880)), 1, 0)\n", 2298 | "df_inputs_prepr['installment_factor : 881-963'] = np.where(df_inputs_prepr['installment_factor'].isin(range(881,963)), 1, 0)\n", 2299 | "df_inputs_prepr['installment_factor : 964-1075'] = np.where(df_inputs_prepr['installment_factor'].isin(range(964,1075)), 1, 0)\n", 2300 | "df_inputs_prepr['installment_factor : 1076-1242'] = np.where(df_inputs_prepr['installment_factor'].isin(range(1076,1242)), 1, 0)\n", 2301 | "df_inputs_prepr['installment_factor : >1242'] = np.where(df_inputs_prepr['installment_factor'].isin(range(1242, int(df_inputs_prepr['installment'].max()))), 1, 0)" 2302 | ] 2303 | }, 2304 | { 2305 | "cell_type": "markdown", 2306 | "metadata": {}, 2307 | "source": [ 2308 | "## delinq_2yrs variable" 2309 | ] 2310 | }, 2311 | { 2312 | "cell_type": "code", 2313 | "execution_count": null, 2314 | "metadata": {}, 2315 | "outputs": [], 2316 | "source": [ 2317 | "df_inputs_prepr['delinq_2yrs'].unique()" 2318 | ] 2319 | }, 2320 | { 2321 | "cell_type": "markdown", 2322 | "metadata": {}, 2323 | "source": [ 2324 | "Since there are less number of categories, there is no need to use `pd.cut()`" 2325 | ] 2326 | }, 2327 | { 2328 | "cell_type": "code", 2329 | "execution_count": null, 2330 | "metadata": { 2331 | "scrolled": true 2332 | }, 2333 | "outputs": [], 2334 | "source": [ 2335 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_2yrs', df_targets_prepr)\n", 2336 | "df_temp" 2337 | ] 2338 | }, 2339 | { 2340 | "cell_type": "code", 2341 | "execution_count": null, 2342 | "metadata": {}, 2343 | "outputs": [], 2344 | "source": [ 2345 | "plot_by_woe(df_temp.iloc[:15,:])" 2346 | ] 2347 | }, 2348 | { 2349 | "cell_type": "markdown", 2350 | "metadata": {}, 2351 | "source": [ 2352 | "To get rid of rows with WOE `inf`:\n", 2353 | " \n", 2354 | " df_temp = df_temp.iloc[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,20], :]" 2355 | ] 2356 | }, 2357 | { 2358 | "cell_type": "code", 2359 | "execution_count": null, 2360 | "metadata": {}, 2361 | "outputs": [], 2362 | "source": [ 2363 | "df_temp" 2364 | ] 2365 | }, 2366 | { 2367 | "cell_type": "code", 2368 | "execution_count": null, 2369 | "metadata": {}, 2370 | "outputs": [], 2371 | "source": [ 2372 | "plot_by_woe(df_temp)" 2373 | ] 2374 | }, 2375 | { 2376 | "cell_type": "code", 2377 | "execution_count": null, 2378 | "metadata": {}, 2379 | "outputs": [], 2380 | "source": [ 2381 | "df_inputs_prepr['delinq_2yrs : <4'] = np.where(df_inputs_prepr['delinq_2yrs'].isin(range(0,3)), 1, 0)\n", 2382 | "df_inputs_prepr['delinq_2yrs : >=4'] = np.where(df_inputs_prepr['delinq_2yrs'].isin(range(4,21)), 1, 0)" 2383 | ] 2384 | }, 2385 | { 2386 | "cell_type": "markdown", 2387 | "metadata": {}, 2388 | "source": [ 2389 | "## inq_last_6mths variable" 2390 | ] 2391 | }, 2392 | { 2393 | "cell_type": "code", 2394 | "execution_count": null, 2395 | "metadata": {}, 2396 | "outputs": [], 2397 | "source": [ 2398 | "df_inputs_prepr['inq_last_6mths'].unique()" 2399 | ] 2400 | }, 2401 | { 2402 | "cell_type": "code", 2403 | "execution_count": null, 2404 | "metadata": { 2405 | "scrolled": true 2406 | }, 2407 | "outputs": [], 2408 | "source": [ 2409 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'inq_last_6mths', df_targets_prepr)\n", 2410 | "df_temp" 2411 | ] 2412 | }, 2413 | { 2414 | "cell_type": "markdown", 2415 | "metadata": {}, 2416 | "source": [ 2417 | "There are rows with `inf` or `-inf` as WOE. To get rid of those, do as below: (the above way is a lengthy one as it requires explicitly naming row indices)" 2418 | ] 2419 | }, 2420 | { 2421 | "cell_type": "code", 2422 | "execution_count": null, 2423 | "metadata": {}, 2424 | "outputs": [], 2425 | "source": [ 2426 | "df_temp = df_temp[(df_temp['WOE'] != float('inf')) & (df_temp['WOE'] != float('-inf'))]\n", 2427 | "df_temp" 2428 | ] 2429 | }, 2430 | { 2431 | "cell_type": "code", 2432 | "execution_count": null, 2433 | "metadata": {}, 2434 | "outputs": [], 2435 | "source": [ 2436 | "plot_by_woe(df_temp)" 2437 | ] 2438 | }, 2439 | { 2440 | "cell_type": "code", 2441 | "execution_count": null, 2442 | "metadata": {}, 2443 | "outputs": [], 2444 | "source": [ 2445 | "df_inputs_prepr['inq_last_6mths : 0-3'] = np.where(df_inputs_prepr['inq_last_6mths'].isin(range(0,4)), 1, 0)\n", 2446 | "df_inputs_prepr['inq_last_6mths : 4-6'] = np.where(df_inputs_prepr['inq_last_6mths'].isin(range(4,7)), 1, 0)\n", 2447 | "df_inputs_prepr['inq_last_6mths : >=6'] = np.where(df_inputs_prepr['inq_last_6mths'].isin(range(6,18)), 1, 0)" 2448 | ] 2449 | }, 2450 | { 2451 | "cell_type": "markdown", 2452 | "metadata": {}, 2453 | "source": [ 2454 | "## open_acc variable" 2455 | ] 2456 | }, 2457 | { 2458 | "cell_type": "code", 2459 | "execution_count": null, 2460 | "metadata": {}, 2461 | "outputs": [], 2462 | "source": [ 2463 | "df_inputs_prepr['open_acc'].unique()" 2464 | ] 2465 | }, 2466 | { 2467 | "cell_type": "code", 2468 | "execution_count": null, 2469 | "metadata": {}, 2470 | "outputs": [], 2471 | "source": [ 2472 | "df_inputs_prepr['open_acc_factor'] = pd.cut(df_inputs_prepr['open_acc'], 50)\n", 2473 | "df_inputs_prepr['open_acc_factor']" 2474 | ] 2475 | }, 2476 | { 2477 | "cell_type": "code", 2478 | "execution_count": null, 2479 | "metadata": { 2480 | "scrolled": true 2481 | }, 2482 | "outputs": [], 2483 | "source": [ 2484 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_acc_factor', df_targets_prepr)\n", 2485 | "df_temp" 2486 | ] 2487 | }, 2488 | { 2489 | "cell_type": "code", 2490 | "execution_count": null, 2491 | "metadata": {}, 2492 | "outputs": [], 2493 | "source": [ 2494 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2495 | ] 2496 | }, 2497 | { 2498 | "cell_type": "markdown", 2499 | "metadata": {}, 2500 | "source": [ 2501 | "There is no need to remove `inf`, `-inf` and `NaN` as the plot function itself ignored those observations" 2502 | ] 2503 | }, 2504 | { 2505 | "cell_type": "code", 2506 | "execution_count": null, 2507 | "metadata": {}, 2508 | "outputs": [], 2509 | "source": [ 2510 | "df_inputs_prepr['open_acc_factor : 0'] = np.where(df_inputs_prepr['open_acc_factor'].isin(range(0)), 1, 0)\n", 2511 | "df_inputs_prepr['open_acc_factor : 1-3'] = np.where(df_inputs_prepr['open_acc_factor'].isin(range(1,3)), 1, 0)\n", 2512 | "df_inputs_prepr['open_acc_factor : 3-11'] = np.where(df_inputs_prepr['open_acc_factor'].isin(range(3,12)), 1, 0)\n", 2513 | "df_inputs_prepr['open_acc_factor : 12-25'] = np.where(df_inputs_prepr['open_acc_factor'].isin(range(12,26)), 1, 0)\n", 2514 | "df_inputs_prepr['open_acc_factor : 26-33'] = np.where(df_inputs_prepr['open_acc_factor'].isin(range(26,34)), 1, 0)\n", 2515 | "df_inputs_prepr['open_acc_factor : >34'] = np.where(df_inputs_prepr['open_acc_factor'].isin(range(34, int(df_inputs_prepr['open_acc'].max()))), 1, 0)" 2516 | ] 2517 | }, 2518 | { 2519 | "cell_type": "markdown", 2520 | "metadata": {}, 2521 | "source": [ 2522 | "## pub_rec variable" 2523 | ] 2524 | }, 2525 | { 2526 | "cell_type": "code", 2527 | "execution_count": null, 2528 | "metadata": {}, 2529 | "outputs": [], 2530 | "source": [ 2531 | "df_inputs_prepr['pub_rec'].unique()" 2532 | ] 2533 | }, 2534 | { 2535 | "cell_type": "code", 2536 | "execution_count": null, 2537 | "metadata": {}, 2538 | "outputs": [], 2539 | "source": [ 2540 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'pub_rec', df_targets_prepr) \n", 2541 | "df_temp" 2542 | ] 2543 | }, 2544 | { 2545 | "cell_type": "code", 2546 | "execution_count": null, 2547 | "metadata": {}, 2548 | "outputs": [], 2549 | "source": [ 2550 | "plot_by_woe(df_temp)" 2551 | ] 2552 | }, 2553 | { 2554 | "cell_type": "code", 2555 | "execution_count": null, 2556 | "metadata": {}, 2557 | "outputs": [], 2558 | "source": [ 2559 | "#removing all inf values\n", 2560 | "df_temp = df_temp[df_temp['WOE'] != float('inf')]\n", 2561 | "df_temp" 2562 | ] 2563 | }, 2564 | { 2565 | "cell_type": "code", 2566 | "execution_count": null, 2567 | "metadata": {}, 2568 | "outputs": [], 2569 | "source": [ 2570 | "plot_by_woe(df_temp)" 2571 | ] 2572 | }, 2573 | { 2574 | "cell_type": "code", 2575 | "execution_count": null, 2576 | "metadata": {}, 2577 | "outputs": [], 2578 | "source": [ 2579 | "df_inputs_prepr['pub_rec : 0-2'] = np.where(df_inputs_prepr['pub_rec'].isin(range(0,3)), 1, 0)\n", 2580 | "df_inputs_prepr['pub_rec : 3-4'] = np.where(df_inputs_prepr['pub_rec'].isin(range(3,5)), 1, 0)\n", 2581 | "df_inputs_prepr['pub_rec : 5-7'] = np.where(df_inputs_prepr['pub_rec'].isin(range(5,7)), 1, 0)\n", 2582 | "df_inputs_prepr['pub_rec : >7'] = np.where(df_inputs_prepr['pub_rec'].isin(range(7,int(df_inputs_prepr['pub_rec'].max()))), 1, 0)" 2583 | ] 2584 | }, 2585 | { 2586 | "cell_type": "markdown", 2587 | "metadata": {}, 2588 | "source": [ 2589 | "## total_acc variable" 2590 | ] 2591 | }, 2592 | { 2593 | "cell_type": "code", 2594 | "execution_count": null, 2595 | "metadata": {}, 2596 | "outputs": [], 2597 | "source": [ 2598 | "df_inputs_prepr['total_acc'].unique()" 2599 | ] 2600 | }, 2601 | { 2602 | "cell_type": "code", 2603 | "execution_count": null, 2604 | "metadata": {}, 2605 | "outputs": [], 2606 | "source": [ 2607 | "df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 50)" 2608 | ] 2609 | }, 2610 | { 2611 | "cell_type": "code", 2612 | "execution_count": null, 2613 | "metadata": { 2614 | "scrolled": true 2615 | }, 2616 | "outputs": [], 2617 | "source": [ 2618 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_acc_factor', df_targets_prepr)\n", 2619 | "df_temp" 2620 | ] 2621 | }, 2622 | { 2623 | "cell_type": "code", 2624 | "execution_count": null, 2625 | "metadata": {}, 2626 | "outputs": [], 2627 | "source": [ 2628 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2629 | ] 2630 | }, 2631 | { 2632 | "cell_type": "code", 2633 | "execution_count": null, 2634 | "metadata": {}, 2635 | "outputs": [], 2636 | "source": [ 2637 | "df_inputs_prepr['total_acc_factor : <=27'] = np.where(df_inputs_prepr['total_acc_factor'].isin(range(0,28)), 1, 0)\n", 2638 | "df_inputs_prepr['total_acc_factor : 28-30'] = np.where(df_inputs_prepr['total_acc_factor'].isin(range(18,30)), 1, 0)\n", 2639 | "df_inputs_prepr['total_acc_factor : 30-45'] = np.where(df_inputs_prepr['total_acc_factor'].isin(range(30,45)), 1, 0)\n", 2640 | "df_inputs_prepr['total_acc_factor : 45-60'] = np.where(df_inputs_prepr['total_acc_factor'].isin(range(45,60)), 1, 0)\n", 2641 | "df_inputs_prepr['total_acc_factor : 60-72'] = np.where(df_inputs_prepr['total_acc_factor'].isin(range(60,72)), 1, 0)\n", 2642 | "df_inputs_prepr['total_acc_factor : >72'] = np.where(df_inputs_prepr['total_acc_factor'].isin(range(72, int(df_inputs_prepr['total_acc'].max()))), 1, 0)" 2643 | ] 2644 | }, 2645 | { 2646 | "cell_type": "markdown", 2647 | "metadata": {}, 2648 | "source": [ 2649 | "## acc_now_delinq variable" 2650 | ] 2651 | }, 2652 | { 2653 | "cell_type": "code", 2654 | "execution_count": null, 2655 | "metadata": {}, 2656 | "outputs": [], 2657 | "source": [ 2658 | "df_inputs_prepr['acc_now_delinq'].unique()" 2659 | ] 2660 | }, 2661 | { 2662 | "cell_type": "code", 2663 | "execution_count": null, 2664 | "metadata": {}, 2665 | "outputs": [], 2666 | "source": [ 2667 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'acc_now_delinq', df_targets_prepr)\n", 2668 | "df_temp" 2669 | ] 2670 | }, 2671 | { 2672 | "cell_type": "code", 2673 | "execution_count": null, 2674 | "metadata": {}, 2675 | "outputs": [], 2676 | "source": [ 2677 | "#df_temp = df_temp.iloc[[0,1,2,3,5], :]" 2678 | ] 2679 | }, 2680 | { 2681 | "cell_type": "code", 2682 | "execution_count": null, 2683 | "metadata": {}, 2684 | "outputs": [], 2685 | "source": [ 2686 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2687 | ] 2688 | }, 2689 | { 2690 | "cell_type": "code", 2691 | "execution_count": null, 2692 | "metadata": {}, 2693 | "outputs": [], 2694 | "source": [ 2695 | "df_inputs_prepr['acc_now_delinq : 0'] = np.where(df_inputs_prepr['acc_now_delinq'].isin(range(0)), 1, 0)\n", 2696 | "df_inputs_prepr['acc_now_delinq : 1-5'] = np.where(df_inputs_prepr['acc_now_delinq'].isin(range(1, 5)), 1, 0) " 2697 | ] 2698 | }, 2699 | { 2700 | "cell_type": "code", 2701 | "execution_count": null, 2702 | "metadata": {}, 2703 | "outputs": [], 2704 | "source": [ 2705 | "df_inputs_prepr.columns.values" 2706 | ] 2707 | }, 2708 | { 2709 | "cell_type": "markdown", 2710 | "metadata": {}, 2711 | "source": [ 2712 | "## Annual income variable" 2713 | ] 2714 | }, 2715 | { 2716 | "cell_type": "code", 2717 | "execution_count": null, 2718 | "metadata": {}, 2719 | "outputs": [], 2720 | "source": [ 2721 | "df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 50)\n", 2722 | "df_inputs_prepr['annual_inc_factor']" 2723 | ] 2724 | }, 2725 | { 2726 | "cell_type": "code", 2727 | "execution_count": null, 2728 | "metadata": { 2729 | "scrolled": true 2730 | }, 2731 | "outputs": [], 2732 | "source": [ 2733 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'annual_inc_factor', df_targets_prepr)\n", 2734 | "df_temp" 2735 | ] 2736 | }, 2737 | { 2738 | "cell_type": "markdown", 2739 | "metadata": {}, 2740 | "source": [ 2741 | "94% of the observations are in one category. It seems that 50 categories weren't enough to split our data well. Let's split into 100 categories." 2742 | ] 2743 | }, 2744 | { 2745 | "cell_type": "code", 2746 | "execution_count": null, 2747 | "metadata": { 2748 | "scrolled": true 2749 | }, 2750 | "outputs": [], 2751 | "source": [ 2752 | "df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 100)\n", 2753 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'annual_inc_factor', df_targets_prepr)\n", 2754 | "df_temp" 2755 | ] 2756 | }, 2757 | { 2758 | "cell_type": "markdown", 2759 | "metadata": {}, 2760 | "source": [ 2761 | "This makes sense since there are very few people with high incomes but a lot with low incomes. Let's set a threshold of $140,000 (first 2 categories have a lot of obs.). Above threshold: high income, Below threshold: low income" 2762 | ] 2763 | }, 2764 | { 2765 | "cell_type": "code", 2766 | "execution_count": null, 2767 | "metadata": {}, 2768 | "outputs": [], 2769 | "source": [ 2770 | "#low income dataframe\n", 2771 | "df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['annual_inc'] <= 140000, :]" 2772 | ] 2773 | }, 2774 | { 2775 | "cell_type": "code", 2776 | "execution_count": null, 2777 | "metadata": { 2778 | "scrolled": true 2779 | }, 2780 | "outputs": [], 2781 | "source": [ 2782 | "df_inputs_prepr_temp['annual_inc_factor'] = pd.cut(df_inputs_prepr_temp['annual_inc'], 50)\n", 2783 | "df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'annual_inc_factor', df_targets_prepr.loc[df_inputs_prepr_temp.index])\n", 2784 | "df_temp" 2785 | ] 2786 | }, 2787 | { 2788 | "cell_type": "code", 2789 | "execution_count": null, 2790 | "metadata": {}, 2791 | "outputs": [], 2792 | "source": [ 2793 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2794 | ] 2795 | }, 2796 | { 2797 | "cell_type": "markdown", 2798 | "metadata": {}, 2799 | "source": [ 2800 | " We will split with a width of $10000, but first and last category will have low no. of values.So, let's make sure first and last 2 categories have a width of $20000. The intervals from $20k to $100k will have a width of $10k." 2801 | ] 2802 | }, 2803 | { 2804 | "cell_type": "code", 2805 | "execution_count": null, 2806 | "metadata": {}, 2807 | "outputs": [], 2808 | "source": [ 2809 | "df_inputs_prepr['annual_inc : <20k'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)\n", 2810 | "df_inputs_prepr['annual_inc : 20k-30k'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)\n", 2811 | "df_inputs_prepr['annual_inc : 30k-40k'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)\n", 2812 | "df_inputs_prepr['annual_inc : 40k-50k'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)\n", 2813 | "df_inputs_prepr['annual_inc : 50k-60k'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)\n", 2814 | "df_inputs_prepr['annual_inc : 60k-70k'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)\n", 2815 | "df_inputs_prepr['annual_inc : 70k-80k'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)\n", 2816 | "df_inputs_prepr['annual_inc : 80k-90k'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)\n", 2817 | "df_inputs_prepr['annual_inc : 90k-100k'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)\n", 2818 | "df_inputs_prepr['annual_inc : 100k-120k'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)\n", 2819 | "df_inputs_prepr['annual_inc : 120k-140k'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)\n", 2820 | "df_inputs_prepr['annual_inc : >140k'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)" 2821 | ] 2822 | }, 2823 | { 2824 | "cell_type": "markdown", 2825 | "metadata": {}, 2826 | "source": [ 2827 | "## mths_since_last_delinq variable" 2828 | ] 2829 | }, 2830 | { 2831 | "cell_type": "code", 2832 | "execution_count": null, 2833 | "metadata": {}, 2834 | "outputs": [], 2835 | "source": [ 2836 | "df_inputs_prepr['mths_since_last_delinq'].isnull().sum()" 2837 | ] 2838 | }, 2839 | { 2840 | "cell_type": "markdown", 2841 | "metadata": {}, 2842 | "source": [ 2843 | "There are a lot of missing values in this variable. So, we will create a dummy variable which is 1: when value is missing, 0: value isn't missing\n" 2844 | ] 2845 | }, 2846 | { 2847 | "cell_type": "code", 2848 | "execution_count": null, 2849 | "metadata": {}, 2850 | "outputs": [], 2851 | "source": [ 2852 | "df_inputs_prepr_temp = df_inputs_prepr[pd.notnull(df_inputs_prepr['mths_since_last_delinq'])]" 2853 | ] 2854 | }, 2855 | { 2856 | "cell_type": "code", 2857 | "execution_count": null, 2858 | "metadata": { 2859 | "scrolled": true 2860 | }, 2861 | "outputs": [], 2862 | "source": [ 2863 | "df_inputs_prepr_temp['mths_since_last_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_last_delinq'], 50)\n", 2864 | "df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_last_delinq_factor', df_targets_prepr[df_inputs_prepr_temp.index])\n", 2865 | "df_temp" 2866 | ] 2867 | }, 2868 | { 2869 | "cell_type": "code", 2870 | "execution_count": null, 2871 | "metadata": {}, 2872 | "outputs": [], 2873 | "source": [ 2874 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2875 | ] 2876 | }, 2877 | { 2878 | "cell_type": "code", 2879 | "execution_count": null, 2880 | "metadata": {}, 2881 | "outputs": [], 2882 | "source": [ 2883 | "df_inputs_prepr['mths_since_last_delinq : Missing'] = np.where(df_inputs_prepr['mths_since_last_delinq'].isnull(), 1, 0)\n", 2884 | "df_inputs_prepr['mths_since_last_delinq : 0-3'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 0) & (df_inputs_prepr['mths_since_last_delinq'] < 3), 1, 0)\n", 2885 | "df_inputs_prepr['mths_since_last_delinq : 4-30'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 4) & (df_inputs_prepr['mths_since_last_delinq'] < 30), 1, 0)\n", 2886 | "df_inputs_prepr['mths_since_last_delinq : 31-56'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 31) & (df_inputs_prepr['mths_since_last_delinq'] < 56), 1, 0)\n", 2887 | "df_inputs_prepr['mths_since_last_delinq : >=57'] = np.where((df_inputs_prepr['mths_since_last_delinq'] >= 57), 1, 0)" 2888 | ] 2889 | }, 2890 | { 2891 | "cell_type": "markdown", 2892 | "metadata": {}, 2893 | "source": [ 2894 | "## DTI(debt-to-income) variable" 2895 | ] 2896 | }, 2897 | { 2898 | "cell_type": "code", 2899 | "execution_count": null, 2900 | "metadata": { 2901 | "scrolled": true 2902 | }, 2903 | "outputs": [], 2904 | "source": [ 2905 | "df_inputs_prepr['dti_factor'] = pd.cut(df_inputs_prepr['dti'], 50)\n", 2906 | "df_temp = woe_ordered_continuous(df_inputs_prepr, 'dti_factor', df_targets_prepr)\n", 2907 | "df_temp" 2908 | ] 2909 | }, 2910 | { 2911 | "cell_type": "code", 2912 | "execution_count": null, 2913 | "metadata": {}, 2914 | "outputs": [], 2915 | "source": [ 2916 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2917 | ] 2918 | }, 2919 | { 2920 | "cell_type": "code", 2921 | "execution_count": null, 2922 | "metadata": {}, 2923 | "outputs": [], 2924 | "source": [ 2925 | "df_inputs_prepr['dti : <=1.6'] = np.where(df_inputs_prepr['dti'] <= 1.7, 1, 0)\n", 2926 | "df_inputs_prepr['dti : 1.7-4.1'] = np.where((df_inputs_prepr['dti'] > 1.7) & (df_inputs_prepr['dti'] <= 4.1), 1, 0)\n", 2927 | "df_inputs_prepr['dti : 4.1-8.9'] = np.where((df_inputs_prepr['dti'] > 4.1) & (df_inputs_prepr['dti'] <= 8.9), 1, 0)\n", 2928 | "df_inputs_prepr['dti : 8.9-14.4'] = np.where((df_inputs_prepr['dti'] > 8.9) & (df_inputs_prepr['dti'] <= 14.4), 1, 0)\n", 2929 | "df_inputs_prepr['dti : 14.4-16.8'] = np.where((df_inputs_prepr['dti'] > 14.4) & (df_inputs_prepr['dti'] <= 16.8), 1, 0)\n", 2930 | "df_inputs_prepr['dti : 16.8-24'] = np.where((df_inputs_prepr['dti'] > 16.8) & (df_inputs_prepr['dti'] <= 24.0), 1, 0)\n", 2931 | "df_inputs_prepr['dti : 24-35.9'] = np.where((df_inputs_prepr['dti'] > 24.0) & (df_inputs_prepr['dti'] <= 35.9), 1, 0)\n", 2932 | "df_inputs_prepr['dti : >35.9'] = np.where(df_inputs_prepr['dti'] > 35.9, 1, 0) " 2933 | ] 2934 | }, 2935 | { 2936 | "cell_type": "code", 2937 | "execution_count": null, 2938 | "metadata": {}, 2939 | "outputs": [], 2940 | "source": [ 2941 | "df_inputs_prepr.columns.values" 2942 | ] 2943 | }, 2944 | { 2945 | "cell_type": "markdown", 2946 | "metadata": {}, 2947 | "source": [ 2948 | "## mths_since_last_record variable" 2949 | ] 2950 | }, 2951 | { 2952 | "cell_type": "code", 2953 | "execution_count": null, 2954 | "metadata": {}, 2955 | "outputs": [], 2956 | "source": [ 2957 | "df_inputs_prepr['mths_since_last_record'].isnull().sum()" 2958 | ] 2959 | }, 2960 | { 2961 | "cell_type": "code", 2962 | "execution_count": null, 2963 | "metadata": {}, 2964 | "outputs": [], 2965 | "source": [ 2966 | "#Taking only non-null values\n", 2967 | "df_inputs_prepr_temp = df_inputs_prepr[pd.notnull(df_inputs_prepr['mths_since_last_record'])]" 2968 | ] 2969 | }, 2970 | { 2971 | "cell_type": "code", 2972 | "execution_count": null, 2973 | "metadata": { 2974 | "scrolled": true 2975 | }, 2976 | "outputs": [], 2977 | "source": [ 2978 | "df_inputs_prepr_temp['mths_since_last_record_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_last_record'], 50)\n", 2979 | "df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_last_record_factor', df_targets_prepr[df_inputs_prepr_temp.index])\n", 2980 | "df_temp" 2981 | ] 2982 | }, 2983 | { 2984 | "cell_type": "code", 2985 | "execution_count": null, 2986 | "metadata": {}, 2987 | "outputs": [], 2988 | "source": [ 2989 | "plot_by_woe(df_temp, rotation_of_x_axis_labels=90)" 2990 | ] 2991 | }, 2992 | { 2993 | "cell_type": "code", 2994 | "execution_count": null, 2995 | "metadata": {}, 2996 | "outputs": [], 2997 | "source": [ 2998 | "df_inputs_prepr['mths_since_last_record : Missing'] = np.where(df_inputs_prepr['mths_since_last_record'].isnull(), 1, 0)\n", 2999 | "df_inputs_prepr['mths_since_last_record : 0-2'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(0,3)), 1, 0)\n", 3000 | "df_inputs_prepr['mths_since_last_record : 3-20'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(3,21)), 1, 0)\n", 3001 | "df_inputs_prepr['mths_since_last_record : 21-40'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(21,41)), 1, 0)\n", 3002 | "df_inputs_prepr['mths_since_last_record : 41-65'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(41,66)), 1, 0)\n", 3003 | "df_inputs_prepr['mths_since_last_record : 66-84'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(66,85)), 1, 0)\n", 3004 | "df_inputs_prepr['mths_since_last_record : 85-96'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(85,97)), 1, 0)\n", 3005 | "df_inputs_prepr['mths_since_last_record : >=97'] = np.where(df_inputs_prepr['mths_since_last_record'].isin(range(97,int(df_inputs_prepr['mths_since_last_record'].max()))), 1, 0)" 3006 | ] 3007 | }, 3008 | { 3009 | "cell_type": "markdown", 3010 | "metadata": {}, 3011 | "source": [ 3012 | "# Preprocessing test dataset" 3013 | ] 3014 | }, 3015 | { 3016 | "cell_type": "code", 3017 | "execution_count": null, 3018 | "metadata": {}, 3019 | "outputs": [], 3020 | "source": [ 3021 | "#loan_data_inputs_train = df_inputs_prepr" 3022 | ] 3023 | }, 3024 | { 3025 | "cell_type": "markdown", 3026 | "metadata": {}, 3027 | "source": [ 3028 | " After executing the code from reassigning test_input and test_target to df_inputs_prep and df_targets_prep, now df_inputs_prep contains data for test dataset. Let's reassign this to loan_data_inputs_test" 3029 | ] 3030 | }, 3031 | { 3032 | "cell_type": "code", 3033 | "execution_count": null, 3034 | "metadata": {}, 3035 | "outputs": [], 3036 | "source": [ 3037 | "#loan_data_inputs_test = df_inputs_prepr" 3038 | ] 3039 | }, 3040 | { 3041 | "cell_type": "code", 3042 | "execution_count": null, 3043 | "metadata": {}, 3044 | "outputs": [], 3045 | "source": [ 3046 | "# #Exporting the final datasets to CSV files\n", 3047 | "# loan_data_inputs_train.to_csv('loan_data_inputs_train.csv')\n", 3048 | "# loan_data_targets_train.to_csv('loan_data_targets_train.csv')\n", 3049 | "# loan_data_inputs_test.to_csv('loan_data_inputs_test.csv')\n", 3050 | "# loan_data_targets_test.to_csv('loan_data_targets_test.csv')" 3051 | ] 3052 | }, 3053 | { 3054 | "cell_type": "markdown", 3055 | "metadata": {}, 3056 | "source": [ 3057 | "Linear regression: output = linear combination of predictors\n", 3058 | "Logistic regression: P(Y = 1) = $\\frac{\\exp (linear combi)}{1 + \\exp(linear combi)}$" 3059 | ] 3060 | }, 3061 | { 3062 | "cell_type": "markdown", 3063 | "metadata": {}, 3064 | "source": [ 3065 | "$\\frac{P(Y=1)}{P(Y=0)}$ [This ratio is called odds] = $\\exp(linear combi)$" 3066 | ] 3067 | }, 3068 | { 3069 | "cell_type": "markdown", 3070 | "metadata": {}, 3071 | "source": [ 3072 | "Taking ln on both sides: ln($\\frac{P(Y=1)}{P(Y=0)}$) = linear combi. So, linear and logistic regression are equivalent." 3073 | ] 3074 | }, 3075 | { 3076 | "cell_type": "markdown", 3077 | "metadata": {}, 3078 | "source": [ 3079 | "ln($\\frac{P(Y=1 | X1= 1)}{P(Y=0 | X1 = 1)}$) - ln($\\frac{P(Y=1 | X1= 0)}{P(Y=0 | X1 = 0)}$ = Beta1" 3080 | ] 3081 | }, 3082 | { 3083 | "cell_type": "markdown", 3084 | "metadata": {}, 3085 | "source": [ 3086 | "Which implies,\n", 3087 | "ln($\\frac{odds(Y = 1 | X1 = 1)}{odds(Y = 1 | X1 = 0)}$) = Beta1" 3088 | ] 3089 | }, 3090 | { 3091 | "cell_type": "markdown", 3092 | "metadata": {}, 3093 | "source": [ 3094 | "Also, $\\frac{odds(Y = 1 | X1 = 1)}{odds(Y = 1 | X1 = 0)}$ = $\\exp(Beta1)$" 3095 | ] 3096 | } 3097 | ], 3098 | "metadata": { 3099 | "kernelspec": { 3100 | "display_name": "Python 3", 3101 | "language": "python", 3102 | "name": "python3" 3103 | }, 3104 | "language_info": { 3105 | "codemirror_mode": { 3106 | "name": "ipython", 3107 | "version": 3 3108 | }, 3109 | "file_extension": ".py", 3110 | "mimetype": "text/x-python", 3111 | "name": "python", 3112 | "nbconvert_exporter": "python", 3113 | "pygments_lexer": "ipython3", 3114 | "version": "3.7.4" 3115 | } 3116 | }, 3117 | "nbformat": 4, 3118 | "nbformat_minor": 2 3119 | } 3120 | -------------------------------------------------------------------------------- /List of dummy variables.txt: -------------------------------------------------------------------------------- 1 | 'Grade : A' 2 | 'Grade : B' 3 | 'Grade : C' 4 | 'Grade : D' 5 | 'Grade : E' 6 | 'Grade : F' 7 | 'Grade : G' 8 | 'home_ownership : RENT_OTHER_NONE_ANY' 9 | 'home_ownership : MORTGAGE' 10 | 'home_ownership : OWN' 11 | 'addr_state : ND_NE_IA_NV_FL_HI_AL' 12 | 'addr_state : NM_VA' 13 | 'addr_state : NY' 14 | 'addr_state : OK_TN_MO_LA_MD_NC' 15 | 'addr_state : CA' 16 | 'addr_state : UT_KY_AZ_NJ' 17 | 'addr_state : AR_MI_PA_OH_MN' 18 | 'addr_state : RI_MA_DE_SD_IN' 19 | 'addr_state : GA_WA_OR' 20 | 'addr_state : WI_MT' 21 | 'addr_state : TX' 22 | 'addr_state : IL_CT' 23 | 'addr_state : KS_SC_CO_VT_AK_MS' 24 | 'addr_state : WV_NH_WY_DC_ME_ID' 25 | 'verification_status : Not Verified' 26 | 'verification_status : Source Verified' 27 | 'verification_status : Verified' 28 | 'purpose : SB_ED' 29 | 'purpose : HO_OT_RE_ME' 30 | 'purpose : WE_VA_DC' 31 | 'purpose : HI_MP_CA_CC' 32 | 'initial_list_status : f' 33 | 'initial_list_status : w' 34 | 'term : 36' 35 | 'term : 60' 36 | 'emp_length : 0' 37 | 'emp_length : 1' 38 | 'emp_length : 2-4' 39 | 'emp_length : 5-6' 40 | 'emp_length : 7-9' 41 | 'emp_length : 10' 42 | 'mths_since_issue_d_date_factor : <38' 43 | 'mths_since_issue_d_date_factor : 38-39' 44 | 'mths_since_issue_d_date_factor : 40-41' 45 | 'mths_since_issue_d_date_factor : 42-48' 46 | 'mths_since_issue_d_date_factor : 49-52' 47 | 'mths_since_issue_d_date_factor : 53-64' 48 | 'mths_since_issue_d_date_factor : 65-84' 49 | 'mths_since_issue_d_date_factor : >84' 50 | 'int_rate : <9.548' 51 | 'int_rate : 9.548-12.025' 52 | 'int_rate : 12.025-15.74' 53 | 'int_rate : 15.74-20.281' 54 | 'int_rate : >20.281' 55 | 'mths_since_earliest_cr_line_factor : <140' 56 | 'mths_since_earliest_cr_line_factor : 141-270' 57 | 'mths_since_earliest_cr_line_factor : 271-352' 58 | 'mths_since_earliest_cr_line_factor : 353-410' 59 | 'mths_since_earliest_cr_line_factor : 411-563' 60 | 'mths_since_earliest_cr_line_factor : >563' 61 | 'delinq_2yrs : <4' 62 | 'delinq_2yrs : >=4' 63 | 'inq_last_6mths : 0-3' 64 | 'inq_last_6mths : 4-6' 65 | 'inq_last_6mths : >=6' 66 | 'open_acc_factor : 0' 67 | 'open_acc_factor : 1-3' 68 | 'open_acc_factor : 3-11' 69 | 'open_acc_factor : 12-25' 70 | 'open_acc_factor : 26-33' 71 | 'open_acc_factor : >34' 72 | 'pub_rec : 0-2' 73 | 'pub_rec : 3-4' 74 | 'pub_rec : 5-7' 75 | 'pub_rec : >7' 76 | 'total_acc_factor : <=27' 77 | 'total_acc_factor : 28-45' 78 | 'total_acc_factor : 45-60' 79 | 'total_acc_factor : 60-72' 80 | 'total_acc_factor : >72' 81 | 'acc_now_delinq : 0' 82 | 'acc_now_delinq : 1-5' 83 | 'total_rev_hi_lim_factorv : <=4k' 84 | 'total_rev_hi_lim_factorv : 5k-36k' 85 | 'total_rev_hi_lim_factorv : 37k-92k' 86 | 'total_rev_hi_lim_factorv : 92k-200k' 87 | 'total_rev_hi_lim_factorv : >200k' 88 | 'annual_inc : <20k' 89 | 'annual_inc : 20k-30k' 90 | 'annual_inc : 30k-40k' 91 | 'annual_inc : 40k-50k' 92 | 'annual_inc : 50k-60k' 93 | 'annual_inc : 60k-70k' 94 | 'annual_inc : 70k-80k' 95 | 'annual_inc : 80k-90k' 96 | 'annual_inc : 90k-100k, 97 | 'annual_inc : 100k-120k' 98 | 'annual_inc : 120k-140k' 99 | 'annual_inc : >140k' 100 | 'dti : <=1.6' 101 | 'dti : 1.7-4.1' 102 | 'dti : 4.1-8.9' 103 | 'dti : 8.9-14.4' 104 | 'dti : 14.4-16.8' 105 | 'dti : 16.8-24' 106 | 'dti : 24-35.9' 107 | 'dti : >35.9' 108 | 'mths_since_last_delinq : Missing' 109 | 'mths_since_last_delinq : 0-3' 110 | 'mths_since_last_delinq : 4-30' 111 | 'mths_since_last_delinq : 31-56' 112 | 'mths_since_last_delinq : >=57' 113 | 'mths_since_last_record : Missing' 114 | 'mths_since_last_record : 0-2' 115 | 'mths_since_last_record : 3-20' 116 | 'mths_since_last_record : 21-40' 117 | 'mths_since_last_record : 41-65' 118 | 'mths_since_last_record : 66-84' 119 | 'mths_since_last_record : 85-96' 120 | 'mths_since_last_record : >=97' -------------------------------------------------------------------------------- /List of reference variables.txt: -------------------------------------------------------------------------------- 1 | 'Grade : G', 2 | 'home_ownership : RENT_OTHER_NONE_ANY', 3 | 'addr_state : ND_NE_IA_NV_FL_HI_AL', 4 | 'verification_status : Verified' 5 | 'purpose : SB_ED', 6 | 'initial_list_status : f', 7 | 'term : 60', 8 | 'emp_length : 0', 9 | 'mths_since_issue_d_date_factor : >84', 10 | 'int_rate : >20.281', 11 | 'mths_since_earliest_cr_line_factor : <140', 12 | 'delinq_2yrs : >=4', 13 | 'inq_last_6mths : >=6', 14 | 'open_acc_factor : 0', 15 | 'pub_rec : 0-2', 16 | 'total_acc_factor : <=27', 17 | 'acc_now_delinq : 0', 18 | 'total_rev_hi_lim_factorv : <=4k', 19 | 'annual_inc : <20k', 20 | 'dti : >35.9', 21 | 'mths_since_last_delinq : 0-3', 22 | 'mths_since_last_record : 0-2' 23 | 24 | 25 | Columns to drop: 26 | mths_since_earliest_cr_line_factor 27 | <70 28 | 70-93 29 | 94-140 30 | 31 | delinq_2yrs 32 | 0-3 33 | 4-6 34 | 7-11 35 | 12-21 36 | 37 | inq_last_6mths : 10-18 38 | inq_last_6mths : 7-9 39 | 40 | open_acc_factor : <3 41 | 42 | pub_rec : 0-1 43 | pub_rec : 2-4 44 | 45 | total_acc_factor : 12-18 46 | total_acc_factor : 3-12 47 | total_acc_factor : <3 48 | total_acc_factor : 18-30 49 | total_acc_factor : 30-45 50 | 51 | acc_now_delinq : >2 52 | acc_now_delinq : 0-1 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Credit-Risk-Modeling-in-Python 2 | I have modeled the credit risk associated with consumer loans. The jupyter notebook contains detailed explanation with comments, 3 | code and visualizations. 4 | 5 | List of dummy variables is a file which contains dummy variables for all original variables (discrete and continuous) which is used for analysis. 6 | 7 | List of reference variables is a file which contains reference variables for all original variables (discrete and continuous). This helps 8 | in comparing the performance of the dummy variables with a reference. 9 | --------------------------------------------------------------------------------