├── LICENSE.txt ├── README.md ├── py_mob-0.4.0.tar.gz └── py_mob ├── __init__.py ├── py_mob-0.2.7.tar.gz ├── py_mob-0.3-py3-none-any.whl ├── py_mob-0.3.tar.gz ├── py_mob.py └── py_mob1.jpg /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright 2020 WenSui Liu (Statcompute) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 6 | 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 |

4 | 5 | ###

Python Implementation of

6 | ###

Monotonic Optimal Binning (PY_MOB)

7 | 8 | #### Introduction 9 | 10 | As an attempt to mimic the mob R package (https://CRAN.R-project.org/package=mob), the py_mob is a collection of python functions that would generate the monotonic binning and perform the WoE (Weight of Evidence) transformation used in consumer credit scorecard developments. The woe transformation is a piecewise transformation that is linear to the log odds. For a numeric variable, all of its monotonic functional transformations will converge to the same woe transformation. In addition, Information Value and KS statistic of each independent variables is also calculated to evaluate the variable predictiveness. 11 | 12 | Different from other python packages for the same purpose, the py\_mob package is very lightweight and the underlying computation is driven by the built-in python list or the numpy array. Functions would return lists of dictionaries, which can be easily converted to other data structures, such as pandas.DataFrame or astropy.table. 13 | 14 | What's more, six different monotonic binning algorithms are implemented, namely qtl\_bin(), bad\_bin(), iso\_bin(), rng\_bin(), kmn\_bin(), and gbm\_bin(), that would provide different predictability and cardinality. 15 | 16 | People without the background knowledge in the consumer risk modeling might be wondering why the monotonic binning and thereafter the WoE transformation are important. Below are a couple reasons based on my experience. They are perfectly generalizable in other use cases of logistic regression with binary outcomes. 17 | 1. Because the WoE is a piecewise transformation based on the data discretization, all missing values would fall into a standalone category either by itself or to be combined with the neighbor with a similar bad rate. As a result, the special treatment for missing values is not necessary. 18 | 2. After the monotonic binning of each variable, since the WoE value for each bin is a projection from the predictor into the response that is defined by the log ratio between event and non-event distributions, any raw value of the predictor doesn't matter anymore and therefore the issue related to outliers would disappear. 19 | 3. While many modelers would like to use log or power transformations to achieve a good linear relationship between the predictor and log odds of the response, which is heuristic at best with no guarantee for the good outcome, the WoE transformation is strictly linear with respect to log odds of the response with the unity correlation. It is also worth mentioning that a numeric variable and its strictly monotone functions should converge to the same monotonic WoE transformation. 20 | 4. At last, because the WoE is defined as the log ratio between event and non-event distributions, it is indicative of the separation between cases with Y = 0 and cases with Y = 1. As the weighted sum of WoE values with the weight being the difference in event and non-event distributions, the IV (Information Value) is an important statistic commonly used to measure the predictor importance. 21 | 22 | 23 | #### Package Dependencies 24 | 25 | ```text 26 | pandas, numpy, scipy, sklearn, lightgbm, tabulate 27 | ``` 28 | 29 | #### Installation 30 | 31 | ```python 32 | pip3 install py_mob 33 | ``` 34 | 35 | #### Core Functions 36 | 37 | ``` 38 | py_mob 39 | |-- qtl_bin() : An iterative discretization based on quantiles of X. 40 | |-- bad_bin() : A revised iterative discretization for records with Y = 1. 41 | |-- iso_bin() : A discretization algorthm driven by the isotonic regression between X and Y. 42 | |-- rng_bin() : A revised iterative discretization based on the equal-width range of X. 43 | |-- kmn_bin() : A discretization algorthm based on the kmean clustering of X. 44 | |-- gbm_bin() : A discretization algorthm based on the gradient boosting machine. 45 | |-- summ_bin() : Generates the statistical summary for the binning outcome. 46 | |-- view_bin() : Displays the binning outcome in a tabular form. 47 | |-- cal_woe() : Applies the WoE transformation to a numeric vector based on the binning outcome. 48 | |-- pd_bin() : Discretizes each vector in a pandas DataFrame. 49 | `-- pd_woe() : Applies WoE transformaton to each vector in the pandas DataFrame. 50 | ``` 51 | -------------------------------------------------------------------------------- /py_mob-0.4.0.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/statcompute/py_mob/1e20ceeeee03e71b075b052206cc8bb9fe7af9ff/py_mob-0.4.0.tar.gz -------------------------------------------------------------------------------- /py_mob/__init__.py: -------------------------------------------------------------------------------- 1 | # py_mob/__init__.py 2 | 3 | __version__ = "0.2.7" 4 | 5 | from .py_mob import * 6 | -------------------------------------------------------------------------------- /py_mob/py_mob-0.2.7.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/statcompute/py_mob/1e20ceeeee03e71b075b052206cc8bb9fe7af9ff/py_mob/py_mob-0.2.7.tar.gz -------------------------------------------------------------------------------- /py_mob/py_mob-0.3-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/statcompute/py_mob/1e20ceeeee03e71b075b052206cc8bb9fe7af9ff/py_mob/py_mob-0.3-py3-none-any.whl -------------------------------------------------------------------------------- /py_mob/py_mob-0.3.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/statcompute/py_mob/1e20ceeeee03e71b075b052206cc8bb9fe7af9ff/py_mob/py_mob-0.3.tar.gz -------------------------------------------------------------------------------- /py_mob/py_mob.py: -------------------------------------------------------------------------------- 1 | # py_mob/py_mob.py 2 | 3 | import numpy, scipy.stats, sklearn.isotonic, sklearn.cluster, lightgbm, tabulate, pkg_resources 4 | 5 | 6 | def get_data(data): 7 | """ 8 | The function loads a testing dataset. 9 | 10 | Parameters: 11 | data : The name of dataset. It is either "hmeq" or "accepts", both of 12 | which are loan performance data. 13 | 14 | Returns: 15 | A dict with the dataset. 16 | 17 | Example: 18 | data = py_mob.get_data("accepts") 19 | 20 | data.keys() 21 | # ['bankruptcy', 'bad', 'app_id', 'tot_derog', 'tot_tr', 'age_oldest_tr', 22 | # 'tot_open_tr', 'tot_rev_tr', 'tot_rev_debt', 'tot_rev_line', 'rev_util', 23 | # 'bureau_score', 'purch_price', 'msrp', 'down_pyt', 'purpose', 24 | # 'loan_term', 'loan_amt', 'ltv', 'tot_income', 'used_ind', 'weight'] 25 | 26 | py_mob.view_bin(py_mob.qtl_bin(data["ltv"], data["bad"])) 27 | """ 28 | 29 | _p = pkg_resources.resource_filename("py_mob", "data/" + data + ".csv") 30 | 31 | _d = numpy.recfromcsv(_p, delimiter = ',', names = True, encoding = 'latin-1') 32 | 33 | return(dict((_2, [_[_1] for _ in _d]) for _1, _2 in enumerate(_d.dtype.fields))) 34 | 35 | 36 | ########## 01. cal_woe() ########## 37 | 38 | def cal_woe(x, bin): 39 | """ 40 | The function applies the woe transformation to a numeric vector based on 41 | the binning outcome. 42 | 43 | Parameters: 44 | x : A numeric vector, which can be a list, 1-D numpy array, or pandas 45 | series 46 | bin : An object containing the binning outcome. 47 | 48 | Returns: 49 | A list of dictionaries with three keys 50 | 51 | Example: 52 | ltv_bin = qtl_bin(ltv, bad) 53 | 54 | for x in cal_woe(ltv[:3], ltv_bin): 55 | print(x) 56 | 57 | # {'x': 109.0, 'bin': 6, 'woe': 0.2694} 58 | # {'x': 97.0, 'bin': 3, 'woe': 0.0045} 59 | # {'x': 105.0, 'bin': 5, 'woe': 0.1829} 60 | """ 61 | 62 | _cut = sorted(bin['cut'] + [numpy.PINF, numpy.NINF]) 63 | 64 | _dat = [[_1[0], _1[1], _2] for _1, _2 in zip(enumerate(x), ~numpy.isnan(x))] 65 | 66 | _m1 = [_[:2] for _ in _dat if _[2] == 0] 67 | _l1 = [_[:2] for _ in _dat if _[2] == 1] 68 | 69 | _l2 = [[*_1, _2] for _1, _2 in zip(_l1, numpy.searchsorted(_cut, [_[1] for _ in _l1]).tolist())] 70 | 71 | flatten = lambda l: [item for subl in l for item in subl] 72 | 73 | _l3 = flatten([[[*l, b['woe']] for l in _l2 if l[2] == b['bin']] for b in bin['tbl'] if b['bin'] > 0]) 74 | 75 | if len(_m1) > 0: 76 | if len([_ for _ in bin['tbl'] if _['miss'] > 0]) > 0: 77 | _m2 = [l + [_['bin'] for _ in bin['tbl'] if _['miss'] > 0] 78 | + [_['woe'] for _ in bin['tbl'] if _['miss'] > 0] for l in _m1] 79 | else: 80 | _m2 = [l + [0, 0] for l in _m1] 81 | _l3.extend(_m2) 82 | 83 | _key = ["x", "bin", "woe"] 84 | 85 | return(list(dict(zip(_key, _[1:])) for _ in sorted(_l3, key = lambda x: x[0]))) 86 | 87 | 88 | ########## 02. summ_bin() ########## 89 | 90 | def summ_bin(x): 91 | """ 92 | The function summarizes the binning outcome generated from a binning function, 93 | e.g. qtl_bin() or iso_bin(). 94 | 95 | Parameters: 96 | x: An object containing the binning outcome. 97 | 98 | Returns: 99 | A dictionary with statistics derived from the binning outcome 100 | 101 | Example: 102 | summ_bin(iso_bin(ltv, bad)) 103 | # {'sample size': 5837, 'bad rate': 0.2049, 'iv': 0.185, 'ks': 16.88, 'missing': 0.0002} 104 | """ 105 | 106 | _freq = sum(_['freq'] for _ in x['tbl']) 107 | _bads = sum(_['bads'] for _ in x['tbl']) 108 | _miss = sum(_['miss'] for _ in x['tbl']) 109 | 110 | _iv = round(sum(_['iv'] for _ in x['tbl']), 4) 111 | _ks = round(max(_["ks"] for _ in x["tbl"]), 2) 112 | 113 | _br = round(_bads / _freq, 4) 114 | _mr = round(_miss / _freq, 4) 115 | 116 | return({"sample size": _freq, "bad rate": _br, "iv": _iv, "ks": _ks, "missing": _mr}) 117 | 118 | 119 | ########## 03. view_bin() ########## 120 | 121 | def view_bin(x): 122 | """ 123 | The function displays the binning outcome generated from a binning function, 124 | e.g. qtl_bin() or iso_bin(). 125 | 126 | Parameters: 127 | x: An object containing the binning outcome. 128 | 129 | Returns: 130 | None 131 | 132 | Example: 133 | view_bin(qtl_bin(df.ltv, df.bad)) 134 | """ 135 | 136 | tabulate.PRESERVE_WHITESPACE = True 137 | 138 | _sel = ["bin", "freq", "miss", "bads", "rate", "woe", "iv", "ks"] 139 | 140 | _tbl = [{**(lambda v: {k: v[k] for k in _sel})(_), "rule": _["rule"].ljust(45)} for _ in x["tbl"]] 141 | 142 | print(tabulate.tabulate(_tbl, headers = "keys", tablefmt = "github", 143 | colalign = ["center"] + ["right"] * 7 + ["center"], 144 | floatfmt = (".0f", ".0f", ".0f", ".0f", ".4f", ".4f", ".4f", ".2f"))) 145 | 146 | 147 | ########## 04. qcut() ########## 148 | 149 | def qcut(x, n): 150 | """ 151 | The function discretizes a numeric vector into n pieces based on quantiles. 152 | 153 | Parameters: 154 | x: A numeric vector. 155 | n: An integer indicating the number of categories to discretize. 156 | 157 | Returns: 158 | A list of numeric values to divide the vector x into n categories. 159 | 160 | Example: 161 | qcut(range(10), 3) 162 | # [3, 6] 163 | """ 164 | 165 | _q = numpy.linspace(0, 100, n, endpoint = False)[1:] 166 | _x = [_ for _ in x if not numpy.isnan(_)] 167 | _c = numpy.unique(numpy.percentile(_x, _q, interpolation = "lower")) 168 | return([_ for _ in _c]) 169 | 170 | 171 | ########## 05. manual_bin() ########## 172 | 173 | def manual_bin(x, y, cuts): 174 | """ 175 | The function discretizes the x vector and then summarizes over the y vector 176 | based on the discretization result. 177 | 178 | Parameters: 179 | x : A numeric vector to discretize without missing values, 180 | e.g. numpy.nan or math.nan 181 | y : A numeric vector with binary values of 0/1 and with the same length 182 | of x 183 | cuts : A list of numeric values as cut points to discretize x. 184 | 185 | Returns: 186 | A list of dictionaries for the binning outcome. 187 | 188 | Example: 189 | for x in manual_bin(scr, bad, [650, 700, 750]): 190 | print(x) 191 | 192 | # {'bin': 1, 'freq': 1311, 'miss': 0, 'bads': 520.0, 'minx': 443.0, 'maxx': 650.0} 193 | # {'bin': 2, 'freq': 1688, 'miss': 0, 'bads': 372.0, 'minx': 651.0, 'maxx': 700.0} 194 | # {'bin': 3, 'freq': 1507, 'miss': 0, 'bads': 157.0, 'minx': 701.0, 'maxx': 750.0} 195 | # {'bin': 4, 'freq': 1016, 'miss': 0, 'bads': 42.0, 'minx': 751.0, 'maxx': 848.0} 196 | """ 197 | 198 | _x = [_ for _ in x] 199 | _y = [_ for _ in y] 200 | _c = sorted([_ for _ in set(cuts)] + [numpy.NINF, numpy.PINF]) 201 | _g = numpy.searchsorted(_c, _x).tolist() 202 | 203 | _l1 = sorted(zip(_g, _x, _y), key = lambda x: x[0]) 204 | _l2 = zip(set(_g), [[l for l in _l1 if l[0] == g] for g in set(_g)]) 205 | 206 | return(sorted([dict(zip(["bin", "freq", "miss", "bads", "minx", "maxx"], 207 | [_1, len(_2), 0, 208 | sum([_[2] for _ in _2]), 209 | min([_[1] for _ in _2]), 210 | max([_[1] for _ in _2])])) for _1, _2 in _l2], 211 | key = lambda x: x["bin"])) 212 | 213 | 214 | ########## 06. miss_bin() ########## 215 | 216 | def miss_bin(y): 217 | """ 218 | The function summarizes the y vector with binary values of 0/1 and is not 219 | supposed to be called directly by users. 220 | 221 | Parameters: 222 | y : A numeric vector with binary values of 0/1. 223 | 224 | Returns: 225 | A dictionary. 226 | """ 227 | 228 | return({"bin": 0, "freq": len([_ for _ in y]), "miss": len([_ for _ in y]), 229 | "bads": sum([_ for _ in y]), "minx": numpy.nan, "maxx": numpy.nan}) 230 | 231 | 232 | ########## 07. gen_rule() ########## 233 | 234 | def gen_rule(tbl, pts): 235 | """ 236 | The function generates binning rules based on the binning outcome table and 237 | a list of cut points and is an utility function that is not supposed to be 238 | called directly by users. 239 | 240 | Parameters: 241 | tbl : A intermediate table of the binning outcome within each binning 242 | function 243 | pts : A list cut points for the binning 244 | 245 | Returns: 246 | A list of dictionaries with binning rules 247 | """ 248 | 249 | for _ in tbl: 250 | if _["bin"] == 0: 251 | _["rule"] = "numpy.isnan($X$)" 252 | elif _["bin"] == len(pts) + 1: 253 | if _["miss"] == 0: 254 | _["rule"] = "$X$ > " + str(pts[-1]) 255 | else: 256 | _["rule"] = "$X$ > " + str(pts[-1]) + " or numpy.isnan($X$)" 257 | elif _["bin"] == 1: 258 | if _["miss"] == 0: 259 | _["rule"] = "$X$ <= " + str(pts[0]) 260 | else: 261 | _["rule"] = "$X$ <= " + str(pts[0]) + " or numpy.isnan($X$)" 262 | else: 263 | _["rule"] = "$X$ > " + str(pts[_["bin"] - 2]) + " and $X$ <= " + str(pts[_["bin"] - 1]) 264 | 265 | _sel = ["bin", "freq", "miss", "bads", "rate", "woe", "iv", "ks", "rule"] 266 | 267 | return([{k: _[k] for k in _sel} for _ in tbl]) 268 | 269 | 270 | ########## 08. gen_woe() ########## 271 | 272 | def gen_woe(x): 273 | """ 274 | The function calculates weight of evidence and information value based on the 275 | binning outcome within each binning function and is an utility function that 276 | is not supposed to be called directly by users. 277 | 278 | Parameters: 279 | x : A list of dictionaries for the binning outcome. 280 | 281 | Returns: 282 | A list of dictionaries with additional keys to the input. 283 | """ 284 | 285 | _freq = sum(_["freq"] for _ in x) 286 | _bads = sum(_["bads"] for _ in x) 287 | 288 | _l1 = sorted([{**_, 289 | "rate": round(_["bads"] / _["freq"], 4), 290 | "woe" : round(numpy.log((_["bads"] / _bads) / ((_["freq"] - _["bads"]) / (_freq - _bads))), 4), 291 | "iv" : round((_["bads"] / _bads - (_["freq"] - _["bads"]) / (_freq - _bads)) * 292 | numpy.log((_["bads"] / _bads) / ((_["freq"] - _["bads"]) / (_freq - _bads))), 4) 293 | } for _ in x], key = lambda _x: _x["bin"]) 294 | 295 | cumsum = lambda x: [sum([_ for _ in x][0:(i + 1)]) for i in range(len(x))] 296 | 297 | _cumb = cumsum([_['bads'] / _bads for _ in _l1]) 298 | _cumg = cumsum([(_['freq'] - _['bads']) / (_freq - _bads) for _ in _l1]) 299 | _ks = [round(numpy.abs(_[0] - _[1]) * 100, 2) for _ in zip(_cumb, _cumg)] 300 | 301 | return([{**_1, "ks": _2} for _1, _2 in zip(_l1, _ks)]) 302 | 303 | 304 | ########## 09. add_miss() ########## 305 | 306 | def add_miss(d, l): 307 | """ 308 | The function appends missing value category, if any, to the binning outcome 309 | and is an utility function and is not supposed to be called directly by 310 | the user. 311 | 312 | Parameters: 313 | d : A list with lists generated by input vectors of binning functions. 314 | l : A list of dicts. 315 | 316 | Returns: 317 | A list of dicts. 318 | """ 319 | 320 | _l = l[:] 321 | 322 | if len([_ for _ in d if _[2] == 0]) > 0: 323 | _m = miss_bin([_[1] for _ in d if _[2] == 0]) 324 | if _m["bads"] == 0: 325 | for _ in ['freq', 'miss', 'bads']: 326 | _l[0][_] = _l[0][_] + _m[_] 327 | elif _m["freq"] == _m["bads"]: 328 | for _ in ['freq', 'miss', 'bads']: 329 | _l[-1][_] = _l[-1][_] + _m[_] 330 | else: 331 | _l.append(_m) 332 | 333 | return(_l) 334 | 335 | 336 | ########## 10. qtl_bin() ########## 337 | 338 | def qtl_bin(x, y): 339 | """ 340 | The function discretizes the x vector based on percentiles and summarizes 341 | over the y vector to derive weight of evidence transformaton (WoE) and 342 | information value. 343 | 344 | Parameters: 345 | x : A numeric vector to discretize. It can be a list, 1-D numpy array, or 346 | pandas series. 347 | y : A numeric vector with binary values of 0/1 and with the same length 348 | of x. It can be a list, 1-D numpy array, or pandas series. 349 | 350 | Returns: 351 | A dictionary with two keys: 352 | "cut" : A numeric vector with cut points applied to the x vector. 353 | "tbl" : A list of dictionaries summarizing the binning outcome. 354 | 355 | Example: 356 | qtl_bin(derog, bad)["cut"] 357 | # [0.0, 1.0, 3.0] 358 | 359 | view_bin(qtl_bin(derog, bad)) 360 | | bin | freq | miss | bads | rate | woe | iv | ks | rule | 361 | |-------|--------|--------|--------|--------|---------|--------|-------|-----------------------------------------------| 362 | | 0 | 213 | 213 | 70 | 0.3286 | 0.6416 | 0.0178 | 2.77 | numpy.isnan($X$) | 363 | | 1 | 2850 | 0 | 367 | 0.1288 | -0.5559 | 0.1268 | 20.04 | $X$ <= 0.0 | 364 | | 2 | 891 | 0 | 193 | 0.2166 | 0.0704 | 0.0008 | 18.95 | $X$ > 0.0 and $X$ <= 1.0 | 365 | | 3 | 810 | 0 | 207 | 0.2556 | 0.2867 | 0.0124 | 14.63 | $X$ > 1.0 and $X$ <= 3.0 | 366 | | 4 | 1073 | 0 | 359 | 0.3346 | 0.6684 | 0.0978 | 0.00 | $X$ > 3.0 | 367 | """ 368 | 369 | _data = [_ for _ in zip(x, y, ~numpy.isnan(x))] 370 | 371 | _x = [_[0] for _ in _data if _[2] == 1] 372 | _y = [_[1] for _ in _data if _[2] == 1] 373 | 374 | _n = numpy.arange(2, max(3, min(50, len(numpy.unique(_x)) - 1))) 375 | _p = set(tuple(qcut(_x, _)) for _ in _n) 376 | 377 | _l1 = [[_, manual_bin(_x, _y, _)] for _ in _p] 378 | 379 | _l2 = [[l[0], 380 | min([_["bads"] / _["freq"] for _ in l[1]]), 381 | max([_["bads"] / _["freq"] for _ in l[1]]), 382 | scipy.stats.spearmanr([_["bin"] for _ in l[1]], [_["bads"] / _["freq"] for _ in l[1]])[0] 383 | ] for l in _l1] 384 | 385 | _l3 = [l[0] for l in sorted(_l2, key = lambda x: -len(x[0])) 386 | if numpy.abs(round(l[3], 8)) == 1 and round(l[1], 8) > 0 and round(l[2], 8) < 1][0] 387 | 388 | _l4 = sorted(*[l[1] for l in _l1 if l[0] == _l3], key = lambda x: x["bads"] / x["freq"]) 389 | 390 | _l5 = add_miss(_data, _l4) 391 | 392 | return({"cut": _l3, "tbl": gen_rule(gen_woe(_l5), _l3)}) 393 | 394 | 395 | ########## 11. bad_bin() ########## 396 | 397 | def bad_bin(x, y): 398 | """ 399 | The function discretizes the x vector based on percentiles and then 400 | summarizes over the y vector with y = 1 to derive the weight of evidence 401 | transformaton (WoE) and information values. 402 | 403 | Parameters: 404 | x : A numeric vector to discretize. It is a list, 1-D numpy array, 405 | or pandas series. 406 | y : A numeric vector with binary values of 0/1 and with the same length 407 | of x. It is a list, 1-D numpy array, or pandas series. 408 | 409 | Returns: 410 | A dictionary with two keys: 411 | "cut" : A numeric vector with cut points applied to the x vector. 412 | "tbl" : A list of dictionaries summarizing the binning outcome. 413 | 414 | Example: 415 | bad_bin(derog, bad)["cut"] 416 | # [0.0, 2.0, 4.0] 417 | 418 | view_bin(bad_bin(derog, bad)) 419 | 420 | | bin | freq | miss | bads | rate | woe | iv | ks | rule | 421 | |-------|--------|--------|--------|--------|---------|--------|-------|-----------------------------------------------| 422 | | 0 | 213 | 213 | 70 | 0.3286 | 0.6416 | 0.0178 | 2.77 | numpy.isnan($X$) | 423 | | 1 | 2850 | 0 | 367 | 0.1288 | -0.5559 | 0.1268 | 20.04 | $X$ <= 0.0 | 424 | | 2 | 1369 | 0 | 314 | 0.2294 | 0.1440 | 0.0051 | 16.52 | $X$ > 0.0 and $X$ <= 2.0 | 425 | | 3 | 587 | 0 | 176 | 0.2998 | 0.5078 | 0.0298 | 10.66 | $X$ > 2.0 and $X$ <= 4.0 | 426 | | 4 | 818 | 0 | 269 | 0.3289 | 0.6426 | 0.0685 | 0.00 | $X$ > 4.0 | 427 | """ 428 | 429 | _data = [_ for _ in zip(x, y, ~numpy.isnan(x))] 430 | 431 | _x = [_[0] for _ in _data if _[2] == 1] 432 | _y = [_[1] for _ in _data if _[2] == 1] 433 | 434 | _n = numpy.arange(2, max(3, min(50, len(numpy.unique([_[0] for _ in _data if _[1] == 1 and _[2] == 1])) - 1))) 435 | 436 | _p = set(tuple(qcut([_[0] for _ in _data if _[1] == 1 and _[2] == 1], _)) for _ in _n) 437 | 438 | _l1 = [[_, manual_bin(_x, _y, _)] for _ in _p] 439 | 440 | _l2 = [[l[0], 441 | min([_["bads"] / _["freq"] for _ in l[1]]), 442 | max([_["bads"] / _["freq"] for _ in l[1]]), 443 | scipy.stats.spearmanr([_["bin"] for _ in l[1]], [_["bads"] / _["freq"] for _ in l[1]])[0] 444 | ] for l in _l1] 445 | 446 | _l3 = [l[0] for l in sorted(_l2, key = lambda x: -len(x[0])) 447 | if numpy.abs(round(l[3], 8)) == 1 and round(l[1], 8) > 0 and round(l[2], 8) < 1][0] 448 | 449 | _l4 = sorted(*[l[1] for l in _l1 if l[0] == _l3], key = lambda x: x["bads"] / x["freq"]) 450 | 451 | _l5 = add_miss(_data, _l4) 452 | 453 | return({"cut": _l3, "tbl": gen_rule(gen_woe(_l5), _l3)}) 454 | 455 | 456 | ########## 12. iso_bin() ########## 457 | 458 | def iso_bin(x, y): 459 | """ 460 | The function discretizes the x vector based on the isotonic regression and 461 | then summarizes over the y vector to derive the weight of evidence 462 | transformaton (WoE) and information values. 463 | 464 | Parameters: 465 | x : A numeric vector to discretize. It is a list, 1-D numpy array, 466 | or pandas series. 467 | y : A numeric vector with binary values of 0/1 and with the same length 468 | of x. It is a list, 1-D numpy array, or pandas series. 469 | 470 | Returns: 471 | A dictionary with two keys: 472 | "cut" : A numeric vector with cut points applied to the x vector. 473 | "tbl" : A list of dictionaries summarizing the binning outcome. 474 | 475 | Example: 476 | iso_bin(derog, bad)["cut"] 477 | # [1.0, 2.0, 3.0, 23.0] 478 | 479 | view_bin(iso_bin(derog, bad)) 480 | | bin | freq | miss | bads | rate | woe | iv | ks | rule | 481 | |-------|--------|--------|--------|--------|---------|--------|-------|-----------------------------------------------| 482 | | 0 | 213 | 213 | 70 | 0.3286 | 0.6416 | 0.0178 | 2.77 | numpy.isnan($X$) | 483 | | 1 | 3741 | 0 | 560 | 0.1497 | -0.3811 | 0.0828 | 18.95 | $X$ <= 1.0 | 484 | | 2 | 478 | 0 | 121 | 0.2531 | 0.2740 | 0.0066 | 16.52 | $X$ > 1.0 and $X$ <= 2.0 | 485 | | 3 | 332 | 0 | 86 | 0.2590 | 0.3050 | 0.0058 | 14.63 | $X$ > 2.0 and $X$ <= 3.0 | 486 | | 4 | 1064 | 0 | 353 | 0.3318 | 0.6557 | 0.0931 | 0.44 | $X$ > 3.0 and $X$ <= 23.0 | 487 | | 5 | 9 | 0 | 6 | 0.6667 | 2.0491 | 0.0090 | 0.00 | $X$ > 23.0 | 488 | """ 489 | 490 | _data = [_ for _ in zip(x, y, ~numpy.isnan(x))] 491 | 492 | _x = [_[0] for _ in _data if _[2] == 1] 493 | _y = [_[1] for _ in _data if _[2] == 1] 494 | 495 | _cor = scipy.stats.spearmanr(_x, _y)[0] 496 | _reg = sklearn.isotonic.IsotonicRegression() 497 | 498 | _f = numpy.abs(_reg.fit_transform(_x, list(map(lambda y: y * _cor / numpy.abs(_cor), _y)))) 499 | 500 | _l1 = sorted(list(zip(_f, _x, _y)), key = lambda x: x[0]) 501 | 502 | _l2 = [[l for l in _l1 if l[0] == f] for f in sorted(set(_f))] 503 | 504 | _l3 = [[*set(_[0] for _ in l), 505 | max(_[1] for _ in l), 506 | numpy.mean([_[2] for _ in l]), 507 | sum(_[2] for _ in l)] for l in _l2] 508 | 509 | _c = sorted([_[1] for _ in [l for l in _l3 if l[2] < 1 and l[2] > 0 and l[3] > 1]]) 510 | _p = _c[1:-1] if len(_c) > 2 else _c[:-1] 511 | 512 | _l4 = sorted(manual_bin(_x, _y, _p), key = lambda x: x["bads"] / x["freq"]) 513 | 514 | _l5 = add_miss(_data, _l4) 515 | 516 | return({"cut": _p, "tbl": gen_rule(gen_woe(_l5), _p)}) 517 | 518 | 519 | ########## 13. rng_bin() ########## 520 | 521 | def rng_bin(x, y): 522 | """ 523 | The function discretizes the x vector based on the equal-width range and 524 | summarizes over the y vector to derive the weight of evidence transformaton 525 | (WoE) and information values. 526 | 527 | Parameters: 528 | x : A numeric vector to discretize. It is a list, 1-D numpy array, 529 | or pandas series. 530 | y : A numeric vector with binary values of 0/1 and with the same length 531 | of x. It is a list, 1-D numpy array, or pandas series. 532 | 533 | Returns: 534 | A dictionary with two keys: 535 | "cut" : A numeric vector with cut points applied to the x vector. 536 | "tbl" : A list of dictionaries summarizing the binning outcome. 537 | 538 | Example: 539 | rng_bin(derog, bad)["cut"] 540 | # [7.0, 14.0, 21.0] 541 | 542 | view_bin(rng_bin(derog, bad)) 543 | | bin | freq | miss | bads | rate | woe | iv | ks | rule | 544 | |-------|--------|--------|--------|--------|---------|--------|------|-----------------------------------------------| 545 | | 0 | 213 | 213 | 70 | 0.3286 | 0.6416 | 0.0178 | 2.77 | numpy.isnan($X$) | 546 | | 1 | 5243 | 0 | 1001 | 0.1909 | -0.0881 | 0.0068 | 4.94 | $X$ <= 7.0 | 547 | | 2 | 322 | 0 | 104 | 0.3230 | 0.6158 | 0.0246 | 0.94 | $X$ > 7.0 and $X$ <= 14.0 | 548 | | 3 | 46 | 0 | 15 | 0.3261 | 0.6300 | 0.0037 | 0.35 | $X$ > 14.0 and $X$ <= 21.0 | 549 | | 4 | 13 | 0 | 6 | 0.4615 | 1.2018 | 0.0042 | 0.00 | $X$ > 21.0 | 550 | """ 551 | 552 | _data = [_ for _ in zip(x, y, ~numpy.isnan(x))] 553 | 554 | _x = [_[0] for _ in _data if _[2] == 1] 555 | _y = [_[1] for _ in _data if _[2] == 1] 556 | 557 | _n = numpy.arange(2, max(3, min(50, len(numpy.unique(_x)) - 1))) 558 | 559 | _m = [[numpy.median([_[0] for _ in _data if _[2] == 1 and _[1] == 1])], 560 | [numpy.median([_[0] for _ in _data if _[2] == 1])]] 561 | 562 | _p = list(set(tuple(qcut(numpy.unique(_x), _)) for _ in _n)) + _m 563 | 564 | _l1 = [[_, manual_bin(_x, _y, _)] for _ in _p] 565 | 566 | _l2 = [[l[0], 567 | min([_["bads"] / _["freq"] for _ in l[1]]), 568 | max([_["bads"] / _["freq"] for _ in l[1]]), 569 | scipy.stats.spearmanr([_["bin"] for _ in l[1]], [_["bads"] / _["freq"] for _ in l[1]])[0] 570 | ] for l in _l1] 571 | 572 | _l3 = [l[0] for l in sorted(_l2, key = lambda x: -len(x[0])) 573 | if numpy.abs(round(l[3], 8)) == 1 and round(l[1], 8) > 0 and round(l[2], 8) < 1][0] 574 | 575 | _l4 = sorted(*[l[1] for l in _l1 if l[0] == _l3], key = lambda x: x["bads"] / x["freq"]) 576 | 577 | _l5 = add_miss(_data, _l4) 578 | 579 | return({"cut": _l3, "tbl": gen_rule(gen_woe(_l5), _l3)}) 580 | 581 | 582 | ########## 14. kmn_bin() ########## 583 | 584 | def kmn_bin(x, y): 585 | """ 586 | The function discretizes the x vector based on the kmean clustering and then 587 | summarizes over the y vector to derive the weight of evidence transformaton 588 | (WoE) and information values. 589 | 590 | Parameters: 591 | x : A numeric vector to discretize. It is a list, 1-D numpy array, 592 | or pandas series. 593 | y : A numeric vector with binary values of 0/1 and with the same length 594 | of x. It is a list, 1-D numpy array, or pandas series. 595 | 596 | Returns: 597 | A dictionary with two keys: 598 | "cut" : A numeric vector with cut points applied to the x vector. 599 | "tbl" : A list of dictionaries summarizing the binning outcome. 600 | 601 | Example: 602 | kmn_bin(derog, bad)['cut'] 603 | # [1.0, 5.0, 11.0] 604 | 605 | view_bin(kmn_bin(derog, bad)) 606 | | bin | freq | miss | bads | rate | woe | iv | ks | rule | 607 | |-------|--------|--------|--------|--------|---------|--------|-------|-----------------------------------------------| 608 | | 0 | 213 | 213 | 70 | 0.3286 | 0.6416 | 0.0178 | 2.77 | numpy.isnan($X$) | 609 | | 1 | 3741 | 0 | 560 | 0.1497 | -0.3811 | 0.0828 | 18.95 | $X$ <= 1.0 | 610 | | 2 | 1249 | 0 | 366 | 0.2930 | 0.4753 | 0.0550 | 7.37 | $X$ > 1.0 and $X$ <= 5.0 | 611 | | 3 | 504 | 0 | 157 | 0.3115 | 0.5629 | 0.0318 | 1.72 | $X$ > 5.0 and $X$ <= 11.0 | 612 | | 4 | 130 | 0 | 43 | 0.3308 | 0.6512 | 0.0112 | 0.00 | $X$ > 11.0 | 613 | """ 614 | 615 | _data = [_ for _ in zip(x, y, ~numpy.isnan(x))] 616 | 617 | _x = [_[0] for _ in _data if _[2] == 1] 618 | _y = [_[1] for _ in _data if _[2] == 1] 619 | 620 | _n = numpy.arange(2, max(3, min(20, len(numpy.unique(_x)) - 1))) 621 | 622 | _m = [[numpy.median([_[0] for _ in _data if _[2] == 1 and _[1] == 1])], 623 | [numpy.median([_[0] for _ in _data if _[2] == 1])]] 624 | 625 | _c1 = [sklearn.cluster.KMeans(n_clusters = _, random_state = 1).fit(numpy.reshape(_x, [-1, 1])).labels_ for _ in _n] 626 | 627 | _c2 = [sorted(_l, key = lambda x: x[0]) for _l in [list(zip(_, _x)) for _ in _c1]] 628 | 629 | group = lambda x: [[_l for _l in x if _l[0] == _k] for _k in set([_[0] for _ in x])] 630 | 631 | upper = lambda x: sorted([max([_2[1] for _2 in _1]) for _1 in x]) 632 | 633 | _c3 = list(set(tuple(upper(_2)[:-1]) for _2 in [group(_1) for _1 in _c2])) + _m 634 | 635 | _l1 = [[_, manual_bin(_x, _y, _)] for _ in _c3] 636 | 637 | _l2 = [[l[0], 638 | min([_["bads"] / _["freq"] for _ in l[1]]), 639 | max([_["bads"] / _["freq"] for _ in l[1]]), 640 | scipy.stats.spearmanr([_["bin"] for _ in l[1]], [_["bads"] / _["freq"] for _ in l[1]])[0] 641 | ] for l in _l1] 642 | 643 | _l3 = [l[0] for l in sorted(_l2, key = lambda x: -len(x[0])) 644 | if numpy.abs(round(l[3], 8)) == 1 and round(l[1], 8) > 0 and round(l[2], 8) < 1][0] 645 | 646 | _l4 = sorted(*[l[1] for l in _l1 if l[0] == _l3], key = lambda x: x["bads"] / x["freq"]) 647 | 648 | _l5 = add_miss(_data, _l4) 649 | 650 | return({"cut": _l3, "tbl": gen_rule(gen_woe(_l5), _l3)}) 651 | 652 | 653 | ########## 15. gbm_bin() ########## 654 | 655 | def gbm_bin(x, y): 656 | """ 657 | The function discretizes the x vector based on the gradient boosting machine 658 | and then summarizes over the y vector to derive the weight of evidence 659 | transformaton (WoE) and information values. 660 | 661 | Parameters: 662 | x : A numeric vector to discretize. It is a list, 1-D numpy array, 663 | or pandas series. 664 | y : A numeric vector with binary values of 0/1 and with the same length 665 | of x. It is a list, 1-D numpy array, or pandas series. 666 | 667 | Returns: 668 | A dictionary with two keys: 669 | "cut" : A numeric vector with cut points applied to the x vector. 670 | "tbl" : A list of dictionaries summarizing the binning outcome. 671 | 672 | Example: 673 | gbm_bin(derog, bad)["cut"] 674 | # [1.0, 2.0, 3.0, 22.0, 26.0] 675 | 676 | view_bin(gbm_bin(derog, bad)) 677 | | bin | freq | miss | bads | rate | woe | iv | ks | rule | 678 | |-------|--------|--------|--------|--------|---------|--------|-------|-----------------------------------------------| 679 | | 0 | 213 | 213 | 70 | 0.3286 | 0.6416 | 0.0178 | 2.77 | numpy.isnan($X$) | 680 | | 1 | 3741 | 0 | 560 | 0.1497 | -0.3811 | 0.0828 | 18.95 | $X$ <= 1.0 | 681 | | 2 | 478 | 0 | 121 | 0.2531 | 0.2740 | 0.0066 | 16.52 | $X$ > 1.0 and $X$ <= 2.0 | 682 | | 3 | 332 | 0 | 86 | 0.2590 | 0.3050 | 0.0058 | 14.63 | $X$ > 2.0 and $X$ <= 3.0 | 683 | | 4 | 1063 | 0 | 353 | 0.3321 | 0.6572 | 0.0934 | 0.42 | $X$ > 3.0 and $X$ <= 22.0 | 684 | | 5 | 6 | 0 | 3 | 0.5000 | 1.3559 | 0.0025 | 0.23 | $X$ > 22.0 and $X$ <= 26.0 | 685 | | 6 | 4 | 0 | 3 | 0.7500 | 2.4546 | 0.0056 | 0.00 | $X$ > 26.0 | 686 | """ 687 | 688 | _data = [_ for _ in zip(x, y, ~numpy.isnan(x))] 689 | 690 | _x = [_[0] for _ in _data if _[2] == 1] 691 | _y = [_[1] for _ in _data if _[2] == 1] 692 | 693 | _cor = scipy.stats.spearmanr(_x, _y)[0] 694 | _con = "1" if _cor > 0 else "-1" 695 | 696 | _gbm = lightgbm.LGBMRegressor(num_leaves = 100, min_child_samples = 3, n_estimators = 1, 697 | random_state = 1, monotone_constraints = _con) 698 | _gbm.fit(numpy.reshape(_x, [-1, 1]), _y) 699 | 700 | _f = numpy.abs(_gbm.predict(numpy.reshape(_x, [-1, 1]))) 701 | 702 | _l1 = sorted(list(zip(_f, _x, _y)), key = lambda x: x[0]) 703 | 704 | _l2 = [[l for l in _l1 if l[0] == f] for f in sorted(set(_f))] 705 | 706 | _l3 = [[*set(_[0] for _ in l), 707 | max(_[1] for _ in l), 708 | numpy.mean([_[2] for _ in l]), 709 | sum(_[2] for _ in l)] for l in _l2] 710 | 711 | _c = sorted([_[1] for _ in [l for l in _l3 if l[2] < 1 and l[2] > 0 and l[3] > 1]]) 712 | 713 | _p = _c[1:-1] if len(_c) > 2 else _c[:-1] 714 | 715 | _l4 = sorted(manual_bin(_x, _y, _p), key = lambda x: x["bads"] / x["freq"]) 716 | 717 | _l5 = add_miss(_data, _l4) 718 | 719 | return({"cut": _p, "tbl": gen_rule(gen_woe(_l5), _p)}) 720 | -------------------------------------------------------------------------------- /py_mob/py_mob1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/statcompute/py_mob/1e20ceeeee03e71b075b052206cc8bb9fe7af9ff/py_mob/py_mob1.jpg --------------------------------------------------------------------------------