├── .gitignore ├── binder └── requirements.yml ├── data └── default of credit card clients.xls ├── Makefile ├── ci └── Dockerfile ├── README.md └── src ├── km.py └── d3.min.js /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | .ipynb_checkpoints/ 3 | -------------------------------------------------------------------------------- /binder/requirements.yml: -------------------------------------------------------------------------------- 1 | dependencies: 2 | - wget 3 | - pytz 4 | -------------------------------------------------------------------------------- /data/default of credit card clients.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/natbusa/deepcredit/HEAD/data/default of credit card clients.xls -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | build: 2 | docker build -f ci/Dockerfile -t natbusa/deepcredit:latest . 3 | 4 | run: 5 | docker run -p 8888:8888 natbusa/deepcredit start.sh jupyter lab 6 | -------------------------------------------------------------------------------- /ci/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM jupyter/minimal-notebook 2 | 3 | # Only ADD the binder dir, to avoid polluting the cache 4 | USER root 5 | COPY binder /home/$NB_USER/binder 6 | RUN fix-permissions /home/$NB_USER/binder 7 | 8 | # Run the env update in as non-privileged user 9 | USER $NB_USER 10 | RUN conda env update -q -n base -f /home/$NB_USER/binder/requirements.yml && \ 11 | conda clean -tipsy && \ 12 | fix-permissions $CONDA_DIR && \ 13 | fix-permissions /home/$NB_USER 14 | 15 | # Copy the rest of the files as late as possible to avoid cache busting 16 | USER root 17 | COPY ci /home/$NB_USER/ci 18 | COPY src /home/$NB_USER/src 19 | COPY data /home/$NB_USER/data 20 | COPY README.md /home/$NB_USER 21 | COPY Makefile /home/$NB_USER 22 | RUN fix-permissions /home/$NB_USER 23 | 24 | # Ready to go: switch to user $NB_USER 25 | USER $NB_USER 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Predicting defaulting on credit card applications 2 | 3 | When customers come in financial difficulties, it usually does not happen at once. There are indicators which can be used to anticipate the final outcome, such as late payments, calls to the customer services, enquiries about the products, a different browsing pattern on the web or mobile app. By using such patterns it is possible to prevent, or at least guide the process and provide a better service for the customer as well as reduced risks for the bank. 4 | 5 | In this tutorial we will look at how to predict defaulting, using statistics, machine learning and deep learning. We we also look at how to summarize the data using topological data analysis (TDA). Finally we will look at how to APIfy the model and use it for account alerting. 6 | 7 | ### Synopysis 8 | 9 | This notebook unfolds in the following phases 10 | 11 | - Getting the Data 12 | - Data Preparation 13 | - Descriptive analytics 14 | - Feature Engineering 15 | - Dimensionality Reduction 16 | - Modeling 17 | - Explainability 18 | 19 | ### Modeling 20 | 21 | I will compare the predictive power of four bespoken classes of algorithms 22 | 23 | - Logistic regression (scikit-learn) 24 | - Random Forests (scikit-learn) 25 | - Boosted Trees (xgboost) 26 | - Deep Learning (keras/tensorflow) 27 | 28 | ### Explainability 29 | 30 | Understanding why a given model predicts the way it predicts, it probably just as important as achieving a good accuracy on the predicted results. I will present 3 methods which can be used to interpret and explain the result and gain a better understanding on the dataset and the various. 31 | 32 | - Extract activations from an inner layer of a neural network 33 | - Apply TSNE on the activation data 34 | - OPTICS for variable density clustering 35 | - TDA with a Kepler Mapper using the TSNE lens 36 | 37 | ### Source 38 | 39 | The dataset is availble at the Center for Machine Learning and Intelligent Systems, Bren School of Information and Computer Science, University of California, Irvine: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients 40 | 41 | Citation: 42 | Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480. 43 | 44 | The dataset Name: I-Cheng Yeh 45 | email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw 46 | institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan. other contact information: 886-2-26215656 ext. 3181 47 | 48 | Extract from the [dataset information section][1] : 49 | "This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients." 50 | 51 | [1]:[https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients] 52 | -------------------------------------------------------------------------------- /src/km.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import numpy as np 3 | from collections import defaultdict 4 | import json 5 | import itertools 6 | from sklearn import cluster, preprocessing, manifold 7 | from datetime import datetime 8 | import sys 9 | 10 | class KeplerMapper(object): 11 | # With this class you can build topological networks from (high-dimensional) data. 12 | # 13 | # 1) Fit a projection/lens/function to a dataset and transform it. 14 | # For instance "mean_of_row(x) for x in X" 15 | # 2) Map this projection with overlapping intervals/hypercubes. 16 | # Cluster the points inside the interval 17 | # (Note: we cluster on the inverse image/original data to lessen projection loss). 18 | # If two clusters/nodes have the same members (due to the overlap), then: 19 | # connect these with an edge. 20 | # 3) Visualize the network using HTML and D3.js. 21 | # 22 | # functions 23 | # --------- 24 | # fit_transform: Create a projection (lens) from a dataset 25 | # map: Apply Mapper algorithm on this projection and build a simplicial complex 26 | # visualize: Turns the complex dictionary into a HTML/D3.js visualization 27 | 28 | def __init__(self, verbose=2): 29 | self.verbose = verbose 30 | 31 | self.chunk_dist = [] 32 | self.overlap_dist = [] 33 | self.d = [] 34 | self.nr_cubes = 0 35 | self.overlap_perc = 0 36 | self.clusterer = False 37 | 38 | def fit_transform(self, X, projection="sum", scaler=preprocessing.MinMaxScaler()): 39 | # Creates the projection/lens from X. 40 | # 41 | # Input: X. Input features as a numpy array. 42 | # Output: projected_X. original data transformed to a projection (lens). 43 | # 44 | # parameters 45 | # ---------- 46 | # projection: Projection parameter is either a string, 47 | # a scikit class with fit_transform, like manifold.TSNE(), 48 | # or a list of dimension indices. 49 | # scaler: if None, do no scaling, else apply scaling to the projection 50 | # Default: Min-Max scaling 51 | 52 | self.scaler = scaler 53 | self.projection = str(projection) 54 | 55 | # Detect if projection is a string (for standard functions) 56 | if isinstance(projection, str): 57 | if self.verbose > 0: 58 | print("Projecting data using: %s"%(projection)) 59 | # Stats lenses 60 | if projection == "sum": # sum of row 61 | X = np.sum(X, axis=1).reshape((X.shape[0],1)) 62 | if projection == "mean": # mean of row 63 | X = np.mean(X, axis=1).reshape((X.shape[0],1)) 64 | if projection == "median": # mean of row 65 | X = np.median(X, axis=1).reshape((X.shape[0],1)) 66 | if projection == "max": # max of row 67 | X = np.max(X, axis=1).reshape((X.shape[0],1)) 68 | if projection == "min": # min of row 69 | X = np.min(X, axis=1).reshape((X.shape[0],1)) 70 | if projection == "std": # std of row 71 | X = np.std(X, axis=1).reshape((X.shape[0],1)) 72 | 73 | if projection == "dist_mean": # Distance of x to mean of X 74 | X_mean = np.mean(X, axis=0) 75 | X = np.sum(np.sqrt((X - X_mean)**2), axis=1).reshape((X.shape[0],1)) 76 | 77 | # Detect if projection is a list (with dimension indices) 78 | elif isinstance(projection, list): 79 | if self.verbose > 0: 80 | print("Projecting data using: %s"%(str(projection))) 81 | X = X[:,np.array(projection)] 82 | 83 | # Detect if projection is a class (for scikit-learn) 84 | elif str(type(projection))[1:6] == "class": #TODO: de-ugly-fy 85 | reducer = projection 86 | if self.verbose > 0: 87 | try: 88 | projection.set_params(**{"verbose":self.verbose}) 89 | except: 90 | pass 91 | print("Projecting data using: %s"%str(projection)) 92 | X = reducer.fit_transform(X) 93 | 94 | # Scaling 95 | if scaler is not None: 96 | if self.verbose > 0: 97 | print("Scaling with: %s"%str(scaler)) 98 | X = scaler.fit_transform(X) 99 | 100 | return X 101 | 102 | def map(self, projected_X, inverse_X=None, clusterer=cluster.DBSCAN(eps=0.5,min_samples=3), nr_cubes=10, overlap_perc=0.1): 103 | # This maps the data to a simplicial complex. Returns a dictionary with nodes and links. 104 | # 105 | # Input: projected_X. A Numpy array with the projection/lens. 106 | # Output: complex. A dictionary with "nodes", "links" and "meta information" 107 | # 108 | # parameters 109 | # ---------- 110 | # projected_X projected_X. A Numpy array with the projection/lens. Required. 111 | # inverse_X Numpy array or None. If None then the projection itself is used for clustering. 112 | # clusterer Scikit-learn API compatible clustering algorithm. Default: DBSCAN 113 | # nr_cubes Int. The number of intervals/hypercubes to create. 114 | # overlap_perc Float. The percentage of overlap "between" the intervals/hypercubes. 115 | 116 | start = datetime.now() 117 | 118 | # Helper function 119 | def cube_coordinates_all(nr_cubes, nr_dimensions): 120 | # Helper function to get origin coordinates for our intervals/hypercubes 121 | # Useful for looping no matter the number of cubes or dimensions 122 | # Example: if there are 4 cubes per dimension and 3 dimensions 123 | # return the bottom left (origin) coordinates of 64 hypercubes, 124 | # as a sorted list of Numpy arrays 125 | # TODO: elegance-ify... 126 | l = [] 127 | for x in range(nr_cubes): 128 | l += [x] * nr_dimensions 129 | return [np.array(list(f)) for f in sorted(set(itertools.permutations(l,nr_dimensions)))] 130 | 131 | nodes = defaultdict(list) 132 | links = defaultdict(list) 133 | complex = {} 134 | self.nr_cubes = nr_cubes 135 | self.clusterer = clusterer 136 | self.overlap_perc = overlap_perc 137 | 138 | if self.verbose > 0: 139 | print("Mapping on data shaped %s using dimensions"%(str(projected_X.shape))) 140 | 141 | # If inverse image is not provided, we use the projection as the inverse image (suffer projection loss) 142 | if inverse_X is None: 143 | inverse_X = projected_X 144 | 145 | # We chop up the min-max column ranges into 'nr_cubes' parts 146 | self.chunk_dist = (np.max(projected_X, axis=0) - np.min(projected_X, axis=0))/nr_cubes 147 | 148 | # We calculate the overlapping windows distance 149 | self.overlap_dist = self.overlap_perc * self.chunk_dist 150 | 151 | # We find our starting point 152 | self.d = np.min(projected_X, axis=0) 153 | 154 | # Use a dimension index array on the projected X 155 | # (For now this uses the entire dimensionality, but we keep for experimentation) 156 | di = np.array([x for x in range(projected_X.shape[1])]) 157 | 158 | # Prefix'ing the data with ID's 159 | ids = np.array([x for x in range(projected_X.shape[0])]) 160 | projected_X = np.c_[ids,projected_X] 161 | inverse_X = np.c_[ids,inverse_X] 162 | 163 | # Subdivide the projected data X in intervals/hypercubes with overlap 164 | if self.verbose > 0: 165 | total_cubes = len(cube_coordinates_all(nr_cubes,projected_X.shape[1]-1)) 166 | print("Creating %s hypercubes." % total_cubes) 167 | 168 | for i, coor in enumerate(cube_coordinates_all(nr_cubes,di.shape[0])): 169 | # Slice the hypercube 170 | hypercube = projected_X[ np.invert(np.any((projected_X[:,di+1] >= self.d[di] + (coor * self.chunk_dist[di])) & 171 | (projected_X[:,di+1] < self.d[di] + (coor * self.chunk_dist[di]) + self.chunk_dist[di] + self.overlap_dist[di]) == False, axis=1 )) ] 172 | 173 | if self.verbose > 1: 174 | print("There are %s points in cube_%s / %s with starting range %s"% 175 | (hypercube.shape[0],i,total_cubes,self.d[di] + (coor * self.chunk_dist[di]))) 176 | 177 | # If at least one sample inside the hypercube 178 | if hypercube.shape[0] > 0: 179 | # Cluster the data point(s) in the cube, skipping the id-column 180 | # Note that we apply clustering on the inverse image (original data samples) that fall inside the cube. 181 | inverse_x = inverse_X[[int(nn) for nn in hypercube[:,0]]] 182 | 183 | clusterer.fit(inverse_x[:,1:]) 184 | 185 | if self.verbose > 1: 186 | print("Found %s clusters in cube_%s"%(np.unique(clusterer.labels_[clusterer.labels_ > -1]).shape[0],i)) 187 | 188 | #Now for every (sample id in cube, predicted cluster label) 189 | for a in np.c_[hypercube[:,0],clusterer.labels_]: 190 | if a[1] != -1: #if not predicted as noise 191 | cluster_id = str(coor[0])+"_"+str(i)+"_"+str(a[1])+"_"+str(coor)+"_"+str(self.d[di] + (coor * self.chunk_dist[di])) # TODO: de-rudimentary-ify 192 | nodes[cluster_id].append( int(a[0]) ) # Append the member id's as integers 193 | else: 194 | if self.verbose > 1: 195 | print("Cube_%s is empty."%(i)) 196 | 197 | # Create links when clusters from different hypercubes have members with the same sample id. 198 | candidates = itertools.combinations(nodes.keys(),2) 199 | for candidate in candidates: 200 | # if there are non-unique members in the union 201 | if len(nodes[candidate[0]]+nodes[candidate[1]]) != len(set(nodes[candidate[0]]+nodes[candidate[1]])): 202 | links[candidate[0]].append( candidate[1] ) 203 | 204 | # Reporting 205 | if self.verbose > 0: 206 | nr_links = 0 207 | for k in links: 208 | nr_links += len(links[k]) 209 | print("Created %s edges and %s nodes in %s."%(nr_links,len(nodes),str(datetime.now()-start))) 210 | 211 | complex["nodes"] = nodes 212 | complex["links"] = links 213 | complex["meta"] = self.projection 214 | 215 | return complex 216 | 217 | def visualize(self, complex, color_function="", path_html="mapper_visualization_output.html", title="My Data", 218 | graph_link_distance=30, graph_gravity=0.1, graph_charge=-120, custom_tooltips=None, width_html=0, 219 | height_html=0, show_tooltips=True, show_title=True, show_meta=True, bg_color_css="#111"): 220 | # Turns the dictionary 'complex' in a html file with d3.js 221 | # 222 | # Input: complex. Dictionary (output from calling .map()) 223 | # Output: a HTML page saved as a file in 'path_html'. 224 | # 225 | # parameters 226 | # ---------- 227 | # color_function string. Not fully implemented. Default: "" (distance to origin) 228 | # path_html file path as string. Where to save the HTML page. 229 | # title string. HTML page document title and first heading. 230 | # graph_link_distance int. Edge length. 231 | # graph_gravity float. "Gravity" to center of layout. 232 | # graph_charge int. charge between nodes. 233 | # custom_tooltips None or Numpy Array. You could use "y"-label array for this. 234 | # width_html int. Width of canvas. Default: 0 (full width) 235 | # height_html int. Height of canvas. Default: 0 (full height) 236 | # show_tooltips bool. default:True 237 | # show_title bool. default:True 238 | # show_meta bool. default:True 239 | 240 | # Format JSON for D3 graph 241 | json_s = {} 242 | json_s["nodes"] = [] 243 | json_s["links"] = [] 244 | k2e = {} # a key to incremental int dict, used for id's when linking 245 | 246 | for e, k in enumerate(complex["nodes"]): 247 | # Tooltip and node color formatting, TODO: de-mess-ify 248 | if custom_tooltips is not None: 249 | tooltip_s = "
=i.length)return n;var r=[],u=o[e++];return n.forEach(function(n,u){r.push({key:n,values:t(u,e)})}),u?r.sort(function(n,t){return u(n.key,t.key)}):r}var e,r,u={},i=[],o=[];return u.map=function(t,e){return n(e,t,0)},u.entries=function(e){return t(n(ta.map,e,0),0)},u.key=function(n){return i.push(n),u},u.sortKeys=function(n){return o[i.length-1]=n,u},u.sortValues=function(n){return e=n,u},u.rollup=function(n){return r=n,u},u},ta.set=function(n){var t=new m;if(n)for(var e=0,r=n.length;r>e;++e)t.add(n[e]);return t},c(m,{has:h,add:function(n){return this._[s(n+="")]=!0,n},remove:g,values:p,size:v,empty:d,forEach:function(n){for(var t in this._)n.call(this,f(t))}}),ta.behavior={},ta.rebind=function(n,t){for(var e,r=1,u=arguments.length;++r=0&&(r=n.slice(e+1),n=n.slice(0,e)),n)return arguments.length<2?this[n].on(r):this[n].on(r,t);if(2===arguments.length){if(null==t)for(n in this)this.hasOwnProperty(n)&&this[n].on(r,null);return this}},ta.event=null,ta.requote=function(n){return n.replace(ma,"\\$&")};var ma=/[\\\^\$\*\+\?\|\[\]\(\)\.\{\}]/g,ya={}.__proto__?function(n,t){n.__proto__=t}:function(n,t){for(var e in t)n[e]=t[e]},Ma=function(n,t){return t.querySelector(n)},xa=function(n,t){return t.querySelectorAll(n)},ba=function(n,t){var e=n.matches||n[x(n,"matchesSelector")];return(ba=function(n,t){return e.call(n,t)})(n,t)};"function"==typeof Sizzle&&(Ma=function(n,t){return Sizzle(n,t)[0]||null},xa=Sizzle,ba=Sizzle.matchesSelector),ta.selection=function(){return ta.select(ua.documentElement)};var _a=ta.selection.prototype=[];_a.select=function(n){var t,e,r,u,i=[];n=N(n);for(var o=-1,a=this.length;++o=0&&(e=n.slice(0,t),n=n.slice(t+1)),wa.hasOwnProperty(e)?{space:wa[e],local:n}:n}},_a.attr=function(n,t){if(arguments.length<2){if("string"==typeof n){var e=this.node();return n=ta.ns.qualify(n),n.local?e.getAttributeNS(n.space,n.local):e.getAttribute(n)}for(t in n)this.each(z(t,n[t]));return this}return this.each(z(n,t))},_a.classed=function(n,t){if(arguments.length<2){if("string"==typeof n){var e=this.node(),r=(n=T(n)).length,u=-1;if(t=e.classList){for(;++u =0?n.slice(0,t):n,r=t>=0?n.slice(t+1):"in";return e=cl.get(e)||al,r=ll.get(r)||y,Mu(r(e.apply(null,ea.call(arguments,1))))},ta.interpolateHcl=Lu,ta.interpolateHsl=Tu,ta.interpolateLab=Ru,ta.interpolateRound=Du,ta.transform=function(n){var t=ua.createElementNS(ta.ns.prefix.svg,"g");return(ta.transform=function(n){if(null!=n){t.setAttribute("transform",n);var e=t.transform.baseVal.consolidate()}return new Pu(e?e.matrix:sl)})(n)},Pu.prototype.toString=function(){return"translate("+this.translate+")rotate("+this.rotate+")skewX("+this.skew+")scale("+this.scale+")"};var sl={a:1,b:0,c:0,d:1,e:0,f:0};ta.interpolateTransform=Hu,ta.layout={},ta.layout.bundle=function(){return function(n){for(var t=[],e=-1,r=n.length;++e