├── Data
└── synthetic-data-from-a-financial-payment-system
│ ├── bs140513_032310.csv
│ └── bsNET140513_032310.csv
├── Fraud Detection on Bank Payments.ipynb
├── Fraud_Detection_bank_sim.py
└── Readme.md
/Fraud Detection on Bank Payments.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"markdown","source":"# Fraud Detection on Bank Payments\n\n## Fraud and detecting it\n\nFraudulent behavior can be seen across many different fields such as e-commerce, healthcare, payment and banking systems. Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of 2018 [1] found that half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind.\n\nEven if fraud seems to be scary for businesses it can be detected using intelligent systems such as rules engines or machine learning. Most people here in Kaggle are familier with machine learning but for rule engines here is a quick information. \n A rules engine is a software system that executes one or more business rules in a runtime production environment. These rules are generally written by domain experts for transferring the knowledge of the problem to the rules engine and from there to production. Two rules examples for fraud detection would be limiting the number of transactions in a time period (velocity rules), denying the transactions which come from previously known fraudulent IP's and/or domains.\n \nRules are great for detecting some type of frauds but they can fire a lot of false positives or false negatives in some cases because they have predefined threshold values. For example let's think of a rule for denying a transaction which has an amount that is bigger than 10000 dollars for a specific user. If this user is an experienced fraudster, he/she may be aware of the fact that the system would have a threshold and he/she can just make a transaction just below the threshold value (9999 dollars).\n\nFor these type of problems ML comes for help and reduce the risk of frauds and the risk of business to lose money. With the combination of rules and machine learning, detection of the fraud would be more precise and confident.\n\n## Banksim dataset\n\nWe detect the fraudulent transactions from the Banksim dataset. This synthetically generated dataset consists of payments from various customers made in different time periods and with different amounts. For\nmore information on the dataset you can check the [Kaggle page](https://www.kaggle.com/ntnu-testimon/banksim1) for this dataset which also has the link to the original paper. \n\nHere what we'll do in this kernel:\n1. [Exploratory Data Analysis (EDA)](#Explaratory-Data-Analysis)\n2. [Data Preprocessing](#Data-Preprocessing)\n3. [Oversampling with SMOTE](#Oversampling-with-SMOTE)\n4. [K-Neighbours Classifier](#K-Neighbours-Classifier)\n5. [Random Forest Classifier](#Random-Forest-Classifier)\n6. [XGBoost Classifier](#XGBoost-Classifier)\n7. [Conclusion](#Conclusion)"},{"metadata":{},"cell_type":"markdown","source":"## Explaratory Data Analysis\n\nIn this chapter we will perform an EDA on the data and try to gain some insight from it."},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true,"_kg_hide-input":true},"cell_type":"code","source":"# Necessary imports\n\n## Data loading, processing and for more\nimport pandas as pd\nimport numpy as np\nfrom imblearn.over_sampling import SMOTE\n\n## Visualization\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n# set seaborn style because it prettier\nsns.set()\n\n## Metrics\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import confusion_matrix, classification_report\nfrom sklearn.metrics import roc_curve, auc\n\n## Models\nimport xgboost as xgb\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble import VotingClassifier","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"e1b7b9feedaed877bfee5bd871c627cdbcf2e31b"},"cell_type":"markdown","source":"**Data**\nAs we can see in the first rows below the dataset has 9 feature columns and a target column. \nThe feature columms are :\n* **Step**: This feature represents the day from the start of simulation. It has 180 steps so simulation ran for virtually 6 months.\n* **Customer**: This feature represents the customer id\n* **zipCodeOrigin**: The zip code of origin/source.\n* **Merchant**: The merchant's id\n* **zipMerchant**: The merchant's zip code\n* **Age**: Categorized age \n * 0: <= 18, \n * 1: 19-25, \n * 2: 26-35, \n * 3: 36-45,\n * 4: 46:55,\n * 5: 56:65,\n * 6: > 65\n * U: Unknown\n* **Gender**: Gender for customer\n * E : Enterprise,\n * F: Female,\n * M: Male,\n * U: Unknown\n* **Category**: Category of the purchase. I won't write all categories here, we'll see them later in the analysis.\n* **Amount**: Amount of the purchase\n* **Fraud**: Target variable which shows if the transaction fraudulent(1) or benign(0)"},{"metadata":{"trusted":true,"_uuid":"2bf619515c1a924cf041c419cb4a0514ba6ce820","_kg_hide-input":true},"cell_type":"code","source":"# read the data and show first 5 rows\ndata = pd.read_csv(\"../input/bs140513_032310.csv\")\ndata.head(5)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"29a786fcbe642321727b007bd015cec05a198238"},"cell_type":"markdown","source":"Let's look at column types and missing values in data. Oh im sorry there is **no** missing values which means we don't have to perform an imputation."},{"metadata":{"trusted":true,"_uuid":"c285fad4bda30a3b5feec97b71e0a9d1e0b7ec8c"},"cell_type":"code","source":"data.info()","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"36778347862db884536c7b89e817c959acddc514"},"cell_type":"markdown","source":"**Fraud data** will be imbalanced like you see in the plot below and from the count of instances. To balance the dataset one can perform oversample or undersample techniques. Oversampling is increasing the number of the minority class by generating instances from the minority class . Undersampling is reducing the number of instances in the majority class by selecting random points from it to where it is equal with the minority class. Both operations have some risks: Oversample will create copies or similar data points which sometimes would not be helpful for the case of fraud detection because fraudulent transactions may vary. Undersampling means that we lost data points thus information. We will perform an oversampled technique called SMOTE (Synthetic Minority Over-sampling Technique). SMOTE will create new data points from minority class using the neighbour instances so generated samples are not exact copies but they are similar to instances we have."},{"metadata":{"trusted":true,"_uuid":"ee9ed07328bb6ae710263c4e55066f3e757206a0","_kg_hide-input":true},"cell_type":"code","source":"# Create two dataframes with fraud and non-fraud data \ndf_fraud = data.loc[data.fraud == 1] \ndf_non_fraud = data.loc[data.fraud == 0]\n\nsns.countplot(x=\"fraud\",data=data)\nplt.title(\"Count of Fraudulent Payments\")\nplt.show()\nprint(\"Number of normal examples: \",df_non_fraud.fraud.count())\nprint(\"Number of fradulent examples: \",df_fraud.fraud.count())\n#print(data.fraud.value_counts()) # does the same thing above","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"cf77a853d6b363b91512d3c3ce22717a5ed29645"},"cell_type":"markdown","source":"We can see the mean amount and fraud percent by category below. Looks like leisure and the travel is the most selected categories for fraudsters. Fraudsters chose the categories which people spend more on average. Let's confirm this hypothesis by checking the fraud and non-fraud amount transacted."},{"metadata":{"trusted":true,"_uuid":"44479e279e91d208d9108112327441e5d57d5fee"},"cell_type":"code","source":"print(\"Mean feature values per category\",data.groupby('category')['amount','fraud'].mean())","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"258b20b07d771a0fd9a7a22fda617e40ce967842"},"cell_type":"markdown","source":"Our hypothesis for fraudsters choosing the categories which people spend more is only partly correct, but as we can see in the table below we can say confidently say that a fraudulent transaction will be much more (about four times or more) than average for that category."},{"metadata":{"trusted":true,"_uuid":"b5a14927d2edfb847e809969fd534e9bae339a2a"},"cell_type":"code","source":"# Create two dataframes with fraud and non-fraud data \npd.concat([df_fraud.groupby('category')['amount'].mean(),df_non_fraud.groupby('category')['amount'].mean(),\\\n data.groupby('category')['fraud'].mean()*100],keys=[\"Fraudulent\",\"Non-Fraudulent\",\"Percent(%)\"],axis=1,\\\n sort=False).sort_values(by=['Non-Fraudulent'])","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"e41aeb2f2c6b898ebbcb5b4222fd44ca8e87c8ae"},"cell_type":"markdown","source":"Average amount spend it categories are similar; between 0-500 discarding the outliers, except for the travel category which goes very high. "},{"metadata":{"trusted":true,"_uuid":"290e7163a62fb029fcbdf45a3d35c7c0d58c423d","_kg_hide-input":false},"cell_type":"code","source":"# Plot histograms of the amounts in fraud and non-fraud data \nplt.figure(figsize=(30,10))\nsns.boxplot(x=data.category,y=data.amount)\nplt.title(\"Boxplot for the Amount spend in category\")\nplt.ylim(0,4000)\nplt.legend()\nplt.show()","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"69dbe47eb7f490030006aaa42bac0a7414b3c100"},"cell_type":"markdown","source":"Again we can see in the histogram below the fradulent transactions are less in count but more in amount."},{"metadata":{"trusted":true,"_uuid":"93272d1343f6783a0152a80b5a177541faed8e25"},"cell_type":"code","source":"# Plot histograms of the amounts in fraud and non-fraud data \nplt.hist(df_fraud.amount, alpha=0.5, label='fraud',bins=100)\nplt.hist(df_non_fraud.amount, alpha=0.5, label='nonfraud',bins=100)\nplt.title(\"Histogram for fraudulent and nonfraudulent payments\")\nplt.ylim(0,10000)\nplt.xlim(0,1000)\nplt.legend()\nplt.show()","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"76c33765a55bb89e0fb8a5d1cd55aa6bd7727dfa"},"cell_type":"markdown","source":"Looks like fraud occurs more in ages equal and below 18(0th category). Can it be because of fraudsters thinking it would be less consequences if they show their age younger, or maybe they really are young."},{"metadata":{"trusted":true,"_uuid":"d5e387216fa347c7f437c7f187095edd3912f848"},"cell_type":"code","source":"print((data.groupby('age')['fraud'].mean()*100).reset_index().rename(columns={'age':'Age','fraud' : 'Fraud Percent'}).sort_values(by='Fraud Percent'))","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"23001479b16bd3f01682dedd94b47a3f231a7c9e"},"cell_type":"markdown","source":"## Data Preprocessing\n\nIn this part we will preprocess the data and prepare for the training.\n\nThere are only one unique zipCode values so we will drop them."},{"metadata":{"trusted":true,"_uuid":"47246c26a027470f30ea0a5e4a149e6dbb951403"},"cell_type":"code","source":"print(\"Unique zipCodeOri values: \",data.zipcodeOri.nunique())\nprint(\"Unique zipMerchant values: \",data.zipMerchant.nunique())\n# dropping zipcodeori and zipMerchant since they have only one unique value\ndata_reduced = data.drop(['zipcodeOri','zipMerchant'],axis=1)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"7f589a60028130489740b7ceaa12ca7c4879c2e7"},"cell_type":"markdown","source":"Checking the data after dropping."},{"metadata":{"trusted":true,"_uuid":"4b5689fb2da6f4ab7d4ba8e707530ea10753e8bb"},"cell_type":"code","source":"data_reduced.columns","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"5da150a203f2ab7e310af70365a4e90f7f522168"},"cell_type":"markdown","source":"Here we will transform categorical features into numerical values. It is usually better to turn these type of categorical values into dummies because they have no relation in size(i.e. customer1 is not greater than customer2) but since they are too many (over 500k customers and merchants) the features will grow 10^5 in size and it will take forever to train. I've put the code below for turning categorical features into dummies if you want to give it a try.\n> data_reduced.loc[:,['customer','merchant','category']].astype('category')\n> data_dum = pd.get_dummies(data_reduced.loc[:,['customer','merchant','category','gender']],drop_first=True) # dummies\n> print(data_dum.info())"},{"metadata":{"trusted":true,"_uuid":"4d7d4e448b2fa8c9313810f7c24b6e010af719e4"},"cell_type":"code","source":"# turning object columns type to categorical for easing the transformation process\ncol_categorical = data_reduced.select_dtypes(include= ['object']).columns\nfor col in col_categorical:\n data_reduced[col] = data_reduced[col].astype('category')\n# categorical values ==> numeric values\ndata_reduced[col_categorical] = data_reduced[col_categorical].apply(lambda x: x.cat.codes)\ndata_reduced.head(5)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"37f7ed8c735c17e699e36cc34699df01a2fddda8"},"cell_type":"markdown","source":"Let's define our independent variable (X) and dependant/target variable y"},{"metadata":{"trusted":true,"_uuid":"bca1ece9de83b5f0db01cc4e3fbb9799bdcc3e22"},"cell_type":"code","source":"X = data_reduced.drop(['fraud'],axis=1)\ny = data['fraud']\nprint(X.head(),\"\\n\")\nprint(y.head())","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"474518cb2bb397a51477e826541fc33c62d2d492"},"cell_type":"code","source":"y[y==1].count()","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"37bca88900f65dd21a308beca5cc615fe4a40c31"},"cell_type":"markdown","source":"## Oversampling with SMOTE\n\nUsing SMOTE(Synthetic Minority Oversampling Technique) [2] for balancing the dataset. Resulted counts show that now we have exact number of class instances (1 and 0)."},{"metadata":{"_uuid":"784e528eb3ea2ceb20af530a516911279d5df8f4","trusted":true},"cell_type":"code","source":"sm = SMOTE(random_state=42)\nX_res, y_res = sm.fit_resample(X, y)\ny_res = pd.DataFrame(y_res)\nprint(y_res[0].value_counts())","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"4cafd096d0bd1bbe528eb8d5f7d943d58fd277db"},"cell_type":"markdown","source":"I will do a train test split for measuring the performance. I haven't done cross validation since we have a lot of instances and i don't want to wait that much for training but it should be better to cross validate most of the times. "},{"metadata":{"trusted":true,"_uuid":"1c4f59c9cec206b7fa56e61a1202050f8966a94b"},"cell_type":"code","source":"# I won't do cross validation since we have a lot of instances\nX_train, X_test, y_train, y_test = train_test_split(X_res,y_res,test_size=0.3,random_state=42,shuffle=True,stratify=y_res)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"47b6d299e0a3c08509066a19b2b225bf37acf0ad"},"cell_type":"markdown","source":"I will define a function for plotting the ROC_AUC curve. It is a good visual way to see the classification performance."},{"metadata":{"trusted":true,"_uuid":"c15c0a19c3fe5402109c9e54e58d0402c40107dd"},"cell_type":"code","source":"# %% Function for plotting ROC_AUC curve\n\ndef plot_roc_auc(y_test, preds):\n '''\n Takes actual and predicted(probabilities) as input and plots the Receiver\n Operating Characteristic (ROC) curve\n '''\n fpr, tpr, threshold = roc_curve(y_test, preds)\n roc_auc = auc(fpr, tpr)\n plt.title('Receiver Operating Characteristic')\n plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)\n plt.legend(loc = 'lower right')\n plt.plot([0, 1], [0, 1],'r--')\n plt.xlim([0, 1])\n plt.ylim([0, 1])\n plt.ylabel('True Positive Rate')\n plt.xlabel('False Positive Rate')\n plt.show()","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"589632daf9c07f7903870ecbf183c2057e10d4be"},"cell_type":"markdown","source":"As i talked about it before fraud datasets will be imbalanced and most of the instances will be non-fraudulent. Imagine that we have the dataset here and we are always predicting non-fraudulent. Our accuracy would be almost 99 % for this dataset and mostly for others as well since fraud percentage is very low. Our accuracy is very high but we are not detecting any frauds so it is a useless classifier. So the base accuracy score should be better at least than predicting always non-fraudulent for performing a detection."},{"metadata":{"trusted":true,"_uuid":"0bacd9ca1db82f754e402ce66b654df460d51f88"},"cell_type":"code","source":"# The base score should be better than predicting always non-fraduelent\nprint(\"Base accuracy score we must beat is: \", \n df_non_fraud.fraud.count()/ np.add(df_non_fraud.fraud.count(),df_fraud.fraud.count()) * 100)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"03bf9d11a03798ff2411e93d3956d378719172d7"},"cell_type":"markdown","source":"## **K-Neighbours Classifier**"},{"metadata":{"trusted":true,"_uuid":"6d3a294403b1afbdbf18219ab0418a2e03e201e2"},"cell_type":"code","source":"# %% K-ello Neigbors\n\nknn = KNeighborsClassifier(n_neighbors=5,p=1)\n\nknn.fit(X_train,y_train)\ny_pred = knn.predict(X_test)\n\n\nprint(\"Classification Report for K-Nearest Neighbours: \\n\", classification_report(y_test, y_pred))\nprint(\"Confusion Matrix of K-Nearest Neigbours: \\n\", confusion_matrix(y_test,y_pred))\nplot_roc_auc(y_test, knn.predict_proba(X_test)[:,1])","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"7fed4c48fb1a0b6d73df873c8492847b03b069c4"},"cell_type":"markdown","source":"## **Random Forest Classifier**"},{"metadata":{"trusted":true,"_uuid":"b320aa8bea1fbf19ef198905ad7f7a365561ccdd"},"cell_type":"code","source":"# %% Random Forest Classifier\n\nrf_clf = RandomForestClassifier(n_estimators=100,max_depth=8,random_state=42,\n verbose=1,class_weight=\"balanced\")\n\nrf_clf.fit(X_train,y_train)\ny_pred = rf_clf.predict(X_test)\n\nprint(\"Classification Report for Random Forest Classifier: \\n\", classification_report(y_test, y_pred))\nprint(\"Confusion Matrix of Random Forest Classifier: \\n\", confusion_matrix(y_test,y_pred))\nplot_roc_auc(y_test, rf_clf.predict_proba(X_test)[:,1])","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"1d5898c4320b78015fd40d059e158f767399d354"},"cell_type":"markdown","source":"## XGBoost Classifier"},{"metadata":{"trusted":true,"_uuid":"1cbe2c476bcef6f4ed9523c891d0536077a6861a"},"cell_type":"code","source":"XGBoost_CLF = xgb.XGBClassifier(max_depth=6, learning_rate=0.05, n_estimators=400, \n objective=\"binary:hinge\", booster='gbtree', \n n_jobs=-1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, \n subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, \n scale_pos_weight=1, base_score=0.5, random_state=42, verbosity=True)\n\nXGBoost_CLF.fit(X_train,y_train)\n\ny_pred = XGBoost_CLF.predict(X_test)\n\nprint(\"Classification Report for XGBoost: \\n\", classification_report(y_test, y_pred))\nprint(\"Confusion Matrix of XGBoost: \\n\", confusion_matrix(y_test,y_pred))\nplot_roc_auc(y_test, XGBoost_CLF.predict_proba(X_test)[:,1])","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"ef0e8b46a02a3ba1e3b7df37a4b9279428569adb"},"cell_type":"markdown","source":"## Conclusion\n\nIn this kernel we have tried to do fraud detection on a bank payment data and we have achieved remarkable results with our classifiers. Since fraud datasets have an imbalance class problem we performed an oversampling technique called SMOTE and generated new minority class examples. I haven't put the classification results without SMOTE here but i added them in my github repo before so if you are interested to compare both results you can also check [my github repo](https://github.com/atavci/fraud-detection-on-banksim-data). \n\nThanks for taking the time to read or just view the results from my first kernel i hope you enjoyed it. I would be grateful for any kind of critique, suggestion or comment and i wish you to have a great day with lots of beautiful data!"},{"metadata":{"_uuid":"3d840bf8b52b3514a2ae86f17f2bed7235f4a927"},"cell_type":"markdown","source":"## Resources\n\n\n[[1]](#-1). Lavion, Didier; et al. [\"PwC's Global Economic Crime and Fraud Survey 2018\"](https://www.pwc.com/gx/en/forensics/global-economic-crime-and-fraud-survey-2018.pdf) (PDF). PwC.com. Retrieved 28 August 2018. \n\n[[2]](#2). [SMOTE: Synthetic Minority Over-sampling Technique](https://jair.org/index.php/jair/article/view/10302)"},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1}
--------------------------------------------------------------------------------
/Fraud_Detection_bank_sim.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | Created on Fri Mar 15 14:57:46 2019
5 |
6 | @author: atavci
7 | """
8 |
9 | import pandas as pd
10 | import numpy as np
11 |
12 | import seaborn as sns
13 | import matplotlib.pyplot as plt
14 |
15 | from sklearn.model_selection import train_test_split
16 | from sklearn.metrics import confusion_matrix, classification_report
17 | from sklearn.metrics import roc_curve, auc
18 |
19 | import xgboost as xgb
20 | from sklearn.neighbors import KNeighborsClassifier
21 | from sklearn.ensemble import RandomForestClassifier
22 | from sklearn.ensemble import VotingClassifier
23 |
24 | # set seaborn style because it prettier
25 | sns.set()
26 | # %% read and plot
27 | data = pd.read_csv("Data/synthetic-data-from-a-financial-payment-system/bs140513_032310.csv")
28 |
29 | data.head(5)
30 |
31 | # Create two dataframes with fraud and non-fraud data
32 | df_fraud = data.loc[data.fraud == 1]
33 | df_non_fraud = data.loc[data.fraud == 0]
34 |
35 |
36 | sns.countplot(x="fraud",data=data)
37 | plt.title("Count of Fraudulent Payments")
38 | plt.legend()
39 | plt.show()
40 | print("Number of normal examples: ",df_non_fraud.fraud.count())
41 | print("Number of fradulent examples: ",df_fraud.fraud.count())
42 | #print(data.fraud.value_counts()) # does the same thing above
43 |
44 | print("Mean feature values per category",data.groupby('category')['amount','fraud'].mean())
45 |
46 | print("Columns: ", data.columns)
47 |
48 |
49 |
50 | # Plot histograms of the amounts in fraud and non-fraud data
51 | plt.hist(df_fraud.amount, alpha=0.5, label='fraud',bins=100)
52 | plt.hist(df_non_fraud.amount, alpha=0.5, label='nonfraud',bins=100)
53 | plt.title("Histogram for fraud and nonfraud payments")
54 | plt.ylim(0,10000)
55 | plt.xlim(0,1000)
56 | plt.legend()
57 | plt.show()
58 |
59 | # %% Preprocessing
60 | print(data.zipcodeOri.nunique())
61 | print(data.zipMerchant.nunique())
62 |
63 | # dropping zipcodeori and zipMerchant since they have only one unique value
64 | data_reduced = data.drop(['zipcodeOri','zipMerchant'],axis=1)
65 |
66 | data_reduced.columns
67 |
68 | # turning object columns type to categorical for later purposes
69 | col_categorical = data_reduced.select_dtypes(include= ['object']).columns
70 | for col in col_categorical:
71 | data_reduced[col] = data_reduced[col].astype('category')
72 |
73 | # it's usually better to turn the categorical values (customer, merchant, and category variables )
74 | # into dummies because they have no relation in size(i.e. 5>4) but since they are too many (over 500k) the features will grow too many and
75 | # it will take forever to train but here is the code below for turning categorical features into dummies
76 | #data_reduced.loc[:,['customer','merchant','category']].astype('category')
77 | #data_dum = pd.get_dummies(data_reduced.loc[:,['customer','merchant','category','gender']],drop_first=True) # dummies
78 | #print(data_dum.info())
79 |
80 | # categorical values ==> numeric values
81 | data_reduced[col_categorical] = data_reduced[col_categorical].apply(lambda x: x.cat.codes)
82 |
83 | # define X and y
84 | X = data_reduced.drop(['fraud'],axis=1)
85 | y = data['fraud']
86 |
87 |
88 | # I won't do cross validation since we have a lot of instances
89 | X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42,shuffle=True,stratify=y)
90 |
91 | # %% Function for plotting ROC_AUC curve
92 |
93 | def plot_roc_auc(y_test, preds):
94 | '''
95 | Takes actual and predicted(probabilities) as input and plots the Receiver
96 | Operating Characteristic (ROC) curve
97 | '''
98 | fpr, tpr, threshold = roc_curve(y_test, preds)
99 | roc_auc = auc(fpr, tpr)
100 | plt.title('Receiver Operating Characteristic')
101 | plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
102 | plt.legend(loc = 'lower right')
103 | plt.plot([0, 1], [0, 1],'r--')
104 | plt.xlim([0, 1])
105 | plt.ylim([0, 1])
106 | plt.ylabel('True Positive Rate')
107 | plt.xlabel('False Positive Rate')
108 | plt.show()
109 |
110 | # The base score should be better than predicting always non-fraduelent
111 | print("Base score we must beat is: ",
112 | df_non_fraud.fraud.count()/ np.add(df_non_fraud.fraud.count(),df_fraud.fraud.count()) * 100)
113 |
114 |
115 | # %% K-ello Neigbors
116 |
117 | knn = KNeighborsClassifier(n_neighbors=5,p=1)
118 |
119 | knn.fit(X_train,y_train)
120 | y_pred = knn.predict(X_test)
121 |
122 | # High precision on fraudulent examples almost perfect score on non-fraudulent examples
123 | print("Classification Report for K-Nearest Neighbours: \n", classification_report(y_test, y_pred))
124 | print("Confusion Matrix of K-Nearest Neigbours: \n", confusion_matrix(y_test,y_pred))
125 | plot_roc_auc(y_test, knn.predict_proba(X_test)[:,1])
126 |
127 | # %% Random Forest Classifier
128 |
129 | rf_clf = RandomForestClassifier(n_estimators=100,max_depth=8,random_state=42,
130 | verbose=1,class_weight="balanced")
131 |
132 | rf_clf.fit(X_train,y_train)
133 | y_pred = rf_clf.predict(X_test)
134 |
135 | # 98 % recall on fraudulent examples but low 24 % precision.
136 | print("Classification Report for Random Forest Classifier: \n", classification_report(y_test, y_pred))
137 | print("Confusion Matrix of Random Forest Classifier: \n", confusion_matrix(y_test,y_pred))
138 | plot_roc_auc(y_test, rf_clf.predict_proba(X_test)[:,1])
139 |
140 | # %% XG-Boost
141 | XGBoost_CLF = xgb.XGBClassifier(max_depth=6, learning_rate=0.05, n_estimators=400,
142 | objective="binary:hinge", booster='gbtree',
143 | n_jobs=-1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
144 | subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
145 | scale_pos_weight=1, base_score=0.5, random_state=42, verbosity=True)
146 |
147 | XGBoost_CLF.fit(X_train,y_train)
148 |
149 | y_pred = XGBoost_CLF.predict(X_test)
150 |
151 | # reatively high precision and recall for fraudulent class
152 | print("Classification Report for XGBoost: \n", classification_report(y_test, y_pred)) # Accuracy for XGBoost: 0.9963059088641371
153 | print("Confusion Matrix of XGBoost: \n", confusion_matrix(y_test,y_pred))
154 | plot_roc_auc(y_test, XGBoost_CLF.predict_proba(X_test)[:,1])
155 |
156 | # %% Ensemble
157 |
158 | estimators = [("KNN",knn),("rf",rf_clf),("xgb",XGBoost_CLF)]
159 | ens = VotingClassifier(estimators=estimators, voting="soft",weights=[1,4,1])
160 |
161 | ens.fit(X_train,y_train)
162 | y_pred = ens.predict(X_test)
163 |
164 |
165 | # Combined Random Forest model's recall and other models' precision thus this model
166 | # ensures a higher recall with less false alarms (false positives)
167 | print("Classification Report for Ensembled Models: \n", classification_report(y_test, y_pred)) # Accuracy for XGBoost: 0.9963059088641371
168 | print("Confusion Matrix of Ensembled Models: \n", confusion_matrix(y_test,y_pred))
169 | plot_roc_auc(y_test, ens.predict_proba(X_test)[:,1])
170 |
--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
1 | # Fraud Detection with Machine Learning On Banksim Data
2 |
3 | Fraudulent behavior can be seen across many different fields such as e-commerce, healthcare, payment and banking systems. Fraud costs businesses millions of dollars each year.
4 |
5 | Automated detection of fraudulent behavior can be done in various ways including rule based approaches and machine learning.
6 | This repository uses the latter approach for classification of fraudulent transactions.
7 |
8 | For the in-depth analysis you can check out my notebook on Kaggle below or run the ipynb file in your Juptyter environment.
9 | https://www.kaggle.com/code/turkayavci/fraud-detection-on-bank-payments/notebook
10 |
11 | The synthetically generated dataset consists of payments from various customers made in different time periods and with different amounts. If you want more information on the dataset you can refer below to the Dataset title for the dataset link and information and you can find the original paper for the dataset under the "Original Paper" title.
12 |
13 | For those who do not wish to run the script to acquire the results, here is a quick recap of the classification results of the machine learning models used in the script:
14 |
15 | !Update Note: These results are without the oversampling technique SMOTE. I have also added a jupyter notebook with more insights and used the SMOTE for balancing the dataset. Overall results looks more better just check the file called Fraud Detection on Bank Payments.ipynb from inside the repo.
16 |
17 |
Classification Report for K-Nearest Neighbours (1:fraudulent,0:non-fraudulent) :
18 |
19 | |class | precision | recall | f1-score | support|
20 | | ---- | --------- | ------ | -------- | -------|
21 | | 0 | 1.00 | 1.00 | 1.00 | 176233 |
22 | | 1 | 0.83 | 0.61 | 0.70 | 2160 |
23 |
24 | Confusion Matrix of K-Nearest Neigbours:
25 |
[175962 271]
26 |
[ 845 1315]
27 |
28 |
29 |
30 |
Classification Report for XGBoost :
31 |
32 | class | precision | recall | f1-score | support|
33 | | ---- | --------- | ------ | -------- | -------|
34 | | 0 | 1.00 | 1.00 | 1.00 | 176233 |
35 | | 1 | 0.89 | 0.76 | 0.82 | 2160 |
36 |
37 |
38 | Confusion Matrix of XGBoost:
39 |
[176029 204]
40 |
[ 529 1631]
41 |
42 |
43 |
44 |
45 |
Classification Report for Random Forest Classifier :
46 |
47 | class | precision | recall | f1-score | support|
48 | | ---- | --------- | ------ | -------- | -------|
49 | | 0 | 1.00 | 0.96 | 0.98 | 176233 |
50 | | 1 | 0.24 | 0.98 | 0.82 | 2160 |
51 |
52 |
53 | Confusion Matrix of Random Forest Classifier:
54 |
[169552 6681]
55 |
[ 39 2121]
56 |
57 |
58 |
59 |
Classification Report for Ensembled Models(RandomForest+KNN+XGBoost) :
60 |
61 | class | precision | recall | f1-score | support|
62 | | ---- | --------- | ------ | -------- | -------|
63 | | 0 | 1.00 | 1.00 | 1.00 | 176233 |
64 | | 1 | 0.73 | 0.81 | 0.77 | 2160 |
65 |
66 |
67 | Confusion Matrix of Ensembled Models:
68 |
[175604 629]
69 |
[ 417 1743]
70 |
71 |
72 | ## Dataset
73 | https://www.kaggle.com/ntnu-testimon/banksim1
74 |
75 | ## Original paper
76 |
77 | Lopez-Rojas, Edgar Alonso ; Axelsson, Stefan Banksim: A bank payments simulator for fraud detection research Inproceedings 26th European Modeling and Simulation Symposium, EMSS 2014, Bordeaux, France, pp. 144–152, Dime University of Genoa, 2014, ISBN: 9788897999324. https://www.researchgate.net/publication/265736405_BankSim_A_Bank_Payment_Simulation_for_Fraud_Detection_Research
78 |
--------------------------------------------------------------------------------