├── .gitignore ├── README.md └── intro_to_fraud_detection_python.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.csv -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## **Fraud Detection in Python** 2 | 3 | https://www.datacamp.com/courses/fraud-detection-in-python 4 | 5 | Charlotte Werger, 6 | Director of Advanced Analytics at Nike 7 | 8 | **Course Description** 9 | 10 | A typical organization loses an estimated 5% of its yearly revenue to fraud. In this course, learn to fight fraud by using data. Apply supervised learning algorithms to detect fraudulent behavior based upon past fraud, and use unsupervised learning methods to discover new types of fraud activities. 11 | 12 | Fraudulent transactions are rare compared to the norm. As such, learn to properly classify imbalanced datasets. 13 | 14 | The course provides technical and theoretical insights and demonstrates how to implement fraud detection models. Finally, get tips and advice from real-life experience to help prevent common mistakes in fraud analytics. 15 | -------------------------------------------------------------------------------- /intro_to_fraud_detection_python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## **Fraud Detection in Python**\n", 8 | "\n", 9 | "https://www.datacamp.com/courses/fraud-detection-in-python\n", 10 | "\n", 11 | "Charlotte Werger,\n", 12 | "Director of Advanced Analytics at Nike\n", 13 | "\n", 14 | "**Course Description**\n", 15 | "\n", 16 | "A typical organization loses an estimated 5% of its yearly revenue to fraud. In this course, learn to fight fraud by using data. Apply supervised learning algorithms to detect fraudulent behavior based upon past fraud, and use unsupervised learning methods to discover new types of fraud activities. \n", 17 | "\n", 18 | "Fraudulent transactions are rare compared to the norm. As such, learn to properly classify imbalanced datasets.\n", 19 | "\n", 20 | "The course provides technical and theoretical insights and demonstrates how to implement fraud detection models. Finally, get tips and advice from real-life experience to help prevent common mistakes in fraud analytics.\n", 21 | "\n", 22 | "* Most data files for the exercises can be found on the [course site](https://www.datacamp.com/courses/fraud-detection-in-python)" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "**Imports**" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "import warnings\n", 39 | "warnings.filterwarnings('ignore')\n", 40 | "warnings.simplefilter('ignore')" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "import pandas as pd\n", 50 | "import matplotlib.pyplot as plt\n", 51 | "from matplotlib.patches import Rectangle\n", 52 | "import numpy as np\n", 53 | "from pprint import pprint as pp\n", 54 | "import csv\n", 55 | "from pathlib import Path\n", 56 | "from imblearn.over_sampling import SMOTE\n", 57 | "from imblearn.pipeline import Pipeline \n", 58 | "from sklearn.linear_model import LinearRegression, LogisticRegression\n", 59 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 60 | "from sklearn.tree import DecisionTreeClassifier\n", 61 | "from sklearn.metrics import r2_score, classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score\n", 62 | "from sklearn.metrics import homogeneity_score, silhouette_score\n", 63 | "from sklearn.ensemble import RandomForestClassifier, VotingClassifier\n", 64 | "from sklearn.preprocessing import MinMaxScaler\n", 65 | "from sklearn.cluster import MiniBatchKMeans, DBSCAN\n", 66 | "import seaborn as sns\n", 67 | "from itertools import product\n", 68 | "import nltk\n", 69 | "from nltk.corpus import stopwords\n", 70 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 71 | "import string\n", 72 | "import gensim\n", 73 | "from gensim import corpora\n", 74 | "import pyLDAvis.gensim" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "**Pandas Configuration Options**" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "pd.set_option('max_columns', 200)\n", 91 | "pd.set_option('max_rows', 300)\n", 92 | "pd.set_option('display.expand_frame_repr', True)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "**Data File Objects**" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "data = Path.cwd() / 'data' / 'fraud_detection'\n", 109 | "\n", 110 | "ch1 = data / 'chapter_1'\n", 111 | "cc1_file = ch1 / 'creditcard_sampledata.csv'\n", 112 | "cc3_file = ch1 / 'creditcard_sampledata_3.csv'\n", 113 | "\n", 114 | "ch2 = data / 'chapter_2'\n", 115 | "cc2_file = ch2 / 'creditcard_sampledata_2.csv'\n", 116 | "\n", 117 | "ch3 = data / 'chapter_3'\n", 118 | "banksim_file = ch3 / 'banksim.csv'\n", 119 | "banksim_adj_file = ch3 / 'banksim_adj.csv'\n", 120 | "db_full_file = ch3 / 'db_full.pickle'\n", 121 | "labels_file = ch3 / 'labels.pickle'\n", 122 | "labels_full_file = ch3 / 'labels_full.pickle'\n", 123 | "x_scaled_file = ch3 / 'x_scaled.pickle'\n", 124 | "x_scaled_full_file = ch3 / 'x_scaled_full.pickle'\n", 125 | "\n", 126 | "ch4 = data / 'chapter_4'\n", 127 | "enron_emails_clean_file = ch4 / 'enron_emails_clean.csv'\n", 128 | "cleantext_file = ch4 / 'cleantext.pickle'\n", 129 | "corpus_file = ch4 / 'corpus.pickle'\n", 130 | "dict_file = ch4 / 'dict.pickle'\n", 131 | "ldamodel_file = ch4 / 'ldamodel.pickle'" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "# Introduction and preparing your data\n", 139 | "\n", 140 | "Learn about the typical challenges associated with fraud detection. Learn how to resample data in a smart way, and tackle problems with imbalanced data." 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## Introduction to fraud detection\n", 148 | "\n", 149 | "* Types:\n", 150 | " * Insurance\n", 151 | " * Credit card\n", 152 | " * Identity theft\n", 153 | " * Money laundering\n", 154 | " * Tax evasion\n", 155 | " * Healthcare\n", 156 | " * Product warranty\n", 157 | "* e-commerce businesses must continuously assess the legitimacy of client transactions\n", 158 | "* Detecting fraud is challenging:\n", 159 | " * Uncommon; < 0.01% of transactions\n", 160 | " * Attempts are made to conceal fraud\n", 161 | " * Behavior evolves\n", 162 | " * Fraudulent activities perpetrated by networks - organized crime\n", 163 | "* Fraud detection requires training an algorithm to identify concealed observations from any normal observations\n", 164 | "* Fraud analytics teams:\n", 165 | " * Often use rules based systems, based on manually set thresholds and experience\n", 166 | " * Check the news\n", 167 | " * Receive external lists of fraudulent accounts and names\n", 168 | " * suspicious names or track an external hit list from police to reference check against the client base\n", 169 | " * Sometimes use machine learning algorithms to detect fraud or suspicious behavior\n", 170 | " * Existing sources can be used as inputs into the ML model\n", 171 | " * Verify the veracity of rules based labels" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "### Checking the fraud to non-fraud ratio\n", 179 | "\n", 180 | "In this chapter, you will work on `creditcard_sampledata.csv`, a dataset containing credit card transactions data. Fraud occurrences are fortunately an **extreme minority** in these transactions.\n", 181 | "\n", 182 | "However, Machine Learning algorithms usually work best when the different classes contained in the dataset are more or less equally present. If there are few cases of fraud, then there's little data to learn how to identify them. This is known as **class imbalance**, and it's one of the main challenges of fraud detection.\n", 183 | "\n", 184 | "Let's explore this dataset, and observe this class imbalance problem.\n", 185 | "\n", 186 | "**Instructions**\n", 187 | "\n", 188 | "* `import pandas as pd`, read the credit card data in and assign it to `df`. This has been done for you.\n", 189 | "* Use `.info()` to print information about `df`.\n", 190 | "* Use `.value_counts()` to get the count of fraudulent and non-fraudulent transactions in the `'Class'` column. Assign the result to `occ`.\n", 191 | "* Get the ratio of fraudulent transactions over the total number of transactions in the dataset." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "df = pd.read_csv(cc3_file)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "#### Explore the features available in your dataframe" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "df.info()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "df.head()" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "# Count the occurrences of fraud and no fraud and print them\n", 235 | "occ = df['Class'].value_counts()\n", 236 | "occ" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "# Print the ratio of fraud cases\n", 246 | "ratio_cases = occ/len(df.index)\n", 247 | "print(f'Ratio of fraudulent cases: {ratio_cases[1]}\\nRatio of non-fraudulent cases: {ratio_cases[0]}')" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "**The ratio of fraudulent transactions is very low. This is a case of class imbalance problem, and you're going to learn how to deal with this in the next exercises.**" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Data visualization\n", 262 | "\n", 263 | "From the previous exercise we know that the ratio of fraud to non-fraud observations is very low. You can do something about that, for example by **re-sampling** our data, which is explained in the next video.\n", 264 | "\n", 265 | "In this exercise, you'll look at the data and **visualize the fraud to non-fraud ratio**. It is always a good starting point in your fraud analysis, to look at your data first, before you make any changes to it.\n", 266 | "\n", 267 | "Moreover, when talking to your colleagues, a picture often makes it very clear that we're dealing with heavily imbalanced data. Let's create a plot to visualize the ratio fraud to non-fraud data points on the dataset `df`.\n", 268 | "\n", 269 | "The function `prep_data()` is already loaded in your workspace, as well as `matplotlib.pyplot as plt`.\n", 270 | "\n", 271 | "**Instructions**\n", 272 | "\n", 273 | "* Define the `plot_data(X, y)` function, that will nicely plot the given feature set `X` with labels `y` in a scatter plot. This has been done for you.\n", 274 | "* Use the function `prep_data()` on your dataset `df` to create feature set `X` and labels `y`.\n", 275 | "* Run the function `plot_data()` on your newly obtained `X` and `y` to visualize your results." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "#### def prep_data" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "def prep_data(df: pd.DataFrame) -> (np.ndarray, np.ndarray):\n", 292 | " \"\"\"\n", 293 | " Convert the DataFrame into two variable\n", 294 | " X: data columns (V1 - V28)\n", 295 | " y: lable column\n", 296 | " \"\"\"\n", 297 | " X = df.iloc[:, 2:30].values\n", 298 | " y = df.Class.values\n", 299 | " return X, y" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "#### def plot_data" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "# Define a function to create a scatter plot of our data and labels\n", 316 | "def plot_data(X: np.ndarray, y: np.ndarray):\n", 317 | " plt.scatter(X[y == 0, 0], X[y == 0, 1], label=\"Class #0\", alpha=0.5, linewidth=0.15)\n", 318 | " plt.scatter(X[y == 1, 0], X[y == 1, 1], label=\"Class #1\", alpha=0.5, linewidth=0.15, c='r')\n", 319 | " plt.legend()\n", 320 | " return plt.show()" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "# Create X and y from the prep_data function \n", 330 | "X, y = prep_data(df)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "# Plot our data by running our plot data function on X and y\n", 340 | "plot_data(X, y)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "**By visualizing the data, you can immediately see how our fraud cases are scattered over our data, and how few cases we have. A picture often makes the imbalance problem clear. In the next exercises we'll visually explore how to improve our fraud to non-fraud balance.**" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "#### Reproduced using the DataFrame" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "plt.scatter(df.V2[df.Class == 0], df.V3[df.Class == 0], label=\"Class #0\", alpha=0.5, linewidth=0.15)\n", 364 | "plt.scatter(df.V2[df.Class == 1], df.V3[df.Class == 1], label=\"Class #1\", alpha=0.5, linewidth=0.15, c='r')\n", 365 | "plt.legend()\n", 366 | "plt.show()" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "## Increase successful detections with data resampling\n", 374 | "\n", 375 | "* resampling can help model performance in cases of imbalanced data sets" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "#### Undersampling\n", 383 | "\n", 384 | "* ![undersampling](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/undersampling.JPG)\n", 385 | "* Undersampling the majority class (non-fraud cases)\n", 386 | " * Straightforward method to adjust imbalanced data\n", 387 | " * Take random draws from the non-fraud observations, to match the occurences of fraud observations (as shown in the picture)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "#### Oversampling\n", 395 | "\n", 396 | "* ![oversampling](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/oversampling.JPG)\n", 397 | "* Oversampling the minority class (fraud cases)\n", 398 | " * Take random draws from the fraud cases and copy those observations to increase the amount of fraud samples\n", 399 | "* Both methods lead to having a balance between fraud and non-fraud cases\n", 400 | "* Drawbacks\n", 401 | " * with random undersampling, a lot of information is thrown away\n", 402 | " * with oversampling, the model will be trained on a lot of duplicates" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "#### Implement resampling methods using Python imblean module\n", 410 | "\n", 411 | "* compatible with scikit-learn\n", 412 | "\n", 413 | "```python\n", 414 | "from imblearn.over_sampling import RandomOverSampler\n", 415 | "\n", 416 | "method = RandomOverSampler()\n", 417 | "X_resampled, y_resampled = method.fit_sample(X, y)\n", 418 | "\n", 419 | "compare_plots(X_resampled, y_resampled, X, y)\n", 420 | "```\n", 421 | "\n", 422 | "![oversampling plot](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/oversampling_plot.JPG)\n", 423 | "* The darker blue points reflect there are more identical data" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "#### SMOTE\n", 431 | "\n", 432 | "* ![smote](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/smote.JPG)\n", 433 | "* Synthetic minority Oversampling Technique (SMOTE)\n", 434 | " * [Resampling strategies for Imbalanced Data Sets](https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets)\n", 435 | " * Another way of adjusting the imbalance by oversampling minority observations\n", 436 | " * SMOTE uses characteristics of nearest neighbors of fraud cases to create new synthetic fraud cases\n", 437 | " * avoids duplicating observations" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "#### Determining the best resampling method is situational\n", 445 | "\n", 446 | "* Random Undersampling (RUS):\n", 447 | " * If there is a lot of data and many minority cases, then undersampling may be computationally more convenient\n", 448 | " * In most cases, throwing away data is not desirable\n", 449 | "* Random Oversampling (ROS):\n", 450 | " * Straightforward\n", 451 | " * Training the model on many duplicates\n", 452 | "* SMOTE:\n", 453 | " * more sophisticated\n", 454 | " * realistic data set\n", 455 | " * training on synthetic data\n", 456 | " * only works well if the minority case features are similar\n", 457 | " * **if fraud is spread through the data and not distinct, using nearest neighbors to create more fraud cases, introduces noise into the data, as the nearest neighbors might not be fraud cases**" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "#### When to use resmapling methods\n", 465 | "\n", 466 | "* Use resampling methods on the training set, not on the test set\n", 467 | "* The goal is to produce a better model by providing balanced data\n", 468 | " * The goal is not to predict the synthetic samples\n", 469 | "* Test data should be free of duplicates and synthetic data\n", 470 | "* Only test the model on real data\n", 471 | " * First, spit the data into train and test sets\n", 472 | " \n", 473 | "```python\n", 474 | "# Define resampling method and split into train and test\n", 475 | "method = SMOTE(kind='borderline1')\n", 476 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)\n", 477 | "\n", 478 | "# Apply resampling to the training data only\n", 479 | "X_resampled, y_resampled = method.fit_sample(X_train, y_train)\n", 480 | "\n", 481 | "# Continue fitting the model and obtain predictions\n", 482 | "model = LogisticRegression()\n", 483 | "model.fit(X_resampled, y_resampled)\n", 484 | "\n", 485 | "# Get model performance metrics\n", 486 | "predicted = model.predict(X_test)\n", 487 | "print(classification_report(y_test, predicted))\n", 488 | "```" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "### Resampling methods for imbalanced data\n", 496 | "\n", 497 | "Which of these methods takes a random subsample of your majority class to account for class \"imbalancedness\"?\n", 498 | "\n", 499 | "**Possible Answers**\n", 500 | "\n", 501 | "* ~~Random Over Sampling (ROS)~~\n", 502 | "* **Random Under Sampling (RUS)**\n", 503 | "* ~~Synthetic Minority Over-sampling Technique (SMOTE)~~\n", 504 | "* ~~None of the above~~\n", 505 | "\n", 506 | "**By using ROS and SMOTE you add more examples to the minority class. RUS adjusts the balance of your data by reducing the majority class.**" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "### Applying Synthetic Minority Oversampling Technique (SMOTE)\n", 514 | "\n", 515 | "In this exercise, you're going to re-balance our data using the **Synthetic Minority Over-sampling Technique** (SMOTE). Unlike ROS, SMOTE does not create exact copies of observations, but **creates new, synthetic, samples** that are quite similar to the existing observations in the minority class. SMOTE is therefore slightly more sophisticated than just copying observations, so let's apply SMOTE to our credit card data. The dataset `df` is available and the packages you need for SMOTE are imported. In the following exercise, you'll visualize the result and compare it to the original data, such that you can see the effect of applying SMOTE very clearly.\n", 516 | "\n", 517 | "**Instructions**\n", 518 | "\n", 519 | "* Use the `prep_data` function on `df` to create features `X` and labels `y`.\n", 520 | "* Define the resampling method as SMOTE of the regular kind, under the variable `method`.\n", 521 | "* Use `.fit_sample()` on the original `X` and `y` to obtain newly resampled data.\n", 522 | "* Plot the resampled data using the `plot_data()` function." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": {}, 529 | "outputs": [], 530 | "source": [ 531 | "# Run the prep_data function\n", 532 | "X, y = prep_data(df)" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "metadata": {}, 539 | "outputs": [], 540 | "source": [ 541 | "print(f'X shape: {X.shape}\\ny shape: {y.shape}')" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": null, 547 | "metadata": {}, 548 | "outputs": [], 549 | "source": [ 550 | "# Define the resampling method\n", 551 | "method = SMOTE()" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": {}, 558 | "outputs": [], 559 | "source": [ 560 | "# Create the resampled feature set\n", 561 | "X_resampled, y_resampled = method.fit_sample(X, y)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "# Plot the resampled data\n", 571 | "plot_data(X_resampled, y_resampled)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "**The minority class is now much more prominently visible in our data. To see the results of SMOTE even better, we'll compare it to the original data in the next exercise.**" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "### Compare SMOTE to original data\n", 586 | "\n", 587 | "In the last exercise, you saw that using SMOTE suddenly gives us more observations of the minority class. Let's compare those results to our original data, to get a good feeling for what has actually happened. Let's have a look at the value counts again of our old and new data, and let's plot the two scatter plots of the data side by side. You'll use the function compare_plot() for that that, which takes the following arguments: `X`, `y`, `X_resampled`, `y_resampled`, `method=''`. The function plots your original data in a scatter plot, along with the resampled side by side.\n", 588 | "\n", 589 | "**Instructions**\n", 590 | "\n", 591 | "* Print the value counts of our original labels, `y`. Be mindful that `y` is currently a Numpy array, so in order to use value counts, we'll assign `y` back as a pandas Series object.\n", 592 | "* Repeat the step and print the value counts on `y_resampled`. This shows you how the balance between the two classes has changed with SMOTE.\n", 593 | "* Use the `compare_plot()` function called on our original data as well our resampled data to see the scatterplots side by side." 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "pd.value_counts(pd.Series(y))" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": null, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "pd.value_counts(pd.Series(y_resampled))" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "#### def compare_plot" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "def compare_plot(X: np.ndarray, y: np.ndarray, X_resampled: np.ndarray, y_resampled: np.ndarray, method: str):\n", 628 | " plt.subplot(1, 2, 1)\n", 629 | " plt.scatter(X[y == 0, 0], X[y == 0, 1], label=\"Class #0\", alpha=0.5, linewidth=0.15)\n", 630 | " plt.scatter(X[y == 1, 0], X[y == 1, 1], label=\"Class #1\", alpha=0.5, linewidth=0.15, c='r')\n", 631 | " plt.title('Original Set')\n", 632 | " plt.subplot(1, 2, 2)\n", 633 | " plt.scatter(X_resampled[y_resampled == 0, 0], X_resampled[y_resampled == 0, 1], label=\"Class #0\", alpha=0.5, linewidth=0.15)\n", 634 | " plt.scatter(X_resampled[y_resampled == 1, 0], X_resampled[y_resampled == 1, 1], label=\"Class #1\", alpha=0.5, linewidth=0.15, c='r')\n", 635 | " plt.title(method)\n", 636 | " plt.legend()\n", 637 | " plt.show()" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": {}, 644 | "outputs": [], 645 | "source": [ 646 | "compare_plot(X, y, X_resampled, y_resampled, method='SMOTE')" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "**It should by now be clear that SMOTE has balanced our data completely, and that the minority class is now equal in size to the majority class. Visualizing the data shows the effect on the data very clearly. The next exercise will demonstrate multiple ways to implement SMOTE and that each method will have a slightly different effect.**" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "## Fraud detection algorithms in action" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "#### Rules Based Systems\n", 668 | "\n", 669 | "* ![rules based](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/rules_based.JPG)\n", 670 | "* Might block transactions from risky zip codes\n", 671 | "* Block transactions from cards used too frequently (e.g. last 30 minutes)\n", 672 | "* Can catch fraud, but also generates false alarms (false positive)\n", 673 | "* Limitations:\n", 674 | " * Fixed threshold per rule and it's difficult to determine the threshold; they don't adapt over time\n", 675 | " * Limited to yes / no outcomes, whereas ML yields a probability\n", 676 | " * probability allows for fine-tuning the outcomes (i.e. rate of occurences of false positives and false negatives)\n", 677 | " * Fails to capture interaction between features\n", 678 | " * Ex. Size of the transaction only matters in combination to the frequency" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "#### ML Based Systems\n", 686 | "\n", 687 | "* Adapt to the data, thus can change over time\n", 688 | "* Uses all the data combined, rather than a threshold per feature\n", 689 | "* Produces a probability, rather than a binary score\n", 690 | "* Typically have better performance and can be combined with rules" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": {}, 697 | "outputs": [], 698 | "source": [ 699 | "# Step 1: split the features and labels into train and test data\n", 700 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [ 709 | "# Step 2: Define which model to use\n", 710 | "model = LinearRegression()" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "# Step 3: Fit the model to the training data\n", 720 | "model.fit(X_train, y_train)" 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": null, 726 | "metadata": {}, 727 | "outputs": [], 728 | "source": [ 729 | "# Step 4: Obtain model predictions from the test data\n", 730 | "y_predicted = model.predict(X_test)" 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "metadata": {}, 737 | "outputs": [], 738 | "source": [ 739 | "# Step 5: Compare y_test to predictions and obtain performance metrics (r^2 score)\n", 740 | "r2_score(y_test, y_predicted)" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "metadata": {}, 746 | "source": [ 747 | "### Exploring the traditional method of fraud detection\n", 748 | "\n", 749 | "In this exercise you're going to try finding fraud cases in our credit card dataset the *\"old way\"*. First you'll define threshold values using common statistics, to split fraud and non-fraud. Then, use those thresholds on your features to detect fraud. This is common practice within fraud analytics teams.\n", 750 | "\n", 751 | "Statistical thresholds are often determined by looking at the **mean** values of observations. Let's start this exercise by checking whether feature **means differ between fraud and non-fraud cases**. Then, you'll use that information to create common sense thresholds. Finally, you'll check how well this performs in fraud detection.\n", 752 | "\n", 753 | "`pandas` has already been imported as `pd`.\n", 754 | "\n", 755 | "**Instructions**\n", 756 | "\n", 757 | "* Use `groupby()` to group `df` on `Class` and obtain the mean of the features.\n", 758 | "* Create the condition `V1` smaller than -3, and `V3` smaller than -5 as a condition to flag fraud cases.\n", 759 | "* As a measure of performance, use the `crosstab` function from `pandas` to compare our flagged fraud cases to actual fraud cases." 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": null, 765 | "metadata": {}, 766 | "outputs": [], 767 | "source": [ 768 | "df.drop(['Unnamed: 0'], axis=1, inplace=True)" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": {}, 775 | "outputs": [], 776 | "source": [ 777 | "df.groupby('Class').mean()" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "df['flag_as_fraud'] = np.where(np.logical_and(df.V1 < -3, df.V3 < -5), 1, 0)" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": {}, 793 | "outputs": [], 794 | "source": [ 795 | "pd.crosstab(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud'])" 796 | ] 797 | }, 798 | { 799 | "cell_type": "markdown", 800 | "metadata": {}, 801 | "source": [ 802 | "**With this rule, 22 out of 50 fraud cases are detected, 28 are not detected, and 16 false positives are identified.**" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": {}, 808 | "source": [ 809 | "### Using ML classification to catch fraud\n", 810 | "\n", 811 | "In this exercise you'll see what happens when you use a simple machine learning model on our credit card data instead.\n", 812 | "\n", 813 | "Do you think you can beat those results? Remember, you've predicted *22 out of 50* fraud cases, and had *16 false positives*.\n", 814 | "\n", 815 | "So with that in mind, let's implement a **Logistic Regression** model. If you have taken the class on supervised learning in Python, you should be familiar with this model. If not, you might want to refresh that at this point. But don't worry, you'll be guided through the structure of the machine learning model.\n", 816 | "\n", 817 | "The `X` and `y` variables are available in your workspace.\n", 818 | "\n", 819 | "**Instructions**\n", 820 | "\n", 821 | "* Split `X` and `y` into training and test data, keeping 30% of the data for testing.\n", 822 | "* Fit your model to your training data.\n", 823 | "* Obtain the model predicted labels by running `model.predict` on `X_test`.\n", 824 | "* Obtain a classification comparing `y_test` with `predicted`, and use the given confusion matrix to check your results." 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": null, 830 | "metadata": {}, 831 | "outputs": [], 832 | "source": [ 833 | "# Create the training and testing sets\n", 834 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)" 835 | ] 836 | }, 837 | { 838 | "cell_type": "code", 839 | "execution_count": null, 840 | "metadata": {}, 841 | "outputs": [], 842 | "source": [ 843 | "# Fit a logistic regression model to our data\n", 844 | "model = LogisticRegression(solver='liblinear')\n", 845 | "model.fit(X_train, y_train)" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": null, 851 | "metadata": {}, 852 | "outputs": [], 853 | "source": [ 854 | "# Obtain model predictions\n", 855 | "predicted = model.predict(X_test)" 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": null, 861 | "metadata": {}, 862 | "outputs": [], 863 | "source": [ 864 | "# Print the classifcation report and confusion matrix\n", 865 | "print('Classification report:\\n', classification_report(y_test, predicted))\n", 866 | "conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)\n", 867 | "print('Confusion matrix:\\n', conf_mat)" 868 | ] 869 | }, 870 | { 871 | "cell_type": "markdown", 872 | "metadata": {}, 873 | "source": [ 874 | "**Do you think these results are better than the rules based model? We are getting far fewer false positives, so that's an improvement. Also, we're catching a higher percentage of fraud cases, so that is also better than before. Do you understand why we have fewer observations to look at in the confusion matrix? Remember we are using only our test data to calculate the model results on. We're comparing the crosstab on the full dataset from the last exercise, with a confusion matrix of only 30% of the total dataset, so that's where that difference comes from. In the next chapter, we'll dive deeper into understanding these model performance metrics. Let's now explore whether we can improve the prediction results even further with resampling methods.**" 875 | ] 876 | }, 877 | { 878 | "cell_type": "markdown", 879 | "metadata": {}, 880 | "source": [ 881 | "### Logistic regression with SMOTE\n", 882 | "\n", 883 | "In this exercise, you're going to take the Logistic Regression model from the previous exercise, and combine that with a **SMOTE resampling method**. We'll show you how to do that efficiently by using a pipeline that combines the resampling method with the model in one go. First, you need to define the pipeline that you're going to use.\n", 884 | "\n", 885 | "**Instructions**\n", 886 | "\n", 887 | "* Import the `Pipeline` module from `imblearn`, this has been done for you.\n", 888 | "* Then define what you want to put into the pipeline, assign the `SMOTE` method with `borderline2` to `resampling`, and assign `LogisticRegression()` to the `model`.\n", 889 | "* Combine two steps in the `Pipeline()` function. You need to state you want to combine `resampling` with the `model` in the respective place in the argument. I show you how to do this." 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "metadata": {}, 896 | "outputs": [], 897 | "source": [ 898 | "# Define which resampling method and which ML model to use in the pipeline\n", 899 | "resampling = SMOTE(kind='borderline2')\n", 900 | "model = LogisticRegression(solver='liblinear')" 901 | ] 902 | }, 903 | { 904 | "cell_type": "code", 905 | "execution_count": null, 906 | "metadata": {}, 907 | "outputs": [], 908 | "source": [ 909 | "pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": {}, 915 | "source": [ 916 | "### Pipelining\n", 917 | "\n", 918 | "Now that you have our pipeline defined, aka **combining a logistic regression with a SMOTE method**, let's run it on the data. You can treat the pipeline as if it were a **single machine learning model**. Our data X and y are already defined, and the pipeline is defined in the previous exercise. Are you curious to find out what the model results are? Let's give it a try!\n", 919 | "\n", 920 | "**Instructions**\n", 921 | "\n", 922 | "* Split the data 'X'and 'y' into the training and test set. Set aside 30% of the data for a test set, and set the `random_state` to zero.\n", 923 | "* Fit your pipeline onto your training data and obtain the predictions by running the `pipeline.predict()` function on our `X_test` dataset." 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": null, 929 | "metadata": {}, 930 | "outputs": [], 931 | "source": [ 932 | "# Split your data X and y, into a training and a test set and fit the pipeline onto the training data\n", 933 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)" 934 | ] 935 | }, 936 | { 937 | "cell_type": "code", 938 | "execution_count": null, 939 | "metadata": {}, 940 | "outputs": [], 941 | "source": [ 942 | "pipeline.fit(X_train, y_train) \n", 943 | "predicted = pipeline.predict(X_test)" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": {}, 950 | "outputs": [], 951 | "source": [ 952 | "# Obtain the results from the classification report and confusion matrix \n", 953 | "print('Classifcation report:\\n', classification_report(y_test, predicted))\n", 954 | "conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)\n", 955 | "print('Confusion matrix:\\n', conf_mat)" 956 | ] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "metadata": {}, 961 | "source": [ 962 | "**As you can see, the SMOTE slightly improves our results. We now manage to find all cases of fraud, but we have a slightly higher number of false positives, albeit only 7 cases. Remember, resampling doesn't necessarily lead to better results. When the fraud cases are very spread and scattered over the data, using SMOTE can introduce a bit of bias. Nearest neighbors aren't necessarily also fraud cases, so the synthetic samples might 'confuse' the model slightly. In the next chapters, we'll learn how to also adjust our machine learning models to better detect the minority fraud cases.**" 963 | ] 964 | }, 965 | { 966 | "cell_type": "markdown", 967 | "metadata": {}, 968 | "source": [ 969 | "# Fraud detection using labeled data\n", 970 | "\n", 971 | "Learn how to flag fraudulent transactions with supervised learning. Use classifiers, adjust and compare them to find the most efficient fraud detection model." 972 | ] 973 | }, 974 | { 975 | "cell_type": "markdown", 976 | "metadata": {}, 977 | "source": [ 978 | "## Review classification methods\n", 979 | "\n", 980 | "* Classification:\n", 981 | " * The problem of identifying to which class a new observation belongs, on the basis of a training set of data containing observations whose class is known\n", 982 | " * Goal: use known fraud cases to train a model to recognize new cases\n", 983 | " * Classes are sometimes called targets, labels or categories\n", 984 | " * Spam detection in email service providers can be identified as a classification problem\n", 985 | " * Binary classification since there are only 2 classes, spam and not spam\n", 986 | " * Fraud detection is also a binary classification prpoblem\n", 987 | " * Patient diagnosis\n", 988 | " * Classification problems normall have categorical output like yes/no, 1/0 or True/False\n", 989 | " * Variable to predict: $$y\\in0,1$$\n", 990 | " * 0: negative calss ('majority' normal cases)\n", 991 | " * 1: positive class ('minority' fraud cases)" 992 | ] 993 | }, 994 | { 995 | "cell_type": "markdown", 996 | "metadata": {}, 997 | "source": [ 998 | "#### Logistic Regression\n", 999 | "\n", 1000 | "* Logistic Regression is one of the most used ML algorithms in binary classification\n", 1001 | "* ![logistic regression](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/logistic_regression.JPG)\n", 1002 | "* Can be adjusted reasonably well to work on imbalanced data...useful for fraud detection" 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "markdown", 1007 | "metadata": {}, 1008 | "source": [ 1009 | "#### Neural Network\n", 1010 | "\n", 1011 | "* ![neural network](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/neural_network.JPG)\n", 1012 | "* Can be used as classifiers for fraud detection\n", 1013 | "* Capable of fitting highly non-linear models to the data\n", 1014 | "* More complex to implement than other classifiers - not demonstrated here" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "markdown", 1019 | "metadata": {}, 1020 | "source": [ 1021 | "#### Decision Trees\n", 1022 | "\n", 1023 | "* ![decision tree](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/decision_tree.JPG)\n", 1024 | "* Commonly used for fraud detection\n", 1025 | "* Transparent results, easily interpreted by analysts\n", 1026 | "* Decision trees are prone to overfit the data" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "markdown", 1031 | "metadata": {}, 1032 | "source": [ 1033 | "#### Random Forests\n", 1034 | "\n", 1035 | "* ![random forest](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/random_forest.JPG)\n", 1036 | "* **Random Forests are a more robust option than a single decision tree**\n", 1037 | " * Construct a multitude of decision trees when training the model and outputting the class that is the mode or mean predicted class of the individual trees\n", 1038 | " * A random forest consists of a collection of trees on a random subset of features\n", 1039 | " * Final predictions are the combined results of those trees\n", 1040 | " * Random forests can handle complex data and are not prone to overfit\n", 1041 | " * They are interpretable by looking at feature importance, and can be adjusted to work well on highly imbalanced data\n", 1042 | " * Their drawback is they're computationally complex\n", 1043 | " * Very popular for fraud detection\n", 1044 | " * A Random Forest model will be optimized in the exercises\n", 1045 | " \n", 1046 | "**Implementation:**\n", 1047 | "\n", 1048 | "```python\n", 1049 | "from sklearn.ensemble import RandomForestClassifier\n", 1050 | "model = RandomForestClassifier()\n", 1051 | "model.fit(X_train, y_train)\n", 1052 | "predicted = model.predict(X_test)\n", 1053 | "print(f'Accuracy Score:\\n{accuracy_score(y_test, predicted)}')\n", 1054 | "```" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "markdown", 1059 | "metadata": {}, 1060 | "source": [ 1061 | "### Natural hit rate\n", 1062 | "\n", 1063 | "In this exercise, you'll again use credit card transaction data. The features and labels are similar to the data in the previous chapter, and the **data is heavily imbalanced**. We've given you features `X` and labels `y` to work with already, which are both numpy arrays.\n", 1064 | "\n", 1065 | "First you need to explore how prevalent fraud is in the dataset, to understand what the **\"natural accuracy\"** is, if we were to predict everything as non-fraud. It's is important to understand which level of \"accuracy\" you need to \"beat\" in order to get a **better prediction than by doing nothing**. In the following exercises, you'll create our first random forest classifier for fraud detection. That will serve as the **\"baseline\"** model that you're going to try to improve in the upcoming exercises.\n", 1066 | "\n", 1067 | "**Instructions**\n", 1068 | "\n", 1069 | "* Count the total number of observations by taking the length of your labels `y`.\n", 1070 | "* Count the non-fraud cases in our data by using list comprehension on `y`; remember `y` is a NumPy array so `.value_counts()` cannot be used in this case.\n", 1071 | "* Calculate the natural accuracy by dividing the non-fraud cases over the total observations.\n", 1072 | "* Print the percentage." 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": null, 1078 | "metadata": {}, 1079 | "outputs": [], 1080 | "source": [ 1081 | "df2 = pd.read_csv(cc2_file)\n", 1082 | "df2.head()" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "code", 1087 | "execution_count": null, 1088 | "metadata": {}, 1089 | "outputs": [], 1090 | "source": [ 1091 | "X, y = prep_data(df2)\n", 1092 | "print(f'X shape: {X.shape}\\ny shape: {y.shape}')" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": null, 1098 | "metadata": {}, 1099 | "outputs": [], 1100 | "source": [ 1101 | "X[0, :]" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "execution_count": null, 1107 | "metadata": {}, 1108 | "outputs": [], 1109 | "source": [ 1110 | "df2.Class.value_counts()" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "code", 1115 | "execution_count": null, 1116 | "metadata": {}, 1117 | "outputs": [], 1118 | "source": [ 1119 | "# Count the total number of observations from the length of y\n", 1120 | "total_obs = len(y)\n", 1121 | "total_obs" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "code", 1126 | "execution_count": null, 1127 | "metadata": {}, 1128 | "outputs": [], 1129 | "source": [ 1130 | "# Count the total number of non-fraudulent observations \n", 1131 | "non_fraud = [i for i in y if i == 0]\n", 1132 | "count_non_fraud = non_fraud.count(0)\n", 1133 | "count_non_fraud" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": null, 1139 | "metadata": {}, 1140 | "outputs": [], 1141 | "source": [ 1142 | "percentage = count_non_fraud/total_obs * 100\n", 1143 | "print(f'{percentage:0.2f}%')" 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "markdown", 1148 | "metadata": {}, 1149 | "source": [ 1150 | "**This tells us that by doing nothing, we would be correct in 95.9% of the cases. So now you understand, that if we get an accuracy of less than this number, our model does not actually add any value in predicting how many cases are correct. Let's see how a random forest does in predicting fraud in our data.**" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "markdown", 1155 | "metadata": {}, 1156 | "source": [ 1157 | "### Random Forest Classifier - part 1\n", 1158 | "\n", 1159 | "Let's now create a first **random forest classifier** for fraud detection. Hopefully you can do better than the baseline accuracy you've just calculated, which was roughly **96%**. This model will serve as the **\"baseline\" model** that you're going to try to improve in the upcoming exercises. Let's start first with **splitting the data into a test and training set**, and **defining the Random Forest model**. The data available are features `X` and labels `y`.\n", 1160 | "\n", 1161 | "**Instructions**\n", 1162 | "\n", 1163 | "* Import the random forest classifier from `sklearn`.\n", 1164 | "* Split your features `X` and labels `y` into a training and test set. Set aside a test set of 30%.\n", 1165 | "* Assign the random forest classifier to `model` and keep `random_state` at 5. We need to set a random state here in order to be able to compare results across different models." 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "markdown", 1170 | "metadata": {}, 1171 | "source": [ 1172 | "#### X_train, X_test, y_train, y_test" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": null, 1178 | "metadata": {}, 1179 | "outputs": [], 1180 | "source": [ 1181 | "# Split your data into training and test set\n", 1182 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": null, 1188 | "metadata": {}, 1189 | "outputs": [], 1190 | "source": [ 1191 | "# Define the model as the random forest\n", 1192 | "model = RandomForestClassifier(random_state=5, n_estimators=20)" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "markdown", 1197 | "metadata": {}, 1198 | "source": [ 1199 | "### Random Forest Classifier - part 2\n", 1200 | "\n", 1201 | "Let's see how our Random Forest model performs **without doing anything special to it**. The `model` from the previous exercise is available, and you've already split your data in `X_train, y_train, X_test, y_test`.\n", 1202 | "\n", 1203 | "**Instructions 1/3**\n", 1204 | "\n", 1205 | "* Fit the earlier defined `model` to our training data and obtain predictions by getting the model predictions on `X_test`." 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "execution_count": null, 1211 | "metadata": {}, 1212 | "outputs": [], 1213 | "source": [ 1214 | "# Fit the model to our training set\n", 1215 | "model.fit(X_train, y_train)" 1216 | ] 1217 | }, 1218 | { 1219 | "cell_type": "code", 1220 | "execution_count": null, 1221 | "metadata": {}, 1222 | "outputs": [], 1223 | "source": [ 1224 | "# Obtain predictions from the test data \n", 1225 | "predicted = model.predict(X_test)" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "markdown", 1230 | "metadata": {}, 1231 | "source": [ 1232 | "**Instructions 2/3**\n", 1233 | "\n", 1234 | "* Obtain and print the accuracy score by comparing the actual labels `y_test` with our predicted labels `predicted`." 1235 | ] 1236 | }, 1237 | { 1238 | "cell_type": "code", 1239 | "execution_count": null, 1240 | "metadata": {}, 1241 | "outputs": [], 1242 | "source": [ 1243 | "print(f'Accuracy Score:\\n{accuracy_score(y_test, predicted):0.3f}')" 1244 | ] 1245 | }, 1246 | { 1247 | "cell_type": "markdown", 1248 | "metadata": {}, 1249 | "source": [ 1250 | "**Instructions 3/3**\n", 1251 | "\n", 1252 | "What is a benefit of using Random Forests versus Decision Trees?\n", 1253 | "\n", 1254 | "**Possible Answers**\n", 1255 | "\n", 1256 | "* ~~Random Forests always have a higher accuracy than Decision Trees.~~\n", 1257 | "* **Random Forests do not tend to overfit, whereas Decision Trees do.**\n", 1258 | "* ~~Random Forests are computationally more efficient than Decision Trees.~~\n", 1259 | "* ~~You can obtain \"feature importance\" from Random Forest, which makes it more transparent.~~\n", 1260 | "\n", 1261 | "**Random Forest prevents overfitting most of the time, by creating random subsets of the features and building smaller trees using these subsets. Afterwards, it combines the subtrees of subsamples of features, so it does not tend to overfit to your entire feature set the way \"deep\" Decisions Trees do.**" 1262 | ] 1263 | }, 1264 | { 1265 | "cell_type": "markdown", 1266 | "metadata": {}, 1267 | "source": [ 1268 | "## Perfomance evaluation\n", 1269 | "\n", 1270 | "* Performance metrics for fraud detection models\n", 1271 | "* There are other performace metrics that are more informative and reliable than accuracy" 1272 | ] 1273 | }, 1274 | { 1275 | "cell_type": "markdown", 1276 | "metadata": {}, 1277 | "source": [ 1278 | "#### Accuracy\n", 1279 | "\n", 1280 | "![accuracy](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/accuracy.JPG)\n", 1281 | "* Accuracy isn't a reliable performance metric when working with highly imbalanced data (such as fraud detection)\n", 1282 | "* By doing nothing, aka predicting everything is the majority class (right image), a higher accuracy is obtained than by trying to build a predictive model (left image)" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "markdown", 1287 | "metadata": {}, 1288 | "source": [ 1289 | "#### Confusion Matrix\n", 1290 | "\n", 1291 | "![advanced confusion matrix](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/confusion_matrix_advanced.JPG)\n", 1292 | "![confusion matrix](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/confusion_matrix.JPG)\n", 1293 | "* [Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix)\n", 1294 | "* False Positives (FP) / False Negatives (FN)\n", 1295 | " * FN: predicts the person is not pregnant, but actually is\n", 1296 | " * Cases of fraud not caught by the model\n", 1297 | " * FP: predicts the person is pregnant, but actually is not\n", 1298 | " * Cases of 'false alarm'\n", 1299 | " * the business case determines whether FN or FP cases are more important\n", 1300 | " * a credit card company might want to catch as much fraud as possible and reduce false negatives, as fraudulent transactions can be incredibly costly\n", 1301 | " * a false alarm just means a transaction is blocked\n", 1302 | " * an insurance company can't handle many false alarms, as it means getting a team of investigators involved for each positive prediction\n", 1303 | " \n", 1304 | "* True Positives / True Negatives are the cases predicted correctly (e.g. fraud / non-fraud)" 1305 | ] 1306 | }, 1307 | { 1308 | "cell_type": "markdown", 1309 | "metadata": {}, 1310 | "source": [ 1311 | "#### Precision Recall\n", 1312 | "\n", 1313 | "* Credit card company wants to optimize for recall\n", 1314 | "* Insurance company wants to optimize for precision\n", 1315 | "* Precision:\n", 1316 | " * $$Precision=\\frac{\\#\\space True\\space Positives}{\\#\\space True\\space Positives+\\#\\space False\\space Positives}$$\n", 1317 | " * Fraction of actual fraud cases out of all predicted fraud cases\n", 1318 | " * true positives relative to the sum of true positives and false positives\n", 1319 | "* Recall:\n", 1320 | " * $$Recall=\\frac{\\#\\space True\\space Positives}{\\#\\space True\\space Positives+\\#\\space False\\space Negatives}$$\n", 1321 | " * Fraction of predicted fraud cases out of all actual fraud cases\n", 1322 | " * true positives relative to the sum of true positives and false negative\n", 1323 | "* Precision and recall are typically inversely related\n", 1324 | " * As precision increases, recall falls and vice-versa\n", 1325 | " * ![precision recall inverse relation](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/precision_recall_inverse.JPG)" 1326 | ] 1327 | }, 1328 | { 1329 | "cell_type": "markdown", 1330 | "metadata": {}, 1331 | "source": [ 1332 | "#### F-Score\n", 1333 | "\n", 1334 | "* Weighs both precision and recall into on measure\n", 1335 | "\n", 1336 | "\\begin{align}\n", 1337 | "F-measure = \\frac{2\\times{Precision}\\times{Recall}}{Precision\\times{Recall}} \\\\ \n", 1338 | "\\\\\n", 1339 | "= \\frac{2\\times{TP}}{2\\times{TP}+FP+FN}\n", 1340 | "\\end{align}\n", 1341 | "\n", 1342 | "* is a performance metric that takes into account a balance between Precision and Recall" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "#### Obtaining performance metrics from sklean\n", 1350 | "\n", 1351 | "```python\n", 1352 | "# import the methods\n", 1353 | "from sklearn.metrics import precision_recall_curve, average_precision_score\n", 1354 | "\n", 1355 | "# Calculate average precision and the PR curve\n", 1356 | "average_precision = average_precision_score(y_test, predicted)\n", 1357 | "\n", 1358 | "# Obtain precision and recall\n", 1359 | "precision, recall = precision_recall_curve(y_test, predicted)\n", 1360 | "```" 1361 | ] 1362 | }, 1363 | { 1364 | "cell_type": "markdown", 1365 | "metadata": {}, 1366 | "source": [ 1367 | "#### Receiver Operating Characteristic (ROC) curve to compare algorithms\n", 1368 | "\n", 1369 | "* Created by plotting the true positive rate against the false positive rate at various threshold settings\n", 1370 | "* ![roc curve](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/roc_curve.JPG)\n", 1371 | "* Useful for comparing performance of different algorithms\n", 1372 | "\n", 1373 | "```python\n", 1374 | "# Obtain model probabilities\n", 1375 | "probs = model.predict_proba(X_test)\n", 1376 | "\n", 1377 | "# Print ROC_AUC score using probabilities\n", 1378 | "print(metrics.roc_auc_score(y_test, probs[:, 1]))\n", 1379 | "```" 1380 | ] 1381 | }, 1382 | { 1383 | "cell_type": "markdown", 1384 | "metadata": {}, 1385 | "source": [ 1386 | "#### Confusion matrix and classification report\n", 1387 | "\n", 1388 | "```python\n", 1389 | "from sklearn.metrics import classification_report, confusion_matrix\n", 1390 | "\n", 1391 | "# Obtain predictions\n", 1392 | "predicted = model.predict(X_test)\n", 1393 | "\n", 1394 | "# Print classification report using predictions\n", 1395 | "print(classification_report(y_test, predicted))\n", 1396 | "\n", 1397 | "# Print confusion matrix using predictions\n", 1398 | "print(confusion_matrix(y_test, predicted))\n", 1399 | "```" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "markdown", 1404 | "metadata": {}, 1405 | "source": [ 1406 | "### Performance metrics for the RF model\n", 1407 | "\n", 1408 | "In the previous exercises you obtained an accuracy score for your random forest model. This time, we know **accuracy can be misleading** in the case of fraud detection. With highly imbalanced fraud data, the AUROC curve is a more reliable performance metric, used to compare different classifiers. Moreover, the *classification report* tells you about the precision and recall of your model, whilst the *confusion matrix* actually shows how many fraud cases you can predict correctly. So let's get these performance metrics.\n", 1409 | "\n", 1410 | "You'll continue working on the same random forest model from the previous exercise. Your model, defined as `model = RandomForestClassifier(random_state=5)` has been fitted to your training data already, and `X_train, y_train, X_test, y_test` are available.\n", 1411 | "\n", 1412 | "**Instructions**\n", 1413 | "\n", 1414 | "* Import the classification report, confusion matrix and ROC score from `sklearn.metrics`.\n", 1415 | "* Get the binary predictions from your trained random forest `model`.\n", 1416 | "* Get the predicted probabilities by running the `predict_proba()` function.\n", 1417 | "* Obtain classification report and confusion matrix by comparing `y_test` with `predicted`." 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "code", 1422 | "execution_count": null, 1423 | "metadata": {}, 1424 | "outputs": [], 1425 | "source": [ 1426 | "# Obtain the predictions from our random forest model \n", 1427 | "predicted = model.predict(X_test)" 1428 | ] 1429 | }, 1430 | { 1431 | "cell_type": "code", 1432 | "execution_count": null, 1433 | "metadata": {}, 1434 | "outputs": [], 1435 | "source": [ 1436 | "# Predict probabilities\n", 1437 | "probs = model.predict_proba(X_test)" 1438 | ] 1439 | }, 1440 | { 1441 | "cell_type": "code", 1442 | "execution_count": null, 1443 | "metadata": {}, 1444 | "outputs": [], 1445 | "source": [ 1446 | "# Print the ROC curve, classification report and confusion matrix\n", 1447 | "print('ROC Score:')\n", 1448 | "print(roc_auc_score(y_test, probs[:,1]))\n", 1449 | "print('\\nClassification Report:')\n", 1450 | "print(classification_report(y_test, predicted))\n", 1451 | "print('\\nConfusion Matrix:')\n", 1452 | "print(confusion_matrix(y_test, predicted))" 1453 | ] 1454 | }, 1455 | { 1456 | "cell_type": "markdown", 1457 | "metadata": {}, 1458 | "source": [ 1459 | "**You have now obtained more meaningful performance metrics that tell us how well the model performs, given the highly imbalanced data that you're working with. The model predicts 76 cases of fraud, out of which 73 are actual fraud. You have only 3 false positives. This is really good, and as a result you have a very high precision score. You do however, miss 18 cases of actual fraud. Recall is therefore not as good as precision.**" 1460 | ] 1461 | }, 1462 | { 1463 | "cell_type": "markdown", 1464 | "metadata": {}, 1465 | "source": [ 1466 | "### Plotting the Precision vs. Recall Curve\n", 1467 | "\n", 1468 | "You can also plot a **Precision-Recall curve**, to investigate the trade-off between the two in your model. In this curve **Precision and Recall are inversely related**; as Precision increases, Recall falls and vice-versa. A balance between these two needs to be achieved in your model, otherwise you might end up with many false positives, or not enough actual fraud cases caught. To achieve this and to compare performance, the precision-recall curves come in handy.\n", 1469 | "\n", 1470 | "Your Random Forest Classifier is available as `model`, and the predictions as `predicted`. You can simply obtain the average precision score and the PR curve from the sklearn package. The function `plot_pr_curve()` plots the results for you. Let's give it a try.\n", 1471 | "\n", 1472 | "**Instructions 1/3**\n", 1473 | "\n", 1474 | "* Calculate the average precision by running the function on the actual labels `y_test` and your predicted labels `predicted`." 1475 | ] 1476 | }, 1477 | { 1478 | "cell_type": "code", 1479 | "execution_count": null, 1480 | "metadata": {}, 1481 | "outputs": [], 1482 | "source": [ 1483 | "# Calculate average precision and the PR curve\n", 1484 | "average_precision = average_precision_score(y_test, predicted)\n", 1485 | "average_precision" 1486 | ] 1487 | }, 1488 | { 1489 | "cell_type": "markdown", 1490 | "metadata": {}, 1491 | "source": [ 1492 | "**Instructions 2/3**\n", 1493 | "\n", 1494 | "* Run the `precision_recall_curve()` function on the same arguments `y_test` and `predicted` and plot the curve (this last thing has been done for you)." 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "code", 1499 | "execution_count": null, 1500 | "metadata": {}, 1501 | "outputs": [], 1502 | "source": [ 1503 | "# Obtain precision and recall \n", 1504 | "precision, recall, _ = precision_recall_curve(y_test, predicted)\n", 1505 | "print(f'Precision: {precision}\\nRecall: {recall}')" 1506 | ] 1507 | }, 1508 | { 1509 | "cell_type": "markdown", 1510 | "metadata": {}, 1511 | "source": [ 1512 | "#### def plot_pr_curve" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "code", 1517 | "execution_count": null, 1518 | "metadata": {}, 1519 | "outputs": [], 1520 | "source": [ 1521 | "def plot_pr_curve(recall, precision, average_precision):\n", 1522 | " \"\"\"\n", 1523 | " https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html\n", 1524 | " \"\"\"\n", 1525 | " from inspect import signature\n", 1526 | " plt.figure()\n", 1527 | " step_kwargs = ({'step': 'post'}\n", 1528 | " if 'step' in signature(plt.fill_between).parameters\n", 1529 | " else {})\n", 1530 | "\n", 1531 | " plt.step(recall, precision, color='b', alpha=0.2, where='post')\n", 1532 | " plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)\n", 1533 | "\n", 1534 | " plt.xlabel('Recall')\n", 1535 | " plt.ylabel('Precision')\n", 1536 | " plt.ylim([0.0, 1.0])\n", 1537 | " plt.xlim([0.0, 1.0])\n", 1538 | " plt.title(f'2-class Precision-Recall curve: AP={average_precision:0.2f}')\n", 1539 | " return plt.show()" 1540 | ] 1541 | }, 1542 | { 1543 | "cell_type": "code", 1544 | "execution_count": null, 1545 | "metadata": {}, 1546 | "outputs": [], 1547 | "source": [ 1548 | "# Plot the recall precision tradeoff\n", 1549 | "plot_pr_curve(recall, precision, average_precision)" 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "markdown", 1554 | "metadata": {}, 1555 | "source": [ 1556 | "**Instructions 3/3**\n", 1557 | "\n", 1558 | "What's the benefit of the performance metric ROC curve (AUROC) versus Precision and Recall?\n", 1559 | "\n", 1560 | "**Possible Answers**\n", 1561 | "\n", 1562 | "* **The AUROC answers the question: \"How well can this classifier be expected to perform in general, at a variety of different baseline probabilities?\" but precision and recall don't.**\n", 1563 | "* ~~The AUROC answers the question: \"How meaningful is a positive result from my classifier given the baseline probabilities of my problem?\" but precision and recall don't.~~\n", 1564 | "* ~~Precision and Recall are not informative when the data is imbalanced.~~\n", 1565 | "* ~~The AUROC curve allows you to visualize classifier performance and with Precision and Recall you cannot.~~\n", 1566 | "\n", 1567 | "**The ROC curve plots the true positives vs. false positives , for a classifier, as its discrimination threshold is varied. Since, a random method describes a horizontal curve through the unit interval, it has an AUC of 0.5. Minimally, classifiers should perform better than this, and the extent to which they score higher than one another (meaning the area under the ROC curve is larger), they have better expected performance.**" 1568 | ] 1569 | }, 1570 | { 1571 | "cell_type": "markdown", 1572 | "metadata": {}, 1573 | "source": [ 1574 | "## Adjusting the algorithm weights\n", 1575 | "\n", 1576 | "* Adjust model parameter to optimize for fraud detection.\n", 1577 | "* When training a model, try different options and settings to get the best recall-precision trade-off\n", 1578 | "* sklearn has two simple options to tweak the model for heavily imbalanced data\n", 1579 | " * `class_weight`:\n", 1580 | " * `balanced` mode: `model = RandomForestClassifier(class_weight='balanced')`\n", 1581 | " * uses the values of y to automatically adjust weights inversely proportional to class frequencies in the the input data\n", 1582 | " * this option is available for other classifiers\n", 1583 | " * `model = LogisticRegression(class_weight='balanced')`\n", 1584 | " * `model = SVC(kernel='linear', class_weight='balanced', probability=True)`\n", 1585 | " * `balanced_subsample` mode: `model = RandomForestClassifier(class_weight='balanced_subsample')`\n", 1586 | " * is the same as the `balanced` option, except weights are calculated again at each iteration of growing a tree in a the random forest\n", 1587 | " * this option is only applicable for the Random Forest model\n", 1588 | " * manual input\n", 1589 | " * adjust weights to any ratio, not just value counts relative to sample\n", 1590 | " * `class_weight={0:1,1:4}`\n", 1591 | " * this is a good option to slightly upsample the minority class" 1592 | ] 1593 | }, 1594 | { 1595 | "cell_type": "markdown", 1596 | "metadata": {}, 1597 | "source": [ 1598 | "#### Hyperparameter tuning\n", 1599 | "\n", 1600 | "* Random Forest takes many other options to optimize the model\n", 1601 | "\n", 1602 | "```python\n", 1603 | "model = RandomForestClassifier(n_estimators=10, \n", 1604 | " criterion=’gini’, \n", 1605 | " max_depth=None, \n", 1606 | " min_samples_split=2, \n", 1607 | " min_samples_leaf=1, \n", 1608 | " max_features=’auto’, \n", 1609 | " n_jobs=-1, class_weight=None)\n", 1610 | "```\n", 1611 | "\n", 1612 | "* the shape and size of the trees in a random forest are adjusted with **leaf size** and **tree depth**\n", 1613 | "* `n_estimators`: one of the most important setting is the number of trees in the forest\n", 1614 | "* `max_features`: the number of features considered for splitting at each leaf node\n", 1615 | "* `criterion`: change the way the data is split at each node (default is `gini` coefficient)" 1616 | ] 1617 | }, 1618 | { 1619 | "cell_type": "markdown", 1620 | "metadata": {}, 1621 | "source": [ 1622 | "#### GridSearchCV for hyperparameter tuning\n", 1623 | "\n", 1624 | "* [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)\n", 1625 | "* `from sklearn.model_selection import GridSearchCV`\n", 1626 | "* `GridSearchCV evaluates all combinations of parameters defined in the parameter grid\n", 1627 | "* Random Forest Parameter Grid:\n", 1628 | "\n", 1629 | "```python\n", 1630 | "# Create the parameter grid \n", 1631 | "param_grid = {'max_depth': [80, 90, 100, 110],\n", 1632 | " 'max_features': [2, 3],\n", 1633 | " 'min_samples_leaf': [3, 4, 5],\n", 1634 | " 'min_samples_split': [8, 10, 12],\n", 1635 | " 'n_estimators': [100, 200, 300, 1000]}\n", 1636 | "\n", 1637 | "# Define which model to use\n", 1638 | "model = RandomForestRegressor()\n", 1639 | "\n", 1640 | "# Instantiate the grid search model\n", 1641 | "grid_search_model = GridSearchCV(estimator = model, \n", 1642 | " param_grid = param_grid, \n", 1643 | " cv = 5,\n", 1644 | " n_jobs = -1, \n", 1645 | " scoring='f1')\n", 1646 | "```\n", 1647 | "\n", 1648 | "* define the ML model to be used\n", 1649 | "* put the model into `GridSearchCV`\n", 1650 | "* pass in `param_grid`\n", 1651 | "* frequency of cross-validation\n", 1652 | "* define a scoring metric to evaluate the models\n", 1653 | " * the default option is accuracy which isn't optimal for fraud detection\n", 1654 | " * use `precision`, `recall` or `f1`\n", 1655 | "\n", 1656 | "```python\n", 1657 | "# Fit the grid search to the data\n", 1658 | "grid_search_model.fit(X_train, y_train)\n", 1659 | "\n", 1660 | "# Get the optimal parameters \n", 1661 | "grid_search_model.best_params_\n", 1662 | "\n", 1663 | "{'bootstrap': True,\n", 1664 | " 'max_depth': 80,\n", 1665 | " 'max_features': 3,\n", 1666 | " 'min_samples_leaf': 5,\n", 1667 | " 'min_samples_split': 12,\n", 1668 | " 'n_estimators': 100}\n", 1669 | "```\n", 1670 | "\n", 1671 | "* once `GridSearchCV` and `model` are fit to the data, obtain the parameters belonging to the optimal model by using the `best_params_` attribute\n", 1672 | "* `GridSearchCV` is computationally heavy\n", 1673 | " * Can require many hours, depending on the amount of data and number of parameters in the grid\n", 1674 | " * __**Save the Results**__\n", 1675 | "\n", 1676 | "```python\n", 1677 | "# Get the best_estimator results\n", 1678 | "grid_search.best_estimator_\n", 1679 | "grid_search.best_score_\n", 1680 | "```\n", 1681 | "\n", 1682 | "* `best_score_`: mean cross-validated score of the `best_estimator_`, which depends on the `scoring` option" 1683 | ] 1684 | }, 1685 | { 1686 | "cell_type": "markdown", 1687 | "metadata": {}, 1688 | "source": [ 1689 | "### Model adjustments\n", 1690 | "\n", 1691 | "A simple way to adjust the random forest model to deal with highly imbalanced fraud data, is to use the **`class_weights` option** when defining the `sklearn` model. However, as you will see, it is a bit of a blunt force mechanism and might not work for your very special case.\n", 1692 | "\n", 1693 | "In this exercise you'll explore the ``weight = \"balanced_subsample\"`` mode the Random Forest model from the earlier exercise. You already have split your data in a training and test set, i.e `X_train`, `X_test`, `y_train`, `y_test` are available. The metrics function have already been imported.\n", 1694 | "\n", 1695 | "**Instructions**\n", 1696 | "\n", 1697 | "* Set the `class_weight` argument of your classifier to `balanced_subsample`.\n", 1698 | "* Fit your model to your training set.\n", 1699 | "* Obtain predictions and probabilities from X_test.\n", 1700 | "* Obtain the `roc_auc_score`, the classification report and confusion matrix." 1701 | ] 1702 | }, 1703 | { 1704 | "cell_type": "code", 1705 | "execution_count": null, 1706 | "metadata": {}, 1707 | "outputs": [], 1708 | "source": [ 1709 | "# Define the model with balanced subsample\n", 1710 | "model = RandomForestClassifier(class_weight='balanced_subsample', random_state=5, n_estimators=100)\n", 1711 | "\n", 1712 | "# Fit your training model to your training set\n", 1713 | "model.fit(X_train, y_train)\n", 1714 | "\n", 1715 | "# Obtain the predicted values and probabilities from the model \n", 1716 | "predicted = model.predict(X_test)\n", 1717 | "probs = model.predict_proba(X_test)\n", 1718 | "\n", 1719 | "# Print the ROC curve, classification report and confusion matrix\n", 1720 | "print('ROC Score:')\n", 1721 | "print(roc_auc_score(y_test, probs[:,1]))\n", 1722 | "print('\\nClassification Report:')\n", 1723 | "print(classification_report(y_test, predicted))\n", 1724 | "print('\\nConfusion Matrix:')\n", 1725 | "print(confusion_matrix(y_test, predicted))" 1726 | ] 1727 | }, 1728 | { 1729 | "cell_type": "markdown", 1730 | "metadata": {}, 1731 | "source": [ 1732 | "**You can see that the model results don't improve drastically. We now have 3 less false positives, but now 19 in stead of 18 false negatives, i.e. cases of fraud we are not catching. If we mostly care about catching fraud, and not so much about the false positives, this does actually not improve our model at all, albeit a simple option to try. In the next exercises you'll see how to more smartly tweak your model to focus on reducing false negatives and catch more fraud.**" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "markdown", 1737 | "metadata": {}, 1738 | "source": [ 1739 | "### Adjusting RF for fraud detection\n", 1740 | "\n", 1741 | "In this exercise you're going to dive into the options for the random forest classifier, as we'll **assign weights** and **tweak the shape** of the decision trees in the forest. You'll **define weights manually**, to be able to off-set that imbalance slightly. In our case we have 300 fraud to 7000 non-fraud cases, so by setting the weight ratio to 1:12, we get to a 1/3 fraud to 2/3 non-fraud ratio, which is good enough for training the model on.\n", 1742 | "\n", 1743 | "The data in this exercise has already been split into training and test set, so you just need to focus on defining your model. You can then use the function `get_model_results()` as a short cut. This function fits the model to your training data, predicts and obtains performance metrics similar to the steps you did in the previous exercises.\n", 1744 | "\n", 1745 | "**Instructions**\n", 1746 | "\n", 1747 | "* Change the `weight` option to set the ratio to 1 to 12 for the non-fraud and fraud cases, and set the split criterion to 'entropy'.\n", 1748 | "* Set the maximum depth to 10.\n", 1749 | "* Set the minimal samples in leaf nodes to 10.\n", 1750 | "* Set the number of trees to use in the model to 20." 1751 | ] 1752 | }, 1753 | { 1754 | "cell_type": "markdown", 1755 | "metadata": {}, 1756 | "source": [ 1757 | "#### def get_model_results" 1758 | ] 1759 | }, 1760 | { 1761 | "cell_type": "code", 1762 | "execution_count": null, 1763 | "metadata": {}, 1764 | "outputs": [], 1765 | "source": [ 1766 | "def get_model_results(X_train: np.ndarray, y_train: np.ndarray,\n", 1767 | " X_test: np.ndarray, y_test: np.ndarray, model):\n", 1768 | " \"\"\"\n", 1769 | " model: sklearn model (e.g. RandomForestClassifier)\n", 1770 | " \"\"\"\n", 1771 | " # Fit your training model to your training set\n", 1772 | " model.fit(X_train, y_train)\n", 1773 | "\n", 1774 | " # Obtain the predicted values and probabilities from the model \n", 1775 | " predicted = model.predict(X_test)\n", 1776 | " \n", 1777 | " try:\n", 1778 | " probs = model.predict_proba(X_test)\n", 1779 | " print('ROC Score:')\n", 1780 | " print(roc_auc_score(y_test, probs[:,1]))\n", 1781 | " except AttributeError:\n", 1782 | " pass\n", 1783 | "\n", 1784 | " # Print the ROC curve, classification report and confusion matrix\n", 1785 | " print('\\nClassification Report:')\n", 1786 | " print(classification_report(y_test, predicted))\n", 1787 | " print('\\nConfusion Matrix:')\n", 1788 | " print(confusion_matrix(y_test, predicted))" 1789 | ] 1790 | }, 1791 | { 1792 | "cell_type": "code", 1793 | "execution_count": null, 1794 | "metadata": {}, 1795 | "outputs": [], 1796 | "source": [ 1797 | "# Change the model options\n", 1798 | "model = RandomForestClassifier(bootstrap=True,\n", 1799 | " class_weight={0:1, 1:12},\n", 1800 | " criterion='entropy',\n", 1801 | " # Change depth of model\n", 1802 | " max_depth=10,\n", 1803 | " # Change the number of samples in leaf nodes\n", 1804 | " min_samples_leaf=10, \n", 1805 | " # Change the number of trees to use\n", 1806 | " n_estimators=20,\n", 1807 | " n_jobs=-1,\n", 1808 | " random_state=5)\n", 1809 | "\n", 1810 | "# Run the function get_model_results\n", 1811 | "get_model_results(X_train, y_train, X_test, y_test, model)" 1812 | ] 1813 | }, 1814 | { 1815 | "cell_type": "markdown", 1816 | "metadata": {}, 1817 | "source": [ 1818 | "**By smartly defining more options in the model, you can obtain better predictions. You have effectively reduced the number of false negatives, i.e. you are catching more cases of fraud, whilst keeping the number of false positives low. In this exercise you've manually changed the options of the model. There is a smarter way of doing it, by using `GridSearchCV`, which you'll see in the next exercise!**" 1819 | ] 1820 | }, 1821 | { 1822 | "cell_type": "markdown", 1823 | "metadata": {}, 1824 | "source": [ 1825 | "### Parameter optimization with GridSearchCV\n", 1826 | "\n", 1827 | "In this exercise you're going to **tweak our model in a less \"random\" way**, but use `GridSearchCV` to do the work for you.\n", 1828 | "\n", 1829 | "With `GridSearchCV` you can define **which performance metric to score** the options on. Since for fraud detection we are mostly interested in catching as many fraud cases as possible, you can optimize your model settings to get the best possible Recall score. If you also cared about reducing the number of false positives, you could optimize on F1-score, this gives you that nice Precision-Recall trade-off.\n", 1830 | "\n", 1831 | "`GridSearchCV` has already been imported from `sklearn.model_selection`, so let's give it a try!\n", 1832 | "\n", 1833 | "**Instructions**\n", 1834 | "\n", 1835 | "* Define in the parameter grid that you want to try 1 and 30 trees, and that you want to try the `gini` and `entropy` split criterion.\n", 1836 | "* Define the model to be simple `RandomForestClassifier`, you want to keep the random_state at 5 to be able to compare models.\n", 1837 | "* Set the `scoring` option such that it optimizes for recall.\n", 1838 | "* Fit the model to the training data `X_train` and `y_train` and obtain the best parameters for the model." 1839 | ] 1840 | }, 1841 | { 1842 | "cell_type": "code", 1843 | "execution_count": null, 1844 | "metadata": {}, 1845 | "outputs": [], 1846 | "source": [ 1847 | "# Define the parameter sets to test\n", 1848 | "param_grid = {'n_estimators': [1, 30],\n", 1849 | " 'max_features': ['auto', 'log2'], \n", 1850 | " 'max_depth': [4, 8, 10, 12],\n", 1851 | " 'criterion': ['gini', 'entropy']}\n", 1852 | "\n", 1853 | "# Define the model to use\n", 1854 | "model = RandomForestClassifier(random_state=5)\n", 1855 | "\n", 1856 | "# Combine the parameter sets with the defined model\n", 1857 | "CV_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='recall', n_jobs=-1)\n", 1858 | "\n", 1859 | "# Fit the model to our training data and obtain best parameters\n", 1860 | "CV_model.fit(X_train, y_train)\n", 1861 | "CV_model.best_params_" 1862 | ] 1863 | }, 1864 | { 1865 | "cell_type": "markdown", 1866 | "metadata": {}, 1867 | "source": [ 1868 | "### Model results with GridSearchCV\n", 1869 | "\n", 1870 | "You discovered that the **best parameters for your model** are that the split criterion should be set to `'gini'`, the number of estimators (trees) should be 30, the maximum depth of the model should be 8 and the maximum features should be set to `\"log2\"`.\n", 1871 | "\n", 1872 | "Let's give this a try and see how well our model performs. You can use the `get_model_results()` function again to save time.\n", 1873 | "\n", 1874 | "**Instructions**\n", 1875 | "\n", 1876 | "* Input the optimal settings into the model definition.\n", 1877 | "* Fit the model, obtain predictions and get the performance parameters with `get_model_results()`." 1878 | ] 1879 | }, 1880 | { 1881 | "cell_type": "code", 1882 | "execution_count": null, 1883 | "metadata": {}, 1884 | "outputs": [], 1885 | "source": [ 1886 | "# Input the optimal parameters in the model\n", 1887 | "model = RandomForestClassifier(class_weight={0:1,1:12},\n", 1888 | " criterion='gini',\n", 1889 | " max_depth=8,\n", 1890 | " max_features='log2', \n", 1891 | " min_samples_leaf=10,\n", 1892 | " n_estimators=30,\n", 1893 | " n_jobs=-1,\n", 1894 | " random_state=5)\n", 1895 | "\n", 1896 | "# Get results from your model\n", 1897 | "get_model_results(X_train, y_train, X_test, y_test, model)" 1898 | ] 1899 | }, 1900 | { 1901 | "cell_type": "markdown", 1902 | "metadata": {}, 1903 | "source": [ 1904 | "**The model has been improved even further. The number of false positives has now been slightly reduced even further, which means we are catching more cases of fraud. However, you see that the number of false positives actually went up. That is that Precision-Recall trade-off in action. To decide which final model is best, you need to take into account how bad it is not to catch fraudsters, versus how many false positives the fraud analytics team can deal with. Ultimately, this final decision should be made by you and the fraud team together.**" 1905 | ] 1906 | }, 1907 | { 1908 | "cell_type": "markdown", 1909 | "metadata": {}, 1910 | "source": [ 1911 | "## Ensemble methods\n", 1912 | "\n", 1913 | "![ensemble](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/ensemble.JPG)\n", 1914 | "* Ensemble methods are techniques that create multiple machine learning models and then combine them to produce a final result\n", 1915 | "* Usually produce more accurate predictions than a single model\n", 1916 | "* The goal of an ML problem is to find a single model that will best predict our wanted outcome\n", 1917 | " * Use ensemble methods rather than making one model and hoping it's best, most accurate predictor\n", 1918 | "* Ensemble methods take a myriad of models into account and average them to produce one final model\n", 1919 | " * Ensures the predictions are robust\n", 1920 | " * Less likely to be the result of overfitting\n", 1921 | " * Can improve prediction performance\n", 1922 | " * Especially by combining models with different recall and precision scores\n", 1923 | " * Are a winning formula at Kaggle competitions\n", 1924 | "* The Random Forest classifier is an ensemble of Decision Trees\n", 1925 | " * **Bootstrap Aggregation** or **Bagging Ensemble** method\n", 1926 | " * In a Random Forest, models are trained on random subsamples of data and the results are aggregated by taking the average prediction of all the trees" 1927 | ] 1928 | }, 1929 | { 1930 | "cell_type": "markdown", 1931 | "metadata": {}, 1932 | "source": [ 1933 | "#### Stacking Ensemble Methods\n", 1934 | "\n", 1935 | "![stacking ensemble](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/ensemble_stacking.JPG)\n", 1936 | "* Multiple models are combined via a \"voting\" rule on the model outcome\n", 1937 | "* The base level models are each trained based on the complete training set\n", 1938 | " * Unlike the Bagging method, models are not trained on a subsample of the data\n", 1939 | "* Algorithms of different types can be combined" 1940 | ] 1941 | }, 1942 | { 1943 | "cell_type": "markdown", 1944 | "metadata": {}, 1945 | "source": [ 1946 | "#### Voting Classifier\n", 1947 | "\n", 1948 | "* available in sklearn\n", 1949 | " * easy way of implementing an ensemble model\n", 1950 | "\n", 1951 | "```python\n", 1952 | "from sklearn.ensemble import VotingClassifier\n", 1953 | "\n", 1954 | "# Define Models\n", 1955 | "clf1 = LogisticRegression(random_state=1)\n", 1956 | "clf2 = RandomForestClassifier(random_state=1)\n", 1957 | "clf3 = GaussianNB()\n", 1958 | "\n", 1959 | "# Combine models into ensemble\n", 1960 | "ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')\n", 1961 | "\n", 1962 | "# Fit and predict as with other models\n", 1963 | "ensemble_model.fit(X_train, y_train)\n", 1964 | "ensemble_model.predict(X_test)\n", 1965 | "```\n", 1966 | "\n", 1967 | "* the `voting='hard'` option uses the predicted class labels and takes the majority vote\n", 1968 | "* the `voting='soft'` option takes the average probability by combining the predicted probabilities of the individual models\n", 1969 | "* Weights can be assigned to the `VotingClassifer` with `weights=[2,1,1]`\n", 1970 | " * Useful when one model significantly outperforms the others" 1971 | ] 1972 | }, 1973 | { 1974 | "cell_type": "markdown", 1975 | "metadata": {}, 1976 | "source": [ 1977 | "#### Reliable Labels\n", 1978 | "\n", 1979 | "* In real life it's unlikely the data will have truly unbiased, reliable labels for the model\n", 1980 | "* In credit card fraud you often will have reliable labels, in which case, use the methods learned so far\n", 1981 | "* Most cases you'll need to rely on unsupervised learning techniques to detect fraud" 1982 | ] 1983 | }, 1984 | { 1985 | "cell_type": "markdown", 1986 | "metadata": {}, 1987 | "source": [ 1988 | "### Logistic Regression\n", 1989 | "\n", 1990 | "In this last lesson you'll **combine three algorithms** into one model with the **VotingClassifier**. This allows us to benefit from the different aspects from all models, and hopefully improve overall performance and detect more fraud. The first model, the Logistic Regression, has a slightly higher recall score than our optimal Random Forest model, but gives a lot more false positives. You'll also add a Decision Tree with balanced weights to it. The data is already split into a training and test set, i.e. `X_train`, `y_train`, `X_test`, `y_test` are available.\n", 1991 | "\n", 1992 | "In order to understand how the Voting Classifier can potentially improve your original model, you should check the standalone results of the Logistic Regression model first.\n", 1993 | "\n", 1994 | "**Instructions**\n", 1995 | "\n", 1996 | "* Define a LogisticRegression model with class weights that are 1:15 for the fraud cases.\n", 1997 | "* Fit the model to the training set, and obtain the model predictions.\n", 1998 | "* Print the classification report and confusion matrix." 1999 | ] 2000 | }, 2001 | { 2002 | "cell_type": "code", 2003 | "execution_count": null, 2004 | "metadata": {}, 2005 | "outputs": [], 2006 | "source": [ 2007 | "# Define the Logistic Regression model with weights\n", 2008 | "model = LogisticRegression(class_weight={0:1, 1:15}, random_state=5, solver='liblinear')\n", 2009 | "\n", 2010 | "# Get the model results\n", 2011 | "get_model_results(X_train, y_train, X_test, y_test, model)" 2012 | ] 2013 | }, 2014 | { 2015 | "cell_type": "markdown", 2016 | "metadata": {}, 2017 | "source": [ 2018 | "**As you can see the Logistic Regression has quite different performance from the Random Forest. More false positives, but also a better Recall. It will therefore will a useful addition to the Random Forest in an ensemble model.**" 2019 | ] 2020 | }, 2021 | { 2022 | "cell_type": "markdown", 2023 | "metadata": {}, 2024 | "source": [ 2025 | "### Voting Classifier\n", 2026 | "\n", 2027 | "Let's now **combine three machine learning models into one**, to improve our Random Forest fraud detection model from before. You'll combine our usual Random Forest model, with the Logistic Regression from the previous exercise, with a simple Decision Tree. You can use the short cut `get_model_results()` to see the immediate result of the ensemble model.\n", 2028 | "\n", 2029 | "**Instructions**\n", 2030 | "\n", 2031 | "* Import the Voting Classifier package.\n", 2032 | "* Define the three models; use the Logistic Regression from before, the Random Forest from previous exercises and a Decision tree with balanced class weights.\n", 2033 | "* Define the ensemble model by inputting the three classifiers with their respective labels." 2034 | ] 2035 | }, 2036 | { 2037 | "cell_type": "code", 2038 | "execution_count": null, 2039 | "metadata": {}, 2040 | "outputs": [], 2041 | "source": [ 2042 | "# Define the three classifiers to use in the ensemble\n", 2043 | "clf1 = LogisticRegression(class_weight={0:1, 1:15},\n", 2044 | " random_state=5,\n", 2045 | " solver='liblinear')\n", 2046 | "\n", 2047 | "clf2 = RandomForestClassifier(class_weight={0:1, 1:12}, \n", 2048 | " criterion='gini', \n", 2049 | " max_depth=8, \n", 2050 | " max_features='log2',\n", 2051 | " min_samples_leaf=10, \n", 2052 | " n_estimators=30, \n", 2053 | " n_jobs=-1,\n", 2054 | " random_state=5)\n", 2055 | "\n", 2056 | "clf3 = DecisionTreeClassifier(random_state=5,\n", 2057 | " class_weight=\"balanced\")\n", 2058 | "\n", 2059 | "# Combine the classifiers in the ensemble model\n", 2060 | "ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('dt', clf3)], voting='hard')\n", 2061 | "\n", 2062 | "# Get the results \n", 2063 | "get_model_results(X_train, y_train, X_test, y_test, ensemble_model)" 2064 | ] 2065 | }, 2066 | { 2067 | "cell_type": "markdown", 2068 | "metadata": {}, 2069 | "source": [ 2070 | "**By combining the classifiers, you can take the best of multiple models. You've increased the cases of fraud you are catching from 76 to 78, and you only have 5 extra false positives in return. If you do care about catching as many fraud cases as you can, whilst keeping the false positives low, this is a pretty good trade-off. The Logistic Regression as a standalone was quite bad in terms of false positives, and the Random Forest was worse in terms of false negatives. By combining these together you indeed managed to improve performance.**" 2071 | ] 2072 | }, 2073 | { 2074 | "cell_type": "markdown", 2075 | "metadata": {}, 2076 | "source": [ 2077 | "### Adjusting weights within the Voting Classifier\n", 2078 | "\n", 2079 | "You've just seen that the Voting Classifier allows you to improve your fraud detection performance, by combining good aspects from multiple models. Now let's try to **adjust the weights** we give to these models. By increasing or decreasing weights you can play with **how much emphasis you give to a particular model** relative to the rest. This comes in handy when a certain model has overall better performance than the rest, but you still want to combine aspects of the others to further improve your results.\n", 2080 | "\n", 2081 | "For this exercise the data is already split into a training and test set, and `clf1`, `clf2` and `clf3` are available and defined as before, i.e. they are the Logistic Regression, the Random Forest model and the Decision Tree respectively.\n", 2082 | "\n", 2083 | "**Instructions**\n", 2084 | "\n", 2085 | "* Define an ensemble method where you over weigh the second classifier (`clf2`) with 4 to 1 to the rest of the classifiers.\n", 2086 | "* Fit the model to the training and test set, and obtain the predictions `predicted` from the ensemble model.\n", 2087 | "* Print the performance metrics, this is ready for you to run." 2088 | ] 2089 | }, 2090 | { 2091 | "cell_type": "code", 2092 | "execution_count": null, 2093 | "metadata": {}, 2094 | "outputs": [], 2095 | "source": [ 2096 | "# Define the ensemble model\n", 2097 | "ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[1, 4, 1], flatten_transform=True)\n", 2098 | "\n", 2099 | "# Get results \n", 2100 | "get_model_results(X_train, y_train, X_test, y_test, ensemble_model)" 2101 | ] 2102 | }, 2103 | { 2104 | "cell_type": "markdown", 2105 | "metadata": {}, 2106 | "source": [ 2107 | "**The weight option allows you to play with the individual models to get the best final mix for your fraud detection model. Now that you have finalized fraud detection with supervised learning, let's have a look at how fraud detetion can be done when you don't have any labels to train on.**" 2108 | ] 2109 | }, 2110 | { 2111 | "cell_type": "markdown", 2112 | "metadata": {}, 2113 | "source": [ 2114 | "# Fraud detection using unlabeled data\n", 2115 | "\n", 2116 | "Use unsupervised learning techniques to detect fraud. Segment customers, use K-means clustering and other clustering algorithms to find suspicious occurrences in your data." 2117 | ] 2118 | }, 2119 | { 2120 | "cell_type": "markdown", 2121 | "metadata": {}, 2122 | "source": [ 2123 | "## Normal versus abnormal behavior\n", 2124 | "\n", 2125 | "* Explore fraud detection without reliable data labels\n", 2126 | "* Unsupervised learning to detect suspicious behavior\n", 2127 | "* Abnormal behavior isn't necessarily fraudulent\n", 2128 | "* Challenging because it's difficult to validate" 2129 | ] 2130 | }, 2131 | { 2132 | "cell_type": "markdown", 2133 | "metadata": {}, 2134 | "source": [ 2135 | "#### What's normal behavior?\n", 2136 | "\n", 2137 | "* thoroughly describe the data:\n", 2138 | " * plot histograms\n", 2139 | " * check for outliers\n", 2140 | " * investigate correlations\n", 2141 | "* Are there any known historic cases of fraud? What typifies those cases?\n", 2142 | "* Investigate whether the data is homogeneous, or whether different types of clients display different behavior\n", 2143 | "* Check patterns within subgroups of data: is your data homogeneous?\n", 2144 | "* Verify data points are the same type:\n", 2145 | " * individuals\n", 2146 | " * groups\n", 2147 | " * companies\n", 2148 | " * governmental organizations\n", 2149 | "* Do the data points differ on:\n", 2150 | " * spending patterns\n", 2151 | " * age\n", 2152 | " * location\n", 2153 | " * frequency\n", 2154 | "* For credit card fraud, location can be an indication of fraud\n", 2155 | "* This goes for e-commerce sites\n", 2156 | " * where's the IP address located and where is the product ordered to ship?\n", 2157 | "* Create a separate model for each segment\n", 2158 | "* How to aggregate the many model results back into one final list" 2159 | ] 2160 | }, 2161 | { 2162 | "cell_type": "markdown", 2163 | "metadata": {}, 2164 | "source": [ 2165 | "### Exploring the data\n", 2166 | "\n", 2167 | "In the next exercises, you will be looking at bank **payment transaction data**. The financial transactions are categorized by type of expense, as well as the amount spent. Moreover, you have some client characteristics available such as age group and gender. Some of the transactions are labeled as fraud; you'll treat these labels as given and will use those to validate the results.\n", 2168 | "\n", 2169 | "When using unsupervised learning techniques for fraud detection, you want to **distinguish normal from abnormal** (thus potentially fraudulent) behavior. As a fraud analyst to understand what is \"normal\", you need to have a good understanding of the data and its characteristics. Let's explore the data in this first exercise.\n", 2170 | "\n", 2171 | "**Instructions 1/3**\n", 2172 | "\n", 2173 | "* Obtain the shape of the dataframe `df` to inspect the size of our data and display the first rows to see which features are available." 2174 | ] 2175 | }, 2176 | { 2177 | "cell_type": "code", 2178 | "execution_count": null, 2179 | "metadata": {}, 2180 | "outputs": [], 2181 | "source": [ 2182 | "banksim_df = pd.read_csv(banksim_file)\n", 2183 | "banksim_df.drop(['Unnamed: 0'], axis=1, inplace=True)\n", 2184 | "banksim_adj_df = pd.read_csv(banksim_adj_file)\n", 2185 | "banksim_adj_df.drop(['Unnamed: 0'], axis=1, inplace=True)" 2186 | ] 2187 | }, 2188 | { 2189 | "cell_type": "code", 2190 | "execution_count": null, 2191 | "metadata": {}, 2192 | "outputs": [], 2193 | "source": [ 2194 | "banksim_df.shape" 2195 | ] 2196 | }, 2197 | { 2198 | "cell_type": "code", 2199 | "execution_count": null, 2200 | "metadata": {}, 2201 | "outputs": [], 2202 | "source": [ 2203 | "banksim_df.head()" 2204 | ] 2205 | }, 2206 | { 2207 | "cell_type": "code", 2208 | "execution_count": null, 2209 | "metadata": {}, 2210 | "outputs": [], 2211 | "source": [ 2212 | "banksim_adj_df.shape" 2213 | ] 2214 | }, 2215 | { 2216 | "cell_type": "code", 2217 | "execution_count": null, 2218 | "metadata": {}, 2219 | "outputs": [], 2220 | "source": [ 2221 | "banksim_adj_df.head()" 2222 | ] 2223 | }, 2224 | { 2225 | "cell_type": "markdown", 2226 | "metadata": {}, 2227 | "source": [ 2228 | "**Instructions 2/3**\n", 2229 | "\n", 2230 | "* Group the data by transaction category and take the mean of the data." 2231 | ] 2232 | }, 2233 | { 2234 | "cell_type": "code", 2235 | "execution_count": null, 2236 | "metadata": {}, 2237 | "outputs": [], 2238 | "source": [ 2239 | "banksim_df.groupby(['category']).mean()" 2240 | ] 2241 | }, 2242 | { 2243 | "cell_type": "markdown", 2244 | "metadata": {}, 2245 | "source": [ 2246 | "**Instructions 3/3**\n", 2247 | "\n", 2248 | "Based on these results, can you already say something about fraud in our data?\n", 2249 | "\n", 2250 | "**Possible Answers**\n", 2251 | "\n", 2252 | "* ~~No, I don't have enough information.~~\n", 2253 | "* **Yes, the majority of fraud is observed in travel, leisure and sports related transactions.**" 2254 | ] 2255 | }, 2256 | { 2257 | "cell_type": "markdown", 2258 | "metadata": {}, 2259 | "source": [ 2260 | "### Customer segmentation\n", 2261 | "\n", 2262 | "In this exercise you're going to check whether there are any **obvious patterns** for the clients in this data, thus whether you need to segment your data into groups, or whether the data is rather homogenous.\n", 2263 | "\n", 2264 | "You unfortunately don't have a lot client information available; you can't for example distinguish between the wealth levels of different clients. However, there is data on **age ** available, so let's see whether there is any significant difference between behavior of age groups.\n", 2265 | "\n", 2266 | "**Instructions 1/3**\n", 2267 | "\n", 2268 | "* Group the dataframe `df` by the category `age` and get the means for each age group." 2269 | ] 2270 | }, 2271 | { 2272 | "cell_type": "code", 2273 | "execution_count": null, 2274 | "metadata": {}, 2275 | "outputs": [], 2276 | "source": [ 2277 | "banksim_df.groupby(['age']).mean()" 2278 | ] 2279 | }, 2280 | { 2281 | "cell_type": "markdown", 2282 | "metadata": {}, 2283 | "source": [ 2284 | "**Instructions 2/3**\n", 2285 | "\n", 2286 | "* Count the values of each age group." 2287 | ] 2288 | }, 2289 | { 2290 | "cell_type": "code", 2291 | "execution_count": null, 2292 | "metadata": {}, 2293 | "outputs": [], 2294 | "source": [ 2295 | "banksim_df.age.value_counts()" 2296 | ] 2297 | }, 2298 | { 2299 | "cell_type": "markdown", 2300 | "metadata": {}, 2301 | "source": [ 2302 | "**Instructions 3/3**\n", 2303 | "\n", 2304 | "Based on the results you see, does it make sense to divide your data into age segments before running a fraud detection algorithm?\n", 2305 | "\n", 2306 | "**Possible Answers**\n", 2307 | "\n", 2308 | "* **No, the age groups who are the largest are relatively similar.**\n", 2309 | "* ~~Yes, the age group \"0\" is very different and I would split that one out.~~\n", 2310 | "\n", 2311 | "**The average amount spent as well as fraud occurrence is rather similar across groups. Age group '0' stands out but since there are only 40 cases, it does not make sense to split these out in a separate group and run a separate model on them.**" 2312 | ] 2313 | }, 2314 | { 2315 | "cell_type": "markdown", 2316 | "metadata": {}, 2317 | "source": [ 2318 | "### Using statistics to define normal behavior\n", 2319 | "\n", 2320 | "In the previous exercises we saw that fraud is **more prevalent in certain transaction categories**, but that there is no obvious way to segment our data into for example age groups. This time, let's investigate the **average amounts spent** in normal transactions versus fraud transactions. This gives you an idea of how fraudulent transactions **differ structurally** from normal transactions.\n", 2321 | "\n", 2322 | "**Instructions**\n", 2323 | "\n", 2324 | "* Create two new dataframes from fraud and non-fraud observations. Locate the data in `df` with `.loc` and assign the condition \"where fraud is 1\" and \"where fraud is 0\" for creation of the new dataframes.\n", 2325 | "* Plot the `amount` column of the newly created dataframes in the histogram plot functions and assign the labels `fraud` and `nonfraud` respectively to the plots." 2326 | ] 2327 | }, 2328 | { 2329 | "cell_type": "code", 2330 | "execution_count": null, 2331 | "metadata": {}, 2332 | "outputs": [], 2333 | "source": [ 2334 | "# Create two dataframes with fraud and non-fraud data \n", 2335 | "df_fraud = banksim_df[banksim_df.fraud == 1] \n", 2336 | "df_non_fraud = banksim_df[banksim_df.fraud == 0]" 2337 | ] 2338 | }, 2339 | { 2340 | "cell_type": "code", 2341 | "execution_count": null, 2342 | "metadata": {}, 2343 | "outputs": [], 2344 | "source": [ 2345 | "# Plot histograms of the amounts in fraud and non-fraud data \n", 2346 | "plt.hist(df_fraud.amount, alpha=0.5, label='fraud')\n", 2347 | "plt.hist(df_non_fraud.amount, alpha=0.5, label='nonfraud')\n", 2348 | "plt.xlabel('amount')\n", 2349 | "plt.legend()\n", 2350 | "plt.show()" 2351 | ] 2352 | }, 2353 | { 2354 | "cell_type": "markdown", 2355 | "metadata": {}, 2356 | "source": [ 2357 | "**As the number fraud observations is much smaller, it is difficult to see the full distribution. Nonetheless, you can see that the fraudulent transactions tend to be on the larger side relative to normal observations. This is good news, as it helps us later in detecting fraud from non-fraud. In the next chapter you're going to implement a clustering model to distinguish between normal and abnormal transactions, when the fraud labels are no longer available.**" 2358 | ] 2359 | }, 2360 | { 2361 | "cell_type": "markdown", 2362 | "metadata": {}, 2363 | "source": [ 2364 | "## Clustering methods to detect fraud" 2365 | ] 2366 | }, 2367 | { 2368 | "cell_type": "markdown", 2369 | "metadata": {}, 2370 | "source": [ 2371 | "#### K-means clustering\n", 2372 | "\n", 2373 | "![k-means](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/k-means.JPG)\n", 2374 | "\n", 2375 | "* The objective of any clustering model is to detect patterns in the data\n", 2376 | "* More specifically, to group the data into distinct clusters made of data points that are very similar to each other, but distinct from the points in the other clusters.\n", 2377 | "* **The objective of k-means is to minimize the sum of all distances between the data samples and their associated cluster centroids**\n", 2378 | " * The score is the inverse of that minimization, so the score should be close to 0.\n", 2379 | "* **Using the distance to cluster centroids**\n", 2380 | " * Training samples are shown as dots and cluster centroids are shown as crosses\n", 2381 | " * Attempt to cluster the data in image A\n", 2382 | " * Start by putting in an initial guess for two cluster centroids, as in B\n", 2383 | " * Predefine the number of clusters at the start\n", 2384 | " * Then calculate the distances of each sample in the data to the closest centroid\n", 2385 | " * Figure C shows the data split into the two clusters\n", 2386 | " * Based on the initial clusters, the location of the centroids can be redefined (fig D) to minimize the sum of all distances in the two clusters.\n", 2387 | " * Repeat the step of reassigning points that are nearest to the centroid (fig E) until it converges to the point where no sample gets reassigned to another cluster (fig F)\n", 2388 | " * ![clustering](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/clustering.JPG)" 2389 | ] 2390 | }, 2391 | { 2392 | "cell_type": "markdown", 2393 | "metadata": {}, 2394 | "source": [ 2395 | "#### K-means clustering in Python\n", 2396 | "\n", 2397 | "* It's of utmost importance to scale the data before doing K-means clustering, or any algorithm that uses distances\n", 2398 | "* Without scaling, features on a larger scale will weight more heavily in the algorithm. All features should weigh equally at the initial stage\n", 2399 | "* fix `random_state` so models can be compared\n", 2400 | "\n", 2401 | "```python\n", 2402 | "# Import the packages\n", 2403 | "from sklearn.preprocessing import MinMaxScaler\n", 2404 | "from sklearn.cluster import KMeans\n", 2405 | "\n", 2406 | "# Transform and scale your data\n", 2407 | "X = np.array(df).astype(np.float)\n", 2408 | "scaler = MinMaxScaler()\n", 2409 | "X_scaled = scaler.fit_transform(X)\n", 2410 | "\n", 2411 | "# Define the k-means model and fit to the data\n", 2412 | "kmeans = KMeans(n_clusters=6, random_state=42).fit(X_scaled)\n", 2413 | "```" 2414 | ] 2415 | }, 2416 | { 2417 | "cell_type": "markdown", 2418 | "metadata": {}, 2419 | "source": [ 2420 | "#### The right amount of clusters\n", 2421 | "\n", 2422 | "* The drawback of K-means clustering is the need to assign the number of clusters beforehand\n", 2423 | "* There are multiple ways to check what the right number of clusters should be\n", 2424 | " * Silhouette method\n", 2425 | " * Elbow curve\n", 2426 | "* By running a k-means model on clusters varying from 1 to 10 and generate an **elbow curve** by saving the scores for each model under \"score\".\n", 2427 | "* Plot the scores against the number of clusters\n", 2428 | "\n", 2429 | " \n", 2430 | "```python\n", 2431 | "clust = range(1, 10) \n", 2432 | "kmeans = [KMeans(n_clusters=i) for i in clust]\n", 2433 | "\n", 2434 | "score = [kmeans[i].fit(X_scaled).score(X_scaled) for i in range(len(kmeans))]\n", 2435 | "\n", 2436 | "plt.plot(clust,score)\n", 2437 | "plt.xlabel('Number of Clusters')\n", 2438 | "plt.ylabel('Score')\n", 2439 | "plt.title('Elbow Curve')\n", 2440 | "plt.show()\n", 2441 | "```\n", 2442 | "\n", 2443 | "![elbow curve](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/elbow.JPG)\n", 2444 | "\n", 2445 | "* The slight elbow at 3 means that 3 clusters could be optimal, but it's not very pronounced" 2446 | ] 2447 | }, 2448 | { 2449 | "cell_type": "markdown", 2450 | "metadata": {}, 2451 | "source": [ 2452 | "### Scaling the data\n", 2453 | "\n", 2454 | "For ML algorithms using distance based metrics, it is **crucial to always scale your data**, as features using different scales will distort your results. K-means uses the Euclidean distance to assess distance to cluster centroids, therefore you first need to scale your data before continuing to implement the algorithm. Let's do that first.\n", 2455 | "\n", 2456 | "Available is the dataframe `df` from the previous exercise, with some minor data preparation done so it is ready for you to use with `sklearn`. The fraud labels are separately stored under labels, you can use those to check the results later.\n", 2457 | "\n", 2458 | "**Instructions**\n", 2459 | "\n", 2460 | "* Import the ``MinMaxScaler``.\n", 2461 | "* Transform your dataframe `df` into a numpy array `X` by taking only the values of `df` and make sure you have all `float` values.\n", 2462 | "* Apply the defined scaler onto `X` to obtain scaled values of `X_scaled` to force all your features to a 0-1 scale." 2463 | ] 2464 | }, 2465 | { 2466 | "cell_type": "code", 2467 | "execution_count": null, 2468 | "metadata": {}, 2469 | "outputs": [], 2470 | "source": [ 2471 | "labels = banksim_adj_df.fraud" 2472 | ] 2473 | }, 2474 | { 2475 | "cell_type": "code", 2476 | "execution_count": null, 2477 | "metadata": {}, 2478 | "outputs": [], 2479 | "source": [ 2480 | "cols = ['age', 'amount', 'M', 'es_barsandrestaurants', 'es_contents',\n", 2481 | " 'es_fashion', 'es_food', 'es_health', 'es_home', 'es_hotelservices',\n", 2482 | " 'es_hyper', 'es_leisure', 'es_otherservices', 'es_sportsandtoys',\n", 2483 | " 'es_tech', 'es_transportation', 'es_travel']" 2484 | ] 2485 | }, 2486 | { 2487 | "cell_type": "code", 2488 | "execution_count": null, 2489 | "metadata": {}, 2490 | "outputs": [], 2491 | "source": [ 2492 | "# Take the float values of df for X\n", 2493 | "X = banksim_adj_df[cols].values.astype(np.float)" 2494 | ] 2495 | }, 2496 | { 2497 | "cell_type": "code", 2498 | "execution_count": null, 2499 | "metadata": {}, 2500 | "outputs": [], 2501 | "source": [ 2502 | "X.shape" 2503 | ] 2504 | }, 2505 | { 2506 | "cell_type": "code", 2507 | "execution_count": null, 2508 | "metadata": {}, 2509 | "outputs": [], 2510 | "source": [ 2511 | "# Define the scaler and apply to the data\n", 2512 | "scaler = MinMaxScaler()\n", 2513 | "X_scaled = scaler.fit_transform(X)" 2514 | ] 2515 | }, 2516 | { 2517 | "cell_type": "markdown", 2518 | "metadata": {}, 2519 | "source": [ 2520 | "### K-mean clustering\n", 2521 | "\n", 2522 | "A very commonly used clustering algorithm is **K-means clustering**. For fraud detection, K-means clustering is straightforward to implement and relatively powerful in predicting suspicious cases. It is a good algorithm to start with when working on fraud detection problems. However, fraud data is oftentimes very large, especially when you are working with transaction data. **MiniBatch K-means** is an **efficient way** to implement K-means on a large dataset, which you will use in this exercise.\n", 2523 | "\n", 2524 | "The scaled data from the previous exercise, `X_scaled` is available. Let's give it a try.\n", 2525 | "\n", 2526 | "**Instructions**\n", 2527 | "\n", 2528 | "* Import `MiniBatchKMeans` from `sklearn`.\n", 2529 | "* Initialize the minibatch kmeans model with 8 clusters.\n", 2530 | "* Fit the model to your scaled data." 2531 | ] 2532 | }, 2533 | { 2534 | "cell_type": "code", 2535 | "execution_count": null, 2536 | "metadata": {}, 2537 | "outputs": [], 2538 | "source": [ 2539 | "# Define the model \n", 2540 | "kmeans = MiniBatchKMeans(n_clusters=8, random_state=0)\n", 2541 | "\n", 2542 | "# Fit the model to the scaled data\n", 2543 | "kmeans.fit(X_scaled)" 2544 | ] 2545 | }, 2546 | { 2547 | "cell_type": "markdown", 2548 | "metadata": {}, 2549 | "source": [ 2550 | "**You have now fitted your MiniBatch K-means model to the data. In the upcoming exercises you're going to explore whether this model is any good at flagging fraud. But before doing that, you still need to figure our what the right number of clusters to use is. Let's do that in the next exercise.**" 2551 | ] 2552 | }, 2553 | { 2554 | "cell_type": "markdown", 2555 | "metadata": {}, 2556 | "source": [ 2557 | "### Elbow method\n", 2558 | "\n", 2559 | "In the previous exercise you've implemented MiniBatch K-means with 8 clusters, without actually checking what the right amount of clusters should be. For our first fraud detection approach, it is important to **get the number of clusters right**, especially when you want to use the outliers of those clusters as fraud predictions. To decide which amount of clusters you're going to use, let's apply the **Elbow method** and see what the optimal number of clusters should be based on this method.\n", 2560 | "\n", 2561 | "`X_scaled` is again available for you to use and `MiniBatchKMeans` has been imported from `sklearn`.\n", 2562 | "\n", 2563 | "**Instructions**\n", 2564 | "\n", 2565 | "* Define the range to be between 1 and 10 clusters.\n", 2566 | "* Run MiniBatch K-means on all the clusters in the range using list comprehension.\n", 2567 | "* Fit each model on the scaled data and obtain the scores from the scaled data.\n", 2568 | "* Plot the cluster numbers and their respective scores." 2569 | ] 2570 | }, 2571 | { 2572 | "cell_type": "code", 2573 | "execution_count": null, 2574 | "metadata": {}, 2575 | "outputs": [], 2576 | "source": [ 2577 | "# Define the range of clusters to try\n", 2578 | "clustno = range(1, 10)\n", 2579 | "\n", 2580 | "# Run MiniBatch Kmeans over the number of clusters\n", 2581 | "kmeans = [MiniBatchKMeans(n_clusters=i) for i in clustno]\n", 2582 | "\n", 2583 | "# Obtain the score for each model\n", 2584 | "score = [kmeans[i].fit(X_scaled).score(X_scaled) for i in range(len(kmeans))]" 2585 | ] 2586 | }, 2587 | { 2588 | "cell_type": "code", 2589 | "execution_count": null, 2590 | "metadata": {}, 2591 | "outputs": [], 2592 | "source": [ 2593 | "# Plot the models and their respective score \n", 2594 | "plt.plot(clustno, score)\n", 2595 | "plt.xlabel('Number of Clusters')\n", 2596 | "plt.ylabel('Score')\n", 2597 | "plt.title('Elbow Curve')\n", 2598 | "plt.show()" 2599 | ] 2600 | }, 2601 | { 2602 | "cell_type": "markdown", 2603 | "metadata": {}, 2604 | "source": [ 2605 | "**Now you can see that the optimal number of clusters should probably be at around 3 clusters, as that is where the elbow is in the curve. We'll use this in the next exercise as our baseline model, and see how well this does in detecting fraud**" 2606 | ] 2607 | }, 2608 | { 2609 | "cell_type": "markdown", 2610 | "metadata": {}, 2611 | "source": [ 2612 | "## Assigning fraud vs. non-fraud\n", 2613 | "\n", 2614 | "* ![clusters](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/clusters_1.JPG)\n", 2615 | "* Take the outliers of each cluster, and flag those as fraud.\n", 2616 | "* ![clusters](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/clusters_2.JPG)\n", 2617 | "1. Collect and store the cluster centroids in memory\n", 2618 | " * Starting point to decide what's normal and not\n", 2619 | "1. Calculate the distance of each point in the dataset, to their own cluster centroid\n", 2620 | "* ![clusters](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/clusters_3.JPG)\n", 2621 | " * Euclidean distance is depicted by the circles in this case\n", 2622 | " * Define a cut-off point for the distances to define what's an outlier\n", 2623 | " * Done based on the distributions of the distances collected\n", 2624 | " * i.e. everything with a distance larger than the top 95th percentile, should be considered an outlier\n", 2625 | " * the tail of the distribution of distances\n", 2626 | " * anything outside the yellow circles is an outlier\n", 2627 | " * ![clusters](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/clusters_4.JPG)\n", 2628 | " * these are definitely outliers and can be described as abnormal or suspicious\n", 2629 | " * doesn't necessarily mean they are fraudulent" 2630 | ] 2631 | }, 2632 | { 2633 | "cell_type": "markdown", 2634 | "metadata": {}, 2635 | "source": [ 2636 | "#### Flagging Fraud Based on Distance to Centroid\n", 2637 | "\n", 2638 | "```python\n", 2639 | "# Run the kmeans model on scaled data\n", 2640 | "kmeans = KMeans(n_clusters=6, random_state=42,n_jobs=-1).fit(X_scaled)\n", 2641 | "\n", 2642 | "# Get the cluster number for each datapoint\n", 2643 | "X_clusters = kmeans.predict(X_scaled)\n", 2644 | "\n", 2645 | "# Save the cluster centroids\n", 2646 | "X_clusters_centers = kmeans.cluster_centers_\n", 2647 | "\n", 2648 | "# Calculate the distance to the cluster centroid for each point\n", 2649 | "dist = [np.linalg.norm(x-y) for x,y in zip(X_scaled, X_clusters_centers[X_clusters])]\n", 2650 | "\n", 2651 | "# Create predictions based on distance\n", 2652 | "km_y_pred = np.array(dist)\n", 2653 | "km_y_pred[dist>=np.percentile(dist, 93)] = 1\n", 2654 | "km_y_pred[dist= np.percentile(dist, 95)] = 1\n", 2715 | "km_y_pred[dist < np.percentile(dist, 95)] = 0" 2716 | ] 2717 | }, 2718 | { 2719 | "cell_type": "markdown", 2720 | "metadata": {}, 2721 | "source": [ 2722 | "### Checking model results\n", 2723 | "\n", 2724 | "In the previous exercise you've flagged all observations to be fraud, if they are in the top 5th percentile in distance from the cluster centroid. I.e. these are the very outliers of the three clusters. For this exercise you have the scaled data and labels already split into training and test set, so y_test is available. The predictions from the previous exercise, km_y_pred, are also available. Let's create some performance metrics and see how well you did.\n", 2725 | "\n", 2726 | "**Instructions 1/3**\n", 2727 | "\n", 2728 | "* Obtain the area under the ROC curve from your test labels and predicted labels." 2729 | ] 2730 | }, 2731 | { 2732 | "cell_type": "code", 2733 | "execution_count": null, 2734 | "metadata": {}, 2735 | "outputs": [], 2736 | "source": [ 2737 | "def plot_confusion_matrix(cm, classes=['Not Fraud', 'Fraud'],\n", 2738 | " normalize=False,\n", 2739 | " title='Fraud Confusion matrix',\n", 2740 | " cmap=plt.cm.Blues):\n", 2741 | " \"\"\"\n", 2742 | " This function prints and plots the confusion matrix.\n", 2743 | " Normalization can be applied by setting `normalize=True`.\n", 2744 | " From:\n", 2745 | " http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-\n", 2746 | " examples-model-selection-plot-confusion-matrix-py\n", 2747 | " \"\"\"\n", 2748 | " if normalize:\n", 2749 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 2750 | " print(\"Normalized confusion matrix\")\n", 2751 | " else:\n", 2752 | " print('Confusion matrix, without normalization')\n", 2753 | "\n", 2754 | " # print(cm)\n", 2755 | "\n", 2756 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", 2757 | " plt.title(title)\n", 2758 | " plt.colorbar()\n", 2759 | " tick_marks = np.arange(len(classes))\n", 2760 | " plt.xticks(tick_marks, classes, rotation=45)\n", 2761 | " plt.yticks(tick_marks, classes)\n", 2762 | "\n", 2763 | " fmt = '.2f' if normalize else 'd'\n", 2764 | " thresh = cm.max() / 2.\n", 2765 | " for i, j in product(range(cm.shape[0]), range(cm.shape[1])):\n", 2766 | " plt.text(j, i, format(cm[i, j], fmt),\n", 2767 | " horizontalalignment=\"center\",\n", 2768 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n", 2769 | "\n", 2770 | " plt.tight_layout()\n", 2771 | " plt.ylabel('True label')\n", 2772 | " plt.xlabel('Predicted label')\n", 2773 | " plt.show()" 2774 | ] 2775 | }, 2776 | { 2777 | "cell_type": "code", 2778 | "execution_count": null, 2779 | "metadata": {}, 2780 | "outputs": [], 2781 | "source": [ 2782 | "# Obtain the ROC score\n", 2783 | "roc_auc_score(y_test, km_y_pred)" 2784 | ] 2785 | }, 2786 | { 2787 | "cell_type": "markdown", 2788 | "metadata": {}, 2789 | "source": [ 2790 | "**Instructions 2/3**\n", 2791 | "\n", 2792 | "* Obtain the confusion matrix from the test labels and predicted labels and plot the results." 2793 | ] 2794 | }, 2795 | { 2796 | "cell_type": "code", 2797 | "execution_count": null, 2798 | "metadata": {}, 2799 | "outputs": [], 2800 | "source": [ 2801 | "# Create a confusion matrix\n", 2802 | "km_cm = confusion_matrix(y_test, km_y_pred)\n", 2803 | "\n", 2804 | "# Plot the confusion matrix in a figure to visualize results \n", 2805 | "plot_confusion_matrix(km_cm)" 2806 | ] 2807 | }, 2808 | { 2809 | "cell_type": "markdown", 2810 | "metadata": {}, 2811 | "source": [ 2812 | "**Instructions 3/3**\n", 2813 | "\n", 2814 | "If you were to decrease the percentile used as a cutoff point in the previous exercise to 93% instead of 95%, what would that do to your prediction results?\n", 2815 | "\n", 2816 | "**Possible Answers**\n", 2817 | "\n", 2818 | "* **The number of fraud cases caught increases, but false positives also increase.**\n", 2819 | "* ~~The number of fraud cases caught decreases, and false positives decrease.~~\n", 2820 | "* ~~The number of fraud cases caught increases, but false positives would decrease.~~\n", 2821 | "* ~~Nothing would happen to the amount of fraud cases caught.~~" 2822 | ] 2823 | }, 2824 | { 2825 | "cell_type": "markdown", 2826 | "metadata": {}, 2827 | "source": [ 2828 | "## Alternate clustering methods for fraud detection\n", 2829 | "\n", 2830 | "* In addition to K-means, there are many different clustering methods, which can be used for fraud detection\n", 2831 | "* ![clustering methods](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/clustering_methods.JPG)\n", 2832 | "* K-means works well when the data is clustered in normal, round shapes\n", 2833 | "* There are methods to flag fraud other the cluster outliers\n", 2834 | "* ![clustering outlier](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/cluster_outlier.JPG)\n", 2835 | " * Small clusters can be an indication of fraud\n", 2836 | " * This approach can be used when fraudulent behavior has commonalities, which cause clustering\n", 2837 | " * The fraudulent data would cluster in tiny groups, rather than be the outliers of larger clusters\n", 2838 | "* ![typical data](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/typical_data.JPG)\n", 2839 | " * In this case there are 3 obvious clusters\n", 2840 | " * The smallest dots are outliers and outside of what can be described as normal behavior\n", 2841 | " * There are also small to medium clusters closely connected to the red cluster\n", 2842 | " * Visualizing the data with something like [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) can be quite helpful" 2843 | ] 2844 | }, 2845 | { 2846 | "cell_type": "markdown", 2847 | "metadata": {}, 2848 | "source": [ 2849 | "#### DBSCAN: Density-Based Spatial Clustering of Applications with Noise\n", 2850 | "\n", 2851 | "* [DBscan](https://en.wikipedia.org/wiki/DBSCAN)\n", 2852 | "* DBSCAN vs. K-means\n", 2853 | " * The number of clusters does not need to be predefined\n", 2854 | " * The algorithm finds core samples of high density and expands clusters from them\n", 2855 | " * Works well on data containing clusters of similar density\n", 2856 | " * This type of algorithm can be used to identify fraud as very small clusters\n", 2857 | " * Maximum allowed distance between points in a cluster must be assigned\n", 2858 | " * Minimal number of data points in clusters must be assigned\n", 2859 | " * Better performance on weirdly shaped data\n", 2860 | " * Computationally heavier then MiniBatch K-means" 2861 | ] 2862 | }, 2863 | { 2864 | "cell_type": "markdown", 2865 | "metadata": {}, 2866 | "source": [ 2867 | "#### Implementation of DBSCAN\n", 2868 | "\n", 2869 | "```python\n", 2870 | "from sklearn.cluster import DBSCAN\n", 2871 | "db = DBSCAN(eps=0.5, min_samples=10, n_jobs=-1).fit(X_scaled)\n", 2872 | "\n", 2873 | "# Get the cluster labels (aka numbers)\n", 2874 | "pred_labels = db.labels_\n", 2875 | "\n", 2876 | "# Count the total number of clusters\n", 2877 | "n_clusters_ = len(set(pred_labels)) - (1 if -1 in pred_labels else 0)\n", 2878 | "\n", 2879 | "# Print model results\n", 2880 | "print(f'Estimated number of clusters: {n_clusters_}')\n", 2881 | ">>> Estimated number of clusters: 31\n", 2882 | " \n", 2883 | "# Print model results\n", 2884 | "print(f'Silhouette Coefficient: {metrics.silhouette_score(X_scaled, pred_labels):0.3f}')\n", 2885 | ">>> Silhouette Coefficient: 0.359\n", 2886 | " \n", 2887 | "# Get sample counts in each cluster \n", 2888 | "counts = np.bincount(pred_labels[pred_labels>=0])\n", 2889 | "print(counts)\n", 2890 | ">>> [ 763, 496, 840, 355 1086, 676, 63, 306, 560, 134, 28, 18, 262, 128,\n", 2891 | " 332, 22, 22, 13, 31, 38, 36, 28, 14, 12, 30, 10, 11, 10, 21, 10, 5]\n", 2892 | "```\n", 2893 | "\n", 2894 | "* start by defining the epsilon `eps`\n", 2895 | " * Distance between data points allowed from which the cluster expands\n", 2896 | "* define minimum samples in the clusters\n", 2897 | "* conventional DBSCAN can't produce the optimal value of epsilon, so it requires sophisticated DBSCAN modifications to automatically determine the optimal epsilon value\n", 2898 | "* Fit DBSCAN to **scaled data**\n", 2899 | "* Use `labels_` method to get the assigned cluster label for each data point\n", 2900 | "* The cluster count can also be determine by counting the unique cluster labels from the cluster `label_` predictions\n", 2901 | "* Can have performance metrics such as **average silhouette score**\n", 2902 | "* The size of each cluster can be calculated with `np.bincount`\n", 2903 | " * counts the number of occurrences of non-negative values in a `numpy` array\n", 2904 | "* sort `counts` and decide how many of the smaller clusters to flag as fraud\n", 2905 | " * selecting the clusters to flag, is a trial-and-error step and depends on the number of cases the fraud team can manage" 2906 | ] 2907 | }, 2908 | { 2909 | "cell_type": "markdown", 2910 | "metadata": {}, 2911 | "source": [ 2912 | "### DB scan\n", 2913 | "\n", 2914 | "In this exercise you're going to explore using a **density based clustering** method (DBSCAN) to detect fraud. The advantage of DBSCAN is that you **do not need to define the number of clusters** beforehand. Also, DBSCAN can handle weirdly shaped data (i.e. non-convex) much better than K-means can. This time, you are not going to take the outliers of the clusters and use that for fraud, but take the **smallest clusters** in the data and label those as fraud. You again have the scaled dataset, i.e. X_scaled available. Let's give it a try!\n", 2915 | "\n", 2916 | "**Instructions**\n", 2917 | "\n", 2918 | "* Import `DBSCAN`.\n", 2919 | "* Initialize a DBSCAN model setting the maximum distance between two samples to 0.9 and the minimum observations in the clusters to 10, and fit the model to the scaled data.\n", 2920 | "* Obtain the predicted labels, these are the cluster numbers assigned to an observation.\n", 2921 | "* Print the number of clusters and the rest of the performance metrics." 2922 | ] 2923 | }, 2924 | { 2925 | "cell_type": "code", 2926 | "execution_count": null, 2927 | "metadata": {}, 2928 | "outputs": [], 2929 | "source": [ 2930 | "# Initialize and fit the DBscan model\n", 2931 | "db = DBSCAN(eps=0.9, min_samples=10, n_jobs=-1).fit(X_scaled)\n", 2932 | "\n", 2933 | "# Obtain the predicted labels and calculate number of clusters\n", 2934 | "pred_labels = db.labels_\n", 2935 | "n_clusters = len(set(pred_labels)) - (1 if -1 in labels else 0)" 2936 | ] 2937 | }, 2938 | { 2939 | "cell_type": "code", 2940 | "execution_count": null, 2941 | "metadata": {}, 2942 | "outputs": [], 2943 | "source": [ 2944 | "# Print performance metrics for DBscan\n", 2945 | "print(f'Estimated number of clusters: {n_clusters}')\n", 2946 | "print(f'Homogeneity: {homogeneity_score(labels, pred_labels):0.3f}')\n", 2947 | "print(f'Silhouette Coefficient: {silhouette_score(X_scaled, pred_labels):0.3f}')" 2948 | ] 2949 | }, 2950 | { 2951 | "cell_type": "markdown", 2952 | "metadata": {}, 2953 | "source": [ 2954 | "**The number of clusters is much higher than with K-means. For fraud detection this is for now OK, as we are only interested in the smallest clusters, since those are considered as abnormal. Now have a look at those clusters and decide which one to flag as fraud.**" 2955 | ] 2956 | }, 2957 | { 2958 | "cell_type": "markdown", 2959 | "metadata": {}, 2960 | "source": [ 2961 | "### Assessing smallest clusters\n", 2962 | "\n", 2963 | "In this exercise you're going to have a look at the clusters that came out of DBscan, and flag certain clusters as fraud:\n", 2964 | "\n", 2965 | "* you first need to figure out how big the clusters are, and **filter out the smallest**\n", 2966 | "* then, you're going to take the smallest ones and **flag those as fraud**\n", 2967 | "* last, you'll **check with the original labels** whether this does actually do a good job in detecting fraud.\n", 2968 | "\n", 2969 | "Available are the DBscan model predictions, so `n_clusters` is available as well as the cluster labels, which are saved under `pred_labels`. Let's give it a try!\n", 2970 | "\n", 2971 | "**Instructions 1/3**\n", 2972 | "\n", 2973 | "* Count the samples within each cluster by running a bincount on the predicted cluster numbers under `pred_labels` and print the results." 2974 | ] 2975 | }, 2976 | { 2977 | "cell_type": "code", 2978 | "execution_count": null, 2979 | "metadata": {}, 2980 | "outputs": [], 2981 | "source": [ 2982 | "# Count observations in each cluster number\n", 2983 | "counts = np.bincount(pred_labels[pred_labels >= 0])\n", 2984 | "\n", 2985 | "# Print the result\n", 2986 | "print(counts)" 2987 | ] 2988 | }, 2989 | { 2990 | "cell_type": "markdown", 2991 | "metadata": {}, 2992 | "source": [ 2993 | "**Instructions 2/3**\n", 2994 | "\n", 2995 | "* Sort the sample `counts` and take the top 3 smallest clusters, and print the results." 2996 | ] 2997 | }, 2998 | { 2999 | "cell_type": "code", 3000 | "execution_count": null, 3001 | "metadata": {}, 3002 | "outputs": [], 3003 | "source": [ 3004 | "# Sort the sample counts of the clusters and take the top 3 smallest clusters\n", 3005 | "smallest_clusters = np.argsort(counts)[:3]" 3006 | ] 3007 | }, 3008 | { 3009 | "cell_type": "code", 3010 | "execution_count": null, 3011 | "metadata": {}, 3012 | "outputs": [], 3013 | "source": [ 3014 | "# Print the results \n", 3015 | "print(f'The smallest clusters are clusters: {smallest_clusters}')" 3016 | ] 3017 | }, 3018 | { 3019 | "cell_type": "markdown", 3020 | "metadata": {}, 3021 | "source": [ 3022 | "**Instructions 3/3**\n", 3023 | "\n", 3024 | "* Within `counts`, select the smallest clusters only, to print the number of samples in the three smallest clusters." 3025 | ] 3026 | }, 3027 | { 3028 | "cell_type": "code", 3029 | "execution_count": null, 3030 | "metadata": {}, 3031 | "outputs": [], 3032 | "source": [ 3033 | "# Print the counts of the smallest clusters only\n", 3034 | "print(f'Their counts are: {counts[smallest_clusters]}')" 3035 | ] 3036 | }, 3037 | { 3038 | "cell_type": "markdown", 3039 | "metadata": {}, 3040 | "source": [ 3041 | "**So now we know which smallest clusters you could flag as fraud. If you were to take more of the smallest clusters, you cast your net wider and catch more fraud, but most likely also more false positives. It is up to the fraud analyst to find the right amount of cases to flag and to investigate. In the next exercise you'll check the results with the actual labels.**" 3042 | ] 3043 | }, 3044 | { 3045 | "cell_type": "markdown", 3046 | "metadata": {}, 3047 | "source": [ 3048 | "### Results verification\n", 3049 | "\n", 3050 | "In this exercise you're going to **check the results** of your DBscan fraud detection model. In reality, you often don't have reliable labels and this where a fraud analyst can help you validate the results. He/She can check your results and see whether the cases you flagged are indeed suspicious. You can also **check historically known cases** of fraud and see whether your model flags them.\n", 3051 | "\n", 3052 | "In this case, you'll **use the fraud labels** to check your model results. The predicted cluster numbers are available under `pred_labels` as well as the original fraud `labels`.\n", 3053 | "\n", 3054 | "**Instructions**\n", 3055 | "\n", 3056 | "* Create a dataframe combining the cluster numbers with the actual labels.\n", 3057 | "* Create a condition that flags fraud for the three smallest clusters: clusters 21, 17 and 9.\n", 3058 | "* Create a crosstab from the actual fraud labels with the newly created predicted fraud labels." 3059 | ] 3060 | }, 3061 | { 3062 | "cell_type": "code", 3063 | "execution_count": null, 3064 | "metadata": {}, 3065 | "outputs": [], 3066 | "source": [ 3067 | "# Create a dataframe of the predicted cluster numbers and fraud labels \n", 3068 | "df = pd.DataFrame({'clusternr':pred_labels,'fraud':labels})\n", 3069 | "\n", 3070 | "# Create a condition flagging fraud for the smallest clusters \n", 3071 | "df['predicted_fraud'] = np.where((df['clusternr'].isin([21, 17, 9])), 1 , 0)" 3072 | ] 3073 | }, 3074 | { 3075 | "cell_type": "code", 3076 | "execution_count": null, 3077 | "metadata": {}, 3078 | "outputs": [], 3079 | "source": [ 3080 | "# Run a crosstab on the results \n", 3081 | "print(pd.crosstab(df['fraud'], df['predicted_fraud'], rownames=['Actual Fraud'], colnames=['Flagged Fraud']))" 3082 | ] 3083 | }, 3084 | { 3085 | "cell_type": "markdown", 3086 | "metadata": {}, 3087 | "source": [ 3088 | "**How does this compare to the K-means model? The good thing is: our of all flagged cases, roughly 2/3 are actually fraud! Since you only take the three smallest clusters, by definition you flag less cases of fraud, so you catch less but also have less false positives. However, you are missing quite a lot of fraud cases. Increasing the amount of smallest clusters you flag could improve that, at the cost of more false positives of course. In the next chapter you'll learn how to further improve fraud detection models by including text analysis.**" 3089 | ] 3090 | }, 3091 | { 3092 | "cell_type": "markdown", 3093 | "metadata": {}, 3094 | "source": [ 3095 | "# Fraud detection using text\n", 3096 | "\n", 3097 | "Use text data, text mining and topic modeling to detect fraudulent behavior." 3098 | ] 3099 | }, 3100 | { 3101 | "cell_type": "markdown", 3102 | "metadata": {}, 3103 | "source": [ 3104 | "## Using text data\n", 3105 | "\n", 3106 | "* Types of useful text data:\n", 3107 | " 1. Emails from employees and/or clients\n", 3108 | " 1. Transaction descriptions\n", 3109 | " 1. Employee notes\n", 3110 | " 1. Insurance claim form description box\n", 3111 | " 1. Recorded telephone conversations\n", 3112 | "* Text mining techniques for fraud detection\n", 3113 | " 1. Word search\n", 3114 | " 1. Sentiment analysis\n", 3115 | " 1. Word frequencies and topic analysis\n", 3116 | " 1. Style\n", 3117 | "* Word search for fraud detection\n", 3118 | " * Flagging suspicious words:\n", 3119 | " 1. Simple, straightforward and easy to explain\n", 3120 | " 1. Match results can be used as a filter on top of machine learning model\n", 3121 | " 1. Match results can be used as a feature in a machine learning model" 3122 | ] 3123 | }, 3124 | { 3125 | "cell_type": "markdown", 3126 | "metadata": {}, 3127 | "source": [ 3128 | "#### Word counts to flag fraud with pandas\n", 3129 | "\n", 3130 | "```python\n", 3131 | "# Using a string operator to find words\n", 3132 | "df['email_body'].str.contains('money laundering')\n", 3133 | "\n", 3134 | " # Select data that matches \n", 3135 | "df.loc[df['email_body'].str.contains('money laundering', na=False)]\n", 3136 | "\n", 3137 | " # Create a list of words to search for\n", 3138 | "list_of_words = ['police', 'money laundering']\n", 3139 | "df.loc[df['email_body'].str.contains('|'.join(list_of_words), na=False)]\n", 3140 | "\n", 3141 | " # Create a fraud flag \n", 3142 | "df['flag'] = np.where((df['email_body'].str.contains('|'.join(list_of_words)) == True), 1, 0)\n", 3143 | "```" 3144 | ] 3145 | }, 3146 | { 3147 | "cell_type": "markdown", 3148 | "metadata": {}, 3149 | "source": [ 3150 | "### Word search with dataframes\n", 3151 | "\n", 3152 | "In this exercise you're going to work with text data, containing emails from Enron employees. The **Enron scandal** is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as \"sell enron stock\".\n", 3153 | "\n", 3154 | "By using string operations on dataframes, you can easily sift through messy email data and create flags based on word-hits. The Enron email data has been put into a dataframe called `df` so let's search for suspicious terms. Feel free to explore `df` in the Console before getting started.\n", 3155 | "\n", 3156 | "**Instructions 1/2**\n", 3157 | "\n", 3158 | "* Check the head of `df` in the console and look for any emails mentioning 'sell enron stock'." 3159 | ] 3160 | }, 3161 | { 3162 | "cell_type": "code", 3163 | "execution_count": null, 3164 | "metadata": {}, 3165 | "outputs": [], 3166 | "source": [ 3167 | "df = pd.read_csv(enron_emails_clean_file)" 3168 | ] 3169 | }, 3170 | { 3171 | "cell_type": "code", 3172 | "execution_count": null, 3173 | "metadata": {}, 3174 | "outputs": [], 3175 | "source": [ 3176 | "mask = df['clean_content'].str.contains('sell enron stock', na=False)" 3177 | ] 3178 | }, 3179 | { 3180 | "cell_type": "markdown", 3181 | "metadata": {}, 3182 | "source": [ 3183 | "**Instructions 2/2**\n", 3184 | "\n", 3185 | "* Locate the data in `df` that meets the condition we created earlier." 3186 | ] 3187 | }, 3188 | { 3189 | "cell_type": "code", 3190 | "execution_count": null, 3191 | "metadata": {}, 3192 | "outputs": [], 3193 | "source": [ 3194 | "# Select the data from df using the mask\n", 3195 | "df[mask]" 3196 | ] 3197 | }, 3198 | { 3199 | "cell_type": "markdown", 3200 | "metadata": {}, 3201 | "source": [ 3202 | "**You see that searching for particular string values in a dataframe can be relatively easy, and allows you to include textual data into your model or analysis. You can use this word search as an additional flag, or as a feature in your fraud detection model. Let's look at how to filter the data using multiple search terms.**" 3203 | ] 3204 | }, 3205 | { 3206 | "cell_type": "markdown", 3207 | "metadata": {}, 3208 | "source": [ 3209 | "### Using list of terms\n", 3210 | "\n", 3211 | "Oftentimes you don't want to search on just one term. You probably can create a full **\"fraud dictionary\"** of terms that could potentially **flag fraudulent clients** and/or transactions. Fraud analysts often will have an idea what should be in such a dictionary. In this exercise you're going to **flag a multitude of terms**, and in the next exercise you'll create a new flag variable out of it. The 'flag' can be used either directly in a machine learning model as a feature, or as an additional filter on top of your machine learning model results. Let's first use a list of terms to filter our data on. The dataframe containing the cleaned emails is again available as `df`.\n", 3212 | "\n", 3213 | "**Instructions**\n", 3214 | "\n", 3215 | "* Create a list to search for including 'enron stock', 'sell stock', 'stock bonus', and 'sell enron stock'.\n", 3216 | "* Join the string terms in the search conditions.\n", 3217 | "* Filter data using the emails that match with the list defined under `searchfor`." 3218 | ] 3219 | }, 3220 | { 3221 | "cell_type": "code", 3222 | "execution_count": null, 3223 | "metadata": {}, 3224 | "outputs": [], 3225 | "source": [ 3226 | "# Create a list of terms to search for\n", 3227 | "searchfor = ['enron stock', 'sell stock', 'stock bonus', 'sell enron stock']\n", 3228 | "\n", 3229 | "# Filter cleaned emails on searchfor list and select from df \n", 3230 | "filtered_emails = df[df.clean_content.str.contains('|'.join(searchfor), na=False)]\n", 3231 | "filtered_emails.head()" 3232 | ] 3233 | }, 3234 | { 3235 | "cell_type": "markdown", 3236 | "metadata": {}, 3237 | "source": [ 3238 | "**By joining the search terms with the 'or' sign, i.e. |, you can search on a multitude of terms in your dataset very easily. Let's now create a flag from this which you can use as a feature in a machine learning model.**" 3239 | ] 3240 | }, 3241 | { 3242 | "cell_type": "markdown", 3243 | "metadata": {}, 3244 | "source": [ 3245 | "### Creating a flag\n", 3246 | "\n", 3247 | "This time you are going to **create an actual flag** variable that gives a **1 when the emails get a hit** on the search terms of interest, and 0 otherwise. This is the last step you need to make in order to actually use the text data content as a feature in a machine learning model, or as an actual flag on top of model results. You can continue working with the dataframe `df` containing the emails, and the `searchfor` list is the one defined in the last exercise.\n", 3248 | "\n", 3249 | "**Instructions**\n", 3250 | "\n", 3251 | "* Use a numpy where condition to flag '1' where the cleaned email contains words on the `searchfor` list and 0 otherwise.\n", 3252 | "* Join the words on the `searchfor` list with an \"or\" indicator.\n", 3253 | "* Count the values of the newly created flag variable." 3254 | ] 3255 | }, 3256 | { 3257 | "cell_type": "code", 3258 | "execution_count": null, 3259 | "metadata": {}, 3260 | "outputs": [], 3261 | "source": [ 3262 | "# Create flag variable where the emails match the searchfor terms\n", 3263 | "df['flag'] = np.where((df['clean_content'].str.contains('|'.join(searchfor)) == True), 1, 0)\n", 3264 | "\n", 3265 | "# Count the values of the flag variable\n", 3266 | "count = df['flag'].value_counts()\n", 3267 | "print(count)" 3268 | ] 3269 | }, 3270 | { 3271 | "cell_type": "markdown", 3272 | "metadata": {}, 3273 | "source": [ 3274 | "**You have now managed to search for a list of strings in several lines of text data. These skills come in handy when you want to flag certain words based on what you discovered in your topic model, or when you know beforehand what you want to search for. In the next exercises you're going to learn how to clean text data and to create your own topic model to further look for indications of fraud in your text data.**" 3275 | ] 3276 | }, 3277 | { 3278 | "cell_type": "markdown", 3279 | "metadata": {}, 3280 | "source": [ 3281 | "## Text mining to detect fraud" 3282 | ] 3283 | }, 3284 | { 3285 | "cell_type": "markdown", 3286 | "metadata": {}, 3287 | "source": [ 3288 | "#### Cleaning your text data\n", 3289 | "\n", 3290 | "**Must dos when working with textual data:**\n", 3291 | "\n", 3292 | "1. Tokenization\n", 3293 | " * Split the text into sentences and the sentences in words\n", 3294 | " * transform everything to lowercase\n", 3295 | " * remove punctuation\n", 3296 | "1. Remove all stopwords\n", 3297 | "1. Lemmatize \n", 3298 | " * change from third person into first person\n", 3299 | " * change past and future tense verbs to present tense\n", 3300 | " * this makes it possible to combine all words that point to the same thing\n", 3301 | "1. Stem the words\n", 3302 | " * reduce words to their root form\n", 3303 | " * e.g. walking and walked to walk\n", 3304 | "\n", 3305 | "* **Unprocessed Text**\n", 3306 | " * ![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/text_df.JPG)\n", 3307 | "* **Processed Text**\n", 3308 | " * ![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/text_processed.JPG)" 3309 | ] 3310 | }, 3311 | { 3312 | "cell_type": "markdown", 3313 | "metadata": {}, 3314 | "source": [ 3315 | "#### Data Preprocessing I\n", 3316 | "\n", 3317 | "* Tokenizers divide strings into list of substrings\n", 3318 | "* nltk word tokenizer can be used to find the words and punctuation in a string\n", 3319 | " * it splits the words on whitespace, and separated the punctuation out\n", 3320 | "\n", 3321 | "```python\n", 3322 | "from nltk import word_tokenize\n", 3323 | "from nltk.corpus import stopwords \n", 3324 | "import string\n", 3325 | "\n", 3326 | "# 1. Tokenization\n", 3327 | "text = df.apply(lambda row: word_tokenize(row[\"email_body\"]), axis=1)\n", 3328 | "text = text.rstrip() # remove whitespace\n", 3329 | "# replace with lowercase\n", 3330 | "# text = re.sub(r'[^a-zA-Z]', ' ', text)\n", 3331 | "text = text.lower()\n", 3332 | "\n", 3333 | " # 2. Remove all stopwords and punctuation\n", 3334 | "exclude = set(string.punctuation)\n", 3335 | "stop = set(stopwords.words('english'))\n", 3336 | "stop_free = \" \".join([word for word in text if((word not in stop) and (not word.isdigit()))])\n", 3337 | "punc_free = ''.join(word for word in stop_free if word not in exclude)\n", 3338 | "```" 3339 | ] 3340 | }, 3341 | { 3342 | "cell_type": "markdown", 3343 | "metadata": {}, 3344 | "source": [ 3345 | "#### Data Preprocessing II\n", 3346 | "\n", 3347 | "```python\n", 3348 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 3349 | "from nltk.stem.porter import PorterStemmer\n", 3350 | "\n", 3351 | "# Lemmatize words\n", 3352 | "lemma = WordNetLemmatizer()\n", 3353 | "normalized = \" \".join(lemma.lemmatize(word) for word in punc_free.split())\n", 3354 | "\n", 3355 | "# Stem words\n", 3356 | "porter= PorterStemmer()\n", 3357 | "cleaned_text = \" \".join(porter.stem(token) for token in normalized.split())\n", 3358 | "print (cleaned_text)\n", 3359 | "\n", 3360 | "['philip','going','street','curious','hear','perspective','may','wish',\n", 3361 | "'offer','trading','floor','enron','stock','lower','joined','company',\n", 3362 | "'business','school','imagine','quite','happy','people','day','relate',\n", 3363 | "'somewhat','stock','around','fact','broke','day','ago','knowing',\n", 3364 | "'imagine','letting','event','get','much','taken','similar',\n", 3365 | "'problem','hope','everything','else','going','well','family','knee',\n", 3366 | "'surgery','yet','give','call','chance','later']\n", 3367 | "```" 3368 | ] 3369 | }, 3370 | { 3371 | "cell_type": "markdown", 3372 | "metadata": {}, 3373 | "source": [ 3374 | "### Removing stopwords\n", 3375 | "\n", 3376 | "In the following exercises you're going to **clean the Enron emails**, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails `df` is available. In a first step you need to **define the list of stopwords and punctuations** that are to be removed in the next exercise from the text data. Let's give it a try.\n", 3377 | "\n", 3378 | "**Instructions**\n", 3379 | "\n", 3380 | "* Import the stopwords from `ntlk`.\n", 3381 | "* Define 'english' words to use as stopwords under the variable `stop`.\n", 3382 | "* Get the punctuation set from the `string` package and assign it to `exclude`." 3383 | ] 3384 | }, 3385 | { 3386 | "cell_type": "code", 3387 | "execution_count": null, 3388 | "metadata": {}, 3389 | "outputs": [], 3390 | "source": [ 3391 | "# Define stopwords to exclude\n", 3392 | "stop = set(stopwords.words('english'))\n", 3393 | "stop.update((\"to\", \"cc\", \"subject\", \"http\", \"from\", \"sent\", \"ect\", \"u\", \"fwd\", \"www\", \"com\", 'html'))\n", 3394 | "\n", 3395 | "# Define punctuations to exclude and lemmatizer\n", 3396 | "exclude = set(string.punctuation)" 3397 | ] 3398 | }, 3399 | { 3400 | "cell_type": "markdown", 3401 | "metadata": {}, 3402 | "source": [ 3403 | "### Cleaning text data\n", 3404 | "\n", 3405 | "Now that you've defined the **stopwords and punctuations**, let's use these to clean our enron emails in the dataframe `df` further. The lists containing stopwords and punctuations are available under `stop` and `exclude` There are a few more steps to take before you have cleaned data, such as **\"lemmatization\"** of words, and **stemming the verbs**. The verbs in the email data are already stemmed, and the lemmatization is already done for you in this exercise.\n", 3406 | "\n", 3407 | "**Instructions 1/2**\n", 3408 | "\n", 3409 | "* Use the previously defined variables `stop` and `exclude` to finish of the function: Strip the words from whitespaces using `rstrip`, and exclude stopwords and punctuations. Finally lemmatize the words and assign that to `normalized`." 3410 | ] 3411 | }, 3412 | { 3413 | "cell_type": "code", 3414 | "execution_count": null, 3415 | "metadata": {}, 3416 | "outputs": [], 3417 | "source": [ 3418 | "# Import the lemmatizer from nltk\n", 3419 | "lemma = WordNetLemmatizer()\n", 3420 | "\n", 3421 | "def clean(text, stop):\n", 3422 | " text = str(text).rstrip()\n", 3423 | " stop_free = \" \".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])\n", 3424 | " punc_free = ''.join(i for i in stop_free if i not in exclude)\n", 3425 | " normalized = \" \".join(lemma.lemmatize(i) for i in punc_free.split()) \n", 3426 | " return normalized" 3427 | ] 3428 | }, 3429 | { 3430 | "cell_type": "markdown", 3431 | "metadata": {}, 3432 | "source": [ 3433 | "**Instructions 2/2**\n", 3434 | "\n", 3435 | "* Apply the function `clean(text,stop)` on each line of text data in our dataframe, and take the column `df['clean_content']` for this." 3436 | ] 3437 | }, 3438 | { 3439 | "cell_type": "code", 3440 | "execution_count": null, 3441 | "metadata": {}, 3442 | "outputs": [], 3443 | "source": [ 3444 | "# Clean the emails in df and print results\n", 3445 | "text_clean=[]\n", 3446 | "for text in df['clean_content']:\n", 3447 | " text_clean.append(clean(text, stop).split()) " 3448 | ] 3449 | }, 3450 | { 3451 | "cell_type": "code", 3452 | "execution_count": null, 3453 | "metadata": {}, 3454 | "outputs": [], 3455 | "source": [ 3456 | "text_clean[0][:10]" 3457 | ] 3458 | }, 3459 | { 3460 | "cell_type": "markdown", 3461 | "metadata": {}, 3462 | "source": [ 3463 | "**Now that you have cleaned your data entirely with the necessary steps, including splitting the text into words, removing stopwords and punctuations, and lemmatizing your words. You are now ready to run a topic model on this data. In the following exercises you're going to explore how to do that.**" 3464 | ] 3465 | }, 3466 | { 3467 | "cell_type": "markdown", 3468 | "metadata": {}, 3469 | "source": [ 3470 | "## Topic modeling on fraud\n", 3471 | "\n", 3472 | "1. Discovering topics in text data\n", 3473 | "1. \"What is the text about\"\n", 3474 | "1. Conceptually similar to clustering data\n", 3475 | "1. Compare topics of fraud cases to non-fraud cases and use as a feature or flag\n", 3476 | "1. Or.. is there a particular topic in the data that seems to point to fraud?" 3477 | ] 3478 | }, 3479 | { 3480 | "cell_type": "markdown", 3481 | "metadata": {}, 3482 | "source": [ 3483 | "#### Latent Dirichlet Allocation (LDA)\n", 3484 | "\n", 3485 | "* With LDA you obtain:\n", 3486 | " * \"topics per text item\" model (i.e. probabilities)\n", 3487 | " * \"words per topic\" model\n", 3488 | "* Creating your own topic model:\n", 3489 | " * Clean your data\n", 3490 | " * Create a bag of words with dictionary and corpus\n", 3491 | " * Dictionary contain words and word frequency from the entire text\n", 3492 | " * Corpus: word count for each line of text\n", 3493 | " * Feed dictionary and corpus into the LDA model\n", 3494 | "* LDA:\n", 3495 | " * ![lda](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/lda.JPG)\n", 3496 | " 1. [LDA2vec: Word Embeddings in Topic Models](https://www.datacamp.com/community/tutorials/lda2vec-topic-model)\n", 3497 | " 1. see how each word in the dataset is associated with each topic\n", 3498 | " 1. see how each text item in the data associates with topics (in the form of probabilities)\n", 3499 | " 1. image on the right " 3500 | ] 3501 | }, 3502 | { 3503 | "cell_type": "markdown", 3504 | "metadata": {}, 3505 | "source": [ 3506 | "#### Bag of words: dictionary and corpus\n", 3507 | "\n", 3508 | "* use the `Dictionary` function in `corpora` to create a `dict` from the text data\n", 3509 | " * contains word counts\n", 3510 | "* filter out words that appear in less than 5 emails and keep only the 50000 most frequent words\n", 3511 | " * this is a way of cleaning the outlier noise\n", 3512 | "* create the corpus, which for each email, counts the number of words and the count for each word (`doc2bow`)\n", 3513 | "* `doc2bow`\n", 3514 | " * Document to Bag of Words\n", 3515 | " * converts text data into bag-of-words format\n", 3516 | " * each row is now a list of words with the associated word count\n", 3517 | " \n", 3518 | "```python\n", 3519 | "from gensim import corpora\n", 3520 | "\n", 3521 | " # Create dictionary number of times a word appears\n", 3522 | "dictionary = corpora.Dictionary(cleaned_emails)\n", 3523 | "\n", 3524 | "# Filter out (non)frequent words \n", 3525 | "dictionary.filter_extremes(no_below=5, keep_n=50000)\n", 3526 | "\n", 3527 | "# Create corpus\n", 3528 | "corpus = [dictionary.doc2bow(text) for text in cleaned_emails]\n", 3529 | "```" 3530 | ] 3531 | }, 3532 | { 3533 | "cell_type": "markdown", 3534 | "metadata": {}, 3535 | "source": [ 3536 | "#### Latent Dirichlet Allocation (LDA) with gensim\n", 3537 | "\n", 3538 | "* Run the LDA model after cleaning the text date, and creating the dictionary and corpus\n", 3539 | "* Pass the corpus and dictionary into the model\n", 3540 | "* As with K-means, beforehand, pick the number of topics to obtain, even if there is uncertainty about what topics exist\n", 3541 | "* The calculated LDA model, will contain the associated words for each topic, and topic scores per email\n", 3542 | "* Use `print_topics` to obtain the top words from the topics\n", 3543 | "\n", 3544 | "```python\n", 3545 | "import gensim\n", 3546 | "\n", 3547 | "# Define the LDA model\n", 3548 | "ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, \n", 3549 | "id2word=dictionary, passes=15)\n", 3550 | "\n", 3551 | "# Print the three topics from the model with top words\n", 3552 | "topics = ldamodel.print_topics(num_words=4)\n", 3553 | "for topic in topics:\n", 3554 | " print(topic)\n", 3555 | "\n", 3556 | ">>> (0, '0.029*\"email\" + 0.016*\"send\" + 0.016*\"results\" + 0.016*\"invoice\"')\n", 3557 | ">>> (1, '0.026*\"price\" + 0.026*\"work\" + 0.026*\"management\" + 0.026*\"sell\"')\n", 3558 | ">>> (2, '0.029*\"distribute\" + 0.029*\"contact\" + 0.016*\"supply\" + 0.016*\"fast\"')\n", 3559 | "```" 3560 | ] 3561 | }, 3562 | { 3563 | "cell_type": "markdown", 3564 | "metadata": {}, 3565 | "source": [ 3566 | "### Create dictionary and corpus\n", 3567 | "\n", 3568 | "In order to run an LDA topic model, you first need to **define your dictionary and corpus** first, as those need to go into the model. You're going to continue working on the cleaned text data that you've done in the previous exercises. That means that `text_clean` is available for you already to continue working with, and you'll use that to create your dictionary and corpus.\n", 3569 | "\n", 3570 | "This exercise will take a little longer to execute than usual.\n", 3571 | "\n", 3572 | "**Instructions**\n", 3573 | "\n", 3574 | "* Import the gensim package and corpora from gensim separately.\n", 3575 | "* Define your dictionary by running the correct function on your clean data `text_clean`.\n", 3576 | "* Define the corpus by running `doc2bow` on each piece of text in `text_clean`.\n", 3577 | "* Print your results so you can see `dictionary` and `corpus` look like." 3578 | ] 3579 | }, 3580 | { 3581 | "cell_type": "code", 3582 | "execution_count": null, 3583 | "metadata": {}, 3584 | "outputs": [], 3585 | "source": [ 3586 | "# Define the dictionary\n", 3587 | "dictionary = corpora.Dictionary(text_clean)\n", 3588 | "\n", 3589 | "# Define the corpus \n", 3590 | "corpus = [dictionary.doc2bow(text) for text in text_clean]" 3591 | ] 3592 | }, 3593 | { 3594 | "cell_type": "code", 3595 | "execution_count": null, 3596 | "metadata": {}, 3597 | "outputs": [], 3598 | "source": [ 3599 | "print(dictionary)" 3600 | ] 3601 | }, 3602 | { 3603 | "cell_type": "code", 3604 | "execution_count": null, 3605 | "metadata": {}, 3606 | "outputs": [], 3607 | "source": [ 3608 | "corpus[0][:10]" 3609 | ] 3610 | }, 3611 | { 3612 | "cell_type": "markdown", 3613 | "metadata": {}, 3614 | "source": [ 3615 | "**These are the two ingredients you need to run your topic model on the enron emails. You are now ready for the final step and create your first fraud detection topic model.**" 3616 | ] 3617 | }, 3618 | { 3619 | "cell_type": "markdown", 3620 | "metadata": {}, 3621 | "source": [ 3622 | "### LDA model\n", 3623 | "\n", 3624 | "Now it's time to **build the LDA model**. Using the `dictionary` and `corpus`, you are ready to discover which topics are present in the Enron emails. With a quick print of words assigned to the topics, you can do a first exploration about whether there are any obvious topics that jump out. Be mindful that the topic model is **heavy to calculate** so it will take a while to run. Let's give it a try!\n", 3625 | "\n", 3626 | "**Instructions**\n", 3627 | "\n", 3628 | "* Build the LDA model from gensim models, by inserting the `corpus` and `dictionary`.\n", 3629 | "* Save the 5 topics by running `print_topics` on the model results, and select the top 5 words." 3630 | ] 3631 | }, 3632 | { 3633 | "cell_type": "code", 3634 | "execution_count": null, 3635 | "metadata": {}, 3636 | "outputs": [], 3637 | "source": [ 3638 | "# Define the LDA model\n", 3639 | "ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5)\n", 3640 | "\n", 3641 | "# Save the topics and top 5 words\n", 3642 | "topics = ldamodel.print_topics(num_words=5)\n", 3643 | "\n", 3644 | "# Print the results\n", 3645 | "for topic in topics:\n", 3646 | " print(topic)" 3647 | ] 3648 | }, 3649 | { 3650 | "cell_type": "markdown", 3651 | "metadata": {}, 3652 | "source": [ 3653 | "**You have now successfully created your first topic model on the Enron email data. However, the print of words doesn't really give you enough information to find a topic that might lead you to signs of fraud. You'll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in your data. You'll learn more about this in the next video.**" 3654 | ] 3655 | }, 3656 | { 3657 | "cell_type": "markdown", 3658 | "metadata": {}, 3659 | "source": [ 3660 | "## Flagging fraud based on topic" 3661 | ] 3662 | }, 3663 | { 3664 | "cell_type": "markdown", 3665 | "metadata": {}, 3666 | "source": [ 3667 | "#### Using your LDA model results for fraud detection\n", 3668 | "\n", 3669 | "1. Are there any suspicious topics? (no labels)\n", 3670 | " 1. if you don't have labels, first check for the frequency of suspicious words within topics and check whether topics seem to describe the fraudulent behavior\n", 3671 | " 1. for the Enron email data, a suspicious topic would be one where employees are discussing stock bonuses, selling stock, stock price, and perhaps mentions of accounting or weak financials\n", 3672 | " 1. Defining suspicious topics does require some pre-knowledge about the fraudulent behavior\n", 3673 | " 1. If the fraudulent topic is noticeable, flag all instances that have a high probability for this topic\n", 3674 | "1. Are the topics in fraud and non-fraud cases similar? (with labels)\n", 3675 | " 1. If there a previous cases of fraud, ran a topic model on the fraud text only, and on the non-fraud text\n", 3676 | " 1. Check whether the results are similar\n", 3677 | " 1. Whether the frequency of the topics are the same in fraud vs non-fraud\n", 3678 | "1. Are fraud cases associated more with certain topics? (with labels)\n", 3679 | " 1. Check whether fraud cases have a higher probability score for certain topics\n", 3680 | " 1. If so, run a topic model on new data and create a flag directly on the instances that score high on those topics" 3681 | ] 3682 | }, 3683 | { 3684 | "cell_type": "markdown", 3685 | "metadata": {}, 3686 | "source": [ 3687 | "#### To understand topics, you need to visualize\n", 3688 | "\n", 3689 | "```python\n", 3690 | "import pyLDAvis.gensim\n", 3691 | "lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)\n", 3692 | "```\n", 3693 | "\n", 3694 | "![topics](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/topics2.jpg)\n", 3695 | "\n", 3696 | "* Each bubble on the left-hand side, represents a topic\n", 3697 | "* The larger the bubble, the more prevalent that topic is\n", 3698 | "* Click on each topic to get the details per topic in the right panel\n", 3699 | "* The words are the most important keywords that form the selected topic.\n", 3700 | "* A good topic model will have fairly big, non-overlapping bubbles, scattered throughout the chart\n", 3701 | "* A model with too many topics, will typically have many overlaps, or small sized bubbles, clustered in one region\n", 3702 | "* In the case of the model above, there is a slight overlap between topic 2 and 3, which may point to 1 topic too many" 3703 | ] 3704 | }, 3705 | { 3706 | "cell_type": "code", 3707 | "execution_count": null, 3708 | "metadata": {}, 3709 | "outputs": [], 3710 | "source": [ 3711 | "lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)" 3712 | ] 3713 | }, 3714 | { 3715 | "cell_type": "code", 3716 | "execution_count": null, 3717 | "metadata": {}, 3718 | "outputs": [], 3719 | "source": [ 3720 | "pyLDAvis.display(lda_display)" 3721 | ] 3722 | }, 3723 | { 3724 | "cell_type": "markdown", 3725 | "metadata": {}, 3726 | "source": [ 3727 | "#### Assign topics to your original data\n", 3728 | "\n", 3729 | "* One practical application of topic modeling is to determine what topic a given text is about\n", 3730 | "* To find that, find the topic number that has the highest percentage contribution in that text\n", 3731 | "* The function, `get_topic_details` shown here, nicely aggregates this information in a presentable table\n", 3732 | "* Combine the original text data with the output of the `get_topic_details` function\n", 3733 | "* Each row contains the dominant topic number, the probability score with that topic and the original text data\n", 3734 | "\n", 3735 | "```python\n", 3736 | "def get_topic_details(ldamodel, corpus):\n", 3737 | " topic_details_df = pd.DataFrame()\n", 3738 | " for i, row in enumerate(ldamodel[corpus]):\n", 3739 | " row = sorted(row, key=lambda x: (x[1]), reverse=True)\n", 3740 | " for j, (topic_num, prop_topic) in enumerate(row):\n", 3741 | " if j == 0: # => dominant topic\n", 3742 | " wp = ldamodel.show_topic(topic_num)\n", 3743 | " topic_details_df = topic_details_df.append(pd.Series([topic_num, prop_topic]), ignore_index=True)\n", 3744 | " topic_details_df.columns = ['Dominant_Topic', '% Score']\n", 3745 | " return topic_details_df\n", 3746 | "\n", 3747 | "\n", 3748 | "contents = pd.DataFrame({'Original text':text_clean})\n", 3749 | "topic_details = pd.concat([get_topic_details(ldamodel,\n", 3750 | " corpus), contents], axis=1)\n", 3751 | "topic_details.head()\n", 3752 | "\n", 3753 | "\n", 3754 | " Dominant_Topic % Score Original text\n", 3755 | "0 0.0 0.989108 [investools, advisory, free, ...\n", 3756 | "1 0.0 0.993513 [forwarded, richard, b, ...\n", 3757 | "2 1.0 0.964858 [hey, wearing, target, purple, ...\n", 3758 | "3 0.0 0.989241 [leslie, milosevich, santa, clara, ...\n", 3759 | "```" 3760 | ] 3761 | }, 3762 | { 3763 | "cell_type": "markdown", 3764 | "metadata": {}, 3765 | "source": [ 3766 | "### Interpreting the topic model\n", 3767 | "\n", 3768 | "* Use the visualization results from the pyLDAvis library shown in 4.4.0.2.\n", 3769 | "* Have a look at topic 1 and 3 from the LDA model on the Enron email data. Which one would you research further for fraud detection purposes?\n", 3770 | "\n", 3771 | "**Possible Answers**\n", 3772 | "\n", 3773 | "* __**Topic 1.**__\n", 3774 | "* ~~Topic 3.~~\n", 3775 | "* ~~None of these topics seem related to fraud.~~\n", 3776 | "\n", 3777 | "\n", 3778 | "**Topic 1 seems to discuss the employee share option program, and seems to point to internal conversation (with \"please, may, know\" etc), so this is more likely to be related to the internal accounting fraud and trading stock with insider knowledge. Topic 3 seems to be more related to general news around Enron.**" 3779 | ] 3780 | }, 3781 | { 3782 | "cell_type": "markdown", 3783 | "metadata": {}, 3784 | "source": [ 3785 | "### Finding fraudsters based on topic\n", 3786 | "\n", 3787 | "In this exercise you're going to **link the results** from the topic model **back to your original data**. You now learned that you want to **flag everything related to topic 3**. As you will see, this is actually not that straightforward. You'll be given the function `get_topic_details()` which takes the arguments `ldamodel` and `corpus`. It retrieves the details of the topics for each line of text. With that function, you can append the results back to your original data. If you want to learn more detail on how to work with the model results, which is beyond the scope of this course, you're highly encouraged to read this [article](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/).\n", 3788 | "\n", 3789 | "Available for you are the `dictionary` and `corpus`, the text data `text_clean` as well as your model results `ldamodel`. Also defined is `get_topic_details()`.\n", 3790 | "\n", 3791 | "**Instructions 1/3**\n", 3792 | "\n", 3793 | "* Print and inspect the results from the `get_topic_details()` function by inserting your LDA model results and `corpus`." 3794 | ] 3795 | }, 3796 | { 3797 | "cell_type": "markdown", 3798 | "metadata": {}, 3799 | "source": [ 3800 | "#### def get_topic_details" 3801 | ] 3802 | }, 3803 | { 3804 | "cell_type": "code", 3805 | "execution_count": null, 3806 | "metadata": {}, 3807 | "outputs": [], 3808 | "source": [ 3809 | "def get_topic_details(ldamodel, corpus):\n", 3810 | " topic_details_df = pd.DataFrame()\n", 3811 | " for i, row in enumerate(ldamodel[corpus]):\n", 3812 | " row = sorted(row, key=lambda x: (x[1]), reverse=True)\n", 3813 | " for j, (topic_num, prop_topic) in enumerate(row):\n", 3814 | " if j == 0: # => dominant topic\n", 3815 | " wp = ldamodel.show_topic(topic_num)\n", 3816 | " topic_details_df = topic_details_df.append(pd.Series([topic_num, prop_topic]), ignore_index=True)\n", 3817 | " topic_details_df.columns = ['Dominant_Topic', '% Score']\n", 3818 | " return topic_details_df" 3819 | ] 3820 | }, 3821 | { 3822 | "cell_type": "code", 3823 | "execution_count": null, 3824 | "metadata": {}, 3825 | "outputs": [], 3826 | "source": [ 3827 | "# Run get_topic_details function and check the results\n", 3828 | "topic_details_df = get_topic_details(ldamodel, corpus)" 3829 | ] 3830 | }, 3831 | { 3832 | "cell_type": "code", 3833 | "execution_count": null, 3834 | "metadata": {}, 3835 | "outputs": [], 3836 | "source": [ 3837 | "topic_details_df.head()" 3838 | ] 3839 | }, 3840 | { 3841 | "cell_type": "code", 3842 | "execution_count": null, 3843 | "metadata": {}, 3844 | "outputs": [], 3845 | "source": [ 3846 | "topic_details_df.tail()" 3847 | ] 3848 | }, 3849 | { 3850 | "cell_type": "markdown", 3851 | "metadata": {}, 3852 | "source": [ 3853 | "**Instructions 2/3**\n", 3854 | "\n", 3855 | "* Concatenate column-wise the results from the previously defined function `get_topic_details()` to the original text data contained under `contents` and inspect the results." 3856 | ] 3857 | }, 3858 | { 3859 | "cell_type": "code", 3860 | "execution_count": null, 3861 | "metadata": {}, 3862 | "outputs": [], 3863 | "source": [ 3864 | "# Add original text to topic details in a dataframe\n", 3865 | "contents = pd.DataFrame({'Original text': text_clean})\n", 3866 | "topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)" 3867 | ] 3868 | }, 3869 | { 3870 | "cell_type": "code", 3871 | "execution_count": null, 3872 | "metadata": {}, 3873 | "outputs": [], 3874 | "source": [ 3875 | "topic_details.sort_values(by=['% Score'], ascending=False).head(10).head()" 3876 | ] 3877 | }, 3878 | { 3879 | "cell_type": "code", 3880 | "execution_count": null, 3881 | "metadata": {}, 3882 | "outputs": [], 3883 | "source": [ 3884 | "topic_details.sort_values(by=['% Score'], ascending=False).head(10).tail()" 3885 | ] 3886 | }, 3887 | { 3888 | "cell_type": "markdown", 3889 | "metadata": {}, 3890 | "source": [ 3891 | "**Instructions 3/3**\n", 3892 | "\n", 3893 | "* Create a flag with the `np.where()` function to flag all content that has topic 3 as a dominant topic with a 1, and 0 otherwise" 3894 | ] 3895 | }, 3896 | { 3897 | "cell_type": "code", 3898 | "execution_count": null, 3899 | "metadata": {}, 3900 | "outputs": [], 3901 | "source": [ 3902 | "# Create flag for text highest associated with topic 3\n", 3903 | "topic_details['flag'] = np.where((topic_details['Dominant_Topic'] == 3.0), 1, 0)" 3904 | ] 3905 | }, 3906 | { 3907 | "cell_type": "code", 3908 | "execution_count": null, 3909 | "metadata": {}, 3910 | "outputs": [], 3911 | "source": [ 3912 | "topic_details_1 = topic_details[topic_details.flag == 1]" 3913 | ] 3914 | }, 3915 | { 3916 | "cell_type": "code", 3917 | "execution_count": null, 3918 | "metadata": {}, 3919 | "outputs": [], 3920 | "source": [ 3921 | "topic_details_1.sort_values(by=['% Score'], ascending=False).head(10)" 3922 | ] 3923 | }, 3924 | { 3925 | "cell_type": "markdown", 3926 | "metadata": {}, 3927 | "source": [ 3928 | "**You have now flagged all data that is highest associated with topic 3, that seems to cover internal conversation about enron stock options. You are a true detective. With these exercises you have demonstrated that text mining and topic modeling can be a powerful tool for fraud detection.**" 3929 | ] 3930 | }, 3931 | { 3932 | "cell_type": "markdown", 3933 | "metadata": {}, 3934 | "source": [ 3935 | "## Lesson 4: Recap" 3936 | ] 3937 | }, 3938 | { 3939 | "cell_type": "markdown", 3940 | "metadata": {}, 3941 | "source": [ 3942 | "### Working with imbalanced data\n", 3943 | "\n", 3944 | "* Worked with highly imbalanced fraud data\n", 3945 | "* Learned how to resample your data\n", 3946 | "* Learned about different resampling methods" 3947 | ] 3948 | }, 3949 | { 3950 | "cell_type": "markdown", 3951 | "metadata": {}, 3952 | "source": [ 3953 | "### Fraud detection with labeled data\n", 3954 | "\n", 3955 | "* Refreshed supervised learning techniques to detect fraud\n", 3956 | "* Learned how to get reliable performance metrics and worked with the precision recall trade-off\n", 3957 | "* Explored how to optimize your model parameters to handle fraud data\n", 3958 | "* Applied ensemble methods to fraud detection" 3959 | ] 3960 | }, 3961 | { 3962 | "cell_type": "markdown", 3963 | "metadata": {}, 3964 | "source": [ 3965 | "### Fraud detection without labels\n", 3966 | "\n", 3967 | "* Learned about the importance of segmentation\n", 3968 | "* Refreshed your knowledge on clustering methods\n", 3969 | "* Learned how to detect fraud using outliers and small clusters with K-means clustering\n", 3970 | "* Applied a DB-scan clustering model for fraud detection" 3971 | ] 3972 | }, 3973 | { 3974 | "cell_type": "markdown", 3975 | "metadata": {}, 3976 | "source": [ 3977 | "### Text mining for fraud detection\n", 3978 | "\n", 3979 | "* Know how to augment fraud detection analysis with text mining techniques\n", 3980 | "* Applied word searches to flag use of certain words, and learned how to apply topic modeling for fraud detection\n", 3981 | "* Learned how to effectively clean messy text data" 3982 | ] 3983 | }, 3984 | { 3985 | "cell_type": "markdown", 3986 | "metadata": {}, 3987 | "source": [ 3988 | "### Further learning for fraud detection\n", 3989 | "\n", 3990 | "* Network analysis to detect fraud\n", 3991 | "* Different supervised and unsupervised learning techniques (e.g. Neural Networks)\n", 3992 | "* Working with very large data" 3993 | ] 3994 | } 3995 | ], 3996 | "metadata": { 3997 | "kernelspec": { 3998 | "display_name": "Python 3", 3999 | "language": "python", 4000 | "name": "python3" 4001 | }, 4002 | "language_info": { 4003 | "codemirror_mode": { 4004 | "name": "ipython", 4005 | "version": 3 4006 | }, 4007 | "file_extension": ".py", 4008 | "mimetype": "text/x-python", 4009 | "name": "python", 4010 | "nbconvert_exporter": "python", 4011 | "pygments_lexer": "ipython3", 4012 | "version": "3.7.3" 4013 | }, 4014 | "toc-autonumbering": true 4015 | }, 4016 | "nbformat": 4, 4017 | "nbformat_minor": 4 4018 | } 4019 | --------------------------------------------------------------------------------