├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── images ├── afd_steps.png ├── all_fdb.png └── ieee_ofi_sample.png ├── scripts ├── examples │ └── Test_FDB_Loader.ipynb └── reproducibility │ ├── afd │ ├── README.md │ ├── configs │ │ ├── CreditCardFraudDetection.json │ │ ├── FakeJobPostingPrediction.json │ │ ├── Fraudecommerce.json │ │ ├── IEEECISFraudDetection.json │ │ ├── IPBlocklist.json │ │ ├── MaliciousURL.json │ │ ├── SimulatedCreditCardTransactionsSparkov.json │ │ ├── TwitterBotAccounts.json │ │ └── VehicleLoanDefaultPrediction.json │ ├── create_afd_resources.py │ └── score_afd_model.py │ ├── autogluon │ ├── README.md │ ├── benchmark_ag.py │ └── example-ag-ieeecis.ipynb │ ├── autosklearn │ ├── README.md │ └── benchmark_autosklearn.py │ ├── benchmark_utils.py │ ├── h2o │ ├── README.md │ ├── benchmark_h2o.py │ └── example-h2o-ieeecis.ipynb │ └── label-noise │ ├── benchmark_experiments.ipynb │ ├── feature_dict.py │ ├── load_fdb_datasets.py │ └── micro_models.py ├── setup.py └── src ├── __init__.py └── fdb ├── __init__.py ├── datasets.py ├── kaggle_configs.py ├── preprocessing.py ├── preprocessing_objects.py └── versioned_datasets ├── __init__.py └── ipblock ├── 20220607.zip └── __init__.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021-2022 Prince Grover 4 | Copyright (c) 2021-2022 Zheng Li 5 | Copyright (c) 2022 Jianbo Liu 6 | Copyright (c) 2022 Jakub Zablocki 7 | Copyright (c) 2022 Jianbo Liu 8 | Copyright (c) 2022 Hao Zhou 9 | Copyright (c) 2022 Julia Xu 10 | Copyright (c) 2022 Anqi Cheng 11 | 12 | Permission is hereby granted, free of charge, to any person obtaining a copy 13 | of this software and associated documentation files (the "Software"), to deal 14 | in the Software without restriction, including without limitation the rights 15 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 16 | copies of the Software, and to permit persons to whom the Software is 17 | furnished to do so, subject to the following conditions: 18 | 19 | The above copyright notice and this permission notice shall be included in all 20 | copies or substantial portions of the Software. 21 | 22 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 23 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 24 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 25 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 26 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 27 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 28 | SOFTWARE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FDB: Fraud Dataset Benchmark 2 | 3 | *By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin Tittelfitz](jtittelfitz), Anqi Cheng, [Jakub Zablocki](qbaza), Jianbo Liu, and [Hao Zhou](haozhouamzn)* 4 | 5 | 6 | [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 7 | 8 | 9 | The **Fraud Dataset Benchmark (FDB)** is a compilation of publicly available datasets relevant to **fraud detection** ([arXiv Link](https://arxiv.org/abs/2208.14417)). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning. 10 | 11 | 12 | ## Datasets used in FDB 13 | Brief summary of the datasets used in FDB. Each dataset is described in detail in [data source section](#data-sources). 14 | 15 | | **#** | **Dataset name** | **Dataset key** | **Fraud category** | **#Train** | **#Test** | **Class ratio (train)** | **#Feats** | **#Cat** | **#Num** | **#Text** | **#Enrichable** | 16 | |-------|------------------------------------------------------------|-----------------|-------------------------------------|------------|-----------|-------------------------|------------|----------|----------|-----------|-----------------| 17 | | 1 | IEEE-CIS Fraud Detection | ieeecis | Card Not Present Transactions Fraud | 561,013 | 28,527 | 3.50% | 67 | 6 | 61 | 0 | 0 | 18 | | 2 | Credit Card Fraud Detection | ccfraud | Card Not Present Transactions Fraud | 227,845 | 56,962 | 0.18% | 28 | 0 | 28 | 0 | 0 | 19 | | 3 | Fraud ecommerce | fraudecom | Card Not Present Transactions Fraud | 120,889 | 30,223 | 10.60% | 6 | 2 | 3 | 0 | 1 | 20 | | 4 | Simulated Credit Card Transactions generated using Sparkov | sparknov | Card Not Present Transactions Fraud | 1,296,675 | 20,000 | 5.70% | 17 | 10 | 6 | 1 | 0 | 21 | | 5 | Twitter Bots Accounts | twitterbot | Bot Attacks | 29,950 | 7,488 | 33.10% | 16 | 6 | 6 | 4 | 0 | 22 | | 6 | Malicious URLs dataset | malurl | Malicious Traffic | 586,072 | 65,119 | 34.20% | 2 | 0 | 1 | 1 | 0 | 23 | | 7 | Fake Job Posting Prediction | fakejob | Content Moderation | 14,304 | 3,576 | 4.70% | 16 | 10 | 1 | 5 | 0 | 24 | | 8 | Vehicle Loan Default Prediction | vehicleloan | Credit Risk | 186,523 | 46,631 | 21.60% | 38 | 13 | 22 | 3 | 0 | 25 | | 9 | IP Blocklist | ipblock | Malicious Traffic | 172,000 | 43,000 | 7% | 1 | 0 | 0 | 0 | 1 | 26 | 27 | 28 | ## Installation 29 | 30 | ### Requirements 31 | - Kaggle account 32 | - **Important**: `ieeecis` dataset requires you to [**join IEEE-CIS competetion**](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call fdb API. Otherwise you will get ApiException: (403). 33 | - AWS account 34 | - Python 3.7+ 35 | 36 | - Python requirements 37 | ``` 38 | autogluon==0.4.2 39 | h2o==3.36.1.2 40 | boto3==1.20.21 41 | click==8.0.3 42 | click-plugins==1.1.1 43 | Faker==4.14.2 44 | joblib==1.0.0 45 | kaggle==1.5.12 46 | numpy==1.19.5 47 | pandas==1.1.2 48 | regex==2020.7.14 49 | scikit-learn==0.22.1 50 | scipy==1.5.4 51 | auto-sklearn==0.14.7 52 | dask==2022.8.1 53 | ``` 54 | 55 | ### Step 1: Setup Kaggle CLI 56 | The `FraudDatasetBenchmark` object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python. 57 | 58 | Use intructions from [How to Use Kaggle](https://www.kaggle.com/docs/api) guide. The steps include: 59 | 60 | Remember to download the authentication token from "My Account" on Kaggle, and save token at `~/.kaggle/kaggle.json` on Linux, OSX and at `C:\Users.kaggle\kaggle.json` on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder. 61 | 62 | 63 | #### Step 1.2. [Join IEEE-CIS competetion](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call `fdb.datasets` with `ieeecis`. Otherwise you will get ApiException: (403). 64 | 65 | 66 | ### Step 2: Clone Repo 67 | Once Kaggle CLI is setup and installed, clone the github repo using `git clone https://github.com/amazon-research/fraud-dataset-benchmark.git` if using HTTPS, or `git clone git@github.com:amazon-research/fraud-dataset-benchmark.git` if using SSH. 68 | 69 | ### Step 3: Install 70 | Once repo is cloned, from your terminal, `cd` to the repo and type `pip install .`, which will install the required classes and methods. 71 | 72 | 73 | ## FraudDatasetBenchmark Usage 74 | The usage is straightforward, where you create a `dataset` object of `FraudDatasetBenchmark` class, and extract useful goodies like train/test splits and eval_metrics. 75 | 76 | **Important note**: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use `load_pre_downloaded=False, delete_downloaded=False` and for subsequent calls, use `load_pre_downloaded=True, delete_downloaded=False`. The default setting is 77 | `load_pre_downloaded=False, delete_downloaded=True` 78 | ``` 79 | from fdb.datasets import FraudDatasetBenchmark 80 | 81 | # all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock'] 82 | key = 'ipblock' 83 | 84 | obj = FraudDatasetBenchmark( 85 | key=key, 86 | load_pre_downloaded=False, # default 87 | delete_downloaded=True, # default 88 | add_random_values_if_real_na = { 89 | "EVENT_TIMESTAMP": True, 90 | "LABEL_TIMESTAMP": True, 91 | "ENTITY_ID": True, 92 | "ENTITY_TYPE": True, 93 | "ENTITY_ID": True, 94 | "EVENT_ID": True 95 | } # default 96 | ) 97 | print(obj.key) 98 | 99 | print('Train set: ') 100 | display(obj.train.head()) 101 | print(len(obj.train.columns)) 102 | print(obj.train.shape) 103 | 104 | print('Test set: ') 105 | display(obj.test.head()) 106 | print(obj.test.shape) 107 | 108 | print('Test scores') 109 | display(obj.test_labels.head()) 110 | print(obj.test_labels['EVENT_LABEL'].value_counts()) 111 | print(obj.train['EVENT_LABEL'].value_counts(normalize=True)) 112 | print('=========') 113 | 114 | ``` 115 | Notebook template to load dataset using FDB data-loader is available at [scripts/examples/Test_FDB_Loader.ipynb](scripts/examples/Test_FDB_Loader.ipynb) 116 | 117 | ## Reproducibility 118 | Reproducibility scripts are available at [scripts/reproducibility/](scripts/reproducibility/) in respective folders for [afd](scripts/reproducibility/afd), [autogluon](scripts/reproducibility/autogluon) and [h2o](scripts/reproducibility/h2o). Each folder also had README with steps to reproduce. 119 | 120 | 121 | ## Benchmark Results 122 | 123 | 135 | 136 | | **Dataset key** | **AUC-ROC** | | | | | 137 | |:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:| 138 | | | **AFD OFI** | **AFD TFI** | **AutoGluon** | **H2O** | **Auto-sklearn** | 139 | | ccfraud | 0.985 | 0.99 | 0.99 | **0.992** | 0.988 | 140 | | fakejob | 0.987 | - | **0.998** | 0.99 | 0.983 | 141 | | fraudecom | 0.519 | **0.636** | 0.522 | 0.518 | 0.515 | 142 | | ieeecis | 0.938 | **0.94** | 0.855 | 0.89 | 0.932 | 143 | | malurl | 0.985 | - | **0.998** | Training failure | 0.5 | 144 | | sparknov | **0.998** | - | 0.997 | 0.997 | 0.995 | 145 | | twitterbot | 0.934 | - | **0.943** | 0.938 | 0.936 | 146 | | vehicleloan | **0.673** | - | 0.669 | 0.67 | 0.664 | 147 | | ipblock | **0.937** | - | 0.804 | Training failure | 0.5 | 148 | 149 | ### ROC Curves 150 | 151 | The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML. 152 | ![roc curves](images/all_fdb.png) 153 | 154 | 155 | ## Data Sources 156 | 157 | 158 | 1. **IEEE-CIS Fraud Detection** 159 | - Source URL: https://www.kaggle.com/c/ieee-fraud-detection/overview 160 | - Source license: https://www.kaggle.com/competitions/ieee-fraud-detection/rules 161 | - Variables: Anonymized product, card, address, email domain, device, transaction date information. Numeric columns with name prefixes as V, C, D and M, and meaning hidden from public. 162 | - Fraud category: Card Not Present Transaction Fraud 163 | - Provider: [Vesta Corporation](https://www.vesta.io/) 164 | - Release date: 2019-10-03 165 | - Description: Prepared by IEEE Computational Intelligence Society, this card-non-present transaction fraud dataset was launched during IEEE-CIS Fraud Detection Kaggle competition, and was provided by Vesta Corporation. The original dataset contains 393 features which are reduced to 67 features in the benchmark. Feature selection was performed based on highly voted Kaggle kernels. The fraud rate in training segment of source dataset is 3.5%. We only used training files (train transaction and train identity) containing 590,540 transactions in the benchmark, and split that into train (95%) and test (5%) segments based on time. Based on the insights from a Kaggle kernel written by the competition winner, we added UUID (called it as ENTITY_ID) that represents a fingerprint and was created using card, address, time and D1 features. 166 | 167 | 2. **Credit Card Fraud Detection** 168 | - Source URL: https://www.kaggle.com/mlg-ulb/creditcardfraud/ 169 | - Source license: https://opendatacommons.org/licenses/dbcl/1-0/ 170 | - Variables: PCA transformed features, time, amount (highly imbalanced) 171 | - Fraud category: Card Not Present Transaction Fraud 172 | - Provider: [Machine Learning Group - ULB](https://mlg.ulb.ac.be/) 173 | - Release date: 2018-03-23 174 | - Description: This dataset contains anonymized credit card transactions by European cardholders in September 2013. The dataset contains 492 frauds out of 284,807 transactions over 2 days. Data only contains numerical features that are the result of a PCA transformation, plus non transformed time and amount. 175 | 176 | 3. **Fraud ecommerce** 177 | - Source URL: https://www.kaggle.com/vbinh002/fraud-ecommerce 178 | - Source license: None 179 | - Variables: The features include sign up time, purchase time, purchase value, device id, user id, browser, and IP address. We added a new feature that measured the time difference between sign up and purchase, as the age of an account is often an important variable in fraud detection. 180 | - Fraud category: Card Not Present Transaction Fraud 181 | - Provider: [Binh Vu](https://www.kaggle.com/vbinh002) 182 | - Release date: 2018-12-09 183 | - Description: This dataset contains ~150k e-commerce transactions. 184 | 185 | 4. **Simulated Credit Card Transactions generated using Sparkov** 186 | - Source URL: https://www.kaggle.com/kartik2112/fraud-detection 187 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/ 188 | - Variables: Transaction date, credit card number, merchant, category, amount, name, street, gender. All variables are synthetically generated using the Sparknov tool. 189 | - Fraud category: Card Not Present Transaction Fraud 190 | - Provider: [Kartik Shenoy](https://www.kaggle.com/kartik2112) 191 | - Release date: 2020-08-05 192 | - Description: This is a simulated credit card transaction dataset. The dataset was generated using Sparkov Data Generation tool and we modified a version of dataset created for Kaggle. It covers transactions of 1000 customers with a pool of 800 merchants over 6 months. We used both train and test segments directly from the source and randomly down sampled test segment. 193 | 194 | 5. **Twitter Bots Accounts** 195 | - Source URL: https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data?select=twitter_human_bots_dataset.csv 196 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/ 197 | - Variables: Features like account creation date, follower and following counts, profile description, account age, meta data about profile picture and account activity, and a label indicating whether the account is human or bot. 198 | - Fraud category: Bot Attacks 199 | - Provider: [David Martín Gutiérrez](https://www.kaggle.com/davidmartngutirrez) 200 | - Release date: 2020-08-20 201 | - Description: The dataset composes of 37,438 rows corresponding to different user accounts from Twitter. 202 | 203 | 6. **Malicious URLs dataset** 204 | - Source URL: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset 205 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/ 206 | - Variables: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. 207 | - Fraud category: Malicious Traffic 208 | - Provider: [Manu Siddhartha](https://www.kaggle.com/sid321axn) 209 | - Release date: 2021-07-23 210 | - Description: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. There is no timestamp information from the source. Therefore, we generate a dummy timestamp column for consistency. 211 | 212 | 7. **Real / Fake Job Posting Prediction** 213 | - Source URL: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction 214 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/ 215 | - Variables: Title, location, department, company, salary range, requirements, description, benefits, telecommuting. Most of the variables are categorical and free form text in nature. 216 | - Fraud category: Content Moderation 217 | - Provider: [Shivam Bansal](https://www.kaggle.com/shivamb) 218 | - Release date: 2020-02-29 219 | - Description: This Kaggle dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The task is to train classification model to detect which job posts are fraudulent. 220 | 221 | 8. **Vehicle Loan Default Prediction** 222 | - Source URL: https://www.kaggle.com/avikpaul4u/vehicle-loan-default-prediction 223 | - Source license: Unknown 224 | - Variables: Loanee information, loan information, credit bureau data, and history. 225 | - Fraud category: Credit Risk 226 | - Provider: [Avik Paul](https://www.kaggle.com/avikpaul4u) 227 | - Release date: 2019-11-12 228 | - Description: The task in this dataset is to determine the probability of vehicle loan default, particularly the risk of default on the first monthly installments. It contains data for 233k loans with 21.7% default rate. 229 | 230 | 9. **IP Blocklist** 231 | - Source URL: http://cinsscore.com/list/ci-badguys.txt 232 | - Source license: Unknown 233 | - Variables: The dataset contains IP address and label telling malicious or fake. A dummy categorical variable that has no relation label is added. 234 | - Fraud category: Malicious Traffic 235 | - Provider: [CINSscore.com](http://cinsscore.com) 236 | - Release date: 2017-09-25 237 | - Description: This dataset is made up from malicious IP address from cinsscore.com. To the list of malicious IP addresses, we added randomly generated IP address using Faker labeled as benign. 238 | 239 | 240 | ## Citation 241 | ``` 242 | @misc{grover2023fraud, 243 | title={Fraud Dataset Benchmark and Applications}, 244 | author={Prince Grover and Julia Xu and Justin Tittelfitz and Anqi Cheng and Zheng Li and Jakub Zablocki and Jianbo Liu and Hao Zhou}, 245 | year={2023}, 246 | eprint={2208.14417}, 247 | archivePrefix={arXiv}, 248 | primaryClass={cs.LG} 249 | } 250 | ``` 251 | 252 | ## License 253 | This project is licensed under the MIT-0 License. 254 | 255 | 256 | ## Acknowledgement 257 | We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes. 258 | 259 | 260 | 261 | 262 | 263 | -------------------------------------------------------------------------------- /images/afd_steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/images/afd_steps.png -------------------------------------------------------------------------------- /images/all_fdb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/images/all_fdb.png -------------------------------------------------------------------------------- /images/ieee_ofi_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/images/ieee_ofi_sample.png -------------------------------------------------------------------------------- /scripts/reproducibility/afd/README.md: -------------------------------------------------------------------------------- 1 | ## Steps to reproduce AFD models 2 | Amazon Fraud Detector (AFD) models can be either run via AWS Console or using API calls. In this folder, we provide scripts that make API calls to create model artifacts and then to score the model on test data. 3 | 4 | High level steps to train and deploy model are: 5 | 6 | ![afd steps](../../../images/afd_steps.png) 7 | 8 | You can use provided scripts to replicate performance shown in the benchmark. 9 | 10 | 1. Setup AWS credentials in terminal for the AWS account where you want to run AFD, and store the data. You can use environment variables as [following](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html) 11 | 12 | 13 | 2. Use the [template data-loader notebook](../../examples/Test_FDB_Loader.ipynb) to upload the benchmark data on S3. (AFD requires data to be saved in S3 and require an S3 path) 14 | 15 | 16 | 3. Create AFD resources including entities, event types, and model. Update values in `IAM_ROLE`, `BUCKET`, `KEY` and `MODEL_NAME` in the `create_afd_resources.py`, then run following. 17 | 18 | ``` 19 | python create_afd_resources.py configs/{dataset-you-want-to-use} 20 | ``` 21 | 22 | You can keep `MODEL_TYPE` as **ONLINE_FRAUD_INSIGHTS** or **TRANSACTION_FRAUD_INSIGHTS** to run corresponding models. 23 | 24 | This will initiate automatic model training. Wait for ~1 hour for models to train. You can check status in your console. 25 | 26 | 4. Create detector and use it to score on the test data. Update values in `IAM_ROLE`, `BUCKET`, `TEST_PATH`, `TEST_LABELS_PATH` and `MODEL_NAME` in the `score_afd_resources.py`, then run following. 27 | 28 | ``` 29 | python score_afd_model.py 30 | ``` 31 | This will print performance metrics in terminal as well as save in S3 location you provide in the script. 32 | 33 | After a model training is completed, AFD console would show performance metrics like following (trained on `ieeecis` with ONLINE_FRAUD_INSIGHTS). 34 | 35 | ![ieee ofi sample](../../../images/ieee_ofi_sample.png) 36 | 37 | 38 | 39 | **In order to fully deep dive into working of Amazon Fraud Detector, [here](https://d1.awsstatic.com/fraud-detector/afd-technical-guide-detecting-new-account-fraud.pdf) is the link to technical guide.** 40 | 41 | -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/CreditCardFraudDetection.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Credit Card Fraud Detection", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "v1", 6 | "variable_type": "NUMERIC", 7 | "data_type": "FLOAT" 8 | }, 9 | { 10 | "variable_name": "v2", 11 | "variable_type": "NUMERIC", 12 | "data_type": "FLOAT" 13 | }, 14 | { 15 | "variable_name": "v3", 16 | "variable_type": "NUMERIC", 17 | "data_type": "FLOAT" 18 | }, 19 | { 20 | "variable_name": "v4", 21 | "variable_type": "NUMERIC", 22 | "data_type": "FLOAT" 23 | }, 24 | { 25 | "variable_name": "v5", 26 | "variable_type": "NUMERIC", 27 | "data_type": "FLOAT" 28 | }, 29 | { 30 | "variable_name": "v6", 31 | "variable_type": "NUMERIC", 32 | "data_type": "FLOAT" 33 | }, 34 | { 35 | "variable_name": "v7", 36 | "variable_type": "NUMERIC", 37 | "data_type": "FLOAT" 38 | }, 39 | { 40 | "variable_name": "v8", 41 | "variable_type": "NUMERIC", 42 | "data_type": "FLOAT" 43 | }, 44 | { 45 | "variable_name": "v9", 46 | "variable_type": "NUMERIC", 47 | "data_type": "FLOAT" 48 | }, 49 | { 50 | "variable_name": "v10", 51 | "variable_type": "NUMERIC", 52 | "data_type": "FLOAT" 53 | }, 54 | { 55 | "variable_name": "v11", 56 | "variable_type": "NUMERIC", 57 | "data_type": "FLOAT" 58 | }, 59 | { 60 | "variable_name": "v12", 61 | "variable_type": "NUMERIC", 62 | "data_type": "FLOAT" 63 | }, 64 | { 65 | "variable_name": "v13", 66 | "variable_type": "NUMERIC", 67 | "data_type": "FLOAT" 68 | }, 69 | { 70 | "variable_name": "v14", 71 | "variable_type": "NUMERIC", 72 | "data_type": "FLOAT" 73 | }, 74 | { 75 | "variable_name": "v15", 76 | "variable_type": "NUMERIC", 77 | "data_type": "FLOAT" 78 | }, 79 | { 80 | "variable_name": "v16", 81 | "variable_type": "NUMERIC", 82 | "data_type": "FLOAT" 83 | }, 84 | { 85 | "variable_name": "v17", 86 | "variable_type": "NUMERIC", 87 | "data_type": "FLOAT" 88 | }, 89 | { 90 | "variable_name": "v18", 91 | "variable_type": "NUMERIC", 92 | "data_type": "FLOAT" 93 | }, 94 | { 95 | "variable_name": "v19", 96 | "variable_type": "NUMERIC", 97 | "data_type": "FLOAT" 98 | }, 99 | { 100 | "variable_name": "v20", 101 | "variable_type": "NUMERIC", 102 | "data_type": "FLOAT" 103 | }, 104 | { 105 | "variable_name": "v21", 106 | "variable_type": "NUMERIC", 107 | "data_type": "FLOAT" 108 | }, 109 | { 110 | "variable_name": "v22", 111 | "variable_type": "NUMERIC", 112 | "data_type": "FLOAT" 113 | }, 114 | { 115 | "variable_name": "v23", 116 | "variable_type": "NUMERIC", 117 | "data_type": "FLOAT" 118 | }, 119 | { 120 | "variable_name": "v24", 121 | "variable_type": "NUMERIC", 122 | "data_type": "FLOAT" 123 | }, 124 | { 125 | "variable_name": "v25", 126 | "variable_type": "NUMERIC", 127 | "data_type": "FLOAT" 128 | }, 129 | { 130 | "variable_name": "v26", 131 | "variable_type": "NUMERIC", 132 | "data_type": "FLOAT" 133 | }, 134 | { 135 | "variable_name": "v27", 136 | "variable_type": "NUMERIC", 137 | "data_type": "FLOAT" 138 | }, 139 | { 140 | "variable_name": "v28", 141 | "variable_type": "NUMERIC", 142 | "data_type": "FLOAT" 143 | }, 144 | { 145 | "variable_name": "amount", 146 | "variable_type": "NUMERIC", 147 | "data_type": "FLOAT" 148 | } 149 | ], 150 | "label_mappings": { 151 | "FRAUD": [ 152 | "1" 153 | ], 154 | "LEGIT": [ 155 | "0" 156 | ] 157 | } 158 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/FakeJobPostingPrediction.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Fake Job Posting Prediction", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "title", 6 | "variable_type": "FREE_FORM_TEXT", 7 | "data_type": "STRING" 8 | }, 9 | { 10 | "variable_name": "location", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | }, 14 | { 15 | "variable_name": "department", 16 | "variable_type": "CATEGORICAL", 17 | "data_type": "STRING" 18 | }, 19 | { 20 | "variable_name": "salary_range", 21 | "variable_type": "CATEGORICAL", 22 | "data_type": "STRING" 23 | }, 24 | { 25 | "variable_name": "company_profile", 26 | "variable_type": "FREE_FORM_TEXT", 27 | "data_type": "STRING" 28 | }, 29 | { 30 | "variable_name": "description", 31 | "variable_type": "FREE_FORM_TEXT", 32 | "data_type": "STRING" 33 | }, 34 | { 35 | "variable_name": "requirements", 36 | "variable_type": "FREE_FORM_TEXT", 37 | "data_type": "STRING" 38 | }, 39 | { 40 | "variable_name": "benefits", 41 | "variable_type": "FREE_FORM_TEXT", 42 | "data_type": "STRING" 43 | }, 44 | { 45 | "variable_name": "telecommuting", 46 | "variable_type": "NUMERIC", 47 | "data_type": "FLOAT" 48 | }, 49 | { 50 | "variable_name": "has_company_logo", 51 | "variable_type": "CATEGORICAL", 52 | "data_type": "STRING" 53 | }, 54 | { 55 | "variable_name": "has_questions", 56 | "variable_type": "CATEGORICAL", 57 | "data_type": "STRING" 58 | }, 59 | { 60 | "variable_name": "employment_type", 61 | "variable_type": "CATEGORICAL", 62 | "data_type": "STRING" 63 | }, 64 | { 65 | "variable_name": "required_experience", 66 | "variable_type": "CATEGORICAL", 67 | "data_type": "STRING" 68 | }, 69 | { 70 | "variable_name": "required_education", 71 | "variable_type": "CATEGORICAL", 72 | "data_type": "STRING" 73 | }, 74 | { 75 | "variable_name": "industry", 76 | "variable_type": "CATEGORICAL", 77 | "data_type": "STRING" 78 | }, 79 | { 80 | "variable_name": "function", 81 | "variable_type": "CATEGORICAL", 82 | "data_type": "STRING" 83 | } 84 | ], 85 | "label_mappings": { 86 | "FRAUD": [ 87 | "1" 88 | ], 89 | "LEGIT": [ 90 | "0" 91 | ] 92 | } 93 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/Fraudecommerce.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Fraud ecommerce", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "purchase_value", 6 | "variable_type": "NUMERIC", 7 | "data_type": "FLOAT" 8 | }, 9 | { 10 | "variable_name": "source", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | }, 14 | { 15 | "variable_name": "browser", 16 | "variable_type": "CATEGORICAL", 17 | "data_type": "STRING" 18 | }, 19 | { 20 | "variable_name": "age", 21 | "variable_type": "NUMERIC", 22 | "data_type": "FLOAT" 23 | }, 24 | { 25 | "variable_name": "ip_address", 26 | "variable_type": "IP_ADDRESS", 27 | "data_type": "FLOAT" 28 | }, 29 | { 30 | "variable_name": "time_since_signup", 31 | "variable_type": "NUMERIC", 32 | "data_type": "FLOAT" 33 | } 34 | ], 35 | "label_mappings": { 36 | "FRAUD": [ 37 | "1" 38 | ], 39 | "LEGIT": [ 40 | "0" 41 | ] 42 | } 43 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/IEEECISFraudDetection.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "IEEE-CIS Fraud Detection", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "transactionamt", 6 | "variable_type": "NUMERIC", 7 | "data_type": "FLOAT" 8 | }, 9 | { 10 | "variable_name": "productcd", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | }, 14 | { 15 | "variable_name": "card1", 16 | "variable_type": "NUMERIC", 17 | "data_type": "FLOAT" 18 | }, 19 | { 20 | "variable_name": "card2", 21 | "variable_type": "NUMERIC", 22 | "data_type": "FLOAT" 23 | }, 24 | { 25 | "variable_name": "card3", 26 | "variable_type": "NUMERIC", 27 | "data_type": "FLOAT" 28 | }, 29 | { 30 | "variable_name": "card5", 31 | "variable_type": "NUMERIC", 32 | "data_type": "FLOAT" 33 | }, 34 | { 35 | "variable_name": "card6", 36 | "variable_type": "CATEGORICAL", 37 | "data_type": "STRING" 38 | }, 39 | { 40 | "variable_name": "addr1", 41 | "variable_type": "NUMERIC", 42 | "data_type": "FLOAT" 43 | }, 44 | { 45 | "variable_name": "dist1", 46 | "variable_type": "NUMERIC", 47 | "data_type": "FLOAT" 48 | }, 49 | { 50 | "variable_name": "p_emaildomain", 51 | "variable_type": "CATEGORICAL", 52 | "data_type": "STRING" 53 | }, 54 | { 55 | "variable_name": "r_emaildomain", 56 | "variable_type": "CATEGORICAL", 57 | "data_type": "STRING" 58 | }, 59 | { 60 | "variable_name": "c1", 61 | "variable_type": "NUMERIC", 62 | "data_type": "FLOAT" 63 | }, 64 | { 65 | "variable_name": "c2", 66 | "variable_type": "NUMERIC", 67 | "data_type": "FLOAT" 68 | }, 69 | { 70 | "variable_name": "c4", 71 | "variable_type": "NUMERIC", 72 | "data_type": "FLOAT" 73 | }, 74 | { 75 | "variable_name": "c5", 76 | "variable_type": "NUMERIC", 77 | "data_type": "FLOAT" 78 | }, 79 | { 80 | "variable_name": "c6", 81 | "variable_type": "NUMERIC", 82 | "data_type": "FLOAT" 83 | }, 84 | { 85 | "variable_name": "c7", 86 | "variable_type": "NUMERIC", 87 | "data_type": "FLOAT" 88 | }, 89 | { 90 | "variable_name": "c8", 91 | "variable_type": "NUMERIC", 92 | "data_type": "FLOAT" 93 | }, 94 | { 95 | "variable_name": "c9", 96 | "variable_type": "NUMERIC", 97 | "data_type": "FLOAT" 98 | }, 99 | { 100 | "variable_name": "c10", 101 | "variable_type": "NUMERIC", 102 | "data_type": "FLOAT" 103 | }, 104 | { 105 | "variable_name": "c11", 106 | "variable_type": "NUMERIC", 107 | "data_type": "FLOAT" 108 | }, 109 | { 110 | "variable_name": "c12", 111 | "variable_type": "NUMERIC", 112 | "data_type": "FLOAT" 113 | }, 114 | { 115 | "variable_name": "c13", 116 | "variable_type": "NUMERIC", 117 | "data_type": "FLOAT" 118 | }, 119 | { 120 | "variable_name": "c14", 121 | "variable_type": "NUMERIC", 122 | "data_type": "FLOAT" 123 | }, 124 | { 125 | "variable_name": "v62", 126 | "variable_type": "NUMERIC", 127 | "data_type": "FLOAT" 128 | }, 129 | { 130 | "variable_name": "v70", 131 | "variable_type": "NUMERIC", 132 | "data_type": "FLOAT" 133 | }, 134 | { 135 | "variable_name": "v76", 136 | "variable_type": "NUMERIC", 137 | "data_type": "FLOAT" 138 | }, 139 | { 140 | "variable_name": "v78", 141 | "variable_type": "NUMERIC", 142 | "data_type": "FLOAT" 143 | }, 144 | { 145 | "variable_name": "v82", 146 | "variable_type": "NUMERIC", 147 | "data_type": "FLOAT" 148 | }, 149 | { 150 | "variable_name": "v91", 151 | "variable_type": "NUMERIC", 152 | "data_type": "FLOAT" 153 | }, 154 | { 155 | "variable_name": "v127", 156 | "variable_type": "NUMERIC", 157 | "data_type": "FLOAT" 158 | }, 159 | { 160 | "variable_name": "v130", 161 | "variable_type": "NUMERIC", 162 | "data_type": "FLOAT" 163 | }, 164 | { 165 | "variable_name": "v139", 166 | "variable_type": "NUMERIC", 167 | "data_type": "FLOAT" 168 | }, 169 | { 170 | "variable_name": "v160", 171 | "variable_type": "NUMERIC", 172 | "data_type": "FLOAT" 173 | }, 174 | { 175 | "variable_name": "v165", 176 | "variable_type": "NUMERIC", 177 | "data_type": "FLOAT" 178 | }, 179 | { 180 | "variable_name": "v187", 181 | "variable_type": "NUMERIC", 182 | "data_type": "FLOAT" 183 | }, 184 | { 185 | "variable_name": "v203", 186 | "variable_type": "NUMERIC", 187 | "data_type": "FLOAT" 188 | }, 189 | { 190 | "variable_name": "v207", 191 | "variable_type": "NUMERIC", 192 | "data_type": "FLOAT" 193 | }, 194 | { 195 | "variable_name": "v209", 196 | "variable_type": "NUMERIC", 197 | "data_type": "FLOAT" 198 | }, 199 | { 200 | "variable_name": "v210", 201 | "variable_type": "NUMERIC", 202 | "data_type": "FLOAT" 203 | }, 204 | { 205 | "variable_name": "v221", 206 | "variable_type": "NUMERIC", 207 | "data_type": "FLOAT" 208 | }, 209 | { 210 | "variable_name": "v234", 211 | "variable_type": "NUMERIC", 212 | "data_type": "FLOAT" 213 | }, 214 | { 215 | "variable_name": "v257", 216 | "variable_type": "NUMERIC", 217 | "data_type": "FLOAT" 218 | }, 219 | { 220 | "variable_name": "v258", 221 | "variable_type": "NUMERIC", 222 | "data_type": "FLOAT" 223 | }, 224 | { 225 | "variable_name": "v261", 226 | "variable_type": "NUMERIC", 227 | "data_type": "FLOAT" 228 | }, 229 | { 230 | "variable_name": "v264", 231 | "variable_type": "NUMERIC", 232 | "data_type": "FLOAT" 233 | }, 234 | { 235 | "variable_name": "v266", 236 | "variable_type": "NUMERIC", 237 | "data_type": "FLOAT" 238 | }, 239 | { 240 | "variable_name": "v267", 241 | "variable_type": "NUMERIC", 242 | "data_type": "FLOAT" 243 | }, 244 | { 245 | "variable_name": "v271", 246 | "variable_type": "NUMERIC", 247 | "data_type": "FLOAT" 248 | }, 249 | { 250 | "variable_name": "v274", 251 | "variable_type": "NUMERIC", 252 | "data_type": "FLOAT" 253 | }, 254 | { 255 | "variable_name": "v277", 256 | "variable_type": "NUMERIC", 257 | "data_type": "FLOAT" 258 | }, 259 | { 260 | "variable_name": "v283", 261 | "variable_type": "NUMERIC", 262 | "data_type": "FLOAT" 263 | }, 264 | { 265 | "variable_name": "v285", 266 | "variable_type": "NUMERIC", 267 | "data_type": "FLOAT" 268 | }, 269 | { 270 | "variable_name": "v289", 271 | "variable_type": "NUMERIC", 272 | "data_type": "FLOAT" 273 | }, 274 | { 275 | "variable_name": "v291", 276 | "variable_type": "NUMERIC", 277 | "data_type": "FLOAT" 278 | }, 279 | { 280 | "variable_name": "v294", 281 | "variable_type": "NUMERIC", 282 | "data_type": "FLOAT" 283 | }, 284 | { 285 | "variable_name": "id_01", 286 | "variable_type": "NUMERIC", 287 | "data_type": "FLOAT" 288 | }, 289 | { 290 | "variable_name": "id_02", 291 | "variable_type": "NUMERIC", 292 | "data_type": "FLOAT" 293 | }, 294 | { 295 | "variable_name": "id_05", 296 | "variable_type": "NUMERIC", 297 | "data_type": "FLOAT" 298 | }, 299 | { 300 | "variable_name": "id_06", 301 | "variable_type": "NUMERIC", 302 | "data_type": "FLOAT" 303 | }, 304 | { 305 | "variable_name": "id_09", 306 | "variable_type": "NUMERIC", 307 | "data_type": "FLOAT" 308 | }, 309 | { 310 | "variable_name": "id_13", 311 | "variable_type": "NUMERIC", 312 | "data_type": "FLOAT" 313 | }, 314 | { 315 | "variable_name": "id_17", 316 | "variable_type": "NUMERIC", 317 | "data_type": "FLOAT" 318 | }, 319 | { 320 | "variable_name": "id_19", 321 | "variable_type": "NUMERIC", 322 | "data_type": "FLOAT" 323 | }, 324 | { 325 | "variable_name": "id_20", 326 | "variable_type": "NUMERIC", 327 | "data_type": "FLOAT" 328 | }, 329 | { 330 | "variable_name": "devicetype", 331 | "variable_type": "CATEGORICAL", 332 | "data_type": "STRING" 333 | }, 334 | { 335 | "variable_name": "deviceinfo", 336 | "variable_type": "CATEGORICAL", 337 | "data_type": "STRING" 338 | } 339 | ], 340 | "label_mappings": { 341 | "FRAUD": [ 342 | "1" 343 | ], 344 | "LEGIT": [ 345 | "0" 346 | ] 347 | } 348 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/IPBlocklist.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "IP-BlockList", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "ip", 6 | "variable_type": "IP_ADDRESS", 7 | "data_type": "STRING" 8 | }, 9 | { 10 | "variable_name": "dummy_cat", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | } 14 | ], 15 | "label_mappings": { 16 | "FRAUD": [ 17 | "1" 18 | ], 19 | "LEGIT": [ 20 | "0" 21 | ] 22 | } 23 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/MaliciousURL.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Malicious URLs Dataset", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "url", 6 | "variable_type": "FREE_FORM_TEXT", 7 | "data_type": "STRING" 8 | }, 9 | { 10 | "variable_name": "dummy_cat", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | } 14 | ], 15 | "label_mappings": { 16 | "FRAUD": [ 17 | "malignant" 18 | ], 19 | "LEGIT": [ 20 | "benign" 21 | ] 22 | } 23 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/SimulatedCreditCardTransactionsSparkov.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Simulated Credit Card Transactions generated using Sparkov", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "cc_num", 6 | "variable_type": "CARD_BIN", 7 | "data_type": "INTEGER" 8 | }, 9 | { 10 | "variable_name": "category", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | }, 14 | { 15 | "variable_name": "amt", 16 | "variable_type": "NUMERIC", 17 | "data_type": "FLOAT" 18 | }, 19 | { 20 | "variable_name": "first", 21 | "variable_type": "BILLING_NAME", 22 | "data_type": "STRING" 23 | }, 24 | { 25 | "variable_name": "last", 26 | "variable_type": "BILLING_NAME", 27 | "data_type": "STRING" 28 | }, 29 | { 30 | "variable_name": "gender", 31 | "variable_type": "CATEGORICAL", 32 | "data_type": "STRING" 33 | }, 34 | { 35 | "variable_name": "street", 36 | "variable_type": "BILLING_ADDRESS_L1", 37 | "data_type": "STRING" 38 | }, 39 | { 40 | "variable_name": "city", 41 | "variable_type": "BILLING_CITY", 42 | "data_type": "STRING" 43 | }, 44 | { 45 | "variable_name": "state", 46 | "variable_type": "BILLING_STATE", 47 | "data_type": "STRING" 48 | }, 49 | { 50 | "variable_name": "zip", 51 | "variable_type": "BILLING_ZIP", 52 | "data_type": "STRING" 53 | }, 54 | { 55 | "variable_name": "lat", 56 | "variable_type": "NUMERIC", 57 | "data_type": "FLOAT" 58 | }, 59 | { 60 | "variable_name": "long", 61 | "variable_type": "NUMERIC", 62 | "data_type": "FLOAT" 63 | }, 64 | { 65 | "variable_name": "city_pop", 66 | "variable_type": "NUMERIC", 67 | "data_type": "FLOAT" 68 | }, 69 | { 70 | "variable_name": "job", 71 | "variable_type": "CATEGORICAL", 72 | "data_type": "STRING" 73 | }, 74 | { 75 | "variable_name": "dob", 76 | "variable_type": "FREE_FORM_TEXT", 77 | "data_type": "STRING" 78 | }, 79 | { 80 | "variable_name": "merch_lat", 81 | "variable_type": "NUMERIC", 82 | "data_type": "FLOAT" 83 | }, 84 | { 85 | "variable_name": "merch_long", 86 | "variable_type": "NUMERIC", 87 | "data_type": "FLOAT" 88 | } 89 | ], 90 | "label_mappings": { 91 | "FRAUD": [ 92 | "1" 93 | ], 94 | "LEGIT": [ 95 | "0" 96 | ] 97 | } 98 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/TwitterBotAccounts.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Twitter Bots Accounts", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "default_profile", 6 | "variable_type": "CATEGORICAL", 7 | "data_type": "STRING" 8 | }, 9 | { 10 | "variable_name": "default_profile_image", 11 | "variable_type": "CATEGORICAL", 12 | "data_type": "STRING" 13 | }, 14 | { 15 | "variable_name": "description", 16 | "variable_type": "FREE_FORM_TEXT", 17 | "data_type": "STRING" 18 | }, 19 | { 20 | "variable_name": "favourites_count", 21 | "variable_type": "NUMERIC", 22 | "data_type": "FLOAT" 23 | }, 24 | { 25 | "variable_name": "followers_count", 26 | "variable_type": "NUMERIC", 27 | "data_type": "FLOAT" 28 | }, 29 | { 30 | "variable_name": "friends_count", 31 | "variable_type": "NUMERIC", 32 | "data_type": "FLOAT" 33 | }, 34 | { 35 | "variable_name": "geo_enabled", 36 | "variable_type": "CATEGORICAL", 37 | "data_type": "STRING" 38 | }, 39 | { 40 | "variable_name": "lang", 41 | "variable_type": "CATEGORICAL", 42 | "data_type": "STRING" 43 | }, 44 | { 45 | "variable_name": "location", 46 | "variable_type": "FREE_FORM_TEXT", 47 | "data_type": "STRING" 48 | }, 49 | { 50 | "variable_name": "profile_background_image_url", 51 | "variable_type": "FREE_FORM_TEXT", 52 | "data_type": "STRING" 53 | }, 54 | { 55 | "variable_name": "profile_image_url", 56 | "variable_type": "FREE_FORM_TEXT", 57 | "data_type": "STRING" 58 | }, 59 | { 60 | "variable_name": "screen_name", 61 | "variable_type": "CATEGORICAL", 62 | "data_type": "STRING" 63 | }, 64 | { 65 | "variable_name": "statuses_count", 66 | "variable_type": "NUMERIC", 67 | "data_type": "FLOAT" 68 | }, 69 | { 70 | "variable_name": "verified", 71 | "variable_type": "CATEGORICAL", 72 | "data_type": "STRING" 73 | }, 74 | { 75 | "variable_name": "average_tweets_per_day", 76 | "variable_type": "NUMERIC", 77 | "data_type": "FLOAT" 78 | }, 79 | { 80 | "variable_name": "account_age_days", 81 | "variable_type": "NUMERIC", 82 | "data_type": "FLOAT" 83 | } 84 | ], 85 | "label_mappings": { 86 | "FRAUD": [ 87 | "bot" 88 | ], 89 | "LEGIT": [ 90 | "human" 91 | ] 92 | } 93 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/configs/VehicleLoanDefaultPrediction.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": "Vehicle Loan Default Prediction", 3 | "variable_mappings": [ 4 | { 5 | "variable_name": "disbursed_amount", 6 | "variable_type": "NUMERIC", 7 | "data_type": "FLOAT" 8 | }, 9 | { 10 | "variable_name": "asset_cost", 11 | "variable_type": "NUMERIC", 12 | "data_type": "FLOAT" 13 | }, 14 | { 15 | "variable_name": "ltv", 16 | "variable_type": "NUMERIC", 17 | "data_type": "FLOAT" 18 | }, 19 | { 20 | "variable_name": "branch_id", 21 | "variable_type": "CATEGORICAL", 22 | "data_type": "STRING" 23 | }, 24 | { 25 | "variable_name": "supplier_id", 26 | "variable_type": "CATEGORICAL", 27 | "data_type": "STRING" 28 | }, 29 | { 30 | "variable_name": "manufacturer_id", 31 | "variable_type": "CATEGORICAL", 32 | "data_type": "STRING" 33 | }, 34 | { 35 | "variable_name": "current_pincode_id", 36 | "variable_type": "CATEGORICAL", 37 | "data_type": "STRING" 38 | }, 39 | { 40 | "variable_name": "date_of_birth", 41 | "variable_type": "FREE_FORM_TEXT", 42 | "data_type": "STRING" 43 | }, 44 | { 45 | "variable_name": "employment_type", 46 | "variable_type": "CATEGORICAL", 47 | "data_type": "STRING" 48 | }, 49 | { 50 | "variable_name": "state_id", 51 | "variable_type": "CATEGORICAL", 52 | "data_type": "STRING" 53 | }, 54 | { 55 | "variable_name": "employee_code_id", 56 | "variable_type": "CATEGORICAL", 57 | "data_type": "STRING" 58 | }, 59 | { 60 | "variable_name": "mobileno_avl_flag", 61 | "variable_type": "CATEGORICAL", 62 | "data_type": "STRING" 63 | }, 64 | { 65 | "variable_name": "aadhar_flag", 66 | "variable_type": "CATEGORICAL", 67 | "data_type": "STRING" 68 | }, 69 | { 70 | "variable_name": "pan_flag", 71 | "variable_type": "CATEGORICAL", 72 | "data_type": "STRING" 73 | }, 74 | { 75 | "variable_name": "voterid_flag", 76 | "variable_type": "CATEGORICAL", 77 | "data_type": "STRING" 78 | }, 79 | { 80 | "variable_name": "driving_flag", 81 | "variable_type": "CATEGORICAL", 82 | "data_type": "STRING" 83 | }, 84 | { 85 | "variable_name": "passport_flag", 86 | "variable_type": "CATEGORICAL", 87 | "data_type": "STRING" 88 | }, 89 | { 90 | "variable_name": "perform_cns_score", 91 | "variable_type": "NUMERIC", 92 | "data_type": "FLOAT" 93 | }, 94 | { 95 | "variable_name": "perform_cns_score_description", 96 | "variable_type": "FREE_FORM_TEXT", 97 | "data_type": "STRING" 98 | }, 99 | { 100 | "variable_name": "pri_no_of_accts", 101 | "variable_type": "NUMERIC", 102 | "data_type": "FLOAT" 103 | }, 104 | { 105 | "variable_name": "pri_active_accts", 106 | "variable_type": "NUMERIC", 107 | "data_type": "FLOAT" 108 | }, 109 | { 110 | "variable_name": "pri_overdue_accts", 111 | "variable_type": "NUMERIC", 112 | "data_type": "FLOAT" 113 | }, 114 | { 115 | "variable_name": "pri_current_balance", 116 | "variable_type": "NUMERIC", 117 | "data_type": "FLOAT" 118 | }, 119 | { 120 | "variable_name": "pri_sanctioned_amount", 121 | "variable_type": "NUMERIC", 122 | "data_type": "FLOAT" 123 | }, 124 | { 125 | "variable_name": "pri_disbursed_amount", 126 | "variable_type": "NUMERIC", 127 | "data_type": "FLOAT" 128 | }, 129 | { 130 | "variable_name": "sec_no_of_accts", 131 | "variable_type": "NUMERIC", 132 | "data_type": "FLOAT" 133 | }, 134 | { 135 | "variable_name": "sec_active_accts", 136 | "variable_type": "NUMERIC", 137 | "data_type": "FLOAT" 138 | }, 139 | { 140 | "variable_name": "sec_overdue_accts", 141 | "variable_type": "NUMERIC", 142 | "data_type": "FLOAT" 143 | }, 144 | { 145 | "variable_name": "sec_current_balance", 146 | "variable_type": "NUMERIC", 147 | "data_type": "FLOAT" 148 | }, 149 | { 150 | "variable_name": "sec_sanctioned_amount", 151 | "variable_type": "NUMERIC", 152 | "data_type": "FLOAT" 153 | }, 154 | { 155 | "variable_name": "sec_disbursed_amount", 156 | "variable_type": "NUMERIC", 157 | "data_type": "FLOAT" 158 | }, 159 | { 160 | "variable_name": "primary_instal_amt", 161 | "variable_type": "NUMERIC", 162 | "data_type": "FLOAT" 163 | }, 164 | { 165 | "variable_name": "sec_instal_amt", 166 | "variable_type": "NUMERIC", 167 | "data_type": "FLOAT" 168 | }, 169 | { 170 | "variable_name": "new_accts_in_last_six_months", 171 | "variable_type": "NUMERIC", 172 | "data_type": "FLOAT" 173 | }, 174 | { 175 | "variable_name": "delinquent_accts_in_last_six_months", 176 | "variable_type": "NUMERIC", 177 | "data_type": "FLOAT" 178 | }, 179 | { 180 | "variable_name": "average_acct_age", 181 | "variable_type": "FREE_FORM_TEXT", 182 | "data_type": "STRING" 183 | }, 184 | { 185 | "variable_name": "credit_history_length", 186 | "variable_type": "NUMERIC", 187 | "data_type": "FLOAT" 188 | }, 189 | { 190 | "variable_name": "no_of_inquiries", 191 | "variable_type": "NUMERIC", 192 | "data_type": "FLOAT" 193 | } 194 | ], 195 | "label_mappings": { 196 | "FRAUD": [ 197 | "1" 198 | ], 199 | "LEGIT": [ 200 | "0" 201 | ] 202 | } 203 | } -------------------------------------------------------------------------------- /scripts/reproducibility/afd/create_afd_resources.py: -------------------------------------------------------------------------------- 1 | # TO BE UPDATED BY USER 2 | IAM_ROLE = "" 3 | BUCKET = "" 4 | KEY = "" 5 | MODEL_NAME = "" # lower case alphanumeric only, only _ allowed as delimiter 6 | MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS 7 | 8 | import os 9 | import time 10 | import json 11 | import boto3 12 | import click 13 | import string 14 | import random 15 | import logging 16 | import pandas as pd 17 | 18 | 19 | MODEL_DESC = "Benchmarking model" 20 | EVENT_DESC = "Event for benchmarking model" 21 | ENTITY_TYPE = "user" # this is provided in the dummy data. Will need to change if using different data 22 | ENTITY_DESC = "Entity for benchmarking model" 23 | 24 | BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME # Others are kept same as model name 25 | 26 | # boto3 connections 27 | client = boto3.client('frauddetector') 28 | s3 = boto3.client('s3') 29 | 30 | @click.command() 31 | @click.argument("config", type=click.Path(exists=True)) 32 | def afd_train_model_demo(config): 33 | 34 | ############################################# 35 | ##### Setup ##### 36 | with open(config, "r") as f: 37 | config_file = json.load(f) 38 | 39 | 40 | EVENT_VARIABLES = [variable["variable_name"] for variable in config_file["variable_mappings"]] 41 | EVENT_LABELS = [v for k,v in config_file["label_mappings"].items()] 42 | EVENT_LABELS = [item for sublist in EVENT_LABELS for item in sublist] # flattening list of lists 43 | 44 | # Variable mappings of demo data in this use case. Important to teach this to customer 45 | click.echo(f'{pd.DataFrame(config_file["variable_mappings"])}') 46 | click.echo(f'{pd.DataFrame(config_file["label_mappings"])}') 47 | 48 | S3_DATA_PATH = "s3://" + os.path.join(BUCKET, KEY) 49 | 50 | ############################################# 51 | ##### Create event variables and labels ##### 52 | 53 | # -- create variable -- 54 | for variable in config_file["variable_mappings"]: 55 | 56 | DEFAULT_VALUE = '0.0' if variable["data_type"] == "FLOAT" else '' 57 | 58 | try: 59 | resp = client.get_variables(name = variable["variable_name"]) 60 | click.echo("{0} exists, data type: {1}".format(variable["variable_name"], resp['variables'][0]['dataType'])) 61 | except: 62 | click.echo("Creating variable: {0}".format(variable["variable_name"])) 63 | resp = client.create_variable( 64 | name = variable["variable_name"], 65 | dataType = variable["data_type"], 66 | dataSource ='EVENT', 67 | defaultValue = DEFAULT_VALUE, 68 | description = variable["variable_name"], 69 | variableType = variable["variable_type"]) 70 | # Putting FRAUD 71 | for f in config_file["label_mappings"]["FRAUD"]: 72 | response = client.put_label( 73 | name = f, 74 | description = "FRAUD") 75 | # Putting LEGIT 76 | for f in config_file["label_mappings"]["LEGIT"]: 77 | response = client.put_label( 78 | name = f, 79 | description = "LEGIT") 80 | 81 | ############################################# 82 | ##### Define Entity and Event Types ##### 83 | 84 | # -- create entity type -- 85 | try: 86 | response = client.get_entity_types(name = ENTITY_TYPE) 87 | click.echo("-- entity type exists --") 88 | click.echo(response) 89 | except: 90 | response = client.put_entity_type( 91 | name = ENTITY_TYPE, 92 | description = ENTITY_DESC 93 | ) 94 | click.echo("-- create entity type --") 95 | click.echo(response) 96 | 97 | 98 | # -- create event type -- 99 | try: 100 | response = client.get_event_types(name = EVENT_TYPE) 101 | click.echo("\n-- event type exists --") 102 | click.echo(response) 103 | except: 104 | response = client.put_event_type ( 105 | name = EVENT_TYPE, 106 | eventVariables = EVENT_VARIABLES, 107 | labels = EVENT_LABELS, 108 | entityTypes = [ENTITY_TYPE]) 109 | click.echo("\n-- create event type --") 110 | click.echo(response) 111 | 112 | ############################################# 113 | ##### Batch import training file for TFI ##### 114 | if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS": 115 | try: 116 | response = client.create_batch_import_job( 117 | jobId = BATCH_PREDICTION_JOB, 118 | inputPath = S3_DATA_PATH, 119 | outputPath = "s3://" + BUCKET, 120 | eventTypeName = EVENT_TYPE, 121 | iamRoleArn = IAM_ROLE 122 | ) 123 | except Exception: 124 | pass 125 | 126 | # -- wait until batch import is finished -- 127 | print("--- waiting until batch import is finished ") 128 | stime = time.time() 129 | while True: 130 | response = client.get_batch_import_jobs(jobId=BATCH_PREDICTION_JOB) 131 | if 'IN_PROGRESS' in response['batchImports'][0]['status']: 132 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") 133 | time.sleep(60) # sleep for 1 minute 134 | else: 135 | print("Batch Impoort status : " + response['batchImports'][0]['status']) 136 | break 137 | 138 | etime = time.time() 139 | print(f"Elapsed time: {(etime - stime)/60:{3}.{3}} minutes \n" ) 140 | print(response) 141 | 142 | 143 | ############################################# 144 | ##### Create and train your model ##### 145 | try: 146 | response = client.create_model( 147 | description = MODEL_DESC, 148 | eventTypeName = EVENT_TYPE, 149 | modelId = MODEL_NAME, 150 | modelType = MODEL_TYPE) 151 | click.echo("-- initalize model --") 152 | click.echo(response) 153 | except Exception: 154 | pass 155 | 156 | # -- initalized the model, it's now ready to train -- 157 | 158 | # -- first define training_data_schema for model to use -- 159 | 160 | 161 | if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS": 162 | training_data_schema = { 163 | 'modelVariables' : EVENT_VARIABLES, 164 | 'labelSchema' : { 165 | 'labelMapper' : config_file["label_mappings"], 166 | 'unlabeledEventsTreatment': 'IGNORE' 167 | } 168 | } 169 | response = client.create_model_version( 170 | modelId = MODEL_NAME, 171 | modelType = MODEL_TYPE, 172 | trainingDataSource = 'INGESTED_EVENTS', 173 | trainingDataSchema = training_data_schema, 174 | ingestedEventsDetail={ # This needs to be changed 175 | 'ingestedEventsTimeWindow': { 176 | 'startTime': '2020-12-10T00:00:00Z', # '2021-08-28T00:00:00Z', 177 | 'endTime': '2022-06-07T00:00:00Z' #'2022-05-10T00:00:00Z' 178 | } 179 | } 180 | ) 181 | else: 182 | training_data_schema = { 183 | 'modelVariables' : EVENT_VARIABLES, 184 | 'labelSchema' : { 185 | 'labelMapper' : config_file["label_mappings"] 186 | } 187 | } 188 | response = client.create_model_version( 189 | modelId = MODEL_NAME, 190 | modelType = MODEL_TYPE, 191 | trainingDataSource = 'EXTERNAL_EVENTS', 192 | trainingDataSchema = training_data_schema, 193 | externalEventsDetail = { 194 | 'dataLocation' : S3_DATA_PATH, 195 | 'dataAccessRoleArn': IAM_ROLE 196 | } 197 | ) 198 | model_version = response['modelVersionNumber'] 199 | click.echo("-- model training --") 200 | click.echo(response) 201 | 202 | 203 | if __name__=="__main__": 204 | afd_train_model_demo() 205 | -------------------------------------------------------------------------------- /scripts/reproducibility/afd/score_afd_model.py: -------------------------------------------------------------------------------- 1 | # TO BE UPDATED BY USER 2 | IAM_ROLE = "" 3 | BUCKET = "" 4 | TEST_PATH = "" 5 | TEST_LABELS_PATH = "" 6 | MODEL_NAME = "" # lower case alphanumeric only, only _ allowed as delimiter 7 | MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS 8 | 9 | import os 10 | import ast 11 | import time 12 | import json 13 | import boto3 14 | import click 15 | import string 16 | import random 17 | import logging 18 | import numpy as np 19 | import pandas as pd 20 | from sklearn.metrics import roc_curve, auc 21 | 22 | # boto3 connections 23 | client = boto3.client('frauddetector') 24 | s3 = boto3.client('s3') 25 | 26 | BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME 27 | model_version = '1.0' 28 | DETECTOR_DESC = "Benchmarking detector" 29 | 30 | 31 | def create_outcomes(outcomes): 32 | """ 33 | Create Fraud Detector Outcomes 34 | """ 35 | for outcome in outcomes: 36 | print("creating outcome variable: {0} ".format(outcome)) 37 | response = client.put_outcome(name = outcome, description = outcome) 38 | 39 | 40 | def create_rules(score_cuts, outcomes): 41 | """ 42 | Creating rules 43 | 44 | Arguments: 45 | score_cuts - list of score cuts to create rules 46 | outcomes - list of outcomes associated with the rules 47 | 48 | Returns: 49 | a rule list to used when create detector 50 | """ 51 | 52 | if len(score_cuts)+1 != len(outcomes): 53 | logging.error('Your socre cuts and outcomes are not matched.') 54 | 55 | rule_list = [] 56 | for i in range(len(outcomes)): 57 | # rule expression 58 | if i < (len(outcomes)-1): 59 | rule = "${0}_insightscore > {1}".format(MODEL_NAME,score_cuts[i]) 60 | else: 61 | rule = "${0}_insightscore <= {1}".format(MODEL_NAME,score_cuts[i-1]) 62 | 63 | # append to rule_list (used when create detector) 64 | rule_id = "rules{0}_{1}".format(i, MODEL_NAME[:9]) 65 | 66 | rule_list.append({ 67 | "ruleId": rule_id, 68 | "ruleVersion" : '1', 69 | "detectorId" : DETECTOR_NAME 70 | }) 71 | 72 | # create rules 73 | print("creating rule: {0}: IF {1} THEN {2}".format(rule_id, rule, outcomes[i])) 74 | try: 75 | response = client.create_rule( 76 | ruleId = rule_id, 77 | detectorId = DETECTOR_NAME, 78 | expression = rule, 79 | language = 'DETECTORPL', 80 | outcomes = [outcomes[i]] 81 | ) 82 | except: 83 | print("this rule already exists in this detector") 84 | 85 | return rule_list 86 | 87 | 88 | def ast_with_nan(x): 89 | try: 90 | return ast.literal_eval(x) 91 | except: 92 | return np.nan 93 | 94 | 95 | def afd_train_model_demo(): 96 | 97 | # -- activate the model version -- 98 | try: 99 | response = client.update_model_version_status ( 100 | modelId = MODEL_NAME, 101 | modelType = MODEL_TYPE, 102 | modelVersionNumber = model_version, 103 | status = 'ACTIVE' 104 | ) 105 | print("-- activating model --") 106 | print(response) 107 | except Exception: 108 | print("First train the model") 109 | 110 | # -- wait until model is active -- 111 | print("--- waiting until model status is active ") 112 | stime = time.time() 113 | while True: 114 | response = client.get_model_version(modelId=MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version) 115 | if response['status'] != 'ACTIVE': 116 | print(response['status']) 117 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") 118 | time.sleep(60) # sleep for 1 minute 119 | if response['status'] == 'ACTIVE': 120 | print("Model status : " + response['status']) 121 | break 122 | 123 | etime = time.time() 124 | print("Elapsed time : %s" % (etime - stime) + " seconds \n" ) 125 | print(response) 126 | 127 | # -- put detector, initalizes your detector -- 128 | response = client.put_detector( 129 | detectorId = DETECTOR_NAME, 130 | description = DETECTOR_DESC, 131 | eventTypeName = EVENT_TYPE ) 132 | 133 | # -- decide what threshold and corresponding outcome you want to add -- 134 | # here, we create three simple rules by cutting the score at [950,750], and create three outcome ['fraud', 'investigate', 'approve'] 135 | # it will create 3 rules: 136 | # score > 950: fraud 137 | # score <= 750: approve 138 | 139 | score_cuts = [750] # recommended to fine tune this based on your business use case 140 | outcomes = ['fraud', 'approve'] # recommended to define this based on your business use case 141 | 142 | # -- create outcomes -- 143 | print(" -- create outcomes --") 144 | create_outcomes(outcomes) 145 | 146 | # -- create rules -- 147 | print(" -- create rules --") 148 | rule_list = create_rules(score_cuts, outcomes) 149 | 150 | # -- create detector version -- 151 | client.create_detector_version( 152 | detectorId = DETECTOR_NAME, 153 | rules = rule_list, 154 | modelVersions = [{"modelId": MODEL_NAME, 155 | "modelType": MODEL_TYPE, 156 | "modelVersionNumber": model_version}], 157 | # there are 2 options for ruleExecutionMode: 158 | # 'ALL_MATCHED' - return all matched rules' outcome 159 | # 'FIRST_MATCHED' - return first matched rule's outcome 160 | ruleExecutionMode = 'FIRST_MATCHED' 161 | ) 162 | 163 | print("\n -- detector created -- ") 164 | print(response) 165 | 166 | response = client.update_detector_version_status( 167 | detectorId = DETECTOR_NAME, 168 | detectorVersionId = '1', 169 | status = 'ACTIVE' 170 | ) 171 | print("\n -- detector activated -- ") 172 | print(response) 173 | 174 | # -- wait until detector is active -- 175 | print("\n --- waiting until detector status is active ") 176 | stime = time.time() 177 | while True: 178 | response = client.describe_detector( 179 | detectorId = DETECTOR_NAME, 180 | ) 181 | if response['detectorVersionSummaries'][0]['status'] != 'ACTIVE': 182 | print(response['detectorVersionSummaries'][0]['status']) 183 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") 184 | time.sleep(60) 185 | if response['detectorVersionSummaries'][0]['status'] == 'ACTIVE': 186 | break 187 | etime = time.time() 188 | print("Elapsed time : %s" % (etime - stime) + " seconds \n" ) 189 | print(response) 190 | 191 | # -- create detector evaluation -- 192 | try: 193 | client.create_batch_prediction_job ( 194 | jobId = BATCH_PREDICTION_JOB, 195 | inputPath = os.path.join('s3://', BUCKET, TEST_PATH), 196 | outputPath =os.path.join('s3://', BUCKET), 197 | eventTypeName = EVENT_TYPE, 198 | detectorName = DETECTOR_NAME, 199 | detectorVersion = '1', 200 | iamRoleArn = IAM_ROLE) 201 | except Exception as e: 202 | print(e) 203 | print("batch prediction job already exists") 204 | 205 | # -- wait until batch prediction job is completed -- 206 | print("\n --- waiting until batch prediction job is completed ") 207 | stime = time.time() 208 | while True: 209 | response = client.get_batch_prediction_jobs(jobId=BATCH_PREDICTION_JOB) 210 | response = response['batchPredictions'][0] 211 | if (response['status'] != 'COMPLETE') and (response['status'] != 'FAILED'): 212 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") 213 | time.sleep(60) 214 | if response['status'] == 'COMPLETE': 215 | break 216 | etime = time.time() 217 | print("Elapsed time : %s" % (etime - stime) + " seconds \n" ) 218 | print(response) 219 | 220 | # -- get batch prediction job result -- 221 | contents = s3.list_objects_v2(Bucket=BUCKET, Prefix=os.path.join(TEST_PATH))['Contents'] 222 | print(contents) 223 | S3_SCORE_PATH = sorted([c['Key'] for c in contents if c['Key'].endswith('output.csv')])[-1] 224 | print(S3_SCORE_PATH) 225 | 226 | # -- get test performance -- 227 | # Predictions 228 | print(os.path.join('s3://', BUCKET, S3_SCORE_PATH)) 229 | predictions = pd.read_csv(os.path.join('s3://', BUCKET, S3_SCORE_PATH)) 230 | predictions = predictions.copy()[~predictions.MODEL_SCORES.isna()] 231 | 232 | predictions['scores'] = predictions['MODEL_SCORES'].\ 233 | apply(lambda x: ast_with_nan(x)).\ 234 | apply(lambda x: x.get(MODEL_NAME)) 235 | 236 | # Labels 237 | labels = pd.read_csv(os.path.join('s3://', BUCKET, TEST_LABELS_PATH)) 238 | # labels['EVENT_LABEL'] = labels['EVENT_LABEL'].map({'benign': 0, 'malignant': 1}) 239 | predictions = predictions.merge(labels, on='EVENT_ID', how='left') 240 | print('Test size: ', predictions.shape) 241 | 242 | fpr, tpr, threshold = roc_curve(predictions['EVENT_LABEL'], predictions['scores']) 243 | test_auc = auc(fpr,tpr) 244 | print('AUC: ', test_auc) 245 | 246 | test_metrics = {} 247 | test_metrics['auc'] = test_auc 248 | test_metrics['fpr'] = list(fpr) 249 | test_metrics['tpr'] = list(tpr) 250 | test_metrics['threshold'] = list(threshold) 251 | 252 | # -- put test metrics in s3 -- 253 | s3.put_object( 254 | Body=json.dumps(test_metrics), 255 | Bucket=BUCKET, 256 | Key='test_metrics.json') 257 | 258 | print("\n -- test metrics saved -- ") 259 | 260 | if __name__ == "__main__": 261 | afd_train_model_demo() 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | -------------------------------------------------------------------------------- /scripts/reproducibility/autogluon/README.md: -------------------------------------------------------------------------------- 1 | - benchmark_ag.py: a script for autogluon benchmarking 2 | - example-ag-ieeecis.ipynb: an example notebook using benchmark_ag.py 3 | 4 | Note that autogluon is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper. 5 | -------------------------------------------------------------------------------- /scripts/reproducibility/autogluon/benchmark_ag.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | import gc 4 | import joblib 5 | import datetime 6 | 7 | import matplotlib as mpl 8 | from sklearn.metrics import roc_auc_score, roc_curve 9 | 10 | mpl.rcParams['figure.dpi'] = 150 11 | pd.set_option('display.max_columns', 500) 12 | pd.set_option('display.max_rows', 500) 13 | pd.set_option('display.width', 200) 14 | pd.set_option('display.float_format', lambda x: '%.3f' % x) 15 | 16 | import logging 17 | FORMAT = "%(levelname)s: %(name)s: %(message)s" 18 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S" 19 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) 20 | logger = logging.getLogger(os.path.basename(__file__)) 21 | logger.setLevel(logging.DEBUG) 22 | 23 | import sys 24 | sys.path.append('../') 25 | from benchmark_utils import load_data, get_recall 26 | 27 | from autogluon.tabular import TabularPredictor 28 | 29 | def run_ag(dataset, base_path, time_limit=3600, presets=None, hyperparameters=None, feature_metadata='infer', verbosity=2): 30 | gc.collect() 31 | features, df_train, df_test = load_data(dataset, base_path) 32 | 33 | dateTimeObj = datetime.datetime.now() 34 | timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S") 35 | 36 | suffix = (f"_{presets}" if presets is not None else "") \ 37 | + (f"_{hyperparameters}" if hyperparameters is not None else "") \ 38 | + ("_feature_metadata" if feature_metadata != 'infer' else "") 39 | folder = f"ag-{timestampStr}" \ 40 | + suffix 41 | 42 | predictor = TabularPredictor(label='EVENT_LABEL', eval_metric='roc_auc', path=f"{base_path}/{dataset}/AutogluonModels/{folder}/", 43 | verbosity=verbosity) 44 | predictor.fit(df_train[features + ['EVENT_LABEL'] ], 45 | time_limit=time_limit, presets=presets, hyperparameters=hyperparameters, feature_metadata=feature_metadata) 46 | 47 | leaderboard = predictor.leaderboard(df_test[features + ['EVENT_LABEL'] ]) 48 | 49 | leaderboard_file = "leaderboard" \ 50 | + suffix \ 51 | + ".csv" 52 | leaderboard.to_csv(f"{base_path}/{dataset}/{leaderboard_file}", index=False) 53 | 54 | df_pred = predictor.predict_proba(df_test[ features ], 55 | as_multiclass=False) 56 | 57 | auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred) 58 | logger.info(f"auc on test data: {auc}") 59 | pos_label = predictor.positive_class 60 | fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred, 61 | pos_label=pos_label) 62 | 63 | y_true = df_test['EVENT_LABEL'] 64 | y_true = (y_true==pos_label) 65 | 66 | recall = get_recall(fpr, tpr, fpr_target=0.01) 67 | logger.info(f"tpr@1%fpr on test data: {recall}") 68 | 69 | test_metrics_ag_bq = { 70 | "labels": df_test['EVENT_LABEL'], 71 | "pred_prob": df_pred, 72 | "auc": auc, 73 | "tpr@1%fpr": recall, 74 | "fpr": fpr, 75 | "tpr": tpr, 76 | "thresholds": thresholds 77 | } 78 | metrics_file = "test_metrics_ag" \ 79 | + suffix \ 80 | + ".joblib" 81 | joblib.dump(test_metrics_ag_bq, f"{base_path}/{dataset}/{metrics_file}") -------------------------------------------------------------------------------- /scripts/reproducibility/autosklearn/README.md: -------------------------------------------------------------------------------- 1 | ## Steps to reproduce Auto-sklearn models 2 | 3 | 4 | 1. Load and save the datasets locally using [FDB Loader](../../examples/Test_FDB_Loader.ipynb). Keep note of `{DATASET_PATH}` that contains local paths to datasets containing `train.csv`, `test.csv` and `test_labels.csv` from FDB loader. 5 | 6 | 2. Run `benchmark_autosklearn.py` using following: 7 | ``` 8 | python3 benchmark_autosklearn.py {DATASET_PATH} 9 | ``` 10 | 11 | 3. The script after running successfully will save results in the `DATASET_PATH`. The evaluation metrics on `test.csv` will be saved in `test_metrics_autosklearn.joblib`. 12 | 13 | *Note: Python 3.7+ is needed to run the used version of auto-sklearn and to reproduce the results. Similar to other auto-ml frameworks, auto-sklearn is also not perfectly reproducible because some underlying models are not deterministically seeded. However, the variations in results are within acceptable errors.* 14 | -------------------------------------------------------------------------------- /scripts/reproducibility/autosklearn/benchmark_autosklearn.py: -------------------------------------------------------------------------------- 1 | 2 | import json 3 | import joblib 4 | import datetime 5 | import numpy as np 6 | import pandas as pd 7 | import os, sys, shutil 8 | 9 | from autosklearn.metrics import roc_auc, log_loss 10 | from autosklearn.classification import AutoSklearnClassifier 11 | 12 | from sklearn.metrics import roc_auc_score, roc_curve 13 | from pandas.api.types import is_numeric_dtype, is_string_dtype 14 | 15 | import logging 16 | FORMAT = "%(levelname)s: %(name)s: %(message)s" 17 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S" 18 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) 19 | logger = logging.getLogger(os.path.basename(__file__)) 20 | logger.setLevel(logging.DEBUG) 21 | 22 | logging_config = { 23 | 'version': 1, 24 | 'disable_existing_loggers': False, 25 | 'formatters': { 26 | 'simple': { 27 | 'format': '%(levelname)-8s %(name)-15s %(message)s' 28 | } 29 | }, 30 | 'handlers':{ 31 | 'console_handler': { 32 | 'class': 'logging.StreamHandler', 33 | 'formatter': 'simple' 34 | }, 35 | 'file_handler': { 36 | 'class':'logging.FileHandler', 37 | 'mode': 'a', 38 | 'encoding': 'utf-8', 39 | 'filename':'main.log', 40 | 'formatter': 'simple' 41 | }, 42 | 'spec_handler':{ 43 | 'class':'logging.FileHandler', 44 | 'filename':'dummy_autosklearn.log', 45 | 'formatter': 'simple' 46 | }, 47 | 'distributed_logfile':{ 48 | 'filename':'distributed.log', 49 | 'class': 'logging.FileHandler', 50 | 'formatter': 'simple', 51 | 'level': 'DEBUG' 52 | } 53 | }, 54 | 'loggers': { 55 | '': { 56 | 'level': 'INFO', 57 | 'handlers':['file_handler', 'console_handler'] 58 | }, 59 | 'autosklearn': { 60 | 'level': 'INFO', 61 | 'propagate': False, 62 | 'handlers': ['spec_handler'] 63 | }, 64 | 'smac': { 65 | 'level': 'INFO', 66 | 'propagate': False, 67 | 'handlers': ['spec_handler'] 68 | }, 69 | 'EnsembleBuilder': { 70 | 'level': 'INFO', 71 | 'propagate': False, 72 | 'handlers': ['spec_handler'] 73 | }, 74 | }, 75 | } 76 | 77 | def load_data(dataset_path): 78 | logger.info(dataset_path) 79 | 80 | df_train = pd.read_csv(f"{dataset_path}/train.csv", lineterminator='\n') 81 | logger.info(df_train.shape) 82 | 83 | df_test = pd.read_csv(f"{dataset_path}/test.csv") 84 | logger.info(df_test.shape) 85 | 86 | df_test_labels = pd.read_csv(f"{dataset_path}/test_labels.csv") 87 | logger.info(df_test_labels.shape) 88 | 89 | df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID") 90 | logger.info(df_test.shape) 91 | 92 | 93 | features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID") 94 | features = [x for x in df_test.columns if x not in features_to_exclude ] 95 | logger.info(len(features)) 96 | logger.info(features) 97 | 98 | return features, df_train, df_test 99 | 100 | 101 | def get_recall(fpr, tpr, fpr_target=0.01): 102 | return np.interp(fpr_target, fpr, tpr) 103 | 104 | 105 | def run_autosklearn(dataset_path): 106 | 107 | features, df_train, df_test = load_data(dataset_path) 108 | 109 | dateTimeObj = datetime.datetime.now() 110 | timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S") 111 | 112 | numeric_features = [f for f in features if is_numeric_dtype(df_train[f])] 113 | categorical_features = [f for f in features if f not in numeric_features] 114 | logger.info(f'categorical: {categorical_features}') 115 | logger.info(f'numeric: {numeric_features}') 116 | 117 | labels = sorted(df_train['EVENT_LABEL'].unique()) 118 | df_train['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True) 119 | df_test['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True) 120 | 121 | for df in [df_train, df_test]: 122 | df[categorical_features] = df[categorical_features].fillna('') 123 | df[categorical_features] = df[categorical_features].astype('category') 124 | 125 | out_dir = f"{dataset_path}/AutoSklearnModels/" 126 | if os.path.exists(out_dir): 127 | shutil.rmtree(out_dir) 128 | 129 | automl = AutoSklearnClassifier( 130 | metric=roc_auc, 131 | scoring_functions=[roc_auc, log_loss], 132 | tmp_folder=out_dir, # for debugging 133 | delete_tmp_folder_after_terminate=False, 134 | logging_config=logging_config, 135 | n_jobs=-1, 136 | memory_limit=None 137 | ) 138 | 139 | assert len(categorical_features) + len(numeric_features) == len(features) 140 | 141 | logger.info('Fitting') 142 | automl.fit(df_train[features], df_train['EVENT_LABEL']) 143 | joblib.dump(automl, f"{dataset_path}/automl.joblib") 144 | 145 | cv = pd.DataFrame(automl.cv_results_) 146 | cv.to_csv(f"{dataset_path}/cv_results_autosklearn.csv", index=False) 147 | 148 | df_pred = automl.predict_proba(df_test[features])[:,1] 149 | 150 | auc_score = roc_auc_score(df_test['EVENT_LABEL'], df_pred) 151 | logger.info(f"auc on test data: {auc_score}") 152 | 153 | fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred) 154 | 155 | recall = get_recall(fpr, tpr, fpr_target=0.01) 156 | logger.info(f"tpr@1%fpr on test data: {recall}") 157 | 158 | test_metrics = { 159 | "labels": df_test['EVENT_LABEL'], 160 | "pred_prob": df_pred, 161 | "auc": auc_score, 162 | "tpr@1%fpr": recall, 163 | "fpr": fpr, 164 | "tpr": tpr, 165 | "thresholds": thresholds 166 | } 167 | joblib.dump(test_metrics, f"{dataset_path}/test_metrics_autosklearn.joblib") 168 | 169 | if __name__ == "__main__": 170 | args = sys.argv 171 | logger.info(args) 172 | run_autosklearn(args[1]) 173 | -------------------------------------------------------------------------------- /scripts/reproducibility/benchmark_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import os 4 | 5 | import matplotlib as mpl 6 | 7 | mpl.rcParams['figure.dpi'] = 150 8 | pd.set_option('display.max_columns', 500) 9 | pd.set_option('display.max_rows', 500) 10 | pd.set_option('display.width', 200) 11 | pd.set_option('display.float_format', lambda x: '%.3f' % x) 12 | 13 | import logging 14 | FORMAT = "%(levelname)s: %(name)s: %(message)s" 15 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S" 16 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) 17 | logger = logging.getLogger(os.path.basename(__file__)) 18 | logger.setLevel(logging.DEBUG) 19 | 20 | 21 | 22 | def load_data(dataset, base_path): 23 | logger.info(dataset) 24 | 25 | df_train = pd.read_csv(f"{base_path}/{dataset}/train.csv", lineterminator='\n') 26 | logger.info(df_train.shape) 27 | 28 | df_test = pd.read_csv(f"{base_path}/{dataset}/test.csv") 29 | logger.info(df_test.shape) 30 | 31 | df_test_labels = pd.read_csv(f"{base_path}/{dataset}/test_labels.csv") 32 | logger.info(df_test_labels.shape) 33 | 34 | df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID") 35 | logger.info(df_test.shape) 36 | 37 | 38 | features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID") 39 | features = [x for x in df_test.columns if x not in features_to_exclude ] 40 | logger.info(len(features)) 41 | logger.info(features) 42 | 43 | return features, df_train, df_test 44 | 45 | def get_recall(fpr, tpr, fpr_target=0.01): 46 | return np.interp(fpr_target, fpr, tpr) -------------------------------------------------------------------------------- /scripts/reproducibility/h2o/README.md: -------------------------------------------------------------------------------- 1 | - benchmark_h2o.py: a script for h2o benchmarking 2 | - example-h2o-ieeecis.ipynb: an example notebook using benchmark_h2o.py 3 | 4 | Note that h2o is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper. 5 | -------------------------------------------------------------------------------- /scripts/reproducibility/h2o/benchmark_h2o.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | import gc 4 | import joblib 5 | 6 | import matplotlib as mpl 7 | from sklearn.metrics import roc_auc_score, roc_curve 8 | 9 | mpl.rcParams['figure.dpi'] = 150 10 | pd.set_option('display.max_columns', 500) 11 | pd.set_option('display.max_rows', 500) 12 | pd.set_option('display.width', 200) 13 | pd.set_option('display.float_format', lambda x: '%.3f' % x) 14 | 15 | import logging 16 | FORMAT = "%(levelname)s: %(name)s: %(message)s" 17 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S" 18 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) 19 | logger = logging.getLogger(os.path.basename(__file__)) 20 | logger.setLevel(logging.DEBUG) 21 | 22 | import sys 23 | sys.path.append('../') 24 | from benchmark_utils import load_data, get_recall 25 | 26 | import h2o 27 | from h2o.automl import H2OAutoML 28 | 29 | def run_h2o(dataset, base_path, connect_url=None, time_limit=None, include_algos=None, exclude_algos=None, verbosity="info", seed=10): 30 | if connect_url is not None: 31 | _ = h2o.connect(url=connect_url, https=True, verbose=True) 32 | h2o.cluster().show_status(True) 33 | else: 34 | h2o.init() 35 | 36 | gc.collect() 37 | features, df_train, df_test = load_data(dataset, base_path) 38 | 39 | df_train_h2o = h2o.H2OFrame(df_train) 40 | feature_types_h2o = {k:df_train_h2o.types[k] for k in df_train_h2o.types if k in features} 41 | # force test schema the same as train schema, otherwise predict will throw errors 42 | df_test_h2o = h2o.H2OFrame(df_test, column_types=feature_types_h2o) 43 | 44 | df_train_h2o['EVENT_LABEL'] = df_train_h2o['EVENT_LABEL'].asfactor() 45 | df_test_h2o['EVENT_LABEL'] = df_test_h2o['EVENT_LABEL'].asfactor() 46 | 47 | aml = H2OAutoML(max_runtime_secs = time_limit, seed = seed, 48 | include_algos=include_algos, 49 | exclude_algos=exclude_algos, 50 | export_checkpoints_dir=f"{base_path}/{dataset}/H2OModels/", 51 | verbosity=verbosity) 52 | 53 | # use validation error in the leaderboard to avoid leakage when calling aml.predict 54 | aml.train(x = features, 55 | y = 'EVENT_LABEL', 56 | training_frame = df_train_h2o, 57 | ) 58 | 59 | lb = aml.leaderboard 60 | # lb.head(rows=lb.nrows) 61 | 62 | h2o.h2o.download_csv(lb, f"{base_path}/{dataset}/leaderboard_h2o.csv") 63 | 64 | lb_2 = h2o.automl.get_leaderboard(aml, extra_columns = "ALL") 65 | h2o.h2o.download_csv(lb_2, f"{base_path}/{dataset}/leaderboard_h2o_full.csv") 66 | # Get training timing info 67 | info = aml.training_info 68 | joblib.dump(info, f"{base_path}/{dataset}/training_info.joblib") 69 | 70 | df_pred_h2o = aml.predict(df_test_h2o[features]) 71 | pos_label = df_test_h2o['EVENT_LABEL'].levels()[0][-1] # levels are ordered alphabetically 72 | 73 | pos_label2 = 'p'+pos_label if pos_label=='1' else pos_label 74 | df_pred_h2o = (h2o.as_list(df_pred_h2o[pos_label2]))[pos_label2] 75 | 76 | auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred_h2o) 77 | logger.info(f"auc on test data: {auc}") 78 | 79 | fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'].astype(str), df_pred_h2o, 80 | pos_label=pos_label) 81 | 82 | y_true = df_test['EVENT_LABEL'] 83 | y_true = (y_true.astype(str)==pos_label) 84 | 85 | recall = get_recall(fpr, tpr, fpr_target=0.01) 86 | logger.info(f"tpr@1%fpr on test data: {recall}") 87 | 88 | test_metrics_h2o = { 89 | "pos_label": pos_label, 90 | "labels": df_test['EVENT_LABEL'], 91 | "pred_prob": df_pred_h2o, 92 | "auc": auc, 93 | "tpr@1%fpr": recall, 94 | "fpr": fpr, 95 | "tpr": tpr, 96 | "thresholds": thresholds 97 | } 98 | joblib.dump(test_metrics_h2o, f"{base_path}/{dataset}/test_metrics_h2o.joblib") 99 | 100 | h2o.cluster().shutdown(prompt=False) -------------------------------------------------------------------------------- /scripts/reproducibility/label-noise/benchmark_experiments.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "c77e5eb5", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#! pip install humanize\n", 11 | "#! pip install catboost" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "id": "f8bd366d", 17 | "metadata": {}, 18 | "source": [ 19 | "# Label noise\n", 20 | "\n", 21 | "\n", 22 | "## Problem statement \n", 23 | "Have some binary classification task, traditionally assume data of the form X,y\n", 24 | "\n", 25 | "In reality, some of the labels may be incorrect, distinguish\n", 26 | "```\n", 27 | "y - true label\n", 28 | "y* - observed, possibly incorrect label\n", 29 | "```\n", 30 | "\n", 31 | "This can obviously effect model training, validation. Would also effect benchmarking process (comparing performance on noisy data doesn't tell you about performance on actual data).\n", 32 | "\n", 33 | "## Types of noise\n", 34 | "\n", 35 | "Can be completely independent:\n", 36 | "`p(y* != y | x, y) = p(y* != y)`\n", 37 | "\n", 38 | "class-dependent, depends on y:\n", 39 | "`p(y* != y | x, y) = p(y* != y | y)`\n", 40 | "\n", 41 | "feature-dependent, depends on x:\n", 42 | "`p(y* != y | x, y) = p(y* != y | x, y)`\n", 43 | "\n", 44 | "In fraud modeling, higher likelihood of `(y*, y) = (0, 1)` than reverse.\n", 45 | "(missed fraud, label maturity, intentional data poisoning, etc.)\n", 46 | "\n", 47 | "\"feature-dependent\" is probably most realistic in fraud but fewer removal techniques and also harder to synthetically generate. We will work with \"boundary conditional\" noise, probability of being mislabeled is weighted by distance from some decision boundary (score from model trained on clean data), implemented in scikit-clean.\n", 48 | "\n", 49 | "## Literature/packages\n", 50 | "\n", 51 | "Many methods in the literature to address this; can build loss functions that are robust to noise, can try to identify and filter (remove) or clean (flip label) examples identified as noisy.\n", 52 | "\n", 53 | "Some packages including CleanLab and scikit-clean. Can also hand-code an ensemble method. Most of these are model-agnostic." 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "id": "9b172deb", 59 | "metadata": {}, 60 | "source": [ 61 | "## CleanLab\n", 62 | "\n", 63 | "well-established, state of the art, open source package with some theoretical guarantees\n", 64 | "\n", 65 | "score all examples with y* = 1, determine average score t_1\n", 66 | "now score all examples with y* = 0. Any that score above t_1 are marked as noise\n", 67 | "\n", 68 | "can wrap any (sklearn-compatible) model with this process. \n", 69 | "\n", 70 | "## scikit-clean \n", 71 | "\n", 72 | "library of several different approaches including filtering as well as noise generation. Is similarly designed to be model-agnostic but doesn't always do a great job (doesn't handle unencoded categorical features well). Some of its methods can also be *very* slow relative to others\n", 73 | "\n", 74 | "## micro-models\n", 75 | "\n", 76 | "slice up training data, train a model on each slice, let models vote on whether to remove data. Can use majority (more than half of models \"misclassify\" example), consensus (all models misclassify) or any other threshold.\n", 77 | "\n", 78 | "## experiment design\n", 79 | "\n", 80 | "take 7 of the datasets - [‘ieeecis’, ‘ccfraud’, ‘fraudecom’, ‘sparknov’, ‘fakejob’, ‘vehicleloan’,‘twitterbot’]\n", 81 | "* drop IP and malurl dataset as they are difficult to work with \"out of the box\"\n", 82 | "* use numerical and categorical features, target-encode categorical features (drop text and enrichable features)\n", 83 | "\n", 84 | "add boundary-conditional noise `n` to training data (flipping both classes).\n", 85 | "\n", 86 | "values: `n in [0, 0.1, 0.2, 0.3, 0.4, 0.5]`\n", 87 | " \n", 88 | "target encoding is done after noise is added\n", 89 | " \n", 90 | "Catboost used as base classifier in all cases (with default settings)\n", 91 | "\n", 92 | "compare following methods for cleaning training data\n", 93 | "* baseline (no cleaning done)\n", 94 | "* CleanLab\n", 95 | "* scikit-clean MCS \n", 96 | "* micro-model majority voting (hand-built)\n", 97 | "* micro-model consensus voting (hand-built)\n", 98 | "\n", 99 | "measure AUC on (clean) test data\n", 100 | "\n", 101 | "repeat process 5 times for each experiment (start with clean data, add random noise, filter noise back out, train classifier, etc.), compute mean and std. dev of AUC for each\n", 102 | "\n", 103 | "CleanLab usually winds up being the best, but not uniformly. Baseline is sometimes the best for zero noise (as expected), and sometimes MCS or micro-model majority will come out ahead" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "id": "846f161f", 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# basic imports\n", 114 | "import os\n", 115 | "import numpy as np\n", 116 | "import pandas as pd\n", 117 | "import warnings\n", 118 | "import matplotlib.pyplot as plt\n", 119 | "%matplotlib inline\n", 120 | "import humanize\n", 121 | "import pickle\n", 122 | "\n", 123 | "# basics from sklearn\n", 124 | "from sklearn.metrics import roc_auc_score\n", 125 | "from category_encoders.target_encoder import TargetEncoder\n", 126 | "\n", 127 | "# noise generation\n", 128 | "from skclean.simulate_noise import flip_labels_cc, BCNoise\n", 129 | "\n", 130 | "# base classifiers\n", 131 | "from catboost import CatBoostClassifier\n", 132 | "\n", 133 | "# cleaning methods/helpers\n", 134 | "from cleanlab.classification import CleanLearning\n", 135 | "from micro_models import MicroModelCleaner\n", 136 | "from skclean.pipeline import Pipeline\n", 137 | "from skclean.handlers import Filter\n", 138 | "from skclean.detectors import MCS\n", 139 | "\n", 140 | "# dataset loader\n", 141 | "from load_fdb_datasets import prepare_noisy_dataset, dataset_stats" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "85117ba5", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# wrapper definitions for the various types of cleaning methods we will use. \n", 152 | "# Each one wraps a model_class (in our case catboost, but could use xgboost, etc.)\n", 153 | "# resulting model_class can then take noisy data in its .fit() method and clean before training\n", 154 | "\n", 155 | "def baseline_model(model_class, params):\n", 156 | " return model_class(**params)\n", 157 | "\n", 158 | "def cleanlab_model(model_class, params, pulearning=False):\n", 159 | " if pulearning:\n", 160 | " return CleanLearning(model_class(**params), pulearning=pulearning)\n", 161 | " else:\n", 162 | " return CleanLearning(model_class(**params))\n", 163 | " \n", 164 | "def micromodels(model_class, pulearning, num_clfs, threshold, params):\n", 165 | " return MicroModelCleaner(model_class, pulearning=pulearning, num_clfs=num_clfs, threshold=threshold, **params)\n", 166 | "\n", 167 | "def skclean_MCS(model_class, params):\n", 168 | " skclean_pipeline = Pipeline([\n", 169 | " ('detector',MCS(classifier=model_class(**params))),\n", 170 | " ('handler',Filter(model_class(**params)))\n", 171 | " ])\n", 172 | " return skclean_pipeline" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "bd6bcd08", 179 | "metadata": { 180 | "scrolled": true 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "# some high-level parameters, \n", 185 | "# the number of runs for each experiment (determine mean/std. dev)\n", 186 | "num_samples = 5 \n", 187 | "# whether to use target encoding on categorical features\n", 188 | "target_encoding = True\n", 189 | "# whether to save intermediate results to disk (in case of failure etc.)\n", 190 | "save_results = True\n", 191 | "\n", 192 | "# we will be creating a lot of classifiers, let's use the same parameters for each\n", 193 | "model_config_dict = {\n", 194 | " 'catboost': {\n", 195 | " 'model_class': CatBoostClassifier,\n", 196 | " 'default_params': {\n", 197 | " 'verbose': False,\n", 198 | " 'iterations': 100\n", 199 | " }\n", 200 | " }\n", 201 | "}\n", 202 | "\n", 203 | "# all of our experiments will use catboost and boundary-consistent noise\n", 204 | "base_model_type = 'catboost'\n", 205 | "noise_type = 'boundary-consistent'\n", 206 | "model_class = model_config_dict[base_model_type]['model_class']\n", 207 | "\n", 208 | "# the set of experimental parameters, we will iterate over all these datasets\n", 209 | "keys = ['ieeecis', 'sparknov', 'ccfraud', 'fraudecom', 'fakejob', 'vehicleloan', 'twitterbot']\n", 210 | "# all these cleaning methods\n", 211 | "clf_types = ['baseline', 'skclean_MCS', 'cleanlab', 'micromodels_majority', 'micromodels_consensus']\n", 212 | "# all these noise levels\n", 213 | "noise_amounts = [0, 0.1, 0.2, 0.3, 0.4, 0.5]\n", 214 | "# and we will let cleaning methods know that noise can happen for either class\n", 215 | "pulearning = None\n", 216 | "\n", 217 | "# a little bit of setup for saving intermediate results to disk\n", 218 | "if save_results:\n", 219 | " results_file_path = './results'\n", 220 | " results_file_name = '{}_noise_benchmark_results.pkl'\n", 221 | " try:\n", 222 | " os.mkdir(results_file_path)\n", 223 | " except OSError as error:\n", 224 | " print(error) " 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "id": "ef2e3bd8", 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "# initialize results dict, we will index results by dataset/noise_amount/cleaning_method\n", 235 | "results = {}\n", 236 | "\n", 237 | "# main experimental loop \n", 238 | "for key in keys:\n", 239 | " # check to see if we have already run this experiment and saved to disk\n", 240 | " full_result_path = os.path.join(results_file_path,results_file_name.format(key))\n", 241 | " if os.path.exists(full_result_path) and save_results:\n", 242 | " with open(full_result_path, 'rb') as results_file:\n", 243 | " results[key] = pickle.load(results_file)\n", 244 | " # otherwise start from scratch\n", 245 | " else:\n", 246 | " # initialize sub-results\n", 247 | " results[key] = {}\n", 248 | " model_params = model_config_dict[base_model_type]['default_params']\n", 249 | " \n", 250 | " for noise_amount in noise_amounts:\n", 251 | " print(f\"\\n =={key}_{noise_amount}== \\n\")\n", 252 | " \n", 253 | " # initialize sub-sub-results\n", 254 | " results[key][noise_amount] = {}\n", 255 | "\n", 256 | " # these are the cleaning classifiers we will use\n", 257 | " clfs = {\n", 258 | " 'baseline': baseline_model(model_class, model_params),\n", 259 | " 'skclean_MCS': skclean_MCS(model_class, model_params),\n", 260 | " 'cleanlab': cleanlab_model(model_class, model_params, pulearning),\n", 261 | " 'micromodels_majority': micromodels(model_class, pulearning=pulearning,\n", 262 | " num_clfs=8, threshold=0.5, params=model_params),\n", 263 | " 'micromodels_consensus': micromodels(model_class, pulearning=pulearning,\n", 264 | " num_clfs=8, threshold=1, params=model_params),\n", 265 | "\n", 266 | " }\n", 267 | " print('generating datasets')\n", 268 | " # preparing a dataset has some overhead, we want to do this five times for each dataset/noise level\n", 269 | " # we will save a little bit of time by doing this in advance and using same set of five\n", 270 | " # for each cleaning method\n", 271 | " datasets = [prepare_noisy_dataset(key, noise_type, noise_amount, split=1, target_encoding=target_encoding) \n", 272 | " for i in range(num_samples)]\n", 273 | " \n", 274 | " # now for each cleaning method, train a \"clean\" model on noisy training data, then determine\n", 275 | " # auc on clean test data and record the results. Do this five times for each cleaning method\n", 276 | " # to determine mean/std. dev\n", 277 | " for clf_type in clfs:\n", 278 | " print(f\"testing {clf_type}\")\n", 279 | " auc = []\n", 280 | " try:\n", 281 | " for i in range(num_samples):\n", 282 | " # grab the dataset we need for this run and extract metadata and subsets\n", 283 | " dataset = datasets[i]\n", 284 | " features, cat_features, label = dataset['features'], dataset['cat_features'], dataset['label']\n", 285 | " train, test = dataset['train'], dataset['test']\n", 286 | " X_tr, y_tr = train[features], train[label].values.reshape(-1)\n", 287 | " X_ts, y_ts = test[features], test[label].values.reshape(-1)\n", 288 | " clf = clfs[clf_type]\n", 289 | " # fit the \"clean\" classifier on noisy training data\n", 290 | " clf.fit(X_tr, y_tr)\n", 291 | " # make predictions on clean test data and calculate AUC\n", 292 | " y_pred = clf.predict_proba(X_ts)[:, 1]\n", 293 | " auc.append(roc_auc_score(y_ts, y_pred))\n", 294 | " print(f\"{clf_type} auc: {auc}\", end=\"\\r\", flush=True)\n", 295 | " # store mean/std. dev for this run in the results dict\n", 296 | " results[key][noise_amount][clf_type] = (np.mean(auc), np.std(auc), auc)\n", 297 | " print('\\n{} auc: {:.2f} ± {:.4f}\\n'.format(clf_type,\n", 298 | " *results[key][noise_amount][clf_type][:2]))\n", 299 | " # if this run failed for some reason, handle it gracefully\n", 300 | " except Exception as e:\n", 301 | " results[key][noise_amount][clf_type] = (0, 0, [0] * num_samples)\n", 302 | " print(e)\n", 303 | " \n", 304 | " # if we are saving intermediate results to disk, do so now\n", 305 | " if save_results:\n", 306 | " with open(full_result_path, 'wb') as results_file:\n", 307 | " pickle.dump(results[key], results_file)" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "id": "7a8a4509", 314 | "metadata": { 315 | "scrolled": false 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "# a couple of helper functions to analyze/summarize results\n", 320 | "\n", 321 | "def highlight_max(s, props=''):\n", 322 | " return np.where(s == np.nanmax(s.values), props, '')\n", 323 | "\n", 324 | "def record_places(places, scores):\n", 325 | " scores = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}\n", 326 | " last_score, last_stddev, last_placement = (2, 0, 1)\n", 327 | " for i, clf in enumerate(scores.keys()): \n", 328 | " if scores[clf][0] + scores[clf][1] >= last_score:\n", 329 | " placement = last_placement \n", 330 | " else:\n", 331 | " placement = i+1\n", 332 | " last_score, last_stddev = scores[clf] \n", 333 | " last_placement = i+1\n", 334 | " places[clf][placement] += 1 " 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "id": "7fa49c8e", 341 | "metadata": { 342 | "scrolled": false 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "# create dataframe of results for each experiment, also process results into dict for keeping track of \n", 347 | "# 1st/2nd/etc. place, as well as a dict for plotting later\n", 348 | "\n", 349 | "places = {clf:{p:0 for p in range(1,len(clf_types)+1)} for clf in clf_types}\n", 350 | "plots = {key:{clf:[[],[]] for clf in clf_types} for key in keys}\n", 351 | " \n", 352 | "for key in results.keys():\n", 353 | " print(f\"\\n =={key}==\\n\")\n", 354 | " rows = pd.Index([clf_type for clf_type in clf_types])\n", 355 | " columns = pd.MultiIndex.from_product([noise_amounts, ['mean','std_dev']], names=['type 2 noise', 'auc'])\n", 356 | " df = pd.DataFrame(index=rows, columns=columns)\n", 357 | " \n", 358 | " for noise_amount in noise_amounts:\n", 359 | " scores = {}\n", 360 | " for clf_type in clf_types:\n", 361 | " auc = results[key][noise_amount][clf_type] \n", 362 | " df.loc[clf_type, (noise_amount, 'mean')] = auc[0] \n", 363 | " df.loc[clf_type, (noise_amount, 'std_dev')] = auc[1]\n", 364 | " scores[clf_type] = (auc[0], auc[1])\n", 365 | "\n", 366 | " plots[key][clf_type][0].append(noise_amount)\n", 367 | " plots[key][clf_type][1].append(auc[0])\n", 368 | " record_places(places, scores)\n", 369 | " display(df.style.set_caption(f\"{key}\")\n", 370 | " .format({(n,'mean'): \"{:.2f}\" for n in noise_amounts})\n", 371 | " .format({(n,'std_dev'): \"{:.4f}\" for n in noise_amounts})\n", 372 | " .apply(highlight_max, props='font-weight:bold;background-color:lightblue', axis=0,\n", 373 | " subset=[[n,'mean'] for n in noise_amounts]))" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "id": "8cb8dbd8", 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [ 383 | "# produce \"race results\" (i.e. how many first place, second place, etc. finishes)\n", 384 | "\n", 385 | "race_results = pd.DataFrame.from_dict(places).rename(index=lambda x : humanize.ordinal(x))\n", 386 | "race_results['totals'] = race_results.sum(axis=1)\n", 387 | "display(race_results)\n", 388 | "print(race_results.to_latex())" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "id": "602877ee", 395 | "metadata": { 396 | "scrolled": false 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "# finally, we can plot the results of individual experiments\n", 401 | "\n", 402 | "colors = ['black','purple','green','red','orange']\n", 403 | "linestyles = ['-','--',':']\n", 404 | "ylims = {\n", 405 | " 'boundary-consistent': {\n", 406 | " 'ieeecis':[0.5,0.9],\n", 407 | " 'sparknov':[0.5,1],\n", 408 | " 'ccfraud':[0.25,1],\n", 409 | " 'fraudecom':[0.48,0.52],\n", 410 | " 'fakejob':[0.5,1],\n", 411 | " 'vehicleloan':[0.57,0.66],\n", 412 | " 'twitterbot':[0.7,0.95]\n", 413 | " },\n", 414 | " 'class-conditional': {\n", 415 | " 'ieeecis':[0.7,0.9],\n", 416 | " 'sparknov':[0.7,1],\n", 417 | " 'ccfraud':[0.8,1],\n", 418 | " 'fraudecom':[0.48,0.52],\n", 419 | " 'fakejob':[0.7,1],\n", 420 | " 'vehicleloan':[0.5,0.7],\n", 421 | " 'twitterbot':[0.8,0.95]\n", 422 | " }\n", 423 | "}\n", 424 | "\n", 425 | "x_labels = {\n", 426 | " 'boundary-consistent':'Boundary-Consistent Noise Level',\n", 427 | " 'class-conditional':'Class-Conditional Type 2 Noise Level'\n", 428 | "}\n", 429 | "\n", 430 | "legends = {\n", 431 | " 'boundary-consistent':'Cleaning Method',\n", 432 | " 'class-conditional':'Type 1 Noise, Cleaning Method'\n", 433 | "}\n", 434 | "def fix_failures(x):\n", 435 | " if x == 0:\n", 436 | " return None\n", 437 | " else:\n", 438 | " return x\n", 439 | "\n", 440 | "def labels(noise_type, noise_amount, clf_type):\n", 441 | " if noise_type == 'boundary-consistent':\n", 442 | " return '{}'.format(clf_type)\n", 443 | " elif noise_type == 'class-conditional':\n", 444 | " return '{}, {}'.format(noise_amount, clf_type)\n", 445 | "\n", 446 | "for key in results.keys():\n", 447 | " plt.figure(figsize=(10,10))\n", 448 | " \n", 449 | " for c, clf_type in enumerate(clf_types):\n", 450 | " a = plots[key][clf_type]\n", 451 | " plt.plot(a[0],[fix_failures(c) for c in a[1]],\n", 452 | " label=labels(noise_type, noise_amount, clf_type),\n", 453 | " color=colors[c],\n", 454 | " linestyle=linestyles[0])\n", 455 | " plt.title(key)\n", 456 | " plt.xlabel(x_labels[noise_type])\n", 457 | " plt.ylabel('Test AUC')\n", 458 | " plt.ylim(ylims[noise_type][key])\n", 459 | " plt.legend(title=legends[noise_type])\n", 460 | " plt.savefig(f\"./figures/label_noise_{key}.png\")\n", 461 | " plt.show()" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "id": "b891c49a", 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [] 471 | } 472 | ], 473 | "metadata": { 474 | "kernelspec": { 475 | "display_name": "conda_python3", 476 | "language": "python", 477 | "name": "conda_python3" 478 | }, 479 | "language_info": { 480 | "codemirror_mode": { 481 | "name": "ipython", 482 | "version": 3 483 | }, 484 | "file_extension": ".py", 485 | "mimetype": "text/x-python", 486 | "name": "python", 487 | "nbconvert_exporter": "python", 488 | "pygments_lexer": "ipython3", 489 | "version": "3.6.13" 490 | } 491 | }, 492 | "nbformat": 4, 493 | "nbformat_minor": 5 494 | } 495 | -------------------------------------------------------------------------------- /scripts/reproducibility/label-noise/feature_dict.py: -------------------------------------------------------------------------------- 1 | feature_dict = { 2 | 'ieeecis': { 3 | 'transactionamt': 'numeric', 4 | 'productcd': 'categorical', 5 | 'card1': 'numeric', 6 | 'card2': 'numeric', 7 | 'card3': 'numeric', 8 | 'card5': 'numeric', 9 | 'card6': 'categorical', 10 | 'addr1': 'numeric', 11 | 'dist1': 'numeric', 12 | 'p_emaildomain': 'categorical', 13 | 'r_emaildomain': 'categorical', 14 | 'c1': 'numeric', 15 | 'c2': 'numeric', 16 | 'c4': 'numeric', 17 | 'c5': 'numeric', 18 | 'c6': 'numeric', 19 | 'c7': 'numeric', 20 | 'c8': 'numeric', 21 | 'c9': 'numeric', 22 | 'c10': 'numeric', 23 | 'c11': 'numeric', 24 | 'c12': 'numeric', 25 | 'c13': 'numeric', 26 | 'c14': 'numeric', 27 | 'v62': 'numeric', 28 | 'v70': 'numeric', 29 | 'v76': 'numeric', 30 | 'v78': 'numeric', 31 | 'v82': 'numeric', 32 | 'v91': 'numeric', 33 | 'v127': 'numeric', 34 | 'v130': 'numeric', 35 | 'v139': 'numeric', 36 | 'v160': 'numeric', 37 | 'v165': 'numeric', 38 | 'v187': 'numeric', 39 | 'v203': 'numeric', 40 | 'v207': 'numeric', 41 | 'v209': 'numeric', 42 | 'v210': 'numeric', 43 | 'v221': 'numeric', 44 | 'v234': 'numeric', 45 | 'v257': 'numeric', 46 | 'v258': 'numeric', 47 | 'v261': 'numeric', 48 | 'v264': 'numeric', 49 | 'v266': 'numeric', 50 | 'v267': 'numeric', 51 | 'v271': 'numeric', 52 | 'v274': 'numeric', 53 | 'v277': 'numeric', 54 | 'v283': 'numeric', 55 | 'v285': 'numeric', 56 | 'v289': 'numeric', 57 | 'v291': 'numeric', 58 | 'v294': 'numeric', 59 | 'id_01': 'numeric', 60 | 'id_02': 'numeric', 61 | 'id_05': 'numeric', 62 | 'id_06': 'numeric', 63 | 'id_09': 'numeric', 64 | 'id_13': 'numeric', 65 | 'id_17': 'numeric', 66 | 'id_19': 'numeric', 67 | 'id_20': 'numeric', 68 | 'devicetype': 'categorical', 69 | 'deviceinfo': 'categorical' 70 | }, 71 | 'ccfraud': { 72 | 'v1': 'numeric', 73 | 'v2': 'numeric', 74 | 'v3': 'numeric', 75 | 'v4': 'numeric', 76 | 'v5': 'numeric', 77 | 'v6': 'numeric', 78 | 'v7': 'numeric', 79 | 'v8': 'numeric', 80 | 'v9': 'numeric', 81 | 'v10': 'numeric', 82 | 'v11': 'numeric', 83 | 'v12': 'numeric', 84 | 'v13': 'numeric', 85 | 'v14': 'numeric', 86 | 'v15': 'numeric', 87 | 'v16': 'numeric', 88 | 'v17': 'numeric', 89 | 'v18': 'numeric', 90 | 'v19': 'numeric', 91 | 'v20': 'numeric', 92 | 'v21': 'numeric', 93 | 'v22': 'numeric', 94 | 'v23': 'numeric', 95 | 'v24': 'numeric', 96 | 'v25': 'numeric', 97 | 'v26': 'numeric', 98 | 'v27': 'numeric', 99 | 'v28': 'numeric', 100 | 'amount': 'numeric' 101 | }, 102 | 'fraudecom': { 103 | 'purchase_value': 'numeric', 104 | 'source': 'categorical', 105 | 'browser': 'categorical', 106 | 'age': 'numeric', 107 | 'ip_address': 'enrichable', 108 | 'time_since_signup': 'numeric' 109 | }, 110 | 'sparknov': { 111 | 'cc_num': 'categorical', 112 | 'category': 'categorical', 113 | 'amt': 'numeric', 114 | 'first': 'categorical', 115 | 'last': 'categorical', 116 | 'gender': 'categorical', 117 | 'street': 'categorical', 118 | 'city': 'categorical', 119 | 'state': 'categorical', 120 | 'zip': 'categorical', 121 | 'lat': 'numeric', 122 | 'long': 'numeric', 123 | 'city_pop': 'numeric', 124 | 'job': 'categorical', 125 | 'dob': 'text', 126 | 'merch_lat': 'numeric', 127 | 'merch_long': 'numeric' 128 | }, 129 | 'twitterbot': { 130 | 'created_at' : 'text', 131 | 'default_profile': 'categorical', 132 | 'default_profile_image': 'categorical', 133 | 'description': 'text', 134 | 'favourites_count': 'numeric', 135 | 'followers_count': 'numeric', 136 | 'friends_count': 'numeric', 137 | 'geo_enabled': 'categorical', 138 | 'lang': 'categorical', 139 | 'location': 'categorical', 140 | 'profile_background_image_url': 'text', 141 | 'profile_image_url': 'text', 142 | 'screen_name': 'text', 143 | 'statuses_count': 'numeric', 144 | 'verified': 'categorical', 145 | 'average_tweets_per_day': 'numeric', 146 | 'account_age_days': 'numeric' 147 | }, 148 | 'fakejob': { 149 | 'title': 'categorical', 150 | 'location': 'categorical', 151 | 'department': 'categorical', 152 | 'salary_range': 'text', 153 | 'company_profile': 'text', 154 | 'description': 'text', 155 | 'requirements': 'text', 156 | 'benefits': 'text', 157 | 'telecommuting': 'categorical', 158 | 'has_company_logo': 'categorical', 159 | 'has_questions': 'categorical', 160 | 'employment_type': 'categorical', 161 | 'required_experience': 'categorical', 162 | 'required_education': 'categorical', 163 | 'industry': 'categorical', 164 | 'function': 'categorical' 165 | }, 166 | 'vehicleloan': { 167 | 'disbursed_amount': 'numeric', 168 | 'asset_cost': 'numeric', 169 | 'ltv': 'numeric', 170 | 'branch_id': 'categorical', 171 | 'supplier_id': 'categorical', 172 | 'manufacturer_id': 'categorical', 173 | 'current_pincode_id': 'categorical', 174 | 'date_of_birth': 'text', 175 | 'employment_type': 'categorical', 176 | 'state_id': 'categorical', 177 | 'employee_code_id': 'categorical', 178 | 'mobileno_avl_flag': 'categorical', 179 | 'aadhar_flag': 'categorical', 180 | 'pan_flag': 'categorical', 181 | 'voterid_flag': 'categorical', 182 | 'driving_flag': 'categorical', 183 | 'passport_flag': 'categorical', 184 | 'perform_cns_score': 'numeric', 185 | 'perform_cns_score_description': 'categorical', 186 | 'pri_no_of_accts': 'numeric', 187 | 'pri_active_accts': 'numeric', 188 | 'pri_overdue_accts': 'numeric', 189 | 'pri_current_balance': 'numeric', 190 | 'pri_sanctioned_amount': 'numeric', 191 | 'pri_disbursed_amount': 'numeric', 192 | 'sec_no_of_accts': 'numeric', 193 | 'sec_active_accts': 'numeric', 194 | 'sec_overdue_accts': 'numeric', 195 | 'sec_current_balance': 'numeric', 196 | 'sec_sanctioned_amount': 'numeric', 197 | 'sec_disbursed_amount': 'numeric', 198 | 'primary_instal_amt': 'numeric', 199 | 'sec_instal_amt': 'numeric', 200 | 'new_accts_in_last_six_months': 'numeric', 201 | 'delinquent_accts_in_last_six_months': 'numeric', 202 | 'average_acct_age': 'text', 203 | 'credit_history_length': 'text', 204 | 'no_of_inquiries': 'numeric' 205 | } 206 | } -------------------------------------------------------------------------------- /scripts/reproducibility/label-noise/load_fdb_datasets.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import json 4 | import pandas as pd 5 | import numpy as np 6 | import warnings 7 | from datetime import datetime 8 | 9 | from category_encoders.target_encoder import TargetEncoder 10 | from skclean.simulate_noise import flip_labels_cc, BCNoise 11 | 12 | from fdb.datasets import FraudDatasetBenchmark 13 | 14 | import feature_dict 15 | 16 | DATASET_PATH = './data/dataset.csv' 17 | METADATA_PATH = './data/feature_metadata.json' 18 | FD = feature_dict.feature_dict 19 | 20 | def noise_amount(df): 21 | return df[df.noise == 1].shape[0] 22 | 23 | def noise_rate(df): 24 | if df.shape[0] > 0: 25 | return noise_amount(df)/df.shape[0] 26 | else: 27 | return None 28 | 29 | def type_1_noise_amount(df): 30 | # examples with true label 0, mislabeled as 1 31 | # here 'df.label' is the observed label, not the true one 32 | return df[(df.label==1) & (df.noise == 1)].shape[0] 33 | 34 | def type_2_noise_amount(df): 35 | # examples with true label 1, mislabeled as 0 36 | # here 'df.label' is the observed label, not the true one 37 | return df[(df.label==0) & (df.noise == 1)].shape[0] 38 | 39 | def actual_legit_amount(df): 40 | return df[(df.label == 0) | (df.noise == 1)].shape[0] 41 | 42 | def observed_legit_amount(df): 43 | return df[df.label == 0].shape[0] 44 | 45 | def actual_fraud_amount(df): 46 | return df[((df.label == 1) & (df.noise == 0)) | ((df.label == 0) & (df.noise == 1))].shape[0] 47 | 48 | def observed_fraud_amount(df): 49 | return df[df.label == 1].shape[0] 50 | 51 | def actual_fraud_rate(df): 52 | if df.shape[0] > 0: 53 | return actual_fraud_amount(df)/df.shape[0] 54 | else: 55 | return None 56 | 57 | def observed_fraud_rate(df): 58 | if df.shape[0] > 0: 59 | return observed_fraud_amount(df)/df.shape[0] 60 | else: 61 | return None 62 | 63 | def type_1_noise_rate(df): 64 | if df.shape[0] > 0: 65 | return type_1_noise_amount(df)/actual_legit_amount(df) 66 | else: 67 | return None 68 | 69 | def type_2_noise_rate(df): 70 | if df.shape[0] > 0: 71 | return type_2_noise_amount(df)/actual_fraud_amount(df) 72 | else: 73 | return None 74 | 75 | def prepare_data_fdb(key, drop_text_enr_features=True): 76 | """ 77 | main function, gets datasets from FDB and then does some preprocessing/cleaning so they are suitable 78 | for modeling, returns data and metadata 79 | 80 | inputs: 81 | key - the FDB dataset to load 82 | drop_text_enr_features - whether we want to drop text/enrichable features 83 | this returns 84 | df - full pandas dataframe containing features, labels and metadata 85 | this includes training and test data, with a 'dataset' column to indicate which 86 | all of these datasets have a timestamp column (even if it is "fake") and by default 87 | data will be sorted by this column. All test > train w.r.t. this timestamp 88 | 89 | features - list of feature names 90 | cat_features - list of categorical feature names (subset of features) 91 | label - name of label column 92 | record_id - name of unique id column 93 | """ 94 | 95 | obj = FraudDatasetBenchmark(key=key) 96 | 97 | print(obj.key) 98 | 99 | # extract training and testing data (and test labels) from the return object 100 | # sort training data by event timestamp 101 | train_df = obj.train.sort_values(by='EVENT_TIMESTAMP',ignore_index=True) 102 | test_df = obj.test.reset_index(drop=True) 103 | test_labels = obj.test_labels.reset_index(drop=True) 104 | 105 | # define metadata and label column names 106 | metadata = ['EVENT_LABEL', 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID', 107 | 'label', 'LABEL_TIMESTAMP', 'noise', 'dataset'] 108 | label = ['label'] 109 | 110 | # we maintain a feature dictionary in another file, this helps us determine which are categorical, numerical, etc. 111 | feature_dict = FD[key] 112 | raw_features = feature_dict.keys() 113 | num_features = [f for f in raw_features if feature_dict[f] == 'numeric'] 114 | cat_features = [f for f in raw_features if feature_dict[f] == 'categorical'] 115 | txt_features = [f for f in raw_features if feature_dict[f] == 'text'] 116 | enr_features = [f for f in raw_features if feature_dict[f] == 'enrichable'] 117 | 118 | # add / rename labels 119 | train_df.rename({'EVENT_LABEL':'label'}, axis=1, inplace=True) 120 | test_df['label'] = test_labels['EVENT_LABEL'] 121 | if key == 'twitterbot': 122 | train_df.loc[train_df.label == 'bot', 'label'] = 1 123 | test_df.loc[test_df.label == 'bot', 'label'] = 1 124 | train_df.loc[train_df.label == 'human', 'label'] = 0 125 | test_df.loc[test_df.label == 'human', 'label'] = 0 126 | 127 | # put train / test into single dataframe, create a 'dataset' column to keep track 128 | train_df['dataset'] = 'train' 129 | test_df['dataset'] = 'test' 130 | 131 | # create noise column - we won't generate any noise now but it may be useful to have (can also be ignored) 132 | train_df['noise'] = 0 133 | test_df['noise'] = 0 134 | 135 | # concatenate train/test into single dataframe 136 | # (remember we have 'dataset' column to separate them again if needed) 137 | df = pd.concat([train_df, test_df], axis=0, ignore_index=True) 138 | 139 | # there are a few date columns that are timestamps, we convert those to epoch 140 | # the new values are put into new columns, those column names are added to the numerical features 141 | if key == 'twitterbot': 142 | df['eng_created_at'] = df['created_at'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp()) 143 | num_features.append('eng_created_at') 144 | if key == 'sparknov': 145 | df['eng_dob'] = df['dob'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d').timestamp()) 146 | num_features.append('eng_dob') 147 | 148 | # fakejob has a salary range column, e.g. "10000 - 20000" that can be converted into two numerical columns 149 | if key == 'fakejob': 150 | def convert(x): 151 | r = re.search(r"([0-9]*)-([0-9]*)",str(x)) 152 | try: 153 | m, M = r.group(1), r.group(2) 154 | if m == '' or M == '': 155 | m, M = 0,0 156 | except: 157 | m, M = 0,0 158 | return m,M 159 | 160 | df['salary_min'], df['salary_max'] = zip(*df['salary_range'].map(convert)) 161 | num_features = num_features + ['salary_min','salary_max'] 162 | 163 | # vehicleloan has a timestamp column that we convert to epoch 164 | # it also has "account age" and "credit history" length cols 165 | # in form "Xyrs Ymon" that can be converted to numeric 166 | if key == 'vehicleloan': 167 | df['eng_dob'] = df['date_of_birth'].apply(lambda x : datetime.strptime(x, '%d-%m-%Y').timestamp()) 168 | 169 | def convert(x): 170 | r = re.search(r"([0-9]*)yrs ([0-9]*)mon", x) 171 | try: 172 | age = 12*float(r.group(1)) + float(r.group(2)) 173 | except: 174 | age = 0 175 | return age 176 | 177 | df['eng_average_acct_age'] = df['average_acct_age'].apply(convert) 178 | df['eng_credit_history_length'] = df['credit_history_length'].apply(convert) 179 | num_features = num_features + ['eng_dob','eng_average_acct_age','eng_credit_history_length'] 180 | 181 | # by default we will drop any remaining text or enrichable (IP address) features as we won't use them 182 | # but you can pass in False for this if they are of interest 183 | if drop_text_enr_features: 184 | df.drop(txt_features + enr_features, axis=1, inplace=True) 185 | features = num_features + cat_features 186 | 187 | # cast all numeric features to float just in case they aren't 188 | for feature in num_features: 189 | df[feature] = df[feature].astype('float64') 190 | df[feature].fillna(0, inplace=True) 191 | 192 | # cast all categorical features to str in case they aren't 193 | for feature in cat_features: 194 | df[feature] = df[feature].astype(str) 195 | df[feature].fillna('', inplace=True) 196 | 197 | # rename the timestamp column 198 | df.rename({'EVENT_TIMESTAMP':'creation_date'}, axis=1, inplace=True) 199 | 200 | # cast the label to int just to be sure 201 | df['label'] = df['label'].astype('int') 202 | 203 | # name of unique id column will always be EVENT_ID 204 | record_id = 'EVENT_ID' 205 | 206 | if drop_text_enr_features: 207 | return df, features, cat_features, label, record_id 208 | else: 209 | return df, features, cat_features, txt_features, enr_features, label, record_id 210 | 211 | 212 | def add_noise(df, noise_type, noise_amount, *, time_index=None, features=None, cat_features=None, label=None): 213 | 214 | if noise_type not in ['random', 'time-dependent', 'boundary-consistent']: 215 | raise(Exception('Invalid Noise Type')) 216 | 217 | # if we want time-dependent noise it will be useful to convert timestamps into epoch 218 | def convert_to_millis(x): 219 | try: 220 | m = datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').timestamp() 221 | except: 222 | m = datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp() 223 | return m 224 | 225 | # random noise can be class-conditional in both directions (other types of noise cannot) 226 | # if noise_amount is passed in as [r,s] we can flip labels in both directions: 227 | # r is percent of 0s flipped to 1s 228 | # s is percent of 1s flipped to 0s 229 | # for random noise, if noise_amount is a single number, assume it is s, and that r=0 230 | # (i.e. class-conditional noise where only 1s get flipped to 0s) 231 | if isinstance(noise_amount, tuple) or isinstance(noise_amount, list): 232 | if noise_type != 'random': 233 | raise(Exception('For time-dependent and boundary-consistent noise,' 234 | 'only a single value is allowed for noise_amount')) 235 | r = noise_amount[0] 236 | s = noise_amount[1] 237 | else: 238 | r = 0 239 | s = noise_amount 240 | 241 | # we will add noise to a *copy* of the dataframe 242 | df_copy = df.copy() 243 | 244 | if noise_type == 'time-dependent': 245 | df_copy['event_millis'] = df_copy[time_index].apply(convert_to_millis) 246 | df_copy['event_millis'] = df_copy['event_millis'] - df_copy['event_millis'].min() 247 | mislabel = df_copy[(df_copy.noise == 0) 248 | & (df_copy.label == 1)].sample(frac = s, 249 | weights=df_copy['event_millis']).index 250 | df_copy.loc[mislabel,'noise'] = 1 251 | df_copy.loc[mislabel,'label'] = 0 252 | else: 253 | if noise_type == 'boundary-consistent': 254 | from catboost import CatBoostClassifier 255 | warnings.filterwarnings("ignore", category=FutureWarning) 256 | target_encoder = TargetEncoder(cols=cat_features) 257 | reshaped_y = df_copy[label].values.reshape(df_copy[label].shape[0],) 258 | X = target_encoder.fit_transform(df_copy[features], reshaped_y) 259 | clf = CatBoostClassifier(verbose=False) 260 | clf.fit(X, reshaped_y) 261 | _, noisy_labels = BCNoise(clf, noise_level=s).simulate_noise(X, reshaped_y) 262 | else: 263 | lcm = np.array([[1-r,r],[s,1-s]]) 264 | noisy_labels = flip_labels_cc(df_copy.label,lcm) 265 | 266 | idx = (df_copy.label != noisy_labels) 267 | df_copy.loc[idx,'noise'] = 1 268 | df_copy['label'] = noisy_labels 269 | 270 | return df_copy 271 | 272 | 273 | def train_valid_split(df, split=0.7, shuffle=True, sort_key='creation_date'): 274 | if shuffle: 275 | df = df.sample(frac=1).reset_index(drop=True) 276 | else: 277 | df = df.sort_values(by=sort_key, ignore_index=True) 278 | train_idx = int(round(split*df.shape[0])) 279 | train = df[:train_idx].reset_index(drop=True) 280 | valid = df[train_idx:].reset_index(drop=True) 281 | 282 | return train, valid 283 | 284 | 285 | def prepare_noisy_dataset(key, noise_type, noise_amount, split=0.7, shuffle=True, 286 | sort_key='creation_date', target_encoding=False): 287 | """ 288 | this function can be used to fetch datasets from FDB, 289 | starts by calling prepare_data_fdb and then adding noise 290 | 291 | input: 292 | key - name of FDB dataset 293 | noise_type - what type of noise to add 294 | noise_amount - how much noise to add 295 | split - training/validation split 296 | shuffle - whether or not to shuffle or sort before doing train/valid split 297 | sort_key - key to use to sort for train/valid split as well as weight for time-dependent noise 298 | """ 299 | 300 | # start by getting clean dataset 301 | 302 | df, features, cat_features, label, record_id = prepare_data_fdb(key) 303 | 304 | if noise_type == 'boundary-consistent': 305 | train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, 306 | time_index=sort_key, features=features, cat_features=cat_features, label=label) 307 | else: 308 | train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, time_index=sort_key) 309 | 310 | train, valid = train_valid_split(train_and_valid, split, shuffle=shuffle, sort_key=sort_key) 311 | test = df[df.dataset == 'test'].reset_index(drop=True) 312 | 313 | train = train[features + ['noise'] + label] 314 | valid = valid[features + ['noise'] + label] 315 | test = test[features + ['noise'] + label] 316 | 317 | if target_encoding: 318 | warnings.filterwarnings("ignore", category=FutureWarning) 319 | target_encoder = TargetEncoder(cols=cat_features) 320 | reshaped_y = train[label].values.reshape(train[label].shape[0],) 321 | train.loc[:, features] = target_encoder.fit_transform(train[features], reshaped_y) 322 | valid.loc[:, features] = target_encoder.transform(valid[features]) 323 | test.loc[:, features] = target_encoder.transform(test[features]) 324 | cat_features = None 325 | 326 | dataset = { 327 | 'description': f"{key} dataset with noise type: {noise_type}, noise amount: {noise_amount} ", 328 | 'features':features, 329 | 'cat_features':cat_features, 330 | 'label':label, 331 | 'record_id':record_id, 332 | 'train':train, 333 | 'valid':valid, 334 | 'test':test, 335 | 'noise':(noise_rate(train), noise_rate(valid), noise_rate(test)), 336 | 'fraud_level':(actual_fraud_rate(train), actual_fraud_rate(valid), actual_fraud_rate(test)), 337 | 'observed_fraud_level':(observed_fraud_rate(train),observed_fraud_rate(valid),observed_fraud_rate(test)), 338 | 'type_1_noise_rate':(type_1_noise_rate(train),type_1_noise_rate(valid),type_1_noise_rate(test)), 339 | 'type_2_noise_rate':(type_2_noise_rate(train),type_2_noise_rate(valid),type_2_noise_rate(test)) 340 | } 341 | 342 | return dataset 343 | 344 | 345 | def dataset_stats(dataset): 346 | noise = dataset['noise'] 347 | fraud_level = dataset['fraud_level'] 348 | observed_fraud_level = dataset['observed_fraud_level'] 349 | type_1_noise_rate = dataset['type_1_noise_rate'] 350 | type_2_noise_rate = dataset['type_2_noise_rate'] 351 | stats = list(zip(['train','valid','test'],noise,type_1_noise_rate,type_2_noise_rate,fraud_level,observed_fraud_level)) 352 | print(dataset['description']) 353 | for stat in stats: 354 | print('{} - total noise rate: {:.3f}, type 1 noise rate: {:.3f}, type 2 noise rate: {:.3f},\n' 355 | '(actual) fraud rate: {:.3f}, observed fraud rate: {:.3f}'.format(*stat)) 356 | 357 | -------------------------------------------------------------------------------- /scripts/reproducibility/label-noise/micro_models.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import pandas as pd 3 | import numpy as np 4 | 5 | 6 | class MicroModelError(Exception): 7 | """ 8 | basic exception type for micro-model specific errors 9 | """ 10 | def __init__(self, error_message): 11 | logging.error(error_message) 12 | 13 | 14 | class MicroModel: 15 | """ 16 | Basic wrapper for the model to be used in ensemble noise removal, ModelClass can be anything that implements 17 | fit and predict_proba. Mainly used by MicroModelEnsemble, user is probably not calling this directly 18 | """ 19 | 20 | def __init__(self, ModelClass, *args, **kwargs): 21 | """ 22 | initialization of the class, ModelClass should be a *class* not an object 23 | e.g. CatBoostClassifier, not CatBoostClassifier() 24 | """ 25 | self.clf = ModelClass(*args, **kwargs) 26 | self.thresh = None 27 | 28 | def set_thresh(self, thresh): 29 | # can set a threshold to be used in model predictions 30 | self.thresh = thresh 31 | 32 | def fit(self, x, y, *args, **kwargs): 33 | # pass-through method to call model.fit() 34 | self.clf.fit(x, y.values.ravel(), *args, **kwargs) 35 | 36 | def predict_proba(self, x, *args, **kwargs): 37 | # pass-through method to call model.predict_proba() 38 | if 'predict_proba' in dir(self.clf): 39 | return self.clf.predict_proba(x, *args, **kwargs) 40 | else: 41 | raise (MicroModelError('ModelClass must implement predict_proba')) 42 | 43 | def predict(self, x): 44 | # make predictions, using either defined threshold (if set) or default value of 0.5 45 | if self.thresh is not None: 46 | t = self.thresh 47 | else: 48 | t = 0.5 49 | scores = self.predict_proba(x)[:, 1] 50 | preds = [int(s > t) for s in scores] 51 | return scores, preds 52 | 53 | 54 | class MicroModelEnsemble: 55 | """ 56 | Ensemble of micro-models used to remove noise 57 | """ 58 | 59 | def __init__(self, ModelClass, num_clfs=16, score_type='preds_avg', *args, **kwargs): 60 | """ 61 | initialization of the class, ModelClass should be a *class* not an object 62 | e.g. CatBoostClassifier, not CatBoostClassifier() 63 | params: 64 | ModelClass - base class to use, needs to implement fit and predict_proba 65 | num_clfs - number of classifiers to use in cleaning ensemble 66 | score_type - means of computing anomaly score from micro-model scores 67 | args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost 68 | """ 69 | self.score_type = score_type 70 | 71 | if type(num_clfs) is not int or num_clfs <= 0: 72 | raise (MicroModelError('num_clfs must be a positive integer')) 73 | self.ModelClass = ModelClass 74 | 75 | # one classifier that will be trained over entire dataset 76 | self.big_clf = MicroModel(ModelClass=ModelClass, *args, **kwargs) 77 | 78 | # micro-models to later be trained over slices 79 | self.num_clfs = num_clfs 80 | self.clfs = [] 81 | for i in range(num_clfs): 82 | self.clfs.append(MicroModel(ModelClass=ModelClass, *args, **kwargs)) 83 | self.thresholds = {} 84 | 85 | def fit(self, x, y, *args, **kwargs): 86 | # assumption that data is already shuffled or sorted (by date or other appropriate key) 87 | # according to the usecase 88 | 89 | if not isinstance(y, pd.DataFrame): 90 | y = pd.DataFrame(y) 91 | 92 | # fit one classifier on all the data 93 | self.big_clf.fit(x, y, *args, **kwargs) 94 | 95 | # now fit individual models on slices of data 96 | stride = round(x.shape[0] / self.num_clfs) 97 | for i, clf in enumerate(self.clfs): 98 | idx = slice(i * stride, min((i + 1) * stride, x.shape[0])) 99 | x_i = x.iloc[idx, :] 100 | y_i = y.iloc[idx, :] 101 | clf.fit(x_i, y_i, *args, **kwargs) 102 | 103 | def predict_proba(self, x, *args, **kwargs): 104 | # output is the mean of the (binary) predictions of all models in the ensemble 105 | # e.g. the percentage of models that voted on the example 106 | results = pd.DataFrame(index=np.arange(x.shape[0])) 107 | if self.score_type == 'preds_avg': 108 | for i, clf in enumerate(self.clfs): 109 | _, results[i] = clf.predict(x, *args, **kwargs) 110 | elif self.score_type == 'score_avg': 111 | for i, clf in enumerate(self.clfs): 112 | results[i] = clf.predict_proba(x, *args, **kwargs)[:, 1] 113 | 114 | scores = results.mean(axis=1, numeric_only=True) 115 | return scores 116 | 117 | def predict(self, x, threshold=0.5, *args, **kwargs): 118 | # compare output of predict_proba to a threshold in order to make a binary prediction, default is 0.5 119 | scores = self.predict_proba(x) 120 | preds = np.array([int(s >= threshold) for s in scores]) 121 | return scores, preds 122 | 123 | def filter_noise(self, x, y, pulearning=True, threshold=0.5): 124 | # compare ensemble predictions to observed labels and return the examples that are NOT considered noise 125 | # i.e. this is noise REMOVAL 126 | # pu_learning=True means a class-conditional assumption is being made, 127 | # there no examples of true 0s mislabeled as 1s 128 | scores, susp = self.predict(x, threshold) 129 | if pulearning: 130 | conf = ((y == 1) | ((y == 0) & (susp == 0))) 131 | else: 132 | conf = (((y == 1) & (scores > 1 - threshold)) | ((y == 0) & (scores < threshold))) 133 | 134 | return x[conf].reset_index(drop=True), y[conf] 135 | 136 | def clean_noise(self, x, y, pulearning=True, threshold=0.5): 137 | # compare ensemble predictions to observed labels and return all examples with corrected labels 138 | # i.e. this is noise CLEANING 139 | # pu_learning=True means a class-conditional assumption is being made, 140 | # there no examples of true 0s mislabeled as 1s 141 | x = x.copy() 142 | y = y.copy() 143 | _, susp = self.predict(x, threshold) 144 | # flip all the probable 1s to actual 1s 145 | probable_1 = (y == 0) & (susp == 1) 146 | y[probable_1] = 1 147 | if not pulearning: 148 | # if there are both types of noise, flip probable 0s to actual 0s 149 | probable_0 = (y == 1) & (susp == 0) 150 | y[probable_0] = 0 151 | 152 | return x, y 153 | 154 | 155 | class MicroModelCleaner: 156 | """ 157 | This class performs the entire model training process end-to-end - given a dataset it will first train an ensemble 158 | then remove noise, then train a final model on the clean data 159 | """ 160 | 161 | def __init__(self, ModelClass, strategy='filter', pulearning=True, num_clfs=16, threshold=0.5, *args, **kwargs): 162 | """ 163 | initialization of the class, ModelClass should be a *class* not an object 164 | e.g. CatBoostClassifier, not CatBoostClassifier() 165 | params: 166 | ModelClass - base class to use, needs to implement fit and predict_proba 167 | strategy - whether to remove noise ('filter') or flip labels ('clean') 168 | pulearning - class-conditional assumption, if True assume there is no true 0's mislabeled as 1's 169 | num_clfs - number of classifiers to use in cleaning ensemble 170 | threshold - percentage of classifiers that have to vote to remove noise (0.5 is majority voting) 171 | args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost 172 | """ 173 | self.detector = MicroModelEnsemble(ModelClass, num_clfs, *args, **kwargs) 174 | self.clf = ModelClass(*args, **kwargs) 175 | if strategy.lower() not in ['filter', 'clean']: 176 | raise (MicroModelError('strategy must be filter or clean')) 177 | self.strategy = strategy.lower() 178 | self.pulearning = pulearning 179 | self.threshold = threshold 180 | 181 | def fit(self, x, y, *args, **kwargs): 182 | # first train the Ensemble to deal with the noise 183 | self.detector.fit(x, y, *args, **kwargs) 184 | if self.strategy == 'filter': 185 | x_clean, y_clean = self.detector.filter_noise(x, y, self.pulearning, self.threshold) 186 | else: 187 | x_clean, y_clean = self.detector.clean_noise(x, y, self.pulearning, self.threshold) 188 | 189 | # then train final model on clean data 190 | self.clf.fit(x_clean, y_clean, *args, **kwargs) 191 | 192 | def predict(self, x, *args, **kwargs): 193 | return self.clf.predict(x, *args, **kwargs) 194 | 195 | def predict_proba(self, x, *args, **kwargs): 196 | return self.clf.predict_proba(x, *args, **kwargs) 197 | 198 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | from glob import glob 3 | 4 | from setuptools import find_packages, setup 5 | 6 | 7 | setup( 8 | name='fraud_dataset_benchmark', 9 | version='1.0', 10 | 11 | # declare your packages 12 | packages=find_packages(where='src', exclude=('test',)), 13 | package_dir={'': 'src'}, 14 | include_package_data=True, 15 | data_files=[('.',[ 16 | 'src/fdb/versioned_datasets/ipblock/20220607.zip', 17 | ])], 18 | 19 | # Enable build-time format checking 20 | check_format=False, 21 | 22 | # Enable type checking 23 | test_mypy=False, 24 | 25 | # Enable linting at build time 26 | test_flake8=False, 27 | 28 | # exclude_package_data={ 29 | # '': glob('fdb/*/__pycache__', recursive=True), 30 | # } 31 | ) 32 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/__init__.py -------------------------------------------------------------------------------- /src/fdb/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/__init__.py -------------------------------------------------------------------------------- /src/fdb/datasets.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod, ABC 2 | from fdb.preprocessing import * 3 | from fdb.preprocessing_objects import load_data 4 | from sklearn.metrics import roc_auc_score, roc_curve, auc 5 | 6 | class FraudDatasetBenchmark(ABC): 7 | def __init__( 8 | self, 9 | key, 10 | load_pre_downloaded=False, 11 | delete_downloaded=True, 12 | add_random_values_if_real_na = { 13 | "EVENT_TIMESTAMP": True, 14 | "LABEL_TIMESTAMP": True, 15 | "ENTITY_ID": True, 16 | "ENTITY_TYPE": True, 17 | "EVENT_ID": True 18 | }): 19 | self.key = key 20 | self.obj = load_data(self.key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na) 21 | 22 | @property 23 | def train(self): 24 | return self.obj.train 25 | 26 | @property 27 | def test(self): 28 | return self.obj.test 29 | 30 | @property 31 | def test_labels(self): 32 | return self.obj.test_labels 33 | 34 | def eval(self, y_pred): 35 | 36 | """ 37 | Method to evaluate predictions against the test set 38 | """ 39 | roc_score = roc_auc_score(self.test_labels['EVENT_LABEL'], y_pred) 40 | fpr, tpr, thres = roc_curve(self.test_labels['EVENT_LABEL'], y_pred) 41 | tpr_1fpr = np.interp(0.01, fpr, tpr) 42 | metrics = {'roc_score': roc_score, 'tpr_1fpr': tpr_1fpr} 43 | return metrics 44 | 45 | 46 | -------------------------------------------------------------------------------- /src/fdb/kaggle_configs.py: -------------------------------------------------------------------------------- 1 | KAGGLE_CONFIGS = { 2 | 3 | "fakejob": 4 | { 5 | "owner": "shivamb", 6 | "dataset": "real-or-fake-fake-jobposting-prediction", 7 | "filename": 'fake_job_postings.csv', 8 | "name": "Real / Fake Job Posting Prediction", 9 | "type": "datasets", 10 | "version": 1 11 | }, 12 | 13 | "vehicleloan": 14 | { 15 | "owner": "avikpaul4u", 16 | "dataset": "vehicle-loan-default-prediction", 17 | "filename": 'train.csv', 18 | "name": "Vehicle Loan Default Prediction", 19 | "type": "datasets", 20 | "version": 4 21 | }, 22 | 23 | "malurl": 24 | { 25 | "owner": "sid321axn", 26 | "dataset": "malicious-urls-dataset", 27 | "filename": 'malicious_phish.csv', 28 | "name": "Malicious URLs Dataset", 29 | "type": "datasets", 30 | "version": 1 31 | }, 32 | 33 | "ieeecis": 34 | { 35 | "owner": "ieee-fraud-detection", 36 | "name": "IEEE-CIS Fraud Detection", 37 | "type": "competitions", 38 | }, 39 | 40 | "ccfraud": 41 | { 42 | "owner": "mlg-ulb", 43 | "dataset": "creditcardfraud", 44 | "filename": 'creditcard.csv', 45 | "name": "Credit Card Fraud Detection", 46 | "type": "datasets", 47 | "version": 3 48 | }, 49 | 50 | "fraudecom": 51 | { 52 | "owner": "vbinh002", 53 | "dataset": "fraud-ecommerce", 54 | "filename": 'Fraud_Data.csv', 55 | "name": "Fraud ecommerce", 56 | "type": "datasets", 57 | "version": 1 58 | }, 59 | 60 | "sparknov": 61 | { 62 | "owner": "kartik2112", 63 | "dataset": "fraud-detection", 64 | "name": "Simulated Credit Card Transactions generated using Sparkov", 65 | "type": "datasets", 66 | "version": 1 67 | }, 68 | 69 | "twitterbot": 70 | { 71 | "owner": "davidmartngutirrez", 72 | "dataset": "twitter-bots-accounts", 73 | "filename": "twitter_human_bots_dataset.csv", 74 | "name": "Twitter Bots Accounts", 75 | "type": "datasets", 76 | "version": 2 77 | } 78 | } -------------------------------------------------------------------------------- /src/fdb/preprocessing.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | import os 4 | import re 5 | import shutil 6 | import kaggle 7 | import pkgutil 8 | import requests 9 | import zipfile 10 | import numpy as np 11 | from abc import ABC 12 | import pandas as pd 13 | import socket, struct 14 | from faker import Faker 15 | from zipfile import ZipFile 16 | from datetime import datetime 17 | from datetime import timedelta 18 | from io import StringIO, BytesIO 19 | from dateutil.relativedelta import relativedelta 20 | 21 | from fdb.kaggle_configs import KAGGLE_CONFIGS 22 | 23 | fake = Faker(['en_US']) 24 | 25 | 26 | # Naming convention for the meta data columns in standardized datasets 27 | _EVENT_TIMESTAMP = 'EVENT_TIMESTAMP' # timestamp column 28 | _ENTITY_TYPE = 'ENTITY_TYPE' # afd specific requirement 29 | _EVENT_LABEL = 'EVENT_LABEL' # label column 30 | _EVENT_ID = 'EVENT_ID' # transaction/event id 31 | _ENTITY_ID = 'ENTITY_ID' # represents user/account id 32 | _LABEL_TIMESTAMP = 'LABEL_TIMESTAMP' # added in a cases where entity id is meaninful 33 | 34 | # Kaggle config related strings 35 | _OWNER = 'owner' 36 | _COMPETITIONS = 'competitions' 37 | _TYPE = 'type' 38 | _FILENAME = 'filename' 39 | _DATASETS = 'datasets' 40 | _DATASET = 'dataset' 41 | _VERSION = 'version' 42 | 43 | # Some fixed parameters 44 | _RANDOM_STATE = 1 45 | _CWD = os.getcwd() 46 | _DOWNLOAD_LOCATION = os.path.join(_CWD, 'tmp') 47 | _TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%SZ' 48 | _DEFAULT_LABEL_TIMESTAMP = datetime.now().strftime(_TIMESTAMP_FORMAT) 49 | 50 | 51 | class BasePreProcessor(ABC): 52 | def __init__( 53 | self, 54 | key = None, 55 | train_percentage = 0.8, 56 | timestamp_col = None, 57 | label_col = None, 58 | label_timestamp_col = None, 59 | event_id_col = None, 60 | entity_id_col = None, 61 | features_to_drop = [], 62 | load_pre_downloaded = False, 63 | delete_downloaded = True, 64 | add_random_values_if_real_na = { 65 | "EVENT_TIMESTAMP": True, 66 | "LABEL_TIMESTAMP": True, 67 | "ENTITY_ID": True, 68 | "ENTITY_TYPE": True, 69 | "EVENT_ID": True 70 | } 71 | ): 72 | 73 | self.key = key 74 | self.train_percentage = train_percentage 75 | self.features_to_drop = features_to_drop 76 | self.delete_downloaded = delete_downloaded 77 | 78 | self._timestamp_col = timestamp_col 79 | self._label_col = label_col 80 | self._label_timestamp_col = label_timestamp_col 81 | self._event_id_col = event_id_col 82 | self._entity_id_col = entity_id_col 83 | self._add_random_values_if_real_na = add_random_values_if_real_na 84 | 85 | # Simply get all required objects at the time of object creation 86 | if KAGGLE_CONFIGS.get(self.key) and not load_pre_downloaded: 87 | self.download_kaggle_data() # download the data when an object is created 88 | self.load_data() 89 | self.preprocess() 90 | self.train_test_split() 91 | 92 | 93 | def _download_kaggle_data_from_competetions(self): 94 | file_name = KAGGLE_CONFIGS[self.key][_OWNER] 95 | kaggle.api.competition_download_files( 96 | competition = KAGGLE_CONFIGS[self.key][_OWNER], 97 | path = _DOWNLOAD_LOCATION 98 | ) 99 | return file_name 100 | 101 | def _download_kaggle_data_from_datasets_with_given_filename(self): 102 | file_name = KAGGLE_CONFIGS[self.key][_FILENAME] 103 | response = kaggle.api.datasets_download_file( 104 | owner_slug = KAGGLE_CONFIGS[self.key][_OWNER], 105 | dataset_slug = KAGGLE_CONFIGS[self.key][_DATASET], 106 | file_name = file_name, 107 | dataset_version_number=KAGGLE_CONFIGS[self.key][_VERSION], 108 | _preload_content = False, 109 | ) 110 | with open(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'wb') as f: 111 | f.write(response.data) 112 | return file_name 113 | 114 | def _download_kaggle_data_from_datasets_containing_single_file(self): 115 | file_name = KAGGLE_CONFIGS[self.key][_DATASET] 116 | kaggle.api.dataset_download_files( 117 | dataset = os.path.join(KAGGLE_CONFIGS[self.key][_OWNER], KAGGLE_CONFIGS[self.key][_DATASET]), 118 | path = _DOWNLOAD_LOCATION 119 | ) 120 | return file_name 121 | 122 | def download_kaggle_data(self): 123 | """ 124 | Download and extract the data from Kaggle. Puts the data in tmp directory within current directory. 125 | """ 126 | 127 | if not os.path.exists(_DOWNLOAD_LOCATION): 128 | os.mkdir(_DOWNLOAD_LOCATION) 129 | 130 | print('Data download location', _DOWNLOAD_LOCATION) 131 | 132 | 133 | if KAGGLE_CONFIGS[self.key][_TYPE] == _COMPETITIONS: 134 | file_name = self._download_kaggle_data_from_competetions() 135 | 136 | elif KAGGLE_CONFIGS[self.key][_TYPE] == _DATASETS: 137 | # If filename is given, download single file, 138 | # Else download all files. 139 | if KAGGLE_CONFIGS[self.key].get(_FILENAME): 140 | file_name = self._download_kaggle_data_from_datasets_with_given_filename() 141 | else: 142 | file_name = self._download_kaggle_data_from_datasets_containing_single_file() 143 | 144 | else: 145 | raise ValueError('Type should be among competetions or datasets in config') 146 | 147 | with zipfile.ZipFile(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'r') as zip_ref: 148 | zip_ref.extractall(_DOWNLOAD_LOCATION) 149 | 150 | def load_data(self): 151 | self.df = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION, KAGGLE_CONFIGS[self.key]['filename']), dtype='object') 152 | # delete downloaded data after loading in memory 153 | if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION) 154 | 155 | @property 156 | def timestamp_col(self): 157 | return self._timestamp_col # If timestamp not available, will create fake timestamps 158 | 159 | @property 160 | def label_col(self): 161 | if self._label_col is None: 162 | raise ValueError('Label column not specified') 163 | else: 164 | return self._label_col 165 | 166 | @property 167 | def event_id_col(self): 168 | return self._event_id_col # If event id not available, will create fake event ids 169 | 170 | @property 171 | def entity_id_col(self): 172 | return self._entity_id_col 173 | 174 | def standardize_timestamp_col(self): 175 | if self.timestamp_col is not None: 176 | self.df[_EVENT_TIMESTAMP] = pd.to_datetime(self.df[self.timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT)) 177 | self.df.drop(self.timestamp_col, axis=1, inplace=True) 178 | elif self.timestamp_col is None and self._add_random_values_if_real_na[_EVENT_TIMESTAMP]: 179 | self.df[_EVENT_TIMESTAMP] = self.df[_EVENT_LABEL].apply( 180 | lambda x: fake.date_time_between( 181 | start_date='-1y', # think about making it to fixed date. vs from now? 182 | end_date='now', 183 | tzinfo=None).strftime(_TIMESTAMP_FORMAT)) 184 | 185 | if self._label_timestamp_col is None and self._add_random_values_if_real_na[_LABEL_TIMESTAMP]: 186 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 187 | elif self._label_timestamp_col is not None: 188 | self.df[_LABEL_TIMESTAMP] = pd.to_datetime(self.df[self._label_timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT)) 189 | self.df.drop(self._label_timestamp_col, axis=1, inplace=True) 190 | 191 | def standardize_label_col(self): 192 | self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True) 193 | self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].astype(int) 194 | 195 | def standardize_event_id_col(self): 196 | if self.event_id_col is not None: 197 | self.df.rename({self.event_id_col: _EVENT_ID}, axis=1, inplace=True) 198 | self.df[_EVENT_ID] = self.df[_EVENT_ID].astype(str) 199 | elif self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: # add fake one if not exist 200 | self.df[_EVENT_ID] = self.df[_EVENT_LABEL].apply( 201 | lambda x: fake.uuid4()) 202 | 203 | 204 | def standardize_entity_id_col(self): 205 | if self.entity_id_col is not None: 206 | self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True) 207 | elif self.entity_id_col is None and self._add_random_values_if_real_na[_ENTITY_ID]: # add fake one if not exist 208 | self.df[_ENTITY_ID] = self.df[_EVENT_LABEL].apply( 209 | lambda x: fake.uuid4()) 210 | 211 | def rename_features(self): 212 | rename_map = {} # default is empty map that won't rename any columns 213 | self.df.rename(rename_map, axis=1, inplace=True) 214 | 215 | def subset_features(self): 216 | features_to_select = self.df.columns.tolist() 217 | self.df = self.df[features_to_select] # all by default 218 | 219 | def drop_features(self): 220 | self.df.drop(self.features_to_drop, axis=1, inplace=True) 221 | 222 | def add_meta_data(self): 223 | if self._add_random_values_if_real_na[_ENTITY_TYPE]: 224 | self.df[_ENTITY_TYPE] = 'user' 225 | 226 | def sort_by_timestamp(self): 227 | self.df.sort_values(by=_EVENT_TIMESTAMP, ascending=True, inplace=True) 228 | 229 | def lower_case_col_names(self): 230 | self.df.columns = [s.lower() for s in self.df.columns] 231 | 232 | def preprocess(self): 233 | self.lower_case_col_names() 234 | self.standardize_label_col() 235 | self.standardize_event_id_col() 236 | self.standardize_entity_id_col() 237 | self.standardize_timestamp_col() 238 | self.add_meta_data() 239 | self.rename_features() 240 | self.subset_features() 241 | self.drop_features() 242 | if self.timestamp_col: 243 | self.sort_by_timestamp() 244 | 245 | def train_test_split(self): 246 | """ 247 | Default setting is out of time with 80%-20% into training and testing respectively 248 | """ 249 | if self.timestamp_col: 250 | split_pt = int(self.df.shape[0]*self.train_percentage) 251 | self.train = self.df.copy().iloc[:split_pt, :] 252 | self.test = self.df.copy().iloc[split_pt:, :] 253 | else: # random if no timestamp col available 254 | self.train = self.df.sample(frac=self.train_percentage, random_state=_RANDOM_STATE) 255 | self.test = self.df.copy()[~self.df.index.isin(self.train.index)] 256 | self.test.reset_index(drop=True, inplace=True) 257 | 258 | self.test_labels = self.test[[_EVENT_LABEL]] 259 | if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: 260 | self.test_labels[_EVENT_ID] = self.test[_EVENT_ID] 261 | self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore") 262 | 263 | 264 | class FakejobPreProcessor(BasePreProcessor): 265 | def __init__(self, **kw): 266 | super(FakejobPreProcessor, self).__init__(**kw) 267 | 268 | 269 | class VehicleloanPreProcessor(BasePreProcessor): 270 | def __init__(self, **kw): 271 | super(VehicleloanPreProcessor, self).__init__(**kw) 272 | 273 | 274 | class MalurlPreProcessor(BasePreProcessor): 275 | """ 276 | This one originally multiple classes for manignant. 277 | We will combine all malignant one class to keep benchmark binary for now 278 | 279 | """ 280 | def __init__(self, **kw): 281 | super(MalurlPreProcessor, self).__init__(**kw) 282 | 283 | def standardize_label_col(self): 284 | self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True) 285 | binary_mapper = { 286 | 'defacement': 1, 287 | 'phishing': 1, 288 | 'malware': 1, 289 | 'benign': 0 290 | } 291 | 292 | self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper) 293 | 294 | def add_dummy_col(self): 295 | self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4()) 296 | 297 | def preprocess(self): 298 | super(MalurlPreProcessor, self).preprocess() 299 | self.add_dummy_col() 300 | 301 | class IEEEPreProcessor(BasePreProcessor): 302 | """ 303 | Some pre-processing was done using kaggle kernels below. 304 | 305 | References: 306 | Data Source: https://www.kaggle.com/c/ieee-fraud-detection/data 307 | 308 | Some processing from: https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600 309 | Feature selection to reduce to 100: https://www.kaggle.com/code/pavelvpster/ieee-fraud-feature-selection-rfecv/notebook 310 | 311 | """ 312 | def __init__(self, **kw): 313 | super(IEEEPreProcessor, self).__init__(**kw) 314 | 315 | @staticmethod 316 | def _dtypes_cols(): 317 | 318 | # FIRST 53 COLUMNS 319 | cols = ['TransactionID', 'TransactionDT', 'TransactionAmt', 320 | 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 321 | 'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain', 322 | 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 323 | 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 324 | 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4', 325 | 'M5', 'M6', 'M7', 'M8', 'M9'] 326 | 327 | # V COLUMNS TO LOAD DECIDED BY CORRELATION EDA 328 | # https://www.kaggle.com/cdeotte/eda-for-columns-v-and-id 329 | v = [1, 3, 4, 6, 8, 11] 330 | v += [13, 14, 17, 20, 23, 26, 27, 30] 331 | v += [36, 37, 40, 41, 44, 47, 48] 332 | v += [54, 56, 59, 62, 65, 67, 68, 70] 333 | v += [76, 78, 80, 82, 86, 88, 89, 91] 334 | 335 | #v += [96, 98, 99, 104] #relates to groups, no NAN 336 | v += [107, 108, 111, 115, 117, 120, 121, 123] # maybe group, no NAN 337 | v += [124, 127, 129, 130, 136] # relates to groups, no NAN 338 | 339 | # LOTS OF NAN BELOW 340 | v += [138, 139, 142, 147, 156, 162] #b1 341 | v += [165, 160, 166] #b1 342 | v += [178, 176, 173, 182] #b2 343 | v += [187, 203, 205, 207, 215] #b2 344 | v += [169, 171, 175, 180, 185, 188, 198, 210, 209] #b2 345 | v += [218, 223, 224, 226, 228, 229, 235] #b3 346 | v += [240, 258, 257, 253, 252, 260, 261] #b3 347 | v += [264, 266, 267, 274, 277] #b3 348 | v += [220, 221, 234, 238, 250, 271] #b3 349 | 350 | v += [294, 284, 285, 286, 291, 297] # relates to grous, no NAN 351 | v += [303, 305, 307, 309, 310, 320] # relates to groups, no NAN 352 | v += [281, 283, 289, 296, 301, 314] # relates to groups, no NAN 353 | 354 | # COLUMNS WITH STRINGS 355 | str_type = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain','M1', 'M2', 'M3', 'M4','M5', 356 | 'M6', 'M7', 'M8', 'M9', 'id_12', 'id_15', 'id_16', 'id_23', 'id_27', 'id_28', 'id_29', 'id_30', 357 | 'id_31', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo'] 358 | str_type += ['id-12', 'id-15', 'id-16', 'id-23', 'id-27', 'id-28', 'id-29', 'id-30', 359 | 'id-31', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38'] 360 | 361 | 362 | cols += ['V'+str(x) for x in v] 363 | dtypes = {} 364 | for c in cols+['id_0'+str(x) for x in range(1,10)]+['id_'+str(x) for x in range(10,34)]+\ 365 | ['id-0'+str(x) for x in range(1,10)]+['id-'+str(x) for x in range(10,34)]: 366 | dtypes[c] = 'float32' 367 | for c in str_type: dtypes[c] = 'category' 368 | 369 | return dtypes, cols 370 | 371 | 372 | def load_data(self): 373 | """ 374 | Hard coded file names for this dataset as it contains multiple files to be combined 375 | """ 376 | 377 | dtypes, cols = IEEEPreProcessor._dtypes_cols() 378 | 379 | self.df = pd.read_csv( 380 | os.path.join(_DOWNLOAD_LOCATION, 381 | 'train_transaction.csv'), 382 | index_col='TransactionID', 383 | dtype=dtypes, 384 | usecols=cols+['isFraud']) 385 | 386 | self.df_id = pd.read_csv( 387 | os.path.join(_DOWNLOAD_LOCATION, 388 | 'train_identity.csv'), 389 | index_col='TransactionID', 390 | dtype=dtypes) 391 | self.df = self.df.merge(self.df_id, how='left', left_index=True, right_index=True) 392 | 393 | # delete downloaded data after loading in memory 394 | if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION) 395 | 396 | def normalization(self): 397 | # NORMALIZE D COLUMNS 398 | for i in range(1,16): 399 | if i in [1,2,3,5,9]: continue 400 | self.df['d'+str(i)] = self.df['d'+str(i)] - self.df[self.timestamp_col]/np.float32(24*60*60) 401 | 402 | def standardize_entity_id_col(self): 403 | def _encode_CB(col1, col2, df): 404 | nm = col1+'_'+col2 405 | df[nm] = df[col1].astype(str)+'_'+df[col2].astype(str) 406 | 407 | _encode_CB('card1', 'addr1', self.df) 408 | self.df['day'] = self.df[self.timestamp_col] / (24*60*60) 409 | self.df[_ENTITY_ID] = self.df['card1_addr1'].astype(str) + '_' + np.floor(self.df['day'] - self.df['d1']).astype(str) 410 | 411 | @staticmethod 412 | def _add_seconds(x): 413 | init_time = '2021-01-01T00:00:00Z' 414 | dt_format = _TIMESTAMP_FORMAT 415 | init_time = datetime.strptime(init_time, dt_format) # start date from last 18 months 416 | final_time = init_time + timedelta(seconds=x) 417 | return final_time.strftime(_TIMESTAMP_FORMAT) 418 | 419 | def standardize_timestamp_col(self): 420 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: IEEEPreProcessor._add_seconds(x)) 421 | self.df.drop(self.timestamp_col, axis=1, inplace=True) 422 | if self._add_random_values_if_real_na["LABEL_TIMESTAMP"]: 423 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 424 | 425 | def subset_features(self): 426 | features_to_select = \ 427 | ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 428 | 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 429 | 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 430 | 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 431 | 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 432 | 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo', 433 | 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID', 'EVENT_LABEL', 'LABEL_TIMESTAMP'] 434 | self.df = self.df.loc[:, self.df.columns.isin(features_to_select)] 435 | 436 | def preprocess(self): 437 | self.lower_case_col_names() 438 | self.normalization() # normalize D columns 439 | self.standardize_label_col() 440 | self.standardize_event_id_col() 441 | self.standardize_entity_id_col() 442 | self.standardize_timestamp_col() 443 | self.add_meta_data() 444 | self.rename_features() 445 | self.subset_features() 446 | if self.timestamp_col: 447 | self.sort_by_timestamp() 448 | 449 | 450 | class CCFraudPreProcessor(BasePreProcessor): 451 | def __init__(self, **kw): 452 | super(CCFraudPreProcessor, self).__init__(**kw) 453 | 454 | @staticmethod 455 | def _add_minutes(x): 456 | dt_format = _TIMESTAMP_FORMAT 457 | init_time = datetime.strptime('2021-09-01T00:00:00Z', dt_format) # chose randomly but in last 18 months 458 | final_time = init_time + timedelta(minutes=x) 459 | return final_time.strftime(_TIMESTAMP_FORMAT) 460 | 461 | def standardize_timestamp_col(self): 462 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].astype(float).apply(lambda x: CCFraudPreProcessor._add_minutes(x)) 463 | self.df.drop(self.timestamp_col, axis=1, inplace=True) 464 | if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]: 465 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 466 | 467 | class FraudecomPreProcessor(BasePreProcessor): 468 | def __init__(self, ip_address_col, signup_time_col, **kw): 469 | self.ip_address_col = ip_address_col 470 | self.signup_time_col = signup_time_col 471 | super(FraudecomPreProcessor, self).__init__(**kw) 472 | 473 | @staticmethod 474 | def _add_years(init_time): 475 | dt_format = '%Y-%m-%d %H:%M:%S' 476 | init_time = datetime.strptime(init_time, dt_format) 477 | final_time = init_time + relativedelta(years=6) # move to more recent time range 478 | return final_time.strftime(_TIMESTAMP_FORMAT) 479 | 480 | 481 | def standardize_timestamp_col(self): 482 | 483 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: FraudecomPreProcessor._add_years(x)) 484 | self.df.drop(self.timestamp_col, axis=1, inplace=True) 485 | 486 | # Also add _LABEL_TIMESTAMP to allow training of this dataset with TFI 487 | if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]: 488 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 489 | 490 | def process_ip(self): 491 | """ 492 | This dataset has ip address as a feature, but needs to be converted into standard IPV4. 493 | """ 494 | self.df[self.ip_address_col] = self.df[self.ip_address_col].astype(float).astype(int).\ 495 | apply(lambda x: socket.inet_ntoa(struct.pack('!L', x))) 496 | 497 | def create_time_since_signup(self): 498 | self.df['time_since_signup'] = ( 499 | pd.to_datetime(self.df[self.timestamp_col]) -\ 500 | pd.to_datetime(self.df[self.signup_time_col])).dt.seconds 501 | 502 | def preprocess(self): 503 | self.lower_case_col_names() 504 | self.standardize_label_col() 505 | self.standardize_event_id_col() 506 | self.standardize_entity_id_col() 507 | self.create_time_since_signup() # One manually engineered feature 508 | self.standardize_timestamp_col() 509 | self.add_meta_data() 510 | self.process_ip() # This extra step added 511 | self.rename_features() 512 | self.drop_features() # Replace select with drop 513 | if self.timestamp_col: 514 | self.sort_by_timestamp() 515 | 516 | 517 | class SparknovPreProcessor(BasePreProcessor): 518 | def __init__(self, **kw): 519 | super(SparknovPreProcessor, self).__init__(**kw) 520 | 521 | def load_data(self): 522 | """ 523 | Hard coded file names for this dataset as it contains multiple files to be combined 524 | """ 525 | 526 | df_train = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTrain.csv')) 527 | df_train['seg'] = 'train' 528 | 529 | df_test = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTest.csv')) 530 | df_test['seg'] = 'test' 531 | 532 | self.df = pd.concat([df_train, df_test], ignore_index=True) 533 | 534 | # delete downloaded data after loading in memory 535 | if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION) 536 | 537 | @staticmethod 538 | def _add_months(x): 539 | _TIMESTAMP_FORMAT_SPARKNOV = '%Y-%m-%d %H:%M:%S' 540 | 541 | x = datetime.strptime(x, _TIMESTAMP_FORMAT_SPARKNOV) 542 | final_time = x + relativedelta(months=20) # chosen to move dates close to now() 543 | return final_time.strftime(_TIMESTAMP_FORMAT) 544 | 545 | def standardize_timestamp_col(self): 546 | 547 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: SparknovPreProcessor._add_months(x)) 548 | self.df.drop(self.timestamp_col, axis=1, inplace=True) 549 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 550 | 551 | def standardize_entity_id_col(self): 552 | 553 | self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True) 554 | self.df[_ENTITY_ID] = self.df[_ENTITY_ID].\ 555 | str.lower().\ 556 | apply(lambda x: re.sub(r'[^A-Za-z0-9]+', '_', x)) 557 | 558 | def train_test_split(self): 559 | self.train = self.df.copy()[self.df['seg'] == 'train'] 560 | self.train.reset_index(drop=True, inplace=True) 561 | self.train.drop(['seg'], axis=1, inplace=True) 562 | 563 | self.test = self.df.copy()[self.df['seg'] == 'test'] 564 | self.test.reset_index(drop=True, inplace=True) 565 | self.test.drop(['seg'], axis=1, inplace=True) 566 | self.test = self.test.sample(n=20000, random_state=1) 567 | 568 | self.test_labels = self.test[[_EVENT_LABEL]] 569 | if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: 570 | self.test_labels[_EVENT_ID] = self.test[_EVENT_ID] 571 | self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore") 572 | 573 | 574 | class TwitterbotPreProcessor(BasePreProcessor): 575 | def __init__(self, **kw): 576 | super(TwitterbotPreProcessor, self).__init__(**kw) 577 | 578 | def standardize_label_col(self): 579 | self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True) 580 | binary_mapper = { 581 | 'bot': 1, 582 | 'human': 0 583 | } 584 | 585 | self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper) 586 | 587 | 588 | class IPBlocklistPreProcessor(BasePreProcessor): 589 | """ 590 | The dataset source is http://cinsscore.com/list/ci-badguys.txt. 591 | In order to download/access the latest version of this dataset, a sign-in/sign-up to is not required 592 | 593 | Since this dataset is not version controlled from the source, we added the version of dataset we used for experiments 594 | discussed in the paper. The versioned dataset is as of 2022-06-07. 595 | The code is set to pick the fixed version. If the user is interested to use the latest version, 596 | 'version' argument will need to be turned off (i.e. set to None) 597 | """ 598 | def __init__(self, version, **kw): 599 | self.version = version # string or None. If string, picks one from versioned_datasets, else creates one from source 600 | super(IPBlocklistPreProcessor, self).__init__(**kw) 601 | 602 | def load_data(self): 603 | if self.version is None: 604 | # load malicious IPs from the source 605 | _URL = 'http://cinsscore.com/list/ci-badguys.txt' # contains confirmed malicious IPs 606 | _N_BENIGN = 200000 607 | 608 | res = requests.get(_URL) 609 | ip_mal = pd.read_csv(StringIO(res.text), sep='\n', names=['ip'], header=None) 610 | ip_mal['is_ip_malign'] = 1 611 | 612 | # add fake IPs as benign 613 | ip_ben = pd.DataFrame({ 614 | 'ip': [fake.ipv4() for i in range(_N_BENIGN)], 615 | 'is_ip_malign': 0 616 | }) 617 | 618 | self.df = pd.concat([ip_mal, ip_ben], axis=0, ignore_index=True) 619 | else: 620 | 621 | _VERSIONED_DATA_PATH = f'versioned_datasets/{self.key}/{self.version}.zip' 622 | data = pkgutil.get_data(__name__, _VERSIONED_DATA_PATH) 623 | with zipfile.ZipFile(BytesIO(data)) as f: 624 | self.train = pd.read_csv(f.open('train.csv')) 625 | self.test = pd.read_csv(f.open('test.csv')) 626 | self.test_labels = pd.read_csv(f.open('test_labels.csv')) 627 | 628 | def add_dummy_col(self): 629 | self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4()) 630 | 631 | def train_test_split(self): 632 | if self.version is None: 633 | super(IPBlocklistPreProcessor, self).train_test_split() 634 | 635 | def preprocess(self): 636 | if self.version is None: 637 | super(IPBlocklistPreProcessor, self).preprocess() 638 | self.add_dummy_col() 639 | -------------------------------------------------------------------------------- /src/fdb/preprocessing_objects.py: -------------------------------------------------------------------------------- 1 | from fdb.preprocessing import * 2 | 3 | 4 | def load_data(key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na): 5 | common_kw = { 6 | "key": key, 7 | "load_pre_downloaded": load_pre_downloaded, 8 | "delete_downloaded": delete_downloaded, 9 | "add_random_values_if_real_na": add_random_values_if_real_na 10 | } 11 | 12 | if key == 'fakejob': 13 | obj = FakejobPreProcessor( 14 | train_percentage = 0.8, 15 | timestamp_col = None, 16 | label_col = 'fraudulent', 17 | event_id_col = 'job_id', 18 | **common_kw 19 | ) 20 | 21 | elif key == 'vehicleloan': 22 | obj = VehicleloanPreProcessor( 23 | train_percentage = 0.8, 24 | timestamp_col = None, 25 | label_col = 'loan_default', 26 | event_id_col = 'uniqueid', 27 | features_to_drop = ['disbursal_date'], 28 | **common_kw 29 | ) 30 | 31 | elif key == 'malurl': 32 | obj = MalurlPreProcessor( 33 | train_percentage = 0.9, 34 | timestamp_col = None, 35 | label_col = 'type', 36 | event_id_col = None, 37 | **common_kw 38 | ) 39 | 40 | elif key == 'ieeecis': 41 | obj = IEEEPreProcessor( 42 | train_percentage = 0.95, 43 | timestamp_col = 'transactiondt', 44 | label_col = 'isfraud', 45 | event_id_col = None, 46 | entity_id_col = None, # manually created in code 47 | **common_kw 48 | ) 49 | 50 | elif key == 'ccfraud': 51 | obj = CCFraudPreProcessor( 52 | train_percentage = 0.8, 53 | timestamp_col = 'time', 54 | label_col = 'class', 55 | event_id_col = None, 56 | **common_kw 57 | ) 58 | 59 | elif key == 'fraudecom': 60 | obj = FraudecomPreProcessor( 61 | train_percentage = 0.8, 62 | timestamp_col = 'purchase_time', 63 | signup_time_col = 'signup_time', 64 | label_col = 'class', 65 | event_id_col = 'user_id', 66 | entity_id_col = 'device_id', 67 | ip_address_col = 'ip_address', 68 | features_to_drop = ['signup_time', 'sex'], 69 | **common_kw 70 | ) 71 | 72 | elif key == 'sparknov': 73 | obj = SparknovPreProcessor( 74 | timestamp_col = 'trans_date_trans_time', 75 | label_col = 'is_fraud', 76 | event_id_col = 'trans_num', 77 | entity_id_col = 'merchant', 78 | features_to_drop = ['unix_time', 'unnamed: 0'], 79 | **common_kw 80 | ) 81 | 82 | elif key == 'twitterbot': 83 | obj = TwitterbotPreProcessor( 84 | train_percentage = 0.8, 85 | timestamp_col = None, 86 | label_col = 'account_type', 87 | event_id_col = 'id', 88 | **common_kw 89 | ) 90 | 91 | elif key == 'ipblock': 92 | obj = IPBlocklistPreProcessor( 93 | label_col = 'is_ip_malign', 94 | version = '20220607', 95 | **common_kw 96 | ) 97 | 98 | else: 99 | raise ValueError('Invalid key') 100 | 101 | return obj -------------------------------------------------------------------------------- /src/fdb/versioned_datasets/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/versioned_datasets/__init__.py -------------------------------------------------------------------------------- /src/fdb/versioned_datasets/ipblock/20220607.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/versioned_datasets/ipblock/20220607.zip -------------------------------------------------------------------------------- /src/fdb/versioned_datasets/ipblock/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/versioned_datasets/ipblock/__init__.py --------------------------------------------------------------------------------