├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── images
├── afd_steps.png
├── all_fdb.png
└── ieee_ofi_sample.png
├── scripts
├── examples
│ └── Test_FDB_Loader.ipynb
└── reproducibility
│ ├── afd
│ ├── README.md
│ ├── configs
│ │ ├── CreditCardFraudDetection.json
│ │ ├── FakeJobPostingPrediction.json
│ │ ├── Fraudecommerce.json
│ │ ├── IEEECISFraudDetection.json
│ │ ├── IPBlocklist.json
│ │ ├── MaliciousURL.json
│ │ ├── SimulatedCreditCardTransactionsSparkov.json
│ │ ├── TwitterBotAccounts.json
│ │ └── VehicleLoanDefaultPrediction.json
│ ├── create_afd_resources.py
│ └── score_afd_model.py
│ ├── autogluon
│ ├── README.md
│ ├── benchmark_ag.py
│ └── example-ag-ieeecis.ipynb
│ ├── autosklearn
│ ├── README.md
│ └── benchmark_autosklearn.py
│ ├── benchmark_utils.py
│ ├── h2o
│ ├── README.md
│ ├── benchmark_h2o.py
│ └── example-h2o-ieeecis.ipynb
│ └── label-noise
│ ├── benchmark_experiments.ipynb
│ ├── feature_dict.py
│ ├── load_fdb_datasets.py
│ └── micro_models.py
├── setup.py
└── src
├── __init__.py
└── fdb
├── __init__.py
├── datasets.py
├── kaggle_configs.py
├── preprocessing.py
├── preprocessing_objects.py
└── versioned_datasets
├── __init__.py
└── ipblock
├── 20220607.zip
└── __init__.py
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021-2022 Prince Grover
4 | Copyright (c) 2021-2022 Zheng Li
5 | Copyright (c) 2022 Jianbo Liu
6 | Copyright (c) 2022 Jakub Zablocki
7 | Copyright (c) 2022 Jianbo Liu
8 | Copyright (c) 2022 Hao Zhou
9 | Copyright (c) 2022 Julia Xu
10 | Copyright (c) 2022 Anqi Cheng
11 |
12 | Permission is hereby granted, free of charge, to any person obtaining a copy
13 | of this software and associated documentation files (the "Software"), to deal
14 | in the Software without restriction, including without limitation the rights
15 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
16 | copies of the Software, and to permit persons to whom the Software is
17 | furnished to do so, subject to the following conditions:
18 |
19 | The above copyright notice and this permission notice shall be included in all
20 | copies or substantial portions of the Software.
21 |
22 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
23 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
24 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
25 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
26 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
27 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
28 | SOFTWARE.
29 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # FDB: Fraud Dataset Benchmark
2 |
3 | *By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin Tittelfitz](jtittelfitz), Anqi Cheng, [Jakub Zablocki](qbaza), Jianbo Liu, and [Hao Zhou](haozhouamzn)*
4 |
5 |
6 | [](https://www.python.org/) [](https://opensource.org/licenses/MIT)
7 |
8 |
9 | The **Fraud Dataset Benchmark (FDB)** is a compilation of publicly available datasets relevant to **fraud detection** ([arXiv Link](https://arxiv.org/abs/2208.14417)). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning.
10 |
11 |
12 | ## Datasets used in FDB
13 | Brief summary of the datasets used in FDB. Each dataset is described in detail in [data source section](#data-sources).
14 |
15 | | **#** | **Dataset name** | **Dataset key** | **Fraud category** | **#Train** | **#Test** | **Class ratio (train)** | **#Feats** | **#Cat** | **#Num** | **#Text** | **#Enrichable** |
16 | |-------|------------------------------------------------------------|-----------------|-------------------------------------|------------|-----------|-------------------------|------------|----------|----------|-----------|-----------------|
17 | | 1 | IEEE-CIS Fraud Detection | ieeecis | Card Not Present Transactions Fraud | 561,013 | 28,527 | 3.50% | 67 | 6 | 61 | 0 | 0 |
18 | | 2 | Credit Card Fraud Detection | ccfraud | Card Not Present Transactions Fraud | 227,845 | 56,962 | 0.18% | 28 | 0 | 28 | 0 | 0 |
19 | | 3 | Fraud ecommerce | fraudecom | Card Not Present Transactions Fraud | 120,889 | 30,223 | 10.60% | 6 | 2 | 3 | 0 | 1 |
20 | | 4 | Simulated Credit Card Transactions generated using Sparkov | sparknov | Card Not Present Transactions Fraud | 1,296,675 | 20,000 | 5.70% | 17 | 10 | 6 | 1 | 0 |
21 | | 5 | Twitter Bots Accounts | twitterbot | Bot Attacks | 29,950 | 7,488 | 33.10% | 16 | 6 | 6 | 4 | 0 |
22 | | 6 | Malicious URLs dataset | malurl | Malicious Traffic | 586,072 | 65,119 | 34.20% | 2 | 0 | 1 | 1 | 0 |
23 | | 7 | Fake Job Posting Prediction | fakejob | Content Moderation | 14,304 | 3,576 | 4.70% | 16 | 10 | 1 | 5 | 0 |
24 | | 8 | Vehicle Loan Default Prediction | vehicleloan | Credit Risk | 186,523 | 46,631 | 21.60% | 38 | 13 | 22 | 3 | 0 |
25 | | 9 | IP Blocklist | ipblock | Malicious Traffic | 172,000 | 43,000 | 7% | 1 | 0 | 0 | 0 | 1 |
26 |
27 |
28 | ## Installation
29 |
30 | ### Requirements
31 | - Kaggle account
32 | - **Important**: `ieeecis` dataset requires you to [**join IEEE-CIS competetion**](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call fdb API. Otherwise you will get ApiException: (403).
33 | - AWS account
34 | - Python 3.7+
35 |
36 | - Python requirements
37 | ```
38 | autogluon==0.4.2
39 | h2o==3.36.1.2
40 | boto3==1.20.21
41 | click==8.0.3
42 | click-plugins==1.1.1
43 | Faker==4.14.2
44 | joblib==1.0.0
45 | kaggle==1.5.12
46 | numpy==1.19.5
47 | pandas==1.1.2
48 | regex==2020.7.14
49 | scikit-learn==0.22.1
50 | scipy==1.5.4
51 | auto-sklearn==0.14.7
52 | dask==2022.8.1
53 | ```
54 |
55 | ### Step 1: Setup Kaggle CLI
56 | The `FraudDatasetBenchmark` object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python.
57 |
58 | Use intructions from [How to Use Kaggle](https://www.kaggle.com/docs/api) guide. The steps include:
59 |
60 | Remember to download the authentication token from "My Account" on Kaggle, and save token at `~/.kaggle/kaggle.json` on Linux, OSX and at `C:\Users.kaggle\kaggle.json` on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder.
61 |
62 |
63 | #### Step 1.2. [Join IEEE-CIS competetion](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call `fdb.datasets` with `ieeecis`. Otherwise you will get ApiException: (403).
64 |
65 |
66 | ### Step 2: Clone Repo
67 | Once Kaggle CLI is setup and installed, clone the github repo using `git clone https://github.com/amazon-research/fraud-dataset-benchmark.git` if using HTTPS, or `git clone git@github.com:amazon-research/fraud-dataset-benchmark.git` if using SSH.
68 |
69 | ### Step 3: Install
70 | Once repo is cloned, from your terminal, `cd` to the repo and type `pip install .`, which will install the required classes and methods.
71 |
72 |
73 | ## FraudDatasetBenchmark Usage
74 | The usage is straightforward, where you create a `dataset` object of `FraudDatasetBenchmark` class, and extract useful goodies like train/test splits and eval_metrics.
75 |
76 | **Important note**: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use `load_pre_downloaded=False, delete_downloaded=False` and for subsequent calls, use `load_pre_downloaded=True, delete_downloaded=False`. The default setting is
77 | `load_pre_downloaded=False, delete_downloaded=True`
78 | ```
79 | from fdb.datasets import FraudDatasetBenchmark
80 |
81 | # all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock']
82 | key = 'ipblock'
83 |
84 | obj = FraudDatasetBenchmark(
85 | key=key,
86 | load_pre_downloaded=False, # default
87 | delete_downloaded=True, # default
88 | add_random_values_if_real_na = {
89 | "EVENT_TIMESTAMP": True,
90 | "LABEL_TIMESTAMP": True,
91 | "ENTITY_ID": True,
92 | "ENTITY_TYPE": True,
93 | "ENTITY_ID": True,
94 | "EVENT_ID": True
95 | } # default
96 | )
97 | print(obj.key)
98 |
99 | print('Train set: ')
100 | display(obj.train.head())
101 | print(len(obj.train.columns))
102 | print(obj.train.shape)
103 |
104 | print('Test set: ')
105 | display(obj.test.head())
106 | print(obj.test.shape)
107 |
108 | print('Test scores')
109 | display(obj.test_labels.head())
110 | print(obj.test_labels['EVENT_LABEL'].value_counts())
111 | print(obj.train['EVENT_LABEL'].value_counts(normalize=True))
112 | print('=========')
113 |
114 | ```
115 | Notebook template to load dataset using FDB data-loader is available at [scripts/examples/Test_FDB_Loader.ipynb](scripts/examples/Test_FDB_Loader.ipynb)
116 |
117 | ## Reproducibility
118 | Reproducibility scripts are available at [scripts/reproducibility/](scripts/reproducibility/) in respective folders for [afd](scripts/reproducibility/afd), [autogluon](scripts/reproducibility/autogluon) and [h2o](scripts/reproducibility/h2o). Each folder also had README with steps to reproduce.
119 |
120 |
121 | ## Benchmark Results
122 |
123 |
135 |
136 | | **Dataset key** | **AUC-ROC** | | | | |
137 | |:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|
138 | | | **AFD OFI** | **AFD TFI** | **AutoGluon** | **H2O** | **Auto-sklearn** |
139 | | ccfraud | 0.985 | 0.99 | 0.99 | **0.992** | 0.988 |
140 | | fakejob | 0.987 | - | **0.998** | 0.99 | 0.983 |
141 | | fraudecom | 0.519 | **0.636** | 0.522 | 0.518 | 0.515 |
142 | | ieeecis | 0.938 | **0.94** | 0.855 | 0.89 | 0.932 |
143 | | malurl | 0.985 | - | **0.998** | Training failure | 0.5 |
144 | | sparknov | **0.998** | - | 0.997 | 0.997 | 0.995 |
145 | | twitterbot | 0.934 | - | **0.943** | 0.938 | 0.936 |
146 | | vehicleloan | **0.673** | - | 0.669 | 0.67 | 0.664 |
147 | | ipblock | **0.937** | - | 0.804 | Training failure | 0.5 |
148 |
149 | ### ROC Curves
150 |
151 | The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML.
152 | 
153 |
154 |
155 | ## Data Sources
156 |
157 |
158 | 1. **IEEE-CIS Fraud Detection**
159 | - Source URL: https://www.kaggle.com/c/ieee-fraud-detection/overview
160 | - Source license: https://www.kaggle.com/competitions/ieee-fraud-detection/rules
161 | - Variables: Anonymized product, card, address, email domain, device, transaction date information. Numeric columns with name prefixes as V, C, D and M, and meaning hidden from public.
162 | - Fraud category: Card Not Present Transaction Fraud
163 | - Provider: [Vesta Corporation](https://www.vesta.io/)
164 | - Release date: 2019-10-03
165 | - Description: Prepared by IEEE Computational Intelligence Society, this card-non-present transaction fraud dataset was launched during IEEE-CIS Fraud Detection Kaggle competition, and was provided by Vesta Corporation. The original dataset contains 393 features which are reduced to 67 features in the benchmark. Feature selection was performed based on highly voted Kaggle kernels. The fraud rate in training segment of source dataset is 3.5%. We only used training files (train transaction and train identity) containing 590,540 transactions in the benchmark, and split that into train (95%) and test (5%) segments based on time. Based on the insights from a Kaggle kernel written by the competition winner, we added UUID (called it as ENTITY_ID) that represents a fingerprint and was created using card, address, time and D1 features.
166 |
167 | 2. **Credit Card Fraud Detection**
168 | - Source URL: https://www.kaggle.com/mlg-ulb/creditcardfraud/
169 | - Source license: https://opendatacommons.org/licenses/dbcl/1-0/
170 | - Variables: PCA transformed features, time, amount (highly imbalanced)
171 | - Fraud category: Card Not Present Transaction Fraud
172 | - Provider: [Machine Learning Group - ULB](https://mlg.ulb.ac.be/)
173 | - Release date: 2018-03-23
174 | - Description: This dataset contains anonymized credit card transactions by European cardholders in September 2013. The dataset contains 492 frauds out of 284,807 transactions over 2 days. Data only contains numerical features that are the result of a PCA transformation, plus non transformed time and amount.
175 |
176 | 3. **Fraud ecommerce**
177 | - Source URL: https://www.kaggle.com/vbinh002/fraud-ecommerce
178 | - Source license: None
179 | - Variables: The features include sign up time, purchase time, purchase value, device id, user id, browser, and IP address. We added a new feature that measured the time difference between sign up and purchase, as the age of an account is often an important variable in fraud detection.
180 | - Fraud category: Card Not Present Transaction Fraud
181 | - Provider: [Binh Vu](https://www.kaggle.com/vbinh002)
182 | - Release date: 2018-12-09
183 | - Description: This dataset contains ~150k e-commerce transactions.
184 |
185 | 4. **Simulated Credit Card Transactions generated using Sparkov**
186 | - Source URL: https://www.kaggle.com/kartik2112/fraud-detection
187 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/
188 | - Variables: Transaction date, credit card number, merchant, category, amount, name, street, gender. All variables are synthetically generated using the Sparknov tool.
189 | - Fraud category: Card Not Present Transaction Fraud
190 | - Provider: [Kartik Shenoy](https://www.kaggle.com/kartik2112)
191 | - Release date: 2020-08-05
192 | - Description: This is a simulated credit card transaction dataset. The dataset was generated using Sparkov Data Generation tool and we modified a version of dataset created for Kaggle. It covers transactions of 1000 customers with a pool of 800 merchants over 6 months. We used both train and test segments directly from the source and randomly down sampled test segment.
193 |
194 | 5. **Twitter Bots Accounts**
195 | - Source URL: https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data?select=twitter_human_bots_dataset.csv
196 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/
197 | - Variables: Features like account creation date, follower and following counts, profile description, account age, meta data about profile picture and account activity, and a label indicating whether the account is human or bot.
198 | - Fraud category: Bot Attacks
199 | - Provider: [David Martín Gutiérrez](https://www.kaggle.com/davidmartngutirrez)
200 | - Release date: 2020-08-20
201 | - Description: The dataset composes of 37,438 rows corresponding to different user accounts from Twitter.
202 |
203 | 6. **Malicious URLs dataset**
204 | - Source URL: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
205 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/
206 | - Variables: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label.
207 | - Fraud category: Malicious Traffic
208 | - Provider: [Manu Siddhartha](https://www.kaggle.com/sid321axn)
209 | - Release date: 2021-07-23
210 | - Description: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. There is no timestamp information from the source. Therefore, we generate a dummy timestamp column for consistency.
211 |
212 | 7. **Real / Fake Job Posting Prediction**
213 | - Source URL: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
214 | - Source license: https://creativecommons.org/publicdomain/zero/1.0/
215 | - Variables: Title, location, department, company, salary range, requirements, description, benefits, telecommuting. Most of the variables are categorical and free form text in nature.
216 | - Fraud category: Content Moderation
217 | - Provider: [Shivam Bansal](https://www.kaggle.com/shivamb)
218 | - Release date: 2020-02-29
219 | - Description: This Kaggle dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The task is to train classification model to detect which job posts are fraudulent.
220 |
221 | 8. **Vehicle Loan Default Prediction**
222 | - Source URL: https://www.kaggle.com/avikpaul4u/vehicle-loan-default-prediction
223 | - Source license: Unknown
224 | - Variables: Loanee information, loan information, credit bureau data, and history.
225 | - Fraud category: Credit Risk
226 | - Provider: [Avik Paul](https://www.kaggle.com/avikpaul4u)
227 | - Release date: 2019-11-12
228 | - Description: The task in this dataset is to determine the probability of vehicle loan default, particularly the risk of default on the first monthly installments. It contains data for 233k loans with 21.7% default rate.
229 |
230 | 9. **IP Blocklist**
231 | - Source URL: http://cinsscore.com/list/ci-badguys.txt
232 | - Source license: Unknown
233 | - Variables: The dataset contains IP address and label telling malicious or fake. A dummy categorical variable that has no relation label is added.
234 | - Fraud category: Malicious Traffic
235 | - Provider: [CINSscore.com](http://cinsscore.com)
236 | - Release date: 2017-09-25
237 | - Description: This dataset is made up from malicious IP address from cinsscore.com. To the list of malicious IP addresses, we added randomly generated IP address using Faker labeled as benign.
238 |
239 |
240 | ## Citation
241 | ```
242 | @misc{grover2023fraud,
243 | title={Fraud Dataset Benchmark and Applications},
244 | author={Prince Grover and Julia Xu and Justin Tittelfitz and Anqi Cheng and Zheng Li and Jakub Zablocki and Jianbo Liu and Hao Zhou},
245 | year={2023},
246 | eprint={2208.14417},
247 | archivePrefix={arXiv},
248 | primaryClass={cs.LG}
249 | }
250 | ```
251 |
252 | ## License
253 | This project is licensed under the MIT-0 License.
254 |
255 |
256 | ## Acknowledgement
257 | We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes.
258 |
259 |
260 |
261 |
262 |
263 |
--------------------------------------------------------------------------------
/images/afd_steps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/images/afd_steps.png
--------------------------------------------------------------------------------
/images/all_fdb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/images/all_fdb.png
--------------------------------------------------------------------------------
/images/ieee_ofi_sample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/images/ieee_ofi_sample.png
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/README.md:
--------------------------------------------------------------------------------
1 | ## Steps to reproduce AFD models
2 | Amazon Fraud Detector (AFD) models can be either run via AWS Console or using API calls. In this folder, we provide scripts that make API calls to create model artifacts and then to score the model on test data.
3 |
4 | High level steps to train and deploy model are:
5 |
6 | 
7 |
8 | You can use provided scripts to replicate performance shown in the benchmark.
9 |
10 | 1. Setup AWS credentials in terminal for the AWS account where you want to run AFD, and store the data. You can use environment variables as [following](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html)
11 |
12 |
13 | 2. Use the [template data-loader notebook](../../examples/Test_FDB_Loader.ipynb) to upload the benchmark data on S3. (AFD requires data to be saved in S3 and require an S3 path)
14 |
15 |
16 | 3. Create AFD resources including entities, event types, and model. Update values in `IAM_ROLE`, `BUCKET`, `KEY` and `MODEL_NAME` in the `create_afd_resources.py`, then run following.
17 |
18 | ```
19 | python create_afd_resources.py configs/{dataset-you-want-to-use}
20 | ```
21 |
22 | You can keep `MODEL_TYPE` as **ONLINE_FRAUD_INSIGHTS** or **TRANSACTION_FRAUD_INSIGHTS** to run corresponding models.
23 |
24 | This will initiate automatic model training. Wait for ~1 hour for models to train. You can check status in your console.
25 |
26 | 4. Create detector and use it to score on the test data. Update values in `IAM_ROLE`, `BUCKET`, `TEST_PATH`, `TEST_LABELS_PATH` and `MODEL_NAME` in the `score_afd_resources.py`, then run following.
27 |
28 | ```
29 | python score_afd_model.py
30 | ```
31 | This will print performance metrics in terminal as well as save in S3 location you provide in the script.
32 |
33 | After a model training is completed, AFD console would show performance metrics like following (trained on `ieeecis` with ONLINE_FRAUD_INSIGHTS).
34 |
35 | 
36 |
37 |
38 |
39 | **In order to fully deep dive into working of Amazon Fraud Detector, [here](https://d1.awsstatic.com/fraud-detector/afd-technical-guide-detecting-new-account-fraud.pdf) is the link to technical guide.**
40 |
41 |
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/CreditCardFraudDetection.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Credit Card Fraud Detection",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "v1",
6 | "variable_type": "NUMERIC",
7 | "data_type": "FLOAT"
8 | },
9 | {
10 | "variable_name": "v2",
11 | "variable_type": "NUMERIC",
12 | "data_type": "FLOAT"
13 | },
14 | {
15 | "variable_name": "v3",
16 | "variable_type": "NUMERIC",
17 | "data_type": "FLOAT"
18 | },
19 | {
20 | "variable_name": "v4",
21 | "variable_type": "NUMERIC",
22 | "data_type": "FLOAT"
23 | },
24 | {
25 | "variable_name": "v5",
26 | "variable_type": "NUMERIC",
27 | "data_type": "FLOAT"
28 | },
29 | {
30 | "variable_name": "v6",
31 | "variable_type": "NUMERIC",
32 | "data_type": "FLOAT"
33 | },
34 | {
35 | "variable_name": "v7",
36 | "variable_type": "NUMERIC",
37 | "data_type": "FLOAT"
38 | },
39 | {
40 | "variable_name": "v8",
41 | "variable_type": "NUMERIC",
42 | "data_type": "FLOAT"
43 | },
44 | {
45 | "variable_name": "v9",
46 | "variable_type": "NUMERIC",
47 | "data_type": "FLOAT"
48 | },
49 | {
50 | "variable_name": "v10",
51 | "variable_type": "NUMERIC",
52 | "data_type": "FLOAT"
53 | },
54 | {
55 | "variable_name": "v11",
56 | "variable_type": "NUMERIC",
57 | "data_type": "FLOAT"
58 | },
59 | {
60 | "variable_name": "v12",
61 | "variable_type": "NUMERIC",
62 | "data_type": "FLOAT"
63 | },
64 | {
65 | "variable_name": "v13",
66 | "variable_type": "NUMERIC",
67 | "data_type": "FLOAT"
68 | },
69 | {
70 | "variable_name": "v14",
71 | "variable_type": "NUMERIC",
72 | "data_type": "FLOAT"
73 | },
74 | {
75 | "variable_name": "v15",
76 | "variable_type": "NUMERIC",
77 | "data_type": "FLOAT"
78 | },
79 | {
80 | "variable_name": "v16",
81 | "variable_type": "NUMERIC",
82 | "data_type": "FLOAT"
83 | },
84 | {
85 | "variable_name": "v17",
86 | "variable_type": "NUMERIC",
87 | "data_type": "FLOAT"
88 | },
89 | {
90 | "variable_name": "v18",
91 | "variable_type": "NUMERIC",
92 | "data_type": "FLOAT"
93 | },
94 | {
95 | "variable_name": "v19",
96 | "variable_type": "NUMERIC",
97 | "data_type": "FLOAT"
98 | },
99 | {
100 | "variable_name": "v20",
101 | "variable_type": "NUMERIC",
102 | "data_type": "FLOAT"
103 | },
104 | {
105 | "variable_name": "v21",
106 | "variable_type": "NUMERIC",
107 | "data_type": "FLOAT"
108 | },
109 | {
110 | "variable_name": "v22",
111 | "variable_type": "NUMERIC",
112 | "data_type": "FLOAT"
113 | },
114 | {
115 | "variable_name": "v23",
116 | "variable_type": "NUMERIC",
117 | "data_type": "FLOAT"
118 | },
119 | {
120 | "variable_name": "v24",
121 | "variable_type": "NUMERIC",
122 | "data_type": "FLOAT"
123 | },
124 | {
125 | "variable_name": "v25",
126 | "variable_type": "NUMERIC",
127 | "data_type": "FLOAT"
128 | },
129 | {
130 | "variable_name": "v26",
131 | "variable_type": "NUMERIC",
132 | "data_type": "FLOAT"
133 | },
134 | {
135 | "variable_name": "v27",
136 | "variable_type": "NUMERIC",
137 | "data_type": "FLOAT"
138 | },
139 | {
140 | "variable_name": "v28",
141 | "variable_type": "NUMERIC",
142 | "data_type": "FLOAT"
143 | },
144 | {
145 | "variable_name": "amount",
146 | "variable_type": "NUMERIC",
147 | "data_type": "FLOAT"
148 | }
149 | ],
150 | "label_mappings": {
151 | "FRAUD": [
152 | "1"
153 | ],
154 | "LEGIT": [
155 | "0"
156 | ]
157 | }
158 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/FakeJobPostingPrediction.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Fake Job Posting Prediction",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "title",
6 | "variable_type": "FREE_FORM_TEXT",
7 | "data_type": "STRING"
8 | },
9 | {
10 | "variable_name": "location",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | },
14 | {
15 | "variable_name": "department",
16 | "variable_type": "CATEGORICAL",
17 | "data_type": "STRING"
18 | },
19 | {
20 | "variable_name": "salary_range",
21 | "variable_type": "CATEGORICAL",
22 | "data_type": "STRING"
23 | },
24 | {
25 | "variable_name": "company_profile",
26 | "variable_type": "FREE_FORM_TEXT",
27 | "data_type": "STRING"
28 | },
29 | {
30 | "variable_name": "description",
31 | "variable_type": "FREE_FORM_TEXT",
32 | "data_type": "STRING"
33 | },
34 | {
35 | "variable_name": "requirements",
36 | "variable_type": "FREE_FORM_TEXT",
37 | "data_type": "STRING"
38 | },
39 | {
40 | "variable_name": "benefits",
41 | "variable_type": "FREE_FORM_TEXT",
42 | "data_type": "STRING"
43 | },
44 | {
45 | "variable_name": "telecommuting",
46 | "variable_type": "NUMERIC",
47 | "data_type": "FLOAT"
48 | },
49 | {
50 | "variable_name": "has_company_logo",
51 | "variable_type": "CATEGORICAL",
52 | "data_type": "STRING"
53 | },
54 | {
55 | "variable_name": "has_questions",
56 | "variable_type": "CATEGORICAL",
57 | "data_type": "STRING"
58 | },
59 | {
60 | "variable_name": "employment_type",
61 | "variable_type": "CATEGORICAL",
62 | "data_type": "STRING"
63 | },
64 | {
65 | "variable_name": "required_experience",
66 | "variable_type": "CATEGORICAL",
67 | "data_type": "STRING"
68 | },
69 | {
70 | "variable_name": "required_education",
71 | "variable_type": "CATEGORICAL",
72 | "data_type": "STRING"
73 | },
74 | {
75 | "variable_name": "industry",
76 | "variable_type": "CATEGORICAL",
77 | "data_type": "STRING"
78 | },
79 | {
80 | "variable_name": "function",
81 | "variable_type": "CATEGORICAL",
82 | "data_type": "STRING"
83 | }
84 | ],
85 | "label_mappings": {
86 | "FRAUD": [
87 | "1"
88 | ],
89 | "LEGIT": [
90 | "0"
91 | ]
92 | }
93 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/Fraudecommerce.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Fraud ecommerce",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "purchase_value",
6 | "variable_type": "NUMERIC",
7 | "data_type": "FLOAT"
8 | },
9 | {
10 | "variable_name": "source",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | },
14 | {
15 | "variable_name": "browser",
16 | "variable_type": "CATEGORICAL",
17 | "data_type": "STRING"
18 | },
19 | {
20 | "variable_name": "age",
21 | "variable_type": "NUMERIC",
22 | "data_type": "FLOAT"
23 | },
24 | {
25 | "variable_name": "ip_address",
26 | "variable_type": "IP_ADDRESS",
27 | "data_type": "FLOAT"
28 | },
29 | {
30 | "variable_name": "time_since_signup",
31 | "variable_type": "NUMERIC",
32 | "data_type": "FLOAT"
33 | }
34 | ],
35 | "label_mappings": {
36 | "FRAUD": [
37 | "1"
38 | ],
39 | "LEGIT": [
40 | "0"
41 | ]
42 | }
43 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/IEEECISFraudDetection.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "IEEE-CIS Fraud Detection",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "transactionamt",
6 | "variable_type": "NUMERIC",
7 | "data_type": "FLOAT"
8 | },
9 | {
10 | "variable_name": "productcd",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | },
14 | {
15 | "variable_name": "card1",
16 | "variable_type": "NUMERIC",
17 | "data_type": "FLOAT"
18 | },
19 | {
20 | "variable_name": "card2",
21 | "variable_type": "NUMERIC",
22 | "data_type": "FLOAT"
23 | },
24 | {
25 | "variable_name": "card3",
26 | "variable_type": "NUMERIC",
27 | "data_type": "FLOAT"
28 | },
29 | {
30 | "variable_name": "card5",
31 | "variable_type": "NUMERIC",
32 | "data_type": "FLOAT"
33 | },
34 | {
35 | "variable_name": "card6",
36 | "variable_type": "CATEGORICAL",
37 | "data_type": "STRING"
38 | },
39 | {
40 | "variable_name": "addr1",
41 | "variable_type": "NUMERIC",
42 | "data_type": "FLOAT"
43 | },
44 | {
45 | "variable_name": "dist1",
46 | "variable_type": "NUMERIC",
47 | "data_type": "FLOAT"
48 | },
49 | {
50 | "variable_name": "p_emaildomain",
51 | "variable_type": "CATEGORICAL",
52 | "data_type": "STRING"
53 | },
54 | {
55 | "variable_name": "r_emaildomain",
56 | "variable_type": "CATEGORICAL",
57 | "data_type": "STRING"
58 | },
59 | {
60 | "variable_name": "c1",
61 | "variable_type": "NUMERIC",
62 | "data_type": "FLOAT"
63 | },
64 | {
65 | "variable_name": "c2",
66 | "variable_type": "NUMERIC",
67 | "data_type": "FLOAT"
68 | },
69 | {
70 | "variable_name": "c4",
71 | "variable_type": "NUMERIC",
72 | "data_type": "FLOAT"
73 | },
74 | {
75 | "variable_name": "c5",
76 | "variable_type": "NUMERIC",
77 | "data_type": "FLOAT"
78 | },
79 | {
80 | "variable_name": "c6",
81 | "variable_type": "NUMERIC",
82 | "data_type": "FLOAT"
83 | },
84 | {
85 | "variable_name": "c7",
86 | "variable_type": "NUMERIC",
87 | "data_type": "FLOAT"
88 | },
89 | {
90 | "variable_name": "c8",
91 | "variable_type": "NUMERIC",
92 | "data_type": "FLOAT"
93 | },
94 | {
95 | "variable_name": "c9",
96 | "variable_type": "NUMERIC",
97 | "data_type": "FLOAT"
98 | },
99 | {
100 | "variable_name": "c10",
101 | "variable_type": "NUMERIC",
102 | "data_type": "FLOAT"
103 | },
104 | {
105 | "variable_name": "c11",
106 | "variable_type": "NUMERIC",
107 | "data_type": "FLOAT"
108 | },
109 | {
110 | "variable_name": "c12",
111 | "variable_type": "NUMERIC",
112 | "data_type": "FLOAT"
113 | },
114 | {
115 | "variable_name": "c13",
116 | "variable_type": "NUMERIC",
117 | "data_type": "FLOAT"
118 | },
119 | {
120 | "variable_name": "c14",
121 | "variable_type": "NUMERIC",
122 | "data_type": "FLOAT"
123 | },
124 | {
125 | "variable_name": "v62",
126 | "variable_type": "NUMERIC",
127 | "data_type": "FLOAT"
128 | },
129 | {
130 | "variable_name": "v70",
131 | "variable_type": "NUMERIC",
132 | "data_type": "FLOAT"
133 | },
134 | {
135 | "variable_name": "v76",
136 | "variable_type": "NUMERIC",
137 | "data_type": "FLOAT"
138 | },
139 | {
140 | "variable_name": "v78",
141 | "variable_type": "NUMERIC",
142 | "data_type": "FLOAT"
143 | },
144 | {
145 | "variable_name": "v82",
146 | "variable_type": "NUMERIC",
147 | "data_type": "FLOAT"
148 | },
149 | {
150 | "variable_name": "v91",
151 | "variable_type": "NUMERIC",
152 | "data_type": "FLOAT"
153 | },
154 | {
155 | "variable_name": "v127",
156 | "variable_type": "NUMERIC",
157 | "data_type": "FLOAT"
158 | },
159 | {
160 | "variable_name": "v130",
161 | "variable_type": "NUMERIC",
162 | "data_type": "FLOAT"
163 | },
164 | {
165 | "variable_name": "v139",
166 | "variable_type": "NUMERIC",
167 | "data_type": "FLOAT"
168 | },
169 | {
170 | "variable_name": "v160",
171 | "variable_type": "NUMERIC",
172 | "data_type": "FLOAT"
173 | },
174 | {
175 | "variable_name": "v165",
176 | "variable_type": "NUMERIC",
177 | "data_type": "FLOAT"
178 | },
179 | {
180 | "variable_name": "v187",
181 | "variable_type": "NUMERIC",
182 | "data_type": "FLOAT"
183 | },
184 | {
185 | "variable_name": "v203",
186 | "variable_type": "NUMERIC",
187 | "data_type": "FLOAT"
188 | },
189 | {
190 | "variable_name": "v207",
191 | "variable_type": "NUMERIC",
192 | "data_type": "FLOAT"
193 | },
194 | {
195 | "variable_name": "v209",
196 | "variable_type": "NUMERIC",
197 | "data_type": "FLOAT"
198 | },
199 | {
200 | "variable_name": "v210",
201 | "variable_type": "NUMERIC",
202 | "data_type": "FLOAT"
203 | },
204 | {
205 | "variable_name": "v221",
206 | "variable_type": "NUMERIC",
207 | "data_type": "FLOAT"
208 | },
209 | {
210 | "variable_name": "v234",
211 | "variable_type": "NUMERIC",
212 | "data_type": "FLOAT"
213 | },
214 | {
215 | "variable_name": "v257",
216 | "variable_type": "NUMERIC",
217 | "data_type": "FLOAT"
218 | },
219 | {
220 | "variable_name": "v258",
221 | "variable_type": "NUMERIC",
222 | "data_type": "FLOAT"
223 | },
224 | {
225 | "variable_name": "v261",
226 | "variable_type": "NUMERIC",
227 | "data_type": "FLOAT"
228 | },
229 | {
230 | "variable_name": "v264",
231 | "variable_type": "NUMERIC",
232 | "data_type": "FLOAT"
233 | },
234 | {
235 | "variable_name": "v266",
236 | "variable_type": "NUMERIC",
237 | "data_type": "FLOAT"
238 | },
239 | {
240 | "variable_name": "v267",
241 | "variable_type": "NUMERIC",
242 | "data_type": "FLOAT"
243 | },
244 | {
245 | "variable_name": "v271",
246 | "variable_type": "NUMERIC",
247 | "data_type": "FLOAT"
248 | },
249 | {
250 | "variable_name": "v274",
251 | "variable_type": "NUMERIC",
252 | "data_type": "FLOAT"
253 | },
254 | {
255 | "variable_name": "v277",
256 | "variable_type": "NUMERIC",
257 | "data_type": "FLOAT"
258 | },
259 | {
260 | "variable_name": "v283",
261 | "variable_type": "NUMERIC",
262 | "data_type": "FLOAT"
263 | },
264 | {
265 | "variable_name": "v285",
266 | "variable_type": "NUMERIC",
267 | "data_type": "FLOAT"
268 | },
269 | {
270 | "variable_name": "v289",
271 | "variable_type": "NUMERIC",
272 | "data_type": "FLOAT"
273 | },
274 | {
275 | "variable_name": "v291",
276 | "variable_type": "NUMERIC",
277 | "data_type": "FLOAT"
278 | },
279 | {
280 | "variable_name": "v294",
281 | "variable_type": "NUMERIC",
282 | "data_type": "FLOAT"
283 | },
284 | {
285 | "variable_name": "id_01",
286 | "variable_type": "NUMERIC",
287 | "data_type": "FLOAT"
288 | },
289 | {
290 | "variable_name": "id_02",
291 | "variable_type": "NUMERIC",
292 | "data_type": "FLOAT"
293 | },
294 | {
295 | "variable_name": "id_05",
296 | "variable_type": "NUMERIC",
297 | "data_type": "FLOAT"
298 | },
299 | {
300 | "variable_name": "id_06",
301 | "variable_type": "NUMERIC",
302 | "data_type": "FLOAT"
303 | },
304 | {
305 | "variable_name": "id_09",
306 | "variable_type": "NUMERIC",
307 | "data_type": "FLOAT"
308 | },
309 | {
310 | "variable_name": "id_13",
311 | "variable_type": "NUMERIC",
312 | "data_type": "FLOAT"
313 | },
314 | {
315 | "variable_name": "id_17",
316 | "variable_type": "NUMERIC",
317 | "data_type": "FLOAT"
318 | },
319 | {
320 | "variable_name": "id_19",
321 | "variable_type": "NUMERIC",
322 | "data_type": "FLOAT"
323 | },
324 | {
325 | "variable_name": "id_20",
326 | "variable_type": "NUMERIC",
327 | "data_type": "FLOAT"
328 | },
329 | {
330 | "variable_name": "devicetype",
331 | "variable_type": "CATEGORICAL",
332 | "data_type": "STRING"
333 | },
334 | {
335 | "variable_name": "deviceinfo",
336 | "variable_type": "CATEGORICAL",
337 | "data_type": "STRING"
338 | }
339 | ],
340 | "label_mappings": {
341 | "FRAUD": [
342 | "1"
343 | ],
344 | "LEGIT": [
345 | "0"
346 | ]
347 | }
348 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/IPBlocklist.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "IP-BlockList",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "ip",
6 | "variable_type": "IP_ADDRESS",
7 | "data_type": "STRING"
8 | },
9 | {
10 | "variable_name": "dummy_cat",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | }
14 | ],
15 | "label_mappings": {
16 | "FRAUD": [
17 | "1"
18 | ],
19 | "LEGIT": [
20 | "0"
21 | ]
22 | }
23 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/MaliciousURL.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Malicious URLs Dataset",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "url",
6 | "variable_type": "FREE_FORM_TEXT",
7 | "data_type": "STRING"
8 | },
9 | {
10 | "variable_name": "dummy_cat",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | }
14 | ],
15 | "label_mappings": {
16 | "FRAUD": [
17 | "malignant"
18 | ],
19 | "LEGIT": [
20 | "benign"
21 | ]
22 | }
23 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/SimulatedCreditCardTransactionsSparkov.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Simulated Credit Card Transactions generated using Sparkov",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "cc_num",
6 | "variable_type": "CARD_BIN",
7 | "data_type": "INTEGER"
8 | },
9 | {
10 | "variable_name": "category",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | },
14 | {
15 | "variable_name": "amt",
16 | "variable_type": "NUMERIC",
17 | "data_type": "FLOAT"
18 | },
19 | {
20 | "variable_name": "first",
21 | "variable_type": "BILLING_NAME",
22 | "data_type": "STRING"
23 | },
24 | {
25 | "variable_name": "last",
26 | "variable_type": "BILLING_NAME",
27 | "data_type": "STRING"
28 | },
29 | {
30 | "variable_name": "gender",
31 | "variable_type": "CATEGORICAL",
32 | "data_type": "STRING"
33 | },
34 | {
35 | "variable_name": "street",
36 | "variable_type": "BILLING_ADDRESS_L1",
37 | "data_type": "STRING"
38 | },
39 | {
40 | "variable_name": "city",
41 | "variable_type": "BILLING_CITY",
42 | "data_type": "STRING"
43 | },
44 | {
45 | "variable_name": "state",
46 | "variable_type": "BILLING_STATE",
47 | "data_type": "STRING"
48 | },
49 | {
50 | "variable_name": "zip",
51 | "variable_type": "BILLING_ZIP",
52 | "data_type": "STRING"
53 | },
54 | {
55 | "variable_name": "lat",
56 | "variable_type": "NUMERIC",
57 | "data_type": "FLOAT"
58 | },
59 | {
60 | "variable_name": "long",
61 | "variable_type": "NUMERIC",
62 | "data_type": "FLOAT"
63 | },
64 | {
65 | "variable_name": "city_pop",
66 | "variable_type": "NUMERIC",
67 | "data_type": "FLOAT"
68 | },
69 | {
70 | "variable_name": "job",
71 | "variable_type": "CATEGORICAL",
72 | "data_type": "STRING"
73 | },
74 | {
75 | "variable_name": "dob",
76 | "variable_type": "FREE_FORM_TEXT",
77 | "data_type": "STRING"
78 | },
79 | {
80 | "variable_name": "merch_lat",
81 | "variable_type": "NUMERIC",
82 | "data_type": "FLOAT"
83 | },
84 | {
85 | "variable_name": "merch_long",
86 | "variable_type": "NUMERIC",
87 | "data_type": "FLOAT"
88 | }
89 | ],
90 | "label_mappings": {
91 | "FRAUD": [
92 | "1"
93 | ],
94 | "LEGIT": [
95 | "0"
96 | ]
97 | }
98 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/TwitterBotAccounts.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Twitter Bots Accounts",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "default_profile",
6 | "variable_type": "CATEGORICAL",
7 | "data_type": "STRING"
8 | },
9 | {
10 | "variable_name": "default_profile_image",
11 | "variable_type": "CATEGORICAL",
12 | "data_type": "STRING"
13 | },
14 | {
15 | "variable_name": "description",
16 | "variable_type": "FREE_FORM_TEXT",
17 | "data_type": "STRING"
18 | },
19 | {
20 | "variable_name": "favourites_count",
21 | "variable_type": "NUMERIC",
22 | "data_type": "FLOAT"
23 | },
24 | {
25 | "variable_name": "followers_count",
26 | "variable_type": "NUMERIC",
27 | "data_type": "FLOAT"
28 | },
29 | {
30 | "variable_name": "friends_count",
31 | "variable_type": "NUMERIC",
32 | "data_type": "FLOAT"
33 | },
34 | {
35 | "variable_name": "geo_enabled",
36 | "variable_type": "CATEGORICAL",
37 | "data_type": "STRING"
38 | },
39 | {
40 | "variable_name": "lang",
41 | "variable_type": "CATEGORICAL",
42 | "data_type": "STRING"
43 | },
44 | {
45 | "variable_name": "location",
46 | "variable_type": "FREE_FORM_TEXT",
47 | "data_type": "STRING"
48 | },
49 | {
50 | "variable_name": "profile_background_image_url",
51 | "variable_type": "FREE_FORM_TEXT",
52 | "data_type": "STRING"
53 | },
54 | {
55 | "variable_name": "profile_image_url",
56 | "variable_type": "FREE_FORM_TEXT",
57 | "data_type": "STRING"
58 | },
59 | {
60 | "variable_name": "screen_name",
61 | "variable_type": "CATEGORICAL",
62 | "data_type": "STRING"
63 | },
64 | {
65 | "variable_name": "statuses_count",
66 | "variable_type": "NUMERIC",
67 | "data_type": "FLOAT"
68 | },
69 | {
70 | "variable_name": "verified",
71 | "variable_type": "CATEGORICAL",
72 | "data_type": "STRING"
73 | },
74 | {
75 | "variable_name": "average_tweets_per_day",
76 | "variable_type": "NUMERIC",
77 | "data_type": "FLOAT"
78 | },
79 | {
80 | "variable_name": "account_age_days",
81 | "variable_type": "NUMERIC",
82 | "data_type": "FLOAT"
83 | }
84 | ],
85 | "label_mappings": {
86 | "FRAUD": [
87 | "bot"
88 | ],
89 | "LEGIT": [
90 | "human"
91 | ]
92 | }
93 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/configs/VehicleLoanDefaultPrediction.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset": "Vehicle Loan Default Prediction",
3 | "variable_mappings": [
4 | {
5 | "variable_name": "disbursed_amount",
6 | "variable_type": "NUMERIC",
7 | "data_type": "FLOAT"
8 | },
9 | {
10 | "variable_name": "asset_cost",
11 | "variable_type": "NUMERIC",
12 | "data_type": "FLOAT"
13 | },
14 | {
15 | "variable_name": "ltv",
16 | "variable_type": "NUMERIC",
17 | "data_type": "FLOAT"
18 | },
19 | {
20 | "variable_name": "branch_id",
21 | "variable_type": "CATEGORICAL",
22 | "data_type": "STRING"
23 | },
24 | {
25 | "variable_name": "supplier_id",
26 | "variable_type": "CATEGORICAL",
27 | "data_type": "STRING"
28 | },
29 | {
30 | "variable_name": "manufacturer_id",
31 | "variable_type": "CATEGORICAL",
32 | "data_type": "STRING"
33 | },
34 | {
35 | "variable_name": "current_pincode_id",
36 | "variable_type": "CATEGORICAL",
37 | "data_type": "STRING"
38 | },
39 | {
40 | "variable_name": "date_of_birth",
41 | "variable_type": "FREE_FORM_TEXT",
42 | "data_type": "STRING"
43 | },
44 | {
45 | "variable_name": "employment_type",
46 | "variable_type": "CATEGORICAL",
47 | "data_type": "STRING"
48 | },
49 | {
50 | "variable_name": "state_id",
51 | "variable_type": "CATEGORICAL",
52 | "data_type": "STRING"
53 | },
54 | {
55 | "variable_name": "employee_code_id",
56 | "variable_type": "CATEGORICAL",
57 | "data_type": "STRING"
58 | },
59 | {
60 | "variable_name": "mobileno_avl_flag",
61 | "variable_type": "CATEGORICAL",
62 | "data_type": "STRING"
63 | },
64 | {
65 | "variable_name": "aadhar_flag",
66 | "variable_type": "CATEGORICAL",
67 | "data_type": "STRING"
68 | },
69 | {
70 | "variable_name": "pan_flag",
71 | "variable_type": "CATEGORICAL",
72 | "data_type": "STRING"
73 | },
74 | {
75 | "variable_name": "voterid_flag",
76 | "variable_type": "CATEGORICAL",
77 | "data_type": "STRING"
78 | },
79 | {
80 | "variable_name": "driving_flag",
81 | "variable_type": "CATEGORICAL",
82 | "data_type": "STRING"
83 | },
84 | {
85 | "variable_name": "passport_flag",
86 | "variable_type": "CATEGORICAL",
87 | "data_type": "STRING"
88 | },
89 | {
90 | "variable_name": "perform_cns_score",
91 | "variable_type": "NUMERIC",
92 | "data_type": "FLOAT"
93 | },
94 | {
95 | "variable_name": "perform_cns_score_description",
96 | "variable_type": "FREE_FORM_TEXT",
97 | "data_type": "STRING"
98 | },
99 | {
100 | "variable_name": "pri_no_of_accts",
101 | "variable_type": "NUMERIC",
102 | "data_type": "FLOAT"
103 | },
104 | {
105 | "variable_name": "pri_active_accts",
106 | "variable_type": "NUMERIC",
107 | "data_type": "FLOAT"
108 | },
109 | {
110 | "variable_name": "pri_overdue_accts",
111 | "variable_type": "NUMERIC",
112 | "data_type": "FLOAT"
113 | },
114 | {
115 | "variable_name": "pri_current_balance",
116 | "variable_type": "NUMERIC",
117 | "data_type": "FLOAT"
118 | },
119 | {
120 | "variable_name": "pri_sanctioned_amount",
121 | "variable_type": "NUMERIC",
122 | "data_type": "FLOAT"
123 | },
124 | {
125 | "variable_name": "pri_disbursed_amount",
126 | "variable_type": "NUMERIC",
127 | "data_type": "FLOAT"
128 | },
129 | {
130 | "variable_name": "sec_no_of_accts",
131 | "variable_type": "NUMERIC",
132 | "data_type": "FLOAT"
133 | },
134 | {
135 | "variable_name": "sec_active_accts",
136 | "variable_type": "NUMERIC",
137 | "data_type": "FLOAT"
138 | },
139 | {
140 | "variable_name": "sec_overdue_accts",
141 | "variable_type": "NUMERIC",
142 | "data_type": "FLOAT"
143 | },
144 | {
145 | "variable_name": "sec_current_balance",
146 | "variable_type": "NUMERIC",
147 | "data_type": "FLOAT"
148 | },
149 | {
150 | "variable_name": "sec_sanctioned_amount",
151 | "variable_type": "NUMERIC",
152 | "data_type": "FLOAT"
153 | },
154 | {
155 | "variable_name": "sec_disbursed_amount",
156 | "variable_type": "NUMERIC",
157 | "data_type": "FLOAT"
158 | },
159 | {
160 | "variable_name": "primary_instal_amt",
161 | "variable_type": "NUMERIC",
162 | "data_type": "FLOAT"
163 | },
164 | {
165 | "variable_name": "sec_instal_amt",
166 | "variable_type": "NUMERIC",
167 | "data_type": "FLOAT"
168 | },
169 | {
170 | "variable_name": "new_accts_in_last_six_months",
171 | "variable_type": "NUMERIC",
172 | "data_type": "FLOAT"
173 | },
174 | {
175 | "variable_name": "delinquent_accts_in_last_six_months",
176 | "variable_type": "NUMERIC",
177 | "data_type": "FLOAT"
178 | },
179 | {
180 | "variable_name": "average_acct_age",
181 | "variable_type": "FREE_FORM_TEXT",
182 | "data_type": "STRING"
183 | },
184 | {
185 | "variable_name": "credit_history_length",
186 | "variable_type": "NUMERIC",
187 | "data_type": "FLOAT"
188 | },
189 | {
190 | "variable_name": "no_of_inquiries",
191 | "variable_type": "NUMERIC",
192 | "data_type": "FLOAT"
193 | }
194 | ],
195 | "label_mappings": {
196 | "FRAUD": [
197 | "1"
198 | ],
199 | "LEGIT": [
200 | "0"
201 | ]
202 | }
203 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/create_afd_resources.py:
--------------------------------------------------------------------------------
1 | # TO BE UPDATED BY USER
2 | IAM_ROLE = ""
3 | BUCKET = ""
4 | KEY = ""
5 | MODEL_NAME = "" # lower case alphanumeric only, only _ allowed as delimiter
6 | MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS
7 |
8 | import os
9 | import time
10 | import json
11 | import boto3
12 | import click
13 | import string
14 | import random
15 | import logging
16 | import pandas as pd
17 |
18 |
19 | MODEL_DESC = "Benchmarking model"
20 | EVENT_DESC = "Event for benchmarking model"
21 | ENTITY_TYPE = "user" # this is provided in the dummy data. Will need to change if using different data
22 | ENTITY_DESC = "Entity for benchmarking model"
23 |
24 | BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME # Others are kept same as model name
25 |
26 | # boto3 connections
27 | client = boto3.client('frauddetector')
28 | s3 = boto3.client('s3')
29 |
30 | @click.command()
31 | @click.argument("config", type=click.Path(exists=True))
32 | def afd_train_model_demo(config):
33 |
34 | #############################################
35 | ##### Setup #####
36 | with open(config, "r") as f:
37 | config_file = json.load(f)
38 |
39 |
40 | EVENT_VARIABLES = [variable["variable_name"] for variable in config_file["variable_mappings"]]
41 | EVENT_LABELS = [v for k,v in config_file["label_mappings"].items()]
42 | EVENT_LABELS = [item for sublist in EVENT_LABELS for item in sublist] # flattening list of lists
43 |
44 | # Variable mappings of demo data in this use case. Important to teach this to customer
45 | click.echo(f'{pd.DataFrame(config_file["variable_mappings"])}')
46 | click.echo(f'{pd.DataFrame(config_file["label_mappings"])}')
47 |
48 | S3_DATA_PATH = "s3://" + os.path.join(BUCKET, KEY)
49 |
50 | #############################################
51 | ##### Create event variables and labels #####
52 |
53 | # -- create variable --
54 | for variable in config_file["variable_mappings"]:
55 |
56 | DEFAULT_VALUE = '0.0' if variable["data_type"] == "FLOAT" else ''
57 |
58 | try:
59 | resp = client.get_variables(name = variable["variable_name"])
60 | click.echo("{0} exists, data type: {1}".format(variable["variable_name"], resp['variables'][0]['dataType']))
61 | except:
62 | click.echo("Creating variable: {0}".format(variable["variable_name"]))
63 | resp = client.create_variable(
64 | name = variable["variable_name"],
65 | dataType = variable["data_type"],
66 | dataSource ='EVENT',
67 | defaultValue = DEFAULT_VALUE,
68 | description = variable["variable_name"],
69 | variableType = variable["variable_type"])
70 | # Putting FRAUD
71 | for f in config_file["label_mappings"]["FRAUD"]:
72 | response = client.put_label(
73 | name = f,
74 | description = "FRAUD")
75 | # Putting LEGIT
76 | for f in config_file["label_mappings"]["LEGIT"]:
77 | response = client.put_label(
78 | name = f,
79 | description = "LEGIT")
80 |
81 | #############################################
82 | ##### Define Entity and Event Types #####
83 |
84 | # -- create entity type --
85 | try:
86 | response = client.get_entity_types(name = ENTITY_TYPE)
87 | click.echo("-- entity type exists --")
88 | click.echo(response)
89 | except:
90 | response = client.put_entity_type(
91 | name = ENTITY_TYPE,
92 | description = ENTITY_DESC
93 | )
94 | click.echo("-- create entity type --")
95 | click.echo(response)
96 |
97 |
98 | # -- create event type --
99 | try:
100 | response = client.get_event_types(name = EVENT_TYPE)
101 | click.echo("\n-- event type exists --")
102 | click.echo(response)
103 | except:
104 | response = client.put_event_type (
105 | name = EVENT_TYPE,
106 | eventVariables = EVENT_VARIABLES,
107 | labels = EVENT_LABELS,
108 | entityTypes = [ENTITY_TYPE])
109 | click.echo("\n-- create event type --")
110 | click.echo(response)
111 |
112 | #############################################
113 | ##### Batch import training file for TFI #####
114 | if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS":
115 | try:
116 | response = client.create_batch_import_job(
117 | jobId = BATCH_PREDICTION_JOB,
118 | inputPath = S3_DATA_PATH,
119 | outputPath = "s3://" + BUCKET,
120 | eventTypeName = EVENT_TYPE,
121 | iamRoleArn = IAM_ROLE
122 | )
123 | except Exception:
124 | pass
125 |
126 | # -- wait until batch import is finished --
127 | print("--- waiting until batch import is finished ")
128 | stime = time.time()
129 | while True:
130 | response = client.get_batch_import_jobs(jobId=BATCH_PREDICTION_JOB)
131 | if 'IN_PROGRESS' in response['batchImports'][0]['status']:
132 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
133 | time.sleep(60) # sleep for 1 minute
134 | else:
135 | print("Batch Impoort status : " + response['batchImports'][0]['status'])
136 | break
137 |
138 | etime = time.time()
139 | print(f"Elapsed time: {(etime - stime)/60:{3}.{3}} minutes \n" )
140 | print(response)
141 |
142 |
143 | #############################################
144 | ##### Create and train your model #####
145 | try:
146 | response = client.create_model(
147 | description = MODEL_DESC,
148 | eventTypeName = EVENT_TYPE,
149 | modelId = MODEL_NAME,
150 | modelType = MODEL_TYPE)
151 | click.echo("-- initalize model --")
152 | click.echo(response)
153 | except Exception:
154 | pass
155 |
156 | # -- initalized the model, it's now ready to train --
157 |
158 | # -- first define training_data_schema for model to use --
159 |
160 |
161 | if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS":
162 | training_data_schema = {
163 | 'modelVariables' : EVENT_VARIABLES,
164 | 'labelSchema' : {
165 | 'labelMapper' : config_file["label_mappings"],
166 | 'unlabeledEventsTreatment': 'IGNORE'
167 | }
168 | }
169 | response = client.create_model_version(
170 | modelId = MODEL_NAME,
171 | modelType = MODEL_TYPE,
172 | trainingDataSource = 'INGESTED_EVENTS',
173 | trainingDataSchema = training_data_schema,
174 | ingestedEventsDetail={ # This needs to be changed
175 | 'ingestedEventsTimeWindow': {
176 | 'startTime': '2020-12-10T00:00:00Z', # '2021-08-28T00:00:00Z',
177 | 'endTime': '2022-06-07T00:00:00Z' #'2022-05-10T00:00:00Z'
178 | }
179 | }
180 | )
181 | else:
182 | training_data_schema = {
183 | 'modelVariables' : EVENT_VARIABLES,
184 | 'labelSchema' : {
185 | 'labelMapper' : config_file["label_mappings"]
186 | }
187 | }
188 | response = client.create_model_version(
189 | modelId = MODEL_NAME,
190 | modelType = MODEL_TYPE,
191 | trainingDataSource = 'EXTERNAL_EVENTS',
192 | trainingDataSchema = training_data_schema,
193 | externalEventsDetail = {
194 | 'dataLocation' : S3_DATA_PATH,
195 | 'dataAccessRoleArn': IAM_ROLE
196 | }
197 | )
198 | model_version = response['modelVersionNumber']
199 | click.echo("-- model training --")
200 | click.echo(response)
201 |
202 |
203 | if __name__=="__main__":
204 | afd_train_model_demo()
205 |
--------------------------------------------------------------------------------
/scripts/reproducibility/afd/score_afd_model.py:
--------------------------------------------------------------------------------
1 | # TO BE UPDATED BY USER
2 | IAM_ROLE = ""
3 | BUCKET = ""
4 | TEST_PATH = ""
5 | TEST_LABELS_PATH = ""
6 | MODEL_NAME = "" # lower case alphanumeric only, only _ allowed as delimiter
7 | MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS
8 |
9 | import os
10 | import ast
11 | import time
12 | import json
13 | import boto3
14 | import click
15 | import string
16 | import random
17 | import logging
18 | import numpy as np
19 | import pandas as pd
20 | from sklearn.metrics import roc_curve, auc
21 |
22 | # boto3 connections
23 | client = boto3.client('frauddetector')
24 | s3 = boto3.client('s3')
25 |
26 | BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME
27 | model_version = '1.0'
28 | DETECTOR_DESC = "Benchmarking detector"
29 |
30 |
31 | def create_outcomes(outcomes):
32 | """
33 | Create Fraud Detector Outcomes
34 | """
35 | for outcome in outcomes:
36 | print("creating outcome variable: {0} ".format(outcome))
37 | response = client.put_outcome(name = outcome, description = outcome)
38 |
39 |
40 | def create_rules(score_cuts, outcomes):
41 | """
42 | Creating rules
43 |
44 | Arguments:
45 | score_cuts - list of score cuts to create rules
46 | outcomes - list of outcomes associated with the rules
47 |
48 | Returns:
49 | a rule list to used when create detector
50 | """
51 |
52 | if len(score_cuts)+1 != len(outcomes):
53 | logging.error('Your socre cuts and outcomes are not matched.')
54 |
55 | rule_list = []
56 | for i in range(len(outcomes)):
57 | # rule expression
58 | if i < (len(outcomes)-1):
59 | rule = "${0}_insightscore > {1}".format(MODEL_NAME,score_cuts[i])
60 | else:
61 | rule = "${0}_insightscore <= {1}".format(MODEL_NAME,score_cuts[i-1])
62 |
63 | # append to rule_list (used when create detector)
64 | rule_id = "rules{0}_{1}".format(i, MODEL_NAME[:9])
65 |
66 | rule_list.append({
67 | "ruleId": rule_id,
68 | "ruleVersion" : '1',
69 | "detectorId" : DETECTOR_NAME
70 | })
71 |
72 | # create rules
73 | print("creating rule: {0}: IF {1} THEN {2}".format(rule_id, rule, outcomes[i]))
74 | try:
75 | response = client.create_rule(
76 | ruleId = rule_id,
77 | detectorId = DETECTOR_NAME,
78 | expression = rule,
79 | language = 'DETECTORPL',
80 | outcomes = [outcomes[i]]
81 | )
82 | except:
83 | print("this rule already exists in this detector")
84 |
85 | return rule_list
86 |
87 |
88 | def ast_with_nan(x):
89 | try:
90 | return ast.literal_eval(x)
91 | except:
92 | return np.nan
93 |
94 |
95 | def afd_train_model_demo():
96 |
97 | # -- activate the model version --
98 | try:
99 | response = client.update_model_version_status (
100 | modelId = MODEL_NAME,
101 | modelType = MODEL_TYPE,
102 | modelVersionNumber = model_version,
103 | status = 'ACTIVE'
104 | )
105 | print("-- activating model --")
106 | print(response)
107 | except Exception:
108 | print("First train the model")
109 |
110 | # -- wait until model is active --
111 | print("--- waiting until model status is active ")
112 | stime = time.time()
113 | while True:
114 | response = client.get_model_version(modelId=MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version)
115 | if response['status'] != 'ACTIVE':
116 | print(response['status'])
117 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
118 | time.sleep(60) # sleep for 1 minute
119 | if response['status'] == 'ACTIVE':
120 | print("Model status : " + response['status'])
121 | break
122 |
123 | etime = time.time()
124 | print("Elapsed time : %s" % (etime - stime) + " seconds \n" )
125 | print(response)
126 |
127 | # -- put detector, initalizes your detector --
128 | response = client.put_detector(
129 | detectorId = DETECTOR_NAME,
130 | description = DETECTOR_DESC,
131 | eventTypeName = EVENT_TYPE )
132 |
133 | # -- decide what threshold and corresponding outcome you want to add --
134 | # here, we create three simple rules by cutting the score at [950,750], and create three outcome ['fraud', 'investigate', 'approve']
135 | # it will create 3 rules:
136 | # score > 950: fraud
137 | # score <= 750: approve
138 |
139 | score_cuts = [750] # recommended to fine tune this based on your business use case
140 | outcomes = ['fraud', 'approve'] # recommended to define this based on your business use case
141 |
142 | # -- create outcomes --
143 | print(" -- create outcomes --")
144 | create_outcomes(outcomes)
145 |
146 | # -- create rules --
147 | print(" -- create rules --")
148 | rule_list = create_rules(score_cuts, outcomes)
149 |
150 | # -- create detector version --
151 | client.create_detector_version(
152 | detectorId = DETECTOR_NAME,
153 | rules = rule_list,
154 | modelVersions = [{"modelId": MODEL_NAME,
155 | "modelType": MODEL_TYPE,
156 | "modelVersionNumber": model_version}],
157 | # there are 2 options for ruleExecutionMode:
158 | # 'ALL_MATCHED' - return all matched rules' outcome
159 | # 'FIRST_MATCHED' - return first matched rule's outcome
160 | ruleExecutionMode = 'FIRST_MATCHED'
161 | )
162 |
163 | print("\n -- detector created -- ")
164 | print(response)
165 |
166 | response = client.update_detector_version_status(
167 | detectorId = DETECTOR_NAME,
168 | detectorVersionId = '1',
169 | status = 'ACTIVE'
170 | )
171 | print("\n -- detector activated -- ")
172 | print(response)
173 |
174 | # -- wait until detector is active --
175 | print("\n --- waiting until detector status is active ")
176 | stime = time.time()
177 | while True:
178 | response = client.describe_detector(
179 | detectorId = DETECTOR_NAME,
180 | )
181 | if response['detectorVersionSummaries'][0]['status'] != 'ACTIVE':
182 | print(response['detectorVersionSummaries'][0]['status'])
183 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
184 | time.sleep(60)
185 | if response['detectorVersionSummaries'][0]['status'] == 'ACTIVE':
186 | break
187 | etime = time.time()
188 | print("Elapsed time : %s" % (etime - stime) + " seconds \n" )
189 | print(response)
190 |
191 | # -- create detector evaluation --
192 | try:
193 | client.create_batch_prediction_job (
194 | jobId = BATCH_PREDICTION_JOB,
195 | inputPath = os.path.join('s3://', BUCKET, TEST_PATH),
196 | outputPath =os.path.join('s3://', BUCKET),
197 | eventTypeName = EVENT_TYPE,
198 | detectorName = DETECTOR_NAME,
199 | detectorVersion = '1',
200 | iamRoleArn = IAM_ROLE)
201 | except Exception as e:
202 | print(e)
203 | print("batch prediction job already exists")
204 |
205 | # -- wait until batch prediction job is completed --
206 | print("\n --- waiting until batch prediction job is completed ")
207 | stime = time.time()
208 | while True:
209 | response = client.get_batch_prediction_jobs(jobId=BATCH_PREDICTION_JOB)
210 | response = response['batchPredictions'][0]
211 | if (response['status'] != 'COMPLETE') and (response['status'] != 'FAILED'):
212 | print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
213 | time.sleep(60)
214 | if response['status'] == 'COMPLETE':
215 | break
216 | etime = time.time()
217 | print("Elapsed time : %s" % (etime - stime) + " seconds \n" )
218 | print(response)
219 |
220 | # -- get batch prediction job result --
221 | contents = s3.list_objects_v2(Bucket=BUCKET, Prefix=os.path.join(TEST_PATH))['Contents']
222 | print(contents)
223 | S3_SCORE_PATH = sorted([c['Key'] for c in contents if c['Key'].endswith('output.csv')])[-1]
224 | print(S3_SCORE_PATH)
225 |
226 | # -- get test performance --
227 | # Predictions
228 | print(os.path.join('s3://', BUCKET, S3_SCORE_PATH))
229 | predictions = pd.read_csv(os.path.join('s3://', BUCKET, S3_SCORE_PATH))
230 | predictions = predictions.copy()[~predictions.MODEL_SCORES.isna()]
231 |
232 | predictions['scores'] = predictions['MODEL_SCORES'].\
233 | apply(lambda x: ast_with_nan(x)).\
234 | apply(lambda x: x.get(MODEL_NAME))
235 |
236 | # Labels
237 | labels = pd.read_csv(os.path.join('s3://', BUCKET, TEST_LABELS_PATH))
238 | # labels['EVENT_LABEL'] = labels['EVENT_LABEL'].map({'benign': 0, 'malignant': 1})
239 | predictions = predictions.merge(labels, on='EVENT_ID', how='left')
240 | print('Test size: ', predictions.shape)
241 |
242 | fpr, tpr, threshold = roc_curve(predictions['EVENT_LABEL'], predictions['scores'])
243 | test_auc = auc(fpr,tpr)
244 | print('AUC: ', test_auc)
245 |
246 | test_metrics = {}
247 | test_metrics['auc'] = test_auc
248 | test_metrics['fpr'] = list(fpr)
249 | test_metrics['tpr'] = list(tpr)
250 | test_metrics['threshold'] = list(threshold)
251 |
252 | # -- put test metrics in s3 --
253 | s3.put_object(
254 | Body=json.dumps(test_metrics),
255 | Bucket=BUCKET,
256 | Key='test_metrics.json')
257 |
258 | print("\n -- test metrics saved -- ")
259 |
260 | if __name__ == "__main__":
261 | afd_train_model_demo()
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 |
--------------------------------------------------------------------------------
/scripts/reproducibility/autogluon/README.md:
--------------------------------------------------------------------------------
1 | - benchmark_ag.py: a script for autogluon benchmarking
2 | - example-ag-ieeecis.ipynb: an example notebook using benchmark_ag.py
3 |
4 | Note that autogluon is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper.
5 |
--------------------------------------------------------------------------------
/scripts/reproducibility/autogluon/benchmark_ag.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import os
3 | import gc
4 | import joblib
5 | import datetime
6 |
7 | import matplotlib as mpl
8 | from sklearn.metrics import roc_auc_score, roc_curve
9 |
10 | mpl.rcParams['figure.dpi'] = 150
11 | pd.set_option('display.max_columns', 500)
12 | pd.set_option('display.max_rows', 500)
13 | pd.set_option('display.width', 200)
14 | pd.set_option('display.float_format', lambda x: '%.3f' % x)
15 |
16 | import logging
17 | FORMAT = "%(levelname)s: %(name)s: %(message)s"
18 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
19 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
20 | logger = logging.getLogger(os.path.basename(__file__))
21 | logger.setLevel(logging.DEBUG)
22 |
23 | import sys
24 | sys.path.append('../')
25 | from benchmark_utils import load_data, get_recall
26 |
27 | from autogluon.tabular import TabularPredictor
28 |
29 | def run_ag(dataset, base_path, time_limit=3600, presets=None, hyperparameters=None, feature_metadata='infer', verbosity=2):
30 | gc.collect()
31 | features, df_train, df_test = load_data(dataset, base_path)
32 |
33 | dateTimeObj = datetime.datetime.now()
34 | timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S")
35 |
36 | suffix = (f"_{presets}" if presets is not None else "") \
37 | + (f"_{hyperparameters}" if hyperparameters is not None else "") \
38 | + ("_feature_metadata" if feature_metadata != 'infer' else "")
39 | folder = f"ag-{timestampStr}" \
40 | + suffix
41 |
42 | predictor = TabularPredictor(label='EVENT_LABEL', eval_metric='roc_auc', path=f"{base_path}/{dataset}/AutogluonModels/{folder}/",
43 | verbosity=verbosity)
44 | predictor.fit(df_train[features + ['EVENT_LABEL'] ],
45 | time_limit=time_limit, presets=presets, hyperparameters=hyperparameters, feature_metadata=feature_metadata)
46 |
47 | leaderboard = predictor.leaderboard(df_test[features + ['EVENT_LABEL'] ])
48 |
49 | leaderboard_file = "leaderboard" \
50 | + suffix \
51 | + ".csv"
52 | leaderboard.to_csv(f"{base_path}/{dataset}/{leaderboard_file}", index=False)
53 |
54 | df_pred = predictor.predict_proba(df_test[ features ],
55 | as_multiclass=False)
56 |
57 | auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred)
58 | logger.info(f"auc on test data: {auc}")
59 | pos_label = predictor.positive_class
60 | fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred,
61 | pos_label=pos_label)
62 |
63 | y_true = df_test['EVENT_LABEL']
64 | y_true = (y_true==pos_label)
65 |
66 | recall = get_recall(fpr, tpr, fpr_target=0.01)
67 | logger.info(f"tpr@1%fpr on test data: {recall}")
68 |
69 | test_metrics_ag_bq = {
70 | "labels": df_test['EVENT_LABEL'],
71 | "pred_prob": df_pred,
72 | "auc": auc,
73 | "tpr@1%fpr": recall,
74 | "fpr": fpr,
75 | "tpr": tpr,
76 | "thresholds": thresholds
77 | }
78 | metrics_file = "test_metrics_ag" \
79 | + suffix \
80 | + ".joblib"
81 | joblib.dump(test_metrics_ag_bq, f"{base_path}/{dataset}/{metrics_file}")
--------------------------------------------------------------------------------
/scripts/reproducibility/autosklearn/README.md:
--------------------------------------------------------------------------------
1 | ## Steps to reproduce Auto-sklearn models
2 |
3 |
4 | 1. Load and save the datasets locally using [FDB Loader](../../examples/Test_FDB_Loader.ipynb). Keep note of `{DATASET_PATH}` that contains local paths to datasets containing `train.csv`, `test.csv` and `test_labels.csv` from FDB loader.
5 |
6 | 2. Run `benchmark_autosklearn.py` using following:
7 | ```
8 | python3 benchmark_autosklearn.py {DATASET_PATH}
9 | ```
10 |
11 | 3. The script after running successfully will save results in the `DATASET_PATH`. The evaluation metrics on `test.csv` will be saved in `test_metrics_autosklearn.joblib`.
12 |
13 | *Note: Python 3.7+ is needed to run the used version of auto-sklearn and to reproduce the results. Similar to other auto-ml frameworks, auto-sklearn is also not perfectly reproducible because some underlying models are not deterministically seeded. However, the variations in results are within acceptable errors.*
14 |
--------------------------------------------------------------------------------
/scripts/reproducibility/autosklearn/benchmark_autosklearn.py:
--------------------------------------------------------------------------------
1 |
2 | import json
3 | import joblib
4 | import datetime
5 | import numpy as np
6 | import pandas as pd
7 | import os, sys, shutil
8 |
9 | from autosklearn.metrics import roc_auc, log_loss
10 | from autosklearn.classification import AutoSklearnClassifier
11 |
12 | from sklearn.metrics import roc_auc_score, roc_curve
13 | from pandas.api.types import is_numeric_dtype, is_string_dtype
14 |
15 | import logging
16 | FORMAT = "%(levelname)s: %(name)s: %(message)s"
17 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
18 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
19 | logger = logging.getLogger(os.path.basename(__file__))
20 | logger.setLevel(logging.DEBUG)
21 |
22 | logging_config = {
23 | 'version': 1,
24 | 'disable_existing_loggers': False,
25 | 'formatters': {
26 | 'simple': {
27 | 'format': '%(levelname)-8s %(name)-15s %(message)s'
28 | }
29 | },
30 | 'handlers':{
31 | 'console_handler': {
32 | 'class': 'logging.StreamHandler',
33 | 'formatter': 'simple'
34 | },
35 | 'file_handler': {
36 | 'class':'logging.FileHandler',
37 | 'mode': 'a',
38 | 'encoding': 'utf-8',
39 | 'filename':'main.log',
40 | 'formatter': 'simple'
41 | },
42 | 'spec_handler':{
43 | 'class':'logging.FileHandler',
44 | 'filename':'dummy_autosklearn.log',
45 | 'formatter': 'simple'
46 | },
47 | 'distributed_logfile':{
48 | 'filename':'distributed.log',
49 | 'class': 'logging.FileHandler',
50 | 'formatter': 'simple',
51 | 'level': 'DEBUG'
52 | }
53 | },
54 | 'loggers': {
55 | '': {
56 | 'level': 'INFO',
57 | 'handlers':['file_handler', 'console_handler']
58 | },
59 | 'autosklearn': {
60 | 'level': 'INFO',
61 | 'propagate': False,
62 | 'handlers': ['spec_handler']
63 | },
64 | 'smac': {
65 | 'level': 'INFO',
66 | 'propagate': False,
67 | 'handlers': ['spec_handler']
68 | },
69 | 'EnsembleBuilder': {
70 | 'level': 'INFO',
71 | 'propagate': False,
72 | 'handlers': ['spec_handler']
73 | },
74 | },
75 | }
76 |
77 | def load_data(dataset_path):
78 | logger.info(dataset_path)
79 |
80 | df_train = pd.read_csv(f"{dataset_path}/train.csv", lineterminator='\n')
81 | logger.info(df_train.shape)
82 |
83 | df_test = pd.read_csv(f"{dataset_path}/test.csv")
84 | logger.info(df_test.shape)
85 |
86 | df_test_labels = pd.read_csv(f"{dataset_path}/test_labels.csv")
87 | logger.info(df_test_labels.shape)
88 |
89 | df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID")
90 | logger.info(df_test.shape)
91 |
92 |
93 | features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID")
94 | features = [x for x in df_test.columns if x not in features_to_exclude ]
95 | logger.info(len(features))
96 | logger.info(features)
97 |
98 | return features, df_train, df_test
99 |
100 |
101 | def get_recall(fpr, tpr, fpr_target=0.01):
102 | return np.interp(fpr_target, fpr, tpr)
103 |
104 |
105 | def run_autosklearn(dataset_path):
106 |
107 | features, df_train, df_test = load_data(dataset_path)
108 |
109 | dateTimeObj = datetime.datetime.now()
110 | timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S")
111 |
112 | numeric_features = [f for f in features if is_numeric_dtype(df_train[f])]
113 | categorical_features = [f for f in features if f not in numeric_features]
114 | logger.info(f'categorical: {categorical_features}')
115 | logger.info(f'numeric: {numeric_features}')
116 |
117 | labels = sorted(df_train['EVENT_LABEL'].unique())
118 | df_train['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True)
119 | df_test['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True)
120 |
121 | for df in [df_train, df_test]:
122 | df[categorical_features] = df[categorical_features].fillna('')
123 | df[categorical_features] = df[categorical_features].astype('category')
124 |
125 | out_dir = f"{dataset_path}/AutoSklearnModels/"
126 | if os.path.exists(out_dir):
127 | shutil.rmtree(out_dir)
128 |
129 | automl = AutoSklearnClassifier(
130 | metric=roc_auc,
131 | scoring_functions=[roc_auc, log_loss],
132 | tmp_folder=out_dir, # for debugging
133 | delete_tmp_folder_after_terminate=False,
134 | logging_config=logging_config,
135 | n_jobs=-1,
136 | memory_limit=None
137 | )
138 |
139 | assert len(categorical_features) + len(numeric_features) == len(features)
140 |
141 | logger.info('Fitting')
142 | automl.fit(df_train[features], df_train['EVENT_LABEL'])
143 | joblib.dump(automl, f"{dataset_path}/automl.joblib")
144 |
145 | cv = pd.DataFrame(automl.cv_results_)
146 | cv.to_csv(f"{dataset_path}/cv_results_autosklearn.csv", index=False)
147 |
148 | df_pred = automl.predict_proba(df_test[features])[:,1]
149 |
150 | auc_score = roc_auc_score(df_test['EVENT_LABEL'], df_pred)
151 | logger.info(f"auc on test data: {auc_score}")
152 |
153 | fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred)
154 |
155 | recall = get_recall(fpr, tpr, fpr_target=0.01)
156 | logger.info(f"tpr@1%fpr on test data: {recall}")
157 |
158 | test_metrics = {
159 | "labels": df_test['EVENT_LABEL'],
160 | "pred_prob": df_pred,
161 | "auc": auc_score,
162 | "tpr@1%fpr": recall,
163 | "fpr": fpr,
164 | "tpr": tpr,
165 | "thresholds": thresholds
166 | }
167 | joblib.dump(test_metrics, f"{dataset_path}/test_metrics_autosklearn.joblib")
168 |
169 | if __name__ == "__main__":
170 | args = sys.argv
171 | logger.info(args)
172 | run_autosklearn(args[1])
173 |
--------------------------------------------------------------------------------
/scripts/reproducibility/benchmark_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import os
4 |
5 | import matplotlib as mpl
6 |
7 | mpl.rcParams['figure.dpi'] = 150
8 | pd.set_option('display.max_columns', 500)
9 | pd.set_option('display.max_rows', 500)
10 | pd.set_option('display.width', 200)
11 | pd.set_option('display.float_format', lambda x: '%.3f' % x)
12 |
13 | import logging
14 | FORMAT = "%(levelname)s: %(name)s: %(message)s"
15 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
16 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
17 | logger = logging.getLogger(os.path.basename(__file__))
18 | logger.setLevel(logging.DEBUG)
19 |
20 |
21 |
22 | def load_data(dataset, base_path):
23 | logger.info(dataset)
24 |
25 | df_train = pd.read_csv(f"{base_path}/{dataset}/train.csv", lineterminator='\n')
26 | logger.info(df_train.shape)
27 |
28 | df_test = pd.read_csv(f"{base_path}/{dataset}/test.csv")
29 | logger.info(df_test.shape)
30 |
31 | df_test_labels = pd.read_csv(f"{base_path}/{dataset}/test_labels.csv")
32 | logger.info(df_test_labels.shape)
33 |
34 | df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID")
35 | logger.info(df_test.shape)
36 |
37 |
38 | features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID")
39 | features = [x for x in df_test.columns if x not in features_to_exclude ]
40 | logger.info(len(features))
41 | logger.info(features)
42 |
43 | return features, df_train, df_test
44 |
45 | def get_recall(fpr, tpr, fpr_target=0.01):
46 | return np.interp(fpr_target, fpr, tpr)
--------------------------------------------------------------------------------
/scripts/reproducibility/h2o/README.md:
--------------------------------------------------------------------------------
1 | - benchmark_h2o.py: a script for h2o benchmarking
2 | - example-h2o-ieeecis.ipynb: an example notebook using benchmark_h2o.py
3 |
4 | Note that h2o is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper.
5 |
--------------------------------------------------------------------------------
/scripts/reproducibility/h2o/benchmark_h2o.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import os
3 | import gc
4 | import joblib
5 |
6 | import matplotlib as mpl
7 | from sklearn.metrics import roc_auc_score, roc_curve
8 |
9 | mpl.rcParams['figure.dpi'] = 150
10 | pd.set_option('display.max_columns', 500)
11 | pd.set_option('display.max_rows', 500)
12 | pd.set_option('display.width', 200)
13 | pd.set_option('display.float_format', lambda x: '%.3f' % x)
14 |
15 | import logging
16 | FORMAT = "%(levelname)s: %(name)s: %(message)s"
17 | DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
18 | logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
19 | logger = logging.getLogger(os.path.basename(__file__))
20 | logger.setLevel(logging.DEBUG)
21 |
22 | import sys
23 | sys.path.append('../')
24 | from benchmark_utils import load_data, get_recall
25 |
26 | import h2o
27 | from h2o.automl import H2OAutoML
28 |
29 | def run_h2o(dataset, base_path, connect_url=None, time_limit=None, include_algos=None, exclude_algos=None, verbosity="info", seed=10):
30 | if connect_url is not None:
31 | _ = h2o.connect(url=connect_url, https=True, verbose=True)
32 | h2o.cluster().show_status(True)
33 | else:
34 | h2o.init()
35 |
36 | gc.collect()
37 | features, df_train, df_test = load_data(dataset, base_path)
38 |
39 | df_train_h2o = h2o.H2OFrame(df_train)
40 | feature_types_h2o = {k:df_train_h2o.types[k] for k in df_train_h2o.types if k in features}
41 | # force test schema the same as train schema, otherwise predict will throw errors
42 | df_test_h2o = h2o.H2OFrame(df_test, column_types=feature_types_h2o)
43 |
44 | df_train_h2o['EVENT_LABEL'] = df_train_h2o['EVENT_LABEL'].asfactor()
45 | df_test_h2o['EVENT_LABEL'] = df_test_h2o['EVENT_LABEL'].asfactor()
46 |
47 | aml = H2OAutoML(max_runtime_secs = time_limit, seed = seed,
48 | include_algos=include_algos,
49 | exclude_algos=exclude_algos,
50 | export_checkpoints_dir=f"{base_path}/{dataset}/H2OModels/",
51 | verbosity=verbosity)
52 |
53 | # use validation error in the leaderboard to avoid leakage when calling aml.predict
54 | aml.train(x = features,
55 | y = 'EVENT_LABEL',
56 | training_frame = df_train_h2o,
57 | )
58 |
59 | lb = aml.leaderboard
60 | # lb.head(rows=lb.nrows)
61 |
62 | h2o.h2o.download_csv(lb, f"{base_path}/{dataset}/leaderboard_h2o.csv")
63 |
64 | lb_2 = h2o.automl.get_leaderboard(aml, extra_columns = "ALL")
65 | h2o.h2o.download_csv(lb_2, f"{base_path}/{dataset}/leaderboard_h2o_full.csv")
66 | # Get training timing info
67 | info = aml.training_info
68 | joblib.dump(info, f"{base_path}/{dataset}/training_info.joblib")
69 |
70 | df_pred_h2o = aml.predict(df_test_h2o[features])
71 | pos_label = df_test_h2o['EVENT_LABEL'].levels()[0][-1] # levels are ordered alphabetically
72 |
73 | pos_label2 = 'p'+pos_label if pos_label=='1' else pos_label
74 | df_pred_h2o = (h2o.as_list(df_pred_h2o[pos_label2]))[pos_label2]
75 |
76 | auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred_h2o)
77 | logger.info(f"auc on test data: {auc}")
78 |
79 | fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'].astype(str), df_pred_h2o,
80 | pos_label=pos_label)
81 |
82 | y_true = df_test['EVENT_LABEL']
83 | y_true = (y_true.astype(str)==pos_label)
84 |
85 | recall = get_recall(fpr, tpr, fpr_target=0.01)
86 | logger.info(f"tpr@1%fpr on test data: {recall}")
87 |
88 | test_metrics_h2o = {
89 | "pos_label": pos_label,
90 | "labels": df_test['EVENT_LABEL'],
91 | "pred_prob": df_pred_h2o,
92 | "auc": auc,
93 | "tpr@1%fpr": recall,
94 | "fpr": fpr,
95 | "tpr": tpr,
96 | "thresholds": thresholds
97 | }
98 | joblib.dump(test_metrics_h2o, f"{base_path}/{dataset}/test_metrics_h2o.joblib")
99 |
100 | h2o.cluster().shutdown(prompt=False)
--------------------------------------------------------------------------------
/scripts/reproducibility/label-noise/benchmark_experiments.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "id": "c77e5eb5",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "#! pip install humanize\n",
11 | "#! pip install catboost"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "id": "f8bd366d",
17 | "metadata": {},
18 | "source": [
19 | "# Label noise\n",
20 | "\n",
21 | "\n",
22 | "## Problem statement \n",
23 | "Have some binary classification task, traditionally assume data of the form X,y\n",
24 | "\n",
25 | "In reality, some of the labels may be incorrect, distinguish\n",
26 | "```\n",
27 | "y - true label\n",
28 | "y* - observed, possibly incorrect label\n",
29 | "```\n",
30 | "\n",
31 | "This can obviously effect model training, validation. Would also effect benchmarking process (comparing performance on noisy data doesn't tell you about performance on actual data).\n",
32 | "\n",
33 | "## Types of noise\n",
34 | "\n",
35 | "Can be completely independent:\n",
36 | "`p(y* != y | x, y) = p(y* != y)`\n",
37 | "\n",
38 | "class-dependent, depends on y:\n",
39 | "`p(y* != y | x, y) = p(y* != y | y)`\n",
40 | "\n",
41 | "feature-dependent, depends on x:\n",
42 | "`p(y* != y | x, y) = p(y* != y | x, y)`\n",
43 | "\n",
44 | "In fraud modeling, higher likelihood of `(y*, y) = (0, 1)` than reverse.\n",
45 | "(missed fraud, label maturity, intentional data poisoning, etc.)\n",
46 | "\n",
47 | "\"feature-dependent\" is probably most realistic in fraud but fewer removal techniques and also harder to synthetically generate. We will work with \"boundary conditional\" noise, probability of being mislabeled is weighted by distance from some decision boundary (score from model trained on clean data), implemented in scikit-clean.\n",
48 | "\n",
49 | "## Literature/packages\n",
50 | "\n",
51 | "Many methods in the literature to address this; can build loss functions that are robust to noise, can try to identify and filter (remove) or clean (flip label) examples identified as noisy.\n",
52 | "\n",
53 | "Some packages including CleanLab and scikit-clean. Can also hand-code an ensemble method. Most of these are model-agnostic."
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "id": "9b172deb",
59 | "metadata": {},
60 | "source": [
61 | "## CleanLab\n",
62 | "\n",
63 | "well-established, state of the art, open source package with some theoretical guarantees\n",
64 | "\n",
65 | "score all examples with y* = 1, determine average score t_1\n",
66 | "now score all examples with y* = 0. Any that score above t_1 are marked as noise\n",
67 | "\n",
68 | "can wrap any (sklearn-compatible) model with this process. \n",
69 | "\n",
70 | "## scikit-clean \n",
71 | "\n",
72 | "library of several different approaches including filtering as well as noise generation. Is similarly designed to be model-agnostic but doesn't always do a great job (doesn't handle unencoded categorical features well). Some of its methods can also be *very* slow relative to others\n",
73 | "\n",
74 | "## micro-models\n",
75 | "\n",
76 | "slice up training data, train a model on each slice, let models vote on whether to remove data. Can use majority (more than half of models \"misclassify\" example), consensus (all models misclassify) or any other threshold.\n",
77 | "\n",
78 | "## experiment design\n",
79 | "\n",
80 | "take 7 of the datasets - [‘ieeecis’, ‘ccfraud’, ‘fraudecom’, ‘sparknov’, ‘fakejob’, ‘vehicleloan’,‘twitterbot’]\n",
81 | "* drop IP and malurl dataset as they are difficult to work with \"out of the box\"\n",
82 | "* use numerical and categorical features, target-encode categorical features (drop text and enrichable features)\n",
83 | "\n",
84 | "add boundary-conditional noise `n` to training data (flipping both classes).\n",
85 | "\n",
86 | "values: `n in [0, 0.1, 0.2, 0.3, 0.4, 0.5]`\n",
87 | " \n",
88 | "target encoding is done after noise is added\n",
89 | " \n",
90 | "Catboost used as base classifier in all cases (with default settings)\n",
91 | "\n",
92 | "compare following methods for cleaning training data\n",
93 | "* baseline (no cleaning done)\n",
94 | "* CleanLab\n",
95 | "* scikit-clean MCS \n",
96 | "* micro-model majority voting (hand-built)\n",
97 | "* micro-model consensus voting (hand-built)\n",
98 | "\n",
99 | "measure AUC on (clean) test data\n",
100 | "\n",
101 | "repeat process 5 times for each experiment (start with clean data, add random noise, filter noise back out, train classifier, etc.), compute mean and std. dev of AUC for each\n",
102 | "\n",
103 | "CleanLab usually winds up being the best, but not uniformly. Baseline is sometimes the best for zero noise (as expected), and sometimes MCS or micro-model majority will come out ahead"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "id": "846f161f",
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "# basic imports\n",
114 | "import os\n",
115 | "import numpy as np\n",
116 | "import pandas as pd\n",
117 | "import warnings\n",
118 | "import matplotlib.pyplot as plt\n",
119 | "%matplotlib inline\n",
120 | "import humanize\n",
121 | "import pickle\n",
122 | "\n",
123 | "# basics from sklearn\n",
124 | "from sklearn.metrics import roc_auc_score\n",
125 | "from category_encoders.target_encoder import TargetEncoder\n",
126 | "\n",
127 | "# noise generation\n",
128 | "from skclean.simulate_noise import flip_labels_cc, BCNoise\n",
129 | "\n",
130 | "# base classifiers\n",
131 | "from catboost import CatBoostClassifier\n",
132 | "\n",
133 | "# cleaning methods/helpers\n",
134 | "from cleanlab.classification import CleanLearning\n",
135 | "from micro_models import MicroModelCleaner\n",
136 | "from skclean.pipeline import Pipeline\n",
137 | "from skclean.handlers import Filter\n",
138 | "from skclean.detectors import MCS\n",
139 | "\n",
140 | "# dataset loader\n",
141 | "from load_fdb_datasets import prepare_noisy_dataset, dataset_stats"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "id": "85117ba5",
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "# wrapper definitions for the various types of cleaning methods we will use. \n",
152 | "# Each one wraps a model_class (in our case catboost, but could use xgboost, etc.)\n",
153 | "# resulting model_class can then take noisy data in its .fit() method and clean before training\n",
154 | "\n",
155 | "def baseline_model(model_class, params):\n",
156 | " return model_class(**params)\n",
157 | "\n",
158 | "def cleanlab_model(model_class, params, pulearning=False):\n",
159 | " if pulearning:\n",
160 | " return CleanLearning(model_class(**params), pulearning=pulearning)\n",
161 | " else:\n",
162 | " return CleanLearning(model_class(**params))\n",
163 | " \n",
164 | "def micromodels(model_class, pulearning, num_clfs, threshold, params):\n",
165 | " return MicroModelCleaner(model_class, pulearning=pulearning, num_clfs=num_clfs, threshold=threshold, **params)\n",
166 | "\n",
167 | "def skclean_MCS(model_class, params):\n",
168 | " skclean_pipeline = Pipeline([\n",
169 | " ('detector',MCS(classifier=model_class(**params))),\n",
170 | " ('handler',Filter(model_class(**params)))\n",
171 | " ])\n",
172 | " return skclean_pipeline"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "id": "bd6bcd08",
179 | "metadata": {
180 | "scrolled": true
181 | },
182 | "outputs": [],
183 | "source": [
184 | "# some high-level parameters, \n",
185 | "# the number of runs for each experiment (determine mean/std. dev)\n",
186 | "num_samples = 5 \n",
187 | "# whether to use target encoding on categorical features\n",
188 | "target_encoding = True\n",
189 | "# whether to save intermediate results to disk (in case of failure etc.)\n",
190 | "save_results = True\n",
191 | "\n",
192 | "# we will be creating a lot of classifiers, let's use the same parameters for each\n",
193 | "model_config_dict = {\n",
194 | " 'catboost': {\n",
195 | " 'model_class': CatBoostClassifier,\n",
196 | " 'default_params': {\n",
197 | " 'verbose': False,\n",
198 | " 'iterations': 100\n",
199 | " }\n",
200 | " }\n",
201 | "}\n",
202 | "\n",
203 | "# all of our experiments will use catboost and boundary-consistent noise\n",
204 | "base_model_type = 'catboost'\n",
205 | "noise_type = 'boundary-consistent'\n",
206 | "model_class = model_config_dict[base_model_type]['model_class']\n",
207 | "\n",
208 | "# the set of experimental parameters, we will iterate over all these datasets\n",
209 | "keys = ['ieeecis', 'sparknov', 'ccfraud', 'fraudecom', 'fakejob', 'vehicleloan', 'twitterbot']\n",
210 | "# all these cleaning methods\n",
211 | "clf_types = ['baseline', 'skclean_MCS', 'cleanlab', 'micromodels_majority', 'micromodels_consensus']\n",
212 | "# all these noise levels\n",
213 | "noise_amounts = [0, 0.1, 0.2, 0.3, 0.4, 0.5]\n",
214 | "# and we will let cleaning methods know that noise can happen for either class\n",
215 | "pulearning = None\n",
216 | "\n",
217 | "# a little bit of setup for saving intermediate results to disk\n",
218 | "if save_results:\n",
219 | " results_file_path = './results'\n",
220 | " results_file_name = '{}_noise_benchmark_results.pkl'\n",
221 | " try:\n",
222 | " os.mkdir(results_file_path)\n",
223 | " except OSError as error:\n",
224 | " print(error) "
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": null,
230 | "id": "ef2e3bd8",
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "# initialize results dict, we will index results by dataset/noise_amount/cleaning_method\n",
235 | "results = {}\n",
236 | "\n",
237 | "# main experimental loop \n",
238 | "for key in keys:\n",
239 | " # check to see if we have already run this experiment and saved to disk\n",
240 | " full_result_path = os.path.join(results_file_path,results_file_name.format(key))\n",
241 | " if os.path.exists(full_result_path) and save_results:\n",
242 | " with open(full_result_path, 'rb') as results_file:\n",
243 | " results[key] = pickle.load(results_file)\n",
244 | " # otherwise start from scratch\n",
245 | " else:\n",
246 | " # initialize sub-results\n",
247 | " results[key] = {}\n",
248 | " model_params = model_config_dict[base_model_type]['default_params']\n",
249 | " \n",
250 | " for noise_amount in noise_amounts:\n",
251 | " print(f\"\\n =={key}_{noise_amount}== \\n\")\n",
252 | " \n",
253 | " # initialize sub-sub-results\n",
254 | " results[key][noise_amount] = {}\n",
255 | "\n",
256 | " # these are the cleaning classifiers we will use\n",
257 | " clfs = {\n",
258 | " 'baseline': baseline_model(model_class, model_params),\n",
259 | " 'skclean_MCS': skclean_MCS(model_class, model_params),\n",
260 | " 'cleanlab': cleanlab_model(model_class, model_params, pulearning),\n",
261 | " 'micromodels_majority': micromodels(model_class, pulearning=pulearning,\n",
262 | " num_clfs=8, threshold=0.5, params=model_params),\n",
263 | " 'micromodels_consensus': micromodels(model_class, pulearning=pulearning,\n",
264 | " num_clfs=8, threshold=1, params=model_params),\n",
265 | "\n",
266 | " }\n",
267 | " print('generating datasets')\n",
268 | " # preparing a dataset has some overhead, we want to do this five times for each dataset/noise level\n",
269 | " # we will save a little bit of time by doing this in advance and using same set of five\n",
270 | " # for each cleaning method\n",
271 | " datasets = [prepare_noisy_dataset(key, noise_type, noise_amount, split=1, target_encoding=target_encoding) \n",
272 | " for i in range(num_samples)]\n",
273 | " \n",
274 | " # now for each cleaning method, train a \"clean\" model on noisy training data, then determine\n",
275 | " # auc on clean test data and record the results. Do this five times for each cleaning method\n",
276 | " # to determine mean/std. dev\n",
277 | " for clf_type in clfs:\n",
278 | " print(f\"testing {clf_type}\")\n",
279 | " auc = []\n",
280 | " try:\n",
281 | " for i in range(num_samples):\n",
282 | " # grab the dataset we need for this run and extract metadata and subsets\n",
283 | " dataset = datasets[i]\n",
284 | " features, cat_features, label = dataset['features'], dataset['cat_features'], dataset['label']\n",
285 | " train, test = dataset['train'], dataset['test']\n",
286 | " X_tr, y_tr = train[features], train[label].values.reshape(-1)\n",
287 | " X_ts, y_ts = test[features], test[label].values.reshape(-1)\n",
288 | " clf = clfs[clf_type]\n",
289 | " # fit the \"clean\" classifier on noisy training data\n",
290 | " clf.fit(X_tr, y_tr)\n",
291 | " # make predictions on clean test data and calculate AUC\n",
292 | " y_pred = clf.predict_proba(X_ts)[:, 1]\n",
293 | " auc.append(roc_auc_score(y_ts, y_pred))\n",
294 | " print(f\"{clf_type} auc: {auc}\", end=\"\\r\", flush=True)\n",
295 | " # store mean/std. dev for this run in the results dict\n",
296 | " results[key][noise_amount][clf_type] = (np.mean(auc), np.std(auc), auc)\n",
297 | " print('\\n{} auc: {:.2f} ± {:.4f}\\n'.format(clf_type,\n",
298 | " *results[key][noise_amount][clf_type][:2]))\n",
299 | " # if this run failed for some reason, handle it gracefully\n",
300 | " except Exception as e:\n",
301 | " results[key][noise_amount][clf_type] = (0, 0, [0] * num_samples)\n",
302 | " print(e)\n",
303 | " \n",
304 | " # if we are saving intermediate results to disk, do so now\n",
305 | " if save_results:\n",
306 | " with open(full_result_path, 'wb') as results_file:\n",
307 | " pickle.dump(results[key], results_file)"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": null,
313 | "id": "7a8a4509",
314 | "metadata": {
315 | "scrolled": false
316 | },
317 | "outputs": [],
318 | "source": [
319 | "# a couple of helper functions to analyze/summarize results\n",
320 | "\n",
321 | "def highlight_max(s, props=''):\n",
322 | " return np.where(s == np.nanmax(s.values), props, '')\n",
323 | "\n",
324 | "def record_places(places, scores):\n",
325 | " scores = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}\n",
326 | " last_score, last_stddev, last_placement = (2, 0, 1)\n",
327 | " for i, clf in enumerate(scores.keys()): \n",
328 | " if scores[clf][0] + scores[clf][1] >= last_score:\n",
329 | " placement = last_placement \n",
330 | " else:\n",
331 | " placement = i+1\n",
332 | " last_score, last_stddev = scores[clf] \n",
333 | " last_placement = i+1\n",
334 | " places[clf][placement] += 1 "
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "id": "7fa49c8e",
341 | "metadata": {
342 | "scrolled": false
343 | },
344 | "outputs": [],
345 | "source": [
346 | "# create dataframe of results for each experiment, also process results into dict for keeping track of \n",
347 | "# 1st/2nd/etc. place, as well as a dict for plotting later\n",
348 | "\n",
349 | "places = {clf:{p:0 for p in range(1,len(clf_types)+1)} for clf in clf_types}\n",
350 | "plots = {key:{clf:[[],[]] for clf in clf_types} for key in keys}\n",
351 | " \n",
352 | "for key in results.keys():\n",
353 | " print(f\"\\n =={key}==\\n\")\n",
354 | " rows = pd.Index([clf_type for clf_type in clf_types])\n",
355 | " columns = pd.MultiIndex.from_product([noise_amounts, ['mean','std_dev']], names=['type 2 noise', 'auc'])\n",
356 | " df = pd.DataFrame(index=rows, columns=columns)\n",
357 | " \n",
358 | " for noise_amount in noise_amounts:\n",
359 | " scores = {}\n",
360 | " for clf_type in clf_types:\n",
361 | " auc = results[key][noise_amount][clf_type] \n",
362 | " df.loc[clf_type, (noise_amount, 'mean')] = auc[0] \n",
363 | " df.loc[clf_type, (noise_amount, 'std_dev')] = auc[1]\n",
364 | " scores[clf_type] = (auc[0], auc[1])\n",
365 | "\n",
366 | " plots[key][clf_type][0].append(noise_amount)\n",
367 | " plots[key][clf_type][1].append(auc[0])\n",
368 | " record_places(places, scores)\n",
369 | " display(df.style.set_caption(f\"{key}\")\n",
370 | " .format({(n,'mean'): \"{:.2f}\" for n in noise_amounts})\n",
371 | " .format({(n,'std_dev'): \"{:.4f}\" for n in noise_amounts})\n",
372 | " .apply(highlight_max, props='font-weight:bold;background-color:lightblue', axis=0,\n",
373 | " subset=[[n,'mean'] for n in noise_amounts]))"
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": null,
379 | "id": "8cb8dbd8",
380 | "metadata": {},
381 | "outputs": [],
382 | "source": [
383 | "# produce \"race results\" (i.e. how many first place, second place, etc. finishes)\n",
384 | "\n",
385 | "race_results = pd.DataFrame.from_dict(places).rename(index=lambda x : humanize.ordinal(x))\n",
386 | "race_results['totals'] = race_results.sum(axis=1)\n",
387 | "display(race_results)\n",
388 | "print(race_results.to_latex())"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "id": "602877ee",
395 | "metadata": {
396 | "scrolled": false
397 | },
398 | "outputs": [],
399 | "source": [
400 | "# finally, we can plot the results of individual experiments\n",
401 | "\n",
402 | "colors = ['black','purple','green','red','orange']\n",
403 | "linestyles = ['-','--',':']\n",
404 | "ylims = {\n",
405 | " 'boundary-consistent': {\n",
406 | " 'ieeecis':[0.5,0.9],\n",
407 | " 'sparknov':[0.5,1],\n",
408 | " 'ccfraud':[0.25,1],\n",
409 | " 'fraudecom':[0.48,0.52],\n",
410 | " 'fakejob':[0.5,1],\n",
411 | " 'vehicleloan':[0.57,0.66],\n",
412 | " 'twitterbot':[0.7,0.95]\n",
413 | " },\n",
414 | " 'class-conditional': {\n",
415 | " 'ieeecis':[0.7,0.9],\n",
416 | " 'sparknov':[0.7,1],\n",
417 | " 'ccfraud':[0.8,1],\n",
418 | " 'fraudecom':[0.48,0.52],\n",
419 | " 'fakejob':[0.7,1],\n",
420 | " 'vehicleloan':[0.5,0.7],\n",
421 | " 'twitterbot':[0.8,0.95]\n",
422 | " }\n",
423 | "}\n",
424 | "\n",
425 | "x_labels = {\n",
426 | " 'boundary-consistent':'Boundary-Consistent Noise Level',\n",
427 | " 'class-conditional':'Class-Conditional Type 2 Noise Level'\n",
428 | "}\n",
429 | "\n",
430 | "legends = {\n",
431 | " 'boundary-consistent':'Cleaning Method',\n",
432 | " 'class-conditional':'Type 1 Noise, Cleaning Method'\n",
433 | "}\n",
434 | "def fix_failures(x):\n",
435 | " if x == 0:\n",
436 | " return None\n",
437 | " else:\n",
438 | " return x\n",
439 | "\n",
440 | "def labels(noise_type, noise_amount, clf_type):\n",
441 | " if noise_type == 'boundary-consistent':\n",
442 | " return '{}'.format(clf_type)\n",
443 | " elif noise_type == 'class-conditional':\n",
444 | " return '{}, {}'.format(noise_amount, clf_type)\n",
445 | "\n",
446 | "for key in results.keys():\n",
447 | " plt.figure(figsize=(10,10))\n",
448 | " \n",
449 | " for c, clf_type in enumerate(clf_types):\n",
450 | " a = plots[key][clf_type]\n",
451 | " plt.plot(a[0],[fix_failures(c) for c in a[1]],\n",
452 | " label=labels(noise_type, noise_amount, clf_type),\n",
453 | " color=colors[c],\n",
454 | " linestyle=linestyles[0])\n",
455 | " plt.title(key)\n",
456 | " plt.xlabel(x_labels[noise_type])\n",
457 | " plt.ylabel('Test AUC')\n",
458 | " plt.ylim(ylims[noise_type][key])\n",
459 | " plt.legend(title=legends[noise_type])\n",
460 | " plt.savefig(f\"./figures/label_noise_{key}.png\")\n",
461 | " plt.show()"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": null,
467 | "id": "b891c49a",
468 | "metadata": {},
469 | "outputs": [],
470 | "source": []
471 | }
472 | ],
473 | "metadata": {
474 | "kernelspec": {
475 | "display_name": "conda_python3",
476 | "language": "python",
477 | "name": "conda_python3"
478 | },
479 | "language_info": {
480 | "codemirror_mode": {
481 | "name": "ipython",
482 | "version": 3
483 | },
484 | "file_extension": ".py",
485 | "mimetype": "text/x-python",
486 | "name": "python",
487 | "nbconvert_exporter": "python",
488 | "pygments_lexer": "ipython3",
489 | "version": "3.6.13"
490 | }
491 | },
492 | "nbformat": 4,
493 | "nbformat_minor": 5
494 | }
495 |
--------------------------------------------------------------------------------
/scripts/reproducibility/label-noise/feature_dict.py:
--------------------------------------------------------------------------------
1 | feature_dict = {
2 | 'ieeecis': {
3 | 'transactionamt': 'numeric',
4 | 'productcd': 'categorical',
5 | 'card1': 'numeric',
6 | 'card2': 'numeric',
7 | 'card3': 'numeric',
8 | 'card5': 'numeric',
9 | 'card6': 'categorical',
10 | 'addr1': 'numeric',
11 | 'dist1': 'numeric',
12 | 'p_emaildomain': 'categorical',
13 | 'r_emaildomain': 'categorical',
14 | 'c1': 'numeric',
15 | 'c2': 'numeric',
16 | 'c4': 'numeric',
17 | 'c5': 'numeric',
18 | 'c6': 'numeric',
19 | 'c7': 'numeric',
20 | 'c8': 'numeric',
21 | 'c9': 'numeric',
22 | 'c10': 'numeric',
23 | 'c11': 'numeric',
24 | 'c12': 'numeric',
25 | 'c13': 'numeric',
26 | 'c14': 'numeric',
27 | 'v62': 'numeric',
28 | 'v70': 'numeric',
29 | 'v76': 'numeric',
30 | 'v78': 'numeric',
31 | 'v82': 'numeric',
32 | 'v91': 'numeric',
33 | 'v127': 'numeric',
34 | 'v130': 'numeric',
35 | 'v139': 'numeric',
36 | 'v160': 'numeric',
37 | 'v165': 'numeric',
38 | 'v187': 'numeric',
39 | 'v203': 'numeric',
40 | 'v207': 'numeric',
41 | 'v209': 'numeric',
42 | 'v210': 'numeric',
43 | 'v221': 'numeric',
44 | 'v234': 'numeric',
45 | 'v257': 'numeric',
46 | 'v258': 'numeric',
47 | 'v261': 'numeric',
48 | 'v264': 'numeric',
49 | 'v266': 'numeric',
50 | 'v267': 'numeric',
51 | 'v271': 'numeric',
52 | 'v274': 'numeric',
53 | 'v277': 'numeric',
54 | 'v283': 'numeric',
55 | 'v285': 'numeric',
56 | 'v289': 'numeric',
57 | 'v291': 'numeric',
58 | 'v294': 'numeric',
59 | 'id_01': 'numeric',
60 | 'id_02': 'numeric',
61 | 'id_05': 'numeric',
62 | 'id_06': 'numeric',
63 | 'id_09': 'numeric',
64 | 'id_13': 'numeric',
65 | 'id_17': 'numeric',
66 | 'id_19': 'numeric',
67 | 'id_20': 'numeric',
68 | 'devicetype': 'categorical',
69 | 'deviceinfo': 'categorical'
70 | },
71 | 'ccfraud': {
72 | 'v1': 'numeric',
73 | 'v2': 'numeric',
74 | 'v3': 'numeric',
75 | 'v4': 'numeric',
76 | 'v5': 'numeric',
77 | 'v6': 'numeric',
78 | 'v7': 'numeric',
79 | 'v8': 'numeric',
80 | 'v9': 'numeric',
81 | 'v10': 'numeric',
82 | 'v11': 'numeric',
83 | 'v12': 'numeric',
84 | 'v13': 'numeric',
85 | 'v14': 'numeric',
86 | 'v15': 'numeric',
87 | 'v16': 'numeric',
88 | 'v17': 'numeric',
89 | 'v18': 'numeric',
90 | 'v19': 'numeric',
91 | 'v20': 'numeric',
92 | 'v21': 'numeric',
93 | 'v22': 'numeric',
94 | 'v23': 'numeric',
95 | 'v24': 'numeric',
96 | 'v25': 'numeric',
97 | 'v26': 'numeric',
98 | 'v27': 'numeric',
99 | 'v28': 'numeric',
100 | 'amount': 'numeric'
101 | },
102 | 'fraudecom': {
103 | 'purchase_value': 'numeric',
104 | 'source': 'categorical',
105 | 'browser': 'categorical',
106 | 'age': 'numeric',
107 | 'ip_address': 'enrichable',
108 | 'time_since_signup': 'numeric'
109 | },
110 | 'sparknov': {
111 | 'cc_num': 'categorical',
112 | 'category': 'categorical',
113 | 'amt': 'numeric',
114 | 'first': 'categorical',
115 | 'last': 'categorical',
116 | 'gender': 'categorical',
117 | 'street': 'categorical',
118 | 'city': 'categorical',
119 | 'state': 'categorical',
120 | 'zip': 'categorical',
121 | 'lat': 'numeric',
122 | 'long': 'numeric',
123 | 'city_pop': 'numeric',
124 | 'job': 'categorical',
125 | 'dob': 'text',
126 | 'merch_lat': 'numeric',
127 | 'merch_long': 'numeric'
128 | },
129 | 'twitterbot': {
130 | 'created_at' : 'text',
131 | 'default_profile': 'categorical',
132 | 'default_profile_image': 'categorical',
133 | 'description': 'text',
134 | 'favourites_count': 'numeric',
135 | 'followers_count': 'numeric',
136 | 'friends_count': 'numeric',
137 | 'geo_enabled': 'categorical',
138 | 'lang': 'categorical',
139 | 'location': 'categorical',
140 | 'profile_background_image_url': 'text',
141 | 'profile_image_url': 'text',
142 | 'screen_name': 'text',
143 | 'statuses_count': 'numeric',
144 | 'verified': 'categorical',
145 | 'average_tweets_per_day': 'numeric',
146 | 'account_age_days': 'numeric'
147 | },
148 | 'fakejob': {
149 | 'title': 'categorical',
150 | 'location': 'categorical',
151 | 'department': 'categorical',
152 | 'salary_range': 'text',
153 | 'company_profile': 'text',
154 | 'description': 'text',
155 | 'requirements': 'text',
156 | 'benefits': 'text',
157 | 'telecommuting': 'categorical',
158 | 'has_company_logo': 'categorical',
159 | 'has_questions': 'categorical',
160 | 'employment_type': 'categorical',
161 | 'required_experience': 'categorical',
162 | 'required_education': 'categorical',
163 | 'industry': 'categorical',
164 | 'function': 'categorical'
165 | },
166 | 'vehicleloan': {
167 | 'disbursed_amount': 'numeric',
168 | 'asset_cost': 'numeric',
169 | 'ltv': 'numeric',
170 | 'branch_id': 'categorical',
171 | 'supplier_id': 'categorical',
172 | 'manufacturer_id': 'categorical',
173 | 'current_pincode_id': 'categorical',
174 | 'date_of_birth': 'text',
175 | 'employment_type': 'categorical',
176 | 'state_id': 'categorical',
177 | 'employee_code_id': 'categorical',
178 | 'mobileno_avl_flag': 'categorical',
179 | 'aadhar_flag': 'categorical',
180 | 'pan_flag': 'categorical',
181 | 'voterid_flag': 'categorical',
182 | 'driving_flag': 'categorical',
183 | 'passport_flag': 'categorical',
184 | 'perform_cns_score': 'numeric',
185 | 'perform_cns_score_description': 'categorical',
186 | 'pri_no_of_accts': 'numeric',
187 | 'pri_active_accts': 'numeric',
188 | 'pri_overdue_accts': 'numeric',
189 | 'pri_current_balance': 'numeric',
190 | 'pri_sanctioned_amount': 'numeric',
191 | 'pri_disbursed_amount': 'numeric',
192 | 'sec_no_of_accts': 'numeric',
193 | 'sec_active_accts': 'numeric',
194 | 'sec_overdue_accts': 'numeric',
195 | 'sec_current_balance': 'numeric',
196 | 'sec_sanctioned_amount': 'numeric',
197 | 'sec_disbursed_amount': 'numeric',
198 | 'primary_instal_amt': 'numeric',
199 | 'sec_instal_amt': 'numeric',
200 | 'new_accts_in_last_six_months': 'numeric',
201 | 'delinquent_accts_in_last_six_months': 'numeric',
202 | 'average_acct_age': 'text',
203 | 'credit_history_length': 'text',
204 | 'no_of_inquiries': 'numeric'
205 | }
206 | }
--------------------------------------------------------------------------------
/scripts/reproducibility/label-noise/load_fdb_datasets.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import json
4 | import pandas as pd
5 | import numpy as np
6 | import warnings
7 | from datetime import datetime
8 |
9 | from category_encoders.target_encoder import TargetEncoder
10 | from skclean.simulate_noise import flip_labels_cc, BCNoise
11 |
12 | from fdb.datasets import FraudDatasetBenchmark
13 |
14 | import feature_dict
15 |
16 | DATASET_PATH = './data/dataset.csv'
17 | METADATA_PATH = './data/feature_metadata.json'
18 | FD = feature_dict.feature_dict
19 |
20 | def noise_amount(df):
21 | return df[df.noise == 1].shape[0]
22 |
23 | def noise_rate(df):
24 | if df.shape[0] > 0:
25 | return noise_amount(df)/df.shape[0]
26 | else:
27 | return None
28 |
29 | def type_1_noise_amount(df):
30 | # examples with true label 0, mislabeled as 1
31 | # here 'df.label' is the observed label, not the true one
32 | return df[(df.label==1) & (df.noise == 1)].shape[0]
33 |
34 | def type_2_noise_amount(df):
35 | # examples with true label 1, mislabeled as 0
36 | # here 'df.label' is the observed label, not the true one
37 | return df[(df.label==0) & (df.noise == 1)].shape[0]
38 |
39 | def actual_legit_amount(df):
40 | return df[(df.label == 0) | (df.noise == 1)].shape[0]
41 |
42 | def observed_legit_amount(df):
43 | return df[df.label == 0].shape[0]
44 |
45 | def actual_fraud_amount(df):
46 | return df[((df.label == 1) & (df.noise == 0)) | ((df.label == 0) & (df.noise == 1))].shape[0]
47 |
48 | def observed_fraud_amount(df):
49 | return df[df.label == 1].shape[0]
50 |
51 | def actual_fraud_rate(df):
52 | if df.shape[0] > 0:
53 | return actual_fraud_amount(df)/df.shape[0]
54 | else:
55 | return None
56 |
57 | def observed_fraud_rate(df):
58 | if df.shape[0] > 0:
59 | return observed_fraud_amount(df)/df.shape[0]
60 | else:
61 | return None
62 |
63 | def type_1_noise_rate(df):
64 | if df.shape[0] > 0:
65 | return type_1_noise_amount(df)/actual_legit_amount(df)
66 | else:
67 | return None
68 |
69 | def type_2_noise_rate(df):
70 | if df.shape[0] > 0:
71 | return type_2_noise_amount(df)/actual_fraud_amount(df)
72 | else:
73 | return None
74 |
75 | def prepare_data_fdb(key, drop_text_enr_features=True):
76 | """
77 | main function, gets datasets from FDB and then does some preprocessing/cleaning so they are suitable
78 | for modeling, returns data and metadata
79 |
80 | inputs:
81 | key - the FDB dataset to load
82 | drop_text_enr_features - whether we want to drop text/enrichable features
83 | this returns
84 | df - full pandas dataframe containing features, labels and metadata
85 | this includes training and test data, with a 'dataset' column to indicate which
86 | all of these datasets have a timestamp column (even if it is "fake") and by default
87 | data will be sorted by this column. All test > train w.r.t. this timestamp
88 |
89 | features - list of feature names
90 | cat_features - list of categorical feature names (subset of features)
91 | label - name of label column
92 | record_id - name of unique id column
93 | """
94 |
95 | obj = FraudDatasetBenchmark(key=key)
96 |
97 | print(obj.key)
98 |
99 | # extract training and testing data (and test labels) from the return object
100 | # sort training data by event timestamp
101 | train_df = obj.train.sort_values(by='EVENT_TIMESTAMP',ignore_index=True)
102 | test_df = obj.test.reset_index(drop=True)
103 | test_labels = obj.test_labels.reset_index(drop=True)
104 |
105 | # define metadata and label column names
106 | metadata = ['EVENT_LABEL', 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID',
107 | 'label', 'LABEL_TIMESTAMP', 'noise', 'dataset']
108 | label = ['label']
109 |
110 | # we maintain a feature dictionary in another file, this helps us determine which are categorical, numerical, etc.
111 | feature_dict = FD[key]
112 | raw_features = feature_dict.keys()
113 | num_features = [f for f in raw_features if feature_dict[f] == 'numeric']
114 | cat_features = [f for f in raw_features if feature_dict[f] == 'categorical']
115 | txt_features = [f for f in raw_features if feature_dict[f] == 'text']
116 | enr_features = [f for f in raw_features if feature_dict[f] == 'enrichable']
117 |
118 | # add / rename labels
119 | train_df.rename({'EVENT_LABEL':'label'}, axis=1, inplace=True)
120 | test_df['label'] = test_labels['EVENT_LABEL']
121 | if key == 'twitterbot':
122 | train_df.loc[train_df.label == 'bot', 'label'] = 1
123 | test_df.loc[test_df.label == 'bot', 'label'] = 1
124 | train_df.loc[train_df.label == 'human', 'label'] = 0
125 | test_df.loc[test_df.label == 'human', 'label'] = 0
126 |
127 | # put train / test into single dataframe, create a 'dataset' column to keep track
128 | train_df['dataset'] = 'train'
129 | test_df['dataset'] = 'test'
130 |
131 | # create noise column - we won't generate any noise now but it may be useful to have (can also be ignored)
132 | train_df['noise'] = 0
133 | test_df['noise'] = 0
134 |
135 | # concatenate train/test into single dataframe
136 | # (remember we have 'dataset' column to separate them again if needed)
137 | df = pd.concat([train_df, test_df], axis=0, ignore_index=True)
138 |
139 | # there are a few date columns that are timestamps, we convert those to epoch
140 | # the new values are put into new columns, those column names are added to the numerical features
141 | if key == 'twitterbot':
142 | df['eng_created_at'] = df['created_at'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp())
143 | num_features.append('eng_created_at')
144 | if key == 'sparknov':
145 | df['eng_dob'] = df['dob'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d').timestamp())
146 | num_features.append('eng_dob')
147 |
148 | # fakejob has a salary range column, e.g. "10000 - 20000" that can be converted into two numerical columns
149 | if key == 'fakejob':
150 | def convert(x):
151 | r = re.search(r"([0-9]*)-([0-9]*)",str(x))
152 | try:
153 | m, M = r.group(1), r.group(2)
154 | if m == '' or M == '':
155 | m, M = 0,0
156 | except:
157 | m, M = 0,0
158 | return m,M
159 |
160 | df['salary_min'], df['salary_max'] = zip(*df['salary_range'].map(convert))
161 | num_features = num_features + ['salary_min','salary_max']
162 |
163 | # vehicleloan has a timestamp column that we convert to epoch
164 | # it also has "account age" and "credit history" length cols
165 | # in form "Xyrs Ymon" that can be converted to numeric
166 | if key == 'vehicleloan':
167 | df['eng_dob'] = df['date_of_birth'].apply(lambda x : datetime.strptime(x, '%d-%m-%Y').timestamp())
168 |
169 | def convert(x):
170 | r = re.search(r"([0-9]*)yrs ([0-9]*)mon", x)
171 | try:
172 | age = 12*float(r.group(1)) + float(r.group(2))
173 | except:
174 | age = 0
175 | return age
176 |
177 | df['eng_average_acct_age'] = df['average_acct_age'].apply(convert)
178 | df['eng_credit_history_length'] = df['credit_history_length'].apply(convert)
179 | num_features = num_features + ['eng_dob','eng_average_acct_age','eng_credit_history_length']
180 |
181 | # by default we will drop any remaining text or enrichable (IP address) features as we won't use them
182 | # but you can pass in False for this if they are of interest
183 | if drop_text_enr_features:
184 | df.drop(txt_features + enr_features, axis=1, inplace=True)
185 | features = num_features + cat_features
186 |
187 | # cast all numeric features to float just in case they aren't
188 | for feature in num_features:
189 | df[feature] = df[feature].astype('float64')
190 | df[feature].fillna(0, inplace=True)
191 |
192 | # cast all categorical features to str in case they aren't
193 | for feature in cat_features:
194 | df[feature] = df[feature].astype(str)
195 | df[feature].fillna('', inplace=True)
196 |
197 | # rename the timestamp column
198 | df.rename({'EVENT_TIMESTAMP':'creation_date'}, axis=1, inplace=True)
199 |
200 | # cast the label to int just to be sure
201 | df['label'] = df['label'].astype('int')
202 |
203 | # name of unique id column will always be EVENT_ID
204 | record_id = 'EVENT_ID'
205 |
206 | if drop_text_enr_features:
207 | return df, features, cat_features, label, record_id
208 | else:
209 | return df, features, cat_features, txt_features, enr_features, label, record_id
210 |
211 |
212 | def add_noise(df, noise_type, noise_amount, *, time_index=None, features=None, cat_features=None, label=None):
213 |
214 | if noise_type not in ['random', 'time-dependent', 'boundary-consistent']:
215 | raise(Exception('Invalid Noise Type'))
216 |
217 | # if we want time-dependent noise it will be useful to convert timestamps into epoch
218 | def convert_to_millis(x):
219 | try:
220 | m = datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').timestamp()
221 | except:
222 | m = datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp()
223 | return m
224 |
225 | # random noise can be class-conditional in both directions (other types of noise cannot)
226 | # if noise_amount is passed in as [r,s] we can flip labels in both directions:
227 | # r is percent of 0s flipped to 1s
228 | # s is percent of 1s flipped to 0s
229 | # for random noise, if noise_amount is a single number, assume it is s, and that r=0
230 | # (i.e. class-conditional noise where only 1s get flipped to 0s)
231 | if isinstance(noise_amount, tuple) or isinstance(noise_amount, list):
232 | if noise_type != 'random':
233 | raise(Exception('For time-dependent and boundary-consistent noise,'
234 | 'only a single value is allowed for noise_amount'))
235 | r = noise_amount[0]
236 | s = noise_amount[1]
237 | else:
238 | r = 0
239 | s = noise_amount
240 |
241 | # we will add noise to a *copy* of the dataframe
242 | df_copy = df.copy()
243 |
244 | if noise_type == 'time-dependent':
245 | df_copy['event_millis'] = df_copy[time_index].apply(convert_to_millis)
246 | df_copy['event_millis'] = df_copy['event_millis'] - df_copy['event_millis'].min()
247 | mislabel = df_copy[(df_copy.noise == 0)
248 | & (df_copy.label == 1)].sample(frac = s,
249 | weights=df_copy['event_millis']).index
250 | df_copy.loc[mislabel,'noise'] = 1
251 | df_copy.loc[mislabel,'label'] = 0
252 | else:
253 | if noise_type == 'boundary-consistent':
254 | from catboost import CatBoostClassifier
255 | warnings.filterwarnings("ignore", category=FutureWarning)
256 | target_encoder = TargetEncoder(cols=cat_features)
257 | reshaped_y = df_copy[label].values.reshape(df_copy[label].shape[0],)
258 | X = target_encoder.fit_transform(df_copy[features], reshaped_y)
259 | clf = CatBoostClassifier(verbose=False)
260 | clf.fit(X, reshaped_y)
261 | _, noisy_labels = BCNoise(clf, noise_level=s).simulate_noise(X, reshaped_y)
262 | else:
263 | lcm = np.array([[1-r,r],[s,1-s]])
264 | noisy_labels = flip_labels_cc(df_copy.label,lcm)
265 |
266 | idx = (df_copy.label != noisy_labels)
267 | df_copy.loc[idx,'noise'] = 1
268 | df_copy['label'] = noisy_labels
269 |
270 | return df_copy
271 |
272 |
273 | def train_valid_split(df, split=0.7, shuffle=True, sort_key='creation_date'):
274 | if shuffle:
275 | df = df.sample(frac=1).reset_index(drop=True)
276 | else:
277 | df = df.sort_values(by=sort_key, ignore_index=True)
278 | train_idx = int(round(split*df.shape[0]))
279 | train = df[:train_idx].reset_index(drop=True)
280 | valid = df[train_idx:].reset_index(drop=True)
281 |
282 | return train, valid
283 |
284 |
285 | def prepare_noisy_dataset(key, noise_type, noise_amount, split=0.7, shuffle=True,
286 | sort_key='creation_date', target_encoding=False):
287 | """
288 | this function can be used to fetch datasets from FDB,
289 | starts by calling prepare_data_fdb and then adding noise
290 |
291 | input:
292 | key - name of FDB dataset
293 | noise_type - what type of noise to add
294 | noise_amount - how much noise to add
295 | split - training/validation split
296 | shuffle - whether or not to shuffle or sort before doing train/valid split
297 | sort_key - key to use to sort for train/valid split as well as weight for time-dependent noise
298 | """
299 |
300 | # start by getting clean dataset
301 |
302 | df, features, cat_features, label, record_id = prepare_data_fdb(key)
303 |
304 | if noise_type == 'boundary-consistent':
305 | train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount,
306 | time_index=sort_key, features=features, cat_features=cat_features, label=label)
307 | else:
308 | train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, time_index=sort_key)
309 |
310 | train, valid = train_valid_split(train_and_valid, split, shuffle=shuffle, sort_key=sort_key)
311 | test = df[df.dataset == 'test'].reset_index(drop=True)
312 |
313 | train = train[features + ['noise'] + label]
314 | valid = valid[features + ['noise'] + label]
315 | test = test[features + ['noise'] + label]
316 |
317 | if target_encoding:
318 | warnings.filterwarnings("ignore", category=FutureWarning)
319 | target_encoder = TargetEncoder(cols=cat_features)
320 | reshaped_y = train[label].values.reshape(train[label].shape[0],)
321 | train.loc[:, features] = target_encoder.fit_transform(train[features], reshaped_y)
322 | valid.loc[:, features] = target_encoder.transform(valid[features])
323 | test.loc[:, features] = target_encoder.transform(test[features])
324 | cat_features = None
325 |
326 | dataset = {
327 | 'description': f"{key} dataset with noise type: {noise_type}, noise amount: {noise_amount} ",
328 | 'features':features,
329 | 'cat_features':cat_features,
330 | 'label':label,
331 | 'record_id':record_id,
332 | 'train':train,
333 | 'valid':valid,
334 | 'test':test,
335 | 'noise':(noise_rate(train), noise_rate(valid), noise_rate(test)),
336 | 'fraud_level':(actual_fraud_rate(train), actual_fraud_rate(valid), actual_fraud_rate(test)),
337 | 'observed_fraud_level':(observed_fraud_rate(train),observed_fraud_rate(valid),observed_fraud_rate(test)),
338 | 'type_1_noise_rate':(type_1_noise_rate(train),type_1_noise_rate(valid),type_1_noise_rate(test)),
339 | 'type_2_noise_rate':(type_2_noise_rate(train),type_2_noise_rate(valid),type_2_noise_rate(test))
340 | }
341 |
342 | return dataset
343 |
344 |
345 | def dataset_stats(dataset):
346 | noise = dataset['noise']
347 | fraud_level = dataset['fraud_level']
348 | observed_fraud_level = dataset['observed_fraud_level']
349 | type_1_noise_rate = dataset['type_1_noise_rate']
350 | type_2_noise_rate = dataset['type_2_noise_rate']
351 | stats = list(zip(['train','valid','test'],noise,type_1_noise_rate,type_2_noise_rate,fraud_level,observed_fraud_level))
352 | print(dataset['description'])
353 | for stat in stats:
354 | print('{} - total noise rate: {:.3f}, type 1 noise rate: {:.3f}, type 2 noise rate: {:.3f},\n'
355 | '(actual) fraud rate: {:.3f}, observed fraud rate: {:.3f}'.format(*stat))
356 |
357 |
--------------------------------------------------------------------------------
/scripts/reproducibility/label-noise/micro_models.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import pandas as pd
3 | import numpy as np
4 |
5 |
6 | class MicroModelError(Exception):
7 | """
8 | basic exception type for micro-model specific errors
9 | """
10 | def __init__(self, error_message):
11 | logging.error(error_message)
12 |
13 |
14 | class MicroModel:
15 | """
16 | Basic wrapper for the model to be used in ensemble noise removal, ModelClass can be anything that implements
17 | fit and predict_proba. Mainly used by MicroModelEnsemble, user is probably not calling this directly
18 | """
19 |
20 | def __init__(self, ModelClass, *args, **kwargs):
21 | """
22 | initialization of the class, ModelClass should be a *class* not an object
23 | e.g. CatBoostClassifier, not CatBoostClassifier()
24 | """
25 | self.clf = ModelClass(*args, **kwargs)
26 | self.thresh = None
27 |
28 | def set_thresh(self, thresh):
29 | # can set a threshold to be used in model predictions
30 | self.thresh = thresh
31 |
32 | def fit(self, x, y, *args, **kwargs):
33 | # pass-through method to call model.fit()
34 | self.clf.fit(x, y.values.ravel(), *args, **kwargs)
35 |
36 | def predict_proba(self, x, *args, **kwargs):
37 | # pass-through method to call model.predict_proba()
38 | if 'predict_proba' in dir(self.clf):
39 | return self.clf.predict_proba(x, *args, **kwargs)
40 | else:
41 | raise (MicroModelError('ModelClass must implement predict_proba'))
42 |
43 | def predict(self, x):
44 | # make predictions, using either defined threshold (if set) or default value of 0.5
45 | if self.thresh is not None:
46 | t = self.thresh
47 | else:
48 | t = 0.5
49 | scores = self.predict_proba(x)[:, 1]
50 | preds = [int(s > t) for s in scores]
51 | return scores, preds
52 |
53 |
54 | class MicroModelEnsemble:
55 | """
56 | Ensemble of micro-models used to remove noise
57 | """
58 |
59 | def __init__(self, ModelClass, num_clfs=16, score_type='preds_avg', *args, **kwargs):
60 | """
61 | initialization of the class, ModelClass should be a *class* not an object
62 | e.g. CatBoostClassifier, not CatBoostClassifier()
63 | params:
64 | ModelClass - base class to use, needs to implement fit and predict_proba
65 | num_clfs - number of classifiers to use in cleaning ensemble
66 | score_type - means of computing anomaly score from micro-model scores
67 | args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost
68 | """
69 | self.score_type = score_type
70 |
71 | if type(num_clfs) is not int or num_clfs <= 0:
72 | raise (MicroModelError('num_clfs must be a positive integer'))
73 | self.ModelClass = ModelClass
74 |
75 | # one classifier that will be trained over entire dataset
76 | self.big_clf = MicroModel(ModelClass=ModelClass, *args, **kwargs)
77 |
78 | # micro-models to later be trained over slices
79 | self.num_clfs = num_clfs
80 | self.clfs = []
81 | for i in range(num_clfs):
82 | self.clfs.append(MicroModel(ModelClass=ModelClass, *args, **kwargs))
83 | self.thresholds = {}
84 |
85 | def fit(self, x, y, *args, **kwargs):
86 | # assumption that data is already shuffled or sorted (by date or other appropriate key)
87 | # according to the usecase
88 |
89 | if not isinstance(y, pd.DataFrame):
90 | y = pd.DataFrame(y)
91 |
92 | # fit one classifier on all the data
93 | self.big_clf.fit(x, y, *args, **kwargs)
94 |
95 | # now fit individual models on slices of data
96 | stride = round(x.shape[0] / self.num_clfs)
97 | for i, clf in enumerate(self.clfs):
98 | idx = slice(i * stride, min((i + 1) * stride, x.shape[0]))
99 | x_i = x.iloc[idx, :]
100 | y_i = y.iloc[idx, :]
101 | clf.fit(x_i, y_i, *args, **kwargs)
102 |
103 | def predict_proba(self, x, *args, **kwargs):
104 | # output is the mean of the (binary) predictions of all models in the ensemble
105 | # e.g. the percentage of models that voted on the example
106 | results = pd.DataFrame(index=np.arange(x.shape[0]))
107 | if self.score_type == 'preds_avg':
108 | for i, clf in enumerate(self.clfs):
109 | _, results[i] = clf.predict(x, *args, **kwargs)
110 | elif self.score_type == 'score_avg':
111 | for i, clf in enumerate(self.clfs):
112 | results[i] = clf.predict_proba(x, *args, **kwargs)[:, 1]
113 |
114 | scores = results.mean(axis=1, numeric_only=True)
115 | return scores
116 |
117 | def predict(self, x, threshold=0.5, *args, **kwargs):
118 | # compare output of predict_proba to a threshold in order to make a binary prediction, default is 0.5
119 | scores = self.predict_proba(x)
120 | preds = np.array([int(s >= threshold) for s in scores])
121 | return scores, preds
122 |
123 | def filter_noise(self, x, y, pulearning=True, threshold=0.5):
124 | # compare ensemble predictions to observed labels and return the examples that are NOT considered noise
125 | # i.e. this is noise REMOVAL
126 | # pu_learning=True means a class-conditional assumption is being made,
127 | # there no examples of true 0s mislabeled as 1s
128 | scores, susp = self.predict(x, threshold)
129 | if pulearning:
130 | conf = ((y == 1) | ((y == 0) & (susp == 0)))
131 | else:
132 | conf = (((y == 1) & (scores > 1 - threshold)) | ((y == 0) & (scores < threshold)))
133 |
134 | return x[conf].reset_index(drop=True), y[conf]
135 |
136 | def clean_noise(self, x, y, pulearning=True, threshold=0.5):
137 | # compare ensemble predictions to observed labels and return all examples with corrected labels
138 | # i.e. this is noise CLEANING
139 | # pu_learning=True means a class-conditional assumption is being made,
140 | # there no examples of true 0s mislabeled as 1s
141 | x = x.copy()
142 | y = y.copy()
143 | _, susp = self.predict(x, threshold)
144 | # flip all the probable 1s to actual 1s
145 | probable_1 = (y == 0) & (susp == 1)
146 | y[probable_1] = 1
147 | if not pulearning:
148 | # if there are both types of noise, flip probable 0s to actual 0s
149 | probable_0 = (y == 1) & (susp == 0)
150 | y[probable_0] = 0
151 |
152 | return x, y
153 |
154 |
155 | class MicroModelCleaner:
156 | """
157 | This class performs the entire model training process end-to-end - given a dataset it will first train an ensemble
158 | then remove noise, then train a final model on the clean data
159 | """
160 |
161 | def __init__(self, ModelClass, strategy='filter', pulearning=True, num_clfs=16, threshold=0.5, *args, **kwargs):
162 | """
163 | initialization of the class, ModelClass should be a *class* not an object
164 | e.g. CatBoostClassifier, not CatBoostClassifier()
165 | params:
166 | ModelClass - base class to use, needs to implement fit and predict_proba
167 | strategy - whether to remove noise ('filter') or flip labels ('clean')
168 | pulearning - class-conditional assumption, if True assume there is no true 0's mislabeled as 1's
169 | num_clfs - number of classifiers to use in cleaning ensemble
170 | threshold - percentage of classifiers that have to vote to remove noise (0.5 is majority voting)
171 | args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost
172 | """
173 | self.detector = MicroModelEnsemble(ModelClass, num_clfs, *args, **kwargs)
174 | self.clf = ModelClass(*args, **kwargs)
175 | if strategy.lower() not in ['filter', 'clean']:
176 | raise (MicroModelError('strategy must be filter or clean'))
177 | self.strategy = strategy.lower()
178 | self.pulearning = pulearning
179 | self.threshold = threshold
180 |
181 | def fit(self, x, y, *args, **kwargs):
182 | # first train the Ensemble to deal with the noise
183 | self.detector.fit(x, y, *args, **kwargs)
184 | if self.strategy == 'filter':
185 | x_clean, y_clean = self.detector.filter_noise(x, y, self.pulearning, self.threshold)
186 | else:
187 | x_clean, y_clean = self.detector.clean_noise(x, y, self.pulearning, self.threshold)
188 |
189 | # then train final model on clean data
190 | self.clf.fit(x_clean, y_clean, *args, **kwargs)
191 |
192 | def predict(self, x, *args, **kwargs):
193 | return self.clf.predict(x, *args, **kwargs)
194 |
195 | def predict_proba(self, x, *args, **kwargs):
196 | return self.clf.predict_proba(x, *args, **kwargs)
197 |
198 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import os
2 | from glob import glob
3 |
4 | from setuptools import find_packages, setup
5 |
6 |
7 | setup(
8 | name='fraud_dataset_benchmark',
9 | version='1.0',
10 |
11 | # declare your packages
12 | packages=find_packages(where='src', exclude=('test',)),
13 | package_dir={'': 'src'},
14 | include_package_data=True,
15 | data_files=[('.',[
16 | 'src/fdb/versioned_datasets/ipblock/20220607.zip',
17 | ])],
18 |
19 | # Enable build-time format checking
20 | check_format=False,
21 |
22 | # Enable type checking
23 | test_mypy=False,
24 |
25 | # Enable linting at build time
26 | test_flake8=False,
27 |
28 | # exclude_package_data={
29 | # '': glob('fdb/*/__pycache__', recursive=True),
30 | # }
31 | )
32 |
--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/__init__.py
--------------------------------------------------------------------------------
/src/fdb/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/__init__.py
--------------------------------------------------------------------------------
/src/fdb/datasets.py:
--------------------------------------------------------------------------------
1 | from abc import abstractmethod, ABC
2 | from fdb.preprocessing import *
3 | from fdb.preprocessing_objects import load_data
4 | from sklearn.metrics import roc_auc_score, roc_curve, auc
5 |
6 | class FraudDatasetBenchmark(ABC):
7 | def __init__(
8 | self,
9 | key,
10 | load_pre_downloaded=False,
11 | delete_downloaded=True,
12 | add_random_values_if_real_na = {
13 | "EVENT_TIMESTAMP": True,
14 | "LABEL_TIMESTAMP": True,
15 | "ENTITY_ID": True,
16 | "ENTITY_TYPE": True,
17 | "EVENT_ID": True
18 | }):
19 | self.key = key
20 | self.obj = load_data(self.key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na)
21 |
22 | @property
23 | def train(self):
24 | return self.obj.train
25 |
26 | @property
27 | def test(self):
28 | return self.obj.test
29 |
30 | @property
31 | def test_labels(self):
32 | return self.obj.test_labels
33 |
34 | def eval(self, y_pred):
35 |
36 | """
37 | Method to evaluate predictions against the test set
38 | """
39 | roc_score = roc_auc_score(self.test_labels['EVENT_LABEL'], y_pred)
40 | fpr, tpr, thres = roc_curve(self.test_labels['EVENT_LABEL'], y_pred)
41 | tpr_1fpr = np.interp(0.01, fpr, tpr)
42 | metrics = {'roc_score': roc_score, 'tpr_1fpr': tpr_1fpr}
43 | return metrics
44 |
45 |
46 |
--------------------------------------------------------------------------------
/src/fdb/kaggle_configs.py:
--------------------------------------------------------------------------------
1 | KAGGLE_CONFIGS = {
2 |
3 | "fakejob":
4 | {
5 | "owner": "shivamb",
6 | "dataset": "real-or-fake-fake-jobposting-prediction",
7 | "filename": 'fake_job_postings.csv',
8 | "name": "Real / Fake Job Posting Prediction",
9 | "type": "datasets",
10 | "version": 1
11 | },
12 |
13 | "vehicleloan":
14 | {
15 | "owner": "avikpaul4u",
16 | "dataset": "vehicle-loan-default-prediction",
17 | "filename": 'train.csv',
18 | "name": "Vehicle Loan Default Prediction",
19 | "type": "datasets",
20 | "version": 4
21 | },
22 |
23 | "malurl":
24 | {
25 | "owner": "sid321axn",
26 | "dataset": "malicious-urls-dataset",
27 | "filename": 'malicious_phish.csv',
28 | "name": "Malicious URLs Dataset",
29 | "type": "datasets",
30 | "version": 1
31 | },
32 |
33 | "ieeecis":
34 | {
35 | "owner": "ieee-fraud-detection",
36 | "name": "IEEE-CIS Fraud Detection",
37 | "type": "competitions",
38 | },
39 |
40 | "ccfraud":
41 | {
42 | "owner": "mlg-ulb",
43 | "dataset": "creditcardfraud",
44 | "filename": 'creditcard.csv',
45 | "name": "Credit Card Fraud Detection",
46 | "type": "datasets",
47 | "version": 3
48 | },
49 |
50 | "fraudecom":
51 | {
52 | "owner": "vbinh002",
53 | "dataset": "fraud-ecommerce",
54 | "filename": 'Fraud_Data.csv',
55 | "name": "Fraud ecommerce",
56 | "type": "datasets",
57 | "version": 1
58 | },
59 |
60 | "sparknov":
61 | {
62 | "owner": "kartik2112",
63 | "dataset": "fraud-detection",
64 | "name": "Simulated Credit Card Transactions generated using Sparkov",
65 | "type": "datasets",
66 | "version": 1
67 | },
68 |
69 | "twitterbot":
70 | {
71 | "owner": "davidmartngutirrez",
72 | "dataset": "twitter-bots-accounts",
73 | "filename": "twitter_human_bots_dataset.csv",
74 | "name": "Twitter Bots Accounts",
75 | "type": "datasets",
76 | "version": 2
77 | }
78 | }
--------------------------------------------------------------------------------
/src/fdb/preprocessing.py:
--------------------------------------------------------------------------------
1 |
2 |
3 | import os
4 | import re
5 | import shutil
6 | import kaggle
7 | import pkgutil
8 | import requests
9 | import zipfile
10 | import numpy as np
11 | from abc import ABC
12 | import pandas as pd
13 | import socket, struct
14 | from faker import Faker
15 | from zipfile import ZipFile
16 | from datetime import datetime
17 | from datetime import timedelta
18 | from io import StringIO, BytesIO
19 | from dateutil.relativedelta import relativedelta
20 |
21 | from fdb.kaggle_configs import KAGGLE_CONFIGS
22 |
23 | fake = Faker(['en_US'])
24 |
25 |
26 | # Naming convention for the meta data columns in standardized datasets
27 | _EVENT_TIMESTAMP = 'EVENT_TIMESTAMP' # timestamp column
28 | _ENTITY_TYPE = 'ENTITY_TYPE' # afd specific requirement
29 | _EVENT_LABEL = 'EVENT_LABEL' # label column
30 | _EVENT_ID = 'EVENT_ID' # transaction/event id
31 | _ENTITY_ID = 'ENTITY_ID' # represents user/account id
32 | _LABEL_TIMESTAMP = 'LABEL_TIMESTAMP' # added in a cases where entity id is meaninful
33 |
34 | # Kaggle config related strings
35 | _OWNER = 'owner'
36 | _COMPETITIONS = 'competitions'
37 | _TYPE = 'type'
38 | _FILENAME = 'filename'
39 | _DATASETS = 'datasets'
40 | _DATASET = 'dataset'
41 | _VERSION = 'version'
42 |
43 | # Some fixed parameters
44 | _RANDOM_STATE = 1
45 | _CWD = os.getcwd()
46 | _DOWNLOAD_LOCATION = os.path.join(_CWD, 'tmp')
47 | _TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%SZ'
48 | _DEFAULT_LABEL_TIMESTAMP = datetime.now().strftime(_TIMESTAMP_FORMAT)
49 |
50 |
51 | class BasePreProcessor(ABC):
52 | def __init__(
53 | self,
54 | key = None,
55 | train_percentage = 0.8,
56 | timestamp_col = None,
57 | label_col = None,
58 | label_timestamp_col = None,
59 | event_id_col = None,
60 | entity_id_col = None,
61 | features_to_drop = [],
62 | load_pre_downloaded = False,
63 | delete_downloaded = True,
64 | add_random_values_if_real_na = {
65 | "EVENT_TIMESTAMP": True,
66 | "LABEL_TIMESTAMP": True,
67 | "ENTITY_ID": True,
68 | "ENTITY_TYPE": True,
69 | "EVENT_ID": True
70 | }
71 | ):
72 |
73 | self.key = key
74 | self.train_percentage = train_percentage
75 | self.features_to_drop = features_to_drop
76 | self.delete_downloaded = delete_downloaded
77 |
78 | self._timestamp_col = timestamp_col
79 | self._label_col = label_col
80 | self._label_timestamp_col = label_timestamp_col
81 | self._event_id_col = event_id_col
82 | self._entity_id_col = entity_id_col
83 | self._add_random_values_if_real_na = add_random_values_if_real_na
84 |
85 | # Simply get all required objects at the time of object creation
86 | if KAGGLE_CONFIGS.get(self.key) and not load_pre_downloaded:
87 | self.download_kaggle_data() # download the data when an object is created
88 | self.load_data()
89 | self.preprocess()
90 | self.train_test_split()
91 |
92 |
93 | def _download_kaggle_data_from_competetions(self):
94 | file_name = KAGGLE_CONFIGS[self.key][_OWNER]
95 | kaggle.api.competition_download_files(
96 | competition = KAGGLE_CONFIGS[self.key][_OWNER],
97 | path = _DOWNLOAD_LOCATION
98 | )
99 | return file_name
100 |
101 | def _download_kaggle_data_from_datasets_with_given_filename(self):
102 | file_name = KAGGLE_CONFIGS[self.key][_FILENAME]
103 | response = kaggle.api.datasets_download_file(
104 | owner_slug = KAGGLE_CONFIGS[self.key][_OWNER],
105 | dataset_slug = KAGGLE_CONFIGS[self.key][_DATASET],
106 | file_name = file_name,
107 | dataset_version_number=KAGGLE_CONFIGS[self.key][_VERSION],
108 | _preload_content = False,
109 | )
110 | with open(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'wb') as f:
111 | f.write(response.data)
112 | return file_name
113 |
114 | def _download_kaggle_data_from_datasets_containing_single_file(self):
115 | file_name = KAGGLE_CONFIGS[self.key][_DATASET]
116 | kaggle.api.dataset_download_files(
117 | dataset = os.path.join(KAGGLE_CONFIGS[self.key][_OWNER], KAGGLE_CONFIGS[self.key][_DATASET]),
118 | path = _DOWNLOAD_LOCATION
119 | )
120 | return file_name
121 |
122 | def download_kaggle_data(self):
123 | """
124 | Download and extract the data from Kaggle. Puts the data in tmp directory within current directory.
125 | """
126 |
127 | if not os.path.exists(_DOWNLOAD_LOCATION):
128 | os.mkdir(_DOWNLOAD_LOCATION)
129 |
130 | print('Data download location', _DOWNLOAD_LOCATION)
131 |
132 |
133 | if KAGGLE_CONFIGS[self.key][_TYPE] == _COMPETITIONS:
134 | file_name = self._download_kaggle_data_from_competetions()
135 |
136 | elif KAGGLE_CONFIGS[self.key][_TYPE] == _DATASETS:
137 | # If filename is given, download single file,
138 | # Else download all files.
139 | if KAGGLE_CONFIGS[self.key].get(_FILENAME):
140 | file_name = self._download_kaggle_data_from_datasets_with_given_filename()
141 | else:
142 | file_name = self._download_kaggle_data_from_datasets_containing_single_file()
143 |
144 | else:
145 | raise ValueError('Type should be among competetions or datasets in config')
146 |
147 | with zipfile.ZipFile(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'r') as zip_ref:
148 | zip_ref.extractall(_DOWNLOAD_LOCATION)
149 |
150 | def load_data(self):
151 | self.df = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION, KAGGLE_CONFIGS[self.key]['filename']), dtype='object')
152 | # delete downloaded data after loading in memory
153 | if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION)
154 |
155 | @property
156 | def timestamp_col(self):
157 | return self._timestamp_col # If timestamp not available, will create fake timestamps
158 |
159 | @property
160 | def label_col(self):
161 | if self._label_col is None:
162 | raise ValueError('Label column not specified')
163 | else:
164 | return self._label_col
165 |
166 | @property
167 | def event_id_col(self):
168 | return self._event_id_col # If event id not available, will create fake event ids
169 |
170 | @property
171 | def entity_id_col(self):
172 | return self._entity_id_col
173 |
174 | def standardize_timestamp_col(self):
175 | if self.timestamp_col is not None:
176 | self.df[_EVENT_TIMESTAMP] = pd.to_datetime(self.df[self.timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT))
177 | self.df.drop(self.timestamp_col, axis=1, inplace=True)
178 | elif self.timestamp_col is None and self._add_random_values_if_real_na[_EVENT_TIMESTAMP]:
179 | self.df[_EVENT_TIMESTAMP] = self.df[_EVENT_LABEL].apply(
180 | lambda x: fake.date_time_between(
181 | start_date='-1y', # think about making it to fixed date. vs from now?
182 | end_date='now',
183 | tzinfo=None).strftime(_TIMESTAMP_FORMAT))
184 |
185 | if self._label_timestamp_col is None and self._add_random_values_if_real_na[_LABEL_TIMESTAMP]:
186 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date
187 | elif self._label_timestamp_col is not None:
188 | self.df[_LABEL_TIMESTAMP] = pd.to_datetime(self.df[self._label_timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT))
189 | self.df.drop(self._label_timestamp_col, axis=1, inplace=True)
190 |
191 | def standardize_label_col(self):
192 | self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True)
193 | self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].astype(int)
194 |
195 | def standardize_event_id_col(self):
196 | if self.event_id_col is not None:
197 | self.df.rename({self.event_id_col: _EVENT_ID}, axis=1, inplace=True)
198 | self.df[_EVENT_ID] = self.df[_EVENT_ID].astype(str)
199 | elif self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: # add fake one if not exist
200 | self.df[_EVENT_ID] = self.df[_EVENT_LABEL].apply(
201 | lambda x: fake.uuid4())
202 |
203 |
204 | def standardize_entity_id_col(self):
205 | if self.entity_id_col is not None:
206 | self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True)
207 | elif self.entity_id_col is None and self._add_random_values_if_real_na[_ENTITY_ID]: # add fake one if not exist
208 | self.df[_ENTITY_ID] = self.df[_EVENT_LABEL].apply(
209 | lambda x: fake.uuid4())
210 |
211 | def rename_features(self):
212 | rename_map = {} # default is empty map that won't rename any columns
213 | self.df.rename(rename_map, axis=1, inplace=True)
214 |
215 | def subset_features(self):
216 | features_to_select = self.df.columns.tolist()
217 | self.df = self.df[features_to_select] # all by default
218 |
219 | def drop_features(self):
220 | self.df.drop(self.features_to_drop, axis=1, inplace=True)
221 |
222 | def add_meta_data(self):
223 | if self._add_random_values_if_real_na[_ENTITY_TYPE]:
224 | self.df[_ENTITY_TYPE] = 'user'
225 |
226 | def sort_by_timestamp(self):
227 | self.df.sort_values(by=_EVENT_TIMESTAMP, ascending=True, inplace=True)
228 |
229 | def lower_case_col_names(self):
230 | self.df.columns = [s.lower() for s in self.df.columns]
231 |
232 | def preprocess(self):
233 | self.lower_case_col_names()
234 | self.standardize_label_col()
235 | self.standardize_event_id_col()
236 | self.standardize_entity_id_col()
237 | self.standardize_timestamp_col()
238 | self.add_meta_data()
239 | self.rename_features()
240 | self.subset_features()
241 | self.drop_features()
242 | if self.timestamp_col:
243 | self.sort_by_timestamp()
244 |
245 | def train_test_split(self):
246 | """
247 | Default setting is out of time with 80%-20% into training and testing respectively
248 | """
249 | if self.timestamp_col:
250 | split_pt = int(self.df.shape[0]*self.train_percentage)
251 | self.train = self.df.copy().iloc[:split_pt, :]
252 | self.test = self.df.copy().iloc[split_pt:, :]
253 | else: # random if no timestamp col available
254 | self.train = self.df.sample(frac=self.train_percentage, random_state=_RANDOM_STATE)
255 | self.test = self.df.copy()[~self.df.index.isin(self.train.index)]
256 | self.test.reset_index(drop=True, inplace=True)
257 |
258 | self.test_labels = self.test[[_EVENT_LABEL]]
259 | if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]:
260 | self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]
261 | self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore")
262 |
263 |
264 | class FakejobPreProcessor(BasePreProcessor):
265 | def __init__(self, **kw):
266 | super(FakejobPreProcessor, self).__init__(**kw)
267 |
268 |
269 | class VehicleloanPreProcessor(BasePreProcessor):
270 | def __init__(self, **kw):
271 | super(VehicleloanPreProcessor, self).__init__(**kw)
272 |
273 |
274 | class MalurlPreProcessor(BasePreProcessor):
275 | """
276 | This one originally multiple classes for manignant.
277 | We will combine all malignant one class to keep benchmark binary for now
278 |
279 | """
280 | def __init__(self, **kw):
281 | super(MalurlPreProcessor, self).__init__(**kw)
282 |
283 | def standardize_label_col(self):
284 | self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True)
285 | binary_mapper = {
286 | 'defacement': 1,
287 | 'phishing': 1,
288 | 'malware': 1,
289 | 'benign': 0
290 | }
291 |
292 | self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper)
293 |
294 | def add_dummy_col(self):
295 | self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4())
296 |
297 | def preprocess(self):
298 | super(MalurlPreProcessor, self).preprocess()
299 | self.add_dummy_col()
300 |
301 | class IEEEPreProcessor(BasePreProcessor):
302 | """
303 | Some pre-processing was done using kaggle kernels below.
304 |
305 | References:
306 | Data Source: https://www.kaggle.com/c/ieee-fraud-detection/data
307 |
308 | Some processing from: https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600
309 | Feature selection to reduce to 100: https://www.kaggle.com/code/pavelvpster/ieee-fraud-feature-selection-rfecv/notebook
310 |
311 | """
312 | def __init__(self, **kw):
313 | super(IEEEPreProcessor, self).__init__(**kw)
314 |
315 | @staticmethod
316 | def _dtypes_cols():
317 |
318 | # FIRST 53 COLUMNS
319 | cols = ['TransactionID', 'TransactionDT', 'TransactionAmt',
320 | 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
321 | 'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain',
322 | 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
323 | 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8',
324 | 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4',
325 | 'M5', 'M6', 'M7', 'M8', 'M9']
326 |
327 | # V COLUMNS TO LOAD DECIDED BY CORRELATION EDA
328 | # https://www.kaggle.com/cdeotte/eda-for-columns-v-and-id
329 | v = [1, 3, 4, 6, 8, 11]
330 | v += [13, 14, 17, 20, 23, 26, 27, 30]
331 | v += [36, 37, 40, 41, 44, 47, 48]
332 | v += [54, 56, 59, 62, 65, 67, 68, 70]
333 | v += [76, 78, 80, 82, 86, 88, 89, 91]
334 |
335 | #v += [96, 98, 99, 104] #relates to groups, no NAN
336 | v += [107, 108, 111, 115, 117, 120, 121, 123] # maybe group, no NAN
337 | v += [124, 127, 129, 130, 136] # relates to groups, no NAN
338 |
339 | # LOTS OF NAN BELOW
340 | v += [138, 139, 142, 147, 156, 162] #b1
341 | v += [165, 160, 166] #b1
342 | v += [178, 176, 173, 182] #b2
343 | v += [187, 203, 205, 207, 215] #b2
344 | v += [169, 171, 175, 180, 185, 188, 198, 210, 209] #b2
345 | v += [218, 223, 224, 226, 228, 229, 235] #b3
346 | v += [240, 258, 257, 253, 252, 260, 261] #b3
347 | v += [264, 266, 267, 274, 277] #b3
348 | v += [220, 221, 234, 238, 250, 271] #b3
349 |
350 | v += [294, 284, 285, 286, 291, 297] # relates to grous, no NAN
351 | v += [303, 305, 307, 309, 310, 320] # relates to groups, no NAN
352 | v += [281, 283, 289, 296, 301, 314] # relates to groups, no NAN
353 |
354 | # COLUMNS WITH STRINGS
355 | str_type = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain','M1', 'M2', 'M3', 'M4','M5',
356 | 'M6', 'M7', 'M8', 'M9', 'id_12', 'id_15', 'id_16', 'id_23', 'id_27', 'id_28', 'id_29', 'id_30',
357 | 'id_31', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo']
358 | str_type += ['id-12', 'id-15', 'id-16', 'id-23', 'id-27', 'id-28', 'id-29', 'id-30',
359 | 'id-31', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38']
360 |
361 |
362 | cols += ['V'+str(x) for x in v]
363 | dtypes = {}
364 | for c in cols+['id_0'+str(x) for x in range(1,10)]+['id_'+str(x) for x in range(10,34)]+\
365 | ['id-0'+str(x) for x in range(1,10)]+['id-'+str(x) for x in range(10,34)]:
366 | dtypes[c] = 'float32'
367 | for c in str_type: dtypes[c] = 'category'
368 |
369 | return dtypes, cols
370 |
371 |
372 | def load_data(self):
373 | """
374 | Hard coded file names for this dataset as it contains multiple files to be combined
375 | """
376 |
377 | dtypes, cols = IEEEPreProcessor._dtypes_cols()
378 |
379 | self.df = pd.read_csv(
380 | os.path.join(_DOWNLOAD_LOCATION,
381 | 'train_transaction.csv'),
382 | index_col='TransactionID',
383 | dtype=dtypes,
384 | usecols=cols+['isFraud'])
385 |
386 | self.df_id = pd.read_csv(
387 | os.path.join(_DOWNLOAD_LOCATION,
388 | 'train_identity.csv'),
389 | index_col='TransactionID',
390 | dtype=dtypes)
391 | self.df = self.df.merge(self.df_id, how='left', left_index=True, right_index=True)
392 |
393 | # delete downloaded data after loading in memory
394 | if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION)
395 |
396 | def normalization(self):
397 | # NORMALIZE D COLUMNS
398 | for i in range(1,16):
399 | if i in [1,2,3,5,9]: continue
400 | self.df['d'+str(i)] = self.df['d'+str(i)] - self.df[self.timestamp_col]/np.float32(24*60*60)
401 |
402 | def standardize_entity_id_col(self):
403 | def _encode_CB(col1, col2, df):
404 | nm = col1+'_'+col2
405 | df[nm] = df[col1].astype(str)+'_'+df[col2].astype(str)
406 |
407 | _encode_CB('card1', 'addr1', self.df)
408 | self.df['day'] = self.df[self.timestamp_col] / (24*60*60)
409 | self.df[_ENTITY_ID] = self.df['card1_addr1'].astype(str) + '_' + np.floor(self.df['day'] - self.df['d1']).astype(str)
410 |
411 | @staticmethod
412 | def _add_seconds(x):
413 | init_time = '2021-01-01T00:00:00Z'
414 | dt_format = _TIMESTAMP_FORMAT
415 | init_time = datetime.strptime(init_time, dt_format) # start date from last 18 months
416 | final_time = init_time + timedelta(seconds=x)
417 | return final_time.strftime(_TIMESTAMP_FORMAT)
418 |
419 | def standardize_timestamp_col(self):
420 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: IEEEPreProcessor._add_seconds(x))
421 | self.df.drop(self.timestamp_col, axis=1, inplace=True)
422 | if self._add_random_values_if_real_na["LABEL_TIMESTAMP"]:
423 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date
424 |
425 | def subset_features(self):
426 | features_to_select = \
427 | ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1',
428 | 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11',
429 | 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160',
430 | 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264',
431 | 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02',
432 | 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo',
433 | 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID', 'EVENT_LABEL', 'LABEL_TIMESTAMP']
434 | self.df = self.df.loc[:, self.df.columns.isin(features_to_select)]
435 |
436 | def preprocess(self):
437 | self.lower_case_col_names()
438 | self.normalization() # normalize D columns
439 | self.standardize_label_col()
440 | self.standardize_event_id_col()
441 | self.standardize_entity_id_col()
442 | self.standardize_timestamp_col()
443 | self.add_meta_data()
444 | self.rename_features()
445 | self.subset_features()
446 | if self.timestamp_col:
447 | self.sort_by_timestamp()
448 |
449 |
450 | class CCFraudPreProcessor(BasePreProcessor):
451 | def __init__(self, **kw):
452 | super(CCFraudPreProcessor, self).__init__(**kw)
453 |
454 | @staticmethod
455 | def _add_minutes(x):
456 | dt_format = _TIMESTAMP_FORMAT
457 | init_time = datetime.strptime('2021-09-01T00:00:00Z', dt_format) # chose randomly but in last 18 months
458 | final_time = init_time + timedelta(minutes=x)
459 | return final_time.strftime(_TIMESTAMP_FORMAT)
460 |
461 | def standardize_timestamp_col(self):
462 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].astype(float).apply(lambda x: CCFraudPreProcessor._add_minutes(x))
463 | self.df.drop(self.timestamp_col, axis=1, inplace=True)
464 | if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]:
465 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date
466 |
467 | class FraudecomPreProcessor(BasePreProcessor):
468 | def __init__(self, ip_address_col, signup_time_col, **kw):
469 | self.ip_address_col = ip_address_col
470 | self.signup_time_col = signup_time_col
471 | super(FraudecomPreProcessor, self).__init__(**kw)
472 |
473 | @staticmethod
474 | def _add_years(init_time):
475 | dt_format = '%Y-%m-%d %H:%M:%S'
476 | init_time = datetime.strptime(init_time, dt_format)
477 | final_time = init_time + relativedelta(years=6) # move to more recent time range
478 | return final_time.strftime(_TIMESTAMP_FORMAT)
479 |
480 |
481 | def standardize_timestamp_col(self):
482 |
483 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: FraudecomPreProcessor._add_years(x))
484 | self.df.drop(self.timestamp_col, axis=1, inplace=True)
485 |
486 | # Also add _LABEL_TIMESTAMP to allow training of this dataset with TFI
487 | if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]:
488 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date
489 |
490 | def process_ip(self):
491 | """
492 | This dataset has ip address as a feature, but needs to be converted into standard IPV4.
493 | """
494 | self.df[self.ip_address_col] = self.df[self.ip_address_col].astype(float).astype(int).\
495 | apply(lambda x: socket.inet_ntoa(struct.pack('!L', x)))
496 |
497 | def create_time_since_signup(self):
498 | self.df['time_since_signup'] = (
499 | pd.to_datetime(self.df[self.timestamp_col]) -\
500 | pd.to_datetime(self.df[self.signup_time_col])).dt.seconds
501 |
502 | def preprocess(self):
503 | self.lower_case_col_names()
504 | self.standardize_label_col()
505 | self.standardize_event_id_col()
506 | self.standardize_entity_id_col()
507 | self.create_time_since_signup() # One manually engineered feature
508 | self.standardize_timestamp_col()
509 | self.add_meta_data()
510 | self.process_ip() # This extra step added
511 | self.rename_features()
512 | self.drop_features() # Replace select with drop
513 | if self.timestamp_col:
514 | self.sort_by_timestamp()
515 |
516 |
517 | class SparknovPreProcessor(BasePreProcessor):
518 | def __init__(self, **kw):
519 | super(SparknovPreProcessor, self).__init__(**kw)
520 |
521 | def load_data(self):
522 | """
523 | Hard coded file names for this dataset as it contains multiple files to be combined
524 | """
525 |
526 | df_train = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTrain.csv'))
527 | df_train['seg'] = 'train'
528 |
529 | df_test = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTest.csv'))
530 | df_test['seg'] = 'test'
531 |
532 | self.df = pd.concat([df_train, df_test], ignore_index=True)
533 |
534 | # delete downloaded data after loading in memory
535 | if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION)
536 |
537 | @staticmethod
538 | def _add_months(x):
539 | _TIMESTAMP_FORMAT_SPARKNOV = '%Y-%m-%d %H:%M:%S'
540 |
541 | x = datetime.strptime(x, _TIMESTAMP_FORMAT_SPARKNOV)
542 | final_time = x + relativedelta(months=20) # chosen to move dates close to now()
543 | return final_time.strftime(_TIMESTAMP_FORMAT)
544 |
545 | def standardize_timestamp_col(self):
546 |
547 | self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: SparknovPreProcessor._add_months(x))
548 | self.df.drop(self.timestamp_col, axis=1, inplace=True)
549 | self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date
550 |
551 | def standardize_entity_id_col(self):
552 |
553 | self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True)
554 | self.df[_ENTITY_ID] = self.df[_ENTITY_ID].\
555 | str.lower().\
556 | apply(lambda x: re.sub(r'[^A-Za-z0-9]+', '_', x))
557 |
558 | def train_test_split(self):
559 | self.train = self.df.copy()[self.df['seg'] == 'train']
560 | self.train.reset_index(drop=True, inplace=True)
561 | self.train.drop(['seg'], axis=1, inplace=True)
562 |
563 | self.test = self.df.copy()[self.df['seg'] == 'test']
564 | self.test.reset_index(drop=True, inplace=True)
565 | self.test.drop(['seg'], axis=1, inplace=True)
566 | self.test = self.test.sample(n=20000, random_state=1)
567 |
568 | self.test_labels = self.test[[_EVENT_LABEL]]
569 | if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]:
570 | self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]
571 | self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore")
572 |
573 |
574 | class TwitterbotPreProcessor(BasePreProcessor):
575 | def __init__(self, **kw):
576 | super(TwitterbotPreProcessor, self).__init__(**kw)
577 |
578 | def standardize_label_col(self):
579 | self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True)
580 | binary_mapper = {
581 | 'bot': 1,
582 | 'human': 0
583 | }
584 |
585 | self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper)
586 |
587 |
588 | class IPBlocklistPreProcessor(BasePreProcessor):
589 | """
590 | The dataset source is http://cinsscore.com/list/ci-badguys.txt.
591 | In order to download/access the latest version of this dataset, a sign-in/sign-up to is not required
592 |
593 | Since this dataset is not version controlled from the source, we added the version of dataset we used for experiments
594 | discussed in the paper. The versioned dataset is as of 2022-06-07.
595 | The code is set to pick the fixed version. If the user is interested to use the latest version,
596 | 'version' argument will need to be turned off (i.e. set to None)
597 | """
598 | def __init__(self, version, **kw):
599 | self.version = version # string or None. If string, picks one from versioned_datasets, else creates one from source
600 | super(IPBlocklistPreProcessor, self).__init__(**kw)
601 |
602 | def load_data(self):
603 | if self.version is None:
604 | # load malicious IPs from the source
605 | _URL = 'http://cinsscore.com/list/ci-badguys.txt' # contains confirmed malicious IPs
606 | _N_BENIGN = 200000
607 |
608 | res = requests.get(_URL)
609 | ip_mal = pd.read_csv(StringIO(res.text), sep='\n', names=['ip'], header=None)
610 | ip_mal['is_ip_malign'] = 1
611 |
612 | # add fake IPs as benign
613 | ip_ben = pd.DataFrame({
614 | 'ip': [fake.ipv4() for i in range(_N_BENIGN)],
615 | 'is_ip_malign': 0
616 | })
617 |
618 | self.df = pd.concat([ip_mal, ip_ben], axis=0, ignore_index=True)
619 | else:
620 |
621 | _VERSIONED_DATA_PATH = f'versioned_datasets/{self.key}/{self.version}.zip'
622 | data = pkgutil.get_data(__name__, _VERSIONED_DATA_PATH)
623 | with zipfile.ZipFile(BytesIO(data)) as f:
624 | self.train = pd.read_csv(f.open('train.csv'))
625 | self.test = pd.read_csv(f.open('test.csv'))
626 | self.test_labels = pd.read_csv(f.open('test_labels.csv'))
627 |
628 | def add_dummy_col(self):
629 | self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4())
630 |
631 | def train_test_split(self):
632 | if self.version is None:
633 | super(IPBlocklistPreProcessor, self).train_test_split()
634 |
635 | def preprocess(self):
636 | if self.version is None:
637 | super(IPBlocklistPreProcessor, self).preprocess()
638 | self.add_dummy_col()
639 |
--------------------------------------------------------------------------------
/src/fdb/preprocessing_objects.py:
--------------------------------------------------------------------------------
1 | from fdb.preprocessing import *
2 |
3 |
4 | def load_data(key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na):
5 | common_kw = {
6 | "key": key,
7 | "load_pre_downloaded": load_pre_downloaded,
8 | "delete_downloaded": delete_downloaded,
9 | "add_random_values_if_real_na": add_random_values_if_real_na
10 | }
11 |
12 | if key == 'fakejob':
13 | obj = FakejobPreProcessor(
14 | train_percentage = 0.8,
15 | timestamp_col = None,
16 | label_col = 'fraudulent',
17 | event_id_col = 'job_id',
18 | **common_kw
19 | )
20 |
21 | elif key == 'vehicleloan':
22 | obj = VehicleloanPreProcessor(
23 | train_percentage = 0.8,
24 | timestamp_col = None,
25 | label_col = 'loan_default',
26 | event_id_col = 'uniqueid',
27 | features_to_drop = ['disbursal_date'],
28 | **common_kw
29 | )
30 |
31 | elif key == 'malurl':
32 | obj = MalurlPreProcessor(
33 | train_percentage = 0.9,
34 | timestamp_col = None,
35 | label_col = 'type',
36 | event_id_col = None,
37 | **common_kw
38 | )
39 |
40 | elif key == 'ieeecis':
41 | obj = IEEEPreProcessor(
42 | train_percentage = 0.95,
43 | timestamp_col = 'transactiondt',
44 | label_col = 'isfraud',
45 | event_id_col = None,
46 | entity_id_col = None, # manually created in code
47 | **common_kw
48 | )
49 |
50 | elif key == 'ccfraud':
51 | obj = CCFraudPreProcessor(
52 | train_percentage = 0.8,
53 | timestamp_col = 'time',
54 | label_col = 'class',
55 | event_id_col = None,
56 | **common_kw
57 | )
58 |
59 | elif key == 'fraudecom':
60 | obj = FraudecomPreProcessor(
61 | train_percentage = 0.8,
62 | timestamp_col = 'purchase_time',
63 | signup_time_col = 'signup_time',
64 | label_col = 'class',
65 | event_id_col = 'user_id',
66 | entity_id_col = 'device_id',
67 | ip_address_col = 'ip_address',
68 | features_to_drop = ['signup_time', 'sex'],
69 | **common_kw
70 | )
71 |
72 | elif key == 'sparknov':
73 | obj = SparknovPreProcessor(
74 | timestamp_col = 'trans_date_trans_time',
75 | label_col = 'is_fraud',
76 | event_id_col = 'trans_num',
77 | entity_id_col = 'merchant',
78 | features_to_drop = ['unix_time', 'unnamed: 0'],
79 | **common_kw
80 | )
81 |
82 | elif key == 'twitterbot':
83 | obj = TwitterbotPreProcessor(
84 | train_percentage = 0.8,
85 | timestamp_col = None,
86 | label_col = 'account_type',
87 | event_id_col = 'id',
88 | **common_kw
89 | )
90 |
91 | elif key == 'ipblock':
92 | obj = IPBlocklistPreProcessor(
93 | label_col = 'is_ip_malign',
94 | version = '20220607',
95 | **common_kw
96 | )
97 |
98 | else:
99 | raise ValueError('Invalid key')
100 |
101 | return obj
--------------------------------------------------------------------------------
/src/fdb/versioned_datasets/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/versioned_datasets/__init__.py
--------------------------------------------------------------------------------
/src/fdb/versioned_datasets/ipblock/20220607.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/versioned_datasets/ipblock/20220607.zip
--------------------------------------------------------------------------------
/src/fdb/versioned_datasets/ipblock/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amazon-science/fraud-dataset-benchmark/f100cb82959938e5272848eadb38e0e996a06aea/src/fdb/versioned_datasets/ipblock/__init__.py
--------------------------------------------------------------------------------