├── requirements.txt ├── LICENSE.txt ├── NOTICE.txt ├── README.md ├── .gitignore ├── CONTRIBUTING.md ├── example_program.py ├── CODE_OF_CONDUCT.md └── rade_classifier.py /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.21 2 | pandas 3 | scikit-learn 4 | xgboost 5 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmware-labs/efficient-supervised-anomaly-detection/main/LICENSE.txt -------------------------------------------------------------------------------- /NOTICE.txt: -------------------------------------------------------------------------------- 1 | Efficient Supervised Anomaly Detection 2 | 3 | Copyright 2021 VMware, Inc. 4 | 5 | This product is licensed to you under the BSD 3-Clause "New" or "Revised" license (the "License"). You may not use this product except in compliance with the BSD-3 License. 6 | 7 | This product may include a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the subcomponent's license, as noted in the LICENSE file. 8 | 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RADE scikit Classifier (v1.0) 2 | 3 | RADE is a resource-efficient decision tree ensemble method (DTEM) based anomaly 4 | detection approach that augments standard DTEM classifiers resulting in 5 | competitive anomaly detection capabilities and significant savings in resource 6 | usage. 7 | 8 | The current implementation of RADE augments either Random-Forest or XGBoost. 9 | 10 | More information about RADE can be found in:
11 | RADE: resource‑efficient supervised anomaly detection using decision tree‑based ensemble methods (Springer ML) 12 | 13 | ## Files: 14 | 15 | #### rade_classifier.py - RADE sci-kit classifier 16 | 17 | #### example_program.py - Basic comparison example between RF, XGBoost, and RADE 18 | 19 | 20 | ## Prerequisities: 21 | numpy
22 | pandas
23 | sklearn
24 | xgboost
25 | 26 | or alternatively run:
27 | $ pip3 install -r requirements.txt 28 | 29 | For more information, support and advanced examples contact:
30 | Yaniv Ben-Itzhak, [ybenitzhak@vmware.com](mailto:ybenitzhak@vmware.com)
31 | Shay Vargaftik, [shayv@vmware.com](mailto:shayv@vmware.com)
32 | 33 | 34 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Editors 2 | .vscode/ 3 | .idea/ 4 | 5 | # Vagrant 6 | .vagrant/ 7 | 8 | # Mac/OSX 9 | .DS_Store 10 | 11 | # Windows 12 | Thumbs.db 13 | 14 | # Source for the following rules: https://raw.githubusercontent.com/github/gitignore/master/Python.gitignore 15 | # Byte-compiled / optimized / DLL files 16 | __pycache__/ 17 | *.py[cod] 18 | *$py.class 19 | 20 | # C extensions 21 | *.so 22 | 23 | # Distribution / packaging 24 | .Python 25 | build/ 26 | develop-eggs/ 27 | dist/ 28 | downloads/ 29 | eggs/ 30 | .eggs/ 31 | lib/ 32 | lib64/ 33 | parts/ 34 | sdist/ 35 | var/ 36 | wheels/ 37 | *.egg-info/ 38 | .installed.cfg 39 | *.egg 40 | MANIFEST 41 | 42 | # PyInstaller 43 | # Usually these files are written by a python script from a template 44 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 45 | *.manifest 46 | *.spec 47 | 48 | # Installer logs 49 | pip-log.txt 50 | pip-delete-this-directory.txt 51 | 52 | # Unit test / coverage reports 53 | htmlcov/ 54 | .tox/ 55 | .nox/ 56 | .coverage 57 | .coverage.* 58 | .cache 59 | nosetests.xml 60 | coverage.xml 61 | *.cover 62 | .hypothesis/ 63 | .pytest_cache/ 64 | 65 | # Translations 66 | *.mo 67 | *.pot 68 | 69 | # Django stuff: 70 | *.log 71 | local_settings.py 72 | db.sqlite3 73 | 74 | # Flask stuff: 75 | instance/ 76 | .webassets-cache 77 | 78 | # Scrapy stuff: 79 | .scrapy 80 | 81 | # Sphinx documentation 82 | docs/_build/ 83 | 84 | # PyBuilder 85 | target/ 86 | 87 | # Jupyter Notebook 88 | .ipynb_checkpoints 89 | 90 | # IPython 91 | profile_default/ 92 | ipython_config.py 93 | 94 | # pyenv 95 | .python-version 96 | 97 | # celery beat schedule file 98 | celerybeat-schedule 99 | 100 | # SageMath parsed files 101 | *.sage.py 102 | 103 | # Environments 104 | .env 105 | .venv 106 | env/ 107 | venv/ 108 | ENV/ 109 | env.bak/ 110 | venv.bak/ 111 | 112 | # Spyder project settings 113 | .spyderproject 114 | .spyproject 115 | 116 | # Rope project settings 117 | .ropeproject 118 | 119 | # mkdocs documentation 120 | /site 121 | 122 | # mypy 123 | .mypy_cache/ 124 | .dmypy.json 125 | dmypy.json 126 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributing to efficient-supervised-anomaly-detection 3 | 4 | The efficient-supervised-anomaly-detection project team welcomes contributions from the community. Before you start working with efficient-supervised-anomaly-detection, please 5 | read our [Developer Certificate of Origin](https://cla.vmware.com/dco). All contributions to this repository must be 6 | signed as described on that page. Your signature certifies that you wrote the patch or have the right to pass it on 7 | as an open-source patch. 8 | 9 | ## Contribution Flow 10 | 11 | This is a rough outline of what a contributor's workflow looks like: 12 | 13 | - Create a topic branch from where you want to base your work 14 | - Make commits of logical units 15 | - Make sure your commit messages are in the proper format (see below) 16 | - Push your changes to a topic branch in your fork of the repository 17 | - Submit a pull request 18 | 19 | Example: 20 | 21 | ``` shell 22 | git remote add upstream https://github.com/vmware/efficient-supervised-anomaly-detection.git 23 | git checkout -b my-new-feature main 24 | git commit -a 25 | git push origin my-new-feature 26 | ``` 27 | 28 | ### Staying In Sync With Upstream 29 | 30 | When your branch gets out of sync with the vmware/main branch, use the following to update: 31 | 32 | ``` shell 33 | git checkout my-new-feature 34 | git fetch -a 35 | git pull --rebase upstream main 36 | git push --force-with-lease origin my-new-feature 37 | ``` 38 | 39 | ### Updating pull requests 40 | 41 | If your PR fails to pass CI or needs changes based on code review, you'll most likely want to squash these changes into 42 | existing commits. 43 | 44 | If your pull request contains a single commit or your changes are related to the most recent commit, you can simply 45 | amend the commit. 46 | 47 | ``` shell 48 | git add . 49 | git commit --amend 50 | git push --force-with-lease origin my-new-feature 51 | ``` 52 | 53 | If you need to squash changes into an earlier commit, you can use: 54 | 55 | ``` shell 56 | git add . 57 | git commit --fixup 58 | git rebase -i --autosquash main 59 | git push --force-with-lease origin my-new-feature 60 | ``` 61 | 62 | Be sure to add a comment to the PR indicating your new changes are ready to review, as GitHub does not generate a 63 | notification when you git push. 64 | 65 | ### Code Style 66 | 67 | ### Formatting Commit Messages 68 | 69 | We follow the conventions on [How to Write a Git Commit Message](http://chris.beams.io/posts/git-commit/). 70 | 71 | Be sure to include any related GitHub issue references in the commit message. See 72 | [GFM syntax](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown) for referencing issues 73 | and commits. 74 | 75 | ## Reporting Bugs and Creating Issues 76 | 77 | When opening a new issue, try to roughly follow the commit message format conventions above. 78 | -------------------------------------------------------------------------------- /example_program.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018-2021 VMware, Inc. 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | 4 | from rade_classifier import RadeClassifier 5 | from sklearn.ensemble import RandomForestClassifier 6 | from xgboost import XGBClassifier 7 | 8 | from sklearn.datasets import make_classification 9 | from sklearn.metrics import classification_report 10 | from sklearn.model_selection import train_test_split 11 | from sklearn.metrics import f1_score 12 | 13 | import time 14 | 15 | X, y = make_classification(n_samples=1000000, n_features=4, 16 | n_informative=2, n_redundant=0, 17 | random_state=0, shuffle=False, weights=[0.99, 0.01]) 18 | 19 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) 20 | 21 | # RADE 22 | rade_clf = RadeClassifier() 23 | 24 | start = time.time() 25 | rade_clf.fit(X_train, y_train) 26 | end = time.time() 27 | rade_train_time = end - start 28 | print("RADE training time is: {:0.2f} seconds".format(rade_train_time)) 29 | 30 | start = time.time() 31 | rade_y_predicted = rade_clf.predict(X_test) 32 | end = time.time() 33 | rade_predict_time = end - start 34 | print("RADE prediction time is: {:0.2f} seconds".format(rade_predict_time)) 35 | 36 | rade_f1 = f1_score(y_test, rade_y_predicted, average='macro') 37 | print("RADE classification_report:") 38 | print(classification_report(y_test, rade_y_predicted, digits=5)) 39 | 40 | # Random Forest 41 | rf_clf = RandomForestClassifier() 42 | 43 | start = time.time() 44 | rf_clf.fit(X_train, y_train) 45 | end = time.time() 46 | rf_train_time = end - start 47 | print("Random Forest training time is: {:0.2f} seconds".format(rf_train_time)) 48 | 49 | start = time.time() 50 | rf_y_predicted = rf_clf.predict(X_test) 51 | end = time.time() 52 | rf_predict_time = end - start 53 | print("Random Forest prediction time is: {:0.2f} seconds".format(rf_predict_time)) 54 | 55 | rf_f1 = f1_score(y_test, rf_y_predicted, average='macro') 56 | print("Random Forest classification_report:") 57 | print(classification_report(y_test, rf_y_predicted, digits=5)) 58 | 59 | # XGBoost 60 | xgb_clf = XGBClassifier() 61 | 62 | start = time.time() 63 | xgb_clf.fit(X_train, y_train) 64 | end = time.time() 65 | xgb_train_time = end - start 66 | print("XGBoost training time is: {:0.2f} seconds".format(xgb_train_time)) 67 | 68 | start = time.time() 69 | xgb_y_predicted = xgb_clf.predict(X_test) 70 | end = time.time() 71 | xgb_predict_time = end - start 72 | print("XGBoost prediction time is: {:0.2f} seconds".format(xgb_predict_time)) 73 | 74 | xgb_f1 = f1_score(y_test, xgb_y_predicted, average='macro') 75 | print("XGBoost classification_report:") 76 | print(classification_report(y_test, xgb_y_predicted, digits=5)) 77 | 78 | print('RADE vs. Random Forest and XGBoost:') 79 | print ('Training time: RADE is {:0.1f}x faster than Random Forest, and {:0.1f}x faster than XGBoost.'. 80 | format(rf_train_time / rade_train_time, xgb_train_time / rade_train_time)) 81 | print ('Prediction time: RADE is {:0.1f}x faster than Random Forest, and {:0.1f}x faster than XGBoost.'. 82 | format(rf_predict_time / rade_predict_time, xgb_predict_time / rade_predict_time)) 83 | print ('Macro F1: RADE is better by {:+0.5f} as compared to Random Forest, and by {:+0.5f} as compared to XGBoost.'. 84 | format(rade_f1 - rf_f1, rade_f1 - xgb_f1)) 85 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in efficient-supervised-anomaly-detection project and our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, religion, or sexual identity 11 | and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the 27 | overall community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or 32 | advances of any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email 36 | address, without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official e-mail address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at oss-coc@vmware.com. 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. -------------------------------------------------------------------------------- /rade_classifier.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018-2021 VMware, Inc. 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | 4 | #!/usr/bin/env python3 5 | 6 | 7 | import numpy as np 8 | import pandas as pd 9 | 10 | from sklearn.base import BaseEstimator, ClassifierMixin 11 | from sklearn.ensemble import RandomForestClassifier 12 | from xgboost import XGBClassifier 13 | from sklearn.utils.multiclass import unique_labels 14 | from sklearn.utils.validation import check_is_fitted, check_X_y, check_array 15 | 16 | 17 | ############################################################################### 18 | ############################################################################### 19 | 20 | class RadeClassifier(BaseEstimator, ClassifierMixin): 21 | """ 22 | A RADE classifier. 23 | 24 | An efficient classifier that augments either a Random Forest or an XGBoost classifier to 25 | obtain lower model memory size, lower training time and lower classification latency. 26 | 27 | The main building blocks of RADE are: 28 | 29 | Coarse-grained classifier - a small model that is trained using the entire training dataset. 30 | The coarse-grained classifier is sufficient to classify the majority of the classification queries correctly, such 31 | that a classification is valid only if its corresponding confidence level is greater than or equal to the 32 | classification confidence threshold. 33 | 34 | Fine-grained classifiers - 'expert' classifiers that are trained to succeed specifically where the coarse-grained model is not sufficiently 35 | confident and is more likely to make a classification mistake. 36 | 37 | 38 | Parameters 39 | ---------- 40 | base_classifier : string, optional (default='RF') 41 | 'RF' or 'XGB'. 42 | The classifier type of the coarse-grained and fine-grained classifiers. 43 | RADE supports Random-Forest ('RF') and XGBoost ('XGB'). 44 | 45 | cg_params : dict or None, optional (default=None) 46 | If None, the classifier uses the defaults according to base_classifier. 47 | i.e., default_cg_params_RF = {'n_estimators': 10, 'max_depth': 5} for base_classifier='RF', 48 | or default_cg_params_XGB = {'n_estimators': 10, 'max_depth': 3} for base_classifier='XGB'. 49 | Parameters for the cg classifier. 50 | 51 | fg_normal_params : dict or None, optional (default=None) 52 | If None, the classifier uses the defaults according to base_classifier. 53 | i.e., default_fg_normal_params_RF = {'n_estimators': 25, 'max_depth': 20} for base_classifier='RF', 54 | or default_fg_normal_params_XGB = {'n_estimators': 30, 'max_depth': 3} for base_classifier='XGB'. 55 | Parameters for the fg normal classifier. 56 | 57 | fg_anomaly_params : dict or None, optional (default=None) 58 | If None, the classifier uses the defaults according to base_classifier. 59 | i.e., default_fg_anomaly_params_RF = {'n_estimators': 25, 'max_depth': 20}, for base_classifier='RF', 60 | or default_fg_anomaly_params_XGB = {'n_estimators': 30, 'max_depth': 3}, for base_classifier='XGB'. 61 | Parameters for the fg anomaly classifier. 62 | 63 | training_confidence_threshold : float, optional (default=None) 64 | If None, the classifier uses the defaults according to base_classifier. 65 | i.e., default_training_confidence_threshold_RF = 0.89 for base_classifier='RF', 66 | or default_training_confidence_threshold_XGB = 0.79 for base_classifier='XGB'. 67 | A value in [0,1]. 68 | The training confidence threshold (TCT). 69 | 70 | classification_confidence_threshold : float, optional (default=None) 71 | If None, the classifier uses the defaults according to base_classifier. 72 | i.e., default_classification_confidence_threshold_RF = 0.79 for base_classifier='RF', 73 | or default_classification_confidence_threshold_XGB = 0.79 for base_classifier='XGB'. 74 | A value in [0,1]. 75 | The classification confidence threshold (CCT). 76 | 77 | collect_telemetry : boolean, (default=False) 78 | If True, collect telemetry on the training_data_fraction of the fg normal and anomaly classifiers. 79 | See also telemetry_ attribute. 80 | 81 | random_seed : int, optional (default=42) 82 | Random seed. 83 | 84 | verbose : int, optional (default=0) 85 | If 0 prints exceptions only, if equal or bigger than 1 prints also warnings. 86 | 87 | Attributes 88 | ---------- 89 | 90 | classes_ : array of shape (n_classes,) classes labels. 91 | 92 | cg_clf_ : Classifier 93 | The cg classifier, either Random Forest (base_classifier='RF') or XGBoost (base_classifier='XGB'). 94 | 95 | fg_clf_normal_ : Classifier 96 | The fg normal classifier, either Random Forest (base_classifier='RF') or XGBoost (base_classifier='XGB'). 97 | 98 | fg_clf_anomaly_ : Classifier 99 | The fg anomaly classifier, either Random Forest (base_classifier='RF') or XGBoost (base_classifier='XGB'). 100 | 101 | cg_train_using_feature_subset : list or None, optional (default=None) 102 | List of columns to use for training the cg classifier (when None, all columns are used). 103 | 104 | cg_only_ : boolean 105 | True if only the cg classifier is fitted. 106 | 107 | telemetry_ : dict (if collect_telemetry is True) 108 | Contains the training_data_fraction of the fg normal and anomaly classifiers. 109 | 110 | Example program 111 | --------------- 112 | 113 | from rade_classifier import RadeClassifier 114 | from sklearn.datasets import make_classification 115 | from sklearn.metrics import classification_report 116 | from sklearn.model_selection import train_test_split 117 | 118 | X, y = make_classification(n_samples=1000000, n_features=4, 119 | n_informative=2, n_redundant=0, 120 | random_state=0, shuffle=False, weights=[0.99, 0.01]) 121 | 122 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) 123 | 124 | clf = RadeClassifier() 125 | 126 | clf.fit(X_train, y_train) 127 | y_predicted = clf.predict(X_test) 128 | print(classification_report(y_test, y_predicted, digits=5)) 129 | 130 | Notes 131 | ----- 132 | More details can be found in [1]. 133 | See section 5.2 in order to tune RADE (e.g., by grid-search). 134 | 135 | References 136 | ---------- 137 | [1] Shay Vargaftik, Isaac Keslassy, Ariel Orda, Yaniv Ben-Itzhak, 138 | "RADE: Resource-Efficient Supervised Anomaly Detection Using Decision Tree-Based Ensemble Methods" 139 | https://arxiv.org/abs/1909.11877 140 | 141 | """ 142 | 143 | ########################################################################### 144 | ########################################################################### 145 | 146 | def __init__(self, 147 | 148 | base_classifier='RF', 149 | random_seed=42, 150 | 151 | cg_params=None, 152 | 153 | fg_normal_params=None, 154 | fg_anomaly_params=None, 155 | 156 | training_confidence_threshold=None, 157 | classification_confidence_threshold=None, 158 | 159 | # default configurations: 160 | # RF: 161 | default_training_confidence_threshold_RF=0.89, 162 | default_classification_confidence_threshold_RF=0.79, 163 | 164 | default_cg_params_RF= 165 | { 166 | 'n_estimators': 10, 167 | 'max_depth': 5 168 | }, 169 | 170 | default_fg_normal_params_RF= 171 | { 172 | 'n_estimators': 25, 173 | 'max_depth': 20 174 | }, 175 | 176 | default_fg_anomaly_params_RF= 177 | { 178 | 'n_estimators': 25, 179 | 'max_depth': 20 180 | }, 181 | 182 | # XGB: 183 | default_training_confidence_threshold_XGB=0.79, 184 | default_classification_confidence_threshold_XGB=0.79, 185 | 186 | default_cg_params_XGB= 187 | { 188 | 'n_estimators': 10, 189 | 'max_depth': 3 190 | }, 191 | 192 | default_fg_normal_params_XGB= 193 | { 194 | 'n_estimators': 30, 195 | 'max_depth': 3 196 | }, 197 | 198 | default_fg_anomaly_params_XGB= 199 | { 200 | 'n_estimators': 30, 201 | 'max_depth': 3 202 | }, 203 | 204 | cg_train_using_feature_subset=None, 205 | 206 | collect_telemetry=False, 207 | 208 | verbose=0 209 | 210 | ): 211 | 212 | self.base_classifier = base_classifier 213 | self.random_seed = random_seed 214 | 215 | self.cg_params = cg_params 216 | self.fg_normal_params = fg_normal_params 217 | self.fg_anomaly_params = fg_anomaly_params 218 | 219 | self.training_confidence_threshold = training_confidence_threshold 220 | self.classification_confidence_threshold = classification_confidence_threshold 221 | 222 | self.collect_telemetry = collect_telemetry 223 | 224 | self.cg_train_using_feature_subset = cg_train_using_feature_subset 225 | 226 | self.verbose = verbose 227 | 228 | ### RF defaults 229 | self.default_training_confidence_threshold_RF = default_training_confidence_threshold_RF 230 | self.default_classification_confidence_threshold_RF = default_classification_confidence_threshold_RF 231 | self.default_cg_params_RF = default_cg_params_RF 232 | self.default_fg_normal_params_RF = default_fg_normal_params_RF 233 | self.default_fg_anomaly_params_RF = default_fg_anomaly_params_RF 234 | 235 | ### XGBoost defaults 236 | self.default_training_confidence_threshold_XGB = default_training_confidence_threshold_XGB 237 | self.default_classification_confidence_threshold_XGB = default_classification_confidence_threshold_XGB 238 | self.default_cg_params_XGB = default_cg_params_XGB 239 | self.default_fg_normal_params_XGB = default_fg_normal_params_XGB 240 | self.default_fg_anomaly_params_XGB = default_fg_anomaly_params_XGB 241 | 242 | ########################################################################### 243 | ########################################################################### 244 | def verify_parameters(self, X, y): 245 | 246 | if self.classification_confidence_threshold and not self.training_confidence_threshold: 247 | if self.base_classifier == 'RF': 248 | if self.classification_confidence_threshold > self.default_training_confidence_threshold_RF: 249 | if self.verbose > 0: 250 | print( 251 | "Warning: classification_confidence_threshold ({}) > " 252 | "default_training_confidence_threshold_RF ({}).\n". 253 | format(self.classification_confidence_threshold, 254 | self.default_training_confidence_threshold_RF)) 255 | elif self.base_classifier == 'XGB': 256 | if self.classification_confidence_threshold > self.default_training_confidence_threshold_XGB: 257 | if self.verbose > 0: 258 | print( 259 | "Warning: classification_confidence_threshold ({}) > " 260 | "default_training_confidence_threshold_XGB ({}).\n". 261 | format(self.classification_confidence_threshold, 262 | self.default_training_confidence_threshold_XGB)) 263 | else: 264 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 265 | 266 | elif not self.classification_confidence_threshold and self.training_confidence_threshold: 267 | if self.base_classifier == 'RF': 268 | if self.default_classification_confidence_threshold_RF > self.training_confidence_threshold: 269 | if self.verbose > 0: 270 | print( 271 | "Warning: default_classification_confidence_threshold_RF ({}) > " 272 | "training_confidence_threshold ({}).\n". 273 | format(self.default_classification_confidence_threshold_RF, 274 | self.training_confidence_threshold)) 275 | elif self.base_classifier == 'XGB': 276 | if self.default_classification_confidence_threshold_XGB > self.training_confidence_threshold: 277 | if self.verbose > 0: 278 | print( 279 | "Warning: default_classification_confidence_threshold_XGB ({}) > " 280 | "training_confidence_threshold ({}).\n". 281 | format(self.default_classification_confidence_threshold_XGB, 282 | self.training_confidence_threshold)) 283 | else: 284 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 285 | 286 | elif self.classification_confidence_threshold and self.training_confidence_threshold: 287 | if self.classification_confidence_threshold > self.training_confidence_threshold: 288 | if self.verbose > 0: 289 | print("Warning: classification_confidence_threshold ({}) > training_confidence_threshold ({}).\n". 290 | format(self.classification_confidence_threshold, self.training_confidence_threshold)) 291 | 292 | if self.cg_train_using_feature_subset is not None: 293 | ### empty is not allowed 294 | if not len(self.cg_train_using_feature_subset): 295 | raise Exception( 296 | "Illegal cg_train_using_feature_subset (err1): {}\nShould be None or specify unique columns".format( 297 | self.cg_train_using_feature_subset)) 298 | 299 | ### duplicates are not allowed 300 | if len(self.cg_train_using_feature_subset) != len(set(self.cg_train_using_feature_subset)): 301 | raise Exception( 302 | "Illegal cg_train_using_feature_subset (err2): {}\nShould be None or specify unique columns".format( 303 | self.cg_train_using_feature_subset)) 304 | 305 | ### translate column names (if X is a dataframe) to indices 306 | if isinstance(X, pd.DataFrame): 307 | if all(elem in X.columns for elem in self.cg_train_using_feature_subset): 308 | self.cg_train_using_feature_subset = [X.columns.get_loc(i) for i in 309 | self.cg_train_using_feature_subset] 310 | 311 | ### verify legal column values 312 | if not set(self.cg_train_using_feature_subset).issubset(set(range(X.shape[1]))): 313 | raise Exception( 314 | "Illegal cg_train_using_feature_subset (err3): {}\nShould be None or specify unique columns".format( 315 | self.cg_train_using_feature_subset)) 316 | 317 | ########################################################################### 318 | ########################################################################### 319 | 320 | def fit(self, X, y): 321 | 322 | ### set numpy seed 323 | np.random.seed(self.random_seed) 324 | 325 | ### base classifier type options 326 | baseClassifierTypes = { 327 | 328 | 'RF': RandomForestClassifier, 329 | 'XGB': XGBClassifier 330 | 331 | } 332 | 333 | ## RADE parameters input checks 334 | self.verify_parameters(X, y) 335 | 336 | ### input verification - required by scikit 337 | X, y = check_X_y(X, y) 338 | 339 | ### store the classes seen during fit - required by scikit 340 | self.classes_ = unique_labels(y) 341 | 342 | ### store the number of features passed to the fit method 343 | self.n_features_in_ = X.shape[1] 344 | 345 | ### binary classifier 346 | if len(self.classes_) >= 3: 347 | raise Exception("RADE is a binary classifier") 348 | 349 | ### collect telemetry 350 | if self.collect_telemetry: 351 | self.telemetry_ = {} 352 | self.telemetry_['normal_fg_training_data_fraction'] = 0 353 | self.telemetry_['anomaly_fg_training_data_fraction'] = 0 354 | 355 | ### init coarse-grained (cg) classifier 356 | self.cg_clf_ = baseClassifierTypes[self.base_classifier](random_state=self.random_seed) 357 | 358 | ### set cg params 359 | if self.cg_params is None: 360 | if self.verbose > 0: 361 | print("Warning: no kwards for the coarse-grained model. Use the default configuration.\n") 362 | if self.base_classifier == 'RF': 363 | self.cg_clf_.set_params(**self.default_cg_params_RF) 364 | elif self.base_classifier == 'XGB': 365 | self.cg_clf_.set_params(**self.default_cg_params_XGB) 366 | else: 367 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 368 | else: 369 | self.cg_clf_.set_params(**self.cg_params) 370 | 371 | ### train cg 372 | if self.cg_train_using_feature_subset == None: 373 | self.cg_clf_.fit(X, y) 374 | else: 375 | self.cg_clf_.fit(X[:, self.cg_train_using_feature_subset], y) 376 | 377 | ### tags 378 | try: 379 | self.__normal_tag_ = np.min(self.classes_) 380 | self.__anomaly_tag_ = np.max(self.classes_) 381 | except: 382 | self.__normal_tag_ = self.classes_[0] 383 | self.__anomaly_tag_ = self.classes_[1] 384 | 385 | ### single class 386 | if self.__normal_tag_ == self.__anomaly_tag_: 387 | self.cg_only_ = True 388 | if self.verbose > 0: 389 | print("Warning: received only a single class for training, no fg models.\n") 390 | return self 391 | else: 392 | self.cg_only_ = False 393 | 394 | ### init fine-grained (fg) classifiers 395 | self.fg_clf_normal_ = baseClassifierTypes[self.base_classifier](random_state=self.random_seed) 396 | self.fg_clf_anomaly_ = baseClassifierTypes[self.base_classifier](random_state=self.random_seed) 397 | 398 | ### set fg normal params 399 | if self.fg_normal_params is None: 400 | if self.verbose > 0: 401 | print("Warning: no kwards for the fine-grained normal model. Use the default configuration.\n") 402 | if self.base_classifier == 'RF': 403 | self.fg_clf_normal_.set_params(**self.default_fg_normal_params_RF) 404 | elif self.base_classifier == 'XGB': 405 | self.fg_clf_normal_.set_params(**self.default_fg_normal_params_XGB) 406 | else: 407 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 408 | else: 409 | self.fg_clf_normal_.set_params(**self.fg_normal_params) 410 | 411 | ### set fg anomaly params 412 | if self.fg_anomaly_params is None: 413 | if self.verbose > 0: 414 | print("Warning: no kwards for the fine-grained anomaly model. Use the default configuration.\n") 415 | if self.base_classifier == 'RF': 416 | self.fg_clf_anomaly_.set_params(**self.default_fg_anomaly_params_RF) 417 | elif self.base_classifier == 'XGB': 418 | self.fg_clf_anomaly_.set_params(**self.default_fg_anomaly_params_XGB) 419 | else: 420 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 421 | else: 422 | self.fg_clf_anomaly_.set_params(**self.fg_anomaly_params) 423 | 424 | ### classify training data by cg to obtain metadata 425 | if self.cg_train_using_feature_subset == None: 426 | cg_classification_distribution = self.cg_clf_.predict_proba(X) 427 | else: 428 | cg_classification_distribution = self.cg_clf_.predict_proba(X[:, self.cg_train_using_feature_subset]) 429 | 430 | cg_classification = np.take(self.classes_, np.argmax(cg_classification_distribution, axis=1)) 431 | cg_classification_confidence = np.max(cg_classification_distribution, axis=1) 432 | 433 | ### prepare train data filters 434 | if self.training_confidence_threshold is None: 435 | if self.base_classifier == 'RF': 436 | cg_low_confidence_indeces = (cg_classification_confidence < 437 | self.default_training_confidence_threshold_RF) 438 | elif self.base_classifier == 'XGB': 439 | cg_low_confidence_indeces = ( 440 | cg_classification_confidence < self.default_training_confidence_threshold_XGB) 441 | else: 442 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 443 | else: 444 | cg_low_confidence_indeces = (cg_classification_confidence < self.training_confidence_threshold) 445 | 446 | true_anomaly_indeces = (y == self.__anomaly_tag_) 447 | cg_normal_classification_indeces = (cg_classification == self.__normal_tag_) 448 | cg_anomaly_classifications_indeces = (cg_classification == self.__anomaly_tag_) 449 | 450 | ### training data for fg models 451 | fg_normal_training_data_filter = cg_low_confidence_indeces & ( 452 | true_anomaly_indeces | cg_normal_classification_indeces) 453 | fg_normal_training_data_X = X[fg_normal_training_data_filter] 454 | fg_normal_training_data_y = y[fg_normal_training_data_filter] 455 | 456 | fg_anomaly_training_data_filter = cg_low_confidence_indeces & ( 457 | true_anomaly_indeces | cg_anomaly_classifications_indeces) 458 | fg_anomaly_training_data_X = X[fg_anomaly_training_data_filter] 459 | fg_anomaly_training_data_y = y[fg_anomaly_training_data_filter] 460 | 461 | ### train the fg models 462 | if len(unique_labels(fg_normal_training_data_y)) == 2 and sum(fg_normal_training_data_filter) > 1: 463 | ### collect telemetry 464 | if self.collect_telemetry: 465 | self.telemetry_['normal_fg_training_data_fraction'] = sum(fg_normal_training_data_filter) / len( 466 | fg_normal_training_data_filter) 467 | ### train 468 | self.fg_clf_normal_.fit(fg_normal_training_data_X, fg_normal_training_data_y) 469 | self.fg_normal_fitted_ = True 470 | else: 471 | if self.verbose > 0: 472 | print("Warning: no fine-grained normal model training.\n") 473 | self.fg_normal_fitted_ = False 474 | 475 | if len(unique_labels(fg_anomaly_training_data_y)) == 2 and sum(fg_anomaly_training_data_filter) > 1: 476 | ### collect telemetry 477 | if self.collect_telemetry: 478 | self.telemetry_['anomaly_fg_training_data_fraction'] = sum(fg_anomaly_training_data_filter) / len( 479 | fg_anomaly_training_data_filter) 480 | ### train 481 | self.fg_clf_anomaly_.fit(fg_anomaly_training_data_X, fg_anomaly_training_data_y) 482 | self.fg_anomaly_fitted_ = True 483 | else: 484 | if self.verbose > 0: 485 | print("Warning: no fine-grained anomaly model training.\n") 486 | self.fg_anomaly_fitted_ = False 487 | 488 | ### for speed 489 | if not self.fg_normal_fitted_ and not self.fg_anomaly_fitted_: 490 | self.cg_only_ = True 491 | 492 | ### a call to fit should return the classifier - required by scikit 493 | return self 494 | 495 | ########################################################################### 496 | ########################################################################### 497 | 498 | def predict_basic(self, X, proba=False): 499 | 500 | ### set numpy seed 501 | np.random.seed(self.random_seed) 502 | 503 | ### check is that fit had been called - required by scikit 504 | check_is_fitted(self) 505 | 506 | ### input verification - required by scikit 507 | X = check_array(X) 508 | 509 | ### collect telemetry 510 | if self.collect_telemetry: 511 | self.telemetry_['normal_fg_test_data_fraction'] = 0 512 | self.telemetry_['anomaly_fg_test_data_fraction'] = 0 513 | 514 | ### no fg models? 515 | if self.cg_only_: 516 | if not proba: 517 | if self.cg_train_using_feature_subset == None: 518 | return self.cg_clf_.predict(X) 519 | else: 520 | return self.cg_clf_.predict(X[:, self.cg_train_using_feature_subset]) 521 | else: 522 | if self.cg_train_using_feature_subset == None: 523 | return self.cg_clf_.predict_proba(X) 524 | else: 525 | return self.cg_clf_.predict_proba(X[:, self.cg_train_using_feature_subset]) 526 | 527 | ### classify test data by cg to obtain metadata 528 | if self.cg_train_using_feature_subset == None: 529 | cg_classification_distribution = self.cg_clf_.predict_proba(X) 530 | else: 531 | cg_classification_distribution = self.cg_clf_.predict_proba(X[:, self.cg_train_using_feature_subset]) 532 | 533 | cg_classification = np.take(self.classes_, np.argmax(cg_classification_distribution, axis=1)) 534 | cg_classification_confidence = np.max(cg_classification_distribution, axis=1) 535 | 536 | ### prepare test data filters 537 | if self.classification_confidence_threshold is None: 538 | if self.base_classifier == 'RF': 539 | cg_low_confidence_indeces = (cg_classification_confidence < 540 | self.default_classification_confidence_threshold_RF) 541 | elif self.base_classifier == 'XGB': 542 | cg_low_confidence_indeces = (cg_classification_confidence < 543 | self.default_classification_confidence_threshold_XGB) 544 | else: 545 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier)) 546 | else: 547 | cg_low_confidence_indeces = (cg_classification_confidence < self.classification_confidence_threshold) 548 | 549 | normal_cg_classification_indeces = (cg_classification == self.__normal_tag_) 550 | anomaly_cg_classifications_indeces = (cg_classification == self.__anomaly_tag_) 551 | 552 | ### test data for fg models 553 | fg_normal_test_data_filter = cg_low_confidence_indeces & normal_cg_classification_indeces 554 | fg_normal_test_data = X[fg_normal_test_data_filter] 555 | 556 | fg_anomaly_test_data_filter = cg_low_confidence_indeces & anomaly_cg_classifications_indeces 557 | fg_anomaly_test_data = X[fg_anomaly_test_data_filter] 558 | 559 | ### predict 560 | if not proba: 561 | 562 | classification_results = cg_classification 563 | 564 | if self.fg_normal_fitted_ and np.any(fg_normal_test_data_filter): 565 | 566 | ### collect telemetry 567 | if self.collect_telemetry: 568 | self.telemetry_['normal_fg_test_data_fraction'] = sum(fg_normal_test_data_filter) / len( 569 | fg_normal_test_data_filter) 570 | 571 | classification_results[fg_normal_test_data_filter] = self.fg_clf_normal_.predict(fg_normal_test_data) 572 | 573 | if self.fg_anomaly_fitted_ and np.any(fg_anomaly_test_data_filter): 574 | 575 | ### collect telemetry 576 | if self.collect_telemetry: 577 | self.telemetry_['anomaly_fg_test_data_fraction'] = sum(fg_anomaly_test_data_filter) / len( 578 | fg_anomaly_test_data_filter) 579 | 580 | classification_results[fg_anomaly_test_data_filter] = self.fg_clf_anomaly_.predict(fg_anomaly_test_data) 581 | 582 | return classification_results 583 | 584 | ### predict proba 585 | else: 586 | 587 | classification_distribution_results = cg_classification_distribution 588 | 589 | if self.fg_normal_fitted_ and np.any(fg_normal_test_data_filter): 590 | 591 | ### collect telemetry 592 | if self.collect_telemetry: 593 | self.telemetry_['normal_fg_test_data_fraction'] = sum(fg_normal_test_data_filter) / len( 594 | fg_normal_test_data_filter) 595 | 596 | classification_distribution_results[fg_normal_test_data_filter] = self.fg_clf_normal_.predict_proba( 597 | fg_normal_test_data) 598 | 599 | if self.fg_anomaly_fitted_ and np.any(fg_anomaly_test_data_filter): 600 | 601 | ### collect telemetry 602 | if self.collect_telemetry: 603 | self.telemetry_['anomaly_fg_test_data_fraction'] = sum(fg_anomaly_test_data_filter) / len( 604 | fg_anomaly_test_data_filter) 605 | 606 | classification_distribution_results[fg_anomaly_test_data_filter] = self.fg_clf_anomaly_.predict_proba( 607 | fg_anomaly_test_data) 608 | 609 | return classification_distribution_results 610 | 611 | ########################################################################### 612 | 613 | ########################################################################### 614 | 615 | def predict(self, X): 616 | 617 | return self.predict_basic(X) 618 | 619 | def predict_proba(self, X): 620 | 621 | return self.predict_basic(X, proba=True) 622 | 623 | ########################################################################### 624 | ########################################################################### 625 | 626 | ### getters 627 | 628 | def get_telemetry(self): 629 | 630 | try: 631 | return self.telemetry_ 632 | except: 633 | print("\nError: get_telemetry was called but telemetry is disabled.\n") 634 | 635 | def get_sub_classifier(self, clf): 636 | 637 | if clf == 'cg': 638 | return self.cg_clf_ 639 | 640 | elif clf == 'fg_normal': 641 | if self.fg_normal_fitted_: 642 | return self.fg_clf_normal_ 643 | else: 644 | return None 645 | 646 | elif clf == 'fg_anomaly': 647 | if self.fg_anomaly_fitted_: 648 | return self.fg_clf_anomaly_ 649 | else: 650 | return None 651 | 652 | else: 653 | raise Exception("unknown sub-classifier type, possible options are: cg / fg_normal / fg_anomaly") 654 | 655 | ########################################################################### 656 | ########################################################################### 657 | def _more_tags(self): 658 | 659 | return {'binary_only': True} 660 | 661 | ########################################################################### 662 | ########################################################################### 663 | --------------------------------------------------------------------------------