├── requirements.txt
├── LICENSE.txt
├── NOTICE.txt
├── README.md
├── .gitignore
├── CONTRIBUTING.md
├── example_program.py
├── CODE_OF_CONDUCT.md
└── rade_classifier.py
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.21
2 | pandas
3 | scikit-learn
4 | xgboost
5 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vmware-labs/efficient-supervised-anomaly-detection/main/LICENSE.txt
--------------------------------------------------------------------------------
/NOTICE.txt:
--------------------------------------------------------------------------------
1 | Efficient Supervised Anomaly Detection
2 |
3 | Copyright 2021 VMware, Inc.
4 |
5 | This product is licensed to you under the BSD 3-Clause "New" or "Revised" license (the "License"). You may not use this product except in compliance with the BSD-3 License.
6 |
7 | This product may include a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the subcomponent's license, as noted in the LICENSE file.
8 |
9 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # RADE scikit Classifier (v1.0)
2 |
3 | RADE is a resource-efficient decision tree ensemble method (DTEM) based anomaly
4 | detection approach that augments standard DTEM classifiers resulting in
5 | competitive anomaly detection capabilities and significant savings in resource
6 | usage.
7 |
8 | The current implementation of RADE augments either Random-Forest or XGBoost.
9 |
10 | More information about RADE can be found in:
11 | RADE: resource‑efficient supervised anomaly detection using decision tree‑based ensemble methods (Springer ML)
12 |
13 | ## Files:
14 |
15 | #### rade_classifier.py - RADE sci-kit classifier
16 |
17 | #### example_program.py - Basic comparison example between RF, XGBoost, and RADE
18 |
19 |
20 | ## Prerequisities:
21 | numpy
22 | pandas
23 | sklearn
24 | xgboost
25 |
26 | or alternatively run:
27 | $ pip3 install -r requirements.txt
28 |
29 | For more information, support and advanced examples contact:
30 | Yaniv Ben-Itzhak, [ybenitzhak@vmware.com](mailto:ybenitzhak@vmware.com)
31 | Shay Vargaftik, [shayv@vmware.com](mailto:shayv@vmware.com)
32 |
33 |
34 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Editors
2 | .vscode/
3 | .idea/
4 |
5 | # Vagrant
6 | .vagrant/
7 |
8 | # Mac/OSX
9 | .DS_Store
10 |
11 | # Windows
12 | Thumbs.db
13 |
14 | # Source for the following rules: https://raw.githubusercontent.com/github/gitignore/master/Python.gitignore
15 | # Byte-compiled / optimized / DLL files
16 | __pycache__/
17 | *.py[cod]
18 | *$py.class
19 |
20 | # C extensions
21 | *.so
22 |
23 | # Distribution / packaging
24 | .Python
25 | build/
26 | develop-eggs/
27 | dist/
28 | downloads/
29 | eggs/
30 | .eggs/
31 | lib/
32 | lib64/
33 | parts/
34 | sdist/
35 | var/
36 | wheels/
37 | *.egg-info/
38 | .installed.cfg
39 | *.egg
40 | MANIFEST
41 |
42 | # PyInstaller
43 | # Usually these files are written by a python script from a template
44 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
45 | *.manifest
46 | *.spec
47 |
48 | # Installer logs
49 | pip-log.txt
50 | pip-delete-this-directory.txt
51 |
52 | # Unit test / coverage reports
53 | htmlcov/
54 | .tox/
55 | .nox/
56 | .coverage
57 | .coverage.*
58 | .cache
59 | nosetests.xml
60 | coverage.xml
61 | *.cover
62 | .hypothesis/
63 | .pytest_cache/
64 |
65 | # Translations
66 | *.mo
67 | *.pot
68 |
69 | # Django stuff:
70 | *.log
71 | local_settings.py
72 | db.sqlite3
73 |
74 | # Flask stuff:
75 | instance/
76 | .webassets-cache
77 |
78 | # Scrapy stuff:
79 | .scrapy
80 |
81 | # Sphinx documentation
82 | docs/_build/
83 |
84 | # PyBuilder
85 | target/
86 |
87 | # Jupyter Notebook
88 | .ipynb_checkpoints
89 |
90 | # IPython
91 | profile_default/
92 | ipython_config.py
93 |
94 | # pyenv
95 | .python-version
96 |
97 | # celery beat schedule file
98 | celerybeat-schedule
99 |
100 | # SageMath parsed files
101 | *.sage.py
102 |
103 | # Environments
104 | .env
105 | .venv
106 | env/
107 | venv/
108 | ENV/
109 | env.bak/
110 | venv.bak/
111 |
112 | # Spyder project settings
113 | .spyderproject
114 | .spyproject
115 |
116 | # Rope project settings
117 | .ropeproject
118 |
119 | # mkdocs documentation
120 | /site
121 |
122 | # mypy
123 | .mypy_cache/
124 | .dmypy.json
125 | dmypy.json
126 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 |
2 | # Contributing to efficient-supervised-anomaly-detection
3 |
4 | The efficient-supervised-anomaly-detection project team welcomes contributions from the community. Before you start working with efficient-supervised-anomaly-detection, please
5 | read our [Developer Certificate of Origin](https://cla.vmware.com/dco). All contributions to this repository must be
6 | signed as described on that page. Your signature certifies that you wrote the patch or have the right to pass it on
7 | as an open-source patch.
8 |
9 | ## Contribution Flow
10 |
11 | This is a rough outline of what a contributor's workflow looks like:
12 |
13 | - Create a topic branch from where you want to base your work
14 | - Make commits of logical units
15 | - Make sure your commit messages are in the proper format (see below)
16 | - Push your changes to a topic branch in your fork of the repository
17 | - Submit a pull request
18 |
19 | Example:
20 |
21 | ``` shell
22 | git remote add upstream https://github.com/vmware/efficient-supervised-anomaly-detection.git
23 | git checkout -b my-new-feature main
24 | git commit -a
25 | git push origin my-new-feature
26 | ```
27 |
28 | ### Staying In Sync With Upstream
29 |
30 | When your branch gets out of sync with the vmware/main branch, use the following to update:
31 |
32 | ``` shell
33 | git checkout my-new-feature
34 | git fetch -a
35 | git pull --rebase upstream main
36 | git push --force-with-lease origin my-new-feature
37 | ```
38 |
39 | ### Updating pull requests
40 |
41 | If your PR fails to pass CI or needs changes based on code review, you'll most likely want to squash these changes into
42 | existing commits.
43 |
44 | If your pull request contains a single commit or your changes are related to the most recent commit, you can simply
45 | amend the commit.
46 |
47 | ``` shell
48 | git add .
49 | git commit --amend
50 | git push --force-with-lease origin my-new-feature
51 | ```
52 |
53 | If you need to squash changes into an earlier commit, you can use:
54 |
55 | ``` shell
56 | git add .
57 | git commit --fixup
58 | git rebase -i --autosquash main
59 | git push --force-with-lease origin my-new-feature
60 | ```
61 |
62 | Be sure to add a comment to the PR indicating your new changes are ready to review, as GitHub does not generate a
63 | notification when you git push.
64 |
65 | ### Code Style
66 |
67 | ### Formatting Commit Messages
68 |
69 | We follow the conventions on [How to Write a Git Commit Message](http://chris.beams.io/posts/git-commit/).
70 |
71 | Be sure to include any related GitHub issue references in the commit message. See
72 | [GFM syntax](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown) for referencing issues
73 | and commits.
74 |
75 | ## Reporting Bugs and Creating Issues
76 |
77 | When opening a new issue, try to roughly follow the commit message format conventions above.
78 |
--------------------------------------------------------------------------------
/example_program.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018-2021 VMware, Inc.
2 | # SPDX-License-Identifier: BSD-3-Clause
3 |
4 | from rade_classifier import RadeClassifier
5 | from sklearn.ensemble import RandomForestClassifier
6 | from xgboost import XGBClassifier
7 |
8 | from sklearn.datasets import make_classification
9 | from sklearn.metrics import classification_report
10 | from sklearn.model_selection import train_test_split
11 | from sklearn.metrics import f1_score
12 |
13 | import time
14 |
15 | X, y = make_classification(n_samples=1000000, n_features=4,
16 | n_informative=2, n_redundant=0,
17 | random_state=0, shuffle=False, weights=[0.99, 0.01])
18 |
19 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
20 |
21 | # RADE
22 | rade_clf = RadeClassifier()
23 |
24 | start = time.time()
25 | rade_clf.fit(X_train, y_train)
26 | end = time.time()
27 | rade_train_time = end - start
28 | print("RADE training time is: {:0.2f} seconds".format(rade_train_time))
29 |
30 | start = time.time()
31 | rade_y_predicted = rade_clf.predict(X_test)
32 | end = time.time()
33 | rade_predict_time = end - start
34 | print("RADE prediction time is: {:0.2f} seconds".format(rade_predict_time))
35 |
36 | rade_f1 = f1_score(y_test, rade_y_predicted, average='macro')
37 | print("RADE classification_report:")
38 | print(classification_report(y_test, rade_y_predicted, digits=5))
39 |
40 | # Random Forest
41 | rf_clf = RandomForestClassifier()
42 |
43 | start = time.time()
44 | rf_clf.fit(X_train, y_train)
45 | end = time.time()
46 | rf_train_time = end - start
47 | print("Random Forest training time is: {:0.2f} seconds".format(rf_train_time))
48 |
49 | start = time.time()
50 | rf_y_predicted = rf_clf.predict(X_test)
51 | end = time.time()
52 | rf_predict_time = end - start
53 | print("Random Forest prediction time is: {:0.2f} seconds".format(rf_predict_time))
54 |
55 | rf_f1 = f1_score(y_test, rf_y_predicted, average='macro')
56 | print("Random Forest classification_report:")
57 | print(classification_report(y_test, rf_y_predicted, digits=5))
58 |
59 | # XGBoost
60 | xgb_clf = XGBClassifier()
61 |
62 | start = time.time()
63 | xgb_clf.fit(X_train, y_train)
64 | end = time.time()
65 | xgb_train_time = end - start
66 | print("XGBoost training time is: {:0.2f} seconds".format(xgb_train_time))
67 |
68 | start = time.time()
69 | xgb_y_predicted = xgb_clf.predict(X_test)
70 | end = time.time()
71 | xgb_predict_time = end - start
72 | print("XGBoost prediction time is: {:0.2f} seconds".format(xgb_predict_time))
73 |
74 | xgb_f1 = f1_score(y_test, xgb_y_predicted, average='macro')
75 | print("XGBoost classification_report:")
76 | print(classification_report(y_test, xgb_y_predicted, digits=5))
77 |
78 | print('RADE vs. Random Forest and XGBoost:')
79 | print ('Training time: RADE is {:0.1f}x faster than Random Forest, and {:0.1f}x faster than XGBoost.'.
80 | format(rf_train_time / rade_train_time, xgb_train_time / rade_train_time))
81 | print ('Prediction time: RADE is {:0.1f}x faster than Random Forest, and {:0.1f}x faster than XGBoost.'.
82 | format(rf_predict_time / rade_predict_time, xgb_predict_time / rade_predict_time))
83 | print ('Macro F1: RADE is better by {:+0.5f} as compared to Random Forest, and by {:+0.5f} as compared to XGBoost.'.
84 | format(rade_f1 - rf_f1, rade_f1 - xgb_f1))
85 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 |
2 | # Contributor Covenant Code of Conduct
3 |
4 | ## Our Pledge
5 |
6 | We as members, contributors, and leaders pledge to make participation in efficient-supervised-anomaly-detection project and our
7 | community a harassment-free experience for everyone, regardless of age, body
8 | size, visible or invisible disability, ethnicity, sex characteristics, gender
9 | identity and expression, level of experience, education, socio-economic status,
10 | nationality, personal appearance, race, religion, or sexual identity
11 | and orientation.
12 |
13 | We pledge to act and interact in ways that contribute to an open, welcoming,
14 | diverse, inclusive, and healthy community.
15 |
16 | ## Our Standards
17 |
18 | Examples of behavior that contributes to a positive environment for our
19 | community include:
20 |
21 | * Demonstrating empathy and kindness toward other people
22 | * Being respectful of differing opinions, viewpoints, and experiences
23 | * Giving and gracefully accepting constructive feedback
24 | * Accepting responsibility and apologizing to those affected by our mistakes,
25 | and learning from the experience
26 | * Focusing on what is best not just for us as individuals, but for the
27 | overall community
28 |
29 | Examples of unacceptable behavior include:
30 |
31 | * The use of sexualized language or imagery, and sexual attention or
32 | advances of any kind
33 | * Trolling, insulting or derogatory comments, and personal or political attacks
34 | * Public or private harassment
35 | * Publishing others' private information, such as a physical or email
36 | address, without their explicit permission
37 | * Other conduct which could reasonably be considered inappropriate in a
38 | professional setting
39 |
40 | ## Enforcement Responsibilities
41 |
42 | Community leaders are responsible for clarifying and enforcing our standards of
43 | acceptable behavior and will take appropriate and fair corrective action in
44 | response to any behavior that they deem inappropriate, threatening, offensive,
45 | or harmful.
46 |
47 | Community leaders have the right and responsibility to remove, edit, or reject
48 | comments, commits, code, wiki edits, issues, and other contributions that are
49 | not aligned to this Code of Conduct, and will communicate reasons for moderation
50 | decisions when appropriate.
51 |
52 | ## Scope
53 |
54 | This Code of Conduct applies within all community spaces, and also applies when
55 | an individual is officially representing the community in public spaces.
56 | Examples of representing our community include using an official e-mail address,
57 | posting via an official social media account, or acting as an appointed
58 | representative at an online or offline event.
59 |
60 | ## Enforcement
61 |
62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
63 | reported to the community leaders responsible for enforcement at oss-coc@vmware.com.
64 | All complaints will be reviewed and investigated promptly and fairly.
65 |
66 | All community leaders are obligated to respect the privacy and security of the
67 | reporter of any incident.
68 |
69 | ## Enforcement Guidelines
70 |
71 | Community leaders will follow these Community Impact Guidelines in determining
72 | the consequences for any action they deem in violation of this Code of Conduct:
73 |
74 | ### 1. Correction
75 |
76 | **Community Impact**: Use of inappropriate language or other behavior deemed
77 | unprofessional or unwelcome in the community.
78 |
79 | **Consequence**: A private, written warning from community leaders, providing
80 | clarity around the nature of the violation and an explanation of why the
81 | behavior was inappropriate. A public apology may be requested.
82 |
83 | ### 2. Warning
84 |
85 | **Community Impact**: A violation through a single incident or series
86 | of actions.
87 |
88 | **Consequence**: A warning with consequences for continued behavior. No
89 | interaction with the people involved, including unsolicited interaction with
90 | those enforcing the Code of Conduct, for a specified period of time. This
91 | includes avoiding interactions in community spaces as well as external channels
92 | like social media. Violating these terms may lead to a temporary or
93 | permanent ban.
94 |
95 | ### 3. Temporary Ban
96 |
97 | **Community Impact**: A serious violation of community standards, including
98 | sustained inappropriate behavior.
99 |
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 |
106 | ### 4. Permanent Ban
107 |
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior, harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 |
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 |
115 | ## Attribution
116 |
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 |
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 |
124 | [homepage]: https://www.contributor-covenant.org
125 |
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
--------------------------------------------------------------------------------
/rade_classifier.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018-2021 VMware, Inc.
2 | # SPDX-License-Identifier: BSD-3-Clause
3 |
4 | #!/usr/bin/env python3
5 |
6 |
7 | import numpy as np
8 | import pandas as pd
9 |
10 | from sklearn.base import BaseEstimator, ClassifierMixin
11 | from sklearn.ensemble import RandomForestClassifier
12 | from xgboost import XGBClassifier
13 | from sklearn.utils.multiclass import unique_labels
14 | from sklearn.utils.validation import check_is_fitted, check_X_y, check_array
15 |
16 |
17 | ###############################################################################
18 | ###############################################################################
19 |
20 | class RadeClassifier(BaseEstimator, ClassifierMixin):
21 | """
22 | A RADE classifier.
23 |
24 | An efficient classifier that augments either a Random Forest or an XGBoost classifier to
25 | obtain lower model memory size, lower training time and lower classification latency.
26 |
27 | The main building blocks of RADE are:
28 |
29 | Coarse-grained classifier - a small model that is trained using the entire training dataset.
30 | The coarse-grained classifier is sufficient to classify the majority of the classification queries correctly, such
31 | that a classification is valid only if its corresponding confidence level is greater than or equal to the
32 | classification confidence threshold.
33 |
34 | Fine-grained classifiers - 'expert' classifiers that are trained to succeed specifically where the coarse-grained model is not sufficiently
35 | confident and is more likely to make a classification mistake.
36 |
37 |
38 | Parameters
39 | ----------
40 | base_classifier : string, optional (default='RF')
41 | 'RF' or 'XGB'.
42 | The classifier type of the coarse-grained and fine-grained classifiers.
43 | RADE supports Random-Forest ('RF') and XGBoost ('XGB').
44 |
45 | cg_params : dict or None, optional (default=None)
46 | If None, the classifier uses the defaults according to base_classifier.
47 | i.e., default_cg_params_RF = {'n_estimators': 10, 'max_depth': 5} for base_classifier='RF',
48 | or default_cg_params_XGB = {'n_estimators': 10, 'max_depth': 3} for base_classifier='XGB'.
49 | Parameters for the cg classifier.
50 |
51 | fg_normal_params : dict or None, optional (default=None)
52 | If None, the classifier uses the defaults according to base_classifier.
53 | i.e., default_fg_normal_params_RF = {'n_estimators': 25, 'max_depth': 20} for base_classifier='RF',
54 | or default_fg_normal_params_XGB = {'n_estimators': 30, 'max_depth': 3} for base_classifier='XGB'.
55 | Parameters for the fg normal classifier.
56 |
57 | fg_anomaly_params : dict or None, optional (default=None)
58 | If None, the classifier uses the defaults according to base_classifier.
59 | i.e., default_fg_anomaly_params_RF = {'n_estimators': 25, 'max_depth': 20}, for base_classifier='RF',
60 | or default_fg_anomaly_params_XGB = {'n_estimators': 30, 'max_depth': 3}, for base_classifier='XGB'.
61 | Parameters for the fg anomaly classifier.
62 |
63 | training_confidence_threshold : float, optional (default=None)
64 | If None, the classifier uses the defaults according to base_classifier.
65 | i.e., default_training_confidence_threshold_RF = 0.89 for base_classifier='RF',
66 | or default_training_confidence_threshold_XGB = 0.79 for base_classifier='XGB'.
67 | A value in [0,1].
68 | The training confidence threshold (TCT).
69 |
70 | classification_confidence_threshold : float, optional (default=None)
71 | If None, the classifier uses the defaults according to base_classifier.
72 | i.e., default_classification_confidence_threshold_RF = 0.79 for base_classifier='RF',
73 | or default_classification_confidence_threshold_XGB = 0.79 for base_classifier='XGB'.
74 | A value in [0,1].
75 | The classification confidence threshold (CCT).
76 |
77 | collect_telemetry : boolean, (default=False)
78 | If True, collect telemetry on the training_data_fraction of the fg normal and anomaly classifiers.
79 | See also telemetry_ attribute.
80 |
81 | random_seed : int, optional (default=42)
82 | Random seed.
83 |
84 | verbose : int, optional (default=0)
85 | If 0 prints exceptions only, if equal or bigger than 1 prints also warnings.
86 |
87 | Attributes
88 | ----------
89 |
90 | classes_ : array of shape (n_classes,) classes labels.
91 |
92 | cg_clf_ : Classifier
93 | The cg classifier, either Random Forest (base_classifier='RF') or XGBoost (base_classifier='XGB').
94 |
95 | fg_clf_normal_ : Classifier
96 | The fg normal classifier, either Random Forest (base_classifier='RF') or XGBoost (base_classifier='XGB').
97 |
98 | fg_clf_anomaly_ : Classifier
99 | The fg anomaly classifier, either Random Forest (base_classifier='RF') or XGBoost (base_classifier='XGB').
100 |
101 | cg_train_using_feature_subset : list or None, optional (default=None)
102 | List of columns to use for training the cg classifier (when None, all columns are used).
103 |
104 | cg_only_ : boolean
105 | True if only the cg classifier is fitted.
106 |
107 | telemetry_ : dict (if collect_telemetry is True)
108 | Contains the training_data_fraction of the fg normal and anomaly classifiers.
109 |
110 | Example program
111 | ---------------
112 |
113 | from rade_classifier import RadeClassifier
114 | from sklearn.datasets import make_classification
115 | from sklearn.metrics import classification_report
116 | from sklearn.model_selection import train_test_split
117 |
118 | X, y = make_classification(n_samples=1000000, n_features=4,
119 | n_informative=2, n_redundant=0,
120 | random_state=0, shuffle=False, weights=[0.99, 0.01])
121 |
122 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
123 |
124 | clf = RadeClassifier()
125 |
126 | clf.fit(X_train, y_train)
127 | y_predicted = clf.predict(X_test)
128 | print(classification_report(y_test, y_predicted, digits=5))
129 |
130 | Notes
131 | -----
132 | More details can be found in [1].
133 | See section 5.2 in order to tune RADE (e.g., by grid-search).
134 |
135 | References
136 | ----------
137 | [1] Shay Vargaftik, Isaac Keslassy, Ariel Orda, Yaniv Ben-Itzhak,
138 | "RADE: Resource-Efficient Supervised Anomaly Detection Using Decision Tree-Based Ensemble Methods"
139 | https://arxiv.org/abs/1909.11877
140 |
141 | """
142 |
143 | ###########################################################################
144 | ###########################################################################
145 |
146 | def __init__(self,
147 |
148 | base_classifier='RF',
149 | random_seed=42,
150 |
151 | cg_params=None,
152 |
153 | fg_normal_params=None,
154 | fg_anomaly_params=None,
155 |
156 | training_confidence_threshold=None,
157 | classification_confidence_threshold=None,
158 |
159 | # default configurations:
160 | # RF:
161 | default_training_confidence_threshold_RF=0.89,
162 | default_classification_confidence_threshold_RF=0.79,
163 |
164 | default_cg_params_RF=
165 | {
166 | 'n_estimators': 10,
167 | 'max_depth': 5
168 | },
169 |
170 | default_fg_normal_params_RF=
171 | {
172 | 'n_estimators': 25,
173 | 'max_depth': 20
174 | },
175 |
176 | default_fg_anomaly_params_RF=
177 | {
178 | 'n_estimators': 25,
179 | 'max_depth': 20
180 | },
181 |
182 | # XGB:
183 | default_training_confidence_threshold_XGB=0.79,
184 | default_classification_confidence_threshold_XGB=0.79,
185 |
186 | default_cg_params_XGB=
187 | {
188 | 'n_estimators': 10,
189 | 'max_depth': 3
190 | },
191 |
192 | default_fg_normal_params_XGB=
193 | {
194 | 'n_estimators': 30,
195 | 'max_depth': 3
196 | },
197 |
198 | default_fg_anomaly_params_XGB=
199 | {
200 | 'n_estimators': 30,
201 | 'max_depth': 3
202 | },
203 |
204 | cg_train_using_feature_subset=None,
205 |
206 | collect_telemetry=False,
207 |
208 | verbose=0
209 |
210 | ):
211 |
212 | self.base_classifier = base_classifier
213 | self.random_seed = random_seed
214 |
215 | self.cg_params = cg_params
216 | self.fg_normal_params = fg_normal_params
217 | self.fg_anomaly_params = fg_anomaly_params
218 |
219 | self.training_confidence_threshold = training_confidence_threshold
220 | self.classification_confidence_threshold = classification_confidence_threshold
221 |
222 | self.collect_telemetry = collect_telemetry
223 |
224 | self.cg_train_using_feature_subset = cg_train_using_feature_subset
225 |
226 | self.verbose = verbose
227 |
228 | ### RF defaults
229 | self.default_training_confidence_threshold_RF = default_training_confidence_threshold_RF
230 | self.default_classification_confidence_threshold_RF = default_classification_confidence_threshold_RF
231 | self.default_cg_params_RF = default_cg_params_RF
232 | self.default_fg_normal_params_RF = default_fg_normal_params_RF
233 | self.default_fg_anomaly_params_RF = default_fg_anomaly_params_RF
234 |
235 | ### XGBoost defaults
236 | self.default_training_confidence_threshold_XGB = default_training_confidence_threshold_XGB
237 | self.default_classification_confidence_threshold_XGB = default_classification_confidence_threshold_XGB
238 | self.default_cg_params_XGB = default_cg_params_XGB
239 | self.default_fg_normal_params_XGB = default_fg_normal_params_XGB
240 | self.default_fg_anomaly_params_XGB = default_fg_anomaly_params_XGB
241 |
242 | ###########################################################################
243 | ###########################################################################
244 | def verify_parameters(self, X, y):
245 |
246 | if self.classification_confidence_threshold and not self.training_confidence_threshold:
247 | if self.base_classifier == 'RF':
248 | if self.classification_confidence_threshold > self.default_training_confidence_threshold_RF:
249 | if self.verbose > 0:
250 | print(
251 | "Warning: classification_confidence_threshold ({}) > "
252 | "default_training_confidence_threshold_RF ({}).\n".
253 | format(self.classification_confidence_threshold,
254 | self.default_training_confidence_threshold_RF))
255 | elif self.base_classifier == 'XGB':
256 | if self.classification_confidence_threshold > self.default_training_confidence_threshold_XGB:
257 | if self.verbose > 0:
258 | print(
259 | "Warning: classification_confidence_threshold ({}) > "
260 | "default_training_confidence_threshold_XGB ({}).\n".
261 | format(self.classification_confidence_threshold,
262 | self.default_training_confidence_threshold_XGB))
263 | else:
264 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
265 |
266 | elif not self.classification_confidence_threshold and self.training_confidence_threshold:
267 | if self.base_classifier == 'RF':
268 | if self.default_classification_confidence_threshold_RF > self.training_confidence_threshold:
269 | if self.verbose > 0:
270 | print(
271 | "Warning: default_classification_confidence_threshold_RF ({}) > "
272 | "training_confidence_threshold ({}).\n".
273 | format(self.default_classification_confidence_threshold_RF,
274 | self.training_confidence_threshold))
275 | elif self.base_classifier == 'XGB':
276 | if self.default_classification_confidence_threshold_XGB > self.training_confidence_threshold:
277 | if self.verbose > 0:
278 | print(
279 | "Warning: default_classification_confidence_threshold_XGB ({}) > "
280 | "training_confidence_threshold ({}).\n".
281 | format(self.default_classification_confidence_threshold_XGB,
282 | self.training_confidence_threshold))
283 | else:
284 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
285 |
286 | elif self.classification_confidence_threshold and self.training_confidence_threshold:
287 | if self.classification_confidence_threshold > self.training_confidence_threshold:
288 | if self.verbose > 0:
289 | print("Warning: classification_confidence_threshold ({}) > training_confidence_threshold ({}).\n".
290 | format(self.classification_confidence_threshold, self.training_confidence_threshold))
291 |
292 | if self.cg_train_using_feature_subset is not None:
293 | ### empty is not allowed
294 | if not len(self.cg_train_using_feature_subset):
295 | raise Exception(
296 | "Illegal cg_train_using_feature_subset (err1): {}\nShould be None or specify unique columns".format(
297 | self.cg_train_using_feature_subset))
298 |
299 | ### duplicates are not allowed
300 | if len(self.cg_train_using_feature_subset) != len(set(self.cg_train_using_feature_subset)):
301 | raise Exception(
302 | "Illegal cg_train_using_feature_subset (err2): {}\nShould be None or specify unique columns".format(
303 | self.cg_train_using_feature_subset))
304 |
305 | ### translate column names (if X is a dataframe) to indices
306 | if isinstance(X, pd.DataFrame):
307 | if all(elem in X.columns for elem in self.cg_train_using_feature_subset):
308 | self.cg_train_using_feature_subset = [X.columns.get_loc(i) for i in
309 | self.cg_train_using_feature_subset]
310 |
311 | ### verify legal column values
312 | if not set(self.cg_train_using_feature_subset).issubset(set(range(X.shape[1]))):
313 | raise Exception(
314 | "Illegal cg_train_using_feature_subset (err3): {}\nShould be None or specify unique columns".format(
315 | self.cg_train_using_feature_subset))
316 |
317 | ###########################################################################
318 | ###########################################################################
319 |
320 | def fit(self, X, y):
321 |
322 | ### set numpy seed
323 | np.random.seed(self.random_seed)
324 |
325 | ### base classifier type options
326 | baseClassifierTypes = {
327 |
328 | 'RF': RandomForestClassifier,
329 | 'XGB': XGBClassifier
330 |
331 | }
332 |
333 | ## RADE parameters input checks
334 | self.verify_parameters(X, y)
335 |
336 | ### input verification - required by scikit
337 | X, y = check_X_y(X, y)
338 |
339 | ### store the classes seen during fit - required by scikit
340 | self.classes_ = unique_labels(y)
341 |
342 | ### store the number of features passed to the fit method
343 | self.n_features_in_ = X.shape[1]
344 |
345 | ### binary classifier
346 | if len(self.classes_) >= 3:
347 | raise Exception("RADE is a binary classifier")
348 |
349 | ### collect telemetry
350 | if self.collect_telemetry:
351 | self.telemetry_ = {}
352 | self.telemetry_['normal_fg_training_data_fraction'] = 0
353 | self.telemetry_['anomaly_fg_training_data_fraction'] = 0
354 |
355 | ### init coarse-grained (cg) classifier
356 | self.cg_clf_ = baseClassifierTypes[self.base_classifier](random_state=self.random_seed)
357 |
358 | ### set cg params
359 | if self.cg_params is None:
360 | if self.verbose > 0:
361 | print("Warning: no kwards for the coarse-grained model. Use the default configuration.\n")
362 | if self.base_classifier == 'RF':
363 | self.cg_clf_.set_params(**self.default_cg_params_RF)
364 | elif self.base_classifier == 'XGB':
365 | self.cg_clf_.set_params(**self.default_cg_params_XGB)
366 | else:
367 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
368 | else:
369 | self.cg_clf_.set_params(**self.cg_params)
370 |
371 | ### train cg
372 | if self.cg_train_using_feature_subset == None:
373 | self.cg_clf_.fit(X, y)
374 | else:
375 | self.cg_clf_.fit(X[:, self.cg_train_using_feature_subset], y)
376 |
377 | ### tags
378 | try:
379 | self.__normal_tag_ = np.min(self.classes_)
380 | self.__anomaly_tag_ = np.max(self.classes_)
381 | except:
382 | self.__normal_tag_ = self.classes_[0]
383 | self.__anomaly_tag_ = self.classes_[1]
384 |
385 | ### single class
386 | if self.__normal_tag_ == self.__anomaly_tag_:
387 | self.cg_only_ = True
388 | if self.verbose > 0:
389 | print("Warning: received only a single class for training, no fg models.\n")
390 | return self
391 | else:
392 | self.cg_only_ = False
393 |
394 | ### init fine-grained (fg) classifiers
395 | self.fg_clf_normal_ = baseClassifierTypes[self.base_classifier](random_state=self.random_seed)
396 | self.fg_clf_anomaly_ = baseClassifierTypes[self.base_classifier](random_state=self.random_seed)
397 |
398 | ### set fg normal params
399 | if self.fg_normal_params is None:
400 | if self.verbose > 0:
401 | print("Warning: no kwards for the fine-grained normal model. Use the default configuration.\n")
402 | if self.base_classifier == 'RF':
403 | self.fg_clf_normal_.set_params(**self.default_fg_normal_params_RF)
404 | elif self.base_classifier == 'XGB':
405 | self.fg_clf_normal_.set_params(**self.default_fg_normal_params_XGB)
406 | else:
407 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
408 | else:
409 | self.fg_clf_normal_.set_params(**self.fg_normal_params)
410 |
411 | ### set fg anomaly params
412 | if self.fg_anomaly_params is None:
413 | if self.verbose > 0:
414 | print("Warning: no kwards for the fine-grained anomaly model. Use the default configuration.\n")
415 | if self.base_classifier == 'RF':
416 | self.fg_clf_anomaly_.set_params(**self.default_fg_anomaly_params_RF)
417 | elif self.base_classifier == 'XGB':
418 | self.fg_clf_anomaly_.set_params(**self.default_fg_anomaly_params_XGB)
419 | else:
420 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
421 | else:
422 | self.fg_clf_anomaly_.set_params(**self.fg_anomaly_params)
423 |
424 | ### classify training data by cg to obtain metadata
425 | if self.cg_train_using_feature_subset == None:
426 | cg_classification_distribution = self.cg_clf_.predict_proba(X)
427 | else:
428 | cg_classification_distribution = self.cg_clf_.predict_proba(X[:, self.cg_train_using_feature_subset])
429 |
430 | cg_classification = np.take(self.classes_, np.argmax(cg_classification_distribution, axis=1))
431 | cg_classification_confidence = np.max(cg_classification_distribution, axis=1)
432 |
433 | ### prepare train data filters
434 | if self.training_confidence_threshold is None:
435 | if self.base_classifier == 'RF':
436 | cg_low_confidence_indeces = (cg_classification_confidence <
437 | self.default_training_confidence_threshold_RF)
438 | elif self.base_classifier == 'XGB':
439 | cg_low_confidence_indeces = (
440 | cg_classification_confidence < self.default_training_confidence_threshold_XGB)
441 | else:
442 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
443 | else:
444 | cg_low_confidence_indeces = (cg_classification_confidence < self.training_confidence_threshold)
445 |
446 | true_anomaly_indeces = (y == self.__anomaly_tag_)
447 | cg_normal_classification_indeces = (cg_classification == self.__normal_tag_)
448 | cg_anomaly_classifications_indeces = (cg_classification == self.__anomaly_tag_)
449 |
450 | ### training data for fg models
451 | fg_normal_training_data_filter = cg_low_confidence_indeces & (
452 | true_anomaly_indeces | cg_normal_classification_indeces)
453 | fg_normal_training_data_X = X[fg_normal_training_data_filter]
454 | fg_normal_training_data_y = y[fg_normal_training_data_filter]
455 |
456 | fg_anomaly_training_data_filter = cg_low_confidence_indeces & (
457 | true_anomaly_indeces | cg_anomaly_classifications_indeces)
458 | fg_anomaly_training_data_X = X[fg_anomaly_training_data_filter]
459 | fg_anomaly_training_data_y = y[fg_anomaly_training_data_filter]
460 |
461 | ### train the fg models
462 | if len(unique_labels(fg_normal_training_data_y)) == 2 and sum(fg_normal_training_data_filter) > 1:
463 | ### collect telemetry
464 | if self.collect_telemetry:
465 | self.telemetry_['normal_fg_training_data_fraction'] = sum(fg_normal_training_data_filter) / len(
466 | fg_normal_training_data_filter)
467 | ### train
468 | self.fg_clf_normal_.fit(fg_normal_training_data_X, fg_normal_training_data_y)
469 | self.fg_normal_fitted_ = True
470 | else:
471 | if self.verbose > 0:
472 | print("Warning: no fine-grained normal model training.\n")
473 | self.fg_normal_fitted_ = False
474 |
475 | if len(unique_labels(fg_anomaly_training_data_y)) == 2 and sum(fg_anomaly_training_data_filter) > 1:
476 | ### collect telemetry
477 | if self.collect_telemetry:
478 | self.telemetry_['anomaly_fg_training_data_fraction'] = sum(fg_anomaly_training_data_filter) / len(
479 | fg_anomaly_training_data_filter)
480 | ### train
481 | self.fg_clf_anomaly_.fit(fg_anomaly_training_data_X, fg_anomaly_training_data_y)
482 | self.fg_anomaly_fitted_ = True
483 | else:
484 | if self.verbose > 0:
485 | print("Warning: no fine-grained anomaly model training.\n")
486 | self.fg_anomaly_fitted_ = False
487 |
488 | ### for speed
489 | if not self.fg_normal_fitted_ and not self.fg_anomaly_fitted_:
490 | self.cg_only_ = True
491 |
492 | ### a call to fit should return the classifier - required by scikit
493 | return self
494 |
495 | ###########################################################################
496 | ###########################################################################
497 |
498 | def predict_basic(self, X, proba=False):
499 |
500 | ### set numpy seed
501 | np.random.seed(self.random_seed)
502 |
503 | ### check is that fit had been called - required by scikit
504 | check_is_fitted(self)
505 |
506 | ### input verification - required by scikit
507 | X = check_array(X)
508 |
509 | ### collect telemetry
510 | if self.collect_telemetry:
511 | self.telemetry_['normal_fg_test_data_fraction'] = 0
512 | self.telemetry_['anomaly_fg_test_data_fraction'] = 0
513 |
514 | ### no fg models?
515 | if self.cg_only_:
516 | if not proba:
517 | if self.cg_train_using_feature_subset == None:
518 | return self.cg_clf_.predict(X)
519 | else:
520 | return self.cg_clf_.predict(X[:, self.cg_train_using_feature_subset])
521 | else:
522 | if self.cg_train_using_feature_subset == None:
523 | return self.cg_clf_.predict_proba(X)
524 | else:
525 | return self.cg_clf_.predict_proba(X[:, self.cg_train_using_feature_subset])
526 |
527 | ### classify test data by cg to obtain metadata
528 | if self.cg_train_using_feature_subset == None:
529 | cg_classification_distribution = self.cg_clf_.predict_proba(X)
530 | else:
531 | cg_classification_distribution = self.cg_clf_.predict_proba(X[:, self.cg_train_using_feature_subset])
532 |
533 | cg_classification = np.take(self.classes_, np.argmax(cg_classification_distribution, axis=1))
534 | cg_classification_confidence = np.max(cg_classification_distribution, axis=1)
535 |
536 | ### prepare test data filters
537 | if self.classification_confidence_threshold is None:
538 | if self.base_classifier == 'RF':
539 | cg_low_confidence_indeces = (cg_classification_confidence <
540 | self.default_classification_confidence_threshold_RF)
541 | elif self.base_classifier == 'XGB':
542 | cg_low_confidence_indeces = (cg_classification_confidence <
543 | self.default_classification_confidence_threshold_XGB)
544 | else:
545 | raise Exception('Unsupported base_classifier {}'.format(self.base_classifier))
546 | else:
547 | cg_low_confidence_indeces = (cg_classification_confidence < self.classification_confidence_threshold)
548 |
549 | normal_cg_classification_indeces = (cg_classification == self.__normal_tag_)
550 | anomaly_cg_classifications_indeces = (cg_classification == self.__anomaly_tag_)
551 |
552 | ### test data for fg models
553 | fg_normal_test_data_filter = cg_low_confidence_indeces & normal_cg_classification_indeces
554 | fg_normal_test_data = X[fg_normal_test_data_filter]
555 |
556 | fg_anomaly_test_data_filter = cg_low_confidence_indeces & anomaly_cg_classifications_indeces
557 | fg_anomaly_test_data = X[fg_anomaly_test_data_filter]
558 |
559 | ### predict
560 | if not proba:
561 |
562 | classification_results = cg_classification
563 |
564 | if self.fg_normal_fitted_ and np.any(fg_normal_test_data_filter):
565 |
566 | ### collect telemetry
567 | if self.collect_telemetry:
568 | self.telemetry_['normal_fg_test_data_fraction'] = sum(fg_normal_test_data_filter) / len(
569 | fg_normal_test_data_filter)
570 |
571 | classification_results[fg_normal_test_data_filter] = self.fg_clf_normal_.predict(fg_normal_test_data)
572 |
573 | if self.fg_anomaly_fitted_ and np.any(fg_anomaly_test_data_filter):
574 |
575 | ### collect telemetry
576 | if self.collect_telemetry:
577 | self.telemetry_['anomaly_fg_test_data_fraction'] = sum(fg_anomaly_test_data_filter) / len(
578 | fg_anomaly_test_data_filter)
579 |
580 | classification_results[fg_anomaly_test_data_filter] = self.fg_clf_anomaly_.predict(fg_anomaly_test_data)
581 |
582 | return classification_results
583 |
584 | ### predict proba
585 | else:
586 |
587 | classification_distribution_results = cg_classification_distribution
588 |
589 | if self.fg_normal_fitted_ and np.any(fg_normal_test_data_filter):
590 |
591 | ### collect telemetry
592 | if self.collect_telemetry:
593 | self.telemetry_['normal_fg_test_data_fraction'] = sum(fg_normal_test_data_filter) / len(
594 | fg_normal_test_data_filter)
595 |
596 | classification_distribution_results[fg_normal_test_data_filter] = self.fg_clf_normal_.predict_proba(
597 | fg_normal_test_data)
598 |
599 | if self.fg_anomaly_fitted_ and np.any(fg_anomaly_test_data_filter):
600 |
601 | ### collect telemetry
602 | if self.collect_telemetry:
603 | self.telemetry_['anomaly_fg_test_data_fraction'] = sum(fg_anomaly_test_data_filter) / len(
604 | fg_anomaly_test_data_filter)
605 |
606 | classification_distribution_results[fg_anomaly_test_data_filter] = self.fg_clf_anomaly_.predict_proba(
607 | fg_anomaly_test_data)
608 |
609 | return classification_distribution_results
610 |
611 | ###########################################################################
612 |
613 | ###########################################################################
614 |
615 | def predict(self, X):
616 |
617 | return self.predict_basic(X)
618 |
619 | def predict_proba(self, X):
620 |
621 | return self.predict_basic(X, proba=True)
622 |
623 | ###########################################################################
624 | ###########################################################################
625 |
626 | ### getters
627 |
628 | def get_telemetry(self):
629 |
630 | try:
631 | return self.telemetry_
632 | except:
633 | print("\nError: get_telemetry was called but telemetry is disabled.\n")
634 |
635 | def get_sub_classifier(self, clf):
636 |
637 | if clf == 'cg':
638 | return self.cg_clf_
639 |
640 | elif clf == 'fg_normal':
641 | if self.fg_normal_fitted_:
642 | return self.fg_clf_normal_
643 | else:
644 | return None
645 |
646 | elif clf == 'fg_anomaly':
647 | if self.fg_anomaly_fitted_:
648 | return self.fg_clf_anomaly_
649 | else:
650 | return None
651 |
652 | else:
653 | raise Exception("unknown sub-classifier type, possible options are: cg / fg_normal / fg_anomaly")
654 |
655 | ###########################################################################
656 | ###########################################################################
657 | def _more_tags(self):
658 |
659 | return {'binary_only': True}
660 |
661 | ###########################################################################
662 | ###########################################################################
663 |
--------------------------------------------------------------------------------