├── README.md ├── datasets ├── arrhythmia.mat ├── cardio.mat ├── letter.mat ├── mammography.mat ├── mnist.mat ├── satellite.mat └── speech.mat ├── figs ├── flowchart.png ├── results.png ├── sample_outputs.png └── t-SNE.png ├── models ├── __init__.py ├── generate_TOS.py ├── glosh.py ├── hbos.py ├── knn.py ├── select_TOS.py └── utility.py ├── plots.py ├── requirements.txt ├── xgbod_demo.py └── xgbod_full.py /README.md: -------------------------------------------------------------------------------- 1 | # XGBOD (Extreme Boosting Based Outlier Detection) 2 | 3 | ------------ 4 | 5 | Zhao, Y. and Hryniewicki, M.K., "XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning," *International Joint Conference on Neural Networks (IJCNN)*, IEEE, 2018. 6 | 7 | Please cite the paper as: 8 | 9 | @inproceedings{zhao2018xgbod, 10 | title={XGBOD: improving supervised outlier detection with unsupervised representation learning}, 11 | author={Zhao, Yue and Hryniewicki, Maciej K}, 12 | booktitle={2018 International Joint Conference on Neural Networks (IJCNN)}, 13 | pages={1--8}, 14 | year={2018}, 15 | organization={IEEE} 16 | } 17 | 18 | [PDF](https://www.andrew.cmu.edu/user/yuezhao2/papers/18-ijcnn-xgbod.pdf) | 19 | [IEEE Explore](https://ieeexplore.ieee.org/document/8489605) | 20 | [API Documentation](https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.xgbod) | 21 | [Example with PyOD](https://github.com/yzhao062/pyod/blob/master/examples/xgbod_example.py) 22 | 23 | 24 | **Update** (Dec 25th, 2018): XGBOD has been officially released in **[Python Outlier Detection (PyOD)](https://github.com/yzhao062/pyod)** V0.6.6. 25 | 26 | **Update** (Dec 6th, 2018): XGBOD has been implemented in **[Python Outlier Detection (PyOD)](https://github.com/yzhao062/pyod)**, to be released in pyod V0.6.6. 27 | 28 | ------------ 29 | 30 | Additional notes: 31 | 1. Two versions of codes are provided: 32 | 1. **Demo version** (xgbod_demo.py) is refactored for fast execution and reproduction as a proof of concept. The key difference from the full version is TOS are built in once for both training and test data. It could be regarded as a static unsupervised engineering. However, it is noted users should not expose and use the testing data while building TOS in practice. 33 | 2. **Production version** ([Python Outlier Detection (PyOD)](https://github.com/yzhao062/pyod)) is released with full optimization and testing as a framework. The purpose of this version is to be used in real applications, which should require fewer dependencies and faster execution. 34 | 3. It is understood that there are **small variations** in the results due to the random process, such as xgboost and Random TOS Selection. Again, running demo code would only give you similar results but not the exact results. Additionally, specific setups are slightly different for distinct datasets, which we have not published yet. 35 | 4. While running *L1_Comb* and *L2_Comb*, EasyEnsemble is used to construct balanced bags. It is noted the demo code uses 10 bags instead of 50, for executing efficiently. Despite, increasing to 50 bags would not change the result too much but just bring better stablity. You are welcomed to change "BalancedBaggingClassifier" parameter for using 50 bags. However, it is very slow and this is also one of the reasons why we propose XGBOD -- it is much more efficient:) 36 | 37 | ------------ 38 | 39 | ## Introduction 40 | XGBOD is a three-phase framework (see Figure below). In the first phase, it generates new data representations. Specifically, various unsupervised outlier detection methods are applied to the original data to get transformed outlier scores as new data representations. In the second phase, a selection process is performed on newly generated outlier scores to keep the useful ones. The selected outlier scores are then combined with the original features to become the new feature space. Finally, an XGBoost model is trained on the new feature space, and its output decides the outlier prediction result. 41 | 42 | ![XGBOD Flowchart](https://github.com/yzhao062/XGBOD/blob/master/figs/flowchart.png "XGBOD Flowchart") 43 | 44 | ## Dependency 45 | The experiment code is writen in Python 3 and built on a number of Python packages: 46 | - matplotlib==2.0.2 47 | - xgboost==0.7 48 | - pandas==0.21.0 49 | - imbalanced_learn==0.3.2 50 | - scipy==0.19.1 51 | - numpy==1.13.1 52 | - PyNomaly==0.1.7 53 | - imblearn==0.0 54 | - scikit_learn==0.19.1 55 | 56 | Batch installation is possible using the supplied "requirements.txt": 57 | 58 | ````cmd 59 | pip install -r requirements.txt 60 | ```` 61 | 62 | ------------ 63 | 64 | 65 | ## Datasets 66 | Seven datasets are used (see dataset folder): 67 | 68 | | Datasets | Dimension | Sample Size | Number of Outliers | 69 | | ------------ | -----------| ------------ | ------------------- | 70 | | Arrhythmia | 351 | 274 | 126 (36%) | 71 | | Letter | 1600 | 32 | 100 (6.25%) | 72 | | Cardio | 1831 | 21 | 176 (9.6%) | 73 | | Speech | 3686 | 600 | 61(1.65%) | 74 | | Satellite | 6435 | 36 | 2036 (31.64%) | 75 | | Mnist | 7603 | 100 | 700 (9.21%) | 76 | | Mammography | 11863 | 6 | 260 (2.32%) | 77 | 78 | All datasets are accessible at http://odds.cs.stonybrook.edu/. Citation Suggestion for the datasets please refer to: 79 | > Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science. 80 | 81 | ------------ 82 | 83 | 84 | ## Usage and Sample Output (Demo Version) 85 | Experiments could be reproduced by running **xgbod_demo.py** directly. You could simply download/clone the entire repository and execute the code by "python xgbod_demo.py". 86 | 87 | The first part of the code read in the datasets using Scipy. Five public outlier datasets are supplied. Then various TOS are built by seven different algorithms: 88 | 1. KNN 89 | 2. K-Median 90 | 3. AvgKNN 91 | 4. LOF 92 | 5. LoOP 93 | 6. One-Class SVM 94 | 7. Isolation Forests 95 | **Please be noted that you may include more TOS** 96 | 97 | Taking KNN as an example, codes are as below: 98 | 99 | ```python 100 | # Generate TOS using KNN based algorithms 101 | feature_list, roc_knn, prc_n_knn, result_knn = get_TOS_knn(X_norm, y, k_range, feature_list) 102 | ``` 103 | 104 | Then three TOS selection methods are used to select *p* TOS: 105 | 106 | ```python 107 | 108 | p = 10 # number of selected TOS 109 | # random selection 110 | X_train_new_rand, X_train_all_rand = random_select(X, X_train_new_orig, roc_list, p) 111 | # accurate selection 112 | X_train_new_accu, X_train_all_accu = accurate_select(X, X_train_new_orig, feature_list, roc_list, p) 113 | # balance selection 114 | X_train_new_bal, X_train_all_bal = balance_select(X, X_train_new_orig, roc_list, p) 115 | ``` 116 | 117 | Finally, various classification methods are applied to the datasets. Sample outputs are provided below: 118 | 119 | ![Sample Outputs on Arrhythmia](https://github.com/yzhao062/XGBOD/blob/master/figs/sample_outputs.png "Sample Outputs on Arrhythmia") 120 | ------------ 121 | ## Figures 122 | 123 | Running **plots.py** would generate the figures for various TOS selection algorithms: 124 | ![The effect of number of TOS and selection method](https://github.com/yzhao062/XGBOD/blob/master/figs/results.png "The effect of number of TOS and selection method") 125 | 126 | -------------------------------------------------------------------------------- /datasets/arrhythmia.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/arrhythmia.mat -------------------------------------------------------------------------------- /datasets/cardio.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/cardio.mat -------------------------------------------------------------------------------- /datasets/letter.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/letter.mat -------------------------------------------------------------------------------- /datasets/mammography.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/mammography.mat -------------------------------------------------------------------------------- /datasets/mnist.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/mnist.mat -------------------------------------------------------------------------------- /datasets/satellite.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/satellite.mat -------------------------------------------------------------------------------- /datasets/speech.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/speech.mat -------------------------------------------------------------------------------- /figs/flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/flowchart.png -------------------------------------------------------------------------------- /figs/results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/results.png -------------------------------------------------------------------------------- /figs/sample_outputs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/sample_outputs.png -------------------------------------------------------------------------------- /figs/t-SNE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/t-SNE.png -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/models/__init__.py -------------------------------------------------------------------------------- /models/generate_TOS.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from models.utility import get_precn 4 | from sklearn.metrics import roc_auc_score 5 | from sklearn.neighbors import NearestNeighbors 6 | from sklearn.neighbors import LocalOutlierFactor 7 | from sklearn.svm import OneClassSVM 8 | from sklearn.ensemble import IsolationForest 9 | from PyNomaly import loop 10 | from models.hbos import Hbos 11 | 12 | 13 | def knn(X, n_neighbors): 14 | ''' 15 | Utility function to return k-average, k-median, knn 16 | Since these three functions are similar, so is inluded in the same func 17 | :param X: train data 18 | :param n_neighbors: number of neighbors 19 | :return: 20 | ''' 21 | neigh = NearestNeighbors() 22 | neigh.fit(X) 23 | 24 | res = neigh.kneighbors(n_neighbors=n_neighbors, return_distance=True) 25 | # k-average, k-median, knn 26 | return np.mean(res[0], axis=1), np.median(res[0], axis=1), res[0][:, -1] 27 | 28 | 29 | def get_TOS_knn(X, y, k_list, feature_list): 30 | knn_clf = ["knn_mean", "knn_median", "knn_kth"] 31 | 32 | result_knn = np.zeros([X.shape[0], len(k_list) * len(knn_clf)]) 33 | roc_knn = [] 34 | prec_knn = [] 35 | 36 | for i in range(len(k_list)): 37 | k = k_list[i] 38 | k_mean, k_median, k_k = knn(X, n_neighbors=k) 39 | knn_result = [k_mean, k_median, k_k] 40 | 41 | for j in range(len(knn_result)): 42 | score_pred = knn_result[j] 43 | clf = knn_clf[j] 44 | 45 | roc = np.round(roc_auc_score(y, score_pred), decimals=4) 46 | # apc = np.round(average_precision_score(y, score_pred), decimals=4) 47 | prec_n = np.round(get_precn(y, score_pred), decimals=4) 48 | print('{clf} @ {k} - ROC: {roc} Precision@n: {pren}'. 49 | format(clf=clf, k=k, roc=roc, pren=prec_n)) 50 | feature_list.append(clf + str(k)) 51 | roc_knn.append(roc) 52 | prec_knn.append(prec_n) 53 | result_knn[:, i * len(knn_result) + j] = score_pred 54 | 55 | print() 56 | return feature_list, roc_knn, prec_knn, result_knn 57 | 58 | 59 | def get_TOS_loop(X, y, k_list, feature_list): 60 | # only compatible with pandas 61 | df_X = pd.DataFrame(X) 62 | 63 | result_loop = np.zeros([X.shape[0], len(k_list)]) 64 | roc_loop = [] 65 | prec_loop = [] 66 | 67 | for i in range(len(k_list)): 68 | k = k_list[i] 69 | clf = loop.LocalOutlierProbability(df_X, n_neighbors=k).fit() 70 | score_pred = clf.local_outlier_probabilities.astype(float) 71 | 72 | roc = np.round(roc_auc_score(y, score_pred), decimals=4) 73 | # apc = np.round(average_precision_score(y, score_pred), decimals=4) 74 | prec_n = np.round(get_precn(y, score_pred), decimals=4) 75 | 76 | print('LoOP @ {k} - ROC: {roc} Precision@n: {pren}'.format(k=k, 77 | roc=roc, 78 | pren=prec_n)) 79 | 80 | feature_list.append('loop_' + str(k)) 81 | roc_loop.append(roc) 82 | prec_loop.append(prec_n) 83 | result_loop[:, i] = score_pred 84 | print() 85 | return feature_list, roc_loop, prec_loop, result_loop 86 | 87 | 88 | def get_TOS_lof(X, y, k_list, feature_list): 89 | result_lof = np.zeros([X.shape[0], len(k_list)]) 90 | roc_lof = [] 91 | prec_lof = [] 92 | 93 | for i in range(len(k_list)): 94 | k = k_list[i] 95 | clf = LocalOutlierFactor(n_neighbors=k) 96 | y_pred = clf.fit_predict(X) 97 | score_pred = clf.negative_outlier_factor_ 98 | 99 | roc = np.round(roc_auc_score(y, score_pred * -1), decimals=4) 100 | # apc = np.round(average_precision_score(y, score_pred * -1), decimals=4) 101 | prec_n = np.round(get_precn(y, score_pred * -1), decimals=4) 102 | print('LOF @ {k} - ROC: {roc} Precision@n: {pren}'.format(k=k, 103 | roc=roc, 104 | pren=prec_n)) 105 | 106 | feature_list.append('lof_' + str(k)) 107 | roc_lof.append(roc) 108 | prec_lof.append(prec_n) 109 | result_lof[:, i] = score_pred * -1 110 | print() 111 | return feature_list, roc_lof, prec_lof, result_lof 112 | 113 | 114 | def get_TOS_hbos(X, y, k_list, feature_list): 115 | result_hbos = np.zeros([X.shape[0], len(k_list)]) 116 | roc_hbos = [] 117 | prec_hbos = [] 118 | 119 | k_list = [3, 5, 7, 9, 12, 15, 20, 25, 30, 50] 120 | for i in range(len(k_list)): 121 | k = k_list[i] 122 | clf = Hbos(bins=k, alpha=0.3) 123 | clf.fit(X) 124 | score_pred = clf.decision_scores 125 | 126 | roc = np.round(roc_auc_score(y, score_pred), decimals=4) 127 | # apc = np.round(average_precision_score(y, score_pred * -1), decimals=4) 128 | prec_n = np.round(get_precn(y, score_pred), decimals=4) 129 | print('HBOS @ {k} - ROC: {roc} Precision@n: {pren}'.format(k=k, 130 | roc=roc, 131 | pren=prec_n)) 132 | 133 | feature_list.append('hbos_' + str(k)) 134 | roc_hbos.append(roc) 135 | prec_hbos.append(prec_n) 136 | result_hbos[:, i] = score_pred 137 | print() 138 | return feature_list, roc_hbos, prec_hbos, result_hbos 139 | 140 | 141 | def get_TOS_svm(X, y, nu_list, feature_list): 142 | result_ocsvm = np.zeros([X.shape[0], len(nu_list)]) 143 | roc_ocsvm = [] 144 | prec_ocsvm = [] 145 | 146 | for i in range(len(nu_list)): 147 | nu = nu_list[i] 148 | clf = OneClassSVM(nu=nu) 149 | clf.fit(X) 150 | score_pred = clf.decision_function(X) 151 | 152 | roc = np.round(roc_auc_score(y, score_pred * -1), decimals=4) 153 | 154 | # apc = np.round(average_precision_score(y, score_pred * -1), decimals=4) 155 | prec_n = np.round( 156 | get_precn(y, score_pred * -1), decimals=4) 157 | print('svm @ {nu} - ROC: {roc} Precision@n: {pren}'.format(nu=nu, 158 | roc=roc, 159 | pren=prec_n)) 160 | feature_list.append('ocsvm_' + str(nu)) 161 | roc_ocsvm.append(roc) 162 | prec_ocsvm.append(prec_n) 163 | result_ocsvm[:, i] = score_pred.reshape(score_pred.shape[0]) * -1 164 | print() 165 | return feature_list, roc_ocsvm, prec_ocsvm, result_ocsvm 166 | 167 | 168 | def get_TOS_iforest(X, y, n_list, feature_list): 169 | result_if = np.zeros([X.shape[0], len(n_list)]) 170 | roc_if = [] 171 | prec_if = [] 172 | 173 | for i in range(len(n_list)): 174 | n = n_list[i] 175 | clf = IsolationForest(n_estimators=n) 176 | clf.fit(X) 177 | score_pred = clf.decision_function(X) 178 | 179 | roc = np.round(roc_auc_score(y, score_pred * -1), decimals=4) 180 | prec_n = np.round(get_precn(y, y_pred=(score_pred * -1)), decimals=4) 181 | 182 | print('Isolation Forest @ {n} - ROC: {roc} Precision@n: {pren}'.format( 183 | n=n, 184 | roc=roc, 185 | pren=prec_n)) 186 | feature_list.append('if_' + str(n)) 187 | roc_if.append(roc) 188 | prec_if.append(prec_n) 189 | result_if[:, i] = score_pred.reshape(score_pred.shape[0]) * -1 190 | print() 191 | return feature_list, roc_if, prec_if, result_if 192 | -------------------------------------------------------------------------------- /models/glosh.py: -------------------------------------------------------------------------------- 1 | import hdbscan 2 | import numpy as np 3 | from sklearn.preprocessing import StandardScaler 4 | from models.utility import get_precn 5 | 6 | 7 | class Glosh(object): 8 | def __init__(self, min_cluster_size=5): 9 | self.min_cluster_size = min_cluster_size 10 | 11 | def fit(self, X_train): 12 | self.X_train = X_train 13 | 14 | def sample_scores(self, X_test): 15 | # initialize the outputs 16 | pred_score = np.zeros([X_test.shape[0], 1]) 17 | 18 | for i in range(X_test.shape[0]): 19 | x_i = X_test[i, :] 20 | 21 | x_i = np.asarray(x_i).reshape(1, x_i.shape[0]) 22 | x_comb = np.concatenate((self.X_train, x_i), axis=0) 23 | 24 | x_comb_norm = StandardScaler().fit_transform(x_comb) 25 | 26 | clusterer = hdbscan.HDBSCAN() 27 | clusterer.fit(x_comb_norm) 28 | 29 | # print(clusterer.outlier_scores_[-1]) 30 | # record the current item 31 | pred_score[i, :] = clusterer.outlier_scores_[-1] 32 | return pred_score 33 | 34 | def evaluate(self, X_test, y_test): 35 | pred_score = self.sample_scores(X_test) 36 | prec_n = (get_precn(y_test, pred_score)) 37 | 38 | print("precision@n", prec_n) 39 | -------------------------------------------------------------------------------- /models/hbos.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import math 3 | from sklearn.preprocessing import MinMaxScaler 4 | from scipy.stats import scoreatpercentile 5 | 6 | class Hbos(object): 7 | 8 | def __init__(self, bins=10, alpha=0.3, beta=0.5, contamination=0.05): 9 | 10 | self.bins = bins 11 | self.alpha = alpha 12 | self.beta = beta 13 | self.contamination = contamination 14 | 15 | def fit(self, X): 16 | 17 | self.n, self.d = X.shape[0], X.shape[1] 18 | out_scores = np.zeros([self.n, self.d]) 19 | 20 | hist = np.zeros([self.bins, self.d]) 21 | bin_edges = np.zeros([self.bins + 1, self.d]) 22 | 23 | # this is actually the fitting 24 | for i in range(self.d): 25 | hist[:, i], bin_edges[:, i] = np.histogram(X[:, i], bins=self.bins, 26 | density=True) 27 | # check the integrity 28 | assert ( 29 | math.isclose(np.sum(hist[:, i] * np.diff(bin_edges[:, i])), 1)) 30 | 31 | # calculate the threshold 32 | for i in range(self.d): 33 | # find histogram assignments of data points 34 | bin_ind = np.digitize(X[:, i], bin_edges[:, i], right=False) 35 | 36 | # very important to do scaling. Not necessary to use min max 37 | density_norm = MinMaxScaler().fit_transform( 38 | hist[:, i].reshape(-1, 1)) 39 | out_score = np.log(1 / (density_norm + self.alpha)) 40 | 41 | for j in range(self.n): 42 | # out sample left 43 | if bin_ind[j] == 0: 44 | dist = np.abs(X[j, i] - bin_edges[0, i]) 45 | bin_width = bin_edges[1, i] - bin_edges[0, i] 46 | # assign it to bin 0 47 | if dist < bin_width * self.beta: 48 | out_scores[j, i] = out_score[bin_ind[j]] 49 | else: 50 | out_scores[j, i] = np.max(out_score) 51 | 52 | # out sample right 53 | elif bin_ind[j] == bin_edges.shape[0]: 54 | dist = np.abs(X[j, i] - bin_edges[-1, i]) 55 | bin_width = bin_edges[-1, i] - bin_edges[-2, i] 56 | # assign it to bin k 57 | if dist < bin_width * self.beta: 58 | out_scores[j, i] = out_score[bin_ind[j] - 2] 59 | else: 60 | out_scores[j, i] = np.max(out_score) 61 | else: 62 | out_scores[j, i] = out_score[bin_ind[j] - 1] 63 | 64 | out_scores_sum = np.sum(out_scores, axis=1) 65 | self.threshold = scoreatpercentile(out_scores_sum, 66 | 100 * (1 - self.contamination)) 67 | self.hist = hist 68 | self.bin_edges = bin_edges 69 | self.decision_scores = out_scores_sum 70 | self.y_pred = (self.decision_scores > self.threshold).astype('int') 71 | 72 | def decision_function(self, X_test): 73 | 74 | n_test = X_test.shape[0] 75 | out_scores = np.zeros([n_test, self.d]) 76 | 77 | for i in range(self.d): 78 | # find histogram assignments of data points 79 | bin_ind = np.digitize(X_test[:, i], self.bin_edges[:, i], 80 | right=False) 81 | 82 | # very important to do scaling. Not necessary to use minmax 83 | density_norm = MinMaxScaler().fit_transform( 84 | self.hist[:, i].reshape(-1, 1)) 85 | 86 | out_score = np.log(1 / (density_norm + self.alpha)) 87 | 88 | for j in range(n_test): 89 | # out sample left 90 | if bin_ind[j] == 0: 91 | dist = np.abs(X_test[j, i] - self.bin_edges[0, i]) 92 | bin_width = self.bin_edges[1, i] - self.bin_edges[0, i] 93 | # assign it to bin 0 94 | if dist < bin_width * self.beta: 95 | out_scores[j, i] = out_score[bin_ind[j]] 96 | else: 97 | out_scores[j, i] = np.max(out_score) 98 | 99 | # out sample right 100 | elif bin_ind[j] == self.bin_edges.shape[0]: 101 | dist = np.abs(X_test[j, i] - self.bin_edges[-1, i]) 102 | bin_width = self.bin_edges[-1, i] - self.bin_edges[-2, i] 103 | # assign it to bin k 104 | if dist < bin_width * self.beta: 105 | out_scores[j, i] = out_score[bin_ind[j] - 2] 106 | else: 107 | out_scores[j, i] = np.max(out_score) 108 | else: 109 | out_scores[j, i] = out_score[bin_ind[j] - 1] 110 | 111 | out_scores_sum = np.sum(out_scores, axis=1) 112 | return out_scores_sum 113 | 114 | def predict(self, X_test): 115 | pred_score = self.decision_function(X_test) 116 | return (pred_score > self.threshold).astype('int') 117 | -------------------------------------------------------------------------------- /models/knn.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.neighbors import NearestNeighbors 3 | from sklearn.neighbors import KDTree 4 | from sklearn.exceptions import NotFittedError 5 | from scipy.stats import scoreatpercentile 6 | 7 | 8 | class Knn(object): 9 | ''' 10 | Knn class for outlier detection 11 | support original knn, average knn, and median knn 12 | ''' 13 | 14 | def __init__(self, n_neighbors=1, contamination=0.05, method='largest'): 15 | self.n_neighbors = n_neighbors 16 | self.contamination = contamination 17 | self.method = method 18 | 19 | def fit(self, X_train): 20 | self.X_train = X_train 21 | self._isfitted = True 22 | self.tree = KDTree(X_train) 23 | 24 | neigh = NearestNeighbors() 25 | neigh.fit(self.X_train) 26 | 27 | result = neigh.kneighbors(n_neighbors=self.n_neighbors, 28 | return_distance=True) 29 | dist_arr = result[0] 30 | 31 | if self.method == 'largest': 32 | dist = dist_arr[:, -1] 33 | elif self.method == 'mean': 34 | dist = np.mean(dist_arr, axis=1) 35 | elif self.method == 'median': 36 | dist = np.median(dist_arr, axis=1) 37 | 38 | threshold = scoreatpercentile(dist, 100 * (1 - self.contamination)) 39 | 40 | self.threshold = threshold 41 | self.decision_scores = dist.ravel() 42 | self.y_pred = (self.decision_scores > self.threshold).astype('int') 43 | 44 | def decision_function(self, X_test): 45 | 46 | if not self._isfitted: 47 | NotFittedError('Knn is not fitted yet') 48 | 49 | # initialize the output score 50 | pred_score = np.zeros([X_test.shape[0], 1]) 51 | 52 | for i in range(X_test.shape[0]): 53 | x_i = X_test[i, :] 54 | x_i = np.asarray(x_i).reshape(1, x_i.shape[0]) 55 | 56 | # get the distance of the current point 57 | dist_arr, ind_arr = self.tree.query(x_i, k=self.n_neighbors) 58 | 59 | if self.method == 'largest': 60 | dist = dist_arr[:, -1] 61 | elif self.method == 'mean': 62 | dist = np.mean(dist_arr, axis=1) 63 | elif self.method == 'median': 64 | dist = np.median(dist_arr, axis=1) 65 | 66 | pred_score_i = dist[-1] 67 | 68 | # record the current item 69 | pred_score[i, :] = pred_score_i 70 | 71 | return pred_score 72 | 73 | def predict(self, X_test): 74 | 75 | pred_score = self.decision_function(X_test) 76 | return (pred_score > self.threshold).astype('int') 77 | 78 | 79 | ############################################################################## 80 | # samples = [[-1, 0], [0., 0.], [1., 1], [2., 5.], [3, 1]] 81 | # 82 | # clf = Knn() 83 | # clf.fit(samples) 84 | # 85 | # scores = clf.decision_function(np.asarray([[2, 3], [6, 8]])).ravel() 86 | # assert (scores[0] == [2]) 87 | # assert (scores[1] == [5]) 88 | # # 89 | # labels = clf.predict(np.asarray([[2, 3], [6, 8]])).ravel() 90 | # assert (labels[0] == [0]) 91 | # assert (labels[1] == [1]) 92 | -------------------------------------------------------------------------------- /models/select_TOS.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | from scipy.stats import pearsonr 4 | 5 | from models.utility import get_top_n 6 | 7 | 8 | def random_select(X, X_train_new_orig, roc_list, p): 9 | s_feature_rand = random.sample(range(0, len(roc_list)), p) 10 | X_train_new_rand = X_train_new_orig[:, s_feature_rand] 11 | X_train_all_rand = np.concatenate((X, X_train_new_rand), axis=1) 12 | 13 | # print(s_feature_rand) 14 | 15 | return X_train_new_rand, X_train_all_rand 16 | 17 | 18 | def accurate_select(X, X_train_new_orig, roc_list, p): 19 | s_feature_accu = get_top_n(roc_list=roc_list, n=p, top=True) 20 | X_train_new_accu = X_train_new_orig[:, s_feature_accu[0][0:p]] 21 | X_train_all_accu = np.concatenate((X, X_train_new_accu), axis=1) 22 | 23 | # print(s_feature_accu) 24 | 25 | return X_train_new_accu, X_train_all_accu 26 | 27 | 28 | def balance_select(X, X_train_new_orig, roc_list, p): 29 | s_feature_balance = [] 30 | pearson_list = np.zeros([len(roc_list), 1]) 31 | 32 | # handle the first value 33 | max_value_idx = roc_list.index(max(roc_list)) 34 | s_feature_balance.append(max_value_idx) 35 | roc_list[max_value_idx] = -1 36 | 37 | for i in range(p - 1): 38 | 39 | for j in range(len(roc_list)): 40 | pear = pearsonr(X_train_new_orig[:, max_value_idx], 41 | X_train_new_orig[:, j]) 42 | 43 | # update the pearson 44 | pearson_list[j] = np.abs(pearson_list[j]) + np.abs(pear[0]) 45 | 46 | discounted_roc = np.true_divide(roc_list, pearson_list.transpose()) 47 | 48 | max_value_idx = np.argmax(discounted_roc) 49 | s_feature_balance.append(max_value_idx) 50 | roc_list[max_value_idx] = -1 51 | 52 | X_train_new_balance = X_train_new_orig[:, s_feature_balance] 53 | X_train_all_balance = np.concatenate((X, X_train_new_balance), axis=1) 54 | 55 | # print(s_feature_balance) 56 | 57 | return X_train_new_balance, X_train_all_balance 58 | -------------------------------------------------------------------------------- /models/utility.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.stats import scoreatpercentile 3 | from sklearn.metrics import precision_score 4 | from sklearn.preprocessing import StandardScaler 5 | from sklearn.metrics import roc_auc_score 6 | 7 | 8 | def get_precn(y, y_pred): 9 | ''' 10 | Utlity function to calculate precision@n 11 | :param y: ground truth 12 | :param y_pred: number of outliers 13 | :return: score 14 | ''' 15 | # calculate the percentage of outliers 16 | out_perc = np.count_nonzero(y) / len(y) 17 | 18 | threshold = scoreatpercentile(y_pred, 100 * (1 - out_perc)) 19 | y_pred = (y_pred > threshold).astype('int') 20 | return precision_score(y, y_pred) 21 | 22 | 23 | def precision_n(y_pred, y, n): 24 | ''' 25 | Utlity function to calculate precision@n 26 | 27 | :param y_pred: predicted value 28 | :param y: ground truth 29 | :param n: number of outliers 30 | :return: scaler score 31 | ''' 32 | y_pred = np.asarray(y_pred) 33 | y = np.asarray(y) 34 | 35 | length = y.shape[0] 36 | 37 | assert (y_pred.shape == y.shape) 38 | y_sorted = np.partition(y_pred, int(length - n)) 39 | 40 | threshold = y_sorted[int(length - n)] 41 | 42 | y_n = np.greater_equal(y_pred, threshold).astype(int) 43 | # print(threshold, y_n, precision_score(y, y_n)) 44 | 45 | return precision_score(y, y_n) 46 | 47 | 48 | def get_top_n(roc_list, n, top=True): 49 | ''' 50 | for use of Accurate Selection only 51 | :param roc_list: a li 52 | :param n: 53 | :param top: 54 | :return: 55 | ''' 56 | roc_list = np.asarray(roc_list) 57 | length = roc_list.shape[0] 58 | 59 | roc_sorted = np.partition(roc_list, length - n) 60 | threshold = roc_sorted[int(length - n)] 61 | 62 | if top: 63 | return np.where(np.greater_equal(roc_list, threshold)) 64 | else: 65 | return np.where(np.less(roc_list, threshold)) 66 | 67 | 68 | def print_baseline(X_train_new_orig, y, roc_list, prec_list): 69 | max_value_idx = roc_list.index(max(roc_list)) 70 | print() 71 | print('Highest TOS ROC:', roc_list[max_value_idx]) 72 | print('Highest TOS Precison@n', max(prec_list)) 73 | 74 | # normalized score 75 | X_train_all_norm = StandardScaler().fit_transform(X_train_new_orig) 76 | X_train_all_norm_mean = np.mean(X_train_all_norm, axis=1) 77 | 78 | roc = np.round(roc_auc_score(y, X_train_all_norm_mean), decimals=4) 79 | prec_n = np.round(get_precn(y, X_train_all_norm_mean), decimals=4) 80 | 81 | print('Average TOS ROC:', roc) 82 | print('Average TOS Precision@n', prec_n) 83 | -------------------------------------------------------------------------------- /plots.py: -------------------------------------------------------------------------------- 1 | import os 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | 5 | # initialize the results of the experiements 6 | # arrhythmia 7 | prc_gr_arr = [0.5606, 0.5976, 0.5986, 0.6053, 0.6109, 0.6219, 0.6076, 0.6115] 8 | prc_ac_arr = [0.5606, 0.5976, 0.5719, 0.5961, 0.6041, 0.5792, 0.6019, 0.6115] 9 | prc_rd_arr = [0.5606, 0.5993, 0.5788, 0.6356, 0.5908, 0.6094, 0.6202, 0.6115] 10 | 11 | # letter 12 | prc_gr_lt = [0.6003, 0.7300, 0.7234, 0.7199, 0.7169, 0.7285, 0.7323, 0.7376] 13 | prc_ac_lt = [0.6003, 0.7300, 0.7210, 0.7272, 0.7477, 0.7302, 0.7308, 0.7376] 14 | prc_rd_lt = [0.6003, 0.6653, 0.7140, 0.7248, 0.7397, 0.7302, 0.7232, 0.7376] 15 | 16 | # cardio 17 | prc_gr_car = [0.9304, 0.9290, 0.9374, 0.9385, 0.9296, 0.9351, 0.9327, 0.9332] 18 | prc_ac_car = [0.9304, 0.9290, 0.9314, 0.9315, 0.9337, 0.9354, 0.9331, 0.9332] 19 | prc_rd_car = [0.9304, 0.9297, 0.9342, 0.9364, 0.9315, 0.9267, 0.9248, 0.9332] 20 | 21 | # speech 22 | prc_gr_sp = [0.1455, 0.2658, 0.2733, 0.3203, 0.3290, 0.3107, 0.3355, 0.2492] 23 | prc_ac_sp = [0.1455, 0.2658, 0.2367, 0.2630, 0.3103, 0.2983, 0.3255, 0.2492] 24 | prc_rd_sp = [0.1455, 0.1356, 0.1814, 0.2101, 0.3194, 0.3053, 0.2940, 0.2492] 25 | 26 | # mammography 27 | prc_gr_ma = [0.6974, 0.6853, 0.6719, 0.6720, 0.6620, 0.6717, 0.6687, 0.6673] 28 | prc_ac_ma = [0.6974, 0.6853, 0.6915, 0.6841, 0.6965, 0.6631, 0.6655, 0.6673] 29 | prc_rd_ma = [0.6974, 0.6812, 0.6823, 0.6649, 0.6693, 0.6619, 0.6654, 0.6673] 30 | 31 | # x-axis 32 | x = [0, 1, 5, 10, 30, 50, 70, 100] 33 | 34 | # main plots 35 | fig = plt.figure(figsize=(8, 10)) 36 | lw = 2 37 | 38 | ax = fig.add_subplot(511) 39 | 40 | plt.plot(x, prc_rd_arr, color='black', linestyle='-.', marker='s', 41 | lw=lw, label='Random Selection') 42 | plt.plot(x, prc_gr_arr, color='blue', linestyle='--', marker='^', 43 | lw=lw, label='Balance Selection') 44 | plt.plot(x, prc_ac_arr, color='red', linestyle='-', marker='o', 45 | lw=lw, label='Accurate Selection') 46 | 47 | plt.xlim([-0.5, 100.5]) 48 | plt.xticks(np.arange(0, 100, 5)) 49 | plt.ylabel('Precision@n', fontsize=12) 50 | plt.title('Arrhythmia', fontsize=12) 51 | plt.legend(loc="lower right") 52 | 53 | ######################################################################### 54 | ax = fig.add_subplot(512) 55 | plt.plot(x, prc_rd_lt, color='black', linestyle='-.', marker='s', 56 | lw=lw, label='Random Selection') 57 | plt.plot(x, prc_gr_lt, color='blue', linestyle='--', marker='^', 58 | lw=lw, label='Balance Selection') 59 | plt.plot(x, prc_ac_lt, color='red', linestyle='--', marker='o', 60 | lw=lw, label='Accurate Selection') 61 | 62 | plt.xlim([-0.5, 100.5]) 63 | plt.xticks(np.arange(0, 100, 5)) 64 | plt.ylabel('Precision@n', fontsize=12) 65 | plt.title('Letter', fontsize=12) 66 | plt.legend(loc="lower right") 67 | 68 | ######################################################################### 69 | ax = fig.add_subplot(513) 70 | plt.plot(x, prc_rd_car, color='black', linestyle='-.', marker='s', 71 | lw=lw, label='Random Selection') 72 | plt.plot(x, prc_gr_car, color='blue', linestyle='--', marker='^', 73 | lw=lw, label='Balance Selection') 74 | plt.plot(x, prc_ac_car, color='red', linestyle='--', marker='o', 75 | lw=lw, label='Accurate Selection') 76 | 77 | plt.xlim([-0.5, 100.5]) 78 | plt.xticks(np.arange(0, 100, 5)) 79 | plt.ylabel('Precision@n', fontsize=12) 80 | plt.title('Cardio', fontsize=12) 81 | plt.legend(loc="lower right") 82 | 83 | ######################################################################### 84 | ax = fig.add_subplot(514) 85 | plt.plot(x, prc_rd_sp, color='black', linestyle='-.', marker='s', 86 | lw=lw, label='Random Selection') 87 | plt.plot(x, prc_gr_sp, color='blue', linestyle='--', marker='^', 88 | lw=lw, label='Balance Selection') 89 | plt.plot(x, prc_ac_sp, color='red', linestyle='--', marker='o', 90 | lw=lw, label='Accurate Selection') 91 | 92 | plt.xlim([-0.5, 100.5]) 93 | plt.xticks(np.arange(0, 100, 5)) 94 | plt.ylabel('Precision@n', fontsize=12) 95 | plt.title('Speech', fontsize=12) 96 | plt.legend(loc="lower right") 97 | ######################################################################### 98 | ax = fig.add_subplot(515) 99 | plt.plot(x, prc_rd_ma, color='black', linestyle='-.', marker='s', 100 | lw=lw, label='Random Selection') 101 | plt.plot(x, prc_gr_ma, color='blue', linestyle='--', marker='^', 102 | lw=lw, label='Balance Selection') 103 | plt.plot(x, prc_ac_ma, color='red', linestyle='--', marker='o', 104 | lw=lw, label='Accurate Selection') 105 | 106 | plt.xlim([-0.5, 100.5]) 107 | plt.xticks(np.arange(0, 100, 5)) 108 | plt.xlabel('Number of Selected ODS') 109 | plt.ylabel('Precision@n', fontsize=12) 110 | plt.title('Mammography', fontsize=12) 111 | plt.legend(loc="upper right") 112 | 113 | ######################################################################### 114 | plt.tight_layout() 115 | plt.savefig(os.path.join('figs', 'results.png'), dpi=300) 116 | plt.show() 117 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | imbalanced_learn>=0.3.2 2 | matplotlib>=2.0.2 3 | numpy>=1.13.1 4 | pandas>=0.21.0 5 | PyNomaly>=0.1.7 6 | scikit_learn>=0.19.1 7 | scipy>=0.19.1 8 | xgboost>=0.7 -------------------------------------------------------------------------------- /xgbod_demo.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Demo codes for XGBOD. 3 | Author: Yue Zhao 4 | 5 | notes: the demo code simulates the use of XGBOD with some changes to expedite 6 | the execution. Use the full code for the production. 7 | 8 | ''' 9 | import os 10 | import random 11 | import scipy.io as scio 12 | import numpy as np 13 | 14 | from sklearn.preprocessing import StandardScaler, normalize 15 | from sklearn.metrics import roc_auc_score 16 | from sklearn.model_selection import train_test_split 17 | from sklearn.linear_model import LogisticRegression 18 | 19 | from xgboost.sklearn import XGBClassifier 20 | from imblearn.ensemble import BalancedBaggingClassifier 21 | from models.utility import get_precn, print_baseline 22 | from models.generate_TOS import get_TOS_knn 23 | from models.generate_TOS import get_TOS_loop 24 | from models.generate_TOS import get_TOS_lof 25 | from models.generate_TOS import get_TOS_svm 26 | from models.generate_TOS import get_TOS_iforest 27 | from models.generate_TOS import get_TOS_hbos 28 | from models.select_TOS import random_select, accurate_select, balance_select 29 | 30 | # load data file 31 | # mat = scio.loadmat(os.path.join('datasets', 'speech.mat')) 32 | mat = scio.loadmat(os.path.join('datasets', 'arrhythmia.mat')) 33 | # mat = scio.loadmat(os.path.join('datasets', 'cardio.mat')) 34 | # mat = scio.loadmat(os.path.join('datasets', 'letter.mat')) 35 | # mat = scio.loadmat(os.path.join('datasets', 'mammography.mat')) 36 | 37 | X = mat['X'] 38 | y = mat['y'] 39 | 40 | # use unit norm vector X improves knn, LoOP, and LOF results 41 | scaler = StandardScaler().fit(X) 42 | # X_norm = scaler.transform(X) 43 | X_norm = normalize(X) 44 | feature_list = [] 45 | 46 | # Running KNN-base algorithms to generate addtional features 47 | 48 | # predefined range of k 49 | k_range = [1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 50 | 200, 250] 51 | # predefined range of k to be used with LoOP due to high complexity 52 | k_range_short = [1, 3, 5, 10] 53 | 54 | # validate the value of k 55 | k_range = [k for k in k_range if k < X.shape[0]] 56 | 57 | # predefined range of nu for one-class svm 58 | nu_range = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99] 59 | 60 | # predefined range for number of estimators in isolation forests 61 | n_range = [10, 20, 50, 70, 100, 150, 200, 250] 62 | ############################################################################## 63 | 64 | # Generate TOS using KNN based algorithms 65 | feature_list, roc_knn, prc_n_knn, result_knn = get_TOS_knn(X_norm, y, k_range, 66 | feature_list) 67 | # Generate TOS using LoOP 68 | feature_list, roc_loop, prc_n_loop, result_loop = get_TOS_loop(X, y, 69 | k_range_short, 70 | feature_list) 71 | # Generate TOS using LOF 72 | feature_list, roc_lof, prc_n_lof, result_lof = get_TOS_lof(X_norm, y, k_range, 73 | feature_list) 74 | # Generate TOS using one class svm 75 | feature_list, roc_ocsvm, prc_n_ocsvm, result_ocsvm = get_TOS_svm(X, y, 76 | nu_range, 77 | feature_list) 78 | # Generate TOS using isolation forests 79 | feature_list, roc_if, prc_n_if, result_if = get_TOS_iforest(X, y, n_range, 80 | feature_list) 81 | 82 | # Generate TOS using isolation forests 83 | feature_list, roc_hbos, prc_n_hbos, result_hbos = get_TOS_hbos(X, y, k_range, 84 | feature_list) 85 | ############################################################################## 86 | # combine the feature space by concanating various TOS 87 | X_train_new_orig = np.concatenate( 88 | (result_knn, result_loop, result_lof, result_ocsvm, result_if), axis=1) 89 | 90 | X_train_all_orig = np.concatenate((X, X_train_new_orig), axis=1) 91 | 92 | # combine ROC and Precision@n list 93 | roc_list = roc_knn + roc_loop + roc_lof + roc_ocsvm + roc_if 94 | prc_n_list = prc_n_knn + prc_n_loop + prc_n_lof + prc_n_ocsvm + prc_n_if 95 | 96 | # get the results of baselines 97 | print_baseline(X_train_new_orig, y, roc_list, prc_n_list) 98 | 99 | ############################################################################## 100 | # select TOS using different methods 101 | 102 | p = 10 # number of selected TOS 103 | 104 | # random selection 105 | # please be noted the actual random selection happens within the 106 | # train-test split, with p repetitions. 107 | X_train_new_rand, X_train_all_rand = random_select(X, X_train_new_orig, 108 | roc_list, p) 109 | # accurate selection 110 | X_train_new_accu, X_train_all_accu = accurate_select(X, X_train_new_orig, 111 | roc_list, p) 112 | # balance selection 113 | X_train_new_bal, X_train_all_bal = balance_select(X, X_train_new_orig, 114 | roc_list, p) 115 | 116 | ############################################################################### 117 | # build various classifiers 118 | 119 | # it is noted that the data split should happen as the first stage 120 | # test data should not be exposed. However, with a relatively large number of 121 | # repetitions, the demo code would generate a similar result. 122 | 123 | # the full code uses the containers to save the intermediate TOS models. The 124 | # codes would be shared after the cleanup. 125 | 126 | ite = 30 # number of iterations 127 | test_size = 0.4 # training = 60%, testing = 40% 128 | result_dict = {} 129 | 130 | clf_list = [XGBClassifier(), LogisticRegression(penalty="l1"), 131 | LogisticRegression(penalty="l2")] 132 | clf_name_list = ['xgb', 'lr1', 'lr2'] 133 | 134 | # initialize the result dictionary 135 | for clf_name in clf_name_list: 136 | result_dict[clf_name + 'ROC' + 'o'] = [] 137 | result_dict[clf_name + 'ROC' + 's'] = [] 138 | result_dict[clf_name + 'ROC' + 'n'] = [] 139 | 140 | result_dict[clf_name + 'PRC@n' + 'o'] = [] 141 | result_dict[clf_name + 'PRC@n' + 's'] = [] 142 | result_dict[clf_name + 'PRC@n' + 'n'] = [] 143 | 144 | for i in range(ite): 145 | s_feature_rand = random.sample(range(0, len(roc_list)), p) 146 | X_train_new_rand = X_train_new_orig[:, s_feature_rand] 147 | X_train_all_rand = np.concatenate((X, X_train_new_rand), axis=1) 148 | 149 | original_len = X.shape[1] 150 | 151 | # use all TOS 152 | X_train, X_test, y_train, y_test = train_test_split(X_train_all_orig, y, 153 | test_size=test_size) 154 | # # use Random Selection 155 | # X_train, X_test, y_train, y_test = train_test_split(X_train_all_rand, y, 156 | # test_size=test_size) 157 | # # use Accurate Selection 158 | # X_train, X_test, y_train, y_test = train_test_split(X_train_all_accu, y, 159 | # test_size=test_size) 160 | # # use Balance Selection 161 | # X_train, X_test, y_train, y_test = train_test_split(X_train_all_bal, y, 162 | # test_size=test_size) 163 | 164 | # use original features 165 | X_train_o = X_train[:, 0:original_len] 166 | X_test_o = X_test[:, 0:original_len] 167 | 168 | X_train_n = X_train[:, original_len:] 169 | X_test_n = X_test[:, original_len:] 170 | 171 | for clf, clf_name in zip(clf_list, clf_name_list): 172 | print('processing', clf_name, 'round', i + 1) 173 | if clf_name != 'xgb': 174 | clf = BalancedBaggingClassifier(base_estimator=clf, 175 | ratio='auto', 176 | replacement=False) 177 | 178 | # fully supervised 179 | clf.fit(X_train_o, y_train.ravel()) 180 | y_pred = clf.predict_proba(X_test_o) 181 | 182 | roc_score = roc_auc_score(y_test, y_pred[:, 1]) 183 | prec_n = get_precn(y_test, y_pred[:, 1]) 184 | 185 | result_dict[clf_name + 'ROC' + 'o'].append(roc_score) 186 | result_dict[clf_name + 'PRC@n' + 'o'].append(prec_n) 187 | 188 | # unsupervised 189 | clf.fit(X_train_n, y_train.ravel()) 190 | y_pred = clf.predict_proba(X_test_n) 191 | 192 | roc_score = roc_auc_score(y_test, y_pred[:, 1]) 193 | prec_n = get_precn(y_test, y_pred[:, 1]) 194 | 195 | result_dict[clf_name + 'ROC' + 'n'].append(roc_score) 196 | result_dict[clf_name + 'PRC@n' + 'n'].append(prec_n) 197 | 198 | # semi-supervised 199 | clf.fit(X_train, y_train.ravel()) 200 | y_pred = clf.predict_proba(X_test) 201 | 202 | roc_score = roc_auc_score(y_test, y_pred[:, 1]) 203 | prec_n = get_precn(y_test, y_pred[:, 1]) 204 | 205 | result_dict[clf_name + 'ROC' + 's'].append(roc_score) 206 | result_dict[clf_name + 'PRC@n' + 's'].append(prec_n) 207 | 208 | for eva in ['ROC', 'PRC@n']: 209 | print() 210 | for clf_name in clf_name_list: 211 | print(np.round(np.mean(result_dict[clf_name + eva + 'o']), decimals=4), 212 | eva, clf_name, 'original features') 213 | print(np.round(np.mean(result_dict[clf_name + eva + 'n']), decimals=4), 214 | eva, clf_name, 'TOS only') 215 | print(np.round(np.mean(result_dict[clf_name + eva + 's']), decimals=4), 216 | eva, clf_name, 'original feature + TOS') 217 | -------------------------------------------------------------------------------- /xgbod_full.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import numpy as np 4 | import scipy.io as scio 5 | 6 | from sklearn.preprocessing import StandardScaler, Normalizer 7 | from sklearn.model_selection import train_test_split 8 | from sklearn.metrics import roc_auc_score 9 | from sklearn.neighbors import LocalOutlierFactor 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.ensemble import IsolationForest 12 | from sklearn.svm import OneClassSVM 13 | 14 | from xgboost.sklearn import XGBClassifier 15 | from imblearn.ensemble import BalancedBaggingClassifier 16 | from PyNomaly import loop 17 | 18 | from models.knn import Knn 19 | from models.utility import get_precn, print_baseline 20 | 21 | # use one dataset at a time; more datasets could be added to /datasets folder 22 | # the experiment codes use a bit more setting up, otherwise the 23 | # exact reproduction is infeasible. Clean-up codes are going to be moved 24 | 25 | # load data file 26 | mat = scio.loadmat(os.path.join('datasets', 'letter.mat')) 27 | ite = 30 # number of iterations 28 | test_size = 0.4 # training = 60%, testing = 40% 29 | 30 | X_orig = mat['X'] 31 | y_orig = mat['y'] 32 | 33 | # outlier percentage 34 | out_perc = np.count_nonzero(y_orig) / len(y_orig) 35 | 36 | # define classifiers to use 37 | clf_list = [XGBClassifier(), LogisticRegression(penalty="l1"), 38 | LogisticRegression(penalty="l2")] 39 | clf_name_list = ['xgb', 'lr1', 'lr2'] 40 | 41 | # initialize the container to store the results 42 | result_dict = {} 43 | 44 | # initialize the result dictionary 45 | for clf_name in clf_name_list: 46 | result_dict[clf_name + 'roc' + 'o'] = [] 47 | result_dict[clf_name + 'roc' + 's'] = [] 48 | result_dict[clf_name + 'roc' + 'n'] = [] 49 | 50 | result_dict[clf_name + 'precn' + 'o'] = [] 51 | result_dict[clf_name + 'precn' + 's'] = [] 52 | result_dict[clf_name + 'precn' + 'n'] = [] 53 | 54 | for t in range(ite): 55 | 56 | print('\nProcessing trial', t + 1, 'out of', ite) 57 | 58 | # split X and y for training and validation 59 | X, X_test, y, y_test = train_test_split(X_orig, y_orig, 60 | test_size=test_size) 61 | 62 | # reserve the normalized data 63 | scaler = Normalizer().fit(X) 64 | X_norm = scaler.transform(X) 65 | X_test_norm = scaler.transform(X_test) 66 | 67 | feature_list = [] 68 | 69 | # predefined range of K 70 | k_list_pre = [1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 71 | 100, 150, 200, 250] 72 | # trim the list in case of small sample size 73 | k_list = [k for k in k_list_pre if k < X.shape[0]] 74 | 75 | ########################################################################### 76 | train_knn = np.zeros([X.shape[0], len(k_list)]) 77 | test_knn = np.zeros([X_test.shape[0], len(k_list)]) 78 | 79 | roc_knn = [] 80 | prec_n_knn = [] 81 | 82 | for i in range(len(k_list)): 83 | k = k_list[i] 84 | 85 | clf = Knn(n_neighbors=k, contamination=out_perc, method='largest') 86 | clf.fit(X_norm) 87 | train_score = clf.decision_scores 88 | pred_score = clf.decision_function(X_test_norm) 89 | 90 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 91 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 92 | print('knn roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc, 93 | pren=prec_n)) 94 | 95 | feature_list.append('knn_' + str(k)) 96 | roc_knn.append(roc) 97 | prec_n_knn.append(prec_n) 98 | 99 | train_knn[:, i] = train_score 100 | test_knn[:, i] = pred_score.ravel() 101 | ########################################################################### 102 | 103 | train_knn_mean = np.zeros([X.shape[0], len(k_list)]) 104 | test_knn_mean = np.zeros([X_test.shape[0], len(k_list)]) 105 | 106 | roc_knn_mean = [] 107 | prec_n_knn_mean = [] 108 | for i in range(len(k_list)): 109 | k = k_list[i] 110 | 111 | clf = Knn(n_neighbors=k, contamination=out_perc, method='mean') 112 | clf.fit(X_norm) 113 | train_score = clf.decision_scores 114 | pred_score = clf.decision_function(X_test_norm) 115 | 116 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 117 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 118 | print('knn_mean roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc, 119 | pren=prec_n)) 120 | 121 | feature_list.append('knn_mean_' + str(k)) 122 | roc_knn_mean.append(roc) 123 | prec_n_knn_mean.append(prec_n) 124 | 125 | train_knn_mean[:, i] = train_score 126 | test_knn_mean[:, i] = pred_score.ravel() 127 | ########################################################################### 128 | 129 | train_knn_median = np.zeros([X.shape[0], len(k_list)]) 130 | test_knn_median = np.zeros([X_test.shape[0], len(k_list)]) 131 | 132 | roc_knn_median = [] 133 | prec_n_knn_median = [] 134 | for i in range(len(k_list)): 135 | k = k_list[i] 136 | 137 | clf = Knn(n_neighbors=k, contamination=out_perc, method='median') 138 | clf.fit(X_norm) 139 | train_score = clf.decision_scores 140 | pred_score = clf.decision_function(X_test_norm) 141 | 142 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 143 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 144 | print('knn_median roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc, 145 | pren=prec_n)) 146 | 147 | feature_list.append('knn_median_' + str(k)) 148 | roc_knn_median.append(roc) 149 | prec_n_knn_median.append(prec_n) 150 | 151 | train_knn_median[:, i] = train_score 152 | test_knn_median[:, i] = pred_score.ravel() 153 | ########################################################################### 154 | 155 | train_lof = np.zeros([X.shape[0], len(k_list)]) 156 | test_lof = np.zeros([X_test.shape[0], len(k_list)]) 157 | 158 | roc_lof = [] 159 | prec_n_lof = [] 160 | 161 | for i in range(len(k_list)): 162 | k = k_list[i] 163 | clf = LocalOutlierFactor(n_neighbors=k) 164 | clf.fit(X_norm) 165 | 166 | # save the train sets 167 | train_score = clf.negative_outlier_factor_ * -1 168 | # flip the score 169 | pred_score = clf._decision_function(X_test_norm) * -1 170 | 171 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 172 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 173 | print('lof roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc, 174 | pren=prec_n)) 175 | feature_list.append('lof_' + str(k)) 176 | roc_lof.append(roc) 177 | prec_n_lof.append(prec_n) 178 | 179 | train_lof[:, i] = train_score 180 | test_lof[:, i] = pred_score 181 | 182 | ########################################################################### 183 | # Noted that LoOP is not really used for prediction since its high 184 | # computational complexity 185 | # However, it is included to demonstrate the effectiveness of XGBOD only 186 | 187 | df_X = pd.DataFrame(np.concatenate([X_norm, X_test_norm], axis=0)) 188 | 189 | # predefined range of K 190 | k_list = [1, 5, 10, 20] 191 | 192 | train_loop = np.zeros([X.shape[0], len(k_list)]) 193 | test_loop = np.zeros([X_test.shape[0], len(k_list)]) 194 | 195 | roc_loop = [] 196 | prec_n_loop = [] 197 | 198 | for i in range(len(k_list)): 199 | k = k_list[i] 200 | clf = loop.LocalOutlierProbability(df_X, n_neighbors=k).fit() 201 | score = clf.local_outlier_probabilities.astype(float) 202 | 203 | # save the train sets 204 | train_score = score[0:X.shape[0]] 205 | # flip the score 206 | pred_score = score[X.shape[0]:] 207 | 208 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 209 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 210 | print('loop roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc, 211 | pren=prec_n)) 212 | feature_list.append('loop_' + str(k)) 213 | roc_loop.append(roc) 214 | prec_n_loop.append(prec_n) 215 | 216 | train_loop[:, i] = train_score 217 | test_loop[:, i] = pred_score 218 | 219 | ########################################################################## 220 | nu_list = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99] 221 | 222 | train_svm = np.zeros([X.shape[0], len(nu_list)]) 223 | test_svm = np.zeros([X_test.shape[0], len(nu_list)]) 224 | 225 | roc_svm = [] 226 | prec_n_svm = [] 227 | 228 | for i in range(len(nu_list)): 229 | nu = nu_list[i] 230 | 231 | clf = OneClassSVM(nu=nu) 232 | clf.fit(X) 233 | 234 | train_score = clf.decision_function(X) * -1 235 | pred_score = clf.decision_function(X_test) * -1 236 | 237 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 238 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 239 | print('svm roc / pren @ {nu} is {roc} {pren}'.format(nu=nu, roc=roc, 240 | pren=prec_n)) 241 | 242 | feature_list.append('svm_' + str(nu)) 243 | roc_svm.append(roc) 244 | prec_n_svm.append(prec_n) 245 | 246 | train_svm[:, i] = train_score.ravel() 247 | test_svm[:, i] = pred_score.ravel() 248 | ########################################################################### 249 | 250 | n_list = [10, 20, 50, 70, 100, 150, 200, 250] 251 | 252 | train_if = np.zeros([X.shape[0], len(n_list)]) 253 | test_if = np.zeros([X_test.shape[0], len(n_list)]) 254 | 255 | roc_if = [] 256 | prec_n_if = [] 257 | 258 | for i in range(len(n_list)): 259 | n = n_list[i] 260 | clf = IsolationForest(n_estimators=n) 261 | clf.fit(X) 262 | train_score = clf.decision_function(X) * -1 263 | pred_score = clf.decision_function(X_test) * -1 264 | 265 | roc = np.round(roc_auc_score(y_test, pred_score), decimals=4) 266 | prec_n = np.round(get_precn(y_test, pred_score), decimals=4) 267 | print('if roc / pren @ {n} is {roc} {pren}'.format(n=n, roc=roc, 268 | pren=prec_n)) 269 | 270 | feature_list.append('if_' + str(n)) 271 | roc_if.append(roc) 272 | prec_n_if.append(prec_n) 273 | 274 | train_if[:, i] = train_score 275 | test_if[:, i] = pred_score 276 | 277 | ######################################################################### 278 | X_train_new = np.concatenate((train_knn, train_knn_mean, train_knn_median, 279 | train_lof, train_loop, train_svm, train_if), 280 | axis=1) 281 | X_test_new = np.concatenate((test_knn, test_knn_mean, test_knn_median, 282 | test_lof, test_loop, test_svm, test_if), 283 | axis=1) 284 | 285 | X_train_all = np.concatenate((X, X_train_new), axis=1) 286 | X_test_all = np.concatenate((X_test, X_test_new), axis=1) 287 | 288 | roc_list = roc_knn + roc_knn_mean + roc_knn_median + roc_lof + roc_loop + roc_svm + roc_if 289 | prec_n_list = prec_n_knn + prec_n_knn_mean + prec_n_knn_median + prec_n_lof + prec_n_loop + prec_n_svm + prec_n_if 290 | 291 | # get the results of baselines 292 | print_baseline(X_test_new, y_test, roc_list, prec_n_list) 293 | 294 | ########################################################################### 295 | # select TOS using different methods 296 | 297 | p = 10 # number of selected TOS 298 | # TODO: supplement the cleaned up version for selection methods 299 | 300 | ############################################################################## 301 | for clf, clf_name in zip(clf_list, clf_name_list): 302 | print('processing', clf_name) 303 | if clf_name != 'xgb': 304 | clf = BalancedBaggingClassifier(base_estimator=clf, 305 | ratio='auto', 306 | replacement=False) 307 | # fully supervised 308 | clf.fit(X, y.ravel()) 309 | y_pred = clf.predict_proba(X_test) 310 | 311 | roc_score = roc_auc_score(y_test, y_pred[:, 1]) 312 | prec_n = get_precn(y_test, y_pred[:, 1]) 313 | 314 | result_dict[clf_name + 'roc' + 'o'].append(roc_score) 315 | result_dict[clf_name + 'precn' + 'o'].append(prec_n) 316 | 317 | # unsupervised 318 | clf.fit(X_train_new, y.ravel()) 319 | y_pred = clf.predict_proba(X_test_new) 320 | 321 | roc_score = roc_auc_score(y_test, y_pred[:, 1]) 322 | prec_n = get_precn(y_test, y_pred[:, 1]) 323 | 324 | result_dict[clf_name + 'roc' + 'n'].append(roc_score) 325 | result_dict[clf_name + 'precn' + 'n'].append(prec_n) 326 | 327 | # semi-supervised 328 | clf.fit(X_train_all, y.ravel()) 329 | y_pred = clf.predict_proba(X_test_all) 330 | 331 | roc_score = roc_auc_score(y_test, y_pred[:, 1]) 332 | prec_n = get_precn(y_test, y_pred[:, 1]) 333 | 334 | result_dict[clf_name + 'roc' + 's'].append(roc_score) 335 | result_dict[clf_name + 'precn' + 's'].append(prec_n) 336 | 337 | for eva in ['roc', 'precn']: 338 | print() 339 | for clf_name in clf_name_list: 340 | print(np.round(np.mean(result_dict[clf_name + eva + 'o']), decimals=4), 341 | eva, clf_name, 'old') 342 | print(np.round(np.mean(result_dict[clf_name + eva + 'n']), decimals=4), 343 | eva, clf_name, 'new') 344 | print(np.round(np.mean(result_dict[clf_name + eva + 's']), decimals=4), 345 | eva, clf_name, 'all') 346 | --------------------------------------------------------------------------------