├── README.md
├── datasets
    ├── arrhythmia.mat
    ├── cardio.mat
    ├── letter.mat
    ├── mammography.mat
    ├── mnist.mat
    ├── satellite.mat
    └── speech.mat
├── figs
    ├── flowchart.png
    ├── results.png
    ├── sample_outputs.png
    └── t-SNE.png
├── models
    ├── __init__.py
    ├── generate_TOS.py
    ├── glosh.py
    ├── hbos.py
    ├── knn.py
    ├── select_TOS.py
    └── utility.py
├── plots.py
├── requirements.txt
├── xgbod_demo.py
└── xgbod_full.py


/README.md:
--------------------------------------------------------------------------------
  1 | # XGBOD (Extreme Boosting Based Outlier Detection)
  2 | 
  3 | ------------
  4 | 
  5 | Zhao, Y. and Hryniewicki, M.K., "XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning," *International Joint Conference on Neural Networks (IJCNN)*, IEEE, 2018.
  6 | 
  7 | Please cite the paper as:
  8 | 
  9 |     @inproceedings{zhao2018xgbod,
 10 |       title={XGBOD: improving supervised outlier detection with unsupervised representation learning},
 11 |       author={Zhao, Yue and Hryniewicki, Maciej K},
 12 |       booktitle={2018 International Joint Conference on Neural Networks (IJCNN)},
 13 |       pages={1--8},
 14 |       year={2018},
 15 |       organization={IEEE}
 16 |     }
 17 | 
 18 | [PDF](https://www.andrew.cmu.edu/user/yuezhao2/papers/18-ijcnn-xgbod.pdf) | 
 19 | [IEEE Explore](https://ieeexplore.ieee.org/document/8489605) | 
 20 | [API Documentation](https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.xgbod) | 
 21 | [Example with PyOD](https://github.com/yzhao062/pyod/blob/master/examples/xgbod_example.py) 
 22 | 
 23 | 
 24 | **Update** (Dec 25th, 2018): XGBOD has been officially released in **[Python Outlier Detection (PyOD)](https://github.com/yzhao062/pyod)** V0.6.6.
 25 | 
 26 | **Update** (Dec 6th, 2018): XGBOD has been implemented in **[Python Outlier Detection (PyOD)](https://github.com/yzhao062/pyod)**, to be released in pyod V0.6.6.
 27 | 
 28 | ------------
 29 | 
 30 | Additional notes:
 31 | 1. Two versions of codes are provided:
 32 |    1. **Demo version** (xgbod_demo.py) is refactored for fast execution and reproduction as a proof of concept. The key difference from the full version is TOS are built in once for both training and test data. It could be regarded as a static unsupervised engineering. However, it is noted users should not expose and use the testing data while building TOS in practice. 
 33 |    2. **Production version** ([Python Outlier Detection (PyOD)](https://github.com/yzhao062/pyod)) is released with full optimization and testing as a framework. The purpose of this version is to be used in real applications, which should require fewer dependencies and faster execution.
 34 | 3. It is understood that there are **small variations** in the results due to the random process, such as xgboost and Random TOS Selection. Again, running demo code would only give you similar results but not the exact results. Additionally, specific setups are slightly different for distinct datasets, which we have not published yet.
 35 | 4. While running *L1_Comb* and *L2_Comb*, EasyEnsemble is used to construct balanced bags. It is noted the demo code uses 10 bags instead of 50, for executing efficiently. Despite, increasing to 50 bags would not change the result too much but just bring better stablity. You are welcomed to change "BalancedBaggingClassifier" parameter for using 50 bags. However, it is very slow and this is also one of the reasons why we propose XGBOD -- it is much more efficient:)
 36 | 
 37 | ------------
 38 | 
 39 | ##  Introduction
 40 | XGBOD is a three-phase framework (see Figure below). In the first phase, it generates new data representations. Specifically, various unsupervised outlier detection methods are applied to the original data to get transformed outlier scores as new data representations. In the second phase, a selection process is performed on newly generated outlier scores to keep the useful ones. The selected outlier scores are then combined with the original features to become the new feature space. Finally, an XGBoost model is trained on the new feature space, and its output decides the outlier prediction result.
 41 | 
 42 | ![XGBOD Flowchart](https://github.com/yzhao062/XGBOD/blob/master/figs/flowchart.png "XGBOD Flowchart")
 43 | 
 44 | ## Dependency
 45 | The experiment code is writen in Python 3 and built on a number of Python packages:
 46 | - matplotlib==2.0.2
 47 | - xgboost==0.7
 48 | - pandas==0.21.0
 49 | - imbalanced_learn==0.3.2
 50 | - scipy==0.19.1
 51 | - numpy==1.13.1
 52 | - PyNomaly==0.1.7
 53 | - imblearn==0.0
 54 | - scikit_learn==0.19.1
 55 | 
 56 | Batch installation is possible using the supplied "requirements.txt":
 57 | 
 58 | ````cmd
 59 | pip install -r requirements.txt
 60 | ````
 61 | 
 62 | ------------
 63 | 
 64 | 
 65 | ## Datasets
 66 | Seven datasets are used (see dataset folder):
 67 | 
 68 | | Datasets     | Dimension  | Sample Size  | Number of Outliers  |
 69 | | ------------ | -----------| ------------ | ------------------- |
 70 | | Arrhythmia   | 351        | 274          | 126 (36%)           |
 71 | | Letter       | 1600       | 32           | 100 (6.25%)         |
 72 | | Cardio       | 1831       | 21           | 176 (9.6%)          |
 73 | | Speech       | 3686       | 600          | 61(1.65%)           |
 74 | | Satellite    | 6435       | 36           | 2036 (31.64%)       |
 75 | | Mnist        | 7603       | 100          | 700 (9.21%)         |
 76 | | Mammography  | 11863      | 6            | 260 (2.32%)         |
 77 | 
 78 | All datasets are accessible at http://odds.cs.stonybrook.edu/. Citation Suggestion for the datasets please refer to: 
 79 | > Shebuti Rayana (2016).  ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.
 80 | 
 81 | ------------
 82 | 
 83 | 
 84 | ## Usage and Sample Output (Demo Version)
 85 | Experiments could be reproduced by running **xgbod_demo.py** directly. You could simply download/clone the entire repository and execute the code by "python xgbod_demo.py".
 86 | 
 87 | The first part of the code read in the datasets using Scipy. Five public outlier datasets are supplied. Then various TOS are built by seven different algorithms:
 88 | 1. KNN 
 89 | 2. K-Median 
 90 | 3. AvgKNN 
 91 | 4. LOF
 92 | 5. LoOP
 93 | 6. One-Class SVM 
 94 | 7. Isolation Forests
 95 | **Please be noted that you may include more TOS**
 96 | 
 97 | Taking KNN as an example, codes are as below:
 98 | 
 99 | ```python
100 | # Generate TOS using KNN based algorithms
101 | feature_list, roc_knn, prc_n_knn, result_knn = get_TOS_knn(X_norm, y, k_range, feature_list)
102 | ```
103 | 
104 | Then three TOS selection methods are used to select *p* TOS:
105 | 
106 | ```python
107 | 
108 | p = 10  # number of selected TOS
109 | # random selection
110 | X_train_new_rand, X_train_all_rand = random_select(X, X_train_new_orig, roc_list, p)
111 | # accurate selection
112 | X_train_new_accu, X_train_all_accu = accurate_select(X, X_train_new_orig, feature_list, roc_list, p)
113 | # balance selection
114 | X_train_new_bal, X_train_all_bal = balance_select(X, X_train_new_orig, roc_list, p)
115 | ```
116 | 
117 | Finally, various classification methods are applied to the datasets. Sample outputs are provided below:
118 | 
119 | ![Sample Outputs on Arrhythmia](https://github.com/yzhao062/XGBOD/blob/master/figs/sample_outputs.png "Sample Outputs on Arrhythmia")
120 | ------------
121 | ## Figures
122 | 
123 | Running **plots.py** would generate the figures for various TOS selection algorithms:
124 | ![The effect of number of TOS and selection method](https://github.com/yzhao062/XGBOD/blob/master/figs/results.png "The effect of number of TOS and selection method")
125 | 
126 | 


--------------------------------------------------------------------------------
/datasets/arrhythmia.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/arrhythmia.mat


--------------------------------------------------------------------------------
/datasets/cardio.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/cardio.mat


--------------------------------------------------------------------------------
/datasets/letter.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/letter.mat


--------------------------------------------------------------------------------
/datasets/mammography.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/mammography.mat


--------------------------------------------------------------------------------
/datasets/mnist.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/mnist.mat


--------------------------------------------------------------------------------
/datasets/satellite.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/satellite.mat


--------------------------------------------------------------------------------
/datasets/speech.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/datasets/speech.mat


--------------------------------------------------------------------------------
/figs/flowchart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/flowchart.png


--------------------------------------------------------------------------------
/figs/results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/results.png


--------------------------------------------------------------------------------
/figs/sample_outputs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/sample_outputs.png


--------------------------------------------------------------------------------
/figs/t-SNE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/figs/t-SNE.png


--------------------------------------------------------------------------------
/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/XGBOD/8017b56e3b9fe115901c910037366f27606ab611/models/__init__.py


--------------------------------------------------------------------------------
/models/generate_TOS.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | from models.utility import get_precn
  4 | from sklearn.metrics import roc_auc_score
  5 | from sklearn.neighbors import NearestNeighbors
  6 | from sklearn.neighbors import LocalOutlierFactor
  7 | from sklearn.svm import OneClassSVM
  8 | from sklearn.ensemble import IsolationForest
  9 | from PyNomaly import loop
 10 | from models.hbos import Hbos
 11 | 
 12 | 
 13 | def knn(X, n_neighbors):
 14 |     '''
 15 |     Utility function to return k-average, k-median, knn
 16 |     Since these three functions are similar, so is inluded in the same func
 17 |     :param X: train data
 18 |     :param n_neighbors: number of neighbors
 19 |     :return:
 20 |     '''
 21 |     neigh = NearestNeighbors()
 22 |     neigh.fit(X)
 23 | 
 24 |     res = neigh.kneighbors(n_neighbors=n_neighbors, return_distance=True)
 25 |     # k-average, k-median, knn
 26 |     return np.mean(res[0], axis=1), np.median(res[0], axis=1), res[0][:, -1]
 27 | 
 28 | 
 29 | def get_TOS_knn(X, y, k_list, feature_list):
 30 |     knn_clf = ["knn_mean", "knn_median", "knn_kth"]
 31 | 
 32 |     result_knn = np.zeros([X.shape[0], len(k_list) * len(knn_clf)])
 33 |     roc_knn = []
 34 |     prec_knn = []
 35 | 
 36 |     for i in range(len(k_list)):
 37 |         k = k_list[i]
 38 |         k_mean, k_median, k_k = knn(X, n_neighbors=k)
 39 |         knn_result = [k_mean, k_median, k_k]
 40 | 
 41 |         for j in range(len(knn_result)):
 42 |             score_pred = knn_result[j]
 43 |             clf = knn_clf[j]
 44 | 
 45 |             roc = np.round(roc_auc_score(y, score_pred), decimals=4)
 46 |             # apc = np.round(average_precision_score(y, score_pred), decimals=4)
 47 |             prec_n = np.round(get_precn(y, score_pred), decimals=4)
 48 |             print('{clf} @ {k} - ROC: {roc} Precision@n: {pren}'.
 49 |                   format(clf=clf, k=k, roc=roc, pren=prec_n))
 50 |             feature_list.append(clf + str(k))
 51 |             roc_knn.append(roc)
 52 |             prec_knn.append(prec_n)
 53 |             result_knn[:, i * len(knn_result) + j] = score_pred
 54 | 
 55 |     print()
 56 |     return feature_list, roc_knn, prec_knn, result_knn
 57 | 
 58 | 
 59 | def get_TOS_loop(X, y, k_list, feature_list):
 60 |     # only compatible with pandas
 61 |     df_X = pd.DataFrame(X)
 62 | 
 63 |     result_loop = np.zeros([X.shape[0], len(k_list)])
 64 |     roc_loop = []
 65 |     prec_loop = []
 66 | 
 67 |     for i in range(len(k_list)):
 68 |         k = k_list[i]
 69 |         clf = loop.LocalOutlierProbability(df_X, n_neighbors=k).fit()
 70 |         score_pred = clf.local_outlier_probabilities.astype(float)
 71 | 
 72 |         roc = np.round(roc_auc_score(y, score_pred), decimals=4)
 73 |         # apc = np.round(average_precision_score(y, score_pred), decimals=4)
 74 |         prec_n = np.round(get_precn(y, score_pred), decimals=4)
 75 | 
 76 |         print('LoOP @ {k} - ROC: {roc} Precision@n: {pren}'.format(k=k,
 77 |                                                                    roc=roc,
 78 |                                                                    pren=prec_n))
 79 | 
 80 |         feature_list.append('loop_' + str(k))
 81 |         roc_loop.append(roc)
 82 |         prec_loop.append(prec_n)
 83 |         result_loop[:, i] = score_pred
 84 |     print()
 85 |     return feature_list, roc_loop, prec_loop, result_loop
 86 | 
 87 | 
 88 | def get_TOS_lof(X, y, k_list, feature_list):
 89 |     result_lof = np.zeros([X.shape[0], len(k_list)])
 90 |     roc_lof = []
 91 |     prec_lof = []
 92 | 
 93 |     for i in range(len(k_list)):
 94 |         k = k_list[i]
 95 |         clf = LocalOutlierFactor(n_neighbors=k)
 96 |         y_pred = clf.fit_predict(X)
 97 |         score_pred = clf.negative_outlier_factor_
 98 | 
 99 |         roc = np.round(roc_auc_score(y, score_pred * -1), decimals=4)
100 |         # apc = np.round(average_precision_score(y, score_pred * -1), decimals=4)
101 |         prec_n = np.round(get_precn(y, score_pred * -1), decimals=4)
102 |         print('LOF @ {k} - ROC: {roc} Precision@n: {pren}'.format(k=k,
103 |                                                                   roc=roc,
104 |                                                                   pren=prec_n))
105 | 
106 |         feature_list.append('lof_' + str(k))
107 |         roc_lof.append(roc)
108 |         prec_lof.append(prec_n)
109 |         result_lof[:, i] = score_pred * -1
110 |     print()
111 |     return feature_list, roc_lof, prec_lof, result_lof
112 | 
113 | 
114 | def get_TOS_hbos(X, y, k_list, feature_list):
115 |     result_hbos = np.zeros([X.shape[0], len(k_list)])
116 |     roc_hbos = []
117 |     prec_hbos = []
118 | 
119 |     k_list = [3, 5, 7, 9, 12, 15, 20, 25, 30, 50]
120 |     for i in range(len(k_list)):
121 |         k = k_list[i]
122 |         clf = Hbos(bins=k, alpha=0.3)
123 |         clf.fit(X)
124 |         score_pred = clf.decision_scores
125 | 
126 |         roc = np.round(roc_auc_score(y, score_pred), decimals=4)
127 |         # apc = np.round(average_precision_score(y, score_pred * -1), decimals=4)
128 |         prec_n = np.round(get_precn(y, score_pred), decimals=4)
129 |         print('HBOS @ {k} - ROC: {roc} Precision@n: {pren}'.format(k=k,
130 |                                                                    roc=roc,
131 |                                                                    pren=prec_n))
132 | 
133 |         feature_list.append('hbos_' + str(k))
134 |         roc_hbos.append(roc)
135 |         prec_hbos.append(prec_n)
136 |         result_hbos[:, i] = score_pred
137 |     print()
138 |     return feature_list, roc_hbos, prec_hbos, result_hbos
139 | 
140 | 
141 | def get_TOS_svm(X, y, nu_list, feature_list):
142 |     result_ocsvm = np.zeros([X.shape[0], len(nu_list)])
143 |     roc_ocsvm = []
144 |     prec_ocsvm = []
145 | 
146 |     for i in range(len(nu_list)):
147 |         nu = nu_list[i]
148 |         clf = OneClassSVM(nu=nu)
149 |         clf.fit(X)
150 |         score_pred = clf.decision_function(X)
151 | 
152 |         roc = np.round(roc_auc_score(y, score_pred * -1), decimals=4)
153 | 
154 |         # apc = np.round(average_precision_score(y, score_pred * -1), decimals=4)
155 |         prec_n = np.round(
156 |             get_precn(y, score_pred * -1), decimals=4)
157 |         print('svm @ {nu} - ROC: {roc} Precision@n: {pren}'.format(nu=nu,
158 |                                                                    roc=roc,
159 |                                                                    pren=prec_n))
160 |         feature_list.append('ocsvm_' + str(nu))
161 |         roc_ocsvm.append(roc)
162 |         prec_ocsvm.append(prec_n)
163 |         result_ocsvm[:, i] = score_pred.reshape(score_pred.shape[0]) * -1
164 |     print()
165 |     return feature_list, roc_ocsvm, prec_ocsvm, result_ocsvm
166 | 
167 | 
168 | def get_TOS_iforest(X, y, n_list, feature_list):
169 |     result_if = np.zeros([X.shape[0], len(n_list)])
170 |     roc_if = []
171 |     prec_if = []
172 | 
173 |     for i in range(len(n_list)):
174 |         n = n_list[i]
175 |         clf = IsolationForest(n_estimators=n)
176 |         clf.fit(X)
177 |         score_pred = clf.decision_function(X)
178 | 
179 |         roc = np.round(roc_auc_score(y, score_pred * -1), decimals=4)
180 |         prec_n = np.round(get_precn(y, y_pred=(score_pred * -1)), decimals=4)
181 | 
182 |         print('Isolation Forest @ {n} - ROC: {roc} Precision@n: {pren}'.format(
183 |             n=n,
184 |             roc=roc,
185 |             pren=prec_n))
186 |         feature_list.append('if_' + str(n))
187 |         roc_if.append(roc)
188 |         prec_if.append(prec_n)
189 |         result_if[:, i] = score_pred.reshape(score_pred.shape[0]) * -1
190 |     print()
191 |     return feature_list, roc_if, prec_if, result_if
192 | 


--------------------------------------------------------------------------------
/models/glosh.py:
--------------------------------------------------------------------------------
 1 | import hdbscan
 2 | import numpy as np
 3 | from sklearn.preprocessing import StandardScaler
 4 | from models.utility import get_precn
 5 | 
 6 | 
 7 | class Glosh(object):
 8 |     def __init__(self, min_cluster_size=5):
 9 |         self.min_cluster_size = min_cluster_size
10 | 
11 |     def fit(self, X_train):
12 |         self.X_train = X_train
13 | 
14 |     def sample_scores(self, X_test):
15 |         # initialize the outputs
16 |         pred_score = np.zeros([X_test.shape[0], 1])
17 | 
18 |         for i in range(X_test.shape[0]):
19 |             x_i = X_test[i, :]
20 | 
21 |             x_i = np.asarray(x_i).reshape(1, x_i.shape[0])
22 |             x_comb = np.concatenate((self.X_train, x_i), axis=0)
23 | 
24 |             x_comb_norm = StandardScaler().fit_transform(x_comb)
25 | 
26 |             clusterer = hdbscan.HDBSCAN()
27 |             clusterer.fit(x_comb_norm)
28 | 
29 |             # print(clusterer.outlier_scores_[-1])
30 |             # record the current item
31 |             pred_score[i, :] = clusterer.outlier_scores_[-1]
32 |         return pred_score
33 | 
34 |     def evaluate(self, X_test, y_test):
35 |         pred_score = self.sample_scores(X_test)
36 |         prec_n = (get_precn(y_test, pred_score))
37 | 
38 |         print("precision@n", prec_n)
39 | 


--------------------------------------------------------------------------------
/models/hbos.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import math
  3 | from sklearn.preprocessing import MinMaxScaler
  4 | from scipy.stats import scoreatpercentile
  5 | 
  6 | class Hbos(object):
  7 | 
  8 |     def __init__(self, bins=10, alpha=0.3, beta=0.5, contamination=0.05):
  9 | 
 10 |         self.bins = bins
 11 |         self.alpha = alpha
 12 |         self.beta = beta
 13 |         self.contamination = contamination
 14 | 
 15 |     def fit(self, X):
 16 | 
 17 |         self.n, self.d = X.shape[0], X.shape[1]
 18 |         out_scores = np.zeros([self.n, self.d])
 19 | 
 20 |         hist = np.zeros([self.bins, self.d])
 21 |         bin_edges = np.zeros([self.bins + 1, self.d])
 22 | 
 23 |         # this is actually the fitting
 24 |         for i in range(self.d):
 25 |             hist[:, i], bin_edges[:, i] = np.histogram(X[:, i], bins=self.bins,
 26 |                                                        density=True)
 27 |             # check the integrity
 28 |             assert (
 29 |                 math.isclose(np.sum(hist[:, i] * np.diff(bin_edges[:, i])), 1))
 30 | 
 31 |         # calculate the threshold
 32 |         for i in range(self.d):
 33 |             # find histogram assignments of data points
 34 |             bin_ind = np.digitize(X[:, i], bin_edges[:, i], right=False)
 35 | 
 36 |             # very important to do scaling. Not necessary to use min max
 37 |             density_norm = MinMaxScaler().fit_transform(
 38 |                 hist[:, i].reshape(-1, 1))
 39 |             out_score = np.log(1 / (density_norm + self.alpha))
 40 | 
 41 |             for j in range(self.n):
 42 |                 # out sample left
 43 |                 if bin_ind[j] == 0:
 44 |                     dist = np.abs(X[j, i] - bin_edges[0, i])
 45 |                     bin_width = bin_edges[1, i] - bin_edges[0, i]
 46 |                     # assign it to bin 0
 47 |                     if dist < bin_width * self.beta:
 48 |                         out_scores[j, i] = out_score[bin_ind[j]]
 49 |                     else:
 50 |                         out_scores[j, i] = np.max(out_score)
 51 | 
 52 |                 # out sample right
 53 |                 elif bin_ind[j] == bin_edges.shape[0]:
 54 |                     dist = np.abs(X[j, i] - bin_edges[-1, i])
 55 |                     bin_width = bin_edges[-1, i] - bin_edges[-2, i]
 56 |                     # assign it to bin k
 57 |                     if dist < bin_width * self.beta:
 58 |                         out_scores[j, i] = out_score[bin_ind[j] - 2]
 59 |                     else:
 60 |                         out_scores[j, i] = np.max(out_score)
 61 |                 else:
 62 |                     out_scores[j, i] = out_score[bin_ind[j] - 1]
 63 | 
 64 |         out_scores_sum = np.sum(out_scores, axis=1)
 65 |         self.threshold = scoreatpercentile(out_scores_sum,
 66 |                                            100 * (1 - self.contamination))
 67 |         self.hist = hist
 68 |         self.bin_edges = bin_edges
 69 |         self.decision_scores = out_scores_sum
 70 |         self.y_pred = (self.decision_scores > self.threshold).astype('int')
 71 | 
 72 |     def decision_function(self, X_test):
 73 | 
 74 |         n_test = X_test.shape[0]
 75 |         out_scores = np.zeros([n_test, self.d])
 76 | 
 77 |         for i in range(self.d):
 78 |             # find histogram assignments of data points
 79 |             bin_ind = np.digitize(X_test[:, i], self.bin_edges[:, i],
 80 |                                   right=False)
 81 | 
 82 |             # very important to do scaling. Not necessary to use minmax
 83 |             density_norm = MinMaxScaler().fit_transform(
 84 |                 self.hist[:, i].reshape(-1, 1))
 85 | 
 86 |             out_score = np.log(1 / (density_norm + self.alpha))
 87 | 
 88 |             for j in range(n_test):
 89 |                 # out sample left
 90 |                 if bin_ind[j] == 0:
 91 |                     dist = np.abs(X_test[j, i] - self.bin_edges[0, i])
 92 |                     bin_width = self.bin_edges[1, i] - self.bin_edges[0, i]
 93 |                     # assign it to bin 0
 94 |                     if dist < bin_width * self.beta:
 95 |                         out_scores[j, i] = out_score[bin_ind[j]]
 96 |                     else:
 97 |                         out_scores[j, i] = np.max(out_score)
 98 | 
 99 |                 # out sample right
100 |                 elif bin_ind[j] == self.bin_edges.shape[0]:
101 |                     dist = np.abs(X_test[j, i] - self.bin_edges[-1, i])
102 |                     bin_width = self.bin_edges[-1, i] - self.bin_edges[-2, i]
103 |                     # assign it to bin k
104 |                     if dist < bin_width * self.beta:
105 |                         out_scores[j, i] = out_score[bin_ind[j] - 2]
106 |                     else:
107 |                         out_scores[j, i] = np.max(out_score)
108 |                 else:
109 |                     out_scores[j, i] = out_score[bin_ind[j] - 1]
110 | 
111 |         out_scores_sum = np.sum(out_scores, axis=1)
112 |         return out_scores_sum
113 | 
114 |     def predict(self, X_test):
115 |         pred_score = self.decision_function(X_test)
116 |         return (pred_score > self.threshold).astype('int')
117 | 


--------------------------------------------------------------------------------
/models/knn.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn.neighbors import NearestNeighbors
 3 | from sklearn.neighbors import KDTree
 4 | from sklearn.exceptions import NotFittedError
 5 | from scipy.stats import scoreatpercentile
 6 | 
 7 | 
 8 | class Knn(object):
 9 |     '''
10 |     Knn class for outlier detection
11 |     support original knn, average knn, and median knn
12 |     '''
13 | 
14 |     def __init__(self, n_neighbors=1, contamination=0.05, method='largest'):
15 |         self.n_neighbors = n_neighbors
16 |         self.contamination = contamination
17 |         self.method = method
18 | 
19 |     def fit(self, X_train):
20 |         self.X_train = X_train
21 |         self._isfitted = True
22 |         self.tree = KDTree(X_train)
23 | 
24 |         neigh = NearestNeighbors()
25 |         neigh.fit(self.X_train)
26 | 
27 |         result = neigh.kneighbors(n_neighbors=self.n_neighbors,
28 |                                   return_distance=True)
29 |         dist_arr = result[0]
30 | 
31 |         if self.method == 'largest':
32 |             dist = dist_arr[:, -1]
33 |         elif self.method == 'mean':
34 |             dist = np.mean(dist_arr, axis=1)
35 |         elif self.method == 'median':
36 |             dist = np.median(dist_arr, axis=1)
37 | 
38 |         threshold = scoreatpercentile(dist, 100 * (1 - self.contamination))
39 | 
40 |         self.threshold = threshold
41 |         self.decision_scores = dist.ravel()
42 |         self.y_pred = (self.decision_scores > self.threshold).astype('int')
43 | 
44 |     def decision_function(self, X_test):
45 | 
46 |         if not self._isfitted:
47 |             NotFittedError('Knn is not fitted yet')
48 | 
49 |         # initialize the output score
50 |         pred_score = np.zeros([X_test.shape[0], 1])
51 | 
52 |         for i in range(X_test.shape[0]):
53 |             x_i = X_test[i, :]
54 |             x_i = np.asarray(x_i).reshape(1, x_i.shape[0])
55 | 
56 |             # get the distance of the current point
57 |             dist_arr, ind_arr = self.tree.query(x_i, k=self.n_neighbors)
58 | 
59 |             if self.method == 'largest':
60 |                 dist = dist_arr[:, -1]
61 |             elif self.method == 'mean':
62 |                 dist = np.mean(dist_arr, axis=1)
63 |             elif self.method == 'median':
64 |                 dist = np.median(dist_arr, axis=1)
65 | 
66 |             pred_score_i = dist[-1]
67 | 
68 |             # record the current item
69 |             pred_score[i, :] = pred_score_i
70 | 
71 |         return pred_score
72 | 
73 |     def predict(self, X_test):
74 | 
75 |         pred_score = self.decision_function(X_test)
76 |         return (pred_score > self.threshold).astype('int')
77 | 
78 | 
79 | ##############################################################################
80 | # samples = [[-1, 0], [0., 0.], [1., 1], [2., 5.], [3, 1]]
81 | #
82 | # clf = Knn()
83 | # clf.fit(samples)
84 | #
85 | # scores = clf.decision_function(np.asarray([[2, 3], [6, 8]])).ravel()
86 | # assert (scores[0] == [2])
87 | # assert (scores[1] == [5])
88 | # #
89 | # labels = clf.predict(np.asarray([[2, 3], [6, 8]])).ravel()
90 | # assert (labels[0] == [0])
91 | # assert (labels[1] == [1])
92 | 


--------------------------------------------------------------------------------
/models/select_TOS.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import numpy as np
 3 | from scipy.stats import pearsonr
 4 | 
 5 | from models.utility import get_top_n
 6 | 
 7 | 
 8 | def random_select(X, X_train_new_orig, roc_list, p):
 9 |     s_feature_rand = random.sample(range(0, len(roc_list)), p)
10 |     X_train_new_rand = X_train_new_orig[:, s_feature_rand]
11 |     X_train_all_rand = np.concatenate((X, X_train_new_rand), axis=1)
12 | 
13 |     # print(s_feature_rand)
14 | 
15 |     return X_train_new_rand, X_train_all_rand
16 | 
17 | 
18 | def accurate_select(X, X_train_new_orig, roc_list, p):
19 |     s_feature_accu = get_top_n(roc_list=roc_list, n=p, top=True)
20 |     X_train_new_accu = X_train_new_orig[:, s_feature_accu[0][0:p]]
21 |     X_train_all_accu = np.concatenate((X, X_train_new_accu), axis=1)
22 | 
23 |     # print(s_feature_accu)
24 | 
25 |     return X_train_new_accu, X_train_all_accu
26 | 
27 | 
28 | def balance_select(X, X_train_new_orig, roc_list, p):
29 |     s_feature_balance = []
30 |     pearson_list = np.zeros([len(roc_list), 1])
31 | 
32 |     # handle the first value
33 |     max_value_idx = roc_list.index(max(roc_list))
34 |     s_feature_balance.append(max_value_idx)
35 |     roc_list[max_value_idx] = -1
36 | 
37 |     for i in range(p - 1):
38 | 
39 |         for j in range(len(roc_list)):
40 |             pear = pearsonr(X_train_new_orig[:, max_value_idx],
41 |                             X_train_new_orig[:, j])
42 | 
43 |             # update the pearson
44 |             pearson_list[j] = np.abs(pearson_list[j]) + np.abs(pear[0])
45 | 
46 |         discounted_roc = np.true_divide(roc_list, pearson_list.transpose())
47 | 
48 |         max_value_idx = np.argmax(discounted_roc)
49 |         s_feature_balance.append(max_value_idx)
50 |         roc_list[max_value_idx] = -1
51 | 
52 |     X_train_new_balance = X_train_new_orig[:, s_feature_balance]
53 |     X_train_all_balance = np.concatenate((X, X_train_new_balance), axis=1)
54 | 
55 |     # print(s_feature_balance)
56 | 
57 |     return X_train_new_balance, X_train_all_balance
58 | 


--------------------------------------------------------------------------------
/models/utility.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from scipy.stats import scoreatpercentile
 3 | from sklearn.metrics import precision_score
 4 | from sklearn.preprocessing import StandardScaler
 5 | from sklearn.metrics import roc_auc_score
 6 | 
 7 | 
 8 | def get_precn(y, y_pred):
 9 |     '''
10 |     Utlity function to calculate precision@n
11 |     :param y: ground truth
12 |     :param y_pred: number of outliers
13 |     :return: score
14 |     '''
15 |     # calculate the percentage of outliers
16 |     out_perc = np.count_nonzero(y) / len(y)
17 | 
18 |     threshold = scoreatpercentile(y_pred, 100 * (1 - out_perc))
19 |     y_pred = (y_pred > threshold).astype('int')
20 |     return precision_score(y, y_pred)
21 | 
22 | 
23 | def precision_n(y_pred, y, n):
24 |     '''
25 |     Utlity function to calculate precision@n
26 | 
27 |     :param y_pred: predicted value
28 |     :param y: ground truth
29 |     :param n: number of outliers
30 |     :return: scaler score
31 |     '''
32 |     y_pred = np.asarray(y_pred)
33 |     y = np.asarray(y)
34 | 
35 |     length = y.shape[0]
36 | 
37 |     assert (y_pred.shape == y.shape)
38 |     y_sorted = np.partition(y_pred, int(length - n))
39 | 
40 |     threshold = y_sorted[int(length - n)]
41 | 
42 |     y_n = np.greater_equal(y_pred, threshold).astype(int)
43 |     #    print(threshold, y_n, precision_score(y, y_n))
44 | 
45 |     return precision_score(y, y_n)
46 | 
47 | 
48 | def get_top_n(roc_list, n, top=True):
49 |     '''
50 |     for use of Accurate Selection only
51 |     :param roc_list: a li
52 |     :param n:
53 |     :param top:
54 |     :return:
55 |     '''
56 |     roc_list = np.asarray(roc_list)
57 |     length = roc_list.shape[0]
58 | 
59 |     roc_sorted = np.partition(roc_list, length - n)
60 |     threshold = roc_sorted[int(length - n)]
61 | 
62 |     if top:
63 |         return np.where(np.greater_equal(roc_list, threshold))
64 |     else:
65 |         return np.where(np.less(roc_list, threshold))
66 | 
67 | 
68 | def print_baseline(X_train_new_orig, y, roc_list, prec_list):
69 |     max_value_idx = roc_list.index(max(roc_list))
70 |     print()
71 |     print('Highest TOS ROC:', roc_list[max_value_idx])
72 |     print('Highest TOS Precison@n', max(prec_list))
73 | 
74 |     # normalized score
75 |     X_train_all_norm = StandardScaler().fit_transform(X_train_new_orig)
76 |     X_train_all_norm_mean = np.mean(X_train_all_norm, axis=1)
77 | 
78 |     roc = np.round(roc_auc_score(y, X_train_all_norm_mean), decimals=4)
79 |     prec_n = np.round(get_precn(y, X_train_all_norm_mean), decimals=4)
80 | 
81 |     print('Average TOS ROC:', roc)
82 |     print('Average TOS Precision@n', prec_n)
83 | 


--------------------------------------------------------------------------------
/plots.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import matplotlib.pyplot as plt
  3 | import numpy as np
  4 | 
  5 | # initialize the results of the experiements
  6 | # arrhythmia
  7 | prc_gr_arr = [0.5606, 0.5976, 0.5986, 0.6053, 0.6109, 0.6219, 0.6076, 0.6115]
  8 | prc_ac_arr = [0.5606, 0.5976, 0.5719, 0.5961, 0.6041, 0.5792, 0.6019, 0.6115]
  9 | prc_rd_arr = [0.5606, 0.5993, 0.5788, 0.6356, 0.5908, 0.6094, 0.6202, 0.6115]
 10 | 
 11 | # letter
 12 | prc_gr_lt = [0.6003, 0.7300, 0.7234, 0.7199, 0.7169, 0.7285, 0.7323, 0.7376]
 13 | prc_ac_lt = [0.6003, 0.7300, 0.7210, 0.7272, 0.7477, 0.7302, 0.7308, 0.7376]
 14 | prc_rd_lt = [0.6003, 0.6653, 0.7140, 0.7248, 0.7397, 0.7302, 0.7232, 0.7376]
 15 | 
 16 | # cardio
 17 | prc_gr_car = [0.9304, 0.9290, 0.9374, 0.9385, 0.9296, 0.9351, 0.9327, 0.9332]
 18 | prc_ac_car = [0.9304, 0.9290, 0.9314, 0.9315, 0.9337, 0.9354, 0.9331, 0.9332]
 19 | prc_rd_car = [0.9304, 0.9297, 0.9342, 0.9364, 0.9315, 0.9267, 0.9248, 0.9332]
 20 | 
 21 | # speech
 22 | prc_gr_sp = [0.1455, 0.2658, 0.2733, 0.3203, 0.3290, 0.3107, 0.3355, 0.2492]
 23 | prc_ac_sp = [0.1455, 0.2658, 0.2367, 0.2630, 0.3103, 0.2983, 0.3255, 0.2492]
 24 | prc_rd_sp = [0.1455, 0.1356, 0.1814, 0.2101, 0.3194, 0.3053, 0.2940, 0.2492]
 25 | 
 26 | # mammography
 27 | prc_gr_ma = [0.6974, 0.6853, 0.6719, 0.6720, 0.6620, 0.6717, 0.6687, 0.6673]
 28 | prc_ac_ma = [0.6974, 0.6853, 0.6915, 0.6841, 0.6965, 0.6631, 0.6655, 0.6673]
 29 | prc_rd_ma = [0.6974, 0.6812, 0.6823, 0.6649, 0.6693, 0.6619, 0.6654, 0.6673]
 30 | 
 31 | # x-axis
 32 | x = [0, 1, 5, 10, 30, 50, 70, 100]
 33 | 
 34 | # main plots
 35 | fig = plt.figure(figsize=(8, 10))
 36 | lw = 2
 37 | 
 38 | ax = fig.add_subplot(511)
 39 | 
 40 | plt.plot(x, prc_rd_arr, color='black', linestyle='-.', marker='s',
 41 |          lw=lw, label='Random Selection')
 42 | plt.plot(x, prc_gr_arr, color='blue', linestyle='--', marker='^',
 43 |          lw=lw, label='Balance Selection')
 44 | plt.plot(x, prc_ac_arr, color='red', linestyle='-', marker='o',
 45 |          lw=lw, label='Accurate Selection')
 46 | 
 47 | plt.xlim([-0.5, 100.5])
 48 | plt.xticks(np.arange(0, 100, 5))
 49 | plt.ylabel('Precision@n', fontsize=12)
 50 | plt.title('Arrhythmia', fontsize=12)
 51 | plt.legend(loc="lower right")
 52 | 
 53 | #########################################################################
 54 | ax = fig.add_subplot(512)
 55 | plt.plot(x, prc_rd_lt, color='black', linestyle='-.', marker='s',
 56 |          lw=lw, label='Random Selection')
 57 | plt.plot(x, prc_gr_lt, color='blue', linestyle='--', marker='^',
 58 |          lw=lw, label='Balance Selection')
 59 | plt.plot(x, prc_ac_lt, color='red', linestyle='--', marker='o',
 60 |          lw=lw, label='Accurate Selection')
 61 | 
 62 | plt.xlim([-0.5, 100.5])
 63 | plt.xticks(np.arange(0, 100, 5))
 64 | plt.ylabel('Precision@n', fontsize=12)
 65 | plt.title('Letter', fontsize=12)
 66 | plt.legend(loc="lower right")
 67 | 
 68 | #########################################################################
 69 | ax = fig.add_subplot(513)
 70 | plt.plot(x, prc_rd_car, color='black', linestyle='-.', marker='s',
 71 |          lw=lw, label='Random Selection')
 72 | plt.plot(x, prc_gr_car, color='blue', linestyle='--', marker='^',
 73 |          lw=lw, label='Balance Selection')
 74 | plt.plot(x, prc_ac_car, color='red', linestyle='--', marker='o',
 75 |          lw=lw, label='Accurate Selection')
 76 | 
 77 | plt.xlim([-0.5, 100.5])
 78 | plt.xticks(np.arange(0, 100, 5))
 79 | plt.ylabel('Precision@n', fontsize=12)
 80 | plt.title('Cardio', fontsize=12)
 81 | plt.legend(loc="lower right")
 82 | 
 83 | #########################################################################
 84 | ax = fig.add_subplot(514)
 85 | plt.plot(x, prc_rd_sp, color='black', linestyle='-.', marker='s',
 86 |          lw=lw, label='Random Selection')
 87 | plt.plot(x, prc_gr_sp, color='blue', linestyle='--', marker='^',
 88 |          lw=lw, label='Balance Selection')
 89 | plt.plot(x, prc_ac_sp, color='red', linestyle='--', marker='o',
 90 |          lw=lw, label='Accurate Selection')
 91 | 
 92 | plt.xlim([-0.5, 100.5])
 93 | plt.xticks(np.arange(0, 100, 5))
 94 | plt.ylabel('Precision@n', fontsize=12)
 95 | plt.title('Speech', fontsize=12)
 96 | plt.legend(loc="lower right")
 97 | #########################################################################
 98 | ax = fig.add_subplot(515)
 99 | plt.plot(x, prc_rd_ma, color='black', linestyle='-.', marker='s',
100 |          lw=lw, label='Random Selection')
101 | plt.plot(x, prc_gr_ma, color='blue', linestyle='--', marker='^',
102 |          lw=lw, label='Balance Selection')
103 | plt.plot(x, prc_ac_ma, color='red', linestyle='--', marker='o',
104 |          lw=lw, label='Accurate Selection')
105 | 
106 | plt.xlim([-0.5, 100.5])
107 | plt.xticks(np.arange(0, 100, 5))
108 | plt.xlabel('Number of Selected ODS')
109 | plt.ylabel('Precision@n', fontsize=12)
110 | plt.title('Mammography', fontsize=12)
111 | plt.legend(loc="upper right")
112 | 
113 | #########################################################################
114 | plt.tight_layout()
115 | plt.savefig(os.path.join('figs', 'results.png'), dpi=300)
116 | plt.show()
117 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | imbalanced_learn>=0.3.2
2 | matplotlib>=2.0.2
3 | numpy>=1.13.1
4 | pandas>=0.21.0
5 | PyNomaly>=0.1.7
6 | scikit_learn>=0.19.1
7 | scipy>=0.19.1
8 | xgboost>=0.7


--------------------------------------------------------------------------------
/xgbod_demo.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Demo codes for XGBOD.
  3 | Author: Yue Zhao
  4 | 
  5 | notes: the demo code simulates the use of XGBOD with some changes to expedite
  6 | the execution. Use the full code for the production.
  7 | 
  8 | '''
  9 | import os
 10 | import random
 11 | import scipy.io as scio
 12 | import numpy as np
 13 | 
 14 | from sklearn.preprocessing import StandardScaler, normalize
 15 | from sklearn.metrics import roc_auc_score
 16 | from sklearn.model_selection import train_test_split
 17 | from sklearn.linear_model import LogisticRegression
 18 | 
 19 | from xgboost.sklearn import XGBClassifier
 20 | from imblearn.ensemble import BalancedBaggingClassifier
 21 | from models.utility import get_precn, print_baseline
 22 | from models.generate_TOS import get_TOS_knn
 23 | from models.generate_TOS import get_TOS_loop
 24 | from models.generate_TOS import get_TOS_lof
 25 | from models.generate_TOS import get_TOS_svm
 26 | from models.generate_TOS import get_TOS_iforest
 27 | from models.generate_TOS import get_TOS_hbos
 28 | from models.select_TOS import random_select, accurate_select, balance_select
 29 | 
 30 | # load data file
 31 | # mat = scio.loadmat(os.path.join('datasets', 'speech.mat'))
 32 | mat = scio.loadmat(os.path.join('datasets', 'arrhythmia.mat'))
 33 | # mat = scio.loadmat(os.path.join('datasets', 'cardio.mat'))
 34 | # mat = scio.loadmat(os.path.join('datasets', 'letter.mat'))
 35 | # mat = scio.loadmat(os.path.join('datasets', 'mammography.mat'))
 36 | 
 37 | X = mat['X']
 38 | y = mat['y']
 39 | 
 40 | # use unit norm vector X improves knn, LoOP, and LOF results
 41 | scaler = StandardScaler().fit(X)
 42 | # X_norm = scaler.transform(X)
 43 | X_norm = normalize(X)
 44 | feature_list = []
 45 | 
 46 | # Running KNN-base algorithms to generate addtional features
 47 | 
 48 | # predefined range of k
 49 | k_range = [1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150,
 50 |            200, 250]
 51 | # predefined range of k to be used with LoOP due to high complexity
 52 | k_range_short = [1, 3, 5, 10]
 53 | 
 54 | # validate the value of k
 55 | k_range = [k for k in k_range if k < X.shape[0]]
 56 | 
 57 | # predefined range of nu for one-class svm
 58 | nu_range = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99]
 59 | 
 60 | # predefined range for number of estimators in isolation forests
 61 | n_range = [10, 20, 50, 70, 100, 150, 200, 250]
 62 | ##############################################################################
 63 | 
 64 | # Generate TOS using KNN based algorithms
 65 | feature_list, roc_knn, prc_n_knn, result_knn = get_TOS_knn(X_norm, y, k_range,
 66 |                                                            feature_list)
 67 | # Generate TOS using LoOP
 68 | feature_list, roc_loop, prc_n_loop, result_loop = get_TOS_loop(X, y,
 69 |                                                                k_range_short,
 70 |                                                                feature_list)
 71 | # Generate TOS using LOF
 72 | feature_list, roc_lof, prc_n_lof, result_lof = get_TOS_lof(X_norm, y, k_range,
 73 |                                                            feature_list)
 74 | # Generate TOS using one class svm
 75 | feature_list, roc_ocsvm, prc_n_ocsvm, result_ocsvm = get_TOS_svm(X, y,
 76 |                                                                  nu_range,
 77 |                                                                  feature_list)
 78 | # Generate TOS using isolation forests
 79 | feature_list, roc_if, prc_n_if, result_if = get_TOS_iforest(X, y, n_range,
 80 |                                                             feature_list)
 81 | 
 82 | # Generate TOS using isolation forests
 83 | feature_list, roc_hbos, prc_n_hbos, result_hbos = get_TOS_hbos(X, y, k_range,
 84 |                                                             feature_list)
 85 | ##############################################################################
 86 | # combine the feature space by concanating various TOS
 87 | X_train_new_orig = np.concatenate(
 88 |     (result_knn, result_loop, result_lof, result_ocsvm, result_if), axis=1)
 89 | 
 90 | X_train_all_orig = np.concatenate((X, X_train_new_orig), axis=1)
 91 | 
 92 | # combine ROC and Precision@n list
 93 | roc_list = roc_knn + roc_loop + roc_lof + roc_ocsvm + roc_if
 94 | prc_n_list = prc_n_knn + prc_n_loop + prc_n_lof + prc_n_ocsvm + prc_n_if
 95 | 
 96 | # get the results of baselines
 97 | print_baseline(X_train_new_orig, y, roc_list, prc_n_list)
 98 | 
 99 | ##############################################################################
100 | # select TOS using different methods
101 | 
102 | p = 10  # number of selected TOS
103 | 
104 | # random selection
105 | # please be noted the actual random selection happens within the
106 | # train-test split, with p repetitions.
107 | X_train_new_rand, X_train_all_rand = random_select(X, X_train_new_orig,
108 |                                                    roc_list, p)
109 | # accurate selection
110 | X_train_new_accu, X_train_all_accu = accurate_select(X, X_train_new_orig,
111 |                                                      roc_list, p)
112 | # balance selection
113 | X_train_new_bal, X_train_all_bal = balance_select(X, X_train_new_orig,
114 |                                                   roc_list, p)
115 | 
116 | ###############################################################################
117 | # build various classifiers
118 | 
119 | # it is noted that the data split should happen as the first stage
120 | # test data should not be exposed. However, with a relatively large number of
121 | # repetitions, the demo code would generate a similar result.
122 | 
123 | # the full code uses the containers to save the intermediate TOS models. The
124 | # codes would be shared after the cleanup.
125 | 
126 | ite = 30  # number of iterations
127 | test_size = 0.4  # training = 60%, testing = 40%
128 | result_dict = {}
129 | 
130 | clf_list = [XGBClassifier(), LogisticRegression(penalty="l1"),
131 |             LogisticRegression(penalty="l2")]
132 | clf_name_list = ['xgb', 'lr1', 'lr2']
133 | 
134 | # initialize the result dictionary
135 | for clf_name in clf_name_list:
136 |     result_dict[clf_name + 'ROC' + 'o'] = []
137 |     result_dict[clf_name + 'ROC' + 's'] = []
138 |     result_dict[clf_name + 'ROC' + 'n'] = []
139 | 
140 |     result_dict[clf_name + 'PRC@n' + 'o'] = []
141 |     result_dict[clf_name + 'PRC@n' + 's'] = []
142 |     result_dict[clf_name + 'PRC@n' + 'n'] = []
143 | 
144 | for i in range(ite):
145 |     s_feature_rand = random.sample(range(0, len(roc_list)), p)
146 |     X_train_new_rand = X_train_new_orig[:, s_feature_rand]
147 |     X_train_all_rand = np.concatenate((X, X_train_new_rand), axis=1)
148 | 
149 |     original_len = X.shape[1]
150 | 
151 |     # use all TOS
152 |     X_train, X_test, y_train, y_test = train_test_split(X_train_all_orig, y,
153 |                                                         test_size=test_size)
154 |     # # use Random Selection
155 |     # X_train, X_test, y_train, y_test = train_test_split(X_train_all_rand, y,
156 |     #                                                     test_size=test_size)
157 |     # # use Accurate Selection
158 |     # X_train, X_test, y_train, y_test = train_test_split(X_train_all_accu, y,
159 |     #                                                     test_size=test_size)
160 |     # # use Balance Selection
161 |     # X_train, X_test, y_train, y_test = train_test_split(X_train_all_bal, y,
162 |     #                                                     test_size=test_size)
163 | 
164 |     # use original features
165 |     X_train_o = X_train[:, 0:original_len]
166 |     X_test_o = X_test[:, 0:original_len]
167 | 
168 |     X_train_n = X_train[:, original_len:]
169 |     X_test_n = X_test[:, original_len:]
170 | 
171 |     for clf, clf_name in zip(clf_list, clf_name_list):
172 |         print('processing', clf_name, 'round', i + 1)
173 |         if clf_name != 'xgb':
174 |             clf = BalancedBaggingClassifier(base_estimator=clf,
175 |                                             ratio='auto',
176 |                                             replacement=False)
177 | 
178 |         # fully supervised
179 |         clf.fit(X_train_o, y_train.ravel())
180 |         y_pred = clf.predict_proba(X_test_o)
181 | 
182 |         roc_score = roc_auc_score(y_test, y_pred[:, 1])
183 |         prec_n = get_precn(y_test, y_pred[:, 1])
184 | 
185 |         result_dict[clf_name + 'ROC' + 'o'].append(roc_score)
186 |         result_dict[clf_name + 'PRC@n' + 'o'].append(prec_n)
187 | 
188 |         # unsupervised
189 |         clf.fit(X_train_n, y_train.ravel())
190 |         y_pred = clf.predict_proba(X_test_n)
191 | 
192 |         roc_score = roc_auc_score(y_test, y_pred[:, 1])
193 |         prec_n = get_precn(y_test, y_pred[:, 1])
194 | 
195 |         result_dict[clf_name + 'ROC' + 'n'].append(roc_score)
196 |         result_dict[clf_name + 'PRC@n' + 'n'].append(prec_n)
197 | 
198 |         # semi-supervised
199 |         clf.fit(X_train, y_train.ravel())
200 |         y_pred = clf.predict_proba(X_test)
201 | 
202 |         roc_score = roc_auc_score(y_test, y_pred[:, 1])
203 |         prec_n = get_precn(y_test, y_pred[:, 1])
204 | 
205 |         result_dict[clf_name + 'ROC' + 's'].append(roc_score)
206 |         result_dict[clf_name + 'PRC@n' + 's'].append(prec_n)
207 | 
208 | for eva in ['ROC', 'PRC@n']:
209 |     print()
210 |     for clf_name in clf_name_list:
211 |         print(np.round(np.mean(result_dict[clf_name + eva + 'o']), decimals=4),
212 |               eva, clf_name, 'original features')
213 |         print(np.round(np.mean(result_dict[clf_name + eva + 'n']), decimals=4),
214 |               eva, clf_name, 'TOS only')
215 |         print(np.round(np.mean(result_dict[clf_name + eva + 's']), decimals=4),
216 |               eva, clf_name, 'original feature + TOS')
217 | 


--------------------------------------------------------------------------------
/xgbod_full.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pandas as pd
  3 | import numpy as np
  4 | import scipy.io as scio
  5 | 
  6 | from sklearn.preprocessing import StandardScaler, Normalizer
  7 | from sklearn.model_selection import train_test_split
  8 | from sklearn.metrics import roc_auc_score
  9 | from sklearn.neighbors import LocalOutlierFactor
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.ensemble import IsolationForest
 12 | from sklearn.svm import OneClassSVM
 13 | 
 14 | from xgboost.sklearn import XGBClassifier
 15 | from imblearn.ensemble import BalancedBaggingClassifier
 16 | from PyNomaly import loop
 17 | 
 18 | from models.knn import Knn
 19 | from models.utility import get_precn, print_baseline
 20 | 
 21 | # use one dataset at a time; more datasets could be added to /datasets folder
 22 | # the experiment codes use a bit more setting up, otherwise the
 23 | # exact reproduction is infeasible. Clean-up codes are going to be moved
 24 | 
 25 | # load data file
 26 | mat = scio.loadmat(os.path.join('datasets', 'letter.mat'))
 27 | ite = 30  # number of iterations
 28 | test_size = 0.4  # training = 60%, testing = 40%
 29 | 
 30 | X_orig = mat['X']
 31 | y_orig = mat['y']
 32 | 
 33 | # outlier percentage
 34 | out_perc = np.count_nonzero(y_orig) / len(y_orig)
 35 | 
 36 | # define classifiers to use
 37 | clf_list = [XGBClassifier(), LogisticRegression(penalty="l1"),
 38 |             LogisticRegression(penalty="l2")]
 39 | clf_name_list = ['xgb', 'lr1', 'lr2']
 40 | 
 41 | # initialize the container to store the results
 42 | result_dict = {}
 43 | 
 44 | # initialize the result dictionary
 45 | for clf_name in clf_name_list:
 46 |     result_dict[clf_name + 'roc' + 'o'] = []
 47 |     result_dict[clf_name + 'roc' + 's'] = []
 48 |     result_dict[clf_name + 'roc' + 'n'] = []
 49 | 
 50 |     result_dict[clf_name + 'precn' + 'o'] = []
 51 |     result_dict[clf_name + 'precn' + 's'] = []
 52 |     result_dict[clf_name + 'precn' + 'n'] = []
 53 | 
 54 | for t in range(ite):
 55 | 
 56 |     print('\nProcessing trial', t + 1, 'out of', ite)
 57 | 
 58 |     # split X and y for training and validation
 59 |     X, X_test, y, y_test = train_test_split(X_orig, y_orig,
 60 |                                             test_size=test_size)
 61 | 
 62 |     # reserve the normalized data
 63 |     scaler = Normalizer().fit(X)
 64 |     X_norm = scaler.transform(X)
 65 |     X_test_norm = scaler.transform(X_test)
 66 | 
 67 |     feature_list = []
 68 | 
 69 |     # predefined range of K
 70 |     k_list_pre = [1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,
 71 |                   100, 150, 200, 250]
 72 |     # trim the list in case of small sample size
 73 |     k_list = [k for k in k_list_pre if k < X.shape[0]]
 74 | 
 75 |     ###########################################################################
 76 |     train_knn = np.zeros([X.shape[0], len(k_list)])
 77 |     test_knn = np.zeros([X_test.shape[0], len(k_list)])
 78 | 
 79 |     roc_knn = []
 80 |     prec_n_knn = []
 81 | 
 82 |     for i in range(len(k_list)):
 83 |         k = k_list[i]
 84 | 
 85 |         clf = Knn(n_neighbors=k, contamination=out_perc, method='largest')
 86 |         clf.fit(X_norm)
 87 |         train_score = clf.decision_scores
 88 |         pred_score = clf.decision_function(X_test_norm)
 89 | 
 90 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
 91 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
 92 |         print('knn roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc,
 93 |                                                           pren=prec_n))
 94 | 
 95 |         feature_list.append('knn_' + str(k))
 96 |         roc_knn.append(roc)
 97 |         prec_n_knn.append(prec_n)
 98 | 
 99 |         train_knn[:, i] = train_score
100 |         test_knn[:, i] = pred_score.ravel()
101 |     ###########################################################################
102 | 
103 |     train_knn_mean = np.zeros([X.shape[0], len(k_list)])
104 |     test_knn_mean = np.zeros([X_test.shape[0], len(k_list)])
105 | 
106 |     roc_knn_mean = []
107 |     prec_n_knn_mean = []
108 |     for i in range(len(k_list)):
109 |         k = k_list[i]
110 | 
111 |         clf = Knn(n_neighbors=k, contamination=out_perc, method='mean')
112 |         clf.fit(X_norm)
113 |         train_score = clf.decision_scores
114 |         pred_score = clf.decision_function(X_test_norm)
115 | 
116 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
117 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
118 |         print('knn_mean roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc,
119 |                                                                pren=prec_n))
120 | 
121 |         feature_list.append('knn_mean_' + str(k))
122 |         roc_knn_mean.append(roc)
123 |         prec_n_knn_mean.append(prec_n)
124 | 
125 |         train_knn_mean[:, i] = train_score
126 |         test_knn_mean[:, i] = pred_score.ravel()
127 |     ###########################################################################
128 | 
129 |     train_knn_median = np.zeros([X.shape[0], len(k_list)])
130 |     test_knn_median = np.zeros([X_test.shape[0], len(k_list)])
131 | 
132 |     roc_knn_median = []
133 |     prec_n_knn_median = []
134 |     for i in range(len(k_list)):
135 |         k = k_list[i]
136 | 
137 |         clf = Knn(n_neighbors=k, contamination=out_perc, method='median')
138 |         clf.fit(X_norm)
139 |         train_score = clf.decision_scores
140 |         pred_score = clf.decision_function(X_test_norm)
141 | 
142 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
143 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
144 |         print('knn_median roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc,
145 |                                                                  pren=prec_n))
146 | 
147 |         feature_list.append('knn_median_' + str(k))
148 |         roc_knn_median.append(roc)
149 |         prec_n_knn_median.append(prec_n)
150 | 
151 |         train_knn_median[:, i] = train_score
152 |         test_knn_median[:, i] = pred_score.ravel()
153 |     ###########################################################################
154 | 
155 |     train_lof = np.zeros([X.shape[0], len(k_list)])
156 |     test_lof = np.zeros([X_test.shape[0], len(k_list)])
157 | 
158 |     roc_lof = []
159 |     prec_n_lof = []
160 | 
161 |     for i in range(len(k_list)):
162 |         k = k_list[i]
163 |         clf = LocalOutlierFactor(n_neighbors=k)
164 |         clf.fit(X_norm)
165 | 
166 |         # save the train sets
167 |         train_score = clf.negative_outlier_factor_ * -1
168 |         # flip the score
169 |         pred_score = clf._decision_function(X_test_norm) * -1
170 | 
171 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
172 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
173 |         print('lof roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc,
174 |                                                           pren=prec_n))
175 |         feature_list.append('lof_' + str(k))
176 |         roc_lof.append(roc)
177 |         prec_n_lof.append(prec_n)
178 | 
179 |         train_lof[:, i] = train_score
180 |         test_lof[:, i] = pred_score
181 | 
182 |     ###########################################################################
183 |     # Noted that LoOP is not really used for prediction since its high
184 |     # computational complexity
185 |     # However, it is included to demonstrate the effectiveness of XGBOD only
186 | 
187 |     df_X = pd.DataFrame(np.concatenate([X_norm, X_test_norm], axis=0))
188 | 
189 |     # predefined range of K
190 |     k_list = [1, 5, 10, 20]
191 | 
192 |     train_loop = np.zeros([X.shape[0], len(k_list)])
193 |     test_loop = np.zeros([X_test.shape[0], len(k_list)])
194 | 
195 |     roc_loop = []
196 |     prec_n_loop = []
197 | 
198 |     for i in range(len(k_list)):
199 |         k = k_list[i]
200 |         clf = loop.LocalOutlierProbability(df_X, n_neighbors=k).fit()
201 |         score = clf.local_outlier_probabilities.astype(float)
202 | 
203 |         # save the train sets
204 |         train_score = score[0:X.shape[0]]
205 |         # flip the score
206 |         pred_score = score[X.shape[0]:]
207 | 
208 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
209 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
210 |         print('loop roc pren @ {k} is {roc} {pren}'.format(k=k, roc=roc,
211 |                                                            pren=prec_n))
212 |         feature_list.append('loop_' + str(k))
213 |         roc_loop.append(roc)
214 |         prec_n_loop.append(prec_n)
215 | 
216 |         train_loop[:, i] = train_score
217 |         test_loop[:, i] = pred_score
218 | 
219 |     ##########################################################################
220 |     nu_list = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99]
221 | 
222 |     train_svm = np.zeros([X.shape[0], len(nu_list)])
223 |     test_svm = np.zeros([X_test.shape[0], len(nu_list)])
224 | 
225 |     roc_svm = []
226 |     prec_n_svm = []
227 | 
228 |     for i in range(len(nu_list)):
229 |         nu = nu_list[i]
230 | 
231 |         clf = OneClassSVM(nu=nu)
232 |         clf.fit(X)
233 | 
234 |         train_score = clf.decision_function(X) * -1
235 |         pred_score = clf.decision_function(X_test) * -1
236 | 
237 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
238 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
239 |         print('svm roc / pren @ {nu} is {roc} {pren}'.format(nu=nu, roc=roc,
240 |                                                              pren=prec_n))
241 | 
242 |         feature_list.append('svm_' + str(nu))
243 |         roc_svm.append(roc)
244 |         prec_n_svm.append(prec_n)
245 | 
246 |         train_svm[:, i] = train_score.ravel()
247 |         test_svm[:, i] = pred_score.ravel()
248 |     ###########################################################################
249 | 
250 |     n_list = [10, 20, 50, 70, 100, 150, 200, 250]
251 | 
252 |     train_if = np.zeros([X.shape[0], len(n_list)])
253 |     test_if = np.zeros([X_test.shape[0], len(n_list)])
254 | 
255 |     roc_if = []
256 |     prec_n_if = []
257 | 
258 |     for i in range(len(n_list)):
259 |         n = n_list[i]
260 |         clf = IsolationForest(n_estimators=n)
261 |         clf.fit(X)
262 |         train_score = clf.decision_function(X) * -1
263 |         pred_score = clf.decision_function(X_test) * -1
264 | 
265 |         roc = np.round(roc_auc_score(y_test, pred_score), decimals=4)
266 |         prec_n = np.round(get_precn(y_test, pred_score), decimals=4)
267 |         print('if roc / pren @ {n} is {roc} {pren}'.format(n=n, roc=roc,
268 |                                                            pren=prec_n))
269 | 
270 |         feature_list.append('if_' + str(n))
271 |         roc_if.append(roc)
272 |         prec_n_if.append(prec_n)
273 | 
274 |         train_if[:, i] = train_score
275 |         test_if[:, i] = pred_score
276 | 
277 |     #########################################################################
278 |     X_train_new = np.concatenate((train_knn, train_knn_mean, train_knn_median,
279 |                                   train_lof, train_loop, train_svm, train_if),
280 |                                  axis=1)
281 |     X_test_new = np.concatenate((test_knn, test_knn_mean, test_knn_median,
282 |                                  test_lof, test_loop, test_svm, test_if),
283 |                                 axis=1)
284 | 
285 |     X_train_all = np.concatenate((X, X_train_new), axis=1)
286 |     X_test_all = np.concatenate((X_test, X_test_new), axis=1)
287 | 
288 |     roc_list = roc_knn + roc_knn_mean + roc_knn_median + roc_lof + roc_loop + roc_svm + roc_if
289 |     prec_n_list = prec_n_knn + prec_n_knn_mean + prec_n_knn_median + prec_n_lof + prec_n_loop + prec_n_svm + prec_n_if
290 | 
291 |     # get the results of baselines
292 |     print_baseline(X_test_new, y_test, roc_list, prec_n_list)
293 | 
294 |     ###########################################################################
295 |     # select TOS using different methods
296 | 
297 |     p = 10  # number of selected TOS
298 |     # TODO: supplement the cleaned up version for selection methods
299 | 
300 |     ##############################################################################
301 |     for clf, clf_name in zip(clf_list, clf_name_list):
302 |         print('processing', clf_name)
303 |         if clf_name != 'xgb':
304 |             clf = BalancedBaggingClassifier(base_estimator=clf,
305 |                                             ratio='auto',
306 |                                             replacement=False)
307 |         # fully supervised
308 |         clf.fit(X, y.ravel())
309 |         y_pred = clf.predict_proba(X_test)
310 | 
311 |         roc_score = roc_auc_score(y_test, y_pred[:, 1])
312 |         prec_n = get_precn(y_test, y_pred[:, 1])
313 | 
314 |         result_dict[clf_name + 'roc' + 'o'].append(roc_score)
315 |         result_dict[clf_name + 'precn' + 'o'].append(prec_n)
316 | 
317 |         # unsupervised
318 |         clf.fit(X_train_new, y.ravel())
319 |         y_pred = clf.predict_proba(X_test_new)
320 | 
321 |         roc_score = roc_auc_score(y_test, y_pred[:, 1])
322 |         prec_n = get_precn(y_test, y_pred[:, 1])
323 | 
324 |         result_dict[clf_name + 'roc' + 'n'].append(roc_score)
325 |         result_dict[clf_name + 'precn' + 'n'].append(prec_n)
326 | 
327 |         # semi-supervised
328 |         clf.fit(X_train_all, y.ravel())
329 |         y_pred = clf.predict_proba(X_test_all)
330 | 
331 |         roc_score = roc_auc_score(y_test, y_pred[:, 1])
332 |         prec_n = get_precn(y_test, y_pred[:, 1])
333 | 
334 |         result_dict[clf_name + 'roc' + 's'].append(roc_score)
335 |         result_dict[clf_name + 'precn' + 's'].append(prec_n)
336 | 
337 | for eva in ['roc', 'precn']:
338 |     print()
339 |     for clf_name in clf_name_list:
340 |         print(np.round(np.mean(result_dict[clf_name + eva + 'o']), decimals=4),
341 |               eva, clf_name, 'old')
342 |         print(np.round(np.mean(result_dict[clf_name + eva + 'n']), decimals=4),
343 |               eva, clf_name, 'new')
344 |         print(np.round(np.mean(result_dict[clf_name + eva + 's']), decimals=4),
345 |               eva, clf_name, 'all')
346 | 


--------------------------------------------------------------------------------