├── README.md ├── datasets └── cardio.mat ├── demo_knn.py ├── demo_lof.py ├── md_figs ├── flowchart.png ├── knn_prc.png ├── knn_roc.png ├── lof_prc.png ├── lof_roc.png └── tsne.png ├── models ├── __init__.py ├── combination.py ├── knn.py └── lof.py ├── requirements.txt ├── utility ├── __init__.py ├── stat_models.py └── utility.py ├── vis_tsne.py └── viz ├── Annthyroid.png ├── Cardio.png ├── Letter.png ├── Mnist.png ├── Pendigits.png ├── Pima.png ├── Satellite.png ├── Thyroid.png └── Vowels.png /README.md: -------------------------------------------------------------------------------- 1 | # DCSO (Dynamic Combination of Detector Scores for Outlier Ensembles) 2 | ### Supplementary materials: datasets, demo source codes and sample outputs. 3 | 4 | Y. Zhao and M.K. Hryniewicki, "DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles" *ACM KDD Workshop on Outlier Detection De-constructed (ODD v5.0)*, 2018. 5 | 6 | Please cite the paper as: 7 | 8 | @conference{zhao2018dcso, 9 | author = {Zhao, Yue and Hryniewicki, Maciej K}, 10 | title = {{DCSO:} Dynamic Combination of Detector Scores for Outlier Ensembles}, 11 | booktitle = {ACM SIGKDD ODD Workshop}, 12 | year = {2018}, 13 | address = {London, UK}, 14 | timestamp = {Mon, 22 Oct 2018 13:07:32 +0200}, 15 | } 16 | 17 | 18 | **[PDF](https://www.andrew.cmu.edu/user/lakoglu/odd/accepted_papers/ODD_v50_paper_3.pdf)** | 19 | **[Presentation Slides](https://yuezhao.squarespace.com/s/ODD-Zhao-DCSO.pdf)** ] 20 | 21 | **Note**: [LSCP](https://github.com/yzhao062/lscp) is an upgraded version of DCSO, which has been accepted at SDM' 19. 22 | 23 | ------------ 24 | 25 | Additional notes: 26 | 1. Three versions of codes are (going to be) provided: 27 | 1. **Demo version** (demo_lof.py and demo_knn.py) are created for the fast reproduction of the experiment results. The demo version only compares the baseline algorithms with DCSO algorithms. The effect of parameters, e.g., the choice of *k*, are not included. 28 | 2. **Full version** (tba) will be released after moderate code cleanup and optimization. In contrast to the demo version, the full version also considers the impact of parameter setting. The full version is therefore relatively slow, which will be further optimized. It is noted the demo version is sufficient to prove the idea. We suggest to using the demo version while playing with DCSO, during the full version is being optimized. 29 | 3. **Production version** (tba) will be released with full optimization and testing as a framework. The purpose of this version is to be used in real applications, which should require fewer dependencies and faster execution. 30 | 3. It is understood that there are **small variations** in the results due to the random process, e.g., spliting the training and test sets. Thus, running demo codes would only result in similar results to the paper but not the exactly same results. 31 | ------------ 32 | 33 | ## Introduction 34 | In this paper, an unsupervised outlier detector combination framework called DCSO (Dynamic Combination of Detector Scores for Outlier Ensembles) is proposed, demonstrated and assessed for the dynamic selection of most competent base detectors, with an emphasis on data locality. The proposed DCSO framework first defines the local region of a test instance by its k nearest neighbors and then identifies the top-performing base detectors within the local region. 35 | As classification ensembles, DCSO has two key stages. In the Generation stage, the chosen base detector algorithm is initialized with distinct parameters to build a pool of diversified detectors, and all are then fitted on the entire training dataset. In the Combination stage, DCSO picks the most competent detector in the local region defined by the test instance. Finally, the selected detector is used to predict the outlier score for the test instance. 36 | 37 | ![Flowchart](https://github.com/yzhao062/DCSO/blob/master/md_figs/flowchart.png) 38 | 39 | ## Dependency 40 | The experiment codes are writen in Python 3.6 and built on a number of Python packages: 41 | - numpy>=1.13 42 | - scipy>=0.19 43 | - scikit_learn>=0.19 44 | 45 | Batch installation is possible using the supplied "requirements.txt" with pip or conda. 46 | 47 | ------------ 48 | 49 | ## Datasets 50 | Ten datasets are used (see dataset folder): 51 | 52 | | Datasets | # Points (*n*) | Dimension (*d*) | # Outliers | % Outliers 53 | | ---------- | ---------------- | ---------------- | ----------- |------------| 54 | | Pima | 768 | 8 | 268 | 34.8958 | 55 | | Vowels | 1456 | 12 | 50 | 3.4341 | 56 | | Letter | 1600 | 32 | 100 | 6.2500 | 57 | | Cardio | 1831 | 21 | 176 | 9.6122 | 58 | | Thyroid | 3772 | 6 | 93 | 2.4655 | 59 | | Satellite | 6435 | 36 | 2036 | 31.6394 | 60 | | Pendigits | 6870 | 16 | 156 | 2.2707 | 61 | | Annthyroid | 7200 | 6 | 534 | 7.4167 | 62 | | Mnist | 7603 | 100 | 700 | 9.2069 | 63 | | Shuttle | 49097 | 9 | 3511 | 7.1511 | 64 | 65 | All datasets are accesible from http://odds.cs.stonybrook.edu/. Citation Suggestion for the datasets please refer to: 66 | > Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science. 67 | 68 | To replicate the demo, you should download the datasets from http://odds.cs.stonybrook.edu/ and place them in ./datasets/. We do not provide the data download. 69 | 70 | ------------ 71 | 72 | ## Usage and Sample Output (Demo Version) 73 | Experiments could be reproduced by running **demo_lof.py** and **demo_knn.py** directly. You could simply download/clone the entire repository and execute the code by 74 | ```bash 75 | python demo_lof.py 76 | ``` 77 | 78 | The difference between **demo_lof.py** and **demo_knn.py** is simply at the base detector choice. Apparently, the former uses LOF as the base detector, while the latter uses *k*NN instead. We introduce two evalution methods: 79 | 1. The area under receiver operating characteristic curve (**ROC**) 80 | 2. Precision at rank m (***P*@*m***) 81 | 82 | The results of **demo_lof.py** and **demo_knn.py** are presented below. Table 1 and 2 illustrate the results when **LOF** is used as the base detector, while Table 3 and 4 are based when ***k*NN** is used as the base detector. The highest score is highlighted in **bold**, while the lowest is marked with an **asterisk (*)**. 83 | 84 | ![ LOF_ROC](https://github.com/yzhao062/DCSO/blob/master/md_figs/lof_roc.png) 85 | ![ LOF_PRC](https://github.com/yzhao062/DCSO/blob/master/md_figs/lof_prc.png) 86 | ![ KNN_ROC](https://github.com/yzhao062/DCSO/blob/master/md_figs/knn_roc.png) 87 | ![ KNN_PRC](https://github.com/yzhao062/DCSO/blob/master/md_figs/knn_prc.png) 88 | 89 | ## Visualizations (based on demo_lof.py ) 90 | The figure below visually compares the performance of SG and DCSO methods on **Cardio**, **Thyroid** and **Letter** using t-distributed stochastic neighbor embedding (t-SNE). Normal and outlying points are denoted as **orange dots** and **red squares**, respectively. The normal points that are only correctly detected by SG methods are named SG_N (** green triangle_down**), and only by DCSO are named as DCSO_N (**blue cross sign**). Similarly, outliers are denoted as SG_N (**green triangle_up**) and DCSO_N (**blue plus sign**), given they can only be detected by SG or DCSO methods, respectively. 91 | 92 | ![ tsne](https://github.com/yzhao062/DCSO/blob/master/md_figs/tsne.png) 93 | 94 | Full visulization could be found at [t-SNE](https://github.com/yzhao062/DCSO/tree/master/viz "t-SNE"). To replicate the visualization, please use "**viz_tsne.py**". It is noted this script is not fully optimized and could be cubersome. 95 | -------------------------------------------------------------------------------- /datasets/cardio.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/datasets/cardio.mat -------------------------------------------------------------------------------- /demo_knn.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | import numpy as np 4 | from scipy.stats import rankdata 5 | from scipy.stats import pearsonr 6 | from scipy.spatial.distance import euclidean 7 | 8 | from sklearn.model_selection import train_test_split 9 | from sklearn.neighbors import KDTree 10 | from sklearn.metrics import roc_auc_score 11 | from sklearn.metrics import average_precision_score 12 | 13 | from models.combination import aom, moa 14 | from utility.stat_models import wpearsonr 15 | from utility.utility import train_predict_knn 16 | from utility.utility import print_save_result 17 | from utility.utility import argmaxp, loaddata, precision_n_score, standardizer 18 | 19 | # access the timestamp for logging purpose 20 | today = datetime.datetime.now() 21 | timestamp = today.strftime("%Y%m%d_%H%M%S") 22 | 23 | # set numpy parameters 24 | np.set_printoptions(suppress=True, precision=4) 25 | 26 | ############################################################################### 27 | # parameter settings 28 | data = 'cardio' 29 | base_detector = 'knn' 30 | n_ite = 20 # number of iterations 31 | test_size = 0.4 # training = 60%, testing = 40% 32 | n_baselines = 30 # the number of baseline algorithms, DO NOT CHANGE 33 | loc_region_size = 100 # for consistency fixed to 100 34 | 35 | # k list for LOF algorithms, for constructing a pool of base detectors 36 | k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 37 | 110, 120, 130, 140, 150, 160, 170, 180, 190, 200] 38 | 39 | n_clf = len(k_list) # 20 base detectors 40 | 41 | # for SG_AOM and SG_MOA, choose the right number of buckets 42 | n_buckets = 5 43 | n_clf_bucket = int(n_clf / n_buckets) 44 | assert (n_clf % n_buckets == 0) # in case wrong number of buckets 45 | 46 | alpha = 0.2 # control the strength of dynamic ensemble selection 47 | 48 | # flag for printing and output saving 49 | verbose = False 50 | 51 | ############################################################################### 52 | 53 | if __name__ == '__main__': 54 | 55 | X_orig, y_orig, outlier_perc = loaddata(data) 56 | 57 | # initialize the matrix for storing scores 58 | roc_mat = np.zeros([n_ite, n_baselines]) # receiver operating curve 59 | ap_mat = np.zeros([n_ite, n_baselines]) # average precision 60 | prc_mat = np.zeros([n_ite, n_baselines]) # precision @ m 61 | 62 | for t in range(n_ite): 63 | print('\nn_ite', t + 1, data) # print status 64 | 65 | # split the data into training and testing 66 | X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig, 67 | test_size=test_size) 68 | # normalized the data 69 | X_train_norm, X_test_norm = standardizer(X_train, X_test) 70 | 71 | train_scores = np.zeros([X_train.shape[0], n_clf]) 72 | test_scores = np.zeros([X_test.shape[0], n_clf]) 73 | 74 | # initialized the list to store the results 75 | test_target_list = [] 76 | method_list = [] 77 | 78 | # generate a pool of detectors and predict on test instances 79 | train_scores, test_scores = train_predict_knn(k_list, X_train_norm, 80 | X_test_norm, 81 | train_scores, 82 | test_scores) 83 | 84 | ####################################################################### 85 | # generate normalized scores 86 | train_scores_norm, test_scores_norm = standardizer(train_scores, 87 | test_scores) 88 | # generate mean and max outputs 89 | # SG_A and SG_M 90 | target_test_mean = np.mean(test_scores_norm, axis=1) 91 | target_test_max = np.max(test_scores_norm, axis=1) 92 | test_target_list.extend([target_test_mean, target_test_max]) 93 | method_list.extend(['sg_a', 'sg_m']) 94 | 95 | # generate pseudo target for training -> for calculating weights 96 | target_mean = np.mean(train_scores_norm, axis=1).reshape(-1, 1) 97 | target_max = np.max(train_scores_norm, axis=1).reshape(-1, 1) 98 | 99 | # higher value for more outlyingness 100 | ranks_mean = rankdata(target_mean).reshape(-1, 1) 101 | ranks_max = rankdata(target_max).reshape(-1, 1) 102 | 103 | # generate weighted mean 104 | # weights are distance or pearson in different modes 105 | clf_weights_pear = np.zeros([n_clf, 1]) 106 | for i in range(n_clf): 107 | clf_weights_pear[i] = \ 108 | pearsonr(target_mean, train_scores_norm[:, i].reshape(-1, 1))[ 109 | 0][0] 110 | 111 | # generate weighted mean 112 | target_test_weighted_pear = np.sum( 113 | test_scores_norm * clf_weights_pear.reshape(1, 114 | -1) / clf_weights_pear.sum(), 115 | axis=1) 116 | 117 | test_target_list.append(target_test_weighted_pear) 118 | method_list.append('sg_wa') 119 | 120 | # generate threshold sum 121 | target_test_threshold = np.sum(test_scores_norm.clip(0), axis=1) 122 | test_target_list.append(target_test_threshold) 123 | method_list.append('sg_thresh') 124 | 125 | # generate average of maximum (SG_AOM) and maximum of average (SG_MOA) 126 | target_test_aom = aom(test_scores_norm, n_buckets, n_clf) 127 | target_test_moa = moa(test_scores_norm, n_buckets, n_clf) 128 | test_target_list.extend([target_test_aom, target_test_moa]) 129 | method_list.extend(['aom', 'moa']) 130 | ################################################################## 131 | 132 | # define local region using KD trees 133 | tree = KDTree(X_train_norm) 134 | dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size) 135 | 136 | # different similarity measures 137 | # s[euc]_w[rank] -> use euclidean distance for similarity measure 138 | # use outlying rank as the weight 139 | m_list = ['s[euc]_w[dist]', 's[euc]_w[rank]', 's[dist]_w[na]', 140 | 's[pear]_w[dist]', 's[pear]_w[rank]', 's[pear]_w[na]'] 141 | 142 | pred_scores_best = np.zeros([X_test.shape[0], len(m_list)]) 143 | pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)]) 144 | 145 | for i in range(X_test.shape[0]): # iterate all test instance 146 | # get the neighbor idx of the current point 147 | ind_k = ind_arr[i, :] 148 | 149 | # get the pseudo target: mean 150 | target_k = target_mean[ind_k,].ravel() 151 | 152 | # get the current scores from all clf 153 | curr_train_k = train_scores_norm[ind_k, :] 154 | 155 | # weights by rank 156 | weights_k_rank = ranks_mean[ind_k] 157 | 158 | # weights by euclidean distance 159 | dist_k = dist_arr[i, :].reshape(-1, 1) 160 | weights_k_dist = dist_k.max() - dist_k 161 | 162 | # initialize containers for correlation 163 | corr_dist_d = np.zeros([n_clf, ]) 164 | corr_dist_r = np.zeros([n_clf, ]) 165 | corr_dist_n = np.zeros([n_clf, ]) 166 | corr_pear_d = np.zeros([n_clf, ]) 167 | corr_pear_r = np.zeros([n_clf, ]) 168 | corr_pear_n = np.zeros([n_clf, ]) 169 | 170 | for d in range(n_clf): 171 | # flip distance so larger values imply larger correlation 172 | corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d], 173 | w=weights_k_dist) * -1 174 | corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d], 175 | w=weights_k_rank) * -1 176 | corr_dist_n[d,] = euclidean(target_k, 177 | curr_train_k[:, d]) * -1 178 | corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d], 179 | w=weights_k_dist) 180 | corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d], 181 | w=weights_k_rank) 182 | corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[ 183 | 0] 184 | 185 | corr_list = [corr_dist_d, corr_dist_r, corr_dist_n, 186 | corr_pear_d, corr_pear_r, corr_pear_n] 187 | 188 | for j in range(len(m_list)): 189 | corr_k = corr_list[j] 190 | 191 | # pick the best one 192 | best_clf_ind = np.nanargmax(corr_k) 193 | pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind] 194 | 195 | # pick the p dynamically 196 | threshold = corr_k.max() - corr_k.std() * alpha 197 | p = (corr_k >= threshold).sum() 198 | if p == 0: # in case extreme cases [nan and all -1's] 199 | p = 1 200 | pred_scores_ens[i, j] = np.max( 201 | test_scores_norm[i, argmaxp(corr_k, p)]) 202 | 203 | for m in range(len(m_list)): 204 | test_target_list.extend([pred_scores_best[:, m], 205 | pred_scores_ens[:, m]]) 206 | method_list.extend(['DCSO_a_' + m_list[m], 207 | 'DCSO_moa_' + m_list[m]]) 208 | ###################################################################### 209 | 210 | # use max for pseudo ground truth generation 211 | tree = KDTree(X_train_norm) 212 | dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size) 213 | 214 | pred_scores_best = np.zeros([X_test.shape[0], len(m_list)]) 215 | pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)]) 216 | 217 | for i in range(X_test.shape[0]): # X_test_norm.shape[0] 218 | # get the neighbor idx of the current point 219 | ind_k = ind_arr[i, :] 220 | 221 | # get the pseudo target: max 222 | target_k = target_max[ind_k,].ravel() 223 | 224 | # get the current scores from all clf 225 | curr_train_k = train_scores_norm[ind_k, :] 226 | 227 | # weights by rank 228 | weights_k_rank = ranks_max[ind_k] 229 | # weights by distance 230 | dist_k = dist_arr[i, :].reshape(-1, 1) 231 | weights_k_dist = dist_k.max() - dist_k 232 | 233 | corr_dist_d = np.zeros([n_clf, ]) 234 | corr_dist_r = np.zeros([n_clf, ]) 235 | corr_dist_n = np.zeros([n_clf, ]) 236 | corr_pear_d = np.zeros([n_clf, ]) 237 | corr_pear_r = np.zeros([n_clf, ]) 238 | corr_pear_n = np.zeros([n_clf, ]) 239 | 240 | for d in range(n_clf): 241 | corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d], 242 | w=weights_k_dist) * -1 243 | corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d], 244 | w=weights_k_rank) * -1 245 | corr_dist_n[d,] = euclidean(target_k, 246 | curr_train_k[:, d]) * -1 247 | corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d], 248 | w=weights_k_dist) 249 | corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d], 250 | w=weights_k_rank) 251 | corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[ 252 | 0] 253 | 254 | corr_list = [corr_dist_d, corr_dist_r, corr_dist_n, 255 | corr_pear_d, corr_pear_r, corr_pear_n] 256 | 257 | for j in range(len(m_list)): 258 | corr_k = corr_list[j] 259 | 260 | # pick the best one 261 | best_clf_ind = np.nanargmax(corr_k) 262 | pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind] 263 | 264 | # pick s detectors dynamically 265 | threshold = corr_k.max() - corr_k.std() * alpha 266 | p = (corr_k >= threshold).sum() 267 | if p == 0: # in case extreme cases [nan and all -1's] 268 | p = 1 269 | pred_scores_ens[i, j] = np.mean( 270 | test_scores_norm[i, argmaxp(corr_k, p)]) 271 | 272 | for m in range(len(m_list)): 273 | test_target_list.extend([pred_scores_best[:, m], 274 | pred_scores_ens[:, m]]) 275 | method_list.extend(['DCSO_m_' + m_list[m], 276 | 'DCSO_aom_' + m_list[m]]) 277 | 278 | # store performance information and print result 279 | for i in range(n_baselines): 280 | roc_mat[t, i] = roc_auc_score(y_test, test_target_list[i]) 281 | ap_mat[t, i] = average_precision_score(y_test, 282 | test_target_list[i]) 283 | prc_mat[t, i] = precision_n_score(y_test, test_target_list[i]) 284 | print(method_list[i], roc_mat[t, i]) 285 | 286 | # print and save the result 287 | # default location is /results/***.csv 288 | print_save_result(data, base_detector, n_baselines, n_clf, n_ite, roc_mat, 289 | ap_mat, prc_mat, method_list, timestamp, verbose) 290 | -------------------------------------------------------------------------------- /demo_lof.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | import numpy as np 4 | from scipy.stats import rankdata 5 | from scipy.stats import pearsonr 6 | from scipy.spatial.distance import euclidean 7 | 8 | from sklearn.model_selection import train_test_split 9 | from sklearn.neighbors import KDTree 10 | from sklearn.metrics import roc_auc_score 11 | from sklearn.metrics import average_precision_score 12 | 13 | from models.combination import aom, moa 14 | from utility.stat_models import wpearsonr 15 | from utility.utility import train_predict_lof 16 | from utility.utility import print_save_result 17 | from utility.utility import argmaxp, loaddata, precision_n_score, standardizer 18 | 19 | # access the timestamp for logging purpose 20 | today = datetime.datetime.now() 21 | timestamp = today.strftime("%Y%m%d_%H%M%S") 22 | 23 | # set numpy parameters 24 | np.set_printoptions(suppress=True, precision=4) 25 | 26 | ############################################################################### 27 | # parameter settings 28 | data = 'cardio' 29 | base_detector = 'lof' 30 | n_ite = 20 # number of iterations 31 | test_size = 0.4 # training = 60%, testing = 40% 32 | n_baselines = 30 # the number of baseline algorithms, DO NOT CHANGE 33 | loc_region_size = 100 # for consistency fixed to 100 34 | 35 | # k list for LOF algorithms, for constructing a pool of base detectors 36 | k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 37 | 110, 120, 130, 140, 150, 160, 170, 180, 190, 200] 38 | 39 | n_clf = len(k_list) # 20 base detectors 40 | 41 | # for SG_AOM and SG_MOA, choose the right number of buckets 42 | n_buckets = 5 43 | n_clf_bucket = int(n_clf / n_buckets) 44 | assert (n_clf % n_buckets == 0) # in case wrong number of buckets 45 | 46 | alpha = 0.2 # control the strength of dynamic ensemble selection 47 | 48 | # flag for printing and output saving 49 | verbose = False 50 | 51 | ############################################################################### 52 | 53 | if __name__ == '__main__': 54 | 55 | X_orig, y_orig, outlier_perc = loaddata(data) 56 | 57 | # initialize the matrix for storing scores 58 | roc_mat = np.zeros([n_ite, n_baselines]) # receiver operating curve 59 | ap_mat = np.zeros([n_ite, n_baselines]) # average precision 60 | prc_mat = np.zeros([n_ite, n_baselines]) # precision @ m 61 | 62 | for t in range(n_ite): 63 | print('\nn_ite', t + 1, data) # print status 64 | 65 | # split the data into training and testing 66 | X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig, 67 | test_size=test_size) 68 | # normalized the data 69 | X_train_norm, X_test_norm = standardizer(X_train, X_test) 70 | 71 | train_scores = np.zeros([X_train.shape[0], n_clf]) 72 | test_scores = np.zeros([X_test.shape[0], n_clf]) 73 | 74 | # initialized the list to store the results 75 | test_target_list = [] 76 | method_list = [] 77 | 78 | # generate a pool of detectors and predict on test instances 79 | train_scores, test_scores = train_predict_lof(k_list, X_train_norm, 80 | X_test_norm, 81 | train_scores, 82 | test_scores) 83 | 84 | ####################################################################### 85 | # generate normalized scores 86 | train_scores_norm, test_scores_norm = standardizer(train_scores, 87 | test_scores) 88 | # generate mean and max outputs 89 | # SG_A and SG_M 90 | target_test_mean = np.mean(test_scores_norm, axis=1) 91 | target_test_max = np.max(test_scores_norm, axis=1) 92 | test_target_list.extend([target_test_mean, target_test_max]) 93 | method_list.extend(['sg_a', 'sg_m']) 94 | 95 | # generate pseudo target for training -> for calculating weights 96 | target_mean = np.mean(train_scores_norm, axis=1).reshape(-1, 1) 97 | target_max = np.max(train_scores_norm, axis=1).reshape(-1, 1) 98 | 99 | # higher value for more outlyingness 100 | ranks_mean = rankdata(target_mean).reshape(-1, 1) 101 | ranks_max = rankdata(target_max).reshape(-1, 1) 102 | 103 | # generate weighted mean 104 | # weights are distance or pearson in different modes 105 | clf_weights_pear = np.zeros([n_clf, 1]) 106 | for i in range(n_clf): 107 | clf_weights_pear[i] = \ 108 | pearsonr(target_mean, train_scores_norm[:, i].reshape(-1, 1))[ 109 | 0][0] 110 | 111 | # generate weighted mean 112 | target_test_weighted_pear = np.sum( 113 | test_scores_norm * clf_weights_pear.reshape(1, 114 | -1) / clf_weights_pear.sum(), 115 | axis=1) 116 | 117 | test_target_list.append(target_test_weighted_pear) 118 | method_list.append('sg_wa') 119 | 120 | # generate threshold sum 121 | target_test_threshold = np.sum(test_scores_norm.clip(0), axis=1) 122 | test_target_list.append(target_test_threshold) 123 | method_list.append('sg_thresh') 124 | 125 | # generate average of maximum (SG_AOM) and maximum of average (SG_MOA) 126 | target_test_aom = aom(test_scores_norm, n_buckets, n_clf) 127 | target_test_moa = moa(test_scores_norm, n_buckets, n_clf) 128 | test_target_list.extend([target_test_aom, target_test_moa]) 129 | method_list.extend(['aom', 'moa']) 130 | ################################################################## 131 | 132 | # define local region using KD trees 133 | tree = KDTree(X_train_norm) 134 | dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size) 135 | 136 | # different similarity measures 137 | # s[euc]_w[rank] -> use euclidean distance for similarity measure 138 | # use outlying rank as the weight 139 | m_list = ['s[euc]_w[dist]', 's[euc]_w[rank]', 's[dist]_w[na]', 140 | 's[pear]_w[dist]', 's[pear]_w[rank]', 's[pear]_w[na]'] 141 | 142 | pred_scores_best = np.zeros([X_test.shape[0], len(m_list)]) 143 | pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)]) 144 | 145 | for i in range(X_test.shape[0]): # iterate all test instance 146 | # get the neighbor idx of the current point 147 | ind_k = ind_arr[i, :] 148 | 149 | # get the pseudo target: mean 150 | target_k = target_mean[ind_k,].ravel() 151 | 152 | # get the current scores from all clf 153 | curr_train_k = train_scores_norm[ind_k, :] 154 | 155 | # weights by rank 156 | weights_k_rank = ranks_mean[ind_k] 157 | 158 | # weights by euclidean distance 159 | dist_k = dist_arr[i, :].reshape(-1, 1) 160 | weights_k_dist = dist_k.max() - dist_k 161 | 162 | # initialize containers for correlation 163 | corr_dist_d = np.zeros([n_clf, ]) 164 | corr_dist_r = np.zeros([n_clf, ]) 165 | corr_dist_n = np.zeros([n_clf, ]) 166 | corr_pear_d = np.zeros([n_clf, ]) 167 | corr_pear_r = np.zeros([n_clf, ]) 168 | corr_pear_n = np.zeros([n_clf, ]) 169 | 170 | for d in range(n_clf): 171 | # flip distance so larger values imply larger correlation 172 | corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d], 173 | w=weights_k_dist) * -1 174 | corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d], 175 | w=weights_k_rank) * -1 176 | corr_dist_n[d,] = euclidean(target_k, 177 | curr_train_k[:, d]) * -1 178 | corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d], 179 | w=weights_k_dist) 180 | corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d], 181 | w=weights_k_rank) 182 | corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[ 183 | 0] 184 | 185 | corr_list = [corr_dist_d, corr_dist_r, corr_dist_n, 186 | corr_pear_d, corr_pear_r, corr_pear_n] 187 | 188 | for j in range(len(m_list)): 189 | corr_k = corr_list[j] 190 | 191 | # pick the best one 192 | best_clf_ind = np.nanargmax(corr_k) 193 | pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind] 194 | 195 | # pick the p dynamically 196 | threshold = corr_k.max() - corr_k.std() * alpha 197 | p = (corr_k >= threshold).sum() 198 | if p == 0: # in case extreme cases [nan and all -1's] 199 | p = 1 200 | pred_scores_ens[i, j] = np.max( 201 | test_scores_norm[i, argmaxp(corr_k, p)]) 202 | 203 | for m in range(len(m_list)): 204 | test_target_list.extend([pred_scores_best[:, m], 205 | pred_scores_ens[:, m]]) 206 | method_list.extend(['DCSO_a_' + m_list[m], 207 | 'DCSO_moa_' + m_list[m]]) 208 | ###################################################################### 209 | 210 | # use max for pseudo ground truth generation 211 | tree = KDTree(X_train_norm) 212 | dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size) 213 | 214 | pred_scores_best = np.zeros([X_test.shape[0], len(m_list)]) 215 | pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)]) 216 | 217 | for i in range(X_test.shape[0]): # X_test_norm.shape[0] 218 | # get the neighbor idx of the current point 219 | ind_k = ind_arr[i, :] 220 | 221 | # get the pseudo target: max 222 | target_k = target_max[ind_k,].ravel() 223 | 224 | # get the current scores from all clf 225 | curr_train_k = train_scores_norm[ind_k, :] 226 | 227 | # weights by rank 228 | weights_k_rank = ranks_max[ind_k] 229 | # weights by distance 230 | dist_k = dist_arr[i, :].reshape(-1, 1) 231 | weights_k_dist = dist_k.max() - dist_k 232 | 233 | corr_dist_d = np.zeros([n_clf, ]) 234 | corr_dist_r = np.zeros([n_clf, ]) 235 | corr_dist_n = np.zeros([n_clf, ]) 236 | corr_pear_d = np.zeros([n_clf, ]) 237 | corr_pear_r = np.zeros([n_clf, ]) 238 | corr_pear_n = np.zeros([n_clf, ]) 239 | 240 | for d in range(n_clf): 241 | corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d], 242 | w=weights_k_dist) * -1 243 | corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d], 244 | w=weights_k_rank) * -1 245 | corr_dist_n[d,] = euclidean(target_k, 246 | curr_train_k[:, d]) * -1 247 | corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d], 248 | w=weights_k_dist) 249 | corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d], 250 | w=weights_k_rank) 251 | corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[ 252 | 0] 253 | 254 | corr_list = [corr_dist_d, corr_dist_r, corr_dist_n, 255 | corr_pear_d, corr_pear_r, corr_pear_n] 256 | 257 | for j in range(len(m_list)): 258 | corr_k = corr_list[j] 259 | 260 | # pick the best one 261 | best_clf_ind = np.nanargmax(corr_k) 262 | pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind] 263 | 264 | # pick s detectors dynamically 265 | threshold = corr_k.max() - corr_k.std() * alpha 266 | p = (corr_k >= threshold).sum() 267 | if p == 0: # in case extreme cases [nan and all -1's] 268 | p = 1 269 | pred_scores_ens[i, j] = np.mean( 270 | test_scores_norm[i, argmaxp(corr_k, p)]) 271 | 272 | for m in range(len(m_list)): 273 | test_target_list.extend([pred_scores_best[:, m], 274 | pred_scores_ens[:, m]]) 275 | method_list.extend(['DCSO_m_' + m_list[m], 276 | 'DCSO_aom_' + m_list[m]]) 277 | 278 | # store performance information and print result 279 | for i in range(n_baselines): 280 | roc_mat[t, i] = roc_auc_score(y_test, test_target_list[i]) 281 | ap_mat[t, i] = average_precision_score(y_test, 282 | test_target_list[i]) 283 | prc_mat[t, i] = precision_n_score(y_test, test_target_list[i]) 284 | print(method_list[i], roc_mat[t, i]) 285 | 286 | # print and save the result 287 | # default location is /results/***.csv 288 | print_save_result(data, base_detector, n_baselines, n_clf, n_ite, roc_mat, 289 | ap_mat, prc_mat, method_list, timestamp, verbose) 290 | -------------------------------------------------------------------------------- /md_figs/flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/flowchart.png -------------------------------------------------------------------------------- /md_figs/knn_prc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/knn_prc.png -------------------------------------------------------------------------------- /md_figs/knn_roc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/knn_roc.png -------------------------------------------------------------------------------- /md_figs/lof_prc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/lof_prc.png -------------------------------------------------------------------------------- /md_figs/lof_roc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/lof_roc.png -------------------------------------------------------------------------------- /md_figs/tsne.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/tsne.png -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/models/__init__.py -------------------------------------------------------------------------------- /models/combination.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | def aom(scores, n_buckets, n_estimators, standard=True): 4 | ''' 5 | Average of Maximum - An ensemble method for outlier detection 6 | 7 | Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms 8 | for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47. 9 | 10 | :param scores: 11 | :param n_buckets: 12 | :param n_estimators: 13 | :param standard: 14 | :return: 15 | ''' 16 | scores = np.asarray(scores) 17 | if scores.shape[1] != n_estimators: 18 | raise ValueError('score matrix should be n_samples by n_estimaters') 19 | 20 | scores_aom = np.zeros([scores.shape[0], n_buckets]) 21 | 22 | n_estimators_per_bucket = int(n_estimators / n_buckets) 23 | if n_estimators % n_buckets != 0: 24 | Warning('n_estimators / n_buckets leads to a remainder') 25 | 26 | # shuffle the estimator order 27 | estimators_list = list(range(0, n_estimators, 1)) 28 | np.random.shuffle(estimators_list) 29 | 30 | head = 0 31 | for i in range(0, n_estimators, n_estimators_per_bucket): 32 | tail = i + n_estimators_per_bucket 33 | batch_ind = int(i / n_estimators_per_bucket) 34 | 35 | scores_aom[:, batch_ind] = np.max( 36 | scores[:, estimators_list[head:tail]], axis=1) 37 | 38 | head = head + n_estimators_per_bucket 39 | tail = tail + n_estimators_per_bucket 40 | 41 | return np.mean(scores_aom, axis=1) 42 | 43 | 44 | def moa(scores, n_buckets, n_estimators): 45 | ''' 46 | Maximum of Average - An ensemble method for outlier detection 47 | 48 | Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms 49 | for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47. 50 | 51 | :param scores: 52 | :param n_buckets: 53 | :param n_estimators: 54 | :param standard: 55 | :return: 56 | ''' 57 | scores = np.asarray(scores) 58 | if scores.shape[1] != n_estimators: 59 | raise ValueError('score matrix should be n_samples by n_estimaters') 60 | 61 | scores_moa = np.zeros([scores.shape[0], n_buckets]) 62 | 63 | n_estimators_per_bucket = int(n_estimators / n_buckets) 64 | if n_estimators % n_buckets != 0: 65 | Warning('n_estimators / n_buckets leads to a remainder') 66 | 67 | # shuffle the estimator order 68 | estimators_list = list(range(0, n_estimators, 1)) 69 | np.random.shuffle(estimators_list) 70 | 71 | head = 0 72 | for i in range(0, n_estimators, n_estimators_per_bucket): 73 | tail = i + n_estimators_per_bucket 74 | batch_ind = int(i / n_estimators_per_bucket) 75 | 76 | scores_moa[:, batch_ind] = np.mean( 77 | scores[:, estimators_list[head:tail]], axis=1) 78 | 79 | head = head + n_estimators_per_bucket 80 | tail = tail + n_estimators_per_bucket 81 | 82 | return np.max(scores_moa, axis=1) 83 | -------------------------------------------------------------------------------- /models/knn.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.preprocessing import MinMaxScaler 3 | from sklearn.neighbors import NearestNeighbors 4 | from sklearn.neighbors import KDTree 5 | from sklearn.exceptions import NotFittedError 6 | from scipy.stats import scoreatpercentile 7 | from scipy.stats import rankdata 8 | from scipy.special import erf 9 | 10 | 11 | class Knn(object): 12 | ''' 13 | Knn class for outlier detection 14 | support original knn, average knn, and median knn 15 | ''' 16 | 17 | def __init__(self, n_neighbors=1, contamination=0.05, method='largest'): 18 | self.n_neighbors = n_neighbors 19 | self.contamination = contamination 20 | self.method = method 21 | 22 | def fit(self, X_train): 23 | self.X_train = X_train 24 | self._isfitted = True 25 | self.tree = KDTree(X_train) 26 | 27 | neigh = NearestNeighbors(n_neighbors=self.n_neighbors) 28 | neigh.fit(self.X_train) 29 | 30 | result = neigh.kneighbors(n_neighbors=self.n_neighbors, 31 | return_distance=True) 32 | dist_arr = result[0] 33 | 34 | if self.method == 'largest': 35 | dist = dist_arr[:, -1] 36 | elif self.method == 'mean': 37 | dist = np.mean(dist_arr, axis=1) 38 | elif self.method == 'median': 39 | dist = np.median(dist_arr, axis=1) 40 | 41 | self.threshold = scoreatpercentile(dist, 42 | 100 * (1 - self.contamination)) 43 | self.decision_scores = dist.ravel() 44 | self.y_pred = (self.decision_scores > self.threshold).astype('int') 45 | 46 | self.mu = np.mean(self.decision_scores) 47 | self.sigma = np.std(self.decision_scores) 48 | 49 | def decision_function(self, X_test): 50 | 51 | if not self._isfitted: 52 | NotFittedError('Knn is not fitted yet') 53 | 54 | # initialize the output score 55 | pred_score = np.zeros([X_test.shape[0], 1]) 56 | 57 | for i in range(X_test.shape[0]): 58 | x_i = X_test[i, :] 59 | x_i = np.asarray(x_i).reshape(1, x_i.shape[0]) 60 | 61 | # get the distance of the current point 62 | dist_arr, ind_arr = self.tree.query(x_i, k=self.n_neighbors) 63 | 64 | if self.method == 'largest': 65 | dist = dist_arr[:, -1] 66 | elif self.method == 'mean': 67 | dist = np.mean(dist_arr, axis=1) 68 | elif self.method == 'median': 69 | dist = np.median(dist_arr, axis=1) 70 | 71 | pred_score_i = dist[-1] 72 | 73 | # record the current item 74 | pred_score[i, :] = pred_score_i 75 | 76 | return pred_score 77 | 78 | def predict(self, X_test): 79 | pred_score = self.decision_function(X_test) 80 | return (pred_score > self.threshold).astype('int') 81 | 82 | def predict_proba(self, X_test, method='linear'): 83 | test_scores = self.decision_function(X_test) 84 | train_scores = self.decision_scores 85 | 86 | if method == 'linear': 87 | scaler = MinMaxScaler().fit(train_scores.reshape(-1, 1)) 88 | proba = scaler.transform(test_scores.reshape(-1, 1)) 89 | return proba.clip(0, 1) 90 | else: 91 | # turn output into probability 92 | pre_erf_score = (test_scores - self.mu) / (self.sigma * np.sqrt(2)) 93 | erf_score = erf(pre_erf_score) 94 | proba = erf_score.clip(0) 95 | 96 | # TODO: move to testing code 97 | assert (proba.min() >= 0) 98 | assert (proba.max() <= 1) 99 | return proba 100 | 101 | def predict_rank(self, X_test): 102 | test_scores = self.decision_function(X_test) 103 | train_scores = self.decision_scores 104 | 105 | ranks = np.zeros([X_test.shape[0], 1]) 106 | 107 | for i in range(test_scores.shape[0]): 108 | train_scores_i = np.append(train_scores.reshape(-1, 1), 109 | test_scores[i]) 110 | 111 | ranks[i] = rankdata(train_scores_i)[-1] 112 | 113 | # return normalized ranks 114 | ranks_norm = ranks / ranks.max() 115 | return ranks_norm 116 | 117 | ############################################################################## 118 | # samples = [[-1, 0], [0., 0.], [1., 1], [2., 5.], [3, 1]] 119 | # 120 | # clf = Knn() 121 | # clf.fit(samples) 122 | # 123 | # scores = clf.decision_function(np.asarray([[2, 3], [6, 8]])).ravel() 124 | # assert (scores[0] == [2]) 125 | # assert (scores[1] == [5]) 126 | # # 127 | # labels = clf.predict(np.asarray([[2, 3], [6, 8]])).ravel() 128 | # assert (labels[0] == [0]) 129 | # assert (labels[1] == [1]) 130 | -------------------------------------------------------------------------------- /models/lof.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.neighbors import LocalOutlierFactor 3 | from sklearn.preprocessing import MinMaxScaler 4 | from scipy.stats import rankdata 5 | from scipy.special import erf 6 | 7 | class Lof(LocalOutlierFactor): 8 | def fit(self, X_train, y=None): 9 | self.X_train = X_train 10 | super().fit(X=X_train, y=y) 11 | return self 12 | 13 | def predict(self, X_test): 14 | return self._predict(X=X_test) 15 | 16 | def decision_function(self, X_test): 17 | return self._decision_function(X_test) 18 | 19 | def predict_proba(self, X_test, method='linear'): 20 | train_scores = self.negative_outlier_factor_ * -1 21 | test_scores = self.decision_function(X_test) * -1 22 | if method == 'linear': 23 | scaler = MinMaxScaler().fit(train_scores.reshape(-1, 1)) 24 | proba = scaler.transform(test_scores.reshape(-1, 1)) 25 | return proba.clip(0, 1) 26 | else: 27 | mu = np.mean(train_scores) 28 | sigma = np.std(train_scores) 29 | 30 | # turn output into probability 31 | pre_erf_score = (test_scores - mu) / (sigma * np.sqrt(2)) 32 | erf_score = erf(pre_erf_score) 33 | proba = erf_score.clip(0) 34 | 35 | # TODO: move to testing code 36 | assert (proba.min() >= 0) 37 | assert (proba.max() <= 1) 38 | 39 | return proba 40 | 41 | def predict_rank(self, X_test): 42 | 43 | train_scores = self.decision_function(self.X_train) * -1 44 | test_scores = self.decision_function(X_test) * -1 45 | ranks = np.zeros([X_test.shape[0], 1]) 46 | 47 | for i in range(test_scores.shape[0]): 48 | train_scores_i = np.append(train_scores.reshape(-1, 1), 49 | test_scores[i]) 50 | ranks[i] = rankdata(train_scores_i)[-1] 51 | 52 | # return normalized ranks 53 | ranks_norm = ranks / ranks.max() 54 | return ranks_norm 55 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.13 2 | scipy>=0.19 3 | scikit_learn>=0.19 4 | -------------------------------------------------------------------------------- /utility/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/utility/__init__.py -------------------------------------------------------------------------------- /utility/stat_models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.special import betainc 3 | from scipy.stats import pearsonr 4 | from scipy.stats import scoreatpercentile 5 | from sklearn.metrics import precision_score 6 | from sklearn.preprocessing import StandardScaler 7 | 8 | 9 | def wpearsonr(x, y, w=None): 10 | # https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation 11 | 12 | # unweighted version 13 | if w is None: 14 | return pearsonr(x, y) 15 | 16 | x = np.asarray(x) 17 | y = np.asarray(y) 18 | w = np.asarray(w) 19 | 20 | n = len(x) 21 | 22 | w_sum = w.sum() 23 | mx = np.sum(x * w) / w_sum 24 | my = np.sum(y * w) / w_sum 25 | 26 | xm, ym = (x - mx), (y - my) 27 | 28 | r_num = np.sum(xm * ym * w) / w_sum 29 | 30 | xm2 = np.sum(xm * xm * w) / w_sum 31 | ym2 = np.sum(ym * ym * w) / w_sum 32 | 33 | r_den = np.sqrt(xm2 * ym2) 34 | r = r_num / r_den 35 | 36 | r = max(min(r, 1.0), -1.0) 37 | # df = n - 2 38 | # 39 | # if abs(r) == 1.0: 40 | # prob = 0.0 41 | # else: 42 | # t_squared = r ** 2 * (df / ((1.0 - r) * (1.0 + r))) 43 | # prob = _betai(0.5 * df, 0.5, df / (df + t_squared)) 44 | return r # , prob 45 | 46 | 47 | ##################################### 48 | # PROBABILITY CALCULATIONS # 49 | ##################################### 50 | 51 | 52 | def _betai(a, b, x): 53 | x = np.asarray(x) 54 | x = np.where(x < 1.0, x, 1.0) # if x > 1 then return 1.0 55 | return betainc(a, b, x) 56 | 57 | 58 | def pearsonr_mat(mat, w=None): 59 | n_row = mat.shape[0] 60 | n_col = mat.shape[1] 61 | pear_mat = np.full([n_row, n_row], 1).astype(float) 62 | 63 | if w is not None: 64 | for cx in range(n_row): 65 | for cy in range(cx + 1, n_row): 66 | curr_pear = wpearsonr(mat[cx, :], mat[cy, :], w) 67 | pear_mat[cx, cy] = curr_pear 68 | pear_mat[cy, cx] = curr_pear 69 | else: 70 | for cx in range(n_col): 71 | for cy in range(cx + 1, n_row): 72 | curr_pear = pearsonr(mat[cx, :], mat[cy, :])[0] 73 | pear_mat[cx, cy] = curr_pear 74 | pear_mat[cy, cx] = curr_pear 75 | 76 | return pear_mat 77 | -------------------------------------------------------------------------------- /utility/utility.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pathlib 3 | 4 | import numpy as np 5 | import scipy.io as scio 6 | from scipy.stats import scoreatpercentile 7 | from sklearn.metrics import precision_score 8 | from sklearn.preprocessing import StandardScaler 9 | 10 | from models.lof import Lof 11 | from models.knn import Knn 12 | 13 | 14 | def get_label_n(y, y_pred): 15 | out_perc = np.count_nonzero(y) / len(y) 16 | threshold = scoreatpercentile(y_pred, 100 * (1 - out_perc)) 17 | y_pred = (y_pred > threshold).astype('int') 18 | return y_pred 19 | 20 | 21 | def standardizer(X_train, X_test): 22 | ''' 23 | normalization function wrapper 24 | :param X_train: 25 | :param X_test: 26 | :return: X_train and X_test after the Z-score normalization 27 | ''' 28 | scaler = StandardScaler().fit(X_train) 29 | return scaler.transform(X_train), scaler.transform(X_test) 30 | 31 | 32 | def precision_n_score(y, y_pred): 33 | ''' 34 | Utlity function to calculate precision@n 35 | :param y: ground truth 36 | :param y_pred: number of outliers 37 | :return: score 38 | ''' 39 | # calculate the percentage of outliers 40 | out_perc = np.count_nonzero(y) / len(y) 41 | 42 | threshold = scoreatpercentile(y_pred, 100 * (1 - out_perc)) 43 | y_pred = (y_pred > threshold).astype('int') 44 | return precision_score(y, y_pred) 45 | 46 | 47 | def argmaxp(a, p): 48 | ''' 49 | Utlity function to return the index of top p values in a 50 | :param a: list variable 51 | :param p: number of elements to select 52 | :return: index of top p elements in a 53 | ''' 54 | 55 | a = np.asarray(a).ravel() 56 | length = a.shape[0] 57 | pth = np.argpartition(a, length - p) 58 | return pth[length - p:] 59 | 60 | 61 | def loaddata(filename): 62 | ''' 63 | load data 64 | :param filename: 65 | :return: 66 | ''' 67 | mat = scio.loadmat(os.path.join('datasets', filename + '.mat')) 68 | X_orig = mat['X'] 69 | y_orig = mat['y'].ravel() 70 | outlier_perc = np.count_nonzero(y_orig) / len(y_orig) 71 | 72 | return X_orig, y_orig, outlier_perc 73 | 74 | 75 | def train_predict_lof(k_list, X_train_norm, X_test_norm, train_scores, 76 | test_scores): 77 | # initialize base detectors 78 | clf_list = [] 79 | for k in k_list: 80 | clf = Lof(n_neighbors=k) 81 | clf.fit(X_train_norm) 82 | train_score = clf.negative_outlier_factor_ * -1 83 | test_score = clf.decision_function(X_test_norm) * -1 84 | clf_name = 'lof_' + str(k) 85 | 86 | clf_list.append(clf_name) 87 | curr_ind = len(clf_list) - 1 88 | 89 | train_scores[:, curr_ind] = train_score.ravel() 90 | test_scores[:, curr_ind] = test_score.ravel() 91 | 92 | return train_scores, test_scores 93 | 94 | 95 | def train_predict_knn(k_list, X_train_norm, X_test_norm, train_scores, 96 | test_scores): 97 | # initialize base detectors 98 | clf_list = [] 99 | for k in k_list: 100 | clf = Knn(n_neighbors=k, method='largest') 101 | clf.fit(X_train_norm) 102 | train_score = clf.decision_scores 103 | test_score = clf.decision_function(X_test_norm) 104 | clf_name = 'knn_' + str(k) 105 | 106 | clf_list.append(clf_name) 107 | curr_ind = len(clf_list) - 1 108 | 109 | train_scores[:, curr_ind] = train_score.ravel() 110 | test_scores[:, curr_ind] = test_score.ravel() 111 | 112 | return train_scores, test_scores 113 | 114 | 115 | def print_save_result(data, base_detector, n_baselines, n_clf, n_ite, roc_mat, 116 | ap_mat, prc_mat, method_list, timestamp, verbose): 117 | ''' 118 | :param data: 119 | :param base_detector: 120 | :param n_baselines: 121 | :param n_clf: 122 | :param n_ite: 123 | :param roc_mat: 124 | :param ap_mat: 125 | :param prc_mat: 126 | :param method_list: 127 | :param timestamp: 128 | :param verbose: 129 | :return: None 130 | ''' 131 | 132 | roc_scores = np.round(np.mean(roc_mat, axis=0), decimals=4) 133 | ap_scores = np.round(np.mean(ap_mat, axis=0), decimals=4) 134 | prc_scores = np.round(np.mean(prc_mat, axis=0), decimals=4) 135 | 136 | method_np = np.asarray(method_list) 137 | 138 | top_roc_ind = argmaxp(roc_scores, 1) 139 | top_ap_ind = argmaxp(ap_scores, 1) 140 | top_prc_ind = argmaxp(prc_scores, 1) 141 | 142 | top_roc_clf = method_np[top_roc_ind].tolist()[0] 143 | top_ap_clf = method_np[top_ap_ind].tolist()[0] 144 | top_prc_clf = method_np[top_prc_ind].tolist()[0] 145 | 146 | top_roc = np.round(roc_scores[top_roc_ind][0], decimals=4) 147 | top_ap = np.round(ap_scores[top_ap_ind][0], decimals=4) 148 | top_prc = np.round(prc_scores[top_prc_ind][0], decimals=4) 149 | 150 | roc_diff = np.round(100 * (top_roc - roc_scores) / roc_scores, decimals=4) 151 | ap_diff = np.round(100 * (top_ap - ap_scores) / ap_scores, decimals=4) 152 | prc_diff = np.round(100 * (top_prc - prc_scores) / prc_scores, decimals=4) 153 | 154 | # initialize the log directory if it does not exist 155 | pathlib.Path('results').mkdir(parents=True, exist_ok=True) 156 | 157 | # create the file if it does not exist 158 | f = open( 159 | 'results\\' + data + '_' + base_detector + '_' + timestamp + '.csv', 160 | 'a') 161 | if verbose: 162 | f.writelines('method, ' 163 | 'roc, best_roc, diff_roc,' 164 | 'ap, best_ap, diff_ap,' 165 | 'p@m, best_p@m, diff_p@m,' 166 | 'best roc, best ap, best prc, n_ite, n_clf') 167 | else: 168 | f.writelines('method, ' 169 | 'roc, ap, p@m,' 170 | 'best roc, best ap, best prc, ' 171 | 'n_ite, n_clf') 172 | 173 | print('method, roc, ap, p@m, best roc, best ap, best prc') 174 | delim = ',' 175 | for i in range(n_baselines): 176 | print(method_list[i], roc_scores[i], ap_scores[i], prc_scores[i], 177 | top_roc_clf, top_ap_clf, top_prc_clf) 178 | 179 | if verbose: 180 | f.writelines( 181 | '\n' + str(method_list[i]) + delim + 182 | str(roc_scores[i]) + delim + str(top_roc) + delim + str( 183 | roc_diff[i]) + delim + 184 | str(ap_scores[i]) + delim + str(top_ap) + delim + str( 185 | ap_diff[i]) + delim + 186 | str(prc_scores[i]) + delim + str(top_prc) + delim + str( 187 | prc_diff[i]) + delim + 188 | top_roc_clf + delim + 189 | top_ap_clf + delim + 190 | top_prc_clf + delim + 191 | str(n_ite) + delim + 192 | str(n_clf)) 193 | else: 194 | f.writelines( 195 | '\n' + str(method_list[i]) + delim + 196 | str(roc_scores[i]) + delim + 197 | str(ap_scores[i]) + delim + 198 | str(prc_scores[i]) + delim + 199 | top_roc_clf + delim + 200 | top_ap_clf + delim + 201 | top_prc_clf + delim + 202 | str(n_ite) + delim + 203 | str(n_clf)) 204 | 205 | f.close() 206 | -------------------------------------------------------------------------------- /vis_tsne.py: -------------------------------------------------------------------------------- 1 | import math 2 | import pathlib 3 | import numpy as np 4 | from scipy.stats import rankdata 5 | from scipy.stats import pearsonr 6 | from scipy.spatial.distance import euclidean 7 | 8 | import matplotlib.pyplot as plt 9 | 10 | from sklearn.model_selection import train_test_split 11 | from sklearn.neighbors import KDTree 12 | 13 | from sklearn.manifold import TSNE 14 | 15 | from models.combination import aom, moa 16 | from models.lof import Lof 17 | from utility.stat_models import wpearsonr 18 | from utility.utility import argmaxp, loaddata, standardizer, get_label_n 19 | 20 | # set numpy parameters 21 | np.set_printoptions(suppress=True, precision=4) 22 | # generates the visualization for all datasets 23 | data_list = ["Annthyroid", 24 | "Pendigits", 25 | "Satellite", 26 | "Pima", 27 | "Letter", 28 | "Thyroid", 29 | "Vowels", 30 | "Cardio", 31 | "Mnist"] 32 | DCSO_best_list = [186, 38, 71, 103, 233, 157, 128, 127, 97] 33 | 34 | for data, DCSO_best in zip(data_list, DCSO_best_list): 35 | 36 | print('processing', data) 37 | X_test_list = [] 38 | X_test_name_list = [] 39 | DCSO_best_list = [] 40 | test_target_list_list = [] 41 | y_test_list = [] 42 | trans_data_list = [] 43 | 44 | X_orig, y_orig, outlier_perc = loaddata(data) 45 | 46 | ite = 1 # number of iterations 47 | test_size = 0.4 # training = 60%, testing = 40% 48 | final_k_list = [10, 30, 60, 100] 49 | n_methods = 253 50 | 51 | k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 52 | 110, 120, 130, 140, 150, 160, 170, 180, 190, 200] 53 | 54 | n_clf = len(k_list) 55 | fixed_range = [5, 10, 15] 56 | 57 | # for AOM and MOA, choose the right number of buckets 58 | n_buckets = 5 59 | n_clf_bucket = int(n_clf / n_buckets) 60 | assert (n_clf % n_buckets == 0) # in case wrong number of buckets 61 | 62 | # split the data into training and testing 63 | # fixed the visualization by random state == 42 64 | X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig, 65 | test_size=test_size, 66 | random_state=42) 67 | 68 | # generate the normalized data 69 | X_train_norm, X_test_norm = standardizer(X_train, X_test) 70 | 71 | train_scores = np.zeros([X_train.shape[0], n_clf]) 72 | test_scores = np.zeros([X_test.shape[0], n_clf]) 73 | 74 | # initialized the list to store the results 75 | test_target_list = [] 76 | method_list = [] 77 | k_rec_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # zeros for non dcs 78 | 79 | clf_list = [] 80 | for k in k_list: 81 | clf = Lof(n_neighbors=k) 82 | clf.fit(X_train_norm) 83 | train_score = clf.negative_outlier_factor_ * -1 84 | test_score = clf.decision_function(X_test_norm) * -1 85 | clf_name = 'lof_' + str(k) 86 | 87 | clf_list.append(clf_name) 88 | curr_ind = len(clf_list) - 1 89 | 90 | train_scores[:, curr_ind] = train_score.ravel() 91 | test_scores[:, curr_ind] = test_score.ravel() 92 | 93 | ####################################################################### 94 | # generate normalized scores 95 | train_scores_norm, test_scores_norm = standardizer(train_scores, 96 | test_scores) 97 | 98 | # make sure the scores are actually standardized 99 | assert (math.isclose(train_scores_norm.mean(), 0, abs_tol=0.1)) 100 | # assert(math.isclose(test_scores_norm.mean(), 0, abs_tol=0.1)) 101 | assert (math.isclose(train_scores_norm.std(), 1, abs_tol=0.1)) 102 | # assert(math.isclose(test_scores_norm.std(), 1, abs_tol=0.1)) 103 | 104 | # generate mean and max outputs 105 | target_test_mean = np.mean(test_scores_norm, axis=1) 106 | target_test_max = np.max(test_scores_norm, axis=1) 107 | test_target_list.extend([target_test_mean, target_test_max]) 108 | method_list.extend(['mean', 'max']) 109 | 110 | # generate pseudo target for training -> for calculating weights 111 | target_mean = np.mean(train_scores_norm, axis=1).reshape(-1, 1) 112 | target_max = np.max(train_scores_norm, axis=1).reshape(-1, 1) 113 | # higher value for more outlierness 114 | ranks_mean = rankdata(target_mean).reshape(-1, 1) 115 | ranks_max = rankdata(target_max).reshape(-1, 1) 116 | 117 | # generate weighted mean 118 | # weights are distance or pearson in different modes 119 | clf_weights_pear = np.zeros([n_clf, 1]) 120 | for i in range(n_clf): 121 | clf_weights_pear[i] = \ 122 | pearsonr(target_mean, train_scores_norm[:, i].reshape(-1, 1))[0][0] 123 | 124 | clf_weights_euc = np.zeros([n_clf, 1]) 125 | for i in range(n_clf): 126 | clf_weights_euc[i] = euclidean(target_mean, 127 | train_scores_norm[:, i].reshape(-1, 1)) 128 | clf_weights_euc = clf_weights_euc.max() - clf_weights_euc 129 | 130 | for i in fixed_range: 131 | target_test_max_pear = np.max( 132 | test_scores_norm[:, argmaxp(clf_weights_pear, i)], axis=1) 133 | target_test_max_euc = np.max( 134 | test_scores_norm[:, argmaxp(clf_weights_euc, i)], axis=1) 135 | test_target_list.extend([target_test_max_pear, target_test_max_euc]) 136 | method_list.extend( 137 | ['max_' + str(i) + '_pear', 'max_' + str(i) + '_euc']) 138 | 139 | # generate weighted mean 140 | target_test_weighted_pear = np.sum( 141 | test_scores_norm * clf_weights_pear.reshape(1, 142 | -1) / clf_weights_pear.sum(), 143 | axis=1) 144 | target_test_weighted_euc = np.sum( 145 | test_scores_norm * clf_weights_euc.reshape(1, 146 | -1) / clf_weights_euc.sum(), 147 | axis=1) 148 | test_target_list.extend( 149 | [target_test_weighted_pear, target_test_weighted_euc]) 150 | method_list.extend(['w_mean_pear', 'w_mean_euc', ]) 151 | 152 | # generate threshold sum 153 | target_test_threshold = np.sum(test_scores_norm.clip(0), axis=1) 154 | test_target_list.append(target_test_threshold) 155 | method_list.append('threshold') 156 | 157 | # generate average of maximum (AOM) and maximum of average (MOA) 158 | target_test_aom = aom(test_scores_norm, n_buckets, n_clf) 159 | target_test_moa = moa(test_scores_norm, n_buckets, n_clf) 160 | test_target_list.extend([target_test_aom, target_test_moa]) 161 | method_list.extend(['aom', 'moa']) 162 | ################################################################### 163 | # use mean as the pseudo target 164 | for k in final_k_list: 165 | tree = KDTree(X_train_norm) 166 | dist_arr, ind_arr = tree.query(X_test_norm, k=k) 167 | 168 | m_list = ['a_dist_d', 'a_dist_r', 'a_dist_n', 169 | 'a_pear_d', 'a_pear_r', 'a_pear_n'] 170 | 171 | # initialize different buckets 172 | pred_scores_best = np.zeros([X_test.shape[0], len(m_list)]) 173 | pred_scores_max_d = np.zeros([X_test.shape[0], len(m_list)]) 174 | pred_scores_max_f5 = np.zeros([X_test.shape[0], len(m_list)]) 175 | pred_scores_max_f10 = np.zeros([X_test.shape[0], len(m_list)]) 176 | pred_scores_max_f15 = np.zeros([X_test.shape[0], len(m_list)]) 177 | 178 | for i in range(X_test.shape[0]): # X_test_norm.shape[0] 179 | # get the neighbor idx of the current point 180 | ind_k = ind_arr[i, :] 181 | 182 | # get the pseudo target: mean 183 | target_k = target_mean[ind_k,].ravel() 184 | 185 | # get the current scores from all clf 186 | curr_train_k = train_scores_norm[ind_k, :] 187 | 188 | # weights by rank 189 | weights_k_rank = ranks_mean[ind_k] 190 | 191 | # weights by distance 192 | dist_k = dist_arr[i, :].reshape(-1, 1) 193 | weights_k_dist = dist_k.max() - dist_k 194 | 195 | # initialize containers for correlation 196 | corr_dist_d = np.zeros([n_clf, ]) 197 | corr_dist_r = np.zeros([n_clf, ]) 198 | corr_dist_n = np.zeros([n_clf, ]) 199 | corr_pear_d = np.zeros([n_clf, ]) 200 | corr_pear_r = np.zeros([n_clf, ]) 201 | corr_pear_n = np.zeros([n_clf, ]) 202 | 203 | for d in range(n_clf): 204 | # flip distance so larger values imply larger correlation 205 | corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d], 206 | w=weights_k_dist) * -1 207 | corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d], 208 | w=weights_k_rank) * -1 209 | corr_dist_n[d,] = euclidean(target_k, curr_train_k[:, d]) * -1 210 | corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d], 211 | w=weights_k_dist) 212 | corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d], 213 | w=weights_k_rank) 214 | corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[0] 215 | 216 | corr_list = [corr_dist_d, corr_dist_r, corr_dist_n, 217 | corr_pear_d, corr_pear_r, corr_pear_n] 218 | 219 | for j in range(len(m_list)): 220 | corr_k = corr_list[j] 221 | 222 | # pick the best one 223 | best_clf_ind = np.nanargmax(corr_k) 224 | pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind] 225 | # print(k, best_clf_ind) 226 | # pick the p dynamically 227 | threshold = corr_k.max() - corr_k.std() * 0.2 228 | p = (corr_k >= threshold).sum() 229 | if p == 0: # in case extreme cases [nan and all -1's] 230 | p = 1 231 | pred_scores_max_d[i, j] = np.max( 232 | test_scores_norm[i, argmaxp(corr_k, p)]) 233 | 234 | # pick the best 5 classifiers 235 | pred_scores_max_f5[i, j] = np.max( 236 | test_scores_norm[i, argmaxp(corr_k, 5)]) 237 | # pick the best 10 classifiers 238 | pred_scores_max_f10[i, j] = np.max( 239 | test_scores_norm[i, argmaxp(corr_k, 10)]) 240 | # pick the best 15 classifiers 241 | pred_scores_max_f15[i, j] = np.max( 242 | test_scores_norm[i, argmaxp(corr_k, 15)]) 243 | 244 | for m in range(len(m_list)): 245 | test_target_list.extend([pred_scores_best[:, m], 246 | pred_scores_max_d[:, m], 247 | pred_scores_max_f5[:, m], 248 | pred_scores_max_f10[:, m], 249 | pred_scores_max_f15[:, m]]) 250 | method_list.extend(['dcs_best_' + m_list[m] + '_' + str(k), 251 | 'dcs_dyn_' + m_list[m] + '_' + str(k), 252 | 'dcs_f5_' + m_list[m] + '_' + str(k), 253 | 'dcs_f10_' + m_list[m] + '_' + str(k), 254 | 'dcs_f15_' + m_list[m] + '_' + str(k)]) 255 | k_rec_list.extend([k, k, k, k, k]) 256 | ########################################################################## 257 | 258 | # use max for pseudo target 259 | for k in final_k_list: 260 | print('processing', k) 261 | tree = KDTree(X_train_norm) 262 | dist_arr, ind_arr = tree.query(X_test_norm, k=k) 263 | 264 | m_list = ['m_dist_d', 'm_dist_r', 'm_dist_n', 265 | 'm_pear_d', 'm_pear_r', 'm_pear_n'] 266 | 267 | pred_scores_best = np.zeros([X_test.shape[0], len(m_list)]) 268 | pred_scores_max_d = np.zeros([X_test.shape[0], len(m_list)]) 269 | pred_scores_max_f5 = np.zeros([X_test.shape[0], len(m_list)]) 270 | pred_scores_max_f10 = np.zeros([X_test.shape[0], len(m_list)]) 271 | pred_scores_max_f15 = np.zeros([X_test.shape[0], len(m_list)]) 272 | 273 | for i in range(X_test.shape[0]): # X_test_norm.shape[0] 274 | # get the neighbor idx of the current point 275 | ind_k = ind_arr[i, :] 276 | 277 | # get the pseudo target: max 278 | target_k = target_max[ind_k,].ravel() 279 | 280 | # get the current scores from all clf 281 | curr_train_k = train_scores_norm[ind_k, :] 282 | 283 | # weights by rank 284 | weights_k_rank = ranks_max[ind_k] 285 | # weights by distance 286 | dist_k = dist_arr[i, :].reshape(-1, 1) 287 | weights_k_dist = dist_k.max() - dist_k 288 | 289 | corr_dist_d = np.zeros([n_clf, ]) 290 | corr_dist_r = np.zeros([n_clf, ]) 291 | corr_dist_n = np.zeros([n_clf, ]) 292 | corr_pear_d = np.zeros([n_clf, ]) 293 | corr_pear_r = np.zeros([n_clf, ]) 294 | corr_pear_n = np.zeros([n_clf, ]) 295 | 296 | for d in range(n_clf): 297 | corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d], 298 | w=weights_k_dist) * -1 299 | corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d], 300 | w=weights_k_rank) * -1 301 | corr_dist_n[d,] = euclidean(target_k, curr_train_k[:, d]) * -1 302 | corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d], 303 | w=weights_k_dist) 304 | corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d], 305 | w=weights_k_rank) 306 | corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[0] 307 | 308 | corr_list = [corr_dist_d, corr_dist_r, corr_dist_n, 309 | corr_pear_d, corr_pear_r, corr_pear_n] 310 | 311 | for j in range(len(m_list)): 312 | corr_k = corr_list[j] 313 | 314 | # pick the best one 315 | best_clf_ind = np.nanargmax(corr_k) 316 | pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind] 317 | 318 | # pick the p dynamically 319 | threshold = corr_k.max() - corr_k.std() * 0.2 320 | p = (corr_k >= threshold).sum() 321 | if p == 0: # in case extreme cases [nan and all -1's] 322 | p = 1 323 | pred_scores_max_d[i, j] = np.mean( 324 | test_scores_norm[i, argmaxp(corr_k, p)]) 325 | 326 | # pick the best 5 classifiers 327 | pred_scores_max_f5[i, j] = np.mean( 328 | test_scores_norm[i, argmaxp(corr_k, 5)]) 329 | # pick the best 10 classifiers 330 | pred_scores_max_f10[i, j] = np.mean( 331 | test_scores_norm[i, argmaxp(corr_k, 10)]) 332 | # pick the best 15 classifiers 333 | pred_scores_max_f15[i, j] = np.mean( 334 | test_scores_norm[i, argmaxp(corr_k, 15)]) 335 | 336 | for m in range(len(m_list)): 337 | test_target_list.extend([pred_scores_best[:, m], 338 | pred_scores_max_d[:, m], 339 | pred_scores_max_f5[:, m], 340 | pred_scores_max_f10[:, m], 341 | pred_scores_max_f15[:, m]]) 342 | method_list.extend(['dcs_best_' + m_list[m] + '_' + str(k), 343 | 'dcs_dyn_' + m_list[m] + '_' + str(k), 344 | 'dcs_f5_' + m_list[m] + '_' + str(k), 345 | 'dcs_f10_' + m_list[m] + '_' + str(k), 346 | 'dcs_f15_' + m_list[m] + '_' + str(k)]) 347 | k_rec_list.extend([k, k, k, k, k]) 348 | 349 | trans_data_list.append( 350 | TSNE(n_components=2, init='pca').fit_transform(X_test)) 351 | X_test_list.append(X_test) 352 | X_test_name_list.append(data) 353 | DCSO_best_list.append(DCSO_best) 354 | test_target_list_list.append(test_target_list) 355 | y_test_list.append(y_test) 356 | ########################################################################## 357 | plt.figure(figsize=(12, 6)) 358 | 359 | for k in range(1): 360 | 361 | # find the comparision 362 | dcs_target = get_label_n(y_test_list[k], 363 | test_target_list_list[k][DCSO_best_list[k]]) 364 | mean_target = get_label_n(y_test_list[k], test_target_list_list[k][0]) 365 | max_target = get_label_n(y_test_list[k], test_target_list_list[k][1]) 366 | 367 | normal_ind = [] 368 | outlier_ind = [] 369 | 370 | equal_right_mean = [] 371 | equal_wrong_mean = [] 372 | 373 | equal_right_max = [] 374 | equal_wrong_max = [] 375 | 376 | dcs_out_mean = [] 377 | mean_out = [] 378 | 379 | dcs_norm_mean = [] 380 | mean_norm = [] 381 | 382 | dcs_out_max = [] 383 | max_out = [] 384 | 385 | dcs_norm_max = [] 386 | max_norm = [] 387 | 388 | for i in range(X_test_list[k].shape[0]): 389 | if y_test_list[k][i] == 0: 390 | normal_ind.append(i) 391 | else: 392 | outlier_ind.append(i) 393 | 394 | if dcs_target[i] == mean_target[i] == y_test_list[k][i]: 395 | print(i, 'equal & right') 396 | equal_right_mean.append(i) 397 | 398 | elif dcs_target[i] == mean_target[i] and dcs_target[i] != \ 399 | y_test_list[k][i]: 400 | print(i, 'equal & wrong') 401 | equal_wrong_mean.append(i) 402 | 403 | elif dcs_target[i] != mean_target[i]: 404 | print(i, 'not equal') 405 | if y_test_list[k][i] == 1: 406 | if dcs_target[i] == y_test_list[k][i]: 407 | dcs_out_mean.append(i) 408 | else: 409 | mean_out.append(i) 410 | else: 411 | if dcs_target[i] == y_test_list[k][i]: 412 | dcs_norm_mean.append(i) 413 | else: 414 | mean_norm.append(i) 415 | ################################################################## 416 | if dcs_target[i] == max_target[i] == y_test_list[k][i]: 417 | print(i, 'equal & right') 418 | equal_right_max.append(i) 419 | 420 | elif dcs_target[i] == max_target[i] and dcs_target[i] != \ 421 | y_test_list[k][i]: 422 | print(i, 'equal & wrong') 423 | equal_wrong_max.append(i) 424 | 425 | elif dcs_target[i] != max_target[i]: 426 | print(i, 'not equal') 427 | if y_test_list[k][i] == 1: 428 | if dcs_target[i] == y_test_list[k][i]: 429 | dcs_out_max.append(i) 430 | else: 431 | max_out.append(i) 432 | else: 433 | if dcs_target[i] == y_test_list[k][i]: 434 | dcs_norm_max.append(i) 435 | else: 436 | max_norm.append(i) 437 | 438 | # plot mean 439 | plt.subplot(121) 440 | 441 | plt.scatter(trans_data_list[k][normal_ind, 0], 442 | trans_data_list[k][normal_ind, 1], label='Normal', 443 | color='orange', alpha=0.6, s=24, marker='o') 444 | plt.scatter(trans_data_list[k][outlier_ind, 0], 445 | trans_data_list[k][outlier_ind, 1], label='Outlying', 446 | color='red', alpha=0.6, s=28, marker='s') 447 | 448 | plt.scatter(trans_data_list[k][mean_norm, 0], 449 | trans_data_list[k][mean_norm, 1], label='SG_N', 450 | color='g', alpha=0.95, s=40, marker='v') 451 | plt.scatter(trans_data_list[k][mean_out, 0], 452 | trans_data_list[k][mean_out, 1], label='SG_O', 453 | color='g', alpha=0.95, s=40, marker='^') 454 | 455 | plt.scatter(trans_data_list[k][dcs_norm_max, 0], 456 | trans_data_list[k][dcs_norm_max, 1], label='DCSO_N', 457 | color='b', alpha=0.95, s=54, marker='x') 458 | plt.scatter(trans_data_list[k][dcs_out_max, 0], 459 | trans_data_list[k][dcs_out_max, 1], label='DCSO_O', 460 | color='b', alpha=0.95, s=65, marker='+') 461 | 462 | 463 | plt.legend(ncol=3, prop={'size': 7.5}, loc='lower right', 464 | bbox_transform=plt.gcf().transFigure) 465 | plt.xticks([]) 466 | plt.yticks([]) 467 | plt.title('SG_A vs. DCSO (' + X_test_name_list[k] + ')', fontsize=12) 468 | 469 | # plot max 470 | plt.subplot(122) 471 | 472 | plt.scatter(trans_data_list[k][normal_ind, 0], 473 | trans_data_list[k][normal_ind, 1], label='Normal', 474 | color='orange', alpha=0.6, s=24, marker='o') 475 | plt.scatter(trans_data_list[k][outlier_ind, 0], 476 | trans_data_list[k][outlier_ind, 1], label='Outlying', 477 | color='red', alpha=0.6, s=28, marker='s') 478 | 479 | plt.scatter(trans_data_list[k][max_norm, 0], 480 | trans_data_list[k][max_norm, 1], label='SG_N', 481 | color='g', alpha=0.95, s=40, marker='v') 482 | plt.scatter(trans_data_list[k][max_out, 0], 483 | trans_data_list[k][max_out, 1], label='SG_O', 484 | color='g', alpha=0.95, s=40, marker='^') 485 | 486 | plt.scatter(trans_data_list[k][dcs_norm_max, 0], 487 | trans_data_list[k][dcs_norm_max, 1], label='DCSO_N', 488 | color='b', alpha=0.95, s=54, marker='x') 489 | plt.scatter(trans_data_list[k][dcs_out_max, 0], 490 | trans_data_list[k][dcs_out_max, 1], label='DCSO_O', 491 | color='b', alpha=0.95, s=65, marker='+') 492 | 493 | plt.legend(ncol=3, prop={'size': 7.5}, loc='lower right', 494 | bbox_transform=plt.gcf().transFigure) 495 | plt.xticks([]) 496 | plt.yticks([]) 497 | plt.title('SG_M vs. DCSO (' + X_test_name_list[k] + ')', fontsize=12) 498 | 499 | plt.tight_layout() 500 | 501 | # initialize the log directory if it does not exist 502 | pathlib.Path('viz').mkdir(parents=True, exist_ok=True) 503 | # save files 504 | plt.savefig('viz\\' + data + '.png', dpi=330) 505 | -------------------------------------------------------------------------------- /viz/Annthyroid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Annthyroid.png -------------------------------------------------------------------------------- /viz/Cardio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Cardio.png -------------------------------------------------------------------------------- /viz/Letter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Letter.png -------------------------------------------------------------------------------- /viz/Mnist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Mnist.png -------------------------------------------------------------------------------- /viz/Pendigits.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Pendigits.png -------------------------------------------------------------------------------- /viz/Pima.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Pima.png -------------------------------------------------------------------------------- /viz/Satellite.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Satellite.png -------------------------------------------------------------------------------- /viz/Thyroid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Thyroid.png -------------------------------------------------------------------------------- /viz/Vowels.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Vowels.png --------------------------------------------------------------------------------