├── README.md
├── datasets
    └── cardio.mat
├── demo_knn.py
├── demo_lof.py
├── md_figs
    ├── flowchart.png
    ├── knn_prc.png
    ├── knn_roc.png
    ├── lof_prc.png
    ├── lof_roc.png
    └── tsne.png
├── models
    ├── __init__.py
    ├── combination.py
    ├── knn.py
    └── lof.py
├── requirements.txt
├── utility
    ├── __init__.py
    ├── stat_models.py
    └── utility.py
├── vis_tsne.py
└── viz
    ├── Annthyroid.png
    ├── Cardio.png
    ├── Letter.png
    ├── Mnist.png
    ├── Pendigits.png
    ├── Pima.png
    ├── Satellite.png
    ├── Thyroid.png
    └── Vowels.png


/README.md:
--------------------------------------------------------------------------------
 1 | # DCSO (Dynamic Combination of Detector Scores for Outlier Ensembles)
 2 | ### Supplementary materials: datasets, demo source codes and sample outputs.
 3 | 
 4 | Y. Zhao and M.K. Hryniewicki, "DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles" *ACM KDD Workshop on Outlier Detection De-constructed (ODD v5.0)*, 2018. 
 5 | 
 6 | Please cite the paper as:
 7 | 
 8 |     @conference{zhao2018dcso,
 9 |         author     = {Zhao, Yue and Hryniewicki, Maciej K},
10 |         title      = {{DCSO:} Dynamic Combination of Detector Scores for Outlier Ensembles},
11 |         booktitle  = {ACM SIGKDD ODD Workshop},
12 |         year       = {2018},
13 |         address    = {London, UK},
14 |         timestamp  = {Mon, 22 Oct 2018 13:07:32 +0200},
15 |     }
16 | 
17 |     
18 | **[PDF](https://www.andrew.cmu.edu/user/lakoglu/odd/accepted_papers/ODD_v50_paper_3.pdf)** | 
19 | **[Presentation Slides](https://yuezhao.squarespace.com/s/ODD-Zhao-DCSO.pdf)** ]
20 | 
21 | **Note**: [LSCP](https://github.com/yzhao062/lscp) is an upgraded version of DCSO, which has been accepted at SDM' 19.
22 | 
23 | ------------
24 | 
25 | Additional notes:
26 | 1. Three versions of codes are (going to be) provided:
27 |    1. **Demo version** (demo_lof.py and demo_knn.py) are created for the fast reproduction of the experiment results. The demo version only compares the baseline algorithms with DCSO algorithms. The effect of parameters, e.g., the choice of *k*, are not included.
28 |    2.  **Full version** (tba)  will be released after moderate code cleanup and optimization. In contrast to the demo version, the full version also considers the impact of parameter setting. The full version is therefore relatively slow, which will be further optimized. It is noted the demo version is sufficient to prove the idea. We suggest to using the demo version while playing with DCSO, during the full version is being optimized.
29 |    3. **Production version** (tba) will be released with full optimization and testing as a framework. The purpose of this version is to be used in real applications, which should require fewer dependencies and faster execution.
30 | 3. It is understood that there are **small variations** in the results due to the random process, e.g., spliting the training and test sets. Thus, running demo codes would only result in similar results to the paper but not the exactly same results.
31 | ------------
32 | 
33 | ##  Introduction
34 | In this paper, an unsupervised outlier detector combination framework called DCSO (Dynamic Combination of Detector Scores for Outlier Ensembles) is proposed, demonstrated and assessed for the dynamic selection of most competent base detectors, with an emphasis on data locality. The proposed DCSO framework first defines the local region of a test instance by its k nearest neighbors and then identifies the top-performing base detectors within the local region.
35 | As classification ensembles, DCSO has two key stages. In the Generation stage, the chosen base detector algorithm is initialized with distinct parameters to build a pool of diversified detectors, and all are then fitted on the entire training dataset. In the Combination stage, DCSO picks the most competent detector in the local region defined by the test instance. Finally, the selected detector is used to predict the outlier score for the test instance.
36 | 
37 | ![Flowchart](https://github.com/yzhao062/DCSO/blob/master/md_figs/flowchart.png)
38 | 
39 | ## Dependency
40 | The experiment codes are writen in Python 3.6 and built on a number of Python packages:
41 | - numpy>=1.13
42 | - scipy>=0.19
43 | - scikit_learn>=0.19
44 | 
45 | Batch installation is possible using the supplied "requirements.txt" with pip or conda.
46 | 
47 | ------------
48 | 
49 | ## Datasets
50 | Ten datasets are used (see dataset folder):
51 | 
52 | | Datasets   | #  Points (*n*)  | Dimension (*d*)  | # Outliers  | % Outliers
53 | | ---------- | ---------------- | ---------------- | ----------- |------------|
54 | | Pima 	     | 768	            | 8	               | 268	     | 34.8958    |
55 | | Vowels     | 1456	            | 12               | 50          | 3.4341     |
56 | | Letter	 | 1600             | 32               | 100	     | 6.2500     |
57 | | Cardio     | 1831	            | 21	           | 176         | 9.6122     |
58 | | Thyroid	 | 3772	            | 6	               | 93	         | 2.4655     |
59 | | Satellite	 | 6435	            | 36	           | 2036	     | 31.6394    |
60 | | Pendigits	 | 6870	            | 16	           | 156	     | 2.2707     |
61 | | Annthyroid | 7200	            | 6	               | 534	     | 7.4167     |
62 | | Mnist	     | 7603	            | 100	           | 700	     | 9.2069     |
63 | | Shuttle	 | 49097	        | 9	               | 3511        | 7.1511     |
64 | 
65 | All datasets are accesible from http://odds.cs.stonybrook.edu/. Citation Suggestion for the datasets please refer to: 
66 | > Shebuti Rayana (2016).  ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.
67 | 
68 | To replicate the demo, you should download the datasets from http://odds.cs.stonybrook.edu/ and place them in ./datasets/. We do not provide the data download.
69 | 
70 | ------------
71 | 
72 | ## Usage and Sample Output (Demo Version)
73 | Experiments could be reproduced by running **demo_lof.py** and **demo_knn.py** directly. You could simply download/clone the entire repository and execute the code by 
74 | ```bash
75 | python demo_lof.py
76 | ```
77 | 
78 | The difference between **demo_lof.py** and **demo_knn.py** is simply at the base detector choice. Apparently, the former uses LOF as the base detector, while the latter uses *k*NN instead. We introduce two evalution methods:
79 | 1.  The area under receiver operating characteristic curve (**ROC**)
80 | 2.  Precision at rank m (***P*@*m***) 
81 | 
82 | The results of **demo_lof.py** and **demo_knn.py**  are presented below. Table 1 and 2 illustrate the results when **LOF** is used as the base detector, while Table 3 and 4 are based when ***k*NN** is used as the base detector. The highest score is highlighted in **bold**, while the lowest is marked with an **asterisk (*)**.
83 | 
84 | ![ LOF_ROC](https://github.com/yzhao062/DCSO/blob/master/md_figs/lof_roc.png)
85 | ![ LOF_PRC](https://github.com/yzhao062/DCSO/blob/master/md_figs/lof_prc.png)
86 | ![ KNN_ROC](https://github.com/yzhao062/DCSO/blob/master/md_figs/knn_roc.png)
87 | ![ KNN_PRC](https://github.com/yzhao062/DCSO/blob/master/md_figs/knn_prc.png)
88 | 
89 | ## Visualizations (based on demo_lof.py )
90 | The figure below visually compares the performance of SG and DCSO methods on **Cardio**, **Thyroid** and **Letter** using t-distributed stochastic neighbor embedding (t-SNE). Normal and outlying points are denoted as **orange dots** and **red squares**, respectively. The normal points that are only correctly detected by SG methods are named SG_N (** green triangle_down**), and only by DCSO are named as DCSO_N (**blue cross sign**). Similarly, outliers are denoted as SG_N (**green triangle_up**) and DCSO_N (**blue plus sign**), given they can only be detected by SG or DCSO methods, respectively.
91 | 
92 | ![ tsne](https://github.com/yzhao062/DCSO/blob/master/md_figs/tsne.png)
93 | 
94 | Full visulization could be found at [t-SNE](https://github.com/yzhao062/DCSO/tree/master/viz "t-SNE"). To replicate the  visualization, please use "**viz_tsne.py**". It is noted this script is not fully optimized and could be cubersome.
95 | 


--------------------------------------------------------------------------------
/datasets/cardio.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/datasets/cardio.mat


--------------------------------------------------------------------------------
/demo_knn.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | 
  3 | import numpy as np
  4 | from scipy.stats import rankdata
  5 | from scipy.stats import pearsonr
  6 | from scipy.spatial.distance import euclidean
  7 | 
  8 | from sklearn.model_selection import train_test_split
  9 | from sklearn.neighbors import KDTree
 10 | from sklearn.metrics import roc_auc_score
 11 | from sklearn.metrics import average_precision_score
 12 | 
 13 | from models.combination import aom, moa
 14 | from utility.stat_models import wpearsonr
 15 | from utility.utility import train_predict_knn
 16 | from utility.utility import print_save_result
 17 | from utility.utility import argmaxp, loaddata, precision_n_score, standardizer
 18 | 
 19 | # access the timestamp for logging purpose
 20 | today = datetime.datetime.now()
 21 | timestamp = today.strftime("%Y%m%d_%H%M%S")
 22 | 
 23 | # set numpy parameters
 24 | np.set_printoptions(suppress=True, precision=4)
 25 | 
 26 | ###############################################################################
 27 | # parameter settings
 28 | data = 'cardio'
 29 | base_detector = 'knn'
 30 | n_ite = 20  # number of iterations
 31 | test_size = 0.4  # training = 60%, testing = 40%
 32 | n_baselines = 30  # the number of baseline algorithms, DO NOT CHANGE
 33 | loc_region_size = 100  # for consistency fixed to 100
 34 | 
 35 | # k list for LOF algorithms, for constructing a pool of base detectors
 36 | k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
 37 |           110, 120, 130, 140, 150, 160, 170, 180, 190, 200]
 38 | 
 39 | n_clf = len(k_list)  # 20 base detectors
 40 | 
 41 | # for SG_AOM and SG_MOA, choose the right number of buckets
 42 | n_buckets = 5
 43 | n_clf_bucket = int(n_clf / n_buckets)
 44 | assert (n_clf % n_buckets == 0)  # in case wrong number of buckets
 45 | 
 46 | alpha = 0.2  # control the strength of dynamic ensemble selection
 47 | 
 48 | # flag for printing and output saving
 49 | verbose = False
 50 | 
 51 | ###############################################################################
 52 | 
 53 | if __name__ == '__main__':
 54 | 
 55 |     X_orig, y_orig, outlier_perc = loaddata(data)
 56 | 
 57 |     # initialize the matrix for storing scores
 58 |     roc_mat = np.zeros([n_ite, n_baselines])  # receiver operating curve
 59 |     ap_mat = np.zeros([n_ite, n_baselines])  # average precision
 60 |     prc_mat = np.zeros([n_ite, n_baselines])  # precision @ m
 61 | 
 62 |     for t in range(n_ite):
 63 |         print('\nn_ite', t + 1, data)  # print status
 64 | 
 65 |         # split the data into training and testing
 66 |         X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig,
 67 |                                                             test_size=test_size)
 68 |         # normalized the data
 69 |         X_train_norm, X_test_norm = standardizer(X_train, X_test)
 70 | 
 71 |         train_scores = np.zeros([X_train.shape[0], n_clf])
 72 |         test_scores = np.zeros([X_test.shape[0], n_clf])
 73 | 
 74 |         # initialized the list to store the results
 75 |         test_target_list = []
 76 |         method_list = []
 77 | 
 78 |         # generate a pool of detectors and predict on test instances
 79 |         train_scores, test_scores = train_predict_knn(k_list, X_train_norm,
 80 |                                                       X_test_norm,
 81 |                                                       train_scores,
 82 |                                                       test_scores)
 83 | 
 84 |         #######################################################################
 85 |         # generate normalized scores
 86 |         train_scores_norm, test_scores_norm = standardizer(train_scores,
 87 |                                                            test_scores)
 88 |         # generate mean and max outputs
 89 |         # SG_A and SG_M
 90 |         target_test_mean = np.mean(test_scores_norm, axis=1)
 91 |         target_test_max = np.max(test_scores_norm, axis=1)
 92 |         test_target_list.extend([target_test_mean, target_test_max])
 93 |         method_list.extend(['sg_a', 'sg_m'])
 94 | 
 95 |         # generate pseudo target for training -> for calculating weights
 96 |         target_mean = np.mean(train_scores_norm, axis=1).reshape(-1, 1)
 97 |         target_max = np.max(train_scores_norm, axis=1).reshape(-1, 1)
 98 | 
 99 |         # higher value for more outlyingness
100 |         ranks_mean = rankdata(target_mean).reshape(-1, 1)
101 |         ranks_max = rankdata(target_max).reshape(-1, 1)
102 | 
103 |         # generate weighted mean
104 |         # weights are distance or pearson in different modes
105 |         clf_weights_pear = np.zeros([n_clf, 1])
106 |         for i in range(n_clf):
107 |             clf_weights_pear[i] = \
108 |                 pearsonr(target_mean, train_scores_norm[:, i].reshape(-1, 1))[
109 |                     0][0]
110 | 
111 |         # generate weighted mean
112 |         target_test_weighted_pear = np.sum(
113 |             test_scores_norm * clf_weights_pear.reshape(1,
114 |                                                         -1) / clf_weights_pear.sum(),
115 |             axis=1)
116 | 
117 |         test_target_list.append(target_test_weighted_pear)
118 |         method_list.append('sg_wa')
119 | 
120 |         # generate threshold sum
121 |         target_test_threshold = np.sum(test_scores_norm.clip(0), axis=1)
122 |         test_target_list.append(target_test_threshold)
123 |         method_list.append('sg_thresh')
124 | 
125 |         # generate average of maximum (SG_AOM) and maximum of average (SG_MOA)
126 |         target_test_aom = aom(test_scores_norm, n_buckets, n_clf)
127 |         target_test_moa = moa(test_scores_norm, n_buckets, n_clf)
128 |         test_target_list.extend([target_test_aom, target_test_moa])
129 |         method_list.extend(['aom', 'moa'])
130 |         ##################################################################
131 | 
132 |         # define local region using KD trees
133 |         tree = KDTree(X_train_norm)
134 |         dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size)
135 | 
136 |         # different similarity measures
137 |         # s[euc]_w[rank] -> use euclidean distance for similarity measure
138 |         #                   use outlying rank as the weight
139 |         m_list = ['s[euc]_w[dist]', 's[euc]_w[rank]', 's[dist]_w[na]',
140 |                   's[pear]_w[dist]', 's[pear]_w[rank]', 's[pear]_w[na]']
141 | 
142 |         pred_scores_best = np.zeros([X_test.shape[0], len(m_list)])
143 |         pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)])
144 | 
145 |         for i in range(X_test.shape[0]):  # iterate all test instance
146 |             # get the neighbor idx of the current point
147 |             ind_k = ind_arr[i, :]
148 | 
149 |             # get the pseudo target: mean
150 |             target_k = target_mean[ind_k,].ravel()
151 | 
152 |             # get the current scores from all clf
153 |             curr_train_k = train_scores_norm[ind_k, :]
154 | 
155 |             # weights by rank
156 |             weights_k_rank = ranks_mean[ind_k]
157 | 
158 |             # weights by euclidean distance
159 |             dist_k = dist_arr[i, :].reshape(-1, 1)
160 |             weights_k_dist = dist_k.max() - dist_k
161 | 
162 |             # initialize containers for correlation
163 |             corr_dist_d = np.zeros([n_clf, ])
164 |             corr_dist_r = np.zeros([n_clf, ])
165 |             corr_dist_n = np.zeros([n_clf, ])
166 |             corr_pear_d = np.zeros([n_clf, ])
167 |             corr_pear_r = np.zeros([n_clf, ])
168 |             corr_pear_n = np.zeros([n_clf, ])
169 | 
170 |             for d in range(n_clf):
171 |                 # flip distance so larger values imply larger correlation
172 |                 corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d],
173 |                                             w=weights_k_dist) * -1
174 |                 corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d],
175 |                                             w=weights_k_rank) * -1
176 |                 corr_dist_n[d,] = euclidean(target_k,
177 |                                             curr_train_k[:, d]) * -1
178 |                 corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d],
179 |                                             w=weights_k_dist)
180 |                 corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d],
181 |                                             w=weights_k_rank)
182 |                 corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[
183 |                     0]
184 | 
185 |             corr_list = [corr_dist_d, corr_dist_r, corr_dist_n,
186 |                          corr_pear_d, corr_pear_r, corr_pear_n]
187 | 
188 |             for j in range(len(m_list)):
189 |                 corr_k = corr_list[j]
190 | 
191 |                 # pick the best one
192 |                 best_clf_ind = np.nanargmax(corr_k)
193 |                 pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind]
194 | 
195 |                 # pick the p dynamically
196 |                 threshold = corr_k.max() - corr_k.std() * alpha
197 |                 p = (corr_k >= threshold).sum()
198 |                 if p == 0:  # in case extreme cases [nan and all -1's]
199 |                     p = 1
200 |                 pred_scores_ens[i, j] = np.max(
201 |                     test_scores_norm[i, argmaxp(corr_k, p)])
202 | 
203 |         for m in range(len(m_list)):
204 |             test_target_list.extend([pred_scores_best[:, m],
205 |                                      pred_scores_ens[:, m]])
206 |             method_list.extend(['DCSO_a_' + m_list[m],
207 |                                 'DCSO_moa_' + m_list[m]])
208 |         ######################################################################
209 | 
210 |         # use max for pseudo ground truth generation
211 |         tree = KDTree(X_train_norm)
212 |         dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size)
213 | 
214 |         pred_scores_best = np.zeros([X_test.shape[0], len(m_list)])
215 |         pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)])
216 | 
217 |         for i in range(X_test.shape[0]):  # X_test_norm.shape[0]
218 |             # get the neighbor idx of the current point
219 |             ind_k = ind_arr[i, :]
220 | 
221 |             # get the pseudo target: max
222 |             target_k = target_max[ind_k,].ravel()
223 | 
224 |             # get the current scores from all clf
225 |             curr_train_k = train_scores_norm[ind_k, :]
226 | 
227 |             # weights by rank
228 |             weights_k_rank = ranks_max[ind_k]
229 |             # weights by distance
230 |             dist_k = dist_arr[i, :].reshape(-1, 1)
231 |             weights_k_dist = dist_k.max() - dist_k
232 | 
233 |             corr_dist_d = np.zeros([n_clf, ])
234 |             corr_dist_r = np.zeros([n_clf, ])
235 |             corr_dist_n = np.zeros([n_clf, ])
236 |             corr_pear_d = np.zeros([n_clf, ])
237 |             corr_pear_r = np.zeros([n_clf, ])
238 |             corr_pear_n = np.zeros([n_clf, ])
239 | 
240 |             for d in range(n_clf):
241 |                 corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d],
242 |                                             w=weights_k_dist) * -1
243 |                 corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d],
244 |                                             w=weights_k_rank) * -1
245 |                 corr_dist_n[d,] = euclidean(target_k,
246 |                                             curr_train_k[:, d]) * -1
247 |                 corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d],
248 |                                             w=weights_k_dist)
249 |                 corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d],
250 |                                             w=weights_k_rank)
251 |                 corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[
252 |                     0]
253 | 
254 |             corr_list = [corr_dist_d, corr_dist_r, corr_dist_n,
255 |                          corr_pear_d, corr_pear_r, corr_pear_n]
256 | 
257 |             for j in range(len(m_list)):
258 |                 corr_k = corr_list[j]
259 | 
260 |                 # pick the best one
261 |                 best_clf_ind = np.nanargmax(corr_k)
262 |                 pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind]
263 | 
264 |                 # pick s detectors dynamically
265 |                 threshold = corr_k.max() - corr_k.std() * alpha
266 |                 p = (corr_k >= threshold).sum()
267 |                 if p == 0:  # in case extreme cases [nan and all -1's]
268 |                     p = 1
269 |                 pred_scores_ens[i, j] = np.mean(
270 |                     test_scores_norm[i, argmaxp(corr_k, p)])
271 | 
272 |         for m in range(len(m_list)):
273 |             test_target_list.extend([pred_scores_best[:, m],
274 |                                      pred_scores_ens[:, m]])
275 |             method_list.extend(['DCSO_m_' + m_list[m],
276 |                                 'DCSO_aom_' + m_list[m]])
277 | 
278 |         # store performance information and print result
279 |         for i in range(n_baselines):
280 |             roc_mat[t, i] = roc_auc_score(y_test, test_target_list[i])
281 |             ap_mat[t, i] = average_precision_score(y_test,
282 |                                                    test_target_list[i])
283 |             prc_mat[t, i] = precision_n_score(y_test, test_target_list[i])
284 |             print(method_list[i], roc_mat[t, i])
285 | 
286 |     # print and save the result
287 |     # default location is /results/***.csv
288 |     print_save_result(data, base_detector, n_baselines, n_clf, n_ite, roc_mat,
289 |                       ap_mat, prc_mat, method_list, timestamp, verbose)
290 | 


--------------------------------------------------------------------------------
/demo_lof.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | 
  3 | import numpy as np
  4 | from scipy.stats import rankdata
  5 | from scipy.stats import pearsonr
  6 | from scipy.spatial.distance import euclidean
  7 | 
  8 | from sklearn.model_selection import train_test_split
  9 | from sklearn.neighbors import KDTree
 10 | from sklearn.metrics import roc_auc_score
 11 | from sklearn.metrics import average_precision_score
 12 | 
 13 | from models.combination import aom, moa
 14 | from utility.stat_models import wpearsonr
 15 | from utility.utility import train_predict_lof
 16 | from utility.utility import print_save_result
 17 | from utility.utility import argmaxp, loaddata, precision_n_score, standardizer
 18 | 
 19 | # access the timestamp for logging purpose
 20 | today = datetime.datetime.now()
 21 | timestamp = today.strftime("%Y%m%d_%H%M%S")
 22 | 
 23 | # set numpy parameters
 24 | np.set_printoptions(suppress=True, precision=4)
 25 | 
 26 | ###############################################################################
 27 | # parameter settings
 28 | data = 'cardio'
 29 | base_detector = 'lof'
 30 | n_ite = 20  # number of iterations
 31 | test_size = 0.4  # training = 60%, testing = 40%
 32 | n_baselines = 30  # the number of baseline algorithms, DO NOT CHANGE
 33 | loc_region_size = 100  # for consistency fixed to 100
 34 | 
 35 | # k list for LOF algorithms, for constructing a pool of base detectors
 36 | k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
 37 |           110, 120, 130, 140, 150, 160, 170, 180, 190, 200]
 38 | 
 39 | n_clf = len(k_list)  # 20 base detectors
 40 | 
 41 | # for SG_AOM and SG_MOA, choose the right number of buckets
 42 | n_buckets = 5
 43 | n_clf_bucket = int(n_clf / n_buckets)
 44 | assert (n_clf % n_buckets == 0)  # in case wrong number of buckets
 45 | 
 46 | alpha = 0.2  # control the strength of dynamic ensemble selection
 47 | 
 48 | # flag for printing and output saving
 49 | verbose = False
 50 | 
 51 | ###############################################################################
 52 | 
 53 | if __name__ == '__main__':
 54 | 
 55 |     X_orig, y_orig, outlier_perc = loaddata(data)
 56 | 
 57 |     # initialize the matrix for storing scores
 58 |     roc_mat = np.zeros([n_ite, n_baselines])  # receiver operating curve
 59 |     ap_mat = np.zeros([n_ite, n_baselines])  # average precision
 60 |     prc_mat = np.zeros([n_ite, n_baselines])  # precision @ m
 61 | 
 62 |     for t in range(n_ite):
 63 |         print('\nn_ite', t + 1, data)  # print status
 64 | 
 65 |         # split the data into training and testing
 66 |         X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig,
 67 |                                                             test_size=test_size)
 68 |         # normalized the data
 69 |         X_train_norm, X_test_norm = standardizer(X_train, X_test)
 70 | 
 71 |         train_scores = np.zeros([X_train.shape[0], n_clf])
 72 |         test_scores = np.zeros([X_test.shape[0], n_clf])
 73 | 
 74 |         # initialized the list to store the results
 75 |         test_target_list = []
 76 |         method_list = []
 77 | 
 78 |         # generate a pool of detectors and predict on test instances
 79 |         train_scores, test_scores = train_predict_lof(k_list, X_train_norm,
 80 |                                                       X_test_norm,
 81 |                                                       train_scores,
 82 |                                                       test_scores)
 83 | 
 84 |         #######################################################################
 85 |         # generate normalized scores
 86 |         train_scores_norm, test_scores_norm = standardizer(train_scores,
 87 |                                                            test_scores)
 88 |         # generate mean and max outputs
 89 |         # SG_A and SG_M
 90 |         target_test_mean = np.mean(test_scores_norm, axis=1)
 91 |         target_test_max = np.max(test_scores_norm, axis=1)
 92 |         test_target_list.extend([target_test_mean, target_test_max])
 93 |         method_list.extend(['sg_a', 'sg_m'])
 94 | 
 95 |         # generate pseudo target for training -> for calculating weights
 96 |         target_mean = np.mean(train_scores_norm, axis=1).reshape(-1, 1)
 97 |         target_max = np.max(train_scores_norm, axis=1).reshape(-1, 1)
 98 | 
 99 |         # higher value for more outlyingness
100 |         ranks_mean = rankdata(target_mean).reshape(-1, 1)
101 |         ranks_max = rankdata(target_max).reshape(-1, 1)
102 | 
103 |         # generate weighted mean
104 |         # weights are distance or pearson in different modes
105 |         clf_weights_pear = np.zeros([n_clf, 1])
106 |         for i in range(n_clf):
107 |             clf_weights_pear[i] = \
108 |                 pearsonr(target_mean, train_scores_norm[:, i].reshape(-1, 1))[
109 |                     0][0]
110 | 
111 |         # generate weighted mean
112 |         target_test_weighted_pear = np.sum(
113 |             test_scores_norm * clf_weights_pear.reshape(1,
114 |                                                         -1) / clf_weights_pear.sum(),
115 |             axis=1)
116 | 
117 |         test_target_list.append(target_test_weighted_pear)
118 |         method_list.append('sg_wa')
119 | 
120 |         # generate threshold sum
121 |         target_test_threshold = np.sum(test_scores_norm.clip(0), axis=1)
122 |         test_target_list.append(target_test_threshold)
123 |         method_list.append('sg_thresh')
124 | 
125 |         # generate average of maximum (SG_AOM) and maximum of average (SG_MOA)
126 |         target_test_aom = aom(test_scores_norm, n_buckets, n_clf)
127 |         target_test_moa = moa(test_scores_norm, n_buckets, n_clf)
128 |         test_target_list.extend([target_test_aom, target_test_moa])
129 |         method_list.extend(['aom', 'moa'])
130 |         ##################################################################
131 | 
132 |         # define local region using KD trees
133 |         tree = KDTree(X_train_norm)
134 |         dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size)
135 | 
136 |         # different similarity measures
137 |         # s[euc]_w[rank] -> use euclidean distance for similarity measure
138 |         #                   use outlying rank as the weight
139 |         m_list = ['s[euc]_w[dist]', 's[euc]_w[rank]', 's[dist]_w[na]',
140 |                   's[pear]_w[dist]', 's[pear]_w[rank]', 's[pear]_w[na]']
141 | 
142 |         pred_scores_best = np.zeros([X_test.shape[0], len(m_list)])
143 |         pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)])
144 | 
145 |         for i in range(X_test.shape[0]):  # iterate all test instance
146 |             # get the neighbor idx of the current point
147 |             ind_k = ind_arr[i, :]
148 | 
149 |             # get the pseudo target: mean
150 |             target_k = target_mean[ind_k,].ravel()
151 | 
152 |             # get the current scores from all clf
153 |             curr_train_k = train_scores_norm[ind_k, :]
154 | 
155 |             # weights by rank
156 |             weights_k_rank = ranks_mean[ind_k]
157 | 
158 |             # weights by euclidean distance
159 |             dist_k = dist_arr[i, :].reshape(-1, 1)
160 |             weights_k_dist = dist_k.max() - dist_k
161 | 
162 |             # initialize containers for correlation
163 |             corr_dist_d = np.zeros([n_clf, ])
164 |             corr_dist_r = np.zeros([n_clf, ])
165 |             corr_dist_n = np.zeros([n_clf, ])
166 |             corr_pear_d = np.zeros([n_clf, ])
167 |             corr_pear_r = np.zeros([n_clf, ])
168 |             corr_pear_n = np.zeros([n_clf, ])
169 | 
170 |             for d in range(n_clf):
171 |                 # flip distance so larger values imply larger correlation
172 |                 corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d],
173 |                                             w=weights_k_dist) * -1
174 |                 corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d],
175 |                                             w=weights_k_rank) * -1
176 |                 corr_dist_n[d,] = euclidean(target_k,
177 |                                             curr_train_k[:, d]) * -1
178 |                 corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d],
179 |                                             w=weights_k_dist)
180 |                 corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d],
181 |                                             w=weights_k_rank)
182 |                 corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[
183 |                     0]
184 | 
185 |             corr_list = [corr_dist_d, corr_dist_r, corr_dist_n,
186 |                          corr_pear_d, corr_pear_r, corr_pear_n]
187 | 
188 |             for j in range(len(m_list)):
189 |                 corr_k = corr_list[j]
190 | 
191 |                 # pick the best one
192 |                 best_clf_ind = np.nanargmax(corr_k)
193 |                 pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind]
194 | 
195 |                 # pick the p dynamically
196 |                 threshold = corr_k.max() - corr_k.std() * alpha
197 |                 p = (corr_k >= threshold).sum()
198 |                 if p == 0:  # in case extreme cases [nan and all -1's]
199 |                     p = 1
200 |                 pred_scores_ens[i, j] = np.max(
201 |                     test_scores_norm[i, argmaxp(corr_k, p)])
202 | 
203 |         for m in range(len(m_list)):
204 |             test_target_list.extend([pred_scores_best[:, m],
205 |                                      pred_scores_ens[:, m]])
206 |             method_list.extend(['DCSO_a_' + m_list[m],
207 |                                 'DCSO_moa_' + m_list[m]])
208 |         ######################################################################
209 | 
210 |         # use max for pseudo ground truth generation
211 |         tree = KDTree(X_train_norm)
212 |         dist_arr, ind_arr = tree.query(X_test_norm, k=loc_region_size)
213 | 
214 |         pred_scores_best = np.zeros([X_test.shape[0], len(m_list)])
215 |         pred_scores_ens = np.zeros([X_test.shape[0], len(m_list)])
216 | 
217 |         for i in range(X_test.shape[0]):  # X_test_norm.shape[0]
218 |             # get the neighbor idx of the current point
219 |             ind_k = ind_arr[i, :]
220 | 
221 |             # get the pseudo target: max
222 |             target_k = target_max[ind_k,].ravel()
223 | 
224 |             # get the current scores from all clf
225 |             curr_train_k = train_scores_norm[ind_k, :]
226 | 
227 |             # weights by rank
228 |             weights_k_rank = ranks_max[ind_k]
229 |             # weights by distance
230 |             dist_k = dist_arr[i, :].reshape(-1, 1)
231 |             weights_k_dist = dist_k.max() - dist_k
232 | 
233 |             corr_dist_d = np.zeros([n_clf, ])
234 |             corr_dist_r = np.zeros([n_clf, ])
235 |             corr_dist_n = np.zeros([n_clf, ])
236 |             corr_pear_d = np.zeros([n_clf, ])
237 |             corr_pear_r = np.zeros([n_clf, ])
238 |             corr_pear_n = np.zeros([n_clf, ])
239 | 
240 |             for d in range(n_clf):
241 |                 corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d],
242 |                                             w=weights_k_dist) * -1
243 |                 corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d],
244 |                                             w=weights_k_rank) * -1
245 |                 corr_dist_n[d,] = euclidean(target_k,
246 |                                             curr_train_k[:, d]) * -1
247 |                 corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d],
248 |                                             w=weights_k_dist)
249 |                 corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d],
250 |                                             w=weights_k_rank)
251 |                 corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[
252 |                     0]
253 | 
254 |             corr_list = [corr_dist_d, corr_dist_r, corr_dist_n,
255 |                          corr_pear_d, corr_pear_r, corr_pear_n]
256 | 
257 |             for j in range(len(m_list)):
258 |                 corr_k = corr_list[j]
259 | 
260 |                 # pick the best one
261 |                 best_clf_ind = np.nanargmax(corr_k)
262 |                 pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind]
263 | 
264 |                 # pick s detectors dynamically
265 |                 threshold = corr_k.max() - corr_k.std() * alpha
266 |                 p = (corr_k >= threshold).sum()
267 |                 if p == 0:  # in case extreme cases [nan and all -1's]
268 |                     p = 1
269 |                 pred_scores_ens[i, j] = np.mean(
270 |                     test_scores_norm[i, argmaxp(corr_k, p)])
271 | 
272 |         for m in range(len(m_list)):
273 |             test_target_list.extend([pred_scores_best[:, m],
274 |                                      pred_scores_ens[:, m]])
275 |             method_list.extend(['DCSO_m_' + m_list[m],
276 |                                 'DCSO_aom_' + m_list[m]])
277 | 
278 |         # store performance information and print result
279 |         for i in range(n_baselines):
280 |             roc_mat[t, i] = roc_auc_score(y_test, test_target_list[i])
281 |             ap_mat[t, i] = average_precision_score(y_test,
282 |                                                    test_target_list[i])
283 |             prc_mat[t, i] = precision_n_score(y_test, test_target_list[i])
284 |             print(method_list[i], roc_mat[t, i])
285 | 
286 |     # print and save the result
287 |     # default location is /results/***.csv
288 |     print_save_result(data, base_detector, n_baselines, n_clf, n_ite, roc_mat,
289 |                       ap_mat, prc_mat, method_list, timestamp, verbose)
290 | 


--------------------------------------------------------------------------------
/md_figs/flowchart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/flowchart.png


--------------------------------------------------------------------------------
/md_figs/knn_prc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/knn_prc.png


--------------------------------------------------------------------------------
/md_figs/knn_roc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/knn_roc.png


--------------------------------------------------------------------------------
/md_figs/lof_prc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/lof_prc.png


--------------------------------------------------------------------------------
/md_figs/lof_roc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/lof_roc.png


--------------------------------------------------------------------------------
/md_figs/tsne.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/md_figs/tsne.png


--------------------------------------------------------------------------------
/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/models/__init__.py


--------------------------------------------------------------------------------
/models/combination.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | def aom(scores, n_buckets, n_estimators, standard=True):
 4 |     '''
 5 |     Average of Maximum - An ensemble method for outlier detection
 6 | 
 7 |     Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms
 8 |     for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.
 9 | 
10 |     :param scores:
11 |     :param n_buckets:
12 |     :param n_estimators:
13 |     :param standard:
14 |     :return:
15 |     '''
16 |     scores = np.asarray(scores)
17 |     if scores.shape[1] != n_estimators:
18 |         raise ValueError('score matrix should be n_samples by n_estimaters')
19 | 
20 |     scores_aom = np.zeros([scores.shape[0], n_buckets])
21 | 
22 |     n_estimators_per_bucket = int(n_estimators / n_buckets)
23 |     if n_estimators % n_buckets != 0:
24 |         Warning('n_estimators / n_buckets leads to a remainder')
25 | 
26 |     # shuffle the estimator order
27 |     estimators_list = list(range(0, n_estimators, 1))
28 |     np.random.shuffle(estimators_list)
29 | 
30 |     head = 0
31 |     for i in range(0, n_estimators, n_estimators_per_bucket):
32 |         tail = i + n_estimators_per_bucket
33 |         batch_ind = int(i / n_estimators_per_bucket)
34 | 
35 |         scores_aom[:, batch_ind] = np.max(
36 |             scores[:, estimators_list[head:tail]], axis=1)
37 | 
38 |         head = head + n_estimators_per_bucket
39 |         tail = tail + n_estimators_per_bucket
40 | 
41 |     return np.mean(scores_aom, axis=1)
42 | 
43 | 
44 | def moa(scores, n_buckets, n_estimators):
45 |     '''
46 |     Maximum of Average - An ensemble method for outlier detection
47 | 
48 |     Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms
49 |     for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.
50 | 
51 |     :param scores:
52 |     :param n_buckets:
53 |     :param n_estimators:
54 |     :param standard:
55 |     :return:
56 |     '''
57 |     scores = np.asarray(scores)
58 |     if scores.shape[1] != n_estimators:
59 |         raise ValueError('score matrix should be n_samples by n_estimaters')
60 | 
61 |     scores_moa = np.zeros([scores.shape[0], n_buckets])
62 | 
63 |     n_estimators_per_bucket = int(n_estimators / n_buckets)
64 |     if n_estimators % n_buckets != 0:
65 |         Warning('n_estimators / n_buckets leads to a remainder')
66 | 
67 |     # shuffle the estimator order
68 |     estimators_list = list(range(0, n_estimators, 1))
69 |     np.random.shuffle(estimators_list)
70 | 
71 |     head = 0
72 |     for i in range(0, n_estimators, n_estimators_per_bucket):
73 |         tail = i + n_estimators_per_bucket
74 |         batch_ind = int(i / n_estimators_per_bucket)
75 | 
76 |         scores_moa[:, batch_ind] = np.mean(
77 |             scores[:, estimators_list[head:tail]], axis=1)
78 | 
79 |         head = head + n_estimators_per_bucket
80 |         tail = tail + n_estimators_per_bucket
81 | 
82 |     return np.max(scores_moa, axis=1)
83 | 


--------------------------------------------------------------------------------
/models/knn.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from sklearn.preprocessing import MinMaxScaler
  3 | from sklearn.neighbors import NearestNeighbors
  4 | from sklearn.neighbors import KDTree
  5 | from sklearn.exceptions import NotFittedError
  6 | from scipy.stats import scoreatpercentile
  7 | from scipy.stats import rankdata
  8 | from scipy.special import erf
  9 | 
 10 | 
 11 | class Knn(object):
 12 |     '''
 13 |     Knn class for outlier detection
 14 |     support original knn, average knn, and median knn
 15 |     '''
 16 | 
 17 |     def __init__(self, n_neighbors=1, contamination=0.05, method='largest'):
 18 |         self.n_neighbors = n_neighbors
 19 |         self.contamination = contamination
 20 |         self.method = method
 21 | 
 22 |     def fit(self, X_train):
 23 |         self.X_train = X_train
 24 |         self._isfitted = True
 25 |         self.tree = KDTree(X_train)
 26 | 
 27 |         neigh = NearestNeighbors(n_neighbors=self.n_neighbors)
 28 |         neigh.fit(self.X_train)
 29 | 
 30 |         result = neigh.kneighbors(n_neighbors=self.n_neighbors,
 31 |                                   return_distance=True)
 32 |         dist_arr = result[0]
 33 | 
 34 |         if self.method == 'largest':
 35 |             dist = dist_arr[:, -1]
 36 |         elif self.method == 'mean':
 37 |             dist = np.mean(dist_arr, axis=1)
 38 |         elif self.method == 'median':
 39 |             dist = np.median(dist_arr, axis=1)
 40 | 
 41 |         self.threshold = scoreatpercentile(dist,
 42 |                                            100 * (1 - self.contamination))
 43 |         self.decision_scores = dist.ravel()
 44 |         self.y_pred = (self.decision_scores > self.threshold).astype('int')
 45 | 
 46 |         self.mu = np.mean(self.decision_scores)
 47 |         self.sigma = np.std(self.decision_scores)
 48 | 
 49 |     def decision_function(self, X_test):
 50 | 
 51 |         if not self._isfitted:
 52 |             NotFittedError('Knn is not fitted yet')
 53 | 
 54 |         # initialize the output score
 55 |         pred_score = np.zeros([X_test.shape[0], 1])
 56 | 
 57 |         for i in range(X_test.shape[0]):
 58 |             x_i = X_test[i, :]
 59 |             x_i = np.asarray(x_i).reshape(1, x_i.shape[0])
 60 | 
 61 |             # get the distance of the current point
 62 |             dist_arr, ind_arr = self.tree.query(x_i, k=self.n_neighbors)
 63 | 
 64 |             if self.method == 'largest':
 65 |                 dist = dist_arr[:, -1]
 66 |             elif self.method == 'mean':
 67 |                 dist = np.mean(dist_arr, axis=1)
 68 |             elif self.method == 'median':
 69 |                 dist = np.median(dist_arr, axis=1)
 70 | 
 71 |             pred_score_i = dist[-1]
 72 | 
 73 |             # record the current item
 74 |             pred_score[i, :] = pred_score_i
 75 | 
 76 |         return pred_score
 77 | 
 78 |     def predict(self, X_test):
 79 |         pred_score = self.decision_function(X_test)
 80 |         return (pred_score > self.threshold).astype('int')
 81 | 
 82 |     def predict_proba(self, X_test, method='linear'):
 83 |         test_scores = self.decision_function(X_test)
 84 |         train_scores = self.decision_scores
 85 | 
 86 |         if method == 'linear':
 87 |             scaler = MinMaxScaler().fit(train_scores.reshape(-1, 1))
 88 |             proba = scaler.transform(test_scores.reshape(-1, 1))
 89 |             return proba.clip(0, 1)
 90 |         else:
 91 |             # turn output into probability
 92 |             pre_erf_score = (test_scores - self.mu) / (self.sigma * np.sqrt(2))
 93 |             erf_score = erf(pre_erf_score)
 94 |             proba = erf_score.clip(0)
 95 | 
 96 |             # TODO: move to testing code
 97 |             assert (proba.min() >= 0)
 98 |             assert (proba.max() <= 1)
 99 |             return proba
100 | 
101 |     def predict_rank(self, X_test):
102 |         test_scores = self.decision_function(X_test)
103 |         train_scores = self.decision_scores
104 | 
105 |         ranks = np.zeros([X_test.shape[0], 1])
106 | 
107 |         for i in range(test_scores.shape[0]):
108 |             train_scores_i = np.append(train_scores.reshape(-1, 1),
109 |                                        test_scores[i])
110 | 
111 |             ranks[i] = rankdata(train_scores_i)[-1]
112 | 
113 |         # return normalized ranks
114 |         ranks_norm = ranks / ranks.max()
115 |         return ranks_norm
116 | 
117 | ##############################################################################
118 | # samples = [[-1, 0], [0., 0.], [1., 1], [2., 5.], [3, 1]]
119 | #
120 | # clf = Knn()
121 | # clf.fit(samples)
122 | #
123 | # scores = clf.decision_function(np.asarray([[2, 3], [6, 8]])).ravel()
124 | # assert (scores[0] == [2])
125 | # assert (scores[1] == [5])
126 | # #
127 | # labels = clf.predict(np.asarray([[2, 3], [6, 8]])).ravel()
128 | # assert (labels[0] == [0])
129 | # assert (labels[1] == [1])
130 | 


--------------------------------------------------------------------------------
/models/lof.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn.neighbors import LocalOutlierFactor
 3 | from sklearn.preprocessing import MinMaxScaler
 4 | from scipy.stats import rankdata
 5 | from scipy.special import erf
 6 | 
 7 | class Lof(LocalOutlierFactor):
 8 |     def fit(self, X_train, y=None):
 9 |         self.X_train = X_train
10 |         super().fit(X=X_train, y=y)
11 |         return self
12 | 
13 |     def predict(self, X_test):
14 |         return self._predict(X=X_test)
15 | 
16 |     def decision_function(self, X_test):
17 |         return self._decision_function(X_test)
18 | 
19 |     def predict_proba(self, X_test, method='linear'):
20 |         train_scores = self.negative_outlier_factor_ * -1
21 |         test_scores = self.decision_function(X_test) * -1
22 |         if method == 'linear':
23 |             scaler = MinMaxScaler().fit(train_scores.reshape(-1, 1))
24 |             proba = scaler.transform(test_scores.reshape(-1, 1))
25 |             return proba.clip(0, 1)
26 |         else:
27 |             mu = np.mean(train_scores)
28 |             sigma = np.std(train_scores)
29 | 
30 |             # turn output into probability
31 |             pre_erf_score = (test_scores - mu) / (sigma * np.sqrt(2))
32 |             erf_score = erf(pre_erf_score)
33 |             proba = erf_score.clip(0)
34 | 
35 |             # TODO: move to testing code
36 |             assert (proba.min() >= 0)
37 |             assert (proba.max() <= 1)
38 | 
39 |             return proba
40 | 
41 |     def predict_rank(self, X_test):
42 | 
43 |         train_scores = self.decision_function(self.X_train) * -1
44 |         test_scores = self.decision_function(X_test) * -1
45 |         ranks = np.zeros([X_test.shape[0], 1])
46 | 
47 |         for i in range(test_scores.shape[0]):
48 |             train_scores_i = np.append(train_scores.reshape(-1, 1),
49 |                                        test_scores[i])
50 |             ranks[i] = rankdata(train_scores_i)[-1]
51 | 
52 |         # return normalized ranks
53 |         ranks_norm = ranks / ranks.max()
54 |         return ranks_norm
55 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.13
2 | scipy>=0.19
3 | scikit_learn>=0.19
4 | 


--------------------------------------------------------------------------------
/utility/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/utility/__init__.py


--------------------------------------------------------------------------------
/utility/stat_models.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from scipy.special import betainc
 3 | from scipy.stats import pearsonr
 4 | from scipy.stats import scoreatpercentile
 5 | from sklearn.metrics import precision_score
 6 | from sklearn.preprocessing import StandardScaler
 7 | 
 8 | 
 9 | def wpearsonr(x, y, w=None):
10 |     # https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation
11 | 
12 |     # unweighted version
13 |     if w is None:
14 |         return pearsonr(x, y)
15 | 
16 |     x = np.asarray(x)
17 |     y = np.asarray(y)
18 |     w = np.asarray(w)
19 | 
20 |     n = len(x)
21 | 
22 |     w_sum = w.sum()
23 |     mx = np.sum(x * w) / w_sum
24 |     my = np.sum(y * w) / w_sum
25 | 
26 |     xm, ym = (x - mx), (y - my)
27 | 
28 |     r_num = np.sum(xm * ym * w) / w_sum
29 | 
30 |     xm2 = np.sum(xm * xm * w) / w_sum
31 |     ym2 = np.sum(ym * ym * w) / w_sum
32 | 
33 |     r_den = np.sqrt(xm2 * ym2)
34 |     r = r_num / r_den
35 | 
36 |     r = max(min(r, 1.0), -1.0)
37 |     #    df = n - 2
38 |     #
39 |     #    if abs(r) == 1.0:
40 |     #        prob = 0.0
41 |     #    else:
42 |     #        t_squared = r ** 2 * (df / ((1.0 - r) * (1.0 + r)))
43 |     #        prob = _betai(0.5 * df, 0.5, df / (df + t_squared))
44 |     return r  # , prob
45 | 
46 | 
47 | #####################################
48 | #      PROBABILITY CALCULATIONS     #
49 | #####################################
50 | 
51 | 
52 | def _betai(a, b, x):
53 |     x = np.asarray(x)
54 |     x = np.where(x < 1.0, x, 1.0)  # if x > 1 then return 1.0
55 |     return betainc(a, b, x)
56 | 
57 | 
58 | def pearsonr_mat(mat, w=None):
59 |     n_row = mat.shape[0]
60 |     n_col = mat.shape[1]
61 |     pear_mat = np.full([n_row, n_row], 1).astype(float)
62 | 
63 |     if w is not None:
64 |         for cx in range(n_row):
65 |             for cy in range(cx + 1, n_row):
66 |                 curr_pear = wpearsonr(mat[cx, :], mat[cy, :], w)
67 |                 pear_mat[cx, cy] = curr_pear
68 |                 pear_mat[cy, cx] = curr_pear
69 |     else:
70 |         for cx in range(n_col):
71 |             for cy in range(cx + 1, n_row):
72 |                 curr_pear = pearsonr(mat[cx, :], mat[cy, :])[0]
73 |                 pear_mat[cx, cy] = curr_pear
74 |                 pear_mat[cy, cx] = curr_pear
75 | 
76 |     return pear_mat
77 | 


--------------------------------------------------------------------------------
/utility/utility.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pathlib
  3 | 
  4 | import numpy as np
  5 | import scipy.io as scio
  6 | from scipy.stats import scoreatpercentile
  7 | from sklearn.metrics import precision_score
  8 | from sklearn.preprocessing import StandardScaler
  9 | 
 10 | from models.lof import Lof
 11 | from models.knn import Knn
 12 | 
 13 | 
 14 | def get_label_n(y, y_pred):
 15 |     out_perc = np.count_nonzero(y) / len(y)
 16 |     threshold = scoreatpercentile(y_pred, 100 * (1 - out_perc))
 17 |     y_pred = (y_pred > threshold).astype('int')
 18 |     return y_pred
 19 | 
 20 | 
 21 | def standardizer(X_train, X_test):
 22 |     '''
 23 |     normalization function wrapper
 24 |     :param X_train:
 25 |     :param X_test:
 26 |     :return: X_train and X_test after the Z-score normalization
 27 |     '''
 28 |     scaler = StandardScaler().fit(X_train)
 29 |     return scaler.transform(X_train), scaler.transform(X_test)
 30 | 
 31 | 
 32 | def precision_n_score(y, y_pred):
 33 |     '''
 34 |     Utlity function to calculate precision@n
 35 |     :param y: ground truth
 36 |     :param y_pred: number of outliers
 37 |     :return: score
 38 |     '''
 39 |     # calculate the percentage of outliers
 40 |     out_perc = np.count_nonzero(y) / len(y)
 41 | 
 42 |     threshold = scoreatpercentile(y_pred, 100 * (1 - out_perc))
 43 |     y_pred = (y_pred > threshold).astype('int')
 44 |     return precision_score(y, y_pred)
 45 | 
 46 | 
 47 | def argmaxp(a, p):
 48 |     '''
 49 |     Utlity function to return the index of top p values in a
 50 |     :param a: list variable
 51 |     :param p: number of elements to select
 52 |     :return: index of top p elements in a
 53 |     '''
 54 | 
 55 |     a = np.asarray(a).ravel()
 56 |     length = a.shape[0]
 57 |     pth = np.argpartition(a, length - p)
 58 |     return pth[length - p:]
 59 | 
 60 | 
 61 | def loaddata(filename):
 62 |     '''
 63 |     load data
 64 |     :param filename:
 65 |     :return:
 66 |     '''
 67 |     mat = scio.loadmat(os.path.join('datasets', filename + '.mat'))
 68 |     X_orig = mat['X']
 69 |     y_orig = mat['y'].ravel()
 70 |     outlier_perc = np.count_nonzero(y_orig) / len(y_orig)
 71 | 
 72 |     return X_orig, y_orig, outlier_perc
 73 | 
 74 | 
 75 | def train_predict_lof(k_list, X_train_norm, X_test_norm, train_scores,
 76 |                       test_scores):
 77 |     # initialize base detectors
 78 |     clf_list = []
 79 |     for k in k_list:
 80 |         clf = Lof(n_neighbors=k)
 81 |         clf.fit(X_train_norm)
 82 |         train_score = clf.negative_outlier_factor_ * -1
 83 |         test_score = clf.decision_function(X_test_norm) * -1
 84 |         clf_name = 'lof_' + str(k)
 85 | 
 86 |         clf_list.append(clf_name)
 87 |         curr_ind = len(clf_list) - 1
 88 | 
 89 |         train_scores[:, curr_ind] = train_score.ravel()
 90 |         test_scores[:, curr_ind] = test_score.ravel()
 91 | 
 92 |     return train_scores, test_scores
 93 | 
 94 | 
 95 | def train_predict_knn(k_list, X_train_norm, X_test_norm, train_scores,
 96 |                       test_scores):
 97 |     # initialize base detectors
 98 |     clf_list = []
 99 |     for k in k_list:
100 |         clf = Knn(n_neighbors=k, method='largest')
101 |         clf.fit(X_train_norm)
102 |         train_score = clf.decision_scores
103 |         test_score = clf.decision_function(X_test_norm)
104 |         clf_name = 'knn_' + str(k)
105 | 
106 |         clf_list.append(clf_name)
107 |         curr_ind = len(clf_list) - 1
108 | 
109 |         train_scores[:, curr_ind] = train_score.ravel()
110 |         test_scores[:, curr_ind] = test_score.ravel()
111 | 
112 |     return train_scores, test_scores
113 | 
114 | 
115 | def print_save_result(data, base_detector, n_baselines, n_clf, n_ite, roc_mat,
116 |                       ap_mat, prc_mat, method_list, timestamp, verbose):
117 |     '''
118 |     :param data:
119 |     :param base_detector:
120 |     :param n_baselines:
121 |     :param n_clf:
122 |     :param n_ite:
123 |     :param roc_mat:
124 |     :param ap_mat:
125 |     :param prc_mat:
126 |     :param method_list:
127 |     :param timestamp:
128 |     :param verbose:
129 |     :return: None
130 |     '''
131 | 
132 |     roc_scores = np.round(np.mean(roc_mat, axis=0), decimals=4)
133 |     ap_scores = np.round(np.mean(ap_mat, axis=0), decimals=4)
134 |     prc_scores = np.round(np.mean(prc_mat, axis=0), decimals=4)
135 | 
136 |     method_np = np.asarray(method_list)
137 | 
138 |     top_roc_ind = argmaxp(roc_scores, 1)
139 |     top_ap_ind = argmaxp(ap_scores, 1)
140 |     top_prc_ind = argmaxp(prc_scores, 1)
141 | 
142 |     top_roc_clf = method_np[top_roc_ind].tolist()[0]
143 |     top_ap_clf = method_np[top_ap_ind].tolist()[0]
144 |     top_prc_clf = method_np[top_prc_ind].tolist()[0]
145 | 
146 |     top_roc = np.round(roc_scores[top_roc_ind][0], decimals=4)
147 |     top_ap = np.round(ap_scores[top_ap_ind][0], decimals=4)
148 |     top_prc = np.round(prc_scores[top_prc_ind][0], decimals=4)
149 | 
150 |     roc_diff = np.round(100 * (top_roc - roc_scores) / roc_scores, decimals=4)
151 |     ap_diff = np.round(100 * (top_ap - ap_scores) / ap_scores, decimals=4)
152 |     prc_diff = np.round(100 * (top_prc - prc_scores) / prc_scores, decimals=4)
153 | 
154 |     # initialize the log directory if it does not exist
155 |     pathlib.Path('results').mkdir(parents=True, exist_ok=True)
156 | 
157 |     # create the file if it does not exist
158 |     f = open(
159 |         'results\\' + data + '_' + base_detector + '_' + timestamp + '.csv',
160 |         'a')
161 |     if verbose:
162 |         f.writelines('method, '
163 |                      'roc, best_roc, diff_roc,'
164 |                      'ap, best_ap, diff_ap,'
165 |                      'p@m, best_p@m, diff_p@m,'
166 |                      'best roc, best ap, best prc, n_ite, n_clf')
167 |     else:
168 |         f.writelines('method, '
169 |                      'roc, ap, p@m,'
170 |                      'best roc, best ap, best prc, '
171 |                      'n_ite, n_clf')
172 | 
173 |     print('method, roc, ap, p@m, best roc, best ap, best prc')
174 |     delim = ','
175 |     for i in range(n_baselines):
176 |         print(method_list[i], roc_scores[i], ap_scores[i], prc_scores[i],
177 |               top_roc_clf, top_ap_clf, top_prc_clf)
178 | 
179 |         if verbose:
180 |             f.writelines(
181 |                 '\n' + str(method_list[i]) + delim +
182 |                 str(roc_scores[i]) + delim + str(top_roc) + delim + str(
183 |                     roc_diff[i]) + delim +
184 |                 str(ap_scores[i]) + delim + str(top_ap) + delim + str(
185 |                     ap_diff[i]) + delim +
186 |                 str(prc_scores[i]) + delim + str(top_prc) + delim + str(
187 |                     prc_diff[i]) + delim +
188 |                 top_roc_clf + delim +
189 |                 top_ap_clf + delim +
190 |                 top_prc_clf + delim +
191 |                 str(n_ite) + delim +
192 |                 str(n_clf))
193 |         else:
194 |             f.writelines(
195 |                 '\n' + str(method_list[i]) + delim +
196 |                 str(roc_scores[i]) + delim +
197 |                 str(ap_scores[i]) + delim +
198 |                 str(prc_scores[i]) + delim +
199 |                 top_roc_clf + delim +
200 |                 top_ap_clf + delim +
201 |                 top_prc_clf + delim +
202 |                 str(n_ite) + delim +
203 |                 str(n_clf))
204 | 
205 |     f.close()
206 | 


--------------------------------------------------------------------------------
/vis_tsne.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import pathlib
  3 | import numpy as np
  4 | from scipy.stats import rankdata
  5 | from scipy.stats import pearsonr
  6 | from scipy.spatial.distance import euclidean
  7 | 
  8 | import matplotlib.pyplot as plt
  9 | 
 10 | from sklearn.model_selection import train_test_split
 11 | from sklearn.neighbors import KDTree
 12 | 
 13 | from sklearn.manifold import TSNE
 14 | 
 15 | from models.combination import aom, moa
 16 | from models.lof import Lof
 17 | from utility.stat_models import wpearsonr
 18 | from utility.utility import argmaxp, loaddata, standardizer, get_label_n
 19 | 
 20 | # set numpy parameters
 21 | np.set_printoptions(suppress=True, precision=4)
 22 | # generates the visualization for all datasets
 23 | data_list = ["Annthyroid",
 24 |              "Pendigits",
 25 |              "Satellite",
 26 |              "Pima",
 27 |              "Letter",
 28 |              "Thyroid",
 29 |              "Vowels",
 30 |              "Cardio",
 31 |              "Mnist"]
 32 | DCSO_best_list = [186, 38, 71, 103, 233, 157, 128, 127, 97]
 33 | 
 34 | for data, DCSO_best in zip(data_list, DCSO_best_list):
 35 | 
 36 |     print('processing', data)
 37 |     X_test_list = []
 38 |     X_test_name_list = []
 39 |     DCSO_best_list = []
 40 |     test_target_list_list = []
 41 |     y_test_list = []
 42 |     trans_data_list = []
 43 | 
 44 |     X_orig, y_orig, outlier_perc = loaddata(data)
 45 | 
 46 |     ite = 1  # number of iterations
 47 |     test_size = 0.4  # training = 60%, testing = 40%
 48 |     final_k_list = [10, 30, 60, 100]
 49 |     n_methods = 253
 50 | 
 51 |     k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
 52 |               110, 120, 130, 140, 150, 160, 170, 180, 190, 200]
 53 | 
 54 |     n_clf = len(k_list)
 55 |     fixed_range = [5, 10, 15]
 56 | 
 57 |     # for AOM and MOA, choose the right number of buckets
 58 |     n_buckets = 5
 59 |     n_clf_bucket = int(n_clf / n_buckets)
 60 |     assert (n_clf % n_buckets == 0)  # in case wrong number of buckets
 61 | 
 62 |     # split the data into training and testing
 63 |     # fixed the visualization by random state == 42
 64 |     X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig,
 65 |                                                         test_size=test_size,
 66 |                                                         random_state=42)
 67 | 
 68 |     # generate the normalized data
 69 |     X_train_norm, X_test_norm = standardizer(X_train, X_test)
 70 | 
 71 |     train_scores = np.zeros([X_train.shape[0], n_clf])
 72 |     test_scores = np.zeros([X_test.shape[0], n_clf])
 73 | 
 74 |     # initialized the list to store the results
 75 |     test_target_list = []
 76 |     method_list = []
 77 |     k_rec_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  # zeros for non dcs
 78 | 
 79 |     clf_list = []
 80 |     for k in k_list:
 81 |         clf = Lof(n_neighbors=k)
 82 |         clf.fit(X_train_norm)
 83 |         train_score = clf.negative_outlier_factor_ * -1
 84 |         test_score = clf.decision_function(X_test_norm) * -1
 85 |         clf_name = 'lof_' + str(k)
 86 | 
 87 |         clf_list.append(clf_name)
 88 |         curr_ind = len(clf_list) - 1
 89 | 
 90 |         train_scores[:, curr_ind] = train_score.ravel()
 91 |         test_scores[:, curr_ind] = test_score.ravel()
 92 | 
 93 |     #######################################################################
 94 |     # generate normalized scores
 95 |     train_scores_norm, test_scores_norm = standardizer(train_scores,
 96 |                                                        test_scores)
 97 | 
 98 |     # make sure the scores are actually standardized
 99 |     assert (math.isclose(train_scores_norm.mean(), 0, abs_tol=0.1))
100 |     #    assert(math.isclose(test_scores_norm.mean(), 0, abs_tol=0.1))
101 |     assert (math.isclose(train_scores_norm.std(), 1, abs_tol=0.1))
102 |     #    assert(math.isclose(test_scores_norm.std(), 1, abs_tol=0.1))
103 | 
104 |     # generate mean and max outputs
105 |     target_test_mean = np.mean(test_scores_norm, axis=1)
106 |     target_test_max = np.max(test_scores_norm, axis=1)
107 |     test_target_list.extend([target_test_mean, target_test_max])
108 |     method_list.extend(['mean', 'max'])
109 | 
110 |     # generate pseudo target for training -> for calculating weights
111 |     target_mean = np.mean(train_scores_norm, axis=1).reshape(-1, 1)
112 |     target_max = np.max(train_scores_norm, axis=1).reshape(-1, 1)
113 |     # higher value for more outlierness
114 |     ranks_mean = rankdata(target_mean).reshape(-1, 1)
115 |     ranks_max = rankdata(target_max).reshape(-1, 1)
116 | 
117 |     # generate weighted mean
118 |     # weights are distance or pearson in different modes
119 |     clf_weights_pear = np.zeros([n_clf, 1])
120 |     for i in range(n_clf):
121 |         clf_weights_pear[i] = \
122 |             pearsonr(target_mean, train_scores_norm[:, i].reshape(-1, 1))[0][0]
123 | 
124 |     clf_weights_euc = np.zeros([n_clf, 1])
125 |     for i in range(n_clf):
126 |         clf_weights_euc[i] = euclidean(target_mean,
127 |                                        train_scores_norm[:, i].reshape(-1, 1))
128 |     clf_weights_euc = clf_weights_euc.max() - clf_weights_euc
129 | 
130 |     for i in fixed_range:
131 |         target_test_max_pear = np.max(
132 |             test_scores_norm[:, argmaxp(clf_weights_pear, i)], axis=1)
133 |         target_test_max_euc = np.max(
134 |             test_scores_norm[:, argmaxp(clf_weights_euc, i)], axis=1)
135 |         test_target_list.extend([target_test_max_pear, target_test_max_euc])
136 |         method_list.extend(
137 |             ['max_' + str(i) + '_pear', 'max_' + str(i) + '_euc'])
138 | 
139 |     # generate weighted mean
140 |     target_test_weighted_pear = np.sum(
141 |         test_scores_norm * clf_weights_pear.reshape(1,
142 |                                                     -1) / clf_weights_pear.sum(),
143 |         axis=1)
144 |     target_test_weighted_euc = np.sum(
145 |         test_scores_norm * clf_weights_euc.reshape(1,
146 |                                                    -1) / clf_weights_euc.sum(),
147 |         axis=1)
148 |     test_target_list.extend(
149 |         [target_test_weighted_pear, target_test_weighted_euc])
150 |     method_list.extend(['w_mean_pear', 'w_mean_euc', ])
151 | 
152 |     # generate threshold sum
153 |     target_test_threshold = np.sum(test_scores_norm.clip(0), axis=1)
154 |     test_target_list.append(target_test_threshold)
155 |     method_list.append('threshold')
156 | 
157 |     # generate average of maximum (AOM) and maximum of average (MOA)
158 |     target_test_aom = aom(test_scores_norm, n_buckets, n_clf)
159 |     target_test_moa = moa(test_scores_norm, n_buckets, n_clf)
160 |     test_target_list.extend([target_test_aom, target_test_moa])
161 |     method_list.extend(['aom', 'moa'])
162 |     ###################################################################
163 |     # use mean as the pseudo target
164 |     for k in final_k_list:
165 |         tree = KDTree(X_train_norm)
166 |         dist_arr, ind_arr = tree.query(X_test_norm, k=k)
167 | 
168 |         m_list = ['a_dist_d', 'a_dist_r', 'a_dist_n',
169 |                   'a_pear_d', 'a_pear_r', 'a_pear_n']
170 | 
171 |         # initialize different buckets
172 |         pred_scores_best = np.zeros([X_test.shape[0], len(m_list)])
173 |         pred_scores_max_d = np.zeros([X_test.shape[0], len(m_list)])
174 |         pred_scores_max_f5 = np.zeros([X_test.shape[0], len(m_list)])
175 |         pred_scores_max_f10 = np.zeros([X_test.shape[0], len(m_list)])
176 |         pred_scores_max_f15 = np.zeros([X_test.shape[0], len(m_list)])
177 | 
178 |         for i in range(X_test.shape[0]):  # X_test_norm.shape[0]
179 |             # get the neighbor idx of the current point
180 |             ind_k = ind_arr[i, :]
181 | 
182 |             # get the pseudo target: mean
183 |             target_k = target_mean[ind_k,].ravel()
184 | 
185 |             # get the current scores from all clf
186 |             curr_train_k = train_scores_norm[ind_k, :]
187 | 
188 |             # weights by rank
189 |             weights_k_rank = ranks_mean[ind_k]
190 | 
191 |             # weights by distance
192 |             dist_k = dist_arr[i, :].reshape(-1, 1)
193 |             weights_k_dist = dist_k.max() - dist_k
194 | 
195 |             # initialize containers for correlation
196 |             corr_dist_d = np.zeros([n_clf, ])
197 |             corr_dist_r = np.zeros([n_clf, ])
198 |             corr_dist_n = np.zeros([n_clf, ])
199 |             corr_pear_d = np.zeros([n_clf, ])
200 |             corr_pear_r = np.zeros([n_clf, ])
201 |             corr_pear_n = np.zeros([n_clf, ])
202 | 
203 |             for d in range(n_clf):
204 |                 # flip distance so larger values imply larger correlation
205 |                 corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d],
206 |                                             w=weights_k_dist) * -1
207 |                 corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d],
208 |                                             w=weights_k_rank) * -1
209 |                 corr_dist_n[d,] = euclidean(target_k, curr_train_k[:, d]) * -1
210 |                 corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d],
211 |                                             w=weights_k_dist)
212 |                 corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d],
213 |                                             w=weights_k_rank)
214 |                 corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[0]
215 | 
216 |             corr_list = [corr_dist_d, corr_dist_r, corr_dist_n,
217 |                          corr_pear_d, corr_pear_r, corr_pear_n]
218 | 
219 |             for j in range(len(m_list)):
220 |                 corr_k = corr_list[j]
221 | 
222 |                 # pick the best one
223 |                 best_clf_ind = np.nanargmax(corr_k)
224 |                 pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind]
225 |                 #                print(k, best_clf_ind)
226 |                 # pick the p dynamically
227 |                 threshold = corr_k.max() - corr_k.std() * 0.2
228 |                 p = (corr_k >= threshold).sum()
229 |                 if p == 0:  # in case extreme cases [nan and all -1's]
230 |                     p = 1
231 |                 pred_scores_max_d[i, j] = np.max(
232 |                     test_scores_norm[i, argmaxp(corr_k, p)])
233 | 
234 |                 # pick the best 5 classifiers
235 |                 pred_scores_max_f5[i, j] = np.max(
236 |                     test_scores_norm[i, argmaxp(corr_k, 5)])
237 |                 # pick the best 10 classifiers
238 |                 pred_scores_max_f10[i, j] = np.max(
239 |                     test_scores_norm[i, argmaxp(corr_k, 10)])
240 |                 # pick the best 15 classifiers
241 |                 pred_scores_max_f15[i, j] = np.max(
242 |                     test_scores_norm[i, argmaxp(corr_k, 15)])
243 | 
244 |         for m in range(len(m_list)):
245 |             test_target_list.extend([pred_scores_best[:, m],
246 |                                      pred_scores_max_d[:, m],
247 |                                      pred_scores_max_f5[:, m],
248 |                                      pred_scores_max_f10[:, m],
249 |                                      pred_scores_max_f15[:, m]])
250 |             method_list.extend(['dcs_best_' + m_list[m] + '_' + str(k),
251 |                                 'dcs_dyn_' + m_list[m] + '_' + str(k),
252 |                                 'dcs_f5_' + m_list[m] + '_' + str(k),
253 |                                 'dcs_f10_' + m_list[m] + '_' + str(k),
254 |                                 'dcs_f15_' + m_list[m] + '_' + str(k)])
255 |             k_rec_list.extend([k, k, k, k, k])
256 |     ##########################################################################
257 | 
258 |     # use max for pseudo target
259 |     for k in final_k_list:
260 |         print('processing', k)
261 |         tree = KDTree(X_train_norm)
262 |         dist_arr, ind_arr = tree.query(X_test_norm, k=k)
263 | 
264 |         m_list = ['m_dist_d', 'm_dist_r', 'm_dist_n',
265 |                   'm_pear_d', 'm_pear_r', 'm_pear_n']
266 | 
267 |         pred_scores_best = np.zeros([X_test.shape[0], len(m_list)])
268 |         pred_scores_max_d = np.zeros([X_test.shape[0], len(m_list)])
269 |         pred_scores_max_f5 = np.zeros([X_test.shape[0], len(m_list)])
270 |         pred_scores_max_f10 = np.zeros([X_test.shape[0], len(m_list)])
271 |         pred_scores_max_f15 = np.zeros([X_test.shape[0], len(m_list)])
272 | 
273 |         for i in range(X_test.shape[0]):  # X_test_norm.shape[0]
274 |             # get the neighbor idx of the current point
275 |             ind_k = ind_arr[i, :]
276 | 
277 |             # get the pseudo target: max
278 |             target_k = target_max[ind_k,].ravel()
279 | 
280 |             # get the current scores from all clf
281 |             curr_train_k = train_scores_norm[ind_k, :]
282 | 
283 |             # weights by rank
284 |             weights_k_rank = ranks_max[ind_k]
285 |             # weights by distance
286 |             dist_k = dist_arr[i, :].reshape(-1, 1)
287 |             weights_k_dist = dist_k.max() - dist_k
288 | 
289 |             corr_dist_d = np.zeros([n_clf, ])
290 |             corr_dist_r = np.zeros([n_clf, ])
291 |             corr_dist_n = np.zeros([n_clf, ])
292 |             corr_pear_d = np.zeros([n_clf, ])
293 |             corr_pear_r = np.zeros([n_clf, ])
294 |             corr_pear_n = np.zeros([n_clf, ])
295 | 
296 |             for d in range(n_clf):
297 |                 corr_dist_d[d,] = euclidean(target_k, curr_train_k[:, d],
298 |                                             w=weights_k_dist) * -1
299 |                 corr_dist_r[d,] = euclidean(target_k, curr_train_k[:, d],
300 |                                             w=weights_k_rank) * -1
301 |                 corr_dist_n[d,] = euclidean(target_k, curr_train_k[:, d]) * -1
302 |                 corr_pear_d[d,] = wpearsonr(target_k, curr_train_k[:, d],
303 |                                             w=weights_k_dist)
304 |                 corr_pear_r[d,] = wpearsonr(target_k, curr_train_k[:, d],
305 |                                             w=weights_k_rank)
306 |                 corr_pear_n[d,] = wpearsonr(target_k, curr_train_k[:, d])[0]
307 | 
308 |             corr_list = [corr_dist_d, corr_dist_r, corr_dist_n,
309 |                          corr_pear_d, corr_pear_r, corr_pear_n]
310 | 
311 |             for j in range(len(m_list)):
312 |                 corr_k = corr_list[j]
313 | 
314 |                 # pick the best one
315 |                 best_clf_ind = np.nanargmax(corr_k)
316 |                 pred_scores_best[i, j] = test_scores_norm[i, best_clf_ind]
317 | 
318 |                 # pick the p dynamically
319 |                 threshold = corr_k.max() - corr_k.std() * 0.2
320 |                 p = (corr_k >= threshold).sum()
321 |                 if p == 0:  # in case extreme cases [nan and all -1's]
322 |                     p = 1
323 |                 pred_scores_max_d[i, j] = np.mean(
324 |                     test_scores_norm[i, argmaxp(corr_k, p)])
325 | 
326 |                 # pick the best 5 classifiers
327 |                 pred_scores_max_f5[i, j] = np.mean(
328 |                     test_scores_norm[i, argmaxp(corr_k, 5)])
329 |                 # pick the best 10 classifiers
330 |                 pred_scores_max_f10[i, j] = np.mean(
331 |                     test_scores_norm[i, argmaxp(corr_k, 10)])
332 |                 # pick the best 15 classifiers
333 |                 pred_scores_max_f15[i, j] = np.mean(
334 |                     test_scores_norm[i, argmaxp(corr_k, 15)])
335 | 
336 |         for m in range(len(m_list)):
337 |             test_target_list.extend([pred_scores_best[:, m],
338 |                                      pred_scores_max_d[:, m],
339 |                                      pred_scores_max_f5[:, m],
340 |                                      pred_scores_max_f10[:, m],
341 |                                      pred_scores_max_f15[:, m]])
342 |             method_list.extend(['dcs_best_' + m_list[m] + '_' + str(k),
343 |                                 'dcs_dyn_' + m_list[m] + '_' + str(k),
344 |                                 'dcs_f5_' + m_list[m] + '_' + str(k),
345 |                                 'dcs_f10_' + m_list[m] + '_' + str(k),
346 |                                 'dcs_f15_' + m_list[m] + '_' + str(k)])
347 |             k_rec_list.extend([k, k, k, k, k])
348 | 
349 |     trans_data_list.append(
350 |         TSNE(n_components=2, init='pca').fit_transform(X_test))
351 |     X_test_list.append(X_test)
352 |     X_test_name_list.append(data)
353 |     DCSO_best_list.append(DCSO_best)
354 |     test_target_list_list.append(test_target_list)
355 |     y_test_list.append(y_test)
356 |     ##########################################################################
357 |     plt.figure(figsize=(12, 6))
358 | 
359 |     for k in range(1):
360 | 
361 |         # find the comparision
362 |         dcs_target = get_label_n(y_test_list[k],
363 |                                  test_target_list_list[k][DCSO_best_list[k]])
364 |         mean_target = get_label_n(y_test_list[k], test_target_list_list[k][0])
365 |         max_target = get_label_n(y_test_list[k], test_target_list_list[k][1])
366 | 
367 |         normal_ind = []
368 |         outlier_ind = []
369 | 
370 |         equal_right_mean = []
371 |         equal_wrong_mean = []
372 | 
373 |         equal_right_max = []
374 |         equal_wrong_max = []
375 | 
376 |         dcs_out_mean = []
377 |         mean_out = []
378 | 
379 |         dcs_norm_mean = []
380 |         mean_norm = []
381 | 
382 |         dcs_out_max = []
383 |         max_out = []
384 | 
385 |         dcs_norm_max = []
386 |         max_norm = []
387 | 
388 |         for i in range(X_test_list[k].shape[0]):
389 |             if y_test_list[k][i] == 0:
390 |                 normal_ind.append(i)
391 |             else:
392 |                 outlier_ind.append(i)
393 | 
394 |             if dcs_target[i] == mean_target[i] == y_test_list[k][i]:
395 |                 print(i, 'equal & right')
396 |                 equal_right_mean.append(i)
397 | 
398 |             elif dcs_target[i] == mean_target[i] and dcs_target[i] != \
399 |                     y_test_list[k][i]:
400 |                 print(i, 'equal & wrong')
401 |                 equal_wrong_mean.append(i)
402 | 
403 |             elif dcs_target[i] != mean_target[i]:
404 |                 print(i, 'not equal')
405 |                 if y_test_list[k][i] == 1:
406 |                     if dcs_target[i] == y_test_list[k][i]:
407 |                         dcs_out_mean.append(i)
408 |                     else:
409 |                         mean_out.append(i)
410 |                 else:
411 |                     if dcs_target[i] == y_test_list[k][i]:
412 |                         dcs_norm_mean.append(i)
413 |                     else:
414 |                         mean_norm.append(i)
415 |             ##################################################################
416 |             if dcs_target[i] == max_target[i] == y_test_list[k][i]:
417 |                 print(i, 'equal & right')
418 |                 equal_right_max.append(i)
419 | 
420 |             elif dcs_target[i] == max_target[i] and dcs_target[i] != \
421 |                     y_test_list[k][i]:
422 |                 print(i, 'equal & wrong')
423 |                 equal_wrong_max.append(i)
424 | 
425 |             elif dcs_target[i] != max_target[i]:
426 |                 print(i, 'not equal')
427 |                 if y_test_list[k][i] == 1:
428 |                     if dcs_target[i] == y_test_list[k][i]:
429 |                         dcs_out_max.append(i)
430 |                     else:
431 |                         max_out.append(i)
432 |                 else:
433 |                     if dcs_target[i] == y_test_list[k][i]:
434 |                         dcs_norm_max.append(i)
435 |                     else:
436 |                         max_norm.append(i)
437 | 
438 |         # plot mean
439 |         plt.subplot(121)
440 | 
441 |         plt.scatter(trans_data_list[k][normal_ind, 0],
442 |                     trans_data_list[k][normal_ind, 1], label='Normal',
443 |                     color='orange', alpha=0.6, s=24, marker='o')
444 |         plt.scatter(trans_data_list[k][outlier_ind, 0],
445 |                     trans_data_list[k][outlier_ind, 1], label='Outlying',
446 |                     color='red', alpha=0.6, s=28, marker='s')
447 | 
448 |         plt.scatter(trans_data_list[k][mean_norm, 0],
449 |                     trans_data_list[k][mean_norm, 1], label='SG_N',
450 |                     color='g', alpha=0.95, s=40, marker='v')
451 |         plt.scatter(trans_data_list[k][mean_out, 0],
452 |                     trans_data_list[k][mean_out, 1], label='SG_O',
453 |                     color='g', alpha=0.95, s=40, marker='^')
454 | 
455 |         plt.scatter(trans_data_list[k][dcs_norm_max, 0],
456 |                     trans_data_list[k][dcs_norm_max, 1], label='DCSO_N',
457 |                     color='b', alpha=0.95, s=54, marker='x')
458 |         plt.scatter(trans_data_list[k][dcs_out_max, 0],
459 |                     trans_data_list[k][dcs_out_max, 1], label='DCSO_O',
460 |                     color='b', alpha=0.95, s=65, marker='+')
461 | 
462 | 
463 |         plt.legend(ncol=3, prop={'size': 7.5}, loc='lower right',
464 |                    bbox_transform=plt.gcf().transFigure)
465 |         plt.xticks([])
466 |         plt.yticks([])
467 |         plt.title('SG_A vs. DCSO (' + X_test_name_list[k] + ')', fontsize=12)
468 | 
469 |         # plot max
470 |         plt.subplot(122)
471 | 
472 |         plt.scatter(trans_data_list[k][normal_ind, 0],
473 |                     trans_data_list[k][normal_ind, 1], label='Normal',
474 |                     color='orange', alpha=0.6, s=24, marker='o')
475 |         plt.scatter(trans_data_list[k][outlier_ind, 0],
476 |                     trans_data_list[k][outlier_ind, 1], label='Outlying',
477 |                     color='red', alpha=0.6, s=28, marker='s')
478 | 
479 |         plt.scatter(trans_data_list[k][max_norm, 0],
480 |                     trans_data_list[k][max_norm, 1], label='SG_N',
481 |                     color='g', alpha=0.95, s=40, marker='v')
482 |         plt.scatter(trans_data_list[k][max_out, 0],
483 |                     trans_data_list[k][max_out, 1], label='SG_O',
484 |                     color='g', alpha=0.95, s=40, marker='^')
485 | 
486 |         plt.scatter(trans_data_list[k][dcs_norm_max, 0],
487 |                     trans_data_list[k][dcs_norm_max, 1], label='DCSO_N',
488 |                     color='b', alpha=0.95, s=54, marker='x')
489 |         plt.scatter(trans_data_list[k][dcs_out_max, 0],
490 |                     trans_data_list[k][dcs_out_max, 1], label='DCSO_O',
491 |                     color='b', alpha=0.95, s=65, marker='+')
492 | 
493 |         plt.legend(ncol=3, prop={'size': 7.5}, loc='lower right',
494 |                    bbox_transform=plt.gcf().transFigure)
495 |         plt.xticks([])
496 |         plt.yticks([])
497 |         plt.title('SG_M vs. DCSO (' + X_test_name_list[k] + ')', fontsize=12)
498 | 
499 |     plt.tight_layout()
500 | 
501 |     # initialize the log directory if it does not exist
502 |     pathlib.Path('viz').mkdir(parents=True, exist_ok=True)
503 |     # save files
504 |     plt.savefig('viz\\' + data + '.png', dpi=330)
505 | 


--------------------------------------------------------------------------------
/viz/Annthyroid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Annthyroid.png


--------------------------------------------------------------------------------
/viz/Cardio.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Cardio.png


--------------------------------------------------------------------------------
/viz/Letter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Letter.png


--------------------------------------------------------------------------------
/viz/Mnist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Mnist.png


--------------------------------------------------------------------------------
/viz/Pendigits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Pendigits.png


--------------------------------------------------------------------------------
/viz/Pima.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Pima.png


--------------------------------------------------------------------------------
/viz/Satellite.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Satellite.png


--------------------------------------------------------------------------------
/viz/Thyroid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Thyroid.png


--------------------------------------------------------------------------------
/viz/Vowels.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yzhao062/DCSO/5e42a4f99134dc4cd8bf6e99c1861d2307df27e2/viz/Vowels.png


--------------------------------------------------------------------------------