├── .gitignore
├── README.md
├── ad_test_coreSet.py
├── docker
    ├── Dockerfile
    ├── container.sh
    └── image.sh
├── image
    └── RAPID_main.png
├── preprocess_rep.py
├── scripts
    └── all_at_once.sh
├── split_data.py
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pkl
 2 | *.json
 3 | *.log
 4 | *_parser
 5 | *.bin
 6 | *.tar.gz
 7 | *.gz
 8 | *.npz
 9 | *.csv
10 | *_0.2
11 | *_0.3
12 | *_0.5
13 | *.txt
14 | 
15 | # !processed_data/*/*/results/*.pkl
16 | dataset/
17 | processed_data/*/
18 | 
19 | # Ignore everything in logAD/dataset/LAnoBERT_split
20 | dataset/LAnoBERT_split/
21 | 
22 | # logAD/logbert/output
23 | logbert/output/
24 | 
25 | final_bert_model/
26 | 
27 | __pycache__/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # RAPID
 2 | **RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information**
 3 | 
 4 | [Gunho No](https://www.linkedin.com/in/%EA%B1%B4%ED%98%B8-%EB%85%B8-58b4a9298/)*, [Yukyung Lee](https://www.linkedin.com/in/yukyung-lee-149681155/)\*, [Hyeongwon Kang](https://www.linkedin.com/in/hyeongwon/) and [Pilsung Kang](https://github.com/pilsung-kang) 
 5 | <br>(*equal contribution)
 6 | 
 7 | This repository is the official implementation of "RAPID".
 8 | 
 9 | ![RAPID Architecture](image/RAPID_main.png)
10 | 
11 | ```
12 | RAPID/
13 | │
14 | ├── split_data.py            # Dataset splitting and preprocessing
15 | ├── preprocess_rep.py        # Log representation generation via Language Model
16 | ├── ad_test_coreSet.py       # Anomaly detection algorithm
17 | ├── utils.py                 
18 | │
19 | ├── scripts/
20 | │   └── all_at_once.sh       # End-to-end experiment runner
21 | │
22 | ├── processed_data/          # Directory for processed datasets
23 | │   ├── bgl/
24 | │   ├── tbird/
25 | │   └── hdfs/
26 | ```
27 | 
28 | ## Datasets
29 | RAPID is evaluated on three public datasets:
30 | 
31 | * BGL (Blue Gene/L)
32 | * Thunderbird
33 | * HDFS
34 | 
35 | Place the raw datasets in the dataset/ directory before running the preprocessing scripts.
36 | 
37 | ## Running the Experiments
38 | ### Full Pipeline
39 | To reproduce all experiments from the paper:
40 | ```
41 | bash scripts/all_at_once.sh
42 | ```
43 | This script runs the entire pipeline, including data preprocessing, representation generation, and anomaly detection across multiple configurations as described in our paper.
44 | 
45 | ### Step-by-step Execution
46 | 
47 | 1. Data Splitting and Preprocessing:
48 | ```
49 | python split_data.py --dataset [bgl/tbird/hdfs] --test_size 0.2
50 | ```
51 | 
52 | 2. Get Representation:
53 | ```
54 | python preprocess_rep.py --dataset [bgl/tbird/hdfs] --plm bert-base-uncased --batch_size 8192 --max_token_len [128/512]
55 | ```
56 | 
57 | 3. Anomaly Detection:
58 | ```
59 | python ad_test_coreSet.py --dataset [bgl/tbird/hdfs] --train_ratio 1 --coreSet 0.01 --only_cls False
60 | ```
61 | 
62 | ## Key Parameters
63 | 
64 | * `--dataset`: Choose the dataset (bgl, tbird, hdfs)
65 | * `--sample`: Sample size for large datasets (e.g., 5000000 for Thunderbird)
66 | * `--plm`: Pre-trained language model (bert-base-uncased, roberta-base, google/electra-base-discriminator)
67 | * `--coreSet`: Core set size or ratio (0, 0.01, 0.1, etc.)
68 | * `--train_ratio`: Ratio of training data to use (1, 0.1, 0.01, etc.)
69 | * `--only_cls`: Whether to use only the CLS token representation (True/False)
70 | 
71 | ## Results
72 | After running the experiments, results will be saved in the `processed_data/[dataset]/[test_size]/[plm]/results/` directory. Each experiment produces a CSV file and a JSON file with detailed performance metrics.
73 | 
74 | ## Citation
75 | If you find this code useful for your research, please cite our paper:
76 | 
77 | ```
78 | @article{NO2024108613,
79 | title = {Training-free retrieval-based log anomaly detection with pre-trained language model considering token-level information},
80 | journal = {Engineering Applications of Artificial Intelligence},
81 | volume = {133},
82 | pages = {108613},
83 | year = {2024},
84 | issn = {0952-1976},
85 | doi = {https://doi.org/10.1016/j.engappai.2024.108613},
86 | url = {https://www.sciencedirect.com/science/article/pii/S0952197624007711},
87 | author = {Gunho No and Yukyung Lee and Hyeongwon Kang and Pilsung Kang}
88 | }
89 | ```
90 | 


--------------------------------------------------------------------------------
/ad_test_coreSet.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | from tqdm import tqdm
  5 | import os
  6 | import pickle
  7 | import json
  8 | 
  9 | from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, auc, precision_recall_curve
 10 | from sklearn.neighbors import NearestNeighbors
 11 | 
 12 | import numpy as np
 13 | 
 14 | import torch
 15 | from torch import matmul
 16 | 
 17 | #user warning
 18 | import warnings
 19 | warnings.filterwarnings("ignore")
 20 | 
 21 | import argparse
 22 | import sys
 23 | 
 24 | from utils import str2bool, load_pickle, save_pickle, set_seed
 25 | #for time check
 26 | import time
 27 | 
 28 | def get_threshold_prc(score, test_label, not_to_numpy=False):
 29 |     if not_to_numpy:
 30 |         precision, recall, thresholds = precision_recall_curve(test_label, score)
 31 |     else:
 32 |         precision, recall, thresholds = precision_recall_curve(test_label, score.to_numpy())
 33 |     #get best f1score
 34 |     f1 = np.array([2 * (pr * re) / (pr + re + 1e-10) for pr, re in zip(precision, recall)])
 35 | 
 36 |     ix = np.argmax(f1)
 37 |     best_thresh = thresholds[ix]
 38 |     return best_thresh
 39 |  
 40 | def get_threshold_roc(score, test_label, not_to_numpy=False):
 41 |     if not_to_numpy:
 42 |         fpr, tpr, thresholds =roc_curve(test_label, score)
 43 |     else:
 44 |         fpr, tpr, thresholds =roc_curve(test_label, score.to_numpy())
 45 |     J = tpr - fpr
 46 |     ix = np.argmax(J)
 47 |     best_thresh = thresholds[ix]
 48 |     print('Best Threshold=%f, sensitivity = %.3f, specificity = %.3f, J=%.3f' % (best_thresh, tpr[ix], 1-fpr[ix], J[ix]))
 49 |     return best_thresh
 50 | 
 51 | def get_detection_score(label, pred, time_list,exp_name = 'current_exp', result_df=None):
 52 |     time_for_get_coreSet, time_for_cal_maxsim, time_for_get_adscore_for_all = time_list
 53 |     # get detection score
 54 |     print(f'confusion_matrix: \n{confusion_matrix(label, pred)}')
 55 |     print(f'accuracy_score: {accuracy_score(label, pred)}')
 56 |     print(f'f1_score: {f1_score(label, pred)}')
 57 |     print(f'precision_score: {precision_score(label, pred)}')
 58 |     print(f'recall_score: {recall_score(label, pred)}')
 59 |     print(f'roc_auc_score: {roc_auc_score(label, pred)}')
 60 |     print(classification_report(label, pred))
 61 | 
 62 |     if result_df is None:
 63 |         #make new dataframe and make exp_name to be index
 64 |         result_df = pd.DataFrame()
 65 |         result_df['exp_name'] = [exp_name]
 66 |         result_df = result_df.set_index('exp_name')
 67 |         result_df['f1_score'] = [f1_score(label, pred)]
 68 |         result_df['roc_auc_score'] = [roc_auc_score(label, pred)]
 69 |         result_df['precision_score'] = [precision_score(label, pred)]
 70 |         result_df['recall_score'] = [recall_score(label, pred)]
 71 |         result_df['accuracy_score'] = [accuracy_score(label, pred)]
 72 |         result_df['coreSet_time'] = [time_for_get_coreSet]
 73 |         result_df['maxsim_time'] = [time_for_cal_maxsim]
 74 |         result_df['lookup_all_adscore_time'] = [time_for_get_adscore_for_all]
 75 |     else:
 76 |         result_df.loc[exp_name] = [f1_score(label, pred),roc_auc_score(label, pred), precision_score(label, pred), recall_score(label, pred), accuracy_score(label, pred), 
 77 |                                    time_for_get_coreSet, time_for_cal_maxsim, time_for_get_adscore_for_all]
 78 |     return result_df
 79 | 
 80 | def get_threshold_pred_distance(score, test_label, desc='make maxsim_ori using lookup'):
 81 |     # get score of all data from unique data
 82 |     time_adscore_all=time.time()
 83 |     score_ori=np.zeros((test_label.shape[0]))
 84 |     for i in tqdm(range(test_label.shape[0]), desc=desc):
 85 |         score_ori[i] = score[test_unique_lookup_table[i]]
 86 |     time_adscore_all=time.time()-time_adscore_all
 87 |     best_thresh = threshold_function(score_ori, test_label['label'], True)
 88 |     pred=(score_ori >= best_thresh).astype(int)
 89 |     return pred, time_adscore_all
 90 | 
 91 | 
 92 | def get_colbert_score(a_test_rep, train_representations, maxsim_metric='cos'): #maxsim_metric: cosine, dot
 93 |     if maxsim_metric=='cos':                
 94 |         test_score = torch.sum(torch.max(torch.div(
 95 |                                                     matmul(a_test_rep, train_representations.transpose(1,2)),
 96 |                                                     torch.mul(torch.norm(a_test_rep,dim=1).unsqueeze(0).unsqueeze(-1),
 97 |                                                             torch.norm(train_representations,dim=2).unsqueeze(1))
 98 |                                                 ), dim=2).values, dim=1)
 99 | 
100 |         maxsim_score=torch.max(test_score)
101 |         mean_maxsim_score=torch.mean(test_score)
102 |         return maxsim_score, mean_maxsim_score
103 | 
104 |     elif maxsim_metric=='dot':
105 |         pass
106 | 
107 | def divide_cal(test_rep_chunk, train_representations, train_neighbor_index, test_idx, coreSet, maxsim_metric='cos'):
108 |     test_rep_chunk_cuda = test_rep_chunk.cuda()
109 |     test_scores_log_chunk = torch.Tensor([]).to(test_rep_chunk_cuda.device)
110 |     test_mean_coreSet_scores_chunk = torch.Tensor([]).to(test_rep_chunk_cuda.device)
111 |     train_representations=train_representations.cuda()
112 | 
113 |     for a_test in test_rep_chunk_cuda:
114 |         if coreSet==0:
115 |             maxsim_score, mean_coreSet_score =get_colbert_score(a_test, train_representations, maxsim_metric=maxsim_metric)
116 |             # if test_idx==0:
117 |             #     print(f'모든 train data와 비교,{train_representations.shape[0]}')
118 |         else:
119 |             maxsim_score, mean_coreSet_score =get_colbert_score(a_test, train_representations[train_neighbor_index[test_idx]], maxsim_metric=maxsim_metric)
120 |             # if test_idx==0:
121 |             #     print(f'coreSet만 비교,{train_representations[train_neighbor_index[test_idx]].shape[0]}')
122 |         test_scores_log_chunk = torch.cat((test_scores_log_chunk, 
123 |                                     maxsim_score.unsqueeze(0)), dim=0) 
124 |         test_mean_coreSet_scores_chunk = torch.cat((test_mean_coreSet_scores_chunk,
125 |                                     mean_coreSet_score.unsqueeze(0)), dim=0)
126 |         test_idx+=1
127 | 
128 |     test_scores_log_chunk=test_scores_log_chunk.detach().cpu().numpy()
129 |     test_mean_coreSet_scores_chunk=test_mean_coreSet_scores_chunk.detach().cpu().numpy()
130 |     return test_scores_log_chunk, test_mean_coreSet_scores_chunk, test_idx
131 | 
132 | if __name__ == '__main__':
133 |     parser = argparse.ArgumentParser()
134 |     parser.add_argument('--plm', type=str, default='bert-base-uncased')
135 |     parser.add_argument('--seed', type=int, default=1234, help='random seed (default: 1234)')
136 |    
137 |     #dataset.
138 |     parser.add_argument('--dataset', type=str, default='bgl', help='bgl, tbird, hdfs') 
139 |     parser.add_argument("--sample", help=[0.1, 0.05, 100000], default=1, type=lambda x: int(x) if x.isdigit() else float(x))
140 |     parser.add_argument("--test_size", help="test_size", default=0.2, type=float)
141 |     
142 |     # core set
143 |     parser.add_argument('--coreSet', default=0, type=lambda x: int(x) if x.isdigit() else float(x), help='0:all unique, 1, 1000, 0.1')
144 | 
145 |     parser.add_argument('--maxsim_metric', type=str, default='cos', help='cos, dot')
146 | 
147 |     #extra Experiment
148 |     parser.add_argument('--only_cls', default=False, type=str2bool, help='only cls colbert')
149 |     parser.add_argument('--train_ratio', type=float, default=1.0, help='for using exp(train ratio)')
150 |     parser.add_argument("--only_in_test", default=False, type=str2bool, help='only_in_test')
151 |     parser.add_argument('--threshold_function', type=str, default='prc', help='prc, roc')
152 | 
153 |     args = parser.parse_args()
154 |     set_seed(args.seed)
155 | 
156 |     # directory setting
157 |     if args.sample != 1:
158 |         root_data_path = os.path.join(os.getcwd(), 'processed_data', f'{args.dataset}_sample_{str(args.sample)}')
159 |         processed_data_path = os.path.join(root_data_path, f'{args.test_size}', f'{args.plm}')
160 | 
161 |     else:
162 |         root_data_path = os.path.join(os.getcwd(), 'processed_data', f'{args.dataset}')
163 |         processed_data_path = os.path.join(root_data_path, f'{args.test_size}', f'{args.plm}') 
164 | 
165 |     #save result
166 |     save_path=os.path.join(processed_data_path, 'results')
167 |     if not os.path.exists(save_path):
168 |         os.makedirs(save_path)
169 | 
170 |     if args.only_in_test:
171 |         save_path=os.path.join(save_path, f'only_in_test')
172 |         if not os.path.exists(save_path):
173 |             os.makedirs(save_path)
174 | 
175 |     #set experiment name start with data information
176 |     exp_log_file_name = f'{args.dataset}_sample-{str(args.sample)}_trainRatio-{str(args.train_ratio)}'
177 | 
178 |     #exp setting
179 |     exp_log_file_name = exp_log_file_name+f'_thrSearch-{args.threshold_function}'
180 |     
181 |     if args.only_cls: 
182 |         exp_log_file_name = exp_log_file_name+f'_C-Onlycls-{args.maxsim_metric}'
183 |     else:
184 |         exp_log_file_name = exp_log_file_name+f'_C-Wcls-{args.maxsim_metric}'
185 | 
186 |     exp_log_file_name = exp_log_file_name+f'_coreSet-{str(args.coreSet)}'
187 | 
188 |     if args.threshold_function == 'roc':
189 |         threshold_function=get_threshold_roc
190 |     elif args.threshold_function == 'prc':
191 |         threshold_function=get_threshold_prc
192 | 
193 |     if 'hdfs' in args.dataset:
194 |         result_name='(session)'
195 |     else:
196 |         result_name='(all_each)'
197 | 
198 |     if os.path.exists(os.path.join(save_path, f'{exp_log_file_name}_dict.json')):
199 |         # end of experiment
200 |         print(f'{exp_log_file_name} is already exist')
201 |         sys.exit()
202 | 
203 |     # load preprocessed data
204 |     # train, val : all normal data
205 |     train_representations = load_pickle(os.path.join(processed_data_path,'train_representations'))
206 |     test_label = load_pickle(os.path.join(processed_data_path,'test_label'))
207 |     test_representations = load_pickle(os.path.join(processed_data_path,'test_representations'))
208 |     test_unique_lookup_table = load_pickle(os.path.join(processed_data_path,'test_unique_lookup_table'))  
209 | 
210 |     if args.train_ratio != 1:
211 |         print(f'original unique train size: {train_representations.shape}')
212 |         print(f'train_ratio: {args.train_ratio}')
213 |         train_unique_lookup_table=load_pickle(os.path.join(processed_data_path,'train_unique_lookup_table'))        
214 |         sampled_train = np.random.choice(train_unique_lookup_table.shape[0], int(train_unique_lookup_table.shape[0]*args.train_ratio), replace=False)
215 |         train_representations=train_representations[np.unique(train_unique_lookup_table[sampled_train]),:,:]
216 |         print(f'sampled_unique_train_size: {train_representations.shape}')
217 | 
218 |     if args.only_in_test:
219 |         test_label['lookup_table']=test_unique_lookup_table
220 |         test_label['label']=test_label['label'].astype(int)
221 |         print('only_in_test')
222 |         #compare train_representations and test_representations to find only in test
223 |         # to easy calcuration compare only cls
224 |         train_representations_cls=train_representations[:,0,:].numpy()
225 |         test_representations_cls=test_representations[:,0,:].numpy()
226 | 
227 |         # get only in test
228 |         only_in_test_idx=[]
229 |         for i, cls in tqdm(enumerate(test_representations_cls.tolist()), desc='only_in_test', total=len(test_representations_cls.tolist())):
230 |             if cls not in train_representations_cls.tolist():
231 |                 only_in_test_idx.append(i)
232 | 
233 |         only_in_test_idx=np.array(only_in_test_idx)
234 |         new_idx_dict={}
235 |         for i in range(len(only_in_test_idx)):
236 |             new_idx_dict[only_in_test_idx[i]]=np.arange(0, len(only_in_test_idx))[i]
237 | 
238 |         only_test_test_label=test_label[test_label['lookup_table'].isin(only_in_test_idx)].reset_index(drop=True)
239 |         only_test_test_label['lookup_table']=only_test_test_label['lookup_table'].map(new_idx_dict)
240 |         
241 |         test_unique_lookup_table=only_test_test_label['lookup_table'].values
242 |         test_label=only_test_test_label[['timestamp','label']]
243 |         test_representations=test_representations[only_in_test_idx]
244 |         print(f'only_in_test: {test_representations.shape}')
245 | 
246 | 
247 |     exp_log_file_name=exp_log_file_name.split('_')
248 |     exp_log_file_name[2]=exp_log_file_name[2]+'-'+str(train_representations.shape[0])
249 | 
250 |     exp_log_file_name='_'.join(exp_log_file_name)
251 |     
252 |     # if coreSet is nor integer, then it is ratio
253 |     # check args.coreSet is ratio or not
254 |     if (args.coreSet > 0) and (args.coreSet < 1):
255 |         coreSet = int(train_representations.shape[0]*args.coreSet)
256 |         coreSet = max(coreSet, 1)
257 |         print(f'{args.coreSet} = {coreSet}')
258 |         exp_log_file_name='-'.join(exp_log_file_name.split('-')[:-1])
259 |         exp_log_file_name = exp_log_file_name+f'-{args.coreSet}-{coreSet}'
260 |     elif args.coreSet <= train_representations.shape[0]:
261 |         #여기에 coreSet=0인 경우도 포함됨
262 |         coreSet = int(args.coreSet)
263 |     else:
264 |         coreSet = int(train_representations.shape[0])
265 |         print(f'train_representations.shape[0] < coreSet: {train_representations.shape[0]} < {args.coreSet}')
266 |         print('use all unique_train')
267 |     
268 |     if train_representations.shape[0] < coreSet:
269 |         exp_log_file_name='-'.join(exp_log_file_name.split('-')[:-1])
270 |         exp_log_file_name = exp_log_file_name+f'-{train_representations.shape[0]}'
271 | 
272 |     if os.path.exists(os.path.join(save_path, f'{exp_log_file_name}_dict.json')):
273 |         # end of experiment
274 |         print(f'{exp_log_file_name} is already exist')
275 |         sys.exit()
276 |         
277 |     with open(os.path.join(save_path, f'{exp_log_file_name}.txt'), 'w') as f:
278 |         sys.stdout = f
279 |         
280 |         # get label
281 |         # new_label from unique to all log
282 |         test_label['lookup_table']=test_unique_lookup_table
283 |         test_label['label']=test_label['label'].astype(int)
284 | 
285 |         #fit knn with unique training data 
286 |         time_for_get_coreSet=time.time()
287 |         if coreSet==0:
288 |             #use all unique_train & 여기선 그냥 3
289 |             knn_cuml_cls = NearestNeighbors(n_neighbors=1)
290 |         else:
291 |             # 앞에서 coreSet이 전체 보다 큰 경우에는 전체를 사용하도록 설정 
292 |             knn_cuml_cls = NearestNeighbors(n_neighbors=coreSet)
293 |         knn_cuml_cls.fit(train_representations[:,0,:].numpy())
294 | 
295 |         knn_D, train_neighbor_index = knn_cuml_cls.kneighbors(test_representations[:,0,:].numpy())
296 |         time_for_get_coreSet=time.time()-time_for_get_coreSet
297 | 
298 |         del knn_cuml_cls
299 | 
300 |         # check the score is ordered by the score
301 |         if (np.sort(knn_D[10])==knn_D[10]).all():
302 |             print('score is ordered by the score')
303 |         else:
304 |             print('*'*50)
305 |             print('score is not ordered by the score')
306 |             print('*'*50)
307 | 
308 |         # make D of original test data by using test_unique_lookup_table
309 |         knn_D = knn_D[:,0]
310 |         knn_pred, knn_time=get_threshold_pred_distance(knn_D, test_label, desc='make D_ori using lookup')
311 | 
312 |         if args.only_cls:
313 |             train_representations=train_representations[:,0,:].unsqueeze(1)
314 |             test_representations=test_representations[:,0,:].unsqueeze(1)
315 |             print('colbert with only cls')
316 |             print('*'*50)
317 |         
318 |         print('='*50)
319 |         print('start calculating colbert score')
320 | 
321 |         test_scores_log = np.array([])
322 |         test_mean_coreSet_scores = np.array([])
323 |         num_chunk = 100
324 |         print(f'train: {train_representations.shape}, test: {test_representations.shape}')
325 | 
326 |         test_chunks = torch.chunk(test_representations, num_chunk, dim=0)
327 |         print(f'num_chunk: {num_chunk}, num_each_chunk: {test_chunks[0].shape[0]}')
328 | 
329 |         time_for_cal_maxsim=time.time()
330 |         test_idx=0
331 |         for i in tqdm(range(num_chunk), desc='colbert score by chunk'):
332 |             if (len(test_chunks) != num_chunk) and (i >= len(test_chunks)):
333 |                 print(f'{i}th chunk is not exist: number of unique is less than num_chunk')
334 |                 break
335 |             test_scores_log_chunk, test_mean_coreSet_scores_chunk, new_test_idx = divide_cal(
336 |                 test_rep_chunk=test_chunks[i], test_idx=test_idx, train_representations=train_representations, 
337 |                 train_neighbor_index=train_neighbor_index, coreSet=coreSet, maxsim_metric=args.maxsim_metric)
338 |             test_idx=new_test_idx
339 |             
340 |             test_scores_log = np.concatenate((test_scores_log, test_scores_log_chunk), axis=0)
341 |             test_mean_coreSet_scores = np.concatenate((test_mean_coreSet_scores, test_mean_coreSet_scores_chunk), axis=0)
342 | 
343 |         del test_chunks
344 | 
345 |         test_scores_log=-test_scores_log+np.max(test_scores_log)
346 |         test_mean_coreSet_scores=-test_mean_coreSet_scores+np.max(test_mean_coreSet_scores)
347 |         time_for_cal_maxsim=time.time()-time_for_cal_maxsim
348 | 
349 |         save_pickle(test_scores_log, os.path.join(save_path, f'{exp_log_file_name}_test_scores_log'))
350 |         save_pickle(test_mean_coreSet_scores, os.path.join(save_path, f'{exp_log_file_name}_test_mean_coreSet_scores'))
351 | 
352 |         # for maxsim_ori version: main result
353 |         maxsim_pred, time_for_get_adscore_for_all = get_threshold_pred_distance(test_scores_log, test_label, desc='make maxsim_ori using lookup')
354 | 
355 |         # for mean_coreSet_score version     
356 |         mean_coreSet_maxsim_pred, mean_coreSet_time = get_threshold_pred_distance(test_mean_coreSet_scores, test_label, desc='make mean maxsim using lookup')
357 | 
358 | 
359 |         print('='*50)
360 |         print('save times')
361 |         print('time_for_get_coreSet:', time_for_get_coreSet)
362 |         print('time_for_cal_maxsim:', time_for_cal_maxsim)
363 |         print('time_for_get_adscore_for_all:', time_for_get_adscore_for_all)
364 |         print('='*50)
365 |         time_list=(time_for_get_coreSet,time_for_cal_maxsim, time_for_get_adscore_for_all)
366 | 
367 |         print('-'*50)
368 |         print('K=1')
369 |         results_df=get_detection_score(test_label['label'], knn_pred,time_list, exp_name = f'K=1{result_name}', result_df=None)
370 |         print('='*50)
371 |         print('only ColBERT by all test')
372 |         results_df=get_detection_score(test_label['label'], maxsim_pred,time_list, exp_name = f'ColBERT{result_name}', result_df=results_df)
373 |         print('='*50)
374 |         print('mean_coreSet_score version')   
375 |         results_df=get_detection_score(test_label['label'], mean_coreSet_maxsim_pred,time_list, exp_name = f'mean ColBERT{result_name}', result_df=results_df)
376 |     # Restore the standard output
377 |     sys.stdout = sys.__stdout__
378 | 
379 |     # Close the file object
380 |     f.close()
381 | 
382 |     #save result
383 |     results_df.to_csv(os.path.join(save_path, f'{exp_log_file_name}_df.csv'))
384 |     #also save result_df as dict
385 |     result_dict={f'{exp_log_file_name}':results_df.to_dict()}
386 | 
387 |     with open(os.path.join(save_path, f'{exp_log_file_name}_dict.json'), 'w') as f:
388 |         json.dump(result_dict, f, indent=4)
389 |         


--------------------------------------------------------------------------------
/docker/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
 2 | 
 3 | RUN apt-get update && apt-get upgrade -y
 4 | RUN pip install --upgrade pip
 5 | 
 6 | RUN apt-get install -y fonts-nanum
 7 | RUN rm -rf ~/.cache/matplotlib/*
 8 | 
 9 | RUN apt-get -q -y update && DEBIAN_FRONTEND=noninteractive apt-get -q -y install git curl vim tmux locales lsb-release python3-pip ssh && apt-get clean
10 | RUN apt-get update && apt-get install -y sudo
11 | 
12 | ## some basic utilities
13 | RUN pip install matplotlib seaborn scikit-learn scipy pandas numpy jupyter wandb
14 | 
15 | #install all from requirements.txt
16 | #start from copying requirements.txt
17 | COPY requirements.txt .
18 | RUN pip install -r requirements.txt
19 | 
20 | ## add locale:
21 | RUN locale-gen en_US.UTF-8 && /usr/sbin/update-locale LANG=en_US.UTF-8
22 | ENV LANG en_US.UTF-8
23 | ENV LANGUAGE en_US:en
24 | ENV LC_ALL en_US.UTF-8
25 | 
26 | # # Copy code
27 | # ARG user_password
28 | 
29 | RUN pip freeze > requirements.txt
30 | ARG UNAME
31 | ARG UID
32 | ARG GID
33 | RUN groupadd -g $GID -o $UNAME
34 | RUN useradd -m -u $UID -g $GID -o -s /bin/bash $UNAME
35 | 
36 | # # sudo 권한 부여하기기
37 | # RUN usermod -aG sudo $UNAME
38 | # # 비밀번호 설정
39 | # RUN echo "$UNAME:$user_password" | chpasswd
40 | USER $UNAME
41 | 
42 | 


--------------------------------------------------------------------------------
/docker/container.sh:
--------------------------------------------------------------------------------
 1 | container_name=logAD
 2 | image_name=bb451/log:cuml_1.13.1_11.6
 3 | 
 4 | echo "Container_name: " $container_name
 5 | echo "Image_name: " $image_name
 6 | 
 7 | docker run -td \
 8 | 	-p 3775:3775 \
 9 |     --ipc=host \
10 |     --name $container_name \
11 | 	--gpus all \
12 | 	-v /ssd2/logAD:/home/bb451/logAD \
13 | 	-v /etc/passwd:/etc/passwd \
14 | 	$image_name


--------------------------------------------------------------------------------
/docker/image.sh:
--------------------------------------------------------------------------------
1 | image_name=bb451/log:cuml_1.13.1_11.6
2 | user_password=3775
3 | echo "Image_name: " $image_name
4 | echo "User_password: " $user_password
5 | 
6 | docker build -t $image_name --build-arg UNAME=$(whoami) --build-arg UID=$(id -u) --build-arg GID=$(id -g) --build-arg user_password=$user_password .


--------------------------------------------------------------------------------
/image/RAPID_main.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DSBA-Lab/RAPID/46b31747073548ed8489421690c8909595397a58/image/RAPID_main.png


--------------------------------------------------------------------------------
/preprocess_rep.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
  3 | import sys
  4 | import pandas as pd
  5 | import numpy as np
  6 | import argparse 
  7 | 
  8 | import torch
  9 | from torch.nn import DataParallel
 10 | 
 11 | from transformers import (
 12 |     BertModel,
 13 |     BertTokenizer,
 14 |     AutoModel,
 15 |     AutoTokenizer,
 16 | )
 17 | 
 18 | import parmap
 19 | import multiprocessing
 20 | num_processors = multiprocessing.cpu_count()
 21 | 
 22 | #for time check
 23 | from tqdm import tqdm
 24 | import time
 25 | from utils import str2bool, set_seed, load_pickle, save_pickle, normalizeString, bgl_regex, tb_regex, hdfs_regex
 26 | 
 27 | def preprocess(sentence, flag, dataset):
 28 |     if "bgl" in dataset:
 29 |         timestamp, sentence = bgl_regex(sentence)
 30 |     elif "tbird" in dataset:
 31 |         timestamp, sentence = tb_regex(sentence)
 32 |     elif "hdfs" in dataset:
 33 |         timestamp, sentence = hdfs_regex(sentence)#hdfs: timestamp=block_id
 34 | 
 35 |     if flag=='test':
 36 |         if sentence.split()[0] == '-':
 37 |             test_label=0
 38 |         else:
 39 |             test_label=1
 40 |             if ("bgl" in dataset) or ("tbird" in dataset): #bgl, tbird: abnormal case has error category, don't need this
 41 |                 sentence = " ".join(sentence.split()[1:])
 42 |     sentence = normalizeString(sentence)
 43 |     if ("bgl" in dataset) or ("tbird" in dataset):
 44 |         sentence = " ".join(sentence.split()[3:]) #useless part remove
 45 |     elif "hdfs" in dataset:
 46 |         sentence = " ".join(sentence.split()[1:])
 47 | 
 48 |     if flag=='test':
 49 |         return timestamp,sentence,test_label
 50 |     else:
 51 |         return timestamp,sentence
 52 | 
 53 | def get_time_data(raw_data, flag, args): 
 54 |     #multiprocessing
 55 |     timestamp_representation_testLabel =np.array(parmap.map(preprocess, raw_data, flag, args.dataset, pm_pbar=True, pm_processes=num_processors-2))
 56 |     timestamps=timestamp_representation_testLabel[:,0]
 57 |     sentence_list=timestamp_representation_testLabel[:,1]
 58 |     if flag=='test':
 59 |         test_labels=timestamp_representation_testLabel[:,2]
 60 |     if flag=='test':
 61 |         return timestamps, sentence_list, test_labels
 62 |     else:
 63 |         return timestamps, sentence_list
 64 |     
 65 | def get_representation(model, sentence_list, batch_size=20000, max_length=512, pooling_strategy='all'):
 66 |     all_representations=torch.tensor([], dtype=torch.float32)
 67 |     for batch_sentences in tqdm(batch(sentence_list, batch_size), total=len(sentence_list)//batch_size+1):
 68 |         tokens=tokenizer(batch_sentences, add_special_tokens=True, return_tensors='pt', padding='max_length', max_length=max_length, truncation=True).to(device) #max_lenth=128, truncation=True하지 않으면 특정한 문장이 299 길이가 나오나봐
 69 | 
 70 |         with torch.no_grad():
 71 |             if pooling_strategy=='all':
 72 |                 representations=model(**tokens).last_hidden_state.detach().cpu()
 73 | 
 74 |         all_representations=torch.cat((all_representations, representations), dim=0)
 75 |     return all_representations
 76 | 
 77 | def batch(iterable, n = 1):
 78 |    current_batch = []
 79 |    for item in iterable:
 80 |        current_batch.append(item)
 81 |        if len(current_batch) == n:
 82 |            yield current_batch
 83 |            current_batch = []
 84 |    if current_batch:
 85 |        yield current_batch
 86 | 
 87 | def get_unique_values_table(sentence_in_list, unique_sentence_list):
 88 |     indices = np.where(unique_sentence_list == sentence_in_list)[0][0]
 89 |     return indices
 90 | 
 91 | def process(model, flag, args):
 92 |     # flag: train, validation, test
 93 |     if not args.need_split:
 94 |         if os.path.exists(os.path.join(temp_data_path,f'{flag}_timestamps.pkl')):
 95 |             print(f'{flag}은 이미 존재합니다.')
 96 |             timestamps = load_pickle(os.path.join(temp_data_path,f'{flag}_timestamps'))
 97 |             sentence_list = load_pickle(os.path.join(temp_data_path,f'{flag}_sentence_list'))
 98 |         else:
 99 |             time_data=get_time_data(globals()[f'raw_{flag}'],flag, args)
100 |             timestamps=time_data[0]
101 |             sentence_list=time_data[1]
102 |             save_pickle(timestamps, os.path.join(temp_data_path,f'{flag}_timestamps'))
103 |             save_pickle(sentence_list, os.path.join(temp_data_path,f'{flag}_sentence_list'))
104 |             if flag == 'test':
105 |                 test_label=time_data[2]
106 |                 save_pickle(test_label, os.path.join(temp_data_path,f'test_label'))
107 | 
108 |         if flag =='test':
109 |             test_label_df=pd.DataFrame({'timestamp':timestamps,'label':test_label})
110 |             save_pickle(test_label_df, os.path.join(temp_data_path,'test_label'))
111 |         print('timestamps, sentence_list 완료')
112 | 
113 |         sentence_array = np.array(sentence_list)
114 |         unique_sentence = np.unique(sentence_array)
115 | 
116 |         unique_lookup_table= np.array(parmap.map(get_unique_values_table, sentence_array, unique_sentence, pm_pbar=True, pm_processes=num_processors-2))
117 |         print('unique_lookup_table 완료')
118 |         unique_sentence=unique_sentence.tolist()
119 |         save_pickle(unique_lookup_table, os.path.join(temp_data_path,f'{flag}_unique_lookup_table'))
120 |         save_pickle(unique_sentence, os.path.join(temp_data_path,f'{flag}_unique_sentence'))
121 | 
122 |         if os.path.exists(os.path.join(temp_data_path,f'{flag}_representations.pkl')):
123 |             print(f'{flag}은 이미 존재합니다.')
124 |         else:
125 |             representations=get_representation(model, unique_sentence, batch_size=args.batch_size, max_length = args.max_token_len, pooling_strategy=args.pooling_strategy)
126 |             save_pickle(representations, os.path.join(temp_data_path,f'{flag}_representations'))
127 |         print('representations 완료')
128 |     else:
129 |         def save_sentence_list(flag, args):
130 |             for i in tqdm(range(args.split_num), desc=f'{flag} time_data split'):
131 |                 print(f'{flag} time_data {i}번째 get_time_data')
132 |                 if i == args.split_num-1:
133 |                     time_data=get_time_data(globals()[f'raw_{flag}'][int(len(globals()[f'raw_{flag}'])/args.split_num)*i:],flag, args)
134 |                 else:
135 |                     time_data=get_time_data(globals()[f'raw_{flag}'][int(len(globals()[f'raw_{flag}'])/args.split_num)*i:int(len(globals()[f'raw_{flag}'])/args.split_num)*(i+1)],flag, args)
136 |                 
137 |                 timestamps=time_data[0]
138 |                 sentence_list=time_data[1]
139 |                 if flag == 'test':
140 |                     test_label=time_data[2]
141 |                     save_pickle(test_label, os.path.join(temp_data_path,f'test_label_{i}'))
142 | 
143 |                     test_label_df=pd.DataFrame({'timestamp':timestamps,'label':test_label})
144 |                     save_pickle(test_label_df, os.path.join(temp_data_path,f'test_label_{i}'))
145 | 
146 |                 save_pickle(timestamps, os.path.join(temp_data_path,f'{flag}_timestamps_{i}'))
147 |                 save_pickle(sentence_list, os.path.join(temp_data_path,f'{flag}_sentence_list_{i}'))
148 |                 print(f'{flag}_timestamps_{i}, {flag}_sentence_list_{i} 완료')
149 | 
150 |         print(f'{args.split_num} 만큼 나누어 전처리 수행 후 다시 결합해 rep 저장합니다.')
151 | 
152 |         save_sentence_list(flag, args)
153 |         del globals()[f'raw_{flag}']
154 |         print('조각을 다시 모아 전체 데이터 생성')
155 |         #label
156 |         if flag == 'test':
157 |             test_label_df=pd.DataFrame({'timestamp':[],'label':[]})
158 |             for i in range(args.split_num):
159 |                 test_label_df=pd.concat([test_label_df, load_pickle(os.path.join(temp_data_path,f'test_label_{i}'))], axis=0, ignore_index=True)
160 |             save_pickle(test_label_df, os.path.join(temp_data_path,'test_label'))
161 |             print('test_label 완료')
162 | 
163 |         sentence_list=np.array([])
164 |         for i in range(args.split_num):
165 |             # concat all sentence_list
166 |             sentence_list=np.append(sentence_list, load_pickle(os.path.join(temp_data_path,f'{flag}_sentence_list_{i}')))
167 |         
168 |         print('unique시작')
169 |         sentence_array = np.array(sentence_list)
170 |         unique_sentence = np.unique(sentence_array)
171 |         print(f'{flag} unique sentence len :{unique_sentence.shape}')
172 |         
173 |         unique_lookup_table= np.array(parmap.map(get_unique_values_table, sentence_array, unique_sentence, pm_pbar=True, pm_processes=num_processors-2))
174 |         unique_sentence=unique_sentence.tolist()
175 | 
176 |         save_pickle(unique_lookup_table, os.path.join(temp_data_path,f'{flag}_unique_lookup_table'))
177 |         save_pickle(unique_sentence, os.path.join(temp_data_path,f'{flag}_unique_sentence'))
178 |         print('unique_lookup_table 완료')
179 |             
180 |         print('get_representation 시작')
181 |         representations=get_representation(model, unique_sentence, batch_size=args.batch_size, max_length = args.max_token_len, pooling_strategy=args.pooling_strategy)
182 |         save_pickle(representations, os.path.join(temp_data_path,f'{flag}_representations'))
183 |         print(f'{flag}_representations 완료')
184 |                     
185 | 
186 | if __name__ == '__main__':
187 |     set_seed(1234)
188 |     parser = argparse.ArgumentParser()
189 |     parser.add_argument("--dataset", help=["hdfs", "bgl", "tbird"], default="tbird")
190 |     parser.add_argument("--sample", help=[0.1, 0.05, 100000], default=1, type=lambda x: int(x) if x.isdigit() else float(x))
191 |     parser.add_argument("--shuffle", help=["True", "False"], default=True, type=str2bool)
192 |     parser.add_argument("--test_size", help="test_size", default=0.2, type=float)
193 |     parser.add_argument("--need_split", help=["True", "False"], default=False, type=str2bool)
194 |     parser.add_argument("--split_num", help=[5, 10], default=10, type=int)
195 |     parser.add_argument('--plm', type=str, default='bert-base-uncased')
196 |     parser.add_argument("--batch_size", default=8192, type=int)
197 |     parser.add_argument("--max_token_len", default=128, type=int, help='bgl, tbrid:128, hdfs:512')
198 |     parser.add_argument("--pooling_strategy", help=["cls", "mean", 'all'], default="all")
199 | 
200 |     args = parser.parse_args()
201 | 
202 |     if args.sample != 1:
203 |         output_path=os.path.join(os.getcwd(), 'processed_data' ,f'{args.dataset}_sample_{str(args.sample)}')
204 |         raw_file_path=os.path.join(os.getcwd(),'processed_data', f'{args.dataset}_sample_{str(args.sample)}')
205 |     else:
206 |         output_path=os.path.join(os.getcwd(), 'processed_data' ,f'{args.dataset}')
207 |         raw_file_path=os.path.join(os.getcwd(),'processed_data', f'{args.dataset}')
208 | 
209 |     if not os.path.exists(output_path):
210 |         os.makedirs(output_path)
211 | 
212 |     #save temp files
213 |     if args.plm == 'pretrained_bgl':
214 |         temp_data_path=os.path.join(output_path, f'{args.test_size}','temp_bgl')
215 |     else :
216 |         temp_data_path=os.path.join(output_path, f'{args.test_size}',f'{args.plm}')
217 | 
218 |     if not os.path.exists(temp_data_path):
219 |         os.makedirs(temp_data_path)
220 |         
221 |     if os.path.exists(os.path.join(temp_data_path, 'rep_time.txt')):
222 |         print('이미 전처리 완료된 데이터입니다.')
223 |         sys.exit()
224 | 
225 |     #device
226 |     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
227 |     if (device.type == 'cuda') and (torch.cuda.device_count() > 1):
228 |         print("Let's use", torch.cuda.device_count(), "GPUs!")
229 | 
230 |     # 일반 사전학습 bert 이용하기
231 |     if args.plm == 'pretrained_bgl':
232 |         model = BertModel.from_pretrained("./final_bert_model")
233 |     else :
234 |         model = AutoModel.from_pretrained(args.plm)
235 |     model= DataParallel(model)
236 |     model.to(device)
237 |     model.eval()
238 |     if args.plm == 'pretrained_bgl':
239 |         tokenizer = BertTokenizer.from_pretrained("./tokenizer/BGL_lanobert-vocab.txt")
240 |     else:
241 |         tokenizer = AutoTokenizer.from_pretrained(args.plm)
242 | 
243 |     # open file
244 |     with open(os.path.join(raw_file_path,f'train_{args.test_size}'), 'r', encoding='utf-8') as f:
245 |         raw_train = f.readlines()
246 | 
247 |     with open(os.path.join(raw_file_path,f'test_{args.test_size}'), 'r', encoding='utf-8') as f:
248 |         raw_test = f.readlines()
249 | 
250 |     # time check
251 |     start_time = time.time()
252 |     print('get train data & representation')
253 |     process(model, 'train', args)
254 | 
255 |     train_end_time = time.time()
256 |     print(f'train time: {train_end_time-start_time}')
257 | 
258 |     print('get test data & representation')
259 |     process(model, 'test', args)
260 |     test_end_time = time.time()
261 |     print(f'test time: {test_end_time-train_end_time}')
262 | 
263 |     print('train_test time: ', test_end_time-start_time)
264 |    
265 |     with open(os.path.join(temp_data_path, f'rep_time.txt'), 'w') as f:
266 |         sys.stdout = f
267 |         print(f'train time: {train_end_time-start_time}')
268 |         print(f'test time: {test_end_time-train_end_time}')
269 |         print(f'train_test time: {test_end_time-start_time}')
270 | 
271 |     # Restore the standard output
272 |     sys.stdout = sys.__stdout__
273 | 
274 |     # Close the file object
275 |     f.close()
276 | 


--------------------------------------------------------------------------------
/scripts/all_at_once.sh:
--------------------------------------------------------------------------------
  1 | 
  2 | test_size=0.2
  3 | unsing_gpu=0
  4 | 
  5 | python split_data.py --dataset bgl --test_size $test_size
  6 | python split_data.py --dataset tbird --sample 5000000 --test_size $test_size
  7 | python split_data.py --dataset hdfs --test_size $test_size
  8 | 
  9 | python preprocess_rep.py --dataset bgl --batch_size 8192 --max_token_len 128 --test_size $test_size
 10 | python preprocess_rep.py --dataset tbird --sample 5000000 --batch_size 8192 --max_token_len 128 --need_split True --split_num 2 --test_size $test_size
 11 | python preprocess_rep.py --dataset hdfs --batch_size 8192 --max_token_len 512 --test_size $test_size
 12 | 
 13 | for plm in google/electra-base-discriminator roberta-base
 14 | do
 15 |     python preprocess_rep.py --dataset bgl --plm $plm --batch_size 8192 --max_token_len 128 --test_size $test_size
 16 |     python preprocess_rep.py --dataset tbird --sample 5000000 --plm $plm --batch_size 8192 --max_token_len 128 --need_split True --split_num 2 --test_size $test_size
 17 |     python preprocess_rep.py --dataset hdfs --plm $plm --batch_size 8192 --max_token_len 512 --test_size $test_size
 18 | done    
 19 | 
 20 | for train_ratio in 1
 21 | do
 22 |     for coreSet in 0.01 0
 23 |     do
 24 |         for only_cls in False
 25 |         do
 26 |             echo =====================RQ1, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls=====================
 27 |             # BGL
 28 |             echo BGL    
 29 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 30 | 
 31 |             #HDFS
 32 |             echo HDFS
 33 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 34 | 
 35 |             #TBIRD
 36 |             echo TBIRD
 37 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 38 |         done
 39 |     done
 40 | done
 41 | 
 42 | for train_ratio in 1 0.1 0.05 0.01 0.001
 43 | do
 44 |     for coreSet in 0.01 0.1 1 0
 45 |     do
 46 |         for only_cls in False True
 47 |         do
 48 |             echo =====================RQ6, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls=====================
 49 |             # BGL
 50 |             echo BGL    
 51 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 52 | 
 53 |             #HDFS
 54 |             echo HDFS
 55 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 56 | 
 57 |             #TBIRD
 58 |             echo TBIRD
 59 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 60 |         done
 61 |     done
 62 | done
 63 | 
 64 | for train_ratio in 1 0.1 0.05 0.01 0.001
 65 | do
 66 |     for coreSet in 0 1 2 5 10 0.01 0.05 0.1 0.2 0.3 0.5
 67 |     do
 68 |         for only_cls in False
 69 |         do
 70 |             echo =====================RQ2,4, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls=====================
 71 |             # BGL
 72 |             echo BGL    
 73 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 74 | 
 75 |             #HDFS
 76 |             echo HDFS
 77 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 78 | 
 79 |             #TBIRD
 80 |             echo TBIRD
 81 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 82 |         done
 83 |     done
 84 | done
 85 | 
 86 | 
 87 | for train_ratio in 1 0.9 0.8 0.5 0.3 0.2 0.1 0.05 0.01 0.001
 88 | do
 89 |     for coreSet in 0 0.01
 90 |     do
 91 |         for only_cls in False
 92 |         do
 93 |             echo =====================RQ5, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls=====================
 94 |             # BGL
 95 |             echo BGL    
 96 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
 97 | 
 98 |             #HDFS
 99 |             echo HDFS
100 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
101 | 
102 |             #TBIRD
103 |             echo TBIRD
104 |             CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
105 |         done
106 |     done
107 | done
108 | 
109 | for plm in google/electra-base-discriminator roberta-base
110 | do
111 |     # python preprocess_rep.py --dataset bgl --plm $plm --batch_size 8192 --max_token_len 128 --test_size $test_size
112 |     # python preprocess_rep.py --dataset tbird --sample 5000000 --plm $plm --batch_size 8192 --max_token_len 128 --need_split True --split_num 2 --test_size $test_size
113 |     # python preprocess_rep.py --dataset hdfs --plm $plm --batch_size 8192 --max_token_len 512 --test_size $test_size
114 |     
115 |     for train_ratio in 1
116 |     do
117 |         for coreSet in 0.01
118 |         do
119 |             for only_cls in False
120 |             do
121 |                 echo =====================RQ3, plm:$plm, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls=====================
122 |                 # BGL
123 |                 echo BGL    
124 |                 CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --plm $plm --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
125 | 
126 |                 #HDFS
127 |                 echo HDFS
128 |                 CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --plm $plm --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
129 | 
130 |                 #TBIRD
131 |                 echo TBIRD
132 |                 CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --plm $plm --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size
133 |             done
134 |         done
135 |     done
136 | done
137 | 
138 | 


--------------------------------------------------------------------------------
/split_data.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os
  3 | import pandas as pd
  4 | import numpy as np
  5 | import argparse
  6 | from tqdm import tqdm
  7 | import json
  8 | 
  9 | import re
 10 | from collections import defaultdict
 11 | 
 12 | import random
 13 | from sklearn.model_selection import train_test_split
 14 | 
 15 | import parmap
 16 | import multiprocessing
 17 | num_processors = multiprocessing.cpu_count()
 18 | 
 19 | from utils import str2bool, set_seed, get_parsed_log, get_unique_log, label_parsed_log
 20 | 
 21 | def save_processed_log(data, path, need_newline=False):
 22 |     if not need_newline:
 23 |         with open(path, 'w') as f:
 24 |             for log in data:
 25 |                 f.write(log)
 26 |     else:
 27 |         with open(path, 'w') as f:
 28 |             for log in data:
 29 |                 f.write(log)
 30 |                 f.write('\n')
 31 | 
 32 | if __name__ == '__main__':
 33 |     set_seed(1234)
 34 |     parser = argparse.ArgumentParser()
 35 |     parser.add_argument("--dataset", help=["hdfs", "bgl", "tbird"], default="tbird")
 36 |     parser.add_argument("--shuffle", help="shuffle data", default=True, type=str2bool)
 37 |     parser.add_argument("--sample", help=[0.1, 0.05], default=1, type=lambda x: int(x) if x.isdigit() else float(x))
 38 |     parser.add_argument("--test_size", help="test_size", default=0.2, type=float)
 39 |     args = parser.parse_args()
 40 |     current_dir = os.path.dirname(os.path.abspath(__file__))
 41 | 
 42 |     if args.dataset == "bgl":
 43 |         data_dir = os.path.join(current_dir, 'dataset', 'bgl')
 44 |         log_file = "BGL.log"
 45 |         output_dir = os.path.join(current_dir, 'processed_data', f'bgl')
 46 |     elif args.dataset == "tbird":
 47 |         data_dir = os.path.join(current_dir, 'dataset', 'tbird')
 48 |         log_file = "Thunderbird.log"
 49 |         if args.sample != 1:
 50 |             output_dir = os.path.join(current_dir, 'processed_data', f'tbird_sample_{str(args.sample)}')
 51 |         else:
 52 |             output_dir = os.path.join(current_dir, 'processed_data', f'tbird')
 53 |     
 54 |     elif args.dataset == "hdfs":
 55 |         # we don't split hdfs dataset with this code
 56 |         data_dir = os.path.join(current_dir, 'dataset', 'hdfs')
 57 |         output_dir = os.path.join(current_dir, 'processed_data', f'hdfs')
 58 | 
 59 |         log_file = "HDFS.log"
 60 |         blk_label_file = os.path.join(data_dir,"anomaly_label.csv")     
 61 |         
 62 |     if not os.path.exists(output_dir):
 63 |         os.makedirs(output_dir)
 64 | 
 65 |     # load dataset and get normal & abnormal
 66 |     if os.path.exists(os.path.join(output_dir, f'train_{args.test_size}')) and os.path.exists(os.path.join(output_dir, f'test_{args.test_size}')):
 67 |         print("Already split dataset")
 68 |         sys.exit()  
 69 | 
 70 | 
 71 |     print("Split dataset")
 72 |     if args.dataset != "hdfs":
 73 |         #open data_dir + log_file
 74 |         with open(os.path.join(data_dir, log_file), 'r', errors='ignore') as f:
 75 |             labels = []
 76 |             data=[]
 77 |             normal_data = []
 78 |             abnormal_data = []
 79 |             idx = 0
 80 |             for line in tqdm(f, desc='get data'):
 81 |                 labels.append(line.split()[0] != '-')
 82 |                 if labels[-1]:
 83 |                     abnormal_data.append(line)
 84 |                 else:
 85 |                     normal_data.append(line)
 86 |                 data.append(line)
 87 |                 idx += 1
 88 |     else:
 89 |         #hdfs
 90 |         if os.path.exists(os.path.join(data_dir, 'preprocessed_data_df.csv')):
 91 |             print("preprocessed hdfs:preprocessed_data_df.csv exists")
 92 |             import ast
 93 |             def str_to_list(s):
 94 |                 return ast.literal_eval(s)
 95 |             data_df=pd.read_csv(os.path.join(data_dir, 'preprocessed_data_df.csv'), converters={'Raw':str_to_list,'labeled_Raw': str_to_list, 'parsed_unique_log': str_to_list})
 96 | 
 97 |         else:
 98 |             print("preprocess hdfs:preprocessed_data_df.csv")
 99 |             
100 |             with open(os.path.join(data_dir, log_file), 'r', errors='ignore') as f:
101 |                 data=[]
102 |                 for line in tqdm(f, total=11175629, desc='get data'):
103 |                     data.append(line)
104 |             #list to dataframe
105 |             df = pd.DataFrame(data, columns=['Raw']) #raw data
106 | 
107 |             data_dict = defaultdict(list) #preserve insertion order of items
108 |             for idx, row in tqdm(df.iterrows(), total=df.shape[0], desc='find blk_id'):
109 |                 blkId_list = re.findall(r'(blk_-?\d+)', row['Raw']) #find all block ids in log Content
110 |                 blkId_set = set(blkId_list)
111 |                 for blk_Id in blkId_set:
112 |                     data_dict[blk_Id].append(row["Raw"])
113 | 
114 |             data_df = pd.DataFrame(list(data_dict.items()), columns=['BlockId', 'Raw'])
115 |             # make dataframe:blk_df to dict:blk_label_dict
116 |             blk_df=pd.read_csv(blk_label_file)
117 |             blk_label_dict = dict(zip(blk_df.BlockId, blk_df.Label))
118 |             blk_label_dict = {k: 1 if v == 'Anomaly' else 0 for k, v in blk_label_dict.items()}
119 | 
120 |             data_df["Label"] = data_df["BlockId"].apply(lambda x: blk_label_dict.get(x)) #add label to the sequence of each blockid
121 |             
122 |             parsed_unique_log=parmap.map(get_parsed_log, data_df['Raw'], pm_pbar=True, pm_processes=num_processors-2)
123 |             parsed_unique_log=parmap.map(get_unique_log, parsed_unique_log, pm_pbar=True, pm_processes=num_processors-2)
124 | 
125 |             data_df['parsed_unique_log']=parsed_unique_log
126 |             data_df=label_parsed_log(data_df)
127 |             data_df.to_csv(os.path.join(data_dir, 'preprocessed_data_df.csv'), index=False)
128 | 
129 |         normal_data = data_df[data_df['Label'] == 0]['labeled_parsed_unique_concat'].tolist()
130 |         abnormal_data = data_df[data_df['Label'] == 1]['labeled_parsed_unique_concat'].tolist()
131 | 
132 |     #split dataset
133 |     if args.sample != 1:
134 |         # sample == float or int
135 |         # sample data with max_num
136 |         #get normal, abnormal data ratio and get # of each max
137 |         ab_ratio=len(abnormal_data)/(len(normal_data)+len(abnormal_data))
138 | 
139 |         if isinstance(args.sample, float):
140 |             normal_data = random.sample(normal_data, int(len(normal_data)*args.sample))
141 |             abnormal_data = random.sample(abnormal_data, int(len(abnormal_data)*args.sample))
142 |             normal_train_val, normal_test = train_test_split(normal_data, test_size=args.test_size, random_state=1234, shuffle=args.shuffle)
143 | 
144 |         elif isinstance(args.sample, int):
145 |             print("sample data with specific integer num")
146 |             normal_data = random.sample(normal_data, int(args.sample*(1-ab_ratio)))
147 |             abnormal_data = random.sample(abnormal_data, int(args.sample*ab_ratio))
148 |             normal_train_val, normal_test = train_test_split(normal_data, test_size=args.test_size, random_state=1234, shuffle=args.shuffle)
149 |     else:
150 |         normal_train_val, normal_test = train_test_split(normal_data, test_size=args.test_size, random_state=1234, shuffle=args.shuffle)
151 | 
152 |     test = normal_test + abnormal_data
153 | 
154 |     if args.dataset == "hdfs":
155 |         need_newline=True
156 |     else:
157 |         need_newline=False
158 | 
159 |     save_processed_log(normal_train_val, os.path.join(output_dir, f'train_{args.test_size}'), need_newline)
160 |     save_processed_log(test, os.path.join(output_dir, f'test_{args.test_size}'),need_newline) 
161 | 
162 |     data_size={}
163 |     data_size['train_normal']=len(normal_train_val)
164 |     data_size['test_normal']=len(normal_test)
165 |     data_size['test_abnormal']=len(abnormal_data)
166 |     with open(os.path.join(output_dir, f'data_size_dict_{args.test_size}.json'), 'w') as f:
167 |         json.dump(data_size, f, indent=4)
168 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import random
  4 | import numpy as np
  5 | import torch
  6 | import pickle
  7 | 
  8 | import re
  9 | import unicodedata
 10 | from datetime import datetime
 11 | 
 12 | def str2bool(v):
 13 |     if isinstance(v, bool):
 14 |        return v
 15 |     if v.lower() in ('yes', 'true', 't', 'y', '1'):
 16 |         return True
 17 |     elif v.lower() in ('no', 'false', 'f', 'n', '0'):
 18 |         return False
 19 |     else:
 20 |         raise argparse.ArgumentTypeError('Boolean value expected.')
 21 |     
 22 | def set_seed(random_seed):
 23 |     torch.manual_seed(random_seed)
 24 |     torch.cuda.manual_seed(random_seed)
 25 |     torch.cuda.manual_seed_all(random_seed)  # if use multi-GPU
 26 |     torch.backends.cudnn.deterministic = True
 27 |     torch.backends.cudnn.benchmark = False
 28 |     np.random.seed(random_seed)
 29 |     random.seed(random_seed)
 30 | 
 31 | def load_pickle(path):
 32 |     with open(path+'.pkl', 'rb') as f:
 33 |         data = pickle.load(f)
 34 |     return data
 35 | 
 36 | def save_pickle(data, path):
 37 |     with open(path+'.pkl', 'wb') as f:
 38 |         pickle.dump(data, f)
 39 | 
 40 | def unicodeToAscii(s):
 41 |     return "".join(
 42 |         c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn"
 43 |     )
 44 | 
 45 | def normalizeString(s):
 46 |     s = unicodeToAscii(s.lower().strip())
 47 |     s = re.sub(r"[^a-zA-Z<>]+", r" ", s) # only english, del: num, special char
 48 |     s = re.sub(r"\s+", r" ", s).strip() # del white space
 49 |     return s
 50 | 
 51 | #for dataset preprocessing
 52 | # for bgl
 53 | def bgl_regex(log):
 54 |     date_time_regex = re.compile(
 55 |         "\d{1,4}\-\d{1,2}\-\d{1,2}-\d{1,2}.\d{1,2}.\d{1,2}.\d{1,6}"
 56 |     )
 57 |     date_regex = re.compile("\d{1,4}\.\d{1,2}\.\d{1,2}")
 58 |     ip_regex = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d{1,5})?")
 59 |     server_regex = re.compile("\S+(?=.*[0-9])(?=.*[a-zA-Z])(?=[:]+)\S+")
 60 |     server_regex2 = re.compile("\S+(?=.*[0-9])(?=.*[a-zA-Z])(?=[-])\S+")
 61 |     ecid_regex = re.compile("[A-Z0-9]{28}")
 62 |     serial_regex = re.compile("[a-zA-Z0-9]{48}")
 63 |     memory_regex = re.compile("0[xX][0-9a-fA-F]\S+")
 64 |     path_regex = re.compile(".\S+(?=.[0-9a-zA-Z])(?=[/]).\S+")
 65 |     iar_regex = re.compile("[0-9a-fA-F]{8}")
 66 |     num_regex = re.compile("(\d+)")
 67 |     
 68 |     timestamp = (np.array([str(datetime.strptime(re.findall(date_time_regex, log)[0],'%Y-%m-%d-%H.%M.%S.%f'))])).item()
 69 |     tmp = re.sub(date_time_regex, " TIME ", log)
 70 |     tmp = re.sub(ip_regex, " IP ", tmp)
 71 |     tmp = re.sub(date_regex, " TIME ", tmp)
 72 |     tmp = re.sub(path_regex, " PATH ", tmp)
 73 |     tmp = re.sub(server_regex, " SERVER ", tmp)
 74 |     tmp = re.sub(server_regex2, " SERVER ", tmp)
 75 |     tmp = re.sub(ecid_regex, " ECID ", tmp)
 76 |     tmp = re.sub(serial_regex, " SERIAL ", tmp)
 77 |     tmp = re.sub(memory_regex, " MEMORY ", tmp)
 78 |     tmp = re.sub(iar_regex, " IAR ", tmp)
 79 |     tmp = re.sub(num_regex, " NUM ", tmp)
 80 |     return timestamp, tmp
 81 | 
 82 | def tb_regex(log):
 83 |     date_regex = re.compile("\d{2,4}\.\d{1,2}\.\d{1,2}\s")
 84 |     date_regex2 = re.compile(
 85 |         "(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+(\d{1,2})\s+"
 86 |     )
 87 |     time_regex = re.compile("\d{1,2}\:\d{1,2}\:\d{1,2}")
 88 |     id_regex = re.compile(r"DATE\s.*\sDATE")
 89 | 
 90 |     account_regex = re.compile("(\w+[\w\.]*)@(\w+[\w\.]*)\-(\w+[\w\.]*)")
 91 |     account_regex2 = re.compile("(\w+[\w\.]*)@(\w+[\w\.]*)")
 92 |     account_regex3 = re.compile(r"TIME\s\S+")
 93 | 
 94 |     dir_regex = re.compile(r'[a-zA-Z0-9_\-\.\/]+\/[a-zA-Z0-9_\-\.\/]+\/[a-zA-Z0-9_\-\.\/]*') # /로 안시작하고 /가 두겹이상인 경우
 95 |     dir_regex2 = re.compile(r'\/[a-zA-Z0-9_\-\.\/]+\/[a-zA-Z0-9_\-\.\/]*') # /로 시작하고 /가 한겹인 경우
 96 |     iar_regex = re.compile("[0-9a-fA-F]{10}")
 97 |     ip_regex = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d{1,5})?")
 98 |     num_regex = re.compile("(\[\d+\])")
 99 | 
100 |     date_time_str=re.findall(date_regex, log)[0]+" "+re.findall(time_regex, log)[0]
101 |     timestamp = (np.array([str(datetime.strptime(date_time_str,'%Y.%m.%d %H:%M:%S'))])).item()
102 |     tmp = re.sub(date_regex, "DATE ", log)
103 |     tmp = re.sub(date_regex2, "DATE ", tmp)
104 |     tmp = re.sub(id_regex, "DATE ID DATE", tmp)
105 |     tmp = re.sub(time_regex, "TIME", tmp)
106 |     tmp = re.sub(account_regex3, "TIME ACCOUNT", tmp) ## TIME / TIME ACCOUNT
107 |     tmp = re.sub(account_regex, "ACCOUNT", tmp)
108 |     tmp = re.sub(account_regex2, "ACCOUNT", tmp)
109 |     tmp = re.sub(dir_regex, " DIR ", tmp)
110 |     tmp = re.sub(dir_regex2, " DIR ", tmp)
111 |     tmp = re.sub(ip_regex, "IP", tmp)
112 |     tmp = re.sub(iar_regex, "IAR", tmp)
113 |     tmp = re.sub(num_regex, " NUM ", tmp)
114 | 
115 |     return timestamp, tmp
116 | 
117 | def hdfs_regex(log):
118 |     id_regex = re.compile("blk_.\d+")
119 |     ip_regex = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d{1,5})?")
120 |     num_regex = re.compile("\d*\d")
121 | 
122 |     block_id = re.findall(id_regex, log)[0]
123 |     tmp = re.sub(id_regex, "BLK", log) # already parsed in dataset preprocessing, del block_id
124 |     tmp = re.sub(ip_regex, "IP", tmp)
125 |     tmp = re.sub(num_regex, "NUM", tmp)
126 |     return block_id, tmp
127 | 
128 | # for hdfs data split
129 | def concat_list_str(row):
130 |     # delete \n & concatenate
131 |     return ' '.join(list(map(lambda x: (x.replace('\n','')),row)))
132 | 
133 | def add_label_Raw_blk(row):
134 |     blk = concat_list_str(row)
135 |     blk = "- "+blk
136 |     return blk
137 | 
138 | def get_parsed_log(df_row):
139 |     blk_log=[]
140 |     for i, log in enumerate(df_row):
141 |         parsed=hdfs_regex(' '.join(log.split()[3:]))
142 |         if i ==0:
143 |             blk_log.append(parsed[0])
144 |         blk_log.append(normalizeString(parsed[1]))        
145 |     return blk_log
146 |     
147 | def get_unique_log(df_row):
148 |     return np.unique(df_row).tolist()
149 | 
150 | def label_parsed_log(data_df):
151 |     data_df['labeled_parsed_unique_concat']=data_df.apply(lambda row: add_label_Raw_blk(row['parsed_unique_log']) if (row['Label'] == 0) else concat_list_str(row['parsed_unique_log']), axis=1)
152 |     return data_df


--------------------------------------------------------------------------------