├── .gitignore ├── README.md ├── ad_test_coreSet.py ├── docker ├── Dockerfile ├── container.sh └── image.sh ├── image └── RAPID_main.png ├── preprocess_rep.py ├── scripts └── all_at_once.sh ├── split_data.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pkl 2 | *.json 3 | *.log 4 | *_parser 5 | *.bin 6 | *.tar.gz 7 | *.gz 8 | *.npz 9 | *.csv 10 | *_0.2 11 | *_0.3 12 | *_0.5 13 | *.txt 14 | 15 | # !processed_data/*/*/results/*.pkl 16 | dataset/ 17 | processed_data/*/ 18 | 19 | # Ignore everything in logAD/dataset/LAnoBERT_split 20 | dataset/LAnoBERT_split/ 21 | 22 | # logAD/logbert/output 23 | logbert/output/ 24 | 25 | final_bert_model/ 26 | 27 | __pycache__/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RAPID 2 | **RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information** 3 | 4 | [Gunho No](https://www.linkedin.com/in/%EA%B1%B4%ED%98%B8-%EB%85%B8-58b4a9298/)*, [Yukyung Lee](https://www.linkedin.com/in/yukyung-lee-149681155/)\*, [Hyeongwon Kang](https://www.linkedin.com/in/hyeongwon/) and [Pilsung Kang](https://github.com/pilsung-kang) 5 |
(*equal contribution) 6 | 7 | This repository is the official implementation of "RAPID". 8 | 9 | ![RAPID Architecture](image/RAPID_main.png) 10 | 11 | ``` 12 | RAPID/ 13 | │ 14 | ├── split_data.py # Dataset splitting and preprocessing 15 | ├── preprocess_rep.py # Log representation generation via Language Model 16 | ├── ad_test_coreSet.py # Anomaly detection algorithm 17 | ├── utils.py 18 | │ 19 | ├── scripts/ 20 | │ └── all_at_once.sh # End-to-end experiment runner 21 | │ 22 | ├── processed_data/ # Directory for processed datasets 23 | │ ├── bgl/ 24 | │ ├── tbird/ 25 | │ └── hdfs/ 26 | ``` 27 | 28 | ## Datasets 29 | RAPID is evaluated on three public datasets: 30 | 31 | * BGL (Blue Gene/L) 32 | * Thunderbird 33 | * HDFS 34 | 35 | Place the raw datasets in the dataset/ directory before running the preprocessing scripts. 36 | 37 | ## Running the Experiments 38 | ### Full Pipeline 39 | To reproduce all experiments from the paper: 40 | ``` 41 | bash scripts/all_at_once.sh 42 | ``` 43 | This script runs the entire pipeline, including data preprocessing, representation generation, and anomaly detection across multiple configurations as described in our paper. 44 | 45 | ### Step-by-step Execution 46 | 47 | 1. Data Splitting and Preprocessing: 48 | ``` 49 | python split_data.py --dataset [bgl/tbird/hdfs] --test_size 0.2 50 | ``` 51 | 52 | 2. Get Representation: 53 | ``` 54 | python preprocess_rep.py --dataset [bgl/tbird/hdfs] --plm bert-base-uncased --batch_size 8192 --max_token_len [128/512] 55 | ``` 56 | 57 | 3. Anomaly Detection: 58 | ``` 59 | python ad_test_coreSet.py --dataset [bgl/tbird/hdfs] --train_ratio 1 --coreSet 0.01 --only_cls False 60 | ``` 61 | 62 | ## Key Parameters 63 | 64 | * `--dataset`: Choose the dataset (bgl, tbird, hdfs) 65 | * `--sample`: Sample size for large datasets (e.g., 5000000 for Thunderbird) 66 | * `--plm`: Pre-trained language model (bert-base-uncased, roberta-base, google/electra-base-discriminator) 67 | * `--coreSet`: Core set size or ratio (0, 0.01, 0.1, etc.) 68 | * `--train_ratio`: Ratio of training data to use (1, 0.1, 0.01, etc.) 69 | * `--only_cls`: Whether to use only the CLS token representation (True/False) 70 | 71 | ## Results 72 | After running the experiments, results will be saved in the `processed_data/[dataset]/[test_size]/[plm]/results/` directory. Each experiment produces a CSV file and a JSON file with detailed performance metrics. 73 | 74 | ## Citation 75 | If you find this code useful for your research, please cite our paper: 76 | 77 | ``` 78 | @article{NO2024108613, 79 | title = {Training-free retrieval-based log anomaly detection with pre-trained language model considering token-level information}, 80 | journal = {Engineering Applications of Artificial Intelligence}, 81 | volume = {133}, 82 | pages = {108613}, 83 | year = {2024}, 84 | issn = {0952-1976}, 85 | doi = {https://doi.org/10.1016/j.engappai.2024.108613}, 86 | url = {https://www.sciencedirect.com/science/article/pii/S0952197624007711}, 87 | author = {Gunho No and Yukyung Lee and Hyeongwon Kang and Pilsung Kang} 88 | } 89 | ``` 90 | -------------------------------------------------------------------------------- /ad_test_coreSet.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | from tqdm import tqdm 5 | import os 6 | import pickle 7 | import json 8 | 9 | from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, auc, precision_recall_curve 10 | from sklearn.neighbors import NearestNeighbors 11 | 12 | import numpy as np 13 | 14 | import torch 15 | from torch import matmul 16 | 17 | #user warning 18 | import warnings 19 | warnings.filterwarnings("ignore") 20 | 21 | import argparse 22 | import sys 23 | 24 | from utils import str2bool, load_pickle, save_pickle, set_seed 25 | #for time check 26 | import time 27 | 28 | def get_threshold_prc(score, test_label, not_to_numpy=False): 29 | if not_to_numpy: 30 | precision, recall, thresholds = precision_recall_curve(test_label, score) 31 | else: 32 | precision, recall, thresholds = precision_recall_curve(test_label, score.to_numpy()) 33 | #get best f1score 34 | f1 = np.array([2 * (pr * re) / (pr + re + 1e-10) for pr, re in zip(precision, recall)]) 35 | 36 | ix = np.argmax(f1) 37 | best_thresh = thresholds[ix] 38 | return best_thresh 39 | 40 | def get_threshold_roc(score, test_label, not_to_numpy=False): 41 | if not_to_numpy: 42 | fpr, tpr, thresholds =roc_curve(test_label, score) 43 | else: 44 | fpr, tpr, thresholds =roc_curve(test_label, score.to_numpy()) 45 | J = tpr - fpr 46 | ix = np.argmax(J) 47 | best_thresh = thresholds[ix] 48 | print('Best Threshold=%f, sensitivity = %.3f, specificity = %.3f, J=%.3f' % (best_thresh, tpr[ix], 1-fpr[ix], J[ix])) 49 | return best_thresh 50 | 51 | def get_detection_score(label, pred, time_list,exp_name = 'current_exp', result_df=None): 52 | time_for_get_coreSet, time_for_cal_maxsim, time_for_get_adscore_for_all = time_list 53 | # get detection score 54 | print(f'confusion_matrix: \n{confusion_matrix(label, pred)}') 55 | print(f'accuracy_score: {accuracy_score(label, pred)}') 56 | print(f'f1_score: {f1_score(label, pred)}') 57 | print(f'precision_score: {precision_score(label, pred)}') 58 | print(f'recall_score: {recall_score(label, pred)}') 59 | print(f'roc_auc_score: {roc_auc_score(label, pred)}') 60 | print(classification_report(label, pred)) 61 | 62 | if result_df is None: 63 | #make new dataframe and make exp_name to be index 64 | result_df = pd.DataFrame() 65 | result_df['exp_name'] = [exp_name] 66 | result_df = result_df.set_index('exp_name') 67 | result_df['f1_score'] = [f1_score(label, pred)] 68 | result_df['roc_auc_score'] = [roc_auc_score(label, pred)] 69 | result_df['precision_score'] = [precision_score(label, pred)] 70 | result_df['recall_score'] = [recall_score(label, pred)] 71 | result_df['accuracy_score'] = [accuracy_score(label, pred)] 72 | result_df['coreSet_time'] = [time_for_get_coreSet] 73 | result_df['maxsim_time'] = [time_for_cal_maxsim] 74 | result_df['lookup_all_adscore_time'] = [time_for_get_adscore_for_all] 75 | else: 76 | result_df.loc[exp_name] = [f1_score(label, pred),roc_auc_score(label, pred), precision_score(label, pred), recall_score(label, pred), accuracy_score(label, pred), 77 | time_for_get_coreSet, time_for_cal_maxsim, time_for_get_adscore_for_all] 78 | return result_df 79 | 80 | def get_threshold_pred_distance(score, test_label, desc='make maxsim_ori using lookup'): 81 | # get score of all data from unique data 82 | time_adscore_all=time.time() 83 | score_ori=np.zeros((test_label.shape[0])) 84 | for i in tqdm(range(test_label.shape[0]), desc=desc): 85 | score_ori[i] = score[test_unique_lookup_table[i]] 86 | time_adscore_all=time.time()-time_adscore_all 87 | best_thresh = threshold_function(score_ori, test_label['label'], True) 88 | pred=(score_ori >= best_thresh).astype(int) 89 | return pred, time_adscore_all 90 | 91 | 92 | def get_colbert_score(a_test_rep, train_representations, maxsim_metric='cos'): #maxsim_metric: cosine, dot 93 | if maxsim_metric=='cos': 94 | test_score = torch.sum(torch.max(torch.div( 95 | matmul(a_test_rep, train_representations.transpose(1,2)), 96 | torch.mul(torch.norm(a_test_rep,dim=1).unsqueeze(0).unsqueeze(-1), 97 | torch.norm(train_representations,dim=2).unsqueeze(1)) 98 | ), dim=2).values, dim=1) 99 | 100 | maxsim_score=torch.max(test_score) 101 | mean_maxsim_score=torch.mean(test_score) 102 | return maxsim_score, mean_maxsim_score 103 | 104 | elif maxsim_metric=='dot': 105 | pass 106 | 107 | def divide_cal(test_rep_chunk, train_representations, train_neighbor_index, test_idx, coreSet, maxsim_metric='cos'): 108 | test_rep_chunk_cuda = test_rep_chunk.cuda() 109 | test_scores_log_chunk = torch.Tensor([]).to(test_rep_chunk_cuda.device) 110 | test_mean_coreSet_scores_chunk = torch.Tensor([]).to(test_rep_chunk_cuda.device) 111 | train_representations=train_representations.cuda() 112 | 113 | for a_test in test_rep_chunk_cuda: 114 | if coreSet==0: 115 | maxsim_score, mean_coreSet_score =get_colbert_score(a_test, train_representations, maxsim_metric=maxsim_metric) 116 | # if test_idx==0: 117 | # print(f'모든 train data와 비교,{train_representations.shape[0]}') 118 | else: 119 | maxsim_score, mean_coreSet_score =get_colbert_score(a_test, train_representations[train_neighbor_index[test_idx]], maxsim_metric=maxsim_metric) 120 | # if test_idx==0: 121 | # print(f'coreSet만 비교,{train_representations[train_neighbor_index[test_idx]].shape[0]}') 122 | test_scores_log_chunk = torch.cat((test_scores_log_chunk, 123 | maxsim_score.unsqueeze(0)), dim=0) 124 | test_mean_coreSet_scores_chunk = torch.cat((test_mean_coreSet_scores_chunk, 125 | mean_coreSet_score.unsqueeze(0)), dim=0) 126 | test_idx+=1 127 | 128 | test_scores_log_chunk=test_scores_log_chunk.detach().cpu().numpy() 129 | test_mean_coreSet_scores_chunk=test_mean_coreSet_scores_chunk.detach().cpu().numpy() 130 | return test_scores_log_chunk, test_mean_coreSet_scores_chunk, test_idx 131 | 132 | if __name__ == '__main__': 133 | parser = argparse.ArgumentParser() 134 | parser.add_argument('--plm', type=str, default='bert-base-uncased') 135 | parser.add_argument('--seed', type=int, default=1234, help='random seed (default: 1234)') 136 | 137 | #dataset. 138 | parser.add_argument('--dataset', type=str, default='bgl', help='bgl, tbird, hdfs') 139 | parser.add_argument("--sample", help=[0.1, 0.05, 100000], default=1, type=lambda x: int(x) if x.isdigit() else float(x)) 140 | parser.add_argument("--test_size", help="test_size", default=0.2, type=float) 141 | 142 | # core set 143 | parser.add_argument('--coreSet', default=0, type=lambda x: int(x) if x.isdigit() else float(x), help='0:all unique, 1, 1000, 0.1') 144 | 145 | parser.add_argument('--maxsim_metric', type=str, default='cos', help='cos, dot') 146 | 147 | #extra Experiment 148 | parser.add_argument('--only_cls', default=False, type=str2bool, help='only cls colbert') 149 | parser.add_argument('--train_ratio', type=float, default=1.0, help='for using exp(train ratio)') 150 | parser.add_argument("--only_in_test", default=False, type=str2bool, help='only_in_test') 151 | parser.add_argument('--threshold_function', type=str, default='prc', help='prc, roc') 152 | 153 | args = parser.parse_args() 154 | set_seed(args.seed) 155 | 156 | # directory setting 157 | if args.sample != 1: 158 | root_data_path = os.path.join(os.getcwd(), 'processed_data', f'{args.dataset}_sample_{str(args.sample)}') 159 | processed_data_path = os.path.join(root_data_path, f'{args.test_size}', f'{args.plm}') 160 | 161 | else: 162 | root_data_path = os.path.join(os.getcwd(), 'processed_data', f'{args.dataset}') 163 | processed_data_path = os.path.join(root_data_path, f'{args.test_size}', f'{args.plm}') 164 | 165 | #save result 166 | save_path=os.path.join(processed_data_path, 'results') 167 | if not os.path.exists(save_path): 168 | os.makedirs(save_path) 169 | 170 | if args.only_in_test: 171 | save_path=os.path.join(save_path, f'only_in_test') 172 | if not os.path.exists(save_path): 173 | os.makedirs(save_path) 174 | 175 | #set experiment name start with data information 176 | exp_log_file_name = f'{args.dataset}_sample-{str(args.sample)}_trainRatio-{str(args.train_ratio)}' 177 | 178 | #exp setting 179 | exp_log_file_name = exp_log_file_name+f'_thrSearch-{args.threshold_function}' 180 | 181 | if args.only_cls: 182 | exp_log_file_name = exp_log_file_name+f'_C-Onlycls-{args.maxsim_metric}' 183 | else: 184 | exp_log_file_name = exp_log_file_name+f'_C-Wcls-{args.maxsim_metric}' 185 | 186 | exp_log_file_name = exp_log_file_name+f'_coreSet-{str(args.coreSet)}' 187 | 188 | if args.threshold_function == 'roc': 189 | threshold_function=get_threshold_roc 190 | elif args.threshold_function == 'prc': 191 | threshold_function=get_threshold_prc 192 | 193 | if 'hdfs' in args.dataset: 194 | result_name='(session)' 195 | else: 196 | result_name='(all_each)' 197 | 198 | if os.path.exists(os.path.join(save_path, f'{exp_log_file_name}_dict.json')): 199 | # end of experiment 200 | print(f'{exp_log_file_name} is already exist') 201 | sys.exit() 202 | 203 | # load preprocessed data 204 | # train, val : all normal data 205 | train_representations = load_pickle(os.path.join(processed_data_path,'train_representations')) 206 | test_label = load_pickle(os.path.join(processed_data_path,'test_label')) 207 | test_representations = load_pickle(os.path.join(processed_data_path,'test_representations')) 208 | test_unique_lookup_table = load_pickle(os.path.join(processed_data_path,'test_unique_lookup_table')) 209 | 210 | if args.train_ratio != 1: 211 | print(f'original unique train size: {train_representations.shape}') 212 | print(f'train_ratio: {args.train_ratio}') 213 | train_unique_lookup_table=load_pickle(os.path.join(processed_data_path,'train_unique_lookup_table')) 214 | sampled_train = np.random.choice(train_unique_lookup_table.shape[0], int(train_unique_lookup_table.shape[0]*args.train_ratio), replace=False) 215 | train_representations=train_representations[np.unique(train_unique_lookup_table[sampled_train]),:,:] 216 | print(f'sampled_unique_train_size: {train_representations.shape}') 217 | 218 | if args.only_in_test: 219 | test_label['lookup_table']=test_unique_lookup_table 220 | test_label['label']=test_label['label'].astype(int) 221 | print('only_in_test') 222 | #compare train_representations and test_representations to find only in test 223 | # to easy calcuration compare only cls 224 | train_representations_cls=train_representations[:,0,:].numpy() 225 | test_representations_cls=test_representations[:,0,:].numpy() 226 | 227 | # get only in test 228 | only_in_test_idx=[] 229 | for i, cls in tqdm(enumerate(test_representations_cls.tolist()), desc='only_in_test', total=len(test_representations_cls.tolist())): 230 | if cls not in train_representations_cls.tolist(): 231 | only_in_test_idx.append(i) 232 | 233 | only_in_test_idx=np.array(only_in_test_idx) 234 | new_idx_dict={} 235 | for i in range(len(only_in_test_idx)): 236 | new_idx_dict[only_in_test_idx[i]]=np.arange(0, len(only_in_test_idx))[i] 237 | 238 | only_test_test_label=test_label[test_label['lookup_table'].isin(only_in_test_idx)].reset_index(drop=True) 239 | only_test_test_label['lookup_table']=only_test_test_label['lookup_table'].map(new_idx_dict) 240 | 241 | test_unique_lookup_table=only_test_test_label['lookup_table'].values 242 | test_label=only_test_test_label[['timestamp','label']] 243 | test_representations=test_representations[only_in_test_idx] 244 | print(f'only_in_test: {test_representations.shape}') 245 | 246 | 247 | exp_log_file_name=exp_log_file_name.split('_') 248 | exp_log_file_name[2]=exp_log_file_name[2]+'-'+str(train_representations.shape[0]) 249 | 250 | exp_log_file_name='_'.join(exp_log_file_name) 251 | 252 | # if coreSet is nor integer, then it is ratio 253 | # check args.coreSet is ratio or not 254 | if (args.coreSet > 0) and (args.coreSet < 1): 255 | coreSet = int(train_representations.shape[0]*args.coreSet) 256 | coreSet = max(coreSet, 1) 257 | print(f'{args.coreSet} = {coreSet}') 258 | exp_log_file_name='-'.join(exp_log_file_name.split('-')[:-1]) 259 | exp_log_file_name = exp_log_file_name+f'-{args.coreSet}-{coreSet}' 260 | elif args.coreSet <= train_representations.shape[0]: 261 | #여기에 coreSet=0인 경우도 포함됨 262 | coreSet = int(args.coreSet) 263 | else: 264 | coreSet = int(train_representations.shape[0]) 265 | print(f'train_representations.shape[0] < coreSet: {train_representations.shape[0]} < {args.coreSet}') 266 | print('use all unique_train') 267 | 268 | if train_representations.shape[0] < coreSet: 269 | exp_log_file_name='-'.join(exp_log_file_name.split('-')[:-1]) 270 | exp_log_file_name = exp_log_file_name+f'-{train_representations.shape[0]}' 271 | 272 | if os.path.exists(os.path.join(save_path, f'{exp_log_file_name}_dict.json')): 273 | # end of experiment 274 | print(f'{exp_log_file_name} is already exist') 275 | sys.exit() 276 | 277 | with open(os.path.join(save_path, f'{exp_log_file_name}.txt'), 'w') as f: 278 | sys.stdout = f 279 | 280 | # get label 281 | # new_label from unique to all log 282 | test_label['lookup_table']=test_unique_lookup_table 283 | test_label['label']=test_label['label'].astype(int) 284 | 285 | #fit knn with unique training data 286 | time_for_get_coreSet=time.time() 287 | if coreSet==0: 288 | #use all unique_train & 여기선 그냥 3 289 | knn_cuml_cls = NearestNeighbors(n_neighbors=1) 290 | else: 291 | # 앞에서 coreSet이 전체 보다 큰 경우에는 전체를 사용하도록 설정 292 | knn_cuml_cls = NearestNeighbors(n_neighbors=coreSet) 293 | knn_cuml_cls.fit(train_representations[:,0,:].numpy()) 294 | 295 | knn_D, train_neighbor_index = knn_cuml_cls.kneighbors(test_representations[:,0,:].numpy()) 296 | time_for_get_coreSet=time.time()-time_for_get_coreSet 297 | 298 | del knn_cuml_cls 299 | 300 | # check the score is ordered by the score 301 | if (np.sort(knn_D[10])==knn_D[10]).all(): 302 | print('score is ordered by the score') 303 | else: 304 | print('*'*50) 305 | print('score is not ordered by the score') 306 | print('*'*50) 307 | 308 | # make D of original test data by using test_unique_lookup_table 309 | knn_D = knn_D[:,0] 310 | knn_pred, knn_time=get_threshold_pred_distance(knn_D, test_label, desc='make D_ori using lookup') 311 | 312 | if args.only_cls: 313 | train_representations=train_representations[:,0,:].unsqueeze(1) 314 | test_representations=test_representations[:,0,:].unsqueeze(1) 315 | print('colbert with only cls') 316 | print('*'*50) 317 | 318 | print('='*50) 319 | print('start calculating colbert score') 320 | 321 | test_scores_log = np.array([]) 322 | test_mean_coreSet_scores = np.array([]) 323 | num_chunk = 100 324 | print(f'train: {train_representations.shape}, test: {test_representations.shape}') 325 | 326 | test_chunks = torch.chunk(test_representations, num_chunk, dim=0) 327 | print(f'num_chunk: {num_chunk}, num_each_chunk: {test_chunks[0].shape[0]}') 328 | 329 | time_for_cal_maxsim=time.time() 330 | test_idx=0 331 | for i in tqdm(range(num_chunk), desc='colbert score by chunk'): 332 | if (len(test_chunks) != num_chunk) and (i >= len(test_chunks)): 333 | print(f'{i}th chunk is not exist: number of unique is less than num_chunk') 334 | break 335 | test_scores_log_chunk, test_mean_coreSet_scores_chunk, new_test_idx = divide_cal( 336 | test_rep_chunk=test_chunks[i], test_idx=test_idx, train_representations=train_representations, 337 | train_neighbor_index=train_neighbor_index, coreSet=coreSet, maxsim_metric=args.maxsim_metric) 338 | test_idx=new_test_idx 339 | 340 | test_scores_log = np.concatenate((test_scores_log, test_scores_log_chunk), axis=0) 341 | test_mean_coreSet_scores = np.concatenate((test_mean_coreSet_scores, test_mean_coreSet_scores_chunk), axis=0) 342 | 343 | del test_chunks 344 | 345 | test_scores_log=-test_scores_log+np.max(test_scores_log) 346 | test_mean_coreSet_scores=-test_mean_coreSet_scores+np.max(test_mean_coreSet_scores) 347 | time_for_cal_maxsim=time.time()-time_for_cal_maxsim 348 | 349 | save_pickle(test_scores_log, os.path.join(save_path, f'{exp_log_file_name}_test_scores_log')) 350 | save_pickle(test_mean_coreSet_scores, os.path.join(save_path, f'{exp_log_file_name}_test_mean_coreSet_scores')) 351 | 352 | # for maxsim_ori version: main result 353 | maxsim_pred, time_for_get_adscore_for_all = get_threshold_pred_distance(test_scores_log, test_label, desc='make maxsim_ori using lookup') 354 | 355 | # for mean_coreSet_score version 356 | mean_coreSet_maxsim_pred, mean_coreSet_time = get_threshold_pred_distance(test_mean_coreSet_scores, test_label, desc='make mean maxsim using lookup') 357 | 358 | 359 | print('='*50) 360 | print('save times') 361 | print('time_for_get_coreSet:', time_for_get_coreSet) 362 | print('time_for_cal_maxsim:', time_for_cal_maxsim) 363 | print('time_for_get_adscore_for_all:', time_for_get_adscore_for_all) 364 | print('='*50) 365 | time_list=(time_for_get_coreSet,time_for_cal_maxsim, time_for_get_adscore_for_all) 366 | 367 | print('-'*50) 368 | print('K=1') 369 | results_df=get_detection_score(test_label['label'], knn_pred,time_list, exp_name = f'K=1{result_name}', result_df=None) 370 | print('='*50) 371 | print('only ColBERT by all test') 372 | results_df=get_detection_score(test_label['label'], maxsim_pred,time_list, exp_name = f'ColBERT{result_name}', result_df=results_df) 373 | print('='*50) 374 | print('mean_coreSet_score version') 375 | results_df=get_detection_score(test_label['label'], mean_coreSet_maxsim_pred,time_list, exp_name = f'mean ColBERT{result_name}', result_df=results_df) 376 | # Restore the standard output 377 | sys.stdout = sys.__stdout__ 378 | 379 | # Close the file object 380 | f.close() 381 | 382 | #save result 383 | results_df.to_csv(os.path.join(save_path, f'{exp_log_file_name}_df.csv')) 384 | #also save result_df as dict 385 | result_dict={f'{exp_log_file_name}':results_df.to_dict()} 386 | 387 | with open(os.path.join(save_path, f'{exp_log_file_name}_dict.json'), 'w') as f: 388 | json.dump(result_dict, f, indent=4) 389 | -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime 2 | 3 | RUN apt-get update && apt-get upgrade -y 4 | RUN pip install --upgrade pip 5 | 6 | RUN apt-get install -y fonts-nanum 7 | RUN rm -rf ~/.cache/matplotlib/* 8 | 9 | RUN apt-get -q -y update && DEBIAN_FRONTEND=noninteractive apt-get -q -y install git curl vim tmux locales lsb-release python3-pip ssh && apt-get clean 10 | RUN apt-get update && apt-get install -y sudo 11 | 12 | ## some basic utilities 13 | RUN pip install matplotlib seaborn scikit-learn scipy pandas numpy jupyter wandb 14 | 15 | #install all from requirements.txt 16 | #start from copying requirements.txt 17 | COPY requirements.txt . 18 | RUN pip install -r requirements.txt 19 | 20 | ## add locale: 21 | RUN locale-gen en_US.UTF-8 && /usr/sbin/update-locale LANG=en_US.UTF-8 22 | ENV LANG en_US.UTF-8 23 | ENV LANGUAGE en_US:en 24 | ENV LC_ALL en_US.UTF-8 25 | 26 | # # Copy code 27 | # ARG user_password 28 | 29 | RUN pip freeze > requirements.txt 30 | ARG UNAME 31 | ARG UID 32 | ARG GID 33 | RUN groupadd -g $GID -o $UNAME 34 | RUN useradd -m -u $UID -g $GID -o -s /bin/bash $UNAME 35 | 36 | # # sudo 권한 부여하기기 37 | # RUN usermod -aG sudo $UNAME 38 | # # 비밀번호 설정 39 | # RUN echo "$UNAME:$user_password" | chpasswd 40 | USER $UNAME 41 | 42 | -------------------------------------------------------------------------------- /docker/container.sh: -------------------------------------------------------------------------------- 1 | container_name=logAD 2 | image_name=bb451/log:cuml_1.13.1_11.6 3 | 4 | echo "Container_name: " $container_name 5 | echo "Image_name: " $image_name 6 | 7 | docker run -td \ 8 | -p 3775:3775 \ 9 | --ipc=host \ 10 | --name $container_name \ 11 | --gpus all \ 12 | -v /ssd2/logAD:/home/bb451/logAD \ 13 | -v /etc/passwd:/etc/passwd \ 14 | $image_name -------------------------------------------------------------------------------- /docker/image.sh: -------------------------------------------------------------------------------- 1 | image_name=bb451/log:cuml_1.13.1_11.6 2 | user_password=3775 3 | echo "Image_name: " $image_name 4 | echo "User_password: " $user_password 5 | 6 | docker build -t $image_name --build-arg UNAME=$(whoami) --build-arg UID=$(id -u) --build-arg GID=$(id -g) --build-arg user_password=$user_password . -------------------------------------------------------------------------------- /image/RAPID_main.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DSBA-Lab/RAPID/46b31747073548ed8489421690c8909595397a58/image/RAPID_main.png -------------------------------------------------------------------------------- /preprocess_rep.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.environ["TOKENIZERS_PARALLELISM"] = "false" 3 | import sys 4 | import pandas as pd 5 | import numpy as np 6 | import argparse 7 | 8 | import torch 9 | from torch.nn import DataParallel 10 | 11 | from transformers import ( 12 | BertModel, 13 | BertTokenizer, 14 | AutoModel, 15 | AutoTokenizer, 16 | ) 17 | 18 | import parmap 19 | import multiprocessing 20 | num_processors = multiprocessing.cpu_count() 21 | 22 | #for time check 23 | from tqdm import tqdm 24 | import time 25 | from utils import str2bool, set_seed, load_pickle, save_pickle, normalizeString, bgl_regex, tb_regex, hdfs_regex 26 | 27 | def preprocess(sentence, flag, dataset): 28 | if "bgl" in dataset: 29 | timestamp, sentence = bgl_regex(sentence) 30 | elif "tbird" in dataset: 31 | timestamp, sentence = tb_regex(sentence) 32 | elif "hdfs" in dataset: 33 | timestamp, sentence = hdfs_regex(sentence)#hdfs: timestamp=block_id 34 | 35 | if flag=='test': 36 | if sentence.split()[0] == '-': 37 | test_label=0 38 | else: 39 | test_label=1 40 | if ("bgl" in dataset) or ("tbird" in dataset): #bgl, tbird: abnormal case has error category, don't need this 41 | sentence = " ".join(sentence.split()[1:]) 42 | sentence = normalizeString(sentence) 43 | if ("bgl" in dataset) or ("tbird" in dataset): 44 | sentence = " ".join(sentence.split()[3:]) #useless part remove 45 | elif "hdfs" in dataset: 46 | sentence = " ".join(sentence.split()[1:]) 47 | 48 | if flag=='test': 49 | return timestamp,sentence,test_label 50 | else: 51 | return timestamp,sentence 52 | 53 | def get_time_data(raw_data, flag, args): 54 | #multiprocessing 55 | timestamp_representation_testLabel =np.array(parmap.map(preprocess, raw_data, flag, args.dataset, pm_pbar=True, pm_processes=num_processors-2)) 56 | timestamps=timestamp_representation_testLabel[:,0] 57 | sentence_list=timestamp_representation_testLabel[:,1] 58 | if flag=='test': 59 | test_labels=timestamp_representation_testLabel[:,2] 60 | if flag=='test': 61 | return timestamps, sentence_list, test_labels 62 | else: 63 | return timestamps, sentence_list 64 | 65 | def get_representation(model, sentence_list, batch_size=20000, max_length=512, pooling_strategy='all'): 66 | all_representations=torch.tensor([], dtype=torch.float32) 67 | for batch_sentences in tqdm(batch(sentence_list, batch_size), total=len(sentence_list)//batch_size+1): 68 | tokens=tokenizer(batch_sentences, add_special_tokens=True, return_tensors='pt', padding='max_length', max_length=max_length, truncation=True).to(device) #max_lenth=128, truncation=True하지 않으면 특정한 문장이 299 길이가 나오나봐 69 | 70 | with torch.no_grad(): 71 | if pooling_strategy=='all': 72 | representations=model(**tokens).last_hidden_state.detach().cpu() 73 | 74 | all_representations=torch.cat((all_representations, representations), dim=0) 75 | return all_representations 76 | 77 | def batch(iterable, n = 1): 78 | current_batch = [] 79 | for item in iterable: 80 | current_batch.append(item) 81 | if len(current_batch) == n: 82 | yield current_batch 83 | current_batch = [] 84 | if current_batch: 85 | yield current_batch 86 | 87 | def get_unique_values_table(sentence_in_list, unique_sentence_list): 88 | indices = np.where(unique_sentence_list == sentence_in_list)[0][0] 89 | return indices 90 | 91 | def process(model, flag, args): 92 | # flag: train, validation, test 93 | if not args.need_split: 94 | if os.path.exists(os.path.join(temp_data_path,f'{flag}_timestamps.pkl')): 95 | print(f'{flag}은 이미 존재합니다.') 96 | timestamps = load_pickle(os.path.join(temp_data_path,f'{flag}_timestamps')) 97 | sentence_list = load_pickle(os.path.join(temp_data_path,f'{flag}_sentence_list')) 98 | else: 99 | time_data=get_time_data(globals()[f'raw_{flag}'],flag, args) 100 | timestamps=time_data[0] 101 | sentence_list=time_data[1] 102 | save_pickle(timestamps, os.path.join(temp_data_path,f'{flag}_timestamps')) 103 | save_pickle(sentence_list, os.path.join(temp_data_path,f'{flag}_sentence_list')) 104 | if flag == 'test': 105 | test_label=time_data[2] 106 | save_pickle(test_label, os.path.join(temp_data_path,f'test_label')) 107 | 108 | if flag =='test': 109 | test_label_df=pd.DataFrame({'timestamp':timestamps,'label':test_label}) 110 | save_pickle(test_label_df, os.path.join(temp_data_path,'test_label')) 111 | print('timestamps, sentence_list 완료') 112 | 113 | sentence_array = np.array(sentence_list) 114 | unique_sentence = np.unique(sentence_array) 115 | 116 | unique_lookup_table= np.array(parmap.map(get_unique_values_table, sentence_array, unique_sentence, pm_pbar=True, pm_processes=num_processors-2)) 117 | print('unique_lookup_table 완료') 118 | unique_sentence=unique_sentence.tolist() 119 | save_pickle(unique_lookup_table, os.path.join(temp_data_path,f'{flag}_unique_lookup_table')) 120 | save_pickle(unique_sentence, os.path.join(temp_data_path,f'{flag}_unique_sentence')) 121 | 122 | if os.path.exists(os.path.join(temp_data_path,f'{flag}_representations.pkl')): 123 | print(f'{flag}은 이미 존재합니다.') 124 | else: 125 | representations=get_representation(model, unique_sentence, batch_size=args.batch_size, max_length = args.max_token_len, pooling_strategy=args.pooling_strategy) 126 | save_pickle(representations, os.path.join(temp_data_path,f'{flag}_representations')) 127 | print('representations 완료') 128 | else: 129 | def save_sentence_list(flag, args): 130 | for i in tqdm(range(args.split_num), desc=f'{flag} time_data split'): 131 | print(f'{flag} time_data {i}번째 get_time_data') 132 | if i == args.split_num-1: 133 | time_data=get_time_data(globals()[f'raw_{flag}'][int(len(globals()[f'raw_{flag}'])/args.split_num)*i:],flag, args) 134 | else: 135 | time_data=get_time_data(globals()[f'raw_{flag}'][int(len(globals()[f'raw_{flag}'])/args.split_num)*i:int(len(globals()[f'raw_{flag}'])/args.split_num)*(i+1)],flag, args) 136 | 137 | timestamps=time_data[0] 138 | sentence_list=time_data[1] 139 | if flag == 'test': 140 | test_label=time_data[2] 141 | save_pickle(test_label, os.path.join(temp_data_path,f'test_label_{i}')) 142 | 143 | test_label_df=pd.DataFrame({'timestamp':timestamps,'label':test_label}) 144 | save_pickle(test_label_df, os.path.join(temp_data_path,f'test_label_{i}')) 145 | 146 | save_pickle(timestamps, os.path.join(temp_data_path,f'{flag}_timestamps_{i}')) 147 | save_pickle(sentence_list, os.path.join(temp_data_path,f'{flag}_sentence_list_{i}')) 148 | print(f'{flag}_timestamps_{i}, {flag}_sentence_list_{i} 완료') 149 | 150 | print(f'{args.split_num} 만큼 나누어 전처리 수행 후 다시 결합해 rep 저장합니다.') 151 | 152 | save_sentence_list(flag, args) 153 | del globals()[f'raw_{flag}'] 154 | print('조각을 다시 모아 전체 데이터 생성') 155 | #label 156 | if flag == 'test': 157 | test_label_df=pd.DataFrame({'timestamp':[],'label':[]}) 158 | for i in range(args.split_num): 159 | test_label_df=pd.concat([test_label_df, load_pickle(os.path.join(temp_data_path,f'test_label_{i}'))], axis=0, ignore_index=True) 160 | save_pickle(test_label_df, os.path.join(temp_data_path,'test_label')) 161 | print('test_label 완료') 162 | 163 | sentence_list=np.array([]) 164 | for i in range(args.split_num): 165 | # concat all sentence_list 166 | sentence_list=np.append(sentence_list, load_pickle(os.path.join(temp_data_path,f'{flag}_sentence_list_{i}'))) 167 | 168 | print('unique시작') 169 | sentence_array = np.array(sentence_list) 170 | unique_sentence = np.unique(sentence_array) 171 | print(f'{flag} unique sentence len :{unique_sentence.shape}') 172 | 173 | unique_lookup_table= np.array(parmap.map(get_unique_values_table, sentence_array, unique_sentence, pm_pbar=True, pm_processes=num_processors-2)) 174 | unique_sentence=unique_sentence.tolist() 175 | 176 | save_pickle(unique_lookup_table, os.path.join(temp_data_path,f'{flag}_unique_lookup_table')) 177 | save_pickle(unique_sentence, os.path.join(temp_data_path,f'{flag}_unique_sentence')) 178 | print('unique_lookup_table 완료') 179 | 180 | print('get_representation 시작') 181 | representations=get_representation(model, unique_sentence, batch_size=args.batch_size, max_length = args.max_token_len, pooling_strategy=args.pooling_strategy) 182 | save_pickle(representations, os.path.join(temp_data_path,f'{flag}_representations')) 183 | print(f'{flag}_representations 완료') 184 | 185 | 186 | if __name__ == '__main__': 187 | set_seed(1234) 188 | parser = argparse.ArgumentParser() 189 | parser.add_argument("--dataset", help=["hdfs", "bgl", "tbird"], default="tbird") 190 | parser.add_argument("--sample", help=[0.1, 0.05, 100000], default=1, type=lambda x: int(x) if x.isdigit() else float(x)) 191 | parser.add_argument("--shuffle", help=["True", "False"], default=True, type=str2bool) 192 | parser.add_argument("--test_size", help="test_size", default=0.2, type=float) 193 | parser.add_argument("--need_split", help=["True", "False"], default=False, type=str2bool) 194 | parser.add_argument("--split_num", help=[5, 10], default=10, type=int) 195 | parser.add_argument('--plm', type=str, default='bert-base-uncased') 196 | parser.add_argument("--batch_size", default=8192, type=int) 197 | parser.add_argument("--max_token_len", default=128, type=int, help='bgl, tbrid:128, hdfs:512') 198 | parser.add_argument("--pooling_strategy", help=["cls", "mean", 'all'], default="all") 199 | 200 | args = parser.parse_args() 201 | 202 | if args.sample != 1: 203 | output_path=os.path.join(os.getcwd(), 'processed_data' ,f'{args.dataset}_sample_{str(args.sample)}') 204 | raw_file_path=os.path.join(os.getcwd(),'processed_data', f'{args.dataset}_sample_{str(args.sample)}') 205 | else: 206 | output_path=os.path.join(os.getcwd(), 'processed_data' ,f'{args.dataset}') 207 | raw_file_path=os.path.join(os.getcwd(),'processed_data', f'{args.dataset}') 208 | 209 | if not os.path.exists(output_path): 210 | os.makedirs(output_path) 211 | 212 | #save temp files 213 | if args.plm == 'pretrained_bgl': 214 | temp_data_path=os.path.join(output_path, f'{args.test_size}','temp_bgl') 215 | else : 216 | temp_data_path=os.path.join(output_path, f'{args.test_size}',f'{args.plm}') 217 | 218 | if not os.path.exists(temp_data_path): 219 | os.makedirs(temp_data_path) 220 | 221 | if os.path.exists(os.path.join(temp_data_path, 'rep_time.txt')): 222 | print('이미 전처리 완료된 데이터입니다.') 223 | sys.exit() 224 | 225 | #device 226 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 227 | if (device.type == 'cuda') and (torch.cuda.device_count() > 1): 228 | print("Let's use", torch.cuda.device_count(), "GPUs!") 229 | 230 | # 일반 사전학습 bert 이용하기 231 | if args.plm == 'pretrained_bgl': 232 | model = BertModel.from_pretrained("./final_bert_model") 233 | else : 234 | model = AutoModel.from_pretrained(args.plm) 235 | model= DataParallel(model) 236 | model.to(device) 237 | model.eval() 238 | if args.plm == 'pretrained_bgl': 239 | tokenizer = BertTokenizer.from_pretrained("./tokenizer/BGL_lanobert-vocab.txt") 240 | else: 241 | tokenizer = AutoTokenizer.from_pretrained(args.plm) 242 | 243 | # open file 244 | with open(os.path.join(raw_file_path,f'train_{args.test_size}'), 'r', encoding='utf-8') as f: 245 | raw_train = f.readlines() 246 | 247 | with open(os.path.join(raw_file_path,f'test_{args.test_size}'), 'r', encoding='utf-8') as f: 248 | raw_test = f.readlines() 249 | 250 | # time check 251 | start_time = time.time() 252 | print('get train data & representation') 253 | process(model, 'train', args) 254 | 255 | train_end_time = time.time() 256 | print(f'train time: {train_end_time-start_time}') 257 | 258 | print('get test data & representation') 259 | process(model, 'test', args) 260 | test_end_time = time.time() 261 | print(f'test time: {test_end_time-train_end_time}') 262 | 263 | print('train_test time: ', test_end_time-start_time) 264 | 265 | with open(os.path.join(temp_data_path, f'rep_time.txt'), 'w') as f: 266 | sys.stdout = f 267 | print(f'train time: {train_end_time-start_time}') 268 | print(f'test time: {test_end_time-train_end_time}') 269 | print(f'train_test time: {test_end_time-start_time}') 270 | 271 | # Restore the standard output 272 | sys.stdout = sys.__stdout__ 273 | 274 | # Close the file object 275 | f.close() 276 | -------------------------------------------------------------------------------- /scripts/all_at_once.sh: -------------------------------------------------------------------------------- 1 | 2 | test_size=0.2 3 | unsing_gpu=0 4 | 5 | python split_data.py --dataset bgl --test_size $test_size 6 | python split_data.py --dataset tbird --sample 5000000 --test_size $test_size 7 | python split_data.py --dataset hdfs --test_size $test_size 8 | 9 | python preprocess_rep.py --dataset bgl --batch_size 8192 --max_token_len 128 --test_size $test_size 10 | python preprocess_rep.py --dataset tbird --sample 5000000 --batch_size 8192 --max_token_len 128 --need_split True --split_num 2 --test_size $test_size 11 | python preprocess_rep.py --dataset hdfs --batch_size 8192 --max_token_len 512 --test_size $test_size 12 | 13 | for plm in google/electra-base-discriminator roberta-base 14 | do 15 | python preprocess_rep.py --dataset bgl --plm $plm --batch_size 8192 --max_token_len 128 --test_size $test_size 16 | python preprocess_rep.py --dataset tbird --sample 5000000 --plm $plm --batch_size 8192 --max_token_len 128 --need_split True --split_num 2 --test_size $test_size 17 | python preprocess_rep.py --dataset hdfs --plm $plm --batch_size 8192 --max_token_len 512 --test_size $test_size 18 | done 19 | 20 | for train_ratio in 1 21 | do 22 | for coreSet in 0.01 0 23 | do 24 | for only_cls in False 25 | do 26 | echo =====================RQ1, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls===================== 27 | # BGL 28 | echo BGL 29 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 30 | 31 | #HDFS 32 | echo HDFS 33 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 34 | 35 | #TBIRD 36 | echo TBIRD 37 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 38 | done 39 | done 40 | done 41 | 42 | for train_ratio in 1 0.1 0.05 0.01 0.001 43 | do 44 | for coreSet in 0.01 0.1 1 0 45 | do 46 | for only_cls in False True 47 | do 48 | echo =====================RQ6, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls===================== 49 | # BGL 50 | echo BGL 51 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 52 | 53 | #HDFS 54 | echo HDFS 55 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 56 | 57 | #TBIRD 58 | echo TBIRD 59 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 60 | done 61 | done 62 | done 63 | 64 | for train_ratio in 1 0.1 0.05 0.01 0.001 65 | do 66 | for coreSet in 0 1 2 5 10 0.01 0.05 0.1 0.2 0.3 0.5 67 | do 68 | for only_cls in False 69 | do 70 | echo =====================RQ2,4, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls===================== 71 | # BGL 72 | echo BGL 73 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 74 | 75 | #HDFS 76 | echo HDFS 77 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 78 | 79 | #TBIRD 80 | echo TBIRD 81 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 82 | done 83 | done 84 | done 85 | 86 | 87 | for train_ratio in 1 0.9 0.8 0.5 0.3 0.2 0.1 0.05 0.01 0.001 88 | do 89 | for coreSet in 0 0.01 90 | do 91 | for only_cls in False 92 | do 93 | echo =====================RQ5, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls===================== 94 | # BGL 95 | echo BGL 96 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 97 | 98 | #HDFS 99 | echo HDFS 100 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 101 | 102 | #TBIRD 103 | echo TBIRD 104 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 105 | done 106 | done 107 | done 108 | 109 | for plm in google/electra-base-discriminator roberta-base 110 | do 111 | # python preprocess_rep.py --dataset bgl --plm $plm --batch_size 8192 --max_token_len 128 --test_size $test_size 112 | # python preprocess_rep.py --dataset tbird --sample 5000000 --plm $plm --batch_size 8192 --max_token_len 128 --need_split True --split_num 2 --test_size $test_size 113 | # python preprocess_rep.py --dataset hdfs --plm $plm --batch_size 8192 --max_token_len 512 --test_size $test_size 114 | 115 | for train_ratio in 1 116 | do 117 | for coreSet in 0.01 118 | do 119 | for only_cls in False 120 | do 121 | echo =====================RQ3, plm:$plm, train_ratio:$train_ratio, coreSet:$coreSet, only_cls:$only_cls===================== 122 | # BGL 123 | echo BGL 124 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --plm $plm --dataset bgl --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 125 | 126 | #HDFS 127 | echo HDFS 128 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --plm $plm --dataset hdfs --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 129 | 130 | #TBIRD 131 | echo TBIRD 132 | CUDA_VISIBLE_DEVICES=$unsing_gpu python ad_test_coreSet.py --plm $plm --dataset tbird --sample 5000000 --train_ratio $train_ratio --coreSet $coreSet --only_cls $only_cls --test_size $test_size 133 | done 134 | done 135 | done 136 | done 137 | 138 | -------------------------------------------------------------------------------- /split_data.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import pandas as pd 4 | import numpy as np 5 | import argparse 6 | from tqdm import tqdm 7 | import json 8 | 9 | import re 10 | from collections import defaultdict 11 | 12 | import random 13 | from sklearn.model_selection import train_test_split 14 | 15 | import parmap 16 | import multiprocessing 17 | num_processors = multiprocessing.cpu_count() 18 | 19 | from utils import str2bool, set_seed, get_parsed_log, get_unique_log, label_parsed_log 20 | 21 | def save_processed_log(data, path, need_newline=False): 22 | if not need_newline: 23 | with open(path, 'w') as f: 24 | for log in data: 25 | f.write(log) 26 | else: 27 | with open(path, 'w') as f: 28 | for log in data: 29 | f.write(log) 30 | f.write('\n') 31 | 32 | if __name__ == '__main__': 33 | set_seed(1234) 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument("--dataset", help=["hdfs", "bgl", "tbird"], default="tbird") 36 | parser.add_argument("--shuffle", help="shuffle data", default=True, type=str2bool) 37 | parser.add_argument("--sample", help=[0.1, 0.05], default=1, type=lambda x: int(x) if x.isdigit() else float(x)) 38 | parser.add_argument("--test_size", help="test_size", default=0.2, type=float) 39 | args = parser.parse_args() 40 | current_dir = os.path.dirname(os.path.abspath(__file__)) 41 | 42 | if args.dataset == "bgl": 43 | data_dir = os.path.join(current_dir, 'dataset', 'bgl') 44 | log_file = "BGL.log" 45 | output_dir = os.path.join(current_dir, 'processed_data', f'bgl') 46 | elif args.dataset == "tbird": 47 | data_dir = os.path.join(current_dir, 'dataset', 'tbird') 48 | log_file = "Thunderbird.log" 49 | if args.sample != 1: 50 | output_dir = os.path.join(current_dir, 'processed_data', f'tbird_sample_{str(args.sample)}') 51 | else: 52 | output_dir = os.path.join(current_dir, 'processed_data', f'tbird') 53 | 54 | elif args.dataset == "hdfs": 55 | # we don't split hdfs dataset with this code 56 | data_dir = os.path.join(current_dir, 'dataset', 'hdfs') 57 | output_dir = os.path.join(current_dir, 'processed_data', f'hdfs') 58 | 59 | log_file = "HDFS.log" 60 | blk_label_file = os.path.join(data_dir,"anomaly_label.csv") 61 | 62 | if not os.path.exists(output_dir): 63 | os.makedirs(output_dir) 64 | 65 | # load dataset and get normal & abnormal 66 | if os.path.exists(os.path.join(output_dir, f'train_{args.test_size}')) and os.path.exists(os.path.join(output_dir, f'test_{args.test_size}')): 67 | print("Already split dataset") 68 | sys.exit() 69 | 70 | 71 | print("Split dataset") 72 | if args.dataset != "hdfs": 73 | #open data_dir + log_file 74 | with open(os.path.join(data_dir, log_file), 'r', errors='ignore') as f: 75 | labels = [] 76 | data=[] 77 | normal_data = [] 78 | abnormal_data = [] 79 | idx = 0 80 | for line in tqdm(f, desc='get data'): 81 | labels.append(line.split()[0] != '-') 82 | if labels[-1]: 83 | abnormal_data.append(line) 84 | else: 85 | normal_data.append(line) 86 | data.append(line) 87 | idx += 1 88 | else: 89 | #hdfs 90 | if os.path.exists(os.path.join(data_dir, 'preprocessed_data_df.csv')): 91 | print("preprocessed hdfs:preprocessed_data_df.csv exists") 92 | import ast 93 | def str_to_list(s): 94 | return ast.literal_eval(s) 95 | data_df=pd.read_csv(os.path.join(data_dir, 'preprocessed_data_df.csv'), converters={'Raw':str_to_list,'labeled_Raw': str_to_list, 'parsed_unique_log': str_to_list}) 96 | 97 | else: 98 | print("preprocess hdfs:preprocessed_data_df.csv") 99 | 100 | with open(os.path.join(data_dir, log_file), 'r', errors='ignore') as f: 101 | data=[] 102 | for line in tqdm(f, total=11175629, desc='get data'): 103 | data.append(line) 104 | #list to dataframe 105 | df = pd.DataFrame(data, columns=['Raw']) #raw data 106 | 107 | data_dict = defaultdict(list) #preserve insertion order of items 108 | for idx, row in tqdm(df.iterrows(), total=df.shape[0], desc='find blk_id'): 109 | blkId_list = re.findall(r'(blk_-?\d+)', row['Raw']) #find all block ids in log Content 110 | blkId_set = set(blkId_list) 111 | for blk_Id in blkId_set: 112 | data_dict[blk_Id].append(row["Raw"]) 113 | 114 | data_df = pd.DataFrame(list(data_dict.items()), columns=['BlockId', 'Raw']) 115 | # make dataframe:blk_df to dict:blk_label_dict 116 | blk_df=pd.read_csv(blk_label_file) 117 | blk_label_dict = dict(zip(blk_df.BlockId, blk_df.Label)) 118 | blk_label_dict = {k: 1 if v == 'Anomaly' else 0 for k, v in blk_label_dict.items()} 119 | 120 | data_df["Label"] = data_df["BlockId"].apply(lambda x: blk_label_dict.get(x)) #add label to the sequence of each blockid 121 | 122 | parsed_unique_log=parmap.map(get_parsed_log, data_df['Raw'], pm_pbar=True, pm_processes=num_processors-2) 123 | parsed_unique_log=parmap.map(get_unique_log, parsed_unique_log, pm_pbar=True, pm_processes=num_processors-2) 124 | 125 | data_df['parsed_unique_log']=parsed_unique_log 126 | data_df=label_parsed_log(data_df) 127 | data_df.to_csv(os.path.join(data_dir, 'preprocessed_data_df.csv'), index=False) 128 | 129 | normal_data = data_df[data_df['Label'] == 0]['labeled_parsed_unique_concat'].tolist() 130 | abnormal_data = data_df[data_df['Label'] == 1]['labeled_parsed_unique_concat'].tolist() 131 | 132 | #split dataset 133 | if args.sample != 1: 134 | # sample == float or int 135 | # sample data with max_num 136 | #get normal, abnormal data ratio and get # of each max 137 | ab_ratio=len(abnormal_data)/(len(normal_data)+len(abnormal_data)) 138 | 139 | if isinstance(args.sample, float): 140 | normal_data = random.sample(normal_data, int(len(normal_data)*args.sample)) 141 | abnormal_data = random.sample(abnormal_data, int(len(abnormal_data)*args.sample)) 142 | normal_train_val, normal_test = train_test_split(normal_data, test_size=args.test_size, random_state=1234, shuffle=args.shuffle) 143 | 144 | elif isinstance(args.sample, int): 145 | print("sample data with specific integer num") 146 | normal_data = random.sample(normal_data, int(args.sample*(1-ab_ratio))) 147 | abnormal_data = random.sample(abnormal_data, int(args.sample*ab_ratio)) 148 | normal_train_val, normal_test = train_test_split(normal_data, test_size=args.test_size, random_state=1234, shuffle=args.shuffle) 149 | else: 150 | normal_train_val, normal_test = train_test_split(normal_data, test_size=args.test_size, random_state=1234, shuffle=args.shuffle) 151 | 152 | test = normal_test + abnormal_data 153 | 154 | if args.dataset == "hdfs": 155 | need_newline=True 156 | else: 157 | need_newline=False 158 | 159 | save_processed_log(normal_train_val, os.path.join(output_dir, f'train_{args.test_size}'), need_newline) 160 | save_processed_log(test, os.path.join(output_dir, f'test_{args.test_size}'),need_newline) 161 | 162 | data_size={} 163 | data_size['train_normal']=len(normal_train_val) 164 | data_size['test_normal']=len(normal_test) 165 | data_size['test_abnormal']=len(abnormal_data) 166 | with open(os.path.join(output_dir, f'data_size_dict_{args.test_size}.json'), 'w') as f: 167 | json.dump(data_size, f, indent=4) 168 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import random 4 | import numpy as np 5 | import torch 6 | import pickle 7 | 8 | import re 9 | import unicodedata 10 | from datetime import datetime 11 | 12 | def str2bool(v): 13 | if isinstance(v, bool): 14 | return v 15 | if v.lower() in ('yes', 'true', 't', 'y', '1'): 16 | return True 17 | elif v.lower() in ('no', 'false', 'f', 'n', '0'): 18 | return False 19 | else: 20 | raise argparse.ArgumentTypeError('Boolean value expected.') 21 | 22 | def set_seed(random_seed): 23 | torch.manual_seed(random_seed) 24 | torch.cuda.manual_seed(random_seed) 25 | torch.cuda.manual_seed_all(random_seed) # if use multi-GPU 26 | torch.backends.cudnn.deterministic = True 27 | torch.backends.cudnn.benchmark = False 28 | np.random.seed(random_seed) 29 | random.seed(random_seed) 30 | 31 | def load_pickle(path): 32 | with open(path+'.pkl', 'rb') as f: 33 | data = pickle.load(f) 34 | return data 35 | 36 | def save_pickle(data, path): 37 | with open(path+'.pkl', 'wb') as f: 38 | pickle.dump(data, f) 39 | 40 | def unicodeToAscii(s): 41 | return "".join( 42 | c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn" 43 | ) 44 | 45 | def normalizeString(s): 46 | s = unicodeToAscii(s.lower().strip()) 47 | s = re.sub(r"[^a-zA-Z<>]+", r" ", s) # only english, del: num, special char 48 | s = re.sub(r"\s+", r" ", s).strip() # del white space 49 | return s 50 | 51 | #for dataset preprocessing 52 | # for bgl 53 | def bgl_regex(log): 54 | date_time_regex = re.compile( 55 | "\d{1,4}\-\d{1,2}\-\d{1,2}-\d{1,2}.\d{1,2}.\d{1,2}.\d{1,6}" 56 | ) 57 | date_regex = re.compile("\d{1,4}\.\d{1,2}\.\d{1,2}") 58 | ip_regex = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d{1,5})?") 59 | server_regex = re.compile("\S+(?=.*[0-9])(?=.*[a-zA-Z])(?=[:]+)\S+") 60 | server_regex2 = re.compile("\S+(?=.*[0-9])(?=.*[a-zA-Z])(?=[-])\S+") 61 | ecid_regex = re.compile("[A-Z0-9]{28}") 62 | serial_regex = re.compile("[a-zA-Z0-9]{48}") 63 | memory_regex = re.compile("0[xX][0-9a-fA-F]\S+") 64 | path_regex = re.compile(".\S+(?=.[0-9a-zA-Z])(?=[/]).\S+") 65 | iar_regex = re.compile("[0-9a-fA-F]{8}") 66 | num_regex = re.compile("(\d+)") 67 | 68 | timestamp = (np.array([str(datetime.strptime(re.findall(date_time_regex, log)[0],'%Y-%m-%d-%H.%M.%S.%f'))])).item() 69 | tmp = re.sub(date_time_regex, " TIME ", log) 70 | tmp = re.sub(ip_regex, " IP ", tmp) 71 | tmp = re.sub(date_regex, " TIME ", tmp) 72 | tmp = re.sub(path_regex, " PATH ", tmp) 73 | tmp = re.sub(server_regex, " SERVER ", tmp) 74 | tmp = re.sub(server_regex2, " SERVER ", tmp) 75 | tmp = re.sub(ecid_regex, " ECID ", tmp) 76 | tmp = re.sub(serial_regex, " SERIAL ", tmp) 77 | tmp = re.sub(memory_regex, " MEMORY ", tmp) 78 | tmp = re.sub(iar_regex, " IAR ", tmp) 79 | tmp = re.sub(num_regex, " NUM ", tmp) 80 | return timestamp, tmp 81 | 82 | def tb_regex(log): 83 | date_regex = re.compile("\d{2,4}\.\d{1,2}\.\d{1,2}\s") 84 | date_regex2 = re.compile( 85 | "(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+(\d{1,2})\s+" 86 | ) 87 | time_regex = re.compile("\d{1,2}\:\d{1,2}\:\d{1,2}") 88 | id_regex = re.compile(r"DATE\s.*\sDATE") 89 | 90 | account_regex = re.compile("(\w+[\w\.]*)@(\w+[\w\.]*)\-(\w+[\w\.]*)") 91 | account_regex2 = re.compile("(\w+[\w\.]*)@(\w+[\w\.]*)") 92 | account_regex3 = re.compile(r"TIME\s\S+") 93 | 94 | dir_regex = re.compile(r'[a-zA-Z0-9_\-\.\/]+\/[a-zA-Z0-9_\-\.\/]+\/[a-zA-Z0-9_\-\.\/]*') # /로 안시작하고 /가 두겹이상인 경우 95 | dir_regex2 = re.compile(r'\/[a-zA-Z0-9_\-\.\/]+\/[a-zA-Z0-9_\-\.\/]*') # /로 시작하고 /가 한겹인 경우 96 | iar_regex = re.compile("[0-9a-fA-F]{10}") 97 | ip_regex = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d{1,5})?") 98 | num_regex = re.compile("(\[\d+\])") 99 | 100 | date_time_str=re.findall(date_regex, log)[0]+" "+re.findall(time_regex, log)[0] 101 | timestamp = (np.array([str(datetime.strptime(date_time_str,'%Y.%m.%d %H:%M:%S'))])).item() 102 | tmp = re.sub(date_regex, "DATE ", log) 103 | tmp = re.sub(date_regex2, "DATE ", tmp) 104 | tmp = re.sub(id_regex, "DATE ID DATE", tmp) 105 | tmp = re.sub(time_regex, "TIME", tmp) 106 | tmp = re.sub(account_regex3, "TIME ACCOUNT", tmp) ## TIME / TIME ACCOUNT 107 | tmp = re.sub(account_regex, "ACCOUNT", tmp) 108 | tmp = re.sub(account_regex2, "ACCOUNT", tmp) 109 | tmp = re.sub(dir_regex, " DIR ", tmp) 110 | tmp = re.sub(dir_regex2, " DIR ", tmp) 111 | tmp = re.sub(ip_regex, "IP", tmp) 112 | tmp = re.sub(iar_regex, "IAR", tmp) 113 | tmp = re.sub(num_regex, " NUM ", tmp) 114 | 115 | return timestamp, tmp 116 | 117 | def hdfs_regex(log): 118 | id_regex = re.compile("blk_.\d+") 119 | ip_regex = re.compile("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d{1,5})?") 120 | num_regex = re.compile("\d*\d") 121 | 122 | block_id = re.findall(id_regex, log)[0] 123 | tmp = re.sub(id_regex, "BLK", log) # already parsed in dataset preprocessing, del block_id 124 | tmp = re.sub(ip_regex, "IP", tmp) 125 | tmp = re.sub(num_regex, "NUM", tmp) 126 | return block_id, tmp 127 | 128 | # for hdfs data split 129 | def concat_list_str(row): 130 | # delete \n & concatenate 131 | return ' '.join(list(map(lambda x: (x.replace('\n','')),row))) 132 | 133 | def add_label_Raw_blk(row): 134 | blk = concat_list_str(row) 135 | blk = "- "+blk 136 | return blk 137 | 138 | def get_parsed_log(df_row): 139 | blk_log=[] 140 | for i, log in enumerate(df_row): 141 | parsed=hdfs_regex(' '.join(log.split()[3:])) 142 | if i ==0: 143 | blk_log.append(parsed[0]) 144 | blk_log.append(normalizeString(parsed[1])) 145 | return blk_log 146 | 147 | def get_unique_log(df_row): 148 | return np.unique(df_row).tolist() 149 | 150 | def label_parsed_log(data_df): 151 | data_df['labeled_parsed_unique_concat']=data_df.apply(lambda row: add_label_Raw_blk(row['parsed_unique_log']) if (row['Label'] == 0) else concat_list_str(row['parsed_unique_log']), axis=1) 152 | return data_df --------------------------------------------------------------------------------