├── Figures ├── LSTM-Autoencoder.PNG ├── Online.PNG └── example.PNG ├── LICENSE ├── README.md ├── Slides.pdf ├── Thesis.pdf ├── data └── power_data.txt └── src ├── DataPrepare └── dataPreparation.py ├── Initialization ├── __pycache__ │ ├── conf_init.cpython-36.pyc │ ├── dataHelper.cpython-36.pyc │ ├── encdecad.cpython-36.pyc │ ├── initTrain.cpython-36.pyc │ └── parameterHelper.cpython-36.pyc ├── conf_init.py ├── dataHelper.py ├── encdecad.py ├── initTrain.py ├── initialization.py └── parameterHelper.py └── OnlinePrediction ├── OnlinePrediction.py ├── ProcessingHelper.py ├── __pycache__ ├── ProcessingHelper.cpython-36.pyc └── conf_online.cpython-36.pyc └── conf_online.py /Figures/LSTM-Autoencoder.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Figures/LSTM-Autoencoder.PNG -------------------------------------------------------------------------------- /Figures/Online.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Figures/Online.PNG -------------------------------------------------------------------------------- /Figures/example.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Figures/example.PNG -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 bli22 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Anomaly detection for streaming data using autoencoders 2 | 3 | This project is my master thesis. The main target is to maintain an adaptive autoencoder-based anomaly detection framework that is able to not only detect contextual anomalies from streaming data, but also update itself according to the latest data feature. 4 | 5 | ## Quick access 6 | 7 | - [Thesis](https://github.com/binli826/LSTM-Autoencoders/blob/master/Thesis.pdf) 8 | - [Slides](https://github.com/binli826/LSTM-Autoencoders/blob/master/Slides.pdf) 9 | - [Usage](https://github.com/binli826/LSTM-Autoencoders/tree/master#usage) 10 | 11 | ## Introduction 12 | The high-volume and -velocity data stream generated from devices and applications from 13 | different domains grows steadily and is valuable for big data research. One of the most 14 | important topics is anomaly detection for streaming data, which has attracted attention 15 | and investigation in plenty of areas, e.g., the sensor data anomaly detection, predictive 16 | maintenance, event detection. Those efforts could potentially avoid large amount of financial 17 | costs in the manufacture. However, different from traditional anomaly detection tasks, 18 | anomaly detection in streaming data is especially difficult due to that data arrives along 19 | with the time with latent distribution changes, so that a single stationary model doesn’t fit 20 | streaming data all the time. An anomaly could become normal during the data evolution, 21 | therefore it is necessary to maintain a dynamic system to adapt the changes. In this work, 22 | we propose a LSTMs-Autoencoder anomaly detection model for streaming data. This is a 23 | mini-batch based streaming processing approach. We experimented with streaming data 24 | that containing different kinds of anomalies as well as concept drifts, the results suggest 25 | that our model can sufficiently detect anomaly from data stream and update model timely 26 | to fit the latest data property. 27 | 28 | ## Model 29 | #### LSTM-Autoencoder 30 | The LSTM-Autoencoder is based on the work of [Malhotra et al.] There are two LSTM units, one as encoder and the other one as decoder. Model will only be trained with normal data, so the reconstruction of anomalies is supposed to lead higher reconstruction error. 31 | 32 | ![LSTM-Autoencoder](https://github.com/binli826/LSTM-Autoencoders/blob/master/Figures/LSTM-Autoencoder.PNG) 33 | 34 | > **Input/Output format** 35 | > 36 | > < Batch size, Time steps, Data dimensions >
37 | > Batch size: Number of windows contained in a single batch
38 | > Time steps: Number of instances within a window (T)
39 | > Data dimensions: Size of feature space 40 | 41 | #### Online framework 42 | Once the LSTM-Autoencoder is initialized with a subset of respective data streams, it is used for the online anomaly detection. For each accumulated batch of streaming data, the model predict each window as normal or anomaly. Afterwards, we introduce experts to label the windows and evaluate the performance. Hard windows will be appended into the updating buffers. Once the normal buffer is full, there will the a continue training of LSTM-Autoencoders only with the hard windows in the buffers. 43 | 44 | ![Online framework](https://github.com/binli826/LSTM-Autoencoders/blob/master/Figures/Online.PNG) 45 | 46 | ## Datasets 47 | The model is experimenced with 5 datasets. [PowerDemand](https://github.com/binli826/LSTM-Autoencoders/blob/master/data/power_data.txt) dataset records the power demand over one year, the unnormal power demand on special days (e.g. festivals, christmas etc.) are labeled as anomalies. 48 | SMTP and HTTP are extracted from the [KDDCup99 dataset](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). SMTP+HTTP is a direct connection of SMTP and HTTP, in order to simulate a concept drift in between. 49 | Here treat the network attacks as anomalies. [FOREST](https://archive.ics.uci.edu/ml/datasets/covertype) dataset records statistics of 7 different forest cover types. We follow the same setting as [Dong et al.], take the smallest class Cottonwood/Willow as anomaly. 50 | The following table shows statistical information of each dataset.(Only numerical features are taken into consideration) 51 | 52 | | Dataset | Dimensionality | #Instances | Anomaly proportion (%) | 53 | | :------: | :------: | :------: | :------: | 54 | | PowerDemand | 1 | 35040 | 2.20 | 55 | | SMTP | 34 | 96554 | 1.22 | 56 | | HTTP | 34 | 623 091 | 0.65 | 57 | | SMTP+HTTP | 34 | 719 645 | 0.72| 58 | | FOREST | 7 | 581 012 | 0.47 | 59 | 60 | ## Results 61 | Here is an reconstruction example of a normal window and an anomaly window of the PowerDemand data. 62 | ![Reconstruction example](https://github.com/binli826/LSTM-Autoencoders/blob/master/Figures/example.PNG) 63 | > 64 | With AUC as evaluation metric, we got following performance of the data stream anomaly detection. 65 | 66 | | Dataset | AUC without updating | AUC with updating | #Updating | 67 | | :------: | :------: | :------: | :------: | 68 | | PowerDemand | 0.91 | 0.97 | 2 | 69 | | SMTP | 0.94 | 0.98 | 2 | 70 | | HTTP | 0.76 | 0.86 | 2 | 71 | | SMTP+HTTP | 0.64 | 0.85 | 3| 72 | | FOREST | 0.74 | 0.82 | 8 | 73 | 74 | 75 | ## Usage 76 | #### Data preparation 77 | Once datasets avaliable, covert the raw data into uniform format using [dataPreparation.py]. 78 | 79 | ```sh 80 | python /src/Initialization/dataPreparation.py dataset inputpath outputpath --powerlabel --kddcol 81 | # Example 82 | python /src/Initialization/dataPreparation.py kdd /mypath/kddcup.data.corrected /mypath --kddcol /mypath/columns.txt 83 | ``` 84 | #### Initialization 85 | With the processed dataset, the model initialization phase can be processed by following command, with figuring out the dataset to use, the data path, and a folder path to save the trained model. 86 | ```sh 87 | python /src/Initialization/initialization.py dataset dataPath modelSavePath 88 | # Example 89 | python /src/Initialization/initialization.py smtp /mypath/smtp.csv /mypath/models/ 90 | ``` 91 | 92 | #### Online prediction 93 | Once data are prepared and model is initializated and saved locally, the online prediction process can be executed as follow, 94 | ```sh 95 | python /src/OnlinePrediction/OnlinePrediction.py datasetname dataPath modelPath 96 | # Example 97 | python /src/OnlinePrediction/OnlinePrediction.py smtp /mypath/smtp.csv /mypath/model_smtp/ 98 | ``` 99 | #### About hyper-parameters 100 | Hyper-parameters are leared by grid search with respect to each dataset, and can be modified in [conf_init.py] and [conf_online.py] 101 | 102 | 103 | 104 | ## Versions 105 | This project works with 106 | * Python 3.6 107 | * Tensorflow 1.4.0 108 | * Numpy 1.13.3 109 | 110 | [Malhotra et al.]: 111 | [Dong et al.]: 112 | [dataPreparation.py]: 113 | [conf_init.py]: 114 | [conf_online.py]: -------------------------------------------------------------------------------- /Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Slides.pdf -------------------------------------------------------------------------------- /Thesis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Thesis.pdf -------------------------------------------------------------------------------- /src/DataPrepare/dataPreparation.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Dec 14 14:14:24 2017 4 | 5 | @author: Bin 6 | """ 7 | 8 | import argparse 9 | import pandas as pd 10 | import numpy as np 11 | 12 | def parseArguments(): 13 | parser = argparse.ArgumentParser() 14 | # Positional mandatory arguments 15 | parser.add_argument("dataset", help="power/kdd/forest", type=str) 16 | parser.add_argument("datapath", help="input data path", type=str) 17 | parser.add_argument("savepath", help="folder to save the processed data", type=str) 18 | 19 | # Optional arguments 20 | parser.add_argument("-pl", "--powerlabel", help="Label file of power demand dataset ", type=str) 21 | parser.add_argument("-kc", "--kddcol", help="Column file of KDD dataset ", type=str) 22 | 23 | # Parse arguments 24 | args = parser.parse_args() 25 | 26 | return args 27 | 28 | 29 | 30 | def power(pathPowerData, pathPowerLabel, pathPowerSave): 31 | '''Process PowerDemand dataset 32 | 33 | we do downsampling with rate 8 on the PowerDemand dataset, 34 | and remove the first and last half week, preserve only the 51 full weeks 35 | with each week benn described by 84 instances. 36 | ''' 37 | 38 | powerData = pd.read_csv(pathPowerData,header=None)[::8][60:-36].reset_index(drop=True) 39 | 40 | 41 | # The PowerDemand dataset is manually label according to the special days in a year. 42 | 43 | 44 | powerLabel = pd.read_csv(pathPowerLabel,header=None) 45 | 46 | PowerDemand = pd.concat((powerData,powerLabel),axis=1) 47 | 48 | PowerDemand.to_csv(pathPowerSave+"/PowerDemand.csv",header=None,index=None) 49 | 50 | 51 | def kdd(pathKDDcol_name, pathKDD, pathKDDSave): 52 | '''Process KDD99 dataset 53 | 54 | SMTP and HTTP are extracted from the KDD99 dataset, only with numerical features 55 | SMTP+HTTP is a connection of SMTP and HTTP 56 | ''' 57 | 58 | try: 59 | with open(pathKDDcol_name) as col_file: 60 | line = col_file.readline() 61 | except EnvironmentError: 62 | print('File not found.') 63 | 64 | columns = line.split('.') 65 | col_names = [] 66 | col_types = [] 67 | for col in columns: 68 | col_names.append(col.split(': ')[0].strip()) 69 | col_types.append(col.split(': ')[1]) 70 | 71 | 72 | df = pd.read_csv(pathKDD,header=None) 73 | 74 | continuous = df.iloc[:,np.array(pd.Series(col_types)=="continuous")] 75 | continuous = pd.concat((continuous,df.iloc[:,-1]),axis=1) 76 | SMTP = continuous[df.iloc[:,2] == "smtp"].reset_index(drop=True) 77 | HTTP = continuous[df.iloc[:,2] == "http"].reset_index(drop=True) 78 | SMTPHTTP = pd.concat((SMTP,HTTP),axis=0).reset_index(drop=True) 79 | 80 | SMTP.to_csv(pathKDDSave+"/SMTP.csv",header=None,index=None) 81 | HTTP.to_csv(pathKDDSave+"/HTTP.csv",header=None,index=None) 82 | SMTPHTTP.to_csv(pathKDDSave+"/SMTPHTTP.csv",header=None,index=None) 83 | 84 | def forest(pathForestData,pathForestSave): 85 | 86 | '''Process Forest cover dataset 87 | 88 | take the smallest class TYP4 as anomaly, rest are normal 89 | only use the 7 numerical features 90 | ''' 91 | 92 | forest = pd.read_csv(pathForestData,header=None) 93 | numerical_col = [0,1,2,3,4,5,9] 94 | 95 | forestData = forest.iloc[:,numerical_col] 96 | forestLabel = forest.iloc[:,-1] 97 | forestLabel[forestLabel != 4] = 'normal.' 98 | forestLabel[forestLabel == 4] = 'anomaly.' 99 | 100 | forest = pd.concat((forestData,forestLabel),axis=1) 101 | forest.to_csv(pathForestSave+"/FOREST.csv",header=None,index=None) 102 | 103 | 104 | if __name__ == '__main__': 105 | 106 | args = parseArguments() 107 | dataset = args.__dict__['dataset'] 108 | datapath = args.__dict__['datapath'] 109 | savepath= args.__dict__['savepath'] 110 | 111 | if dataset == "power": 112 | powerlabel = args.__dict__['powerlabel'] 113 | power(datapath,powerlabel,savepath) 114 | elif dataset == "kdd": 115 | kddcol = args.__dict__['kddcol'] 116 | kdd(kddcol,datapath,savepath) 117 | elif dataset == "forest": 118 | forest(datapath,savepath) 119 | else: 120 | print("Please input the dataset name: power/kdd/forest.") 121 | -------------------------------------------------------------------------------- /src/Initialization/__pycache__/conf_init.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/conf_init.cpython-36.pyc -------------------------------------------------------------------------------- /src/Initialization/__pycache__/dataHelper.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/dataHelper.cpython-36.pyc -------------------------------------------------------------------------------- /src/Initialization/__pycache__/encdecad.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/encdecad.cpython-36.pyc -------------------------------------------------------------------------------- /src/Initialization/__pycache__/initTrain.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/initTrain.cpython-36.pyc -------------------------------------------------------------------------------- /src/Initialization/__pycache__/parameterHelper.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/parameterHelper.cpython-36.pyc -------------------------------------------------------------------------------- /src/Initialization/conf_init.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Aug 16 16:44:42 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | from dataHelper import Data_Helper 9 | 10 | class Configuration(object): 11 | 12 | def __init__(self, dataset, dataPath, modelSavePath, training_data_source = "file", optimizer=None, decode_without_input=False): 13 | 14 | 15 | if dataset == "power": 16 | 17 | self.batch_num = 8 18 | self.hidden_num = 15 19 | self.step_num = 84 20 | self.training_set_size = self.step_num*12 21 | 22 | elif dataset == "smtp": 23 | 24 | self.batch_num = 8 25 | self.hidden_num = 15 26 | self.step_num = 10 27 | self.training_set_size = self.step_num*6000 28 | 29 | elif dataset == "http": 30 | 31 | self.batch_num = 8 32 | self.hidden_num = 35 33 | self.step_num = 30 34 | self.training_set_size = self.step_num*30000 35 | 36 | elif dataset == "smtphttp": 37 | self.batch_num = 8 38 | self.hidden_num = 15 39 | self.step_num = 10 40 | self.training_set_size = self.step_num*2500 41 | 42 | elif dataset == "forest": 43 | self.batch_num = 8 44 | self.hidden_num = 25 45 | self.step_num = 10 46 | self.training_set_size = self.step_num*10000 47 | 48 | else: 49 | print("Wrong dataset name input.") 50 | 51 | 52 | self.input_root =dataPath 53 | self.iteration = 300 54 | self.modelpath_root = modelSavePath 55 | self.modelmeta = self.modelpath_root + "_"+str(self.batch_num)+"_"+str(self.hidden_num)+"_"+str(self.step_num)+"_.ckpt.meta" 56 | self.modelpath_p = self.modelpath_root + "_"+str(self.batch_num)+"_"+str(self.hidden_num)+"_"+str(self.step_num)+"_para.ckpt" 57 | self.modelmeta_p = self.modelpath_root + "_"+str(self.batch_num)+"_"+str(self.hidden_num)+"_"+str(self.step_num)+"_para.ckpt.meta" 58 | self.decode_without_input = False 59 | 60 | self.log_path = modelSavePath + "log.txt" 61 | 62 | 63 | # import dataset 64 | # The dataset is divided into 6 parts, namely training_normal, validation_1, 65 | # validation_2, test_normal, validation_anomaly, test_anomaly. 66 | 67 | self.training_data_source = training_data_source 68 | data_helper = Data_Helper(self.input_root,self.training_set_size,self.step_num,self.batch_num,self.training_data_source,self.log_path) 69 | 70 | self.sn_list = data_helper.sn_list 71 | self.va_list = data_helper.va_list 72 | self.vn1_list = data_helper.vn1_list 73 | self.vn2_list = data_helper.vn2_list 74 | self.tn_list = data_helper.tn_list 75 | self.ta_list = data_helper.ta_list 76 | self.data_list = [self.sn_list, self.va_list, self.vn1_list, self.vn2_list, self.tn_list, self.ta_list] 77 | 78 | self.elem_num = data_helper.sn.shape[1] 79 | self.va_label_list = data_helper.va_label_list 80 | 81 | 82 | f = open(self.log_path,'a') 83 | f.write("Batch_num=%d\nHidden_num=%d\nwindow_length=%d\ntraining_used_#windows=%d\n"%(self.batch_num,self.hidden_num,self.step_num,self.training_set_size//self.step_num)) 84 | f.close() -------------------------------------------------------------------------------- /src/Initialization/dataHelper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Aug 16 16:53:38 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | import numpy as np 9 | import pandas as pd 10 | from sklearn.preprocessing import MinMaxScaler 11 | 12 | 13 | # Given a initialization dataset, split it into normal lists and abnormal lists of different subsets. 14 | 15 | class Data_Helper(object): 16 | 17 | def __init__(self, path,training_set_size,step_num,batch_num,training_data_source,log_path): 18 | 19 | self.path = path 20 | self.step_num = step_num 21 | self.batch_num = batch_num 22 | self.training_data_source = training_data_source 23 | self.training_set_size = training_set_size 24 | 25 | 26 | 27 | self.df = pd.read_csv(self.path).iloc[:self.training_set_size,:] 28 | 29 | print("Preprocessing...") 30 | 31 | self.sn,self.vn1,self.vn2,self.tn,self.va,self.ta,self.va_labels = self.preprocessing(self.df,log_path) 32 | assert min(self.sn.size,self.vn1.size,self.vn2.size,self.tn.size,self.va.size,self.ta.size) > 0, "Not enough continuous data in file for training, ended."+str((self.sn.size,self.vn1.size,self.vn2.size,self.tn.size,self.va.size,self.ta.size)) 33 | 34 | # data seriealization 35 | t1 = self.sn.shape[0]//step_num 36 | t2 = self.va.shape[0]//step_num 37 | t3 = self.vn1.shape[0]//step_num 38 | t4 = self.vn2.shape[0]//step_num 39 | t5 = self.tn.shape[0]//step_num 40 | t6 = self.ta.shape[0]//step_num 41 | 42 | self.sn_list = [self.sn[step_num*i:step_num*(i+1)].as_matrix() for i in range(t1)] 43 | self.va_list = [self.va[step_num*i:step_num*(i+1)].as_matrix() for i in range(t2)] 44 | self.vn1_list = [self.vn1[step_num*i:step_num*(i+1)].as_matrix() for i in range(t3)] 45 | self.vn2_list = [self.vn2[step_num*i:step_num*(i+1)].as_matrix() for i in range(t4)] 46 | 47 | self.tn_list = [self.tn[step_num*i:step_num*(i+1)].as_matrix() for i in range(t5)] 48 | self.ta_list = [self.ta[step_num*i:step_num*(i+1)].as_matrix() for i in range(t6)] 49 | 50 | self.va_label_list = [self.va_labels[step_num*i:step_num*(i+1)].as_matrix() for i in range(t2)] 51 | 52 | print("Ready for training.") 53 | 54 | def preprocessing(self,df,log_path): 55 | 56 | #scaling 57 | label = df.iloc[:,-1] 58 | scaler = MinMaxScaler() 59 | scaler.fit(df.iloc[:,:-1]) 60 | cont = pd.DataFrame(scaler.transform(df.iloc[:,:-1])) 61 | data = pd.concat((cont,label),axis=1) 62 | 63 | # split data according to window length 64 | # split dataframe into segments of length L, if a window contains mindestens one anomaly, then this window is anomaly wondow 65 | n_list = [] 66 | a_list = [] 67 | 68 | windows = [data.iloc[w*self.step_num:(w+1)*self.step_num,:] for w in range(data.index.size//self.step_num)] 69 | for win in windows: 70 | label = win.iloc[:,-1] 71 | if label[label!="normal."].size == 0: 72 | n_list += [i for i in win.index] 73 | else: 74 | a_list += [i for i in win.index] 75 | 76 | normal = data.iloc[np.array(n_list),:-1] 77 | anomaly = data.iloc[np.array(a_list),:-1] 78 | print("Info: Initialization set contains %d normal windows and %d abnormal windows."%(normal.shape[0],anomaly.shape[0])) 79 | 80 | a_labels = data.iloc[np.array(a_list),-1] 81 | 82 | # split into subsets 83 | tmp = normal.index.size//self.step_num//10 84 | assert tmp > 0 ,"Too small normal set %d rows"%normal.index.size 85 | sn = normal.iloc[:tmp*5*self.step_num,:] 86 | vn1 = normal.iloc[tmp*5*self.step_num:tmp*8*self.step_num,:] 87 | vn2 = normal.iloc[tmp*8*self.step_num:tmp*9*self.step_num,:] 88 | tn = normal.iloc[tmp*9*self.step_num:,:] 89 | 90 | tmp_a = anomaly.index.size//self.step_num//2 91 | va = anomaly.iloc[:tmp_a*self.step_num,:] if tmp_a !=0 else anomaly 92 | ta = anomaly.iloc[tmp_a*self.step_num:,:] if tmp_a !=0 else anomaly 93 | a_labels = a_labels[:va.index.size] 94 | 95 | print("Local preprocessing finished.") 96 | print("Subsets contain windows: sn:%d,vn1:%d,vn2:%d,tn:%d,va:%d,ta:%d\n"%(sn.shape[0]/self.step_num,vn1.shape[0]/self.step_num,vn2.shape[0]/self.step_num,tn.shape[0]/self.step_num,va.shape[0]/self.step_num,ta.shape[0]/self.step_num)) 97 | 98 | f = open(log_path,'a') 99 | 100 | f.write("Info: Initialization set contains %d normal windows and %d abnormal windows.\n"%(normal.shape[0],anomaly.shape[0])) 101 | f.write("Subsets contain windows: sn:%d,vn1:%d,vn2:%d,tn:%d,va:%d,ta:%d\n"%(sn.shape[0]/self.step_num,vn1.shape[0]/self.step_num,vn2.shape[0]/self.step_num,tn.shape[0]/self.step_num,va.shape[0]/self.step_num,ta.shape[0]/self.step_num)) 102 | f.close() 103 | 104 | return sn,vn1,vn2,tn,va,ta,a_labels -------------------------------------------------------------------------------- /src/Initialization/encdecad.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Aug 17 14:07:54 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | import tensorflow as tf 9 | import math 10 | 11 | class EncDecAD(object): 12 | 13 | def __init__(self, hidden_num, inputs, is_training, optimizer=None, reverse=True, decode_without_input=False): 14 | 15 | self.batch_num = inputs[0].get_shape().as_list()[0] 16 | self.elem_num = inputs[0].get_shape().as_list()[1] 17 | 18 | self._enc_cell = tf.nn.rnn_cell.LSTMCell(hidden_num, use_peepholes=True) 19 | self._dec_cell = tf.nn.rnn_cell.LSTMCell(hidden_num, use_peepholes=True) 20 | if is_training == True: 21 | self._enc_cell = tf.nn.rnn_cell.DropoutWrapper(self._enc_cell, input_keep_prob=0.8, output_keep_prob=0.8) 22 | self._dec_cell = tf.nn.rnn_cell.DropoutWrapper(self._dec_cell, input_keep_prob=0.8, output_keep_prob=0.8) 23 | 24 | self.is_training = is_training 25 | 26 | self.input_ = tf.transpose(tf.stack(inputs), [1, 0, 2],name="input_") 27 | 28 | with tf.variable_scope('encoder',reuse = tf.AUTO_REUSE): 29 | (self.z_codes, self.enc_state) = tf.contrib.rnn.static_rnn(self._enc_cell, inputs, dtype=tf.float32) 30 | 31 | with tf.variable_scope('decoder',reuse =tf.AUTO_REUSE) as vs: 32 | 33 | dec_weight_ = tf.Variable(tf.truncated_normal([hidden_num,self.elem_num], dtype=tf.float32)) 34 | 35 | dec_bias_ = tf.Variable(tf.constant(0.1,shape=[self.elem_num],dtype=tf.float32)) 36 | 37 | dec_state = self.enc_state 38 | dec_input_ = tf.ones(tf.shape(inputs[0]),dtype=tf.float32) 39 | dec_outputs = [] 40 | 41 | for step in range(len(inputs)): 42 | if step > 0: 43 | vs.reuse_variables() 44 | (dec_input_, dec_state) =self._dec_cell(dec_input_, dec_state) 45 | dec_input_ = tf.matmul(dec_input_, dec_weight_) + dec_bias_ 46 | dec_outputs.append(dec_input_) 47 | # use real input as as input of decoder ******* 48 | tmp = -(step+1) 49 | dec_input_ = inputs[tmp] 50 | 51 | if reverse: 52 | dec_outputs = dec_outputs[::-1] 53 | 54 | self.output_ = tf.transpose(tf.stack(dec_outputs), [1, 0, 2],name="output_") 55 | self.loss = tf.reduce_mean(tf.square(self.input_ - self.output_),name="loss") 56 | 57 | def check_is_train(is_training): 58 | def t_ (): return tf.train.AdamOptimizer().minimize(self.loss,name="train_") 59 | def f_ (): return tf.train.AdamOptimizer(1/math.inf).minimize(self.loss) 60 | is_train = tf.cond(is_training, t_, f_) 61 | return is_train 62 | self.train = check_is_train(is_training) -------------------------------------------------------------------------------- /src/Initialization/initTrain.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Aug 17 14:09:13 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | import numpy as np 9 | import matplotlib.pyplot as plt 10 | import time 11 | import tensorflow as tf 12 | from conf_init import Configuration 13 | from encdecad import EncDecAD 14 | from parameterHelper import Parameter_Helper 15 | 16 | class Initialization_Train(object): 17 | 18 | def __init__(self, dataset,dataPath, modelSavePath, training_data_source='file'): 19 | start_time = time.time() 20 | conf = Configuration(dataset, dataPath, modelSavePath, training_data_source=training_data_source) 21 | 22 | 23 | batch_num = conf.batch_num 24 | hidden_num = conf.hidden_num 25 | step_num = conf.step_num 26 | elem_num = conf.elem_num 27 | 28 | iteration = conf.iteration 29 | modelpath_root = conf.modelpath_root 30 | modelpath = conf.modelpath_p 31 | decode_without_input = conf.decode_without_input 32 | 33 | patience = 20 34 | patience_cnt = 0 35 | min_delta = 0.0001 36 | 37 | 38 | #************# 39 | # Training 40 | #************# 41 | 42 | p_input = tf.placeholder(tf.float32, shape=(batch_num, step_num, elem_num),name = "p_input") 43 | p_inputs = [tf.squeeze(t, [1]) for t in tf.split(p_input, step_num, 1)] 44 | 45 | p_is_training = tf.placeholder(tf.bool,name= "is_training_") 46 | 47 | ae = EncDecAD(hidden_num, p_inputs, p_is_training , decode_without_input=False) 48 | 49 | graph = tf.get_default_graph() 50 | gvars = graph.get_collection(tf.GraphKeys.GLOBAL_VARIABLES) 51 | assign_ops = [graph.get_operation_by_name(v.op.name + "/Assign") for v in gvars] 52 | init_values = [assign_op.inputs[1] for assign_op in assign_ops] 53 | 54 | 55 | print("Training start.") 56 | with tf.Session() as sess: 57 | saver = tf.train.Saver() 58 | 59 | 60 | sess.run(tf.global_variables_initializer()) 61 | input_= tf.transpose(tf.stack(p_inputs), [1, 0, 2]) 62 | output_ = graph.get_tensor_by_name("decoder/output_:0") 63 | 64 | loss = [] 65 | val_loss = [] 66 | sn_list_length = len(conf.sn_list) 67 | tn_list_length = len(conf.tn_list) 68 | 69 | for i in range(iteration): 70 | #training set 71 | snlist = conf.sn_list[:] 72 | tmp_loss = 0 73 | for t in range(sn_list_length//batch_num): 74 | data =[] 75 | for _ in range(batch_num): 76 | data.append(snlist.pop()) 77 | data = np.array(data) 78 | (loss_val, _) = sess.run([ae.loss, ae.train], {p_input: data,p_is_training : True}) 79 | tmp_loss += loss_val 80 | l = tmp_loss/(sn_list_length//batch_num) 81 | loss.append(l) 82 | 83 | #validation set 84 | tnlist = conf.tn_list[:] 85 | tmp_loss_ = 0 86 | for t in range(tn_list_length//batch_num): 87 | testdata = [] 88 | for _ in range(batch_num): 89 | testdata.append(tnlist.pop()) 90 | testdata = np.array(testdata) 91 | (loss_val,ein,aus) = sess.run([ae.loss,input_,output_], {p_input: testdata,p_is_training :False}) 92 | tmp_loss_ += loss_val 93 | tl = tmp_loss_/(tn_list_length//batch_num) 94 | val_loss.append(tl) 95 | print('Epoch %d: Loss:%.3f, Val_loss:%.3f' %(i, l,tl)) 96 | 97 | if i == 5: 98 | break 99 | #Early stopping 100 | if i > 0 and val_loss[i] < np.array(val_loss[:i]).min(): 101 | #save_path = saver.save(sess, conf.modelpath_p) 102 | gvars_state = sess.run(gvars) 103 | 104 | if i > 0 and val_loss[i-1] - val_loss[i] > min_delta: 105 | patience_cnt = 0 106 | else: 107 | patience_cnt += 1 108 | 109 | if i>0 and patience_cnt > patience: 110 | print("Early stopping at epoch %d\n"%i) 111 | feed_dict = {init_value: val for init_value, val in zip(init_values, gvars_state)} 112 | sess.run(assign_ops, feed_dict=feed_dict) 113 | #saver.restore(sess,tf.train.latest_checkpoint(modelpath_root)) 114 | #graph = tf.get_default_graph() 115 | break 116 | 117 | plt.plot(loss,label="Train") 118 | plt.plot(val_loss,label="val_loss") 119 | plt.legend() 120 | plt.show() 121 | 122 | 123 | # mu & sigma & threshold 124 | 125 | para = Parameter_Helper(conf) 126 | mu, sigma = para.mu_and_sigma(sess,input_, output_,p_input, p_is_training) 127 | threshold = para.get_threshold(mu,sigma,sess,input_, output_,p_input, p_is_training) 128 | 129 | # test = EncDecAD_Test(conf) 130 | # test.test_encdecad(sess,input_,output_,p_input,p_is_training,mu,sigma,threshold,beta = 0.5) 131 | 132 | c_mu = tf.constant(mu,dtype=tf.float32,name = "mu") 133 | c_sigma = tf.constant(sigma,dtype=tf.float32,name = "sigma") 134 | c_threshold = tf.constant(threshold,dtype=tf.float32,name = "threshold") 135 | print("Saving model to disk...") 136 | save_path = saver.save(sess, conf.modelpath_p) 137 | print("Model saved accompany with parameters and threshold in file: %s" % save_path) 138 | 139 | print("--- Initialization time: %s seconds ---" % (time.time() - start_time)) 140 | 141 | f = open(conf.log_path,'a') 142 | f.write("Early stopping at epoch %d\n"%i) 143 | # f.write("mu:\n",mu,"sigma:\n",sigma,"threshold:\n",threshold) 144 | f.write("Model saved accompany with parameters and threshold in file: %s" % save_path) 145 | f.write("--- Initialization time: %s seconds ---" % (time.time() - start_time)) 146 | f.close() -------------------------------------------------------------------------------- /src/Initialization/initialization.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Aug 16 16:44:42 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | import argparse 9 | from initTrain import Initialization_Train 10 | 11 | def parseArguments(): 12 | parser = argparse.ArgumentParser() 13 | # Positional mandatory arguments 14 | parser.add_argument("dataset", help="power/smtp/http/forest", type=str) 15 | parser.add_argument("dataPath", help="input data path", type=str) 16 | parser.add_argument("modelSavePath", help="folder to save the trained model", type=str) 17 | 18 | # Parse arguments 19 | args = parser.parse_args() 20 | 21 | return args 22 | 23 | if __name__=="__main__": 24 | 25 | args = parseArguments() 26 | dataset = args.__dict__['dataset'] 27 | dataPath = args.__dict__['dataPath'] 28 | modelSavePath= args.__dict__['modelSavePath'] 29 | 30 | Initialization_Train(dataset,dataPath, modelSavePath) 31 | 32 | 33 | -------------------------------------------------------------------------------- /src/Initialization/parameterHelper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Aug 17 14:08:21 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | import numpy as np 9 | from scipy.spatial.distance import mahalanobis 10 | 11 | class Parameter_Helper(object): 12 | 13 | def __init__(self, conf): 14 | self.conf = conf 15 | 16 | 17 | def mu_and_sigma(self,sess,input_, output_,p_input, p_is_training): 18 | 19 | err_vec_list = [] 20 | 21 | ind = list(np.random.permutation(len(self.conf.vn1_list))) 22 | 23 | while len(ind)>=self.conf.batch_num: 24 | data = [] 25 | for _ in range(self.conf.batch_num): 26 | data += [self.conf.vn1_list[ind.pop()]] 27 | data = np.array(data,dtype=float) 28 | data = data.reshape((self.conf.batch_num,self.conf.step_num,self.conf.elem_num)) 29 | 30 | (_input_, _output_) = sess.run([input_, output_], {p_input: data, p_is_training: False}) 31 | abs_err = abs(_input_ - _output_) 32 | err_vec_list += [abs_err[i] for i in range(abs_err.shape[0])] 33 | 34 | 35 | # new metric 36 | err_vec_array = np.array(err_vec_list).reshape(-1,self.conf.elem_num) 37 | 38 | # for univariate data, anomaly score is squared euclidean distance 39 | # for multivariate data, anomaly score is squared mahalanobis distance 40 | 41 | mu = np.mean(err_vec_array.ravel()) if self.conf.elem_num == 1 else np.mean(err_vec_array,axis=0) 42 | sigma = np.var(err_vec_array.ravel()) if self.conf.elem_num == 1 else np.cov(err_vec_array.T) 43 | 44 | print("Got parameters mu and sigma.") 45 | 46 | return mu, sigma 47 | 48 | 49 | 50 | def get_threshold(self,mu,sigma,sess,input_, output_,p_input, p_is_training): 51 | 52 | normal_score = [] 53 | for count in range(len(self.conf.vn2_list)//self.conf.batch_num): 54 | normal_sub = np.array(self.conf.vn2_list[count*self.conf.batch_num:(count+1)*self.conf.batch_num]) 55 | (input_n, output_n) = sess.run([input_, output_], {p_input: normal_sub,p_is_training : False}) 56 | 57 | err_n = abs(input_n-output_n).reshape(-1,self.conf.step_num,self.conf.elem_num) 58 | for window in range(self.conf.batch_num): 59 | for t in range(self.conf.step_num): 60 | s = mahalanobis(err_n[window,t],mu,sigma) 61 | normal_score.append(s) 62 | 63 | 64 | abnormal_score = [] 65 | ''' 66 | if have enough anomaly data, then calculate anomaly score, and the 67 | threshold that achives best f1 score as divide boundary. 68 | otherwise estimate threshold through normal scores 69 | ''' 70 | print(len(self.conf.va_list)) 71 | 72 | if len(self.conf.va_list) < self.conf.batch_num: # not enough anomaly data for a single batch 73 | threshold = max(normal_score) * 2 74 | print("Not enough large va set.") 75 | 76 | else: 77 | 78 | for count in range(len(self.conf.va_list)//self.conf.batch_num): 79 | abnormal_sub = np.array(self.conf.va_list[count*self.conf.batch_num:(count+1)*self.conf.batch_num]) 80 | va_lable_list = np.array(self.conf.va_label_list[count*self.conf.batch_num:(count+1)*self.conf.batch_num]) 81 | va_lable_list = va_lable_list.reshape(self.conf.batch_num,self.conf.step_num) 82 | 83 | (input_a, output_a) = sess.run([input_, output_], {p_input: abnormal_sub,p_is_training : False}) 84 | err_a = abs(input_a-output_a).reshape(-1,self.conf.step_num,self.conf.elem_num) 85 | for window in range(self.conf.batch_num): 86 | for t in range(self.conf.step_num): 87 | # temp = np.dot((err_a[window,t,:] - mu[t,:] ) , sigma[t]) 88 | # s = np.dot(temp,(err_a[window,t,:] - mu[t,:] ).T) 89 | s = mahalanobis(err_a[window,t],mu,sigma) 90 | if va_lable_list[window,t] == "normal": 91 | normal_score.append(s) 92 | else: 93 | abnormal_score.append(s) 94 | 95 | upper = np.median(np.array(abnormal_score)) 96 | lower = np.median(np.array(normal_score)) 97 | scala = 20 # divide the range(min,max) into 20 parts, find the optimal threshold 98 | delta = (upper-lower) / scala 99 | candidate = lower 100 | threshold = 0 101 | result = 0 102 | 103 | def evaluate(threshold,normal_score,abnormal_score): 104 | 105 | beta = 0.5 106 | tp = np.array(abnormal_score)[np.array(abnormal_score)>threshold].size 107 | fp = len(abnormal_score)-tp 108 | fn = np.array(normal_score)[np.array(normal_score)>threshold].size 109 | tn = len(normal_score)- fn 110 | 111 | if tp == 0: return 0 112 | 113 | P = tp/(tp+fp) 114 | R = tp/(tp+fn) 115 | fbeta= (1+beta*beta)*P*R/(beta*beta*P+R) 116 | return fbeta 117 | 118 | for _ in range(scala): 119 | r = evaluate(candidate,normal_score,abnormal_score) 120 | if r > result: 121 | result = r 122 | threshold = candidate 123 | candidate += delta 124 | 125 | print("Threshold: ",threshold) 126 | 127 | return threshold 128 | -------------------------------------------------------------------------------- /src/OnlinePrediction/OnlinePrediction.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Aug 21 15:50:07 2018 4 | 5 | @author: Bin 6 | """ 7 | import argparse 8 | 9 | import pandas as pd 10 | import numpy as np 11 | import tensorflow as tf 12 | import time 13 | from sklearn import metrics 14 | 15 | from conf_online import Conf 16 | from ProcessingHelper import processingHelper 17 | 18 | 19 | def parseArguments(): 20 | parser = argparse.ArgumentParser() 21 | # Positional mandatory arguments 22 | parser.add_argument("datasetname", help="power/smtp/http/forest", type=str) 23 | parser.add_argument("dataPath", help="input data path", type=str) 24 | parser.add_argument("modelPath", help="folder of trained model", type=str) 25 | 26 | # Parse arguments 27 | args = parser.parse_args() 28 | 29 | return args 30 | 31 | if __name__ == "__main__": 32 | 33 | # args = parseArguments() 34 | # datasetname = args.__dict__['datasetname'] 35 | # datapath = args.__dict__['dataPath'] 36 | # modelpath_root= args.__dict__['modelPath'] 37 | datasetname = "smtp" 38 | datapath = "C:/Users/Bin/Desktop/Thesis/codeBearbeitung/originalData/SMTP.csv" 39 | modelpath_root= "C:/Users/Bin/Desktop/Thesis/codeBearbeitung/ModelS/" 40 | 41 | # get dataset parameters 42 | conf = Conf(datasetname) 43 | batch_num = conf.batch_num 44 | hidden_num = conf.hidden_num 45 | step_num = conf.step_num 46 | elem_num = conf.elem_num 47 | init_wins = conf.training_set_size 48 | 49 | # load dataset and divede into batches and windows 50 | names = [str(x) for x in range(elem_num)] +["label"] 51 | df = pd.read_csv(datapath,header=None,names=names,skiprows=init_wins) 52 | 53 | batches = df.shape[0]//step_num//batch_num 54 | 55 | test_set = df.iloc[:batches*batch_num*step_num,:-1] 56 | labels =df.iloc[:batches*batch_num*step_num,-1] 57 | ts = test_set.as_matrix().reshape(batches,batch_num,step_num,elem_num) 58 | test_set_list = [ts[a] for a in range(batches)] 59 | 60 | wins = batches * batch_num 61 | # figure out anomaly windows 62 | buffer = [labels[i*step_num:(i+1)*step_num] for i in range(0,labels.size//step_num)] 63 | anomaly_index = [] 64 | count = 0 65 | 66 | for buf in buffer: 67 | if "anomaly" in buf.tolist(): 68 | anomaly_index.append(count) 69 | else: 70 | pass 71 | count +=1 72 | 73 | expert = ["normal"]*wins 74 | for x in anomaly_index: 75 | expert[x] = "anomaly" 76 | 77 | 78 | # load model 79 | 80 | modelmeta_p = modelpath_root + "_"+str(batch_num)+"_"+str(hidden_num)+"_"+str(step_num)+"_para.ckpt.meta" 81 | 82 | sess = tf.Session() 83 | saver = tf.train.import_meta_graph(modelmeta_p) # load trained gragh, but without the trained parameters 84 | saver.restore(sess,tf.train.latest_checkpoint(modelpath_root)) 85 | graph = tf.get_default_graph() 86 | 87 | p_input = graph.get_tensor_by_name("p_input:0") 88 | p_inputs = [tf.squeeze(t, [1]) for t in tf.split(p_input, step_num, 1)] 89 | p_is_training = graph.get_tensor_by_name("is_training_:0") 90 | 91 | input_= tf.transpose(tf.stack(p_inputs), [1, 0, 2]) 92 | output_ = graph.get_tensor_by_name("decoder/output_:0") 93 | 94 | tensor_mu = graph.get_tensor_by_name("mu:0") 95 | tensor_sigma = graph.get_tensor_by_name("sigma:0") 96 | tensor_threshold = graph.get_tensor_by_name("threshold:0") 97 | 98 | loss_ = graph.get_tensor_by_name("decoder/loss:0") 99 | train_ = graph.get_operation_by_name("cond/train_") 100 | 101 | mu = sess.run(tensor_mu) 102 | sigma = sess.run(tensor_sigma) 103 | threshold = sess.run(tensor_threshold) 104 | 105 | 106 | # online phase 107 | 108 | count = 0 109 | n_buf = [] 110 | a_buf = [] 111 | 112 | y = [] 113 | output=[] 114 | err_nbuf = [] 115 | err_abuf = [] 116 | all_scores = [] 117 | 118 | start_time = time.time() 119 | helper = processingHelper 120 | 121 | 122 | for l in labels: 123 | if l == "normal": 124 | y +=[0] 125 | else: 126 | y +=[1] 127 | for ids in range(len(test_set_list)): 128 | 129 | data = test_set_list[ids] 130 | if count % 100 == 0: 131 | print(count,"batches processed.") 132 | prediction = [] 133 | df = helper.local_preprocessing(data) 134 | (input_n, output_n) = sess.run([input_, output_], {p_input: df, p_is_training: False}) 135 | err = abs(input_n-output_n).reshape(-1,elem_num) 136 | scores = helper.scoring(err,mu,sigma) 137 | all_scores.append(scores) 138 | output += [ss.max() for ss in np.array(scores).reshape(batch_num,step_num)] 139 | pred = [scores[b*step_num:(b+1)*step_num] for b in range(batch_num)] 140 | label = [expert[count*batch_num+b] for b in range(batch_num)] 141 | e = err 142 | 143 | for i in range(pd.DataFrame(pred).shape[0]):#loop batch_num 144 | index = i 145 | value=pd.DataFrame(pred).iloc[i,:] 146 | 147 | if value[value>threshold].size>=conf.HardCriterion: 148 | if label[index] == "anomaly": 149 | a_buf += [df[index,x,:] for x in range(step_num)] 150 | err_abuf = np.concatenate((err_abuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_abuf) != 0 else e[index*step_num:(index+1)*step_num] 151 | else: 152 | err_nbuf =np.concatenate((err_nbuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_nbuf) != 0 else e[index*step_num:(index+1)*step_num] 153 | n_buf += [df[index,x,:] for x in range(step_num)] 154 | else: 155 | if label[index] == "anomaly": 156 | a_buf += [df[index,x,:] for x in range(step_num)] 157 | err_abuf = np.concatenate((err_abuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_abuf) != 0 else e[index*step_num:(index+1)*step_num] 158 | else: 159 | err_nbuf = np.concatenate((err_nbuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_nbuf) != 0 else e[index*step_num:(index+1)*step_num] 160 | count +=1 161 | 162 | #Check update 163 | if len(n_buf)>=batch_num*step_num*conf.buffersize and len(a_buf) !=0: 164 | while (len(a_buf) < batch_num*step_num): 165 | a_buf += a_buf 166 | 167 | B = len(n_buf) //(batch_num*step_num) 168 | n_buf = n_buf[:batch_num*step_num*B] 169 | A = len(a_buf)//(batch_num*step_num) 170 | a_buf = a_buf[:batch_num*step_num*A] 171 | 172 | print("retrain at %d batch"%count) 173 | loss_list_all=[] 174 | 175 | datalist = np.array(n_buf[:batch_num*step_num*(B-1)]).reshape(-1,batch_num,step_num,elem_num) 176 | validation_list_n = np.array(n_buf[batch_num*step_num*(B-1):]).reshape(-1,batch_num,step_num,elem_num) 177 | validation_list_a = np.array(a_buf).reshape(-1,batch_num,step_num,elem_num) 178 | 179 | patience = 10 180 | min_delta = 0.005 181 | lastLoss = np.float('inf') 182 | patience_cnt = 0 183 | 184 | for i in range(300): 185 | loss_list=[] 186 | for data in datalist: 187 | (loss, _) = sess.run([loss_, train_], {p_input: data,p_is_training : True}) 188 | loss_list.append(loss) 189 | loss_list_all.append( np.array(loss_list).mean()) 190 | 191 | if i > 0 and lastLoss - loss >min_delta: 192 | patience_cnt =0 193 | else: 194 | patience_cnt += 1 195 | lastLoss = loss 196 | if patience_cnt > patience: 197 | print("Early stopping...") 198 | break 199 | 200 | 201 | err_nbuf_tmp = np.array(err_nbuf).reshape(-1,elem_num) 202 | mu,sigma = helper.get_musigma(err_nbuf_tmp,mu,sigma) 203 | 204 | print("Parameters updated!") 205 | pd.Series(loss_list_all).plot(title="Loss") 206 | n_buf = [] 207 | a_buf = [] 208 | err_buf = [] 209 | err_nbuf = [] 210 | err_abuf = [] 211 | 212 | fpr, tpr, thresholds = metrics.roc_curve(expert, output, pos_label="anomaly") 213 | auc = metrics.auc(fpr, tpr) 214 | helper.plot_roc(fpr,tpr,auc) 215 | print("--- Used time: %s seconds ---" % (time.time() - start_time)) -------------------------------------------------------------------------------- /src/OnlinePrediction/ProcessingHelper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Aug 21 16:07:31 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | import pandas as pd 9 | from sklearn.preprocessing import MinMaxScaler 10 | import matplotlib.pyplot as plt 11 | import numpy as np 12 | from scipy.spatial.distance import mahalanobis,euclidean 13 | 14 | 15 | class processingHelper(object): 16 | 17 | def local_preprocessing(batchdata): 18 | # input batchdata with shape : [batch_num, step_num, elem_num] 19 | # minmax scaler on window level 20 | df = pd.DataFrame() 21 | 22 | for window in batchdata: 23 | 24 | scaler = MinMaxScaler() 25 | scaler.fit(window) 26 | new_win = scaler.transform(window) 27 | df = pd.concat((df, pd.DataFrame(new_win)),axis=0) if df.size!=0 else pd.DataFrame(new_win) 28 | return df.as_matrix().reshape(batchdata.shape) 29 | 30 | def scoring(err,mu,sigma): 31 | 32 | scores = [] 33 | for e in err: 34 | scores.append(mahalanobis(e,mu,sigma)) 35 | 36 | return scores 37 | 38 | def get_musigma(err_nbuf,mu,sigma): 39 | 40 | err_vec_array = np.array(err_nbuf) 41 | # for multivariate data, cov, for univariate data, var 42 | mu = np.mean(err_vec_array.ravel()) if err_nbuf.shape[1] == 1 else np.mean(err_vec_array,axis=0) 43 | sigma = np.var(err_vec_array.ravel()) if err_nbuf.shape[1] == 1 else np.cov(err_vec_array.T) 44 | 45 | return mu, sigma 46 | 47 | def get_threshold(normal_score, abnormal_score): 48 | upper = np.median(np.array(abnormal_score)) 49 | lower = np.median(np.array(normal_score)) 50 | scala = 20 51 | delta = (upper-lower) / scala 52 | candidate = lower 53 | threshold = 0 54 | result = 0 55 | 56 | def evaluate(threshold,normal_score,abnormal_score): 57 | 58 | beta = 0.5 59 | tp = np.array(abnormal_score)[np.array(abnormal_score)>threshold].size 60 | fp = len(abnormal_score)-tp 61 | fn = np.array(normal_score)[np.array(normal_score)>threshold].size 62 | tn = len(normal_score)- fn 63 | 64 | if tp == 0: return 0 65 | 66 | P = tp/(tp+fp) 67 | R = tp/(tp+fn) 68 | fbeta= (1+beta*beta)*P*R/(beta*beta*P+R) 69 | return fbeta 70 | 71 | for _ in range(scala): 72 | r = evaluate(candidate,normal_score,abnormal_score) 73 | if r > result: 74 | result = r 75 | threshold = candidate 76 | candidate += delta 77 | return threshold 78 | 79 | def plot_roc(fpr,tpr,auc): 80 | plt.figure() 81 | lw = 2 82 | plt.plot(fpr, tpr, color='darkorange', 83 | lw=lw, label='ROC curve (area = %0.2f)' %auc) 84 | plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') 85 | plt.xlim([0.0, 1.0]) 86 | plt.ylim([0.0, 1.05]) 87 | plt.xlabel('False Positive Rate') 88 | plt.ylabel('True Positive Rate') 89 | plt.title('Receiver operating characteristic example') 90 | plt.legend(loc="lower right") 91 | plt.show() -------------------------------------------------------------------------------- /src/OnlinePrediction/__pycache__/ProcessingHelper.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/OnlinePrediction/__pycache__/ProcessingHelper.cpython-36.pyc -------------------------------------------------------------------------------- /src/OnlinePrediction/__pycache__/conf_online.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/OnlinePrediction/__pycache__/conf_online.cpython-36.pyc -------------------------------------------------------------------------------- /src/OnlinePrediction/conf_online.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Aug 21 15:53:12 2018 4 | 5 | @author: Bin 6 | """ 7 | 8 | class Conf(object): 9 | 10 | def __init__(self, dataset): 11 | 12 | 13 | if dataset == "power": 14 | 15 | self.batch_num = 8 16 | self.hidden_num = 15 17 | self.step_num = 84 18 | self.elem_num = 1 19 | self.training_set_size = self.step_num*12 20 | self.HardCriterion = 5 21 | self.buffersize = 9# number of batches 22 | elif dataset == "smtp": 23 | 24 | self.batch_num = 8 25 | self.hidden_num = 15 26 | self.step_num = 10 27 | self.elem_num = 34 28 | self.training_set_size = self.step_num*6000 29 | self.HardCriterion = 5 30 | self.buffersize = 50# number of batches 31 | elif dataset == "http": 32 | 33 | self.batch_num = 8 34 | self.hidden_num = 35 35 | self.step_num = 30 36 | self.elem_num = 34 37 | self.training_set_size = self.step_num*30000 38 | self.HardCriterion = 5 39 | self.buffersize = 1000 # number of batches 40 | 41 | elif dataset == "smtphttp": 42 | self.batch_num = 8 43 | self.hidden_num = 15 44 | self.step_num = 10 45 | self.elem_num = 34 46 | self.training_set_size = self.step_num*2500 47 | self.HardCriterion = 5 48 | self.buffersize = 1500# number of batches 49 | 50 | elif dataset == "forest": 51 | self.batch_num = 8 52 | self.hidden_num = 25 53 | self.step_num = 10 54 | self.elem_num = 7 55 | self.training_set_size = self.step_num*10000 56 | self.HardCriterion = 4 57 | self.buffersize = 400 # number of batches 58 | 59 | else: 60 | print("Wrong dataset name input.") 61 | 62 | --------------------------------------------------------------------------------