├── Figures
├── LSTM-Autoencoder.PNG
├── Online.PNG
└── example.PNG
├── LICENSE
├── README.md
├── Slides.pdf
├── Thesis.pdf
├── data
└── power_data.txt
└── src
├── DataPrepare
└── dataPreparation.py
├── Initialization
├── __pycache__
│ ├── conf_init.cpython-36.pyc
│ ├── dataHelper.cpython-36.pyc
│ ├── encdecad.cpython-36.pyc
│ ├── initTrain.cpython-36.pyc
│ └── parameterHelper.cpython-36.pyc
├── conf_init.py
├── dataHelper.py
├── encdecad.py
├── initTrain.py
├── initialization.py
└── parameterHelper.py
└── OnlinePrediction
├── OnlinePrediction.py
├── ProcessingHelper.py
├── __pycache__
├── ProcessingHelper.cpython-36.pyc
└── conf_online.cpython-36.pyc
└── conf_online.py
/Figures/LSTM-Autoencoder.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Figures/LSTM-Autoencoder.PNG
--------------------------------------------------------------------------------
/Figures/Online.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Figures/Online.PNG
--------------------------------------------------------------------------------
/Figures/example.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Figures/example.PNG
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 bli22
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Anomaly detection for streaming data using autoencoders
2 |
3 | This project is my master thesis. The main target is to maintain an adaptive autoencoder-based anomaly detection framework that is able to not only detect contextual anomalies from streaming data, but also update itself according to the latest data feature.
4 |
5 | ## Quick access
6 |
7 | - [Thesis](https://github.com/binli826/LSTM-Autoencoders/blob/master/Thesis.pdf)
8 | - [Slides](https://github.com/binli826/LSTM-Autoencoders/blob/master/Slides.pdf)
9 | - [Usage](https://github.com/binli826/LSTM-Autoencoders/tree/master#usage)
10 |
11 | ## Introduction
12 | The high-volume and -velocity data stream generated from devices and applications from
13 | different domains grows steadily and is valuable for big data research. One of the most
14 | important topics is anomaly detection for streaming data, which has attracted attention
15 | and investigation in plenty of areas, e.g., the sensor data anomaly detection, predictive
16 | maintenance, event detection. Those efforts could potentially avoid large amount of financial
17 | costs in the manufacture. However, different from traditional anomaly detection tasks,
18 | anomaly detection in streaming data is especially difficult due to that data arrives along
19 | with the time with latent distribution changes, so that a single stationary model doesn’t fit
20 | streaming data all the time. An anomaly could become normal during the data evolution,
21 | therefore it is necessary to maintain a dynamic system to adapt the changes. In this work,
22 | we propose a LSTMs-Autoencoder anomaly detection model for streaming data. This is a
23 | mini-batch based streaming processing approach. We experimented with streaming data
24 | that containing different kinds of anomalies as well as concept drifts, the results suggest
25 | that our model can sufficiently detect anomaly from data stream and update model timely
26 | to fit the latest data property.
27 |
28 | ## Model
29 | #### LSTM-Autoencoder
30 | The LSTM-Autoencoder is based on the work of [Malhotra et al.] There are two LSTM units, one as encoder and the other one as decoder. Model will only be trained with normal data, so the reconstruction of anomalies is supposed to lead higher reconstruction error.
31 |
32 | 
33 |
34 | > **Input/Output format**
35 | >
36 | > < Batch size, Time steps, Data dimensions >
37 | > Batch size: Number of windows contained in a single batch
38 | > Time steps: Number of instances within a window (T)
39 | > Data dimensions: Size of feature space
40 |
41 | #### Online framework
42 | Once the LSTM-Autoencoder is initialized with a subset of respective data streams, it is used for the online anomaly detection. For each accumulated batch of streaming data, the model predict each window as normal or anomaly. Afterwards, we introduce experts to label the windows and evaluate the performance. Hard windows will be appended into the updating buffers. Once the normal buffer is full, there will the a continue training of LSTM-Autoencoders only with the hard windows in the buffers.
43 |
44 | 
45 |
46 | ## Datasets
47 | The model is experimenced with 5 datasets. [PowerDemand](https://github.com/binli826/LSTM-Autoencoders/blob/master/data/power_data.txt) dataset records the power demand over one year, the unnormal power demand on special days (e.g. festivals, christmas etc.) are labeled as anomalies.
48 | SMTP and HTTP are extracted from the [KDDCup99 dataset](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). SMTP+HTTP is a direct connection of SMTP and HTTP, in order to simulate a concept drift in between.
49 | Here treat the network attacks as anomalies. [FOREST](https://archive.ics.uci.edu/ml/datasets/covertype) dataset records statistics of 7 different forest cover types. We follow the same setting as [Dong et al.], take the smallest class Cottonwood/Willow as anomaly.
50 | The following table shows statistical information of each dataset.(Only numerical features are taken into consideration)
51 |
52 | | Dataset | Dimensionality | #Instances | Anomaly proportion (%) |
53 | | :------: | :------: | :------: | :------: |
54 | | PowerDemand | 1 | 35040 | 2.20 |
55 | | SMTP | 34 | 96554 | 1.22 |
56 | | HTTP | 34 | 623 091 | 0.65 |
57 | | SMTP+HTTP | 34 | 719 645 | 0.72|
58 | | FOREST | 7 | 581 012 | 0.47 |
59 |
60 | ## Results
61 | Here is an reconstruction example of a normal window and an anomaly window of the PowerDemand data.
62 | 
63 | >
64 | With AUC as evaluation metric, we got following performance of the data stream anomaly detection.
65 |
66 | | Dataset | AUC without updating | AUC with updating | #Updating |
67 | | :------: | :------: | :------: | :------: |
68 | | PowerDemand | 0.91 | 0.97 | 2 |
69 | | SMTP | 0.94 | 0.98 | 2 |
70 | | HTTP | 0.76 | 0.86 | 2 |
71 | | SMTP+HTTP | 0.64 | 0.85 | 3|
72 | | FOREST | 0.74 | 0.82 | 8 |
73 |
74 |
75 | ## Usage
76 | #### Data preparation
77 | Once datasets avaliable, covert the raw data into uniform format using [dataPreparation.py].
78 |
79 | ```sh
80 | python /src/Initialization/dataPreparation.py dataset inputpath outputpath --powerlabel --kddcol
81 | # Example
82 | python /src/Initialization/dataPreparation.py kdd /mypath/kddcup.data.corrected /mypath --kddcol /mypath/columns.txt
83 | ```
84 | #### Initialization
85 | With the processed dataset, the model initialization phase can be processed by following command, with figuring out the dataset to use, the data path, and a folder path to save the trained model.
86 | ```sh
87 | python /src/Initialization/initialization.py dataset dataPath modelSavePath
88 | # Example
89 | python /src/Initialization/initialization.py smtp /mypath/smtp.csv /mypath/models/
90 | ```
91 |
92 | #### Online prediction
93 | Once data are prepared and model is initializated and saved locally, the online prediction process can be executed as follow,
94 | ```sh
95 | python /src/OnlinePrediction/OnlinePrediction.py datasetname dataPath modelPath
96 | # Example
97 | python /src/OnlinePrediction/OnlinePrediction.py smtp /mypath/smtp.csv /mypath/model_smtp/
98 | ```
99 | #### About hyper-parameters
100 | Hyper-parameters are leared by grid search with respect to each dataset, and can be modified in [conf_init.py] and [conf_online.py]
101 |
102 |
103 |
104 | ## Versions
105 | This project works with
106 | * Python 3.6
107 | * Tensorflow 1.4.0
108 | * Numpy 1.13.3
109 |
110 | [Malhotra et al.]:
111 | [Dong et al.]:
112 | [dataPreparation.py]:
113 | [conf_init.py]:
114 | [conf_online.py]:
--------------------------------------------------------------------------------
/Slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Slides.pdf
--------------------------------------------------------------------------------
/Thesis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/Thesis.pdf
--------------------------------------------------------------------------------
/src/DataPrepare/dataPreparation.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Dec 14 14:14:24 2017
4 |
5 | @author: Bin
6 | """
7 |
8 | import argparse
9 | import pandas as pd
10 | import numpy as np
11 |
12 | def parseArguments():
13 | parser = argparse.ArgumentParser()
14 | # Positional mandatory arguments
15 | parser.add_argument("dataset", help="power/kdd/forest", type=str)
16 | parser.add_argument("datapath", help="input data path", type=str)
17 | parser.add_argument("savepath", help="folder to save the processed data", type=str)
18 |
19 | # Optional arguments
20 | parser.add_argument("-pl", "--powerlabel", help="Label file of power demand dataset ", type=str)
21 | parser.add_argument("-kc", "--kddcol", help="Column file of KDD dataset ", type=str)
22 |
23 | # Parse arguments
24 | args = parser.parse_args()
25 |
26 | return args
27 |
28 |
29 |
30 | def power(pathPowerData, pathPowerLabel, pathPowerSave):
31 | '''Process PowerDemand dataset
32 |
33 | we do downsampling with rate 8 on the PowerDemand dataset,
34 | and remove the first and last half week, preserve only the 51 full weeks
35 | with each week benn described by 84 instances.
36 | '''
37 |
38 | powerData = pd.read_csv(pathPowerData,header=None)[::8][60:-36].reset_index(drop=True)
39 |
40 |
41 | # The PowerDemand dataset is manually label according to the special days in a year.
42 |
43 |
44 | powerLabel = pd.read_csv(pathPowerLabel,header=None)
45 |
46 | PowerDemand = pd.concat((powerData,powerLabel),axis=1)
47 |
48 | PowerDemand.to_csv(pathPowerSave+"/PowerDemand.csv",header=None,index=None)
49 |
50 |
51 | def kdd(pathKDDcol_name, pathKDD, pathKDDSave):
52 | '''Process KDD99 dataset
53 |
54 | SMTP and HTTP are extracted from the KDD99 dataset, only with numerical features
55 | SMTP+HTTP is a connection of SMTP and HTTP
56 | '''
57 |
58 | try:
59 | with open(pathKDDcol_name) as col_file:
60 | line = col_file.readline()
61 | except EnvironmentError:
62 | print('File not found.')
63 |
64 | columns = line.split('.')
65 | col_names = []
66 | col_types = []
67 | for col in columns:
68 | col_names.append(col.split(': ')[0].strip())
69 | col_types.append(col.split(': ')[1])
70 |
71 |
72 | df = pd.read_csv(pathKDD,header=None)
73 |
74 | continuous = df.iloc[:,np.array(pd.Series(col_types)=="continuous")]
75 | continuous = pd.concat((continuous,df.iloc[:,-1]),axis=1)
76 | SMTP = continuous[df.iloc[:,2] == "smtp"].reset_index(drop=True)
77 | HTTP = continuous[df.iloc[:,2] == "http"].reset_index(drop=True)
78 | SMTPHTTP = pd.concat((SMTP,HTTP),axis=0).reset_index(drop=True)
79 |
80 | SMTP.to_csv(pathKDDSave+"/SMTP.csv",header=None,index=None)
81 | HTTP.to_csv(pathKDDSave+"/HTTP.csv",header=None,index=None)
82 | SMTPHTTP.to_csv(pathKDDSave+"/SMTPHTTP.csv",header=None,index=None)
83 |
84 | def forest(pathForestData,pathForestSave):
85 |
86 | '''Process Forest cover dataset
87 |
88 | take the smallest class TYP4 as anomaly, rest are normal
89 | only use the 7 numerical features
90 | '''
91 |
92 | forest = pd.read_csv(pathForestData,header=None)
93 | numerical_col = [0,1,2,3,4,5,9]
94 |
95 | forestData = forest.iloc[:,numerical_col]
96 | forestLabel = forest.iloc[:,-1]
97 | forestLabel[forestLabel != 4] = 'normal.'
98 | forestLabel[forestLabel == 4] = 'anomaly.'
99 |
100 | forest = pd.concat((forestData,forestLabel),axis=1)
101 | forest.to_csv(pathForestSave+"/FOREST.csv",header=None,index=None)
102 |
103 |
104 | if __name__ == '__main__':
105 |
106 | args = parseArguments()
107 | dataset = args.__dict__['dataset']
108 | datapath = args.__dict__['datapath']
109 | savepath= args.__dict__['savepath']
110 |
111 | if dataset == "power":
112 | powerlabel = args.__dict__['powerlabel']
113 | power(datapath,powerlabel,savepath)
114 | elif dataset == "kdd":
115 | kddcol = args.__dict__['kddcol']
116 | kdd(kddcol,datapath,savepath)
117 | elif dataset == "forest":
118 | forest(datapath,savepath)
119 | else:
120 | print("Please input the dataset name: power/kdd/forest.")
121 |
--------------------------------------------------------------------------------
/src/Initialization/__pycache__/conf_init.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/conf_init.cpython-36.pyc
--------------------------------------------------------------------------------
/src/Initialization/__pycache__/dataHelper.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/dataHelper.cpython-36.pyc
--------------------------------------------------------------------------------
/src/Initialization/__pycache__/encdecad.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/encdecad.cpython-36.pyc
--------------------------------------------------------------------------------
/src/Initialization/__pycache__/initTrain.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/initTrain.cpython-36.pyc
--------------------------------------------------------------------------------
/src/Initialization/__pycache__/parameterHelper.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/Initialization/__pycache__/parameterHelper.cpython-36.pyc
--------------------------------------------------------------------------------
/src/Initialization/conf_init.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Aug 16 16:44:42 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | from dataHelper import Data_Helper
9 |
10 | class Configuration(object):
11 |
12 | def __init__(self, dataset, dataPath, modelSavePath, training_data_source = "file", optimizer=None, decode_without_input=False):
13 |
14 |
15 | if dataset == "power":
16 |
17 | self.batch_num = 8
18 | self.hidden_num = 15
19 | self.step_num = 84
20 | self.training_set_size = self.step_num*12
21 |
22 | elif dataset == "smtp":
23 |
24 | self.batch_num = 8
25 | self.hidden_num = 15
26 | self.step_num = 10
27 | self.training_set_size = self.step_num*6000
28 |
29 | elif dataset == "http":
30 |
31 | self.batch_num = 8
32 | self.hidden_num = 35
33 | self.step_num = 30
34 | self.training_set_size = self.step_num*30000
35 |
36 | elif dataset == "smtphttp":
37 | self.batch_num = 8
38 | self.hidden_num = 15
39 | self.step_num = 10
40 | self.training_set_size = self.step_num*2500
41 |
42 | elif dataset == "forest":
43 | self.batch_num = 8
44 | self.hidden_num = 25
45 | self.step_num = 10
46 | self.training_set_size = self.step_num*10000
47 |
48 | else:
49 | print("Wrong dataset name input.")
50 |
51 |
52 | self.input_root =dataPath
53 | self.iteration = 300
54 | self.modelpath_root = modelSavePath
55 | self.modelmeta = self.modelpath_root + "_"+str(self.batch_num)+"_"+str(self.hidden_num)+"_"+str(self.step_num)+"_.ckpt.meta"
56 | self.modelpath_p = self.modelpath_root + "_"+str(self.batch_num)+"_"+str(self.hidden_num)+"_"+str(self.step_num)+"_para.ckpt"
57 | self.modelmeta_p = self.modelpath_root + "_"+str(self.batch_num)+"_"+str(self.hidden_num)+"_"+str(self.step_num)+"_para.ckpt.meta"
58 | self.decode_without_input = False
59 |
60 | self.log_path = modelSavePath + "log.txt"
61 |
62 |
63 | # import dataset
64 | # The dataset is divided into 6 parts, namely training_normal, validation_1,
65 | # validation_2, test_normal, validation_anomaly, test_anomaly.
66 |
67 | self.training_data_source = training_data_source
68 | data_helper = Data_Helper(self.input_root,self.training_set_size,self.step_num,self.batch_num,self.training_data_source,self.log_path)
69 |
70 | self.sn_list = data_helper.sn_list
71 | self.va_list = data_helper.va_list
72 | self.vn1_list = data_helper.vn1_list
73 | self.vn2_list = data_helper.vn2_list
74 | self.tn_list = data_helper.tn_list
75 | self.ta_list = data_helper.ta_list
76 | self.data_list = [self.sn_list, self.va_list, self.vn1_list, self.vn2_list, self.tn_list, self.ta_list]
77 |
78 | self.elem_num = data_helper.sn.shape[1]
79 | self.va_label_list = data_helper.va_label_list
80 |
81 |
82 | f = open(self.log_path,'a')
83 | f.write("Batch_num=%d\nHidden_num=%d\nwindow_length=%d\ntraining_used_#windows=%d\n"%(self.batch_num,self.hidden_num,self.step_num,self.training_set_size//self.step_num))
84 | f.close()
--------------------------------------------------------------------------------
/src/Initialization/dataHelper.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Aug 16 16:53:38 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | import numpy as np
9 | import pandas as pd
10 | from sklearn.preprocessing import MinMaxScaler
11 |
12 |
13 | # Given a initialization dataset, split it into normal lists and abnormal lists of different subsets.
14 |
15 | class Data_Helper(object):
16 |
17 | def __init__(self, path,training_set_size,step_num,batch_num,training_data_source,log_path):
18 |
19 | self.path = path
20 | self.step_num = step_num
21 | self.batch_num = batch_num
22 | self.training_data_source = training_data_source
23 | self.training_set_size = training_set_size
24 |
25 |
26 |
27 | self.df = pd.read_csv(self.path).iloc[:self.training_set_size,:]
28 |
29 | print("Preprocessing...")
30 |
31 | self.sn,self.vn1,self.vn2,self.tn,self.va,self.ta,self.va_labels = self.preprocessing(self.df,log_path)
32 | assert min(self.sn.size,self.vn1.size,self.vn2.size,self.tn.size,self.va.size,self.ta.size) > 0, "Not enough continuous data in file for training, ended."+str((self.sn.size,self.vn1.size,self.vn2.size,self.tn.size,self.va.size,self.ta.size))
33 |
34 | # data seriealization
35 | t1 = self.sn.shape[0]//step_num
36 | t2 = self.va.shape[0]//step_num
37 | t3 = self.vn1.shape[0]//step_num
38 | t4 = self.vn2.shape[0]//step_num
39 | t5 = self.tn.shape[0]//step_num
40 | t6 = self.ta.shape[0]//step_num
41 |
42 | self.sn_list = [self.sn[step_num*i:step_num*(i+1)].as_matrix() for i in range(t1)]
43 | self.va_list = [self.va[step_num*i:step_num*(i+1)].as_matrix() for i in range(t2)]
44 | self.vn1_list = [self.vn1[step_num*i:step_num*(i+1)].as_matrix() for i in range(t3)]
45 | self.vn2_list = [self.vn2[step_num*i:step_num*(i+1)].as_matrix() for i in range(t4)]
46 |
47 | self.tn_list = [self.tn[step_num*i:step_num*(i+1)].as_matrix() for i in range(t5)]
48 | self.ta_list = [self.ta[step_num*i:step_num*(i+1)].as_matrix() for i in range(t6)]
49 |
50 | self.va_label_list = [self.va_labels[step_num*i:step_num*(i+1)].as_matrix() for i in range(t2)]
51 |
52 | print("Ready for training.")
53 |
54 | def preprocessing(self,df,log_path):
55 |
56 | #scaling
57 | label = df.iloc[:,-1]
58 | scaler = MinMaxScaler()
59 | scaler.fit(df.iloc[:,:-1])
60 | cont = pd.DataFrame(scaler.transform(df.iloc[:,:-1]))
61 | data = pd.concat((cont,label),axis=1)
62 |
63 | # split data according to window length
64 | # split dataframe into segments of length L, if a window contains mindestens one anomaly, then this window is anomaly wondow
65 | n_list = []
66 | a_list = []
67 |
68 | windows = [data.iloc[w*self.step_num:(w+1)*self.step_num,:] for w in range(data.index.size//self.step_num)]
69 | for win in windows:
70 | label = win.iloc[:,-1]
71 | if label[label!="normal."].size == 0:
72 | n_list += [i for i in win.index]
73 | else:
74 | a_list += [i for i in win.index]
75 |
76 | normal = data.iloc[np.array(n_list),:-1]
77 | anomaly = data.iloc[np.array(a_list),:-1]
78 | print("Info: Initialization set contains %d normal windows and %d abnormal windows."%(normal.shape[0],anomaly.shape[0]))
79 |
80 | a_labels = data.iloc[np.array(a_list),-1]
81 |
82 | # split into subsets
83 | tmp = normal.index.size//self.step_num//10
84 | assert tmp > 0 ,"Too small normal set %d rows"%normal.index.size
85 | sn = normal.iloc[:tmp*5*self.step_num,:]
86 | vn1 = normal.iloc[tmp*5*self.step_num:tmp*8*self.step_num,:]
87 | vn2 = normal.iloc[tmp*8*self.step_num:tmp*9*self.step_num,:]
88 | tn = normal.iloc[tmp*9*self.step_num:,:]
89 |
90 | tmp_a = anomaly.index.size//self.step_num//2
91 | va = anomaly.iloc[:tmp_a*self.step_num,:] if tmp_a !=0 else anomaly
92 | ta = anomaly.iloc[tmp_a*self.step_num:,:] if tmp_a !=0 else anomaly
93 | a_labels = a_labels[:va.index.size]
94 |
95 | print("Local preprocessing finished.")
96 | print("Subsets contain windows: sn:%d,vn1:%d,vn2:%d,tn:%d,va:%d,ta:%d\n"%(sn.shape[0]/self.step_num,vn1.shape[0]/self.step_num,vn2.shape[0]/self.step_num,tn.shape[0]/self.step_num,va.shape[0]/self.step_num,ta.shape[0]/self.step_num))
97 |
98 | f = open(log_path,'a')
99 |
100 | f.write("Info: Initialization set contains %d normal windows and %d abnormal windows.\n"%(normal.shape[0],anomaly.shape[0]))
101 | f.write("Subsets contain windows: sn:%d,vn1:%d,vn2:%d,tn:%d,va:%d,ta:%d\n"%(sn.shape[0]/self.step_num,vn1.shape[0]/self.step_num,vn2.shape[0]/self.step_num,tn.shape[0]/self.step_num,va.shape[0]/self.step_num,ta.shape[0]/self.step_num))
102 | f.close()
103 |
104 | return sn,vn1,vn2,tn,va,ta,a_labels
--------------------------------------------------------------------------------
/src/Initialization/encdecad.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Fri Aug 17 14:07:54 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | import tensorflow as tf
9 | import math
10 |
11 | class EncDecAD(object):
12 |
13 | def __init__(self, hidden_num, inputs, is_training, optimizer=None, reverse=True, decode_without_input=False):
14 |
15 | self.batch_num = inputs[0].get_shape().as_list()[0]
16 | self.elem_num = inputs[0].get_shape().as_list()[1]
17 |
18 | self._enc_cell = tf.nn.rnn_cell.LSTMCell(hidden_num, use_peepholes=True)
19 | self._dec_cell = tf.nn.rnn_cell.LSTMCell(hidden_num, use_peepholes=True)
20 | if is_training == True:
21 | self._enc_cell = tf.nn.rnn_cell.DropoutWrapper(self._enc_cell, input_keep_prob=0.8, output_keep_prob=0.8)
22 | self._dec_cell = tf.nn.rnn_cell.DropoutWrapper(self._dec_cell, input_keep_prob=0.8, output_keep_prob=0.8)
23 |
24 | self.is_training = is_training
25 |
26 | self.input_ = tf.transpose(tf.stack(inputs), [1, 0, 2],name="input_")
27 |
28 | with tf.variable_scope('encoder',reuse = tf.AUTO_REUSE):
29 | (self.z_codes, self.enc_state) = tf.contrib.rnn.static_rnn(self._enc_cell, inputs, dtype=tf.float32)
30 |
31 | with tf.variable_scope('decoder',reuse =tf.AUTO_REUSE) as vs:
32 |
33 | dec_weight_ = tf.Variable(tf.truncated_normal([hidden_num,self.elem_num], dtype=tf.float32))
34 |
35 | dec_bias_ = tf.Variable(tf.constant(0.1,shape=[self.elem_num],dtype=tf.float32))
36 |
37 | dec_state = self.enc_state
38 | dec_input_ = tf.ones(tf.shape(inputs[0]),dtype=tf.float32)
39 | dec_outputs = []
40 |
41 | for step in range(len(inputs)):
42 | if step > 0:
43 | vs.reuse_variables()
44 | (dec_input_, dec_state) =self._dec_cell(dec_input_, dec_state)
45 | dec_input_ = tf.matmul(dec_input_, dec_weight_) + dec_bias_
46 | dec_outputs.append(dec_input_)
47 | # use real input as as input of decoder *******
48 | tmp = -(step+1)
49 | dec_input_ = inputs[tmp]
50 |
51 | if reverse:
52 | dec_outputs = dec_outputs[::-1]
53 |
54 | self.output_ = tf.transpose(tf.stack(dec_outputs), [1, 0, 2],name="output_")
55 | self.loss = tf.reduce_mean(tf.square(self.input_ - self.output_),name="loss")
56 |
57 | def check_is_train(is_training):
58 | def t_ (): return tf.train.AdamOptimizer().minimize(self.loss,name="train_")
59 | def f_ (): return tf.train.AdamOptimizer(1/math.inf).minimize(self.loss)
60 | is_train = tf.cond(is_training, t_, f_)
61 | return is_train
62 | self.train = check_is_train(is_training)
--------------------------------------------------------------------------------
/src/Initialization/initTrain.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Fri Aug 17 14:09:13 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | import numpy as np
9 | import matplotlib.pyplot as plt
10 | import time
11 | import tensorflow as tf
12 | from conf_init import Configuration
13 | from encdecad import EncDecAD
14 | from parameterHelper import Parameter_Helper
15 |
16 | class Initialization_Train(object):
17 |
18 | def __init__(self, dataset,dataPath, modelSavePath, training_data_source='file'):
19 | start_time = time.time()
20 | conf = Configuration(dataset, dataPath, modelSavePath, training_data_source=training_data_source)
21 |
22 |
23 | batch_num = conf.batch_num
24 | hidden_num = conf.hidden_num
25 | step_num = conf.step_num
26 | elem_num = conf.elem_num
27 |
28 | iteration = conf.iteration
29 | modelpath_root = conf.modelpath_root
30 | modelpath = conf.modelpath_p
31 | decode_without_input = conf.decode_without_input
32 |
33 | patience = 20
34 | patience_cnt = 0
35 | min_delta = 0.0001
36 |
37 |
38 | #************#
39 | # Training
40 | #************#
41 |
42 | p_input = tf.placeholder(tf.float32, shape=(batch_num, step_num, elem_num),name = "p_input")
43 | p_inputs = [tf.squeeze(t, [1]) for t in tf.split(p_input, step_num, 1)]
44 |
45 | p_is_training = tf.placeholder(tf.bool,name= "is_training_")
46 |
47 | ae = EncDecAD(hidden_num, p_inputs, p_is_training , decode_without_input=False)
48 |
49 | graph = tf.get_default_graph()
50 | gvars = graph.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
51 | assign_ops = [graph.get_operation_by_name(v.op.name + "/Assign") for v in gvars]
52 | init_values = [assign_op.inputs[1] for assign_op in assign_ops]
53 |
54 |
55 | print("Training start.")
56 | with tf.Session() as sess:
57 | saver = tf.train.Saver()
58 |
59 |
60 | sess.run(tf.global_variables_initializer())
61 | input_= tf.transpose(tf.stack(p_inputs), [1, 0, 2])
62 | output_ = graph.get_tensor_by_name("decoder/output_:0")
63 |
64 | loss = []
65 | val_loss = []
66 | sn_list_length = len(conf.sn_list)
67 | tn_list_length = len(conf.tn_list)
68 |
69 | for i in range(iteration):
70 | #training set
71 | snlist = conf.sn_list[:]
72 | tmp_loss = 0
73 | for t in range(sn_list_length//batch_num):
74 | data =[]
75 | for _ in range(batch_num):
76 | data.append(snlist.pop())
77 | data = np.array(data)
78 | (loss_val, _) = sess.run([ae.loss, ae.train], {p_input: data,p_is_training : True})
79 | tmp_loss += loss_val
80 | l = tmp_loss/(sn_list_length//batch_num)
81 | loss.append(l)
82 |
83 | #validation set
84 | tnlist = conf.tn_list[:]
85 | tmp_loss_ = 0
86 | for t in range(tn_list_length//batch_num):
87 | testdata = []
88 | for _ in range(batch_num):
89 | testdata.append(tnlist.pop())
90 | testdata = np.array(testdata)
91 | (loss_val,ein,aus) = sess.run([ae.loss,input_,output_], {p_input: testdata,p_is_training :False})
92 | tmp_loss_ += loss_val
93 | tl = tmp_loss_/(tn_list_length//batch_num)
94 | val_loss.append(tl)
95 | print('Epoch %d: Loss:%.3f, Val_loss:%.3f' %(i, l,tl))
96 |
97 | if i == 5:
98 | break
99 | #Early stopping
100 | if i > 0 and val_loss[i] < np.array(val_loss[:i]).min():
101 | #save_path = saver.save(sess, conf.modelpath_p)
102 | gvars_state = sess.run(gvars)
103 |
104 | if i > 0 and val_loss[i-1] - val_loss[i] > min_delta:
105 | patience_cnt = 0
106 | else:
107 | patience_cnt += 1
108 |
109 | if i>0 and patience_cnt > patience:
110 | print("Early stopping at epoch %d\n"%i)
111 | feed_dict = {init_value: val for init_value, val in zip(init_values, gvars_state)}
112 | sess.run(assign_ops, feed_dict=feed_dict)
113 | #saver.restore(sess,tf.train.latest_checkpoint(modelpath_root))
114 | #graph = tf.get_default_graph()
115 | break
116 |
117 | plt.plot(loss,label="Train")
118 | plt.plot(val_loss,label="val_loss")
119 | plt.legend()
120 | plt.show()
121 |
122 |
123 | # mu & sigma & threshold
124 |
125 | para = Parameter_Helper(conf)
126 | mu, sigma = para.mu_and_sigma(sess,input_, output_,p_input, p_is_training)
127 | threshold = para.get_threshold(mu,sigma,sess,input_, output_,p_input, p_is_training)
128 |
129 | # test = EncDecAD_Test(conf)
130 | # test.test_encdecad(sess,input_,output_,p_input,p_is_training,mu,sigma,threshold,beta = 0.5)
131 |
132 | c_mu = tf.constant(mu,dtype=tf.float32,name = "mu")
133 | c_sigma = tf.constant(sigma,dtype=tf.float32,name = "sigma")
134 | c_threshold = tf.constant(threshold,dtype=tf.float32,name = "threshold")
135 | print("Saving model to disk...")
136 | save_path = saver.save(sess, conf.modelpath_p)
137 | print("Model saved accompany with parameters and threshold in file: %s" % save_path)
138 |
139 | print("--- Initialization time: %s seconds ---" % (time.time() - start_time))
140 |
141 | f = open(conf.log_path,'a')
142 | f.write("Early stopping at epoch %d\n"%i)
143 | # f.write("mu:\n",mu,"sigma:\n",sigma,"threshold:\n",threshold)
144 | f.write("Model saved accompany with parameters and threshold in file: %s" % save_path)
145 | f.write("--- Initialization time: %s seconds ---" % (time.time() - start_time))
146 | f.close()
--------------------------------------------------------------------------------
/src/Initialization/initialization.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Aug 16 16:44:42 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | import argparse
9 | from initTrain import Initialization_Train
10 |
11 | def parseArguments():
12 | parser = argparse.ArgumentParser()
13 | # Positional mandatory arguments
14 | parser.add_argument("dataset", help="power/smtp/http/forest", type=str)
15 | parser.add_argument("dataPath", help="input data path", type=str)
16 | parser.add_argument("modelSavePath", help="folder to save the trained model", type=str)
17 |
18 | # Parse arguments
19 | args = parser.parse_args()
20 |
21 | return args
22 |
23 | if __name__=="__main__":
24 |
25 | args = parseArguments()
26 | dataset = args.__dict__['dataset']
27 | dataPath = args.__dict__['dataPath']
28 | modelSavePath= args.__dict__['modelSavePath']
29 |
30 | Initialization_Train(dataset,dataPath, modelSavePath)
31 |
32 |
33 |
--------------------------------------------------------------------------------
/src/Initialization/parameterHelper.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Fri Aug 17 14:08:21 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | import numpy as np
9 | from scipy.spatial.distance import mahalanobis
10 |
11 | class Parameter_Helper(object):
12 |
13 | def __init__(self, conf):
14 | self.conf = conf
15 |
16 |
17 | def mu_and_sigma(self,sess,input_, output_,p_input, p_is_training):
18 |
19 | err_vec_list = []
20 |
21 | ind = list(np.random.permutation(len(self.conf.vn1_list)))
22 |
23 | while len(ind)>=self.conf.batch_num:
24 | data = []
25 | for _ in range(self.conf.batch_num):
26 | data += [self.conf.vn1_list[ind.pop()]]
27 | data = np.array(data,dtype=float)
28 | data = data.reshape((self.conf.batch_num,self.conf.step_num,self.conf.elem_num))
29 |
30 | (_input_, _output_) = sess.run([input_, output_], {p_input: data, p_is_training: False})
31 | abs_err = abs(_input_ - _output_)
32 | err_vec_list += [abs_err[i] for i in range(abs_err.shape[0])]
33 |
34 |
35 | # new metric
36 | err_vec_array = np.array(err_vec_list).reshape(-1,self.conf.elem_num)
37 |
38 | # for univariate data, anomaly score is squared euclidean distance
39 | # for multivariate data, anomaly score is squared mahalanobis distance
40 |
41 | mu = np.mean(err_vec_array.ravel()) if self.conf.elem_num == 1 else np.mean(err_vec_array,axis=0)
42 | sigma = np.var(err_vec_array.ravel()) if self.conf.elem_num == 1 else np.cov(err_vec_array.T)
43 |
44 | print("Got parameters mu and sigma.")
45 |
46 | return mu, sigma
47 |
48 |
49 |
50 | def get_threshold(self,mu,sigma,sess,input_, output_,p_input, p_is_training):
51 |
52 | normal_score = []
53 | for count in range(len(self.conf.vn2_list)//self.conf.batch_num):
54 | normal_sub = np.array(self.conf.vn2_list[count*self.conf.batch_num:(count+1)*self.conf.batch_num])
55 | (input_n, output_n) = sess.run([input_, output_], {p_input: normal_sub,p_is_training : False})
56 |
57 | err_n = abs(input_n-output_n).reshape(-1,self.conf.step_num,self.conf.elem_num)
58 | for window in range(self.conf.batch_num):
59 | for t in range(self.conf.step_num):
60 | s = mahalanobis(err_n[window,t],mu,sigma)
61 | normal_score.append(s)
62 |
63 |
64 | abnormal_score = []
65 | '''
66 | if have enough anomaly data, then calculate anomaly score, and the
67 | threshold that achives best f1 score as divide boundary.
68 | otherwise estimate threshold through normal scores
69 | '''
70 | print(len(self.conf.va_list))
71 |
72 | if len(self.conf.va_list) < self.conf.batch_num: # not enough anomaly data for a single batch
73 | threshold = max(normal_score) * 2
74 | print("Not enough large va set.")
75 |
76 | else:
77 |
78 | for count in range(len(self.conf.va_list)//self.conf.batch_num):
79 | abnormal_sub = np.array(self.conf.va_list[count*self.conf.batch_num:(count+1)*self.conf.batch_num])
80 | va_lable_list = np.array(self.conf.va_label_list[count*self.conf.batch_num:(count+1)*self.conf.batch_num])
81 | va_lable_list = va_lable_list.reshape(self.conf.batch_num,self.conf.step_num)
82 |
83 | (input_a, output_a) = sess.run([input_, output_], {p_input: abnormal_sub,p_is_training : False})
84 | err_a = abs(input_a-output_a).reshape(-1,self.conf.step_num,self.conf.elem_num)
85 | for window in range(self.conf.batch_num):
86 | for t in range(self.conf.step_num):
87 | # temp = np.dot((err_a[window,t,:] - mu[t,:] ) , sigma[t])
88 | # s = np.dot(temp,(err_a[window,t,:] - mu[t,:] ).T)
89 | s = mahalanobis(err_a[window,t],mu,sigma)
90 | if va_lable_list[window,t] == "normal":
91 | normal_score.append(s)
92 | else:
93 | abnormal_score.append(s)
94 |
95 | upper = np.median(np.array(abnormal_score))
96 | lower = np.median(np.array(normal_score))
97 | scala = 20 # divide the range(min,max) into 20 parts, find the optimal threshold
98 | delta = (upper-lower) / scala
99 | candidate = lower
100 | threshold = 0
101 | result = 0
102 |
103 | def evaluate(threshold,normal_score,abnormal_score):
104 |
105 | beta = 0.5
106 | tp = np.array(abnormal_score)[np.array(abnormal_score)>threshold].size
107 | fp = len(abnormal_score)-tp
108 | fn = np.array(normal_score)[np.array(normal_score)>threshold].size
109 | tn = len(normal_score)- fn
110 |
111 | if tp == 0: return 0
112 |
113 | P = tp/(tp+fp)
114 | R = tp/(tp+fn)
115 | fbeta= (1+beta*beta)*P*R/(beta*beta*P+R)
116 | return fbeta
117 |
118 | for _ in range(scala):
119 | r = evaluate(candidate,normal_score,abnormal_score)
120 | if r > result:
121 | result = r
122 | threshold = candidate
123 | candidate += delta
124 |
125 | print("Threshold: ",threshold)
126 |
127 | return threshold
128 |
--------------------------------------------------------------------------------
/src/OnlinePrediction/OnlinePrediction.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Aug 21 15:50:07 2018
4 |
5 | @author: Bin
6 | """
7 | import argparse
8 |
9 | import pandas as pd
10 | import numpy as np
11 | import tensorflow as tf
12 | import time
13 | from sklearn import metrics
14 |
15 | from conf_online import Conf
16 | from ProcessingHelper import processingHelper
17 |
18 |
19 | def parseArguments():
20 | parser = argparse.ArgumentParser()
21 | # Positional mandatory arguments
22 | parser.add_argument("datasetname", help="power/smtp/http/forest", type=str)
23 | parser.add_argument("dataPath", help="input data path", type=str)
24 | parser.add_argument("modelPath", help="folder of trained model", type=str)
25 |
26 | # Parse arguments
27 | args = parser.parse_args()
28 |
29 | return args
30 |
31 | if __name__ == "__main__":
32 |
33 | # args = parseArguments()
34 | # datasetname = args.__dict__['datasetname']
35 | # datapath = args.__dict__['dataPath']
36 | # modelpath_root= args.__dict__['modelPath']
37 | datasetname = "smtp"
38 | datapath = "C:/Users/Bin/Desktop/Thesis/codeBearbeitung/originalData/SMTP.csv"
39 | modelpath_root= "C:/Users/Bin/Desktop/Thesis/codeBearbeitung/ModelS/"
40 |
41 | # get dataset parameters
42 | conf = Conf(datasetname)
43 | batch_num = conf.batch_num
44 | hidden_num = conf.hidden_num
45 | step_num = conf.step_num
46 | elem_num = conf.elem_num
47 | init_wins = conf.training_set_size
48 |
49 | # load dataset and divede into batches and windows
50 | names = [str(x) for x in range(elem_num)] +["label"]
51 | df = pd.read_csv(datapath,header=None,names=names,skiprows=init_wins)
52 |
53 | batches = df.shape[0]//step_num//batch_num
54 |
55 | test_set = df.iloc[:batches*batch_num*step_num,:-1]
56 | labels =df.iloc[:batches*batch_num*step_num,-1]
57 | ts = test_set.as_matrix().reshape(batches,batch_num,step_num,elem_num)
58 | test_set_list = [ts[a] for a in range(batches)]
59 |
60 | wins = batches * batch_num
61 | # figure out anomaly windows
62 | buffer = [labels[i*step_num:(i+1)*step_num] for i in range(0,labels.size//step_num)]
63 | anomaly_index = []
64 | count = 0
65 |
66 | for buf in buffer:
67 | if "anomaly" in buf.tolist():
68 | anomaly_index.append(count)
69 | else:
70 | pass
71 | count +=1
72 |
73 | expert = ["normal"]*wins
74 | for x in anomaly_index:
75 | expert[x] = "anomaly"
76 |
77 |
78 | # load model
79 |
80 | modelmeta_p = modelpath_root + "_"+str(batch_num)+"_"+str(hidden_num)+"_"+str(step_num)+"_para.ckpt.meta"
81 |
82 | sess = tf.Session()
83 | saver = tf.train.import_meta_graph(modelmeta_p) # load trained gragh, but without the trained parameters
84 | saver.restore(sess,tf.train.latest_checkpoint(modelpath_root))
85 | graph = tf.get_default_graph()
86 |
87 | p_input = graph.get_tensor_by_name("p_input:0")
88 | p_inputs = [tf.squeeze(t, [1]) for t in tf.split(p_input, step_num, 1)]
89 | p_is_training = graph.get_tensor_by_name("is_training_:0")
90 |
91 | input_= tf.transpose(tf.stack(p_inputs), [1, 0, 2])
92 | output_ = graph.get_tensor_by_name("decoder/output_:0")
93 |
94 | tensor_mu = graph.get_tensor_by_name("mu:0")
95 | tensor_sigma = graph.get_tensor_by_name("sigma:0")
96 | tensor_threshold = graph.get_tensor_by_name("threshold:0")
97 |
98 | loss_ = graph.get_tensor_by_name("decoder/loss:0")
99 | train_ = graph.get_operation_by_name("cond/train_")
100 |
101 | mu = sess.run(tensor_mu)
102 | sigma = sess.run(tensor_sigma)
103 | threshold = sess.run(tensor_threshold)
104 |
105 |
106 | # online phase
107 |
108 | count = 0
109 | n_buf = []
110 | a_buf = []
111 |
112 | y = []
113 | output=[]
114 | err_nbuf = []
115 | err_abuf = []
116 | all_scores = []
117 |
118 | start_time = time.time()
119 | helper = processingHelper
120 |
121 |
122 | for l in labels:
123 | if l == "normal":
124 | y +=[0]
125 | else:
126 | y +=[1]
127 | for ids in range(len(test_set_list)):
128 |
129 | data = test_set_list[ids]
130 | if count % 100 == 0:
131 | print(count,"batches processed.")
132 | prediction = []
133 | df = helper.local_preprocessing(data)
134 | (input_n, output_n) = sess.run([input_, output_], {p_input: df, p_is_training: False})
135 | err = abs(input_n-output_n).reshape(-1,elem_num)
136 | scores = helper.scoring(err,mu,sigma)
137 | all_scores.append(scores)
138 | output += [ss.max() for ss in np.array(scores).reshape(batch_num,step_num)]
139 | pred = [scores[b*step_num:(b+1)*step_num] for b in range(batch_num)]
140 | label = [expert[count*batch_num+b] for b in range(batch_num)]
141 | e = err
142 |
143 | for i in range(pd.DataFrame(pred).shape[0]):#loop batch_num
144 | index = i
145 | value=pd.DataFrame(pred).iloc[i,:]
146 |
147 | if value[value>threshold].size>=conf.HardCriterion:
148 | if label[index] == "anomaly":
149 | a_buf += [df[index,x,:] for x in range(step_num)]
150 | err_abuf = np.concatenate((err_abuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_abuf) != 0 else e[index*step_num:(index+1)*step_num]
151 | else:
152 | err_nbuf =np.concatenate((err_nbuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_nbuf) != 0 else e[index*step_num:(index+1)*step_num]
153 | n_buf += [df[index,x,:] for x in range(step_num)]
154 | else:
155 | if label[index] == "anomaly":
156 | a_buf += [df[index,x,:] for x in range(step_num)]
157 | err_abuf = np.concatenate((err_abuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_abuf) != 0 else e[index*step_num:(index+1)*step_num]
158 | else:
159 | err_nbuf = np.concatenate((err_nbuf , e[index*step_num:(index+1)*step_num]),axis=0) if len(err_nbuf) != 0 else e[index*step_num:(index+1)*step_num]
160 | count +=1
161 |
162 | #Check update
163 | if len(n_buf)>=batch_num*step_num*conf.buffersize and len(a_buf) !=0:
164 | while (len(a_buf) < batch_num*step_num):
165 | a_buf += a_buf
166 |
167 | B = len(n_buf) //(batch_num*step_num)
168 | n_buf = n_buf[:batch_num*step_num*B]
169 | A = len(a_buf)//(batch_num*step_num)
170 | a_buf = a_buf[:batch_num*step_num*A]
171 |
172 | print("retrain at %d batch"%count)
173 | loss_list_all=[]
174 |
175 | datalist = np.array(n_buf[:batch_num*step_num*(B-1)]).reshape(-1,batch_num,step_num,elem_num)
176 | validation_list_n = np.array(n_buf[batch_num*step_num*(B-1):]).reshape(-1,batch_num,step_num,elem_num)
177 | validation_list_a = np.array(a_buf).reshape(-1,batch_num,step_num,elem_num)
178 |
179 | patience = 10
180 | min_delta = 0.005
181 | lastLoss = np.float('inf')
182 | patience_cnt = 0
183 |
184 | for i in range(300):
185 | loss_list=[]
186 | for data in datalist:
187 | (loss, _) = sess.run([loss_, train_], {p_input: data,p_is_training : True})
188 | loss_list.append(loss)
189 | loss_list_all.append( np.array(loss_list).mean())
190 |
191 | if i > 0 and lastLoss - loss >min_delta:
192 | patience_cnt =0
193 | else:
194 | patience_cnt += 1
195 | lastLoss = loss
196 | if patience_cnt > patience:
197 | print("Early stopping...")
198 | break
199 |
200 |
201 | err_nbuf_tmp = np.array(err_nbuf).reshape(-1,elem_num)
202 | mu,sigma = helper.get_musigma(err_nbuf_tmp,mu,sigma)
203 |
204 | print("Parameters updated!")
205 | pd.Series(loss_list_all).plot(title="Loss")
206 | n_buf = []
207 | a_buf = []
208 | err_buf = []
209 | err_nbuf = []
210 | err_abuf = []
211 |
212 | fpr, tpr, thresholds = metrics.roc_curve(expert, output, pos_label="anomaly")
213 | auc = metrics.auc(fpr, tpr)
214 | helper.plot_roc(fpr,tpr,auc)
215 | print("--- Used time: %s seconds ---" % (time.time() - start_time))
--------------------------------------------------------------------------------
/src/OnlinePrediction/ProcessingHelper.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Aug 21 16:07:31 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | import pandas as pd
9 | from sklearn.preprocessing import MinMaxScaler
10 | import matplotlib.pyplot as plt
11 | import numpy as np
12 | from scipy.spatial.distance import mahalanobis,euclidean
13 |
14 |
15 | class processingHelper(object):
16 |
17 | def local_preprocessing(batchdata):
18 | # input batchdata with shape : [batch_num, step_num, elem_num]
19 | # minmax scaler on window level
20 | df = pd.DataFrame()
21 |
22 | for window in batchdata:
23 |
24 | scaler = MinMaxScaler()
25 | scaler.fit(window)
26 | new_win = scaler.transform(window)
27 | df = pd.concat((df, pd.DataFrame(new_win)),axis=0) if df.size!=0 else pd.DataFrame(new_win)
28 | return df.as_matrix().reshape(batchdata.shape)
29 |
30 | def scoring(err,mu,sigma):
31 |
32 | scores = []
33 | for e in err:
34 | scores.append(mahalanobis(e,mu,sigma))
35 |
36 | return scores
37 |
38 | def get_musigma(err_nbuf,mu,sigma):
39 |
40 | err_vec_array = np.array(err_nbuf)
41 | # for multivariate data, cov, for univariate data, var
42 | mu = np.mean(err_vec_array.ravel()) if err_nbuf.shape[1] == 1 else np.mean(err_vec_array,axis=0)
43 | sigma = np.var(err_vec_array.ravel()) if err_nbuf.shape[1] == 1 else np.cov(err_vec_array.T)
44 |
45 | return mu, sigma
46 |
47 | def get_threshold(normal_score, abnormal_score):
48 | upper = np.median(np.array(abnormal_score))
49 | lower = np.median(np.array(normal_score))
50 | scala = 20
51 | delta = (upper-lower) / scala
52 | candidate = lower
53 | threshold = 0
54 | result = 0
55 |
56 | def evaluate(threshold,normal_score,abnormal_score):
57 |
58 | beta = 0.5
59 | tp = np.array(abnormal_score)[np.array(abnormal_score)>threshold].size
60 | fp = len(abnormal_score)-tp
61 | fn = np.array(normal_score)[np.array(normal_score)>threshold].size
62 | tn = len(normal_score)- fn
63 |
64 | if tp == 0: return 0
65 |
66 | P = tp/(tp+fp)
67 | R = tp/(tp+fn)
68 | fbeta= (1+beta*beta)*P*R/(beta*beta*P+R)
69 | return fbeta
70 |
71 | for _ in range(scala):
72 | r = evaluate(candidate,normal_score,abnormal_score)
73 | if r > result:
74 | result = r
75 | threshold = candidate
76 | candidate += delta
77 | return threshold
78 |
79 | def plot_roc(fpr,tpr,auc):
80 | plt.figure()
81 | lw = 2
82 | plt.plot(fpr, tpr, color='darkorange',
83 | lw=lw, label='ROC curve (area = %0.2f)' %auc)
84 | plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
85 | plt.xlim([0.0, 1.0])
86 | plt.ylim([0.0, 1.05])
87 | plt.xlabel('False Positive Rate')
88 | plt.ylabel('True Positive Rate')
89 | plt.title('Receiver operating characteristic example')
90 | plt.legend(loc="lower right")
91 | plt.show()
--------------------------------------------------------------------------------
/src/OnlinePrediction/__pycache__/ProcessingHelper.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/OnlinePrediction/__pycache__/ProcessingHelper.cpython-36.pyc
--------------------------------------------------------------------------------
/src/OnlinePrediction/__pycache__/conf_online.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/binli826/LSTM-Autoencoders/8b52c946bb96994ca54a7b43273c8d1d48fcb8cb/src/OnlinePrediction/__pycache__/conf_online.cpython-36.pyc
--------------------------------------------------------------------------------
/src/OnlinePrediction/conf_online.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Aug 21 15:53:12 2018
4 |
5 | @author: Bin
6 | """
7 |
8 | class Conf(object):
9 |
10 | def __init__(self, dataset):
11 |
12 |
13 | if dataset == "power":
14 |
15 | self.batch_num = 8
16 | self.hidden_num = 15
17 | self.step_num = 84
18 | self.elem_num = 1
19 | self.training_set_size = self.step_num*12
20 | self.HardCriterion = 5
21 | self.buffersize = 9# number of batches
22 | elif dataset == "smtp":
23 |
24 | self.batch_num = 8
25 | self.hidden_num = 15
26 | self.step_num = 10
27 | self.elem_num = 34
28 | self.training_set_size = self.step_num*6000
29 | self.HardCriterion = 5
30 | self.buffersize = 50# number of batches
31 | elif dataset == "http":
32 |
33 | self.batch_num = 8
34 | self.hidden_num = 35
35 | self.step_num = 30
36 | self.elem_num = 34
37 | self.training_set_size = self.step_num*30000
38 | self.HardCriterion = 5
39 | self.buffersize = 1000 # number of batches
40 |
41 | elif dataset == "smtphttp":
42 | self.batch_num = 8
43 | self.hidden_num = 15
44 | self.step_num = 10
45 | self.elem_num = 34
46 | self.training_set_size = self.step_num*2500
47 | self.HardCriterion = 5
48 | self.buffersize = 1500# number of batches
49 |
50 | elif dataset == "forest":
51 | self.batch_num = 8
52 | self.hidden_num = 25
53 | self.step_num = 10
54 | self.elem_num = 7
55 | self.training_set_size = self.step_num*10000
56 | self.HardCriterion = 4
57 | self.buffersize = 400 # number of batches
58 |
59 | else:
60 | print("Wrong dataset name input.")
61 |
62 |
--------------------------------------------------------------------------------