├── .gitignore ├── README.md ├── dataPreprocessing.py ├── dl_models ├── __init__.py ├── c_lstm.py ├── c_lstm_ae.py ├── cmc_lstm.py ├── cnn.py ├── imc_lstm.py ├── ms_cnn.py └── smc_lstm.py ├── optUtils ├── __init__.py ├── dataUtil.py ├── logUtil.py ├── metricsUtil.py ├── modelUtil.py ├── pytorchModel.py └── trainUtil.py ├── param.yaml ├── requirements.txt ├── train.py └── train_windows.py /.gitignore: -------------------------------------------------------------------------------- 1 | # python 2 | *.py[cod] 3 | venv 4 | .idea 5 | datasets/* -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # mc-lstm-time-series 2 | 本项目是论文《[Anomaly Detection Using Multiscale C-LSTM for Univariate Time-Series](https://www.hindawi.com/journals/scn/2023/6597623/)》的实验代码,实现了多种时间序列异常检测模型。
3 | This project is the experimental code of the paper "*[Anomaly Detection Using Multiscale C-LSTM for Univariate Time-Series](https://www.hindawi.com/journals/scn/2023/6597623/)*", which implements a variety of time series anomaly detection models. 4 | 5 | ## 目录 Table of Contents 6 | 7 | - [项目目录 Project Directory](#项目目录-project-directory) 8 | - [使用方法 Getting Started](#使用方法-getting-started) 9 | - [项目声明 Project Statement](#项目声明-project-statement) 10 | 11 |

项目目录 Project Directory

12 | 13 | ├─ datasets (数据集目录 Dataset directory)
14 |  ├─ Numenta Anomaly Benchmark (NAB数据集目录 NAB dataset directory)
15 |  ├─ Yahoo! Webscope S5 (雅虎数据集目录 Yahoo dataset directory)
16 | ├─ dl_models (模型目录 Model directory)
17 |  ├─ cnn.py (CNN模型 CNN model)
18 |  ├─ ms_cnn.py (多尺度CNN模型 Multi-scale CNN model)
19 |  ├─ c_lstm.py (C-LSTM模型 C-LSTM model)
20 |  ├─ c_lstm_ae.py (C-LSTM-AE模型 C-LSTM-AE model)
21 |  ├─ imc_lstm.py (IMC-LSTM模型 IMC-LSTM model)
22 |  ├─ cmc_lstm.py (CMC-LSTM模型 CMC-LSTM model)
23 |  ├─ smc_lstm.py (SMC-LSTM模型 SMC-LSTM model)
24 | ├─ dataPreprocessing.py (数据预处理 Data preprocessing)
25 | ├─ train.py (训练代码 Training data)
26 | ├─ train_windows.py (不同滑动窗口大小的训练代码 Training code for different sliding window sizes)
27 | ├─ requirements.txt (项目依赖 Project dependencies)
28 | 29 | > 以上列出了模型文件及主要的训练代码文件,其余未列出的文件均为项目基础文件,无需重点关注。
30 | > The model files and main training code files are listed above, and the rest of the unlisted files are the basic files of the project and do not need to be paid attention to.
31 | > 本项目使用的数据集是网上公开的数据集,并非私有。因此,为了维护数据集的版权,我们并未将数据集一并上传。数据集的原链接如下:
32 | > The datasets used in this project are publicly available online, not private. Therefore, in order to maintain the copyright of the dataset, we did not upload the dataset together. The original link to the dataset is as follows:
33 | > NAB: https://github.com/numenta/NAB
34 | > Yahoo: https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70 35 | 36 |

使用方法 Getting Started

37 | 38 | 首先,拉取本项目到本地。
39 | First, pull the project to the local. 40 | 41 | $ git clone git@github.com:lyx199504/mc-lstm-time-series.git 42 | 43 | 接着,进入到项目中并安装本项目的依赖。但要注意,pytorch可能需要采取其他方式安装,安装完毕pytorch后可直接用如下代码安装其他依赖。
44 | Next, enter the project and install the dependencies of the project. However, it should be noted that pytorch may need to be installed in other ways. After installing pytorch, you can directly install other dependencies with the following code. 45 | 46 | $ cd mc-lstm-time-series/ 47 | $ pip install -r requirements.txt 48 | 49 | 然后,分别将NAB和雅虎数据集下载到项目的NAB数据集目录和雅虎数据集目录中。
50 | Then, download the NAB and Yahoo datasets to the project's NAB dataset directory and Yahoo dataset directory, respectively. 51 | 52 | 最后,执行train.py或train_windows.py即可训练模型。
53 | Finally, execute train.py or train_windows.py to train the model. 54 | 55 |

项目声明 Project Statement

56 | 57 | 本项目的作者及单位:
58 | The author and affiliation of this project: 59 | 60 | 项目名称(Project Name):mc-lstm-time-series 61 | 项目作者(Author):Yixiang Lu, Yudan Cheng, Jianbin Mai, Hongliang Sun, Juli Yin, Guoxuan Zhong 62 | 作者单位(Affiliation):暨南大学网络空间安全学院(College of Cyber Security, Jinan University) 63 | 64 | 本实验代码基于param-opt训练工具,原项目作者及出处如下:
65 | The experimental code is based on the param-opt training tool. The author and source of the original project are as follows:
66 | **Author: Yixiang Lu**
67 | **Project: [param-opt](https://github.com/lyx199504/param-opt)** 68 | 69 | 若要引用本论文,可按照如下latex引用格式:
70 | If you want to cite this paper, you could use the following latex citation format: 71 | 72 | @article{lu2023anomaly, 73 | title={Anomaly Detection Using Multiscale C-LSTM for Univariate Time-Series}, 74 | author={Lu, Yi-Xiang and Jin, Xiao-Bo and Liu, Dong-Jie and Zhang, Xin-Chang and Geng, Guang-Gang and others}, 75 | journal={Security and Communication Networks}, 76 | volume={2023}, 77 | year={2023}, 78 | publisher={Hindawi} 79 | } 80 | 81 | -------------------------------------------------------------------------------- /dataPreprocessing.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/9/6 12:28 4 | # @Author : LYX-夜光 5 | from pathlib import Path 6 | 7 | import numpy as np 8 | import pandas as pd 9 | 10 | from optUtils import read_json 11 | 12 | # 标准化 13 | def standard(value): 14 | mean_value, std_value = value.mean(), value.std() 15 | return (value - mean_value) / std_value 16 | 17 | # 归一化 18 | def norm(value): 19 | max_value, min_value = value.max(), value.min() 20 | diff_value = max_value - min_value 21 | return (value - min_value) / diff_value 22 | 23 | def reset_feat(value): 24 | grow_value = [] 25 | value_median = np.median(value) 26 | for i in range(0, len(value), 2): 27 | if i+1 < len(value): 28 | abs_value1 = abs(value[i] - value_median) 29 | abs_value2 = abs(value[i+1] - value_median) 30 | grow_value.append(value[i] if abs_value1 > abs_value2 else value[i+1]) 31 | return grow_value 32 | 33 | 34 | # 获取NAB数据集,将异常数据定为正例 35 | def getNabDataset(dataset_name, seq_len=50, move_pace=1, pre_list=(standard,), deal_list=()): 36 | """ 37 | :param dataset_name: 数据集名称 38 | :param seq_len: 滑动窗口序列长度 39 | :param move_pace: 滑动窗口步长 40 | :param pre_list: 编码成滑动窗口之前的数据预处理 41 | :param deal_list: 编码成滑动窗口之后的数据处理 42 | :return: X, y 43 | """ 44 | X, y = [], [] 45 | r = [] 46 | labels = read_json("./datasets/Numenta Anomaly Benchmark/labels/combined_labels.json") 47 | # windows = read_json("./datasets/NAB/labels/combined_windows.json") 48 | path_list = list(Path("./datasets/Numenta Anomaly Benchmark/data/%s" % dataset_name).glob("*.csv")) 49 | for i, path in enumerate(path_list): 50 | raw_data = pd.read_csv(path, index_col='timestamp') 51 | raw_value = raw_data.value 52 | for pre in pre_list: 53 | raw_value = pre(raw_value) 54 | raw_data['label'] = 0 55 | # for window in windows[dataset_name + "/%s" % path.name]: 56 | # start, end = window[0].split('.')[0], window[1].split('.')[0] 57 | # raw_data.loc[start: end, 'label'] = 1 58 | for timestamp in labels[dataset_name + "/%s" % path.name]: 59 | raw_data.loc[timestamp, 'label'] = 1 60 | 61 | # 构造时间序列特征 62 | s_point, e_point = 0, len(raw_value) - seq_len + 1 63 | X_ = np.array([raw_value[ix: ix + seq_len] for ix in range(s_point, e_point, move_pace)]) 64 | for deal in deal_list: 65 | X_ = np.array([deal(value) for value in X_]) 66 | X.append(X_) 67 | 68 | # 构造时间序列标签及其异常位置 69 | raw_label = raw_data.label 70 | # y_ = np.array([np.where(raw_label[ix: ix + seq_len] == 1)[0][0] + 1 if sum(raw_label[ix: ix + seq_len]) > 0 else 0 for ix in range(s_point, e_point, move_pace)]) 71 | y_ = np.array([1 if sum(raw_label[ix: ix + seq_len]) > 0 else 0 for ix in range(s_point, e_point, move_pace)]) 72 | y.append(y_) 73 | 74 | r_ = np.array([1 if sum(raw_label[ix: ix + seq_len]) > 0 else 0 for ix in range(s_point, e_point, move_pace)]) 75 | r_ = r_ * 1000 + i 76 | r.append(r_) 77 | 78 | X, y, r = np.vstack(X).astype('float32'), np.hstack(y).astype('int32'), np.hstack(r).astype('int32') 79 | 80 | print("构造数据集完毕,数据集大小为:%s..." % str(X.shape)) 81 | return X, y, r 82 | 83 | 84 | # 获取雅虎数据集,将异常数据定为正例 85 | def getWebscopeS5Dataset(dataset_name, seq_len=50, move_pace=1, pre_list=(standard,), deal_list=()): 86 | X, y = [], [] 87 | r = [] 88 | path_list = list(Path("./datasets/Yahoo! Webscope S5/%s" % dataset_name).glob("*.csv")) 89 | for i, path in enumerate(path_list): 90 | raw_data = pd.read_csv(path) 91 | try: 92 | raw_value = raw_data.value.values 93 | for pre in pre_list: 94 | raw_value = pre(raw_value) 95 | except: 96 | continue 97 | # 构造时间序列特征 98 | s_point, e_point = 0, len(raw_value) - seq_len + 1 99 | X_ = np.array([raw_value[ix: ix + seq_len] for ix in range(s_point, e_point, move_pace)]) 100 | for deal in deal_list: 101 | X_ = np.array([deal(value) for value in X_]) 102 | X.append(X_) 103 | 104 | # 构造时间序列标签 105 | try: 106 | raw_label = raw_data.is_anomaly 107 | except: 108 | raw_label = raw_data.anomaly 109 | # y_ = np.array([np.where(raw_label[ix: ix + seq_len] == 1)[0][0] + 1 if sum(raw_label[ix: ix + seq_len]) > 0 else 0 for ix in range(s_point, e_point, move_pace)]) 110 | y_ = np.array([1 if sum(raw_label[ix: ix + seq_len]) > 0 else 0 for ix in range(s_point, e_point, move_pace)]) 111 | y.append(y_) 112 | 113 | r_ = np.array([1 if sum(raw_label[ix: ix + seq_len]) > 0 else 0 for ix in range(s_point, e_point, move_pace)]) 114 | r_ = r_ * 1000 + i 115 | r.append(r_) 116 | 117 | X, y, r = np.vstack(X).astype('float32'), np.hstack(y).astype('int32'), np.hstack(r).astype('int32') 118 | 119 | print("构造数据集完毕,数据集大小为:%s..." % str(X.shape)) 120 | return X, y, r 121 | 122 | # 根据数据集名称获取数据集 123 | def getDataset(dataset_name, seq_len=50, move_pace=1, pre_list=(standard,), deal_list=()): 124 | if dataset_name.startswith('A'): 125 | return getWebscopeS5Dataset(dataset_name, seq_len, move_pace, pre_list, deal_list) 126 | else: 127 | return getNabDataset(dataset_name, seq_len, move_pace, pre_list, deal_list) -------------------------------------------------------------------------------- /dl_models/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2022/4/19 15:33 4 | # @Author : LYX-夜光 -------------------------------------------------------------------------------- /dl_models/c_lstm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/9/6 16:02 4 | # @Author : LYX-夜光 5 | import joblib 6 | from torch import nn 7 | import numpy as np 8 | 9 | from optUtils import yaml_config 10 | from optUtils.logUtil import get_rank_param, logging_config 11 | from optUtils.pytorchModel import DeepLearningClassifier 12 | 13 | 14 | class C_LSTM(DeepLearningClassifier): 15 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 16 | super(C_LSTM, self).__init__(learning_rate, epochs, batch_size, random_state, device) 17 | self.model_name = "c_lstm" 18 | self.label_num = 2 # 二分类 19 | self.seq_len = seq_len 20 | 21 | def create_model(self): 22 | self.conv1 = nn.Sequential( 23 | nn.Conv1d(1, 64, kernel_size=5, stride=1, padding=2), 24 | nn.Tanh(), 25 | nn.MaxPool1d(kernel_size=2, stride=2), 26 | ) 27 | self.conv2 = nn.Sequential( 28 | nn.Conv1d(64, 64, kernel_size=5, stride=1, padding=2), 29 | nn.Tanh(), 30 | nn.MaxPool1d(kernel_size=2, stride=2), 31 | ) 32 | self.lstm = nn.LSTM(input_size=64, hidden_size=64) 33 | self.dnn = nn.Sequential( 34 | nn.Linear(in_features=64, out_features=32), 35 | nn.Tanh(), 36 | nn.Linear(in_features=32, out_features=self.label_num), 37 | ) 38 | 39 | def forward(self, X): 40 | X = X.unsqueeze(1) 41 | X = self.conv1(X) 42 | X = self.conv2(X) 43 | X = X.permute(2, 0, 1) 44 | _, (h, _) = self.lstm(X) 45 | y = self.dnn(h.squeeze(0)) 46 | return y 47 | 48 | # # 拟合步骤 49 | # def fit_step(self, X, y=None, train=True): 50 | # y_ = y.clone() 51 | # y_[y_ > 0] = 1 52 | # return super().fit_step(X, y_, train) 53 | 54 | # 测试数据,将验证集中前k个最高分数的模型用于测试,最终分数为k个测试分数的均值 55 | def test_score(self, X, y, k=1): 56 | model_param_list = get_rank_param( 57 | self.model_name, 58 | key_list=['best_score_', 'train_score', 'epoch'], 59 | reverse_list=[True, True, True], 60 | )[:k] 61 | log_dir = yaml_config['dir']['log_dir'] 62 | logger = logging_config(self.model_name, log_dir + '/%s.log' % self.model_name) 63 | logger.info({ 64 | "===================== Test scores of the top %d validation models =====================" % k 65 | }) 66 | test_score_lists = [] 67 | for model_param in model_param_list: 68 | mdl = joblib.load(model_param['model_path']) 69 | mdl.device = self.device 70 | test_score = mdl.score(X, y, batch_size=512) 71 | test_score_list = mdl.score_list(X, y, batch_size=512) 72 | test_score_lists.append([test_score] + test_score_list) 73 | test_score_dict = {self.metrics.__name__: test_score} 74 | for i, metrics in enumerate(self.metrics_list): 75 | test_score_dict.update({metrics.__name__: test_score_list[i]}) 76 | 77 | logger.info({ 78 | "select_epoch": model_param['epoch'], 79 | "test_score_dict": test_score_dict, 80 | }) 81 | logger.info({ 82 | "===================== Mean test score =====================" 83 | }) 84 | mean_score_list = np.mean(test_score_lists, axis=0) 85 | mean_score_dict = {self.metrics.__name__: mean_score_list[0]} 86 | for i, metrics in enumerate(self.metrics_list): 87 | mean_score_dict.update({metrics.__name__: mean_score_list[i+1]}) 88 | logger.info({ 89 | "mean_score_dict": mean_score_dict, 90 | }) 91 | 92 | -------------------------------------------------------------------------------- /dl_models/c_lstm_ae.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/11/25 11:49 4 | # @Author : LYX-夜光 5 | from torch import nn 6 | 7 | from dl_models.c_lstm import C_LSTM 8 | 9 | 10 | class C_LSTM_AE(C_LSTM): 11 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 12 | super().__init__(learning_rate, epochs, batch_size, random_state, device, seq_len) 13 | self.model_name = "c_lstm_ae" 14 | 15 | def create_model(self): 16 | self.conv1 = nn.Sequential( 17 | nn.Conv2d(1, 64, kernel_size=(1, 5), stride=(1, 1), padding=(0, 2)), 18 | nn.ReLU(), 19 | nn.MaxPool2d(kernel_size=(1, 2), stride=(1, 2)), 20 | ) 21 | self.conv2 = nn.Sequential( 22 | nn.Conv2d(64, 32, kernel_size=(1, 5), stride=(1, 1), padding=(0, 2)), 23 | nn.ReLU(), 24 | nn.MaxPool2d(kernel_size=(1, 2), stride=(1, 2)), 25 | ) 26 | self.lstm_encoder = nn.LSTM(input_size=64, hidden_size=64) 27 | self.lstm_decoder = nn.LSTM(input_size=64, hidden_size=128) 28 | self.dnn = nn.Sequential( 29 | nn.Linear(in_features=128, out_features=510), 30 | nn.ReLU(), 31 | nn.Linear(in_features=510, out_features=self.label_num), 32 | ) 33 | 34 | def forward(self, X): 35 | X = X.unsqueeze(1) 36 | X = self.conv1(X) 37 | X = self.conv2(X) 38 | X = X.permute(2, 0, 1, 3).flatten(2) 39 | H, _ = self.lstm_encoder(X) 40 | _, (h, _) = self.lstm_decoder(H) 41 | y = self.dnn(h.squeeze(0)) 42 | return y 43 | 44 | -------------------------------------------------------------------------------- /dl_models/cmc_lstm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2022/3/4 16:25 4 | # @Author : LYX-夜光 5 | 6 | import torch 7 | from torch import nn 8 | 9 | from dl_models.c_lstm import C_LSTM 10 | 11 | 12 | class CMC_LSTM(C_LSTM): 13 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 14 | super().__init__(learning_rate, epochs, batch_size, random_state, device, seq_len) 15 | self.model_name = "cmc_lstm" 16 | 17 | def create_model(self): 18 | conv_type, conv_num = 4, 32 19 | self.cnn_list = nn.ModuleList([]) 20 | for i in range(conv_type): 21 | kernel_size = i*2 + 1 22 | padding = i 23 | pooling_size = 3 24 | self.cnn_list.append( 25 | nn.Sequential( 26 | nn.Conv1d(1, conv_num, kernel_size=kernel_size, stride=1, padding=padding), 27 | nn.Tanh(), 28 | nn.MaxPool1d(kernel_size=pooling_size, stride=pooling_size), 29 | ) 30 | ) 31 | self.lstm = nn.LSTM(input_size=conv_type*conv_num, hidden_size=conv_type*conv_num) 32 | self.dnn = nn.Sequential( 33 | nn.Dropout(p=0.2), 34 | nn.Linear(in_features=conv_type*conv_num, out_features=conv_num), 35 | nn.Tanh(), 36 | nn.Linear(in_features=conv_num, out_features=self.label_num), 37 | ) 38 | 39 | def forward(self, X): 40 | X = X.unsqueeze(1) 41 | Z = [] 42 | for cnn in self.cnn_list: 43 | Z.append(cnn(X).permute(2, 0, 1)) 44 | Z = torch.cat(Z, dim=-1) 45 | _, (h, _) = self.lstm(Z) 46 | y = self.dnn(h.squeeze(0)) 47 | return y 48 | -------------------------------------------------------------------------------- /dl_models/cnn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2022/2/11 20:03 4 | # @Author : LYX-夜光 5 | 6 | from torch import nn 7 | 8 | from dl_models.c_lstm import C_LSTM 9 | 10 | 11 | class CNN(C_LSTM): 12 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 13 | super().__init__(learning_rate, epochs, batch_size, random_state, device, seq_len) 14 | self.model_name = "cnn" 15 | 16 | def create_model(self): 17 | self.conv = nn.Sequential( 18 | nn.Conv1d(1, 16, kernel_size=5, stride=1, padding=2), 19 | nn.Tanh(), 20 | nn.MaxPool1d(kernel_size=3, stride=3), 21 | ) 22 | self.dnn = nn.Sequential( 23 | nn.Linear(in_features=int(self.seq_len/3)*16, out_features=64), 24 | nn.Tanh(), 25 | nn.Linear(in_features=64, out_features=self.label_num), 26 | ) 27 | 28 | def forward(self, X): 29 | X = X.unsqueeze(1) 30 | X = self.conv(X) 31 | X = X.flatten(1) 32 | y = self.dnn(X) 33 | return y 34 | -------------------------------------------------------------------------------- /dl_models/imc_lstm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/12/3 23:23 4 | # @Author : LYX-夜光 5 | import torch 6 | from torch import nn 7 | 8 | from dl_models.c_lstm import C_LSTM 9 | 10 | 11 | class IMC_LSTM(C_LSTM): 12 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 13 | super().__init__(learning_rate, epochs, batch_size, random_state, device, seq_len) 14 | self.model_name = "imc_lstm" 15 | 16 | def create_model(self): 17 | conv_type, conv_num = 4, 64 18 | self.cnn_list, self.lstm_list = nn.ModuleList([]), nn.ModuleList([]) 19 | for i in range(conv_type): 20 | kernel_size = i*2 + 1 21 | padding = i 22 | pooling_size = 3 23 | self.cnn_list.append( 24 | nn.Sequential( 25 | nn.Conv1d(1, conv_num, kernel_size=kernel_size, stride=1, padding=padding), 26 | nn.Tanh(), 27 | nn.MaxPool1d(kernel_size=pooling_size, stride=pooling_size), 28 | ) 29 | ) 30 | self.lstm_list.append( 31 | nn.LSTM(input_size=conv_num, hidden_size=conv_num), 32 | ) 33 | self.dnn = nn.Sequential( 34 | nn.Dropout(p=0.2), 35 | nn.Linear(in_features=conv_type*conv_num, out_features=conv_num), 36 | nn.Tanh(), 37 | nn.Linear(in_features=conv_num, out_features=self.label_num), 38 | ) 39 | 40 | def forward(self, X): 41 | X = X.unsqueeze(1) 42 | H = [] 43 | for cnn, lstm in zip(self.cnn_list, self.lstm_list): 44 | X_ = cnn(X).permute(2, 0, 1) 45 | _, (h, _) = lstm(X_) 46 | H.append(h.squeeze(0)) 47 | H = torch.cat(H, dim=1) 48 | y = self.dnn(H) 49 | return y 50 | -------------------------------------------------------------------------------- /dl_models/ms_cnn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2022/2/11 20:18 4 | # @Author : LYX-夜光 5 | import torch 6 | from torch import nn 7 | 8 | from dl_models.c_lstm import C_LSTM 9 | 10 | 11 | class MS_CNN(C_LSTM): 12 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 13 | super().__init__(learning_rate, epochs, batch_size, random_state, device, seq_len) 14 | self.model_name = "ms_cnn" 15 | 16 | def create_model(self): 17 | c_lstm_num, c_lstm_hidden_size = 3, 16 18 | pooling_size = 3 19 | self.cnn_list = nn.ModuleList([]) 20 | for i in range(c_lstm_num): 21 | kernel_size = i * 2 + 1 22 | padding = i 23 | self.cnn_list.append( 24 | nn.Sequential( 25 | nn.Conv1d(1, c_lstm_hidden_size, kernel_size=kernel_size, stride=1, padding=padding), 26 | nn.Tanh(), 27 | nn.MaxPool1d(kernel_size=pooling_size, stride=pooling_size), 28 | ) 29 | ) 30 | self.dnn = nn.Sequential( 31 | nn.Linear(in_features=c_lstm_num*c_lstm_hidden_size*int(self.seq_len/pooling_size), out_features=128), 32 | nn.Tanh(), 33 | nn.Linear(in_features=128, out_features=self.label_num), 34 | ) 35 | 36 | def forward(self, X): 37 | X = X.unsqueeze(1) 38 | H = [] 39 | for cnn in self.cnn_list: 40 | X_ = cnn(X) 41 | H.append(X_.flatten(1)) 42 | H = torch.cat(H, dim=1) 43 | y = self.dnn(H) 44 | return y 45 | -------------------------------------------------------------------------------- /dl_models/smc_lstm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/12/5 18:03 4 | # @Author : LYX-夜光 5 | 6 | import torch 7 | from torch import nn 8 | 9 | from dl_models.c_lstm import C_LSTM 10 | 11 | 12 | class SMC_LSTM(C_LSTM): 13 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu', seq_len=60): 14 | super().__init__(learning_rate, epochs, batch_size, random_state, device, seq_len) 15 | self.model_name = "smc_lstm" 16 | 17 | def create_model(self): 18 | conv_type, conv_num = 3, 64 19 | self.cnn_list = nn.ModuleList([]) 20 | for i in range(conv_type): 21 | kernel_size = i*2 + 1 22 | padding = i 23 | pooling_size = 3 24 | self.cnn_list.append( 25 | nn.Sequential( 26 | nn.Conv1d(1, conv_num, kernel_size=kernel_size, stride=1, padding=padding), 27 | nn.Tanh(), 28 | nn.MaxPool1d(kernel_size=pooling_size, stride=pooling_size), 29 | ) 30 | ) 31 | self.lstm = nn.LSTM(input_size=conv_num, hidden_size=conv_num) 32 | self.dnn = nn.Sequential( 33 | nn.Dropout(p=0.2), 34 | nn.Linear(in_features=conv_type*conv_num, out_features=conv_num), 35 | nn.Tanh(), 36 | nn.Linear(in_features=conv_num, out_features=self.label_num), 37 | ) 38 | 39 | def forward(self, X): 40 | X = X.unsqueeze(1) 41 | H = [] 42 | for cnn in self.cnn_list: 43 | X_ = cnn(X).permute(2, 0, 1) 44 | _, (h, _) = self.lstm(X_) 45 | H.append(h.squeeze(0)) 46 | H = torch.cat(H, dim=1) 47 | y = self.dnn(H) 48 | return y 49 | -------------------------------------------------------------------------------- /optUtils/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/6/3 16:25 4 | # @Author : LYX-夜光 5 | 6 | import codecs 7 | import json 8 | import os 9 | import random 10 | import yaml 11 | import numpy as np 12 | 13 | import warnings 14 | warnings.filterwarnings("ignore") 15 | 16 | 17 | # 创建文件夹 18 | def make_dirs(path): 19 | if not os.path.isdir(path): 20 | os.makedirs(path) 21 | 22 | # 设置随机种子 23 | def set_seed(seed): 24 | random.seed(seed) 25 | np.random.seed(seed) 26 | 27 | # 写入json文件 28 | def write_json(jsonPath, value): 29 | with open(jsonPath, "w", encoding="utf-8") as f: 30 | f.write(json.dumps(value)) 31 | 32 | # 获取json文件 33 | def read_json(jsonPath): 34 | with open(jsonPath, "r", encoding="utf-8") as f: 35 | value = json.loads(f.read()) 36 | return value 37 | 38 | # 读取yaml配置文件 39 | def read_yaml(yamlFile): 40 | with codecs.open(yamlFile, 'r', encoding='utf-8') as f: 41 | config = yaml.load(f, Loader=yaml.FullLoader) 42 | return config 43 | 44 | yaml_config = read_yaml(os.path.join(os.path.dirname(__file__), '../param.yaml')) 45 | -------------------------------------------------------------------------------- /optUtils/dataUtil.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/6/4 18:09 4 | # @Author : LYX-夜光 5 | 6 | import numpy as np 7 | 8 | from optUtils import set_seed 9 | 10 | 11 | # 分层打乱下标 12 | def stratified_shuffle_index(y, n_splits, random_state=0): 13 | if random_state: 14 | set_seed(random_state) 15 | # 按标签取出样本,再打乱样本下标 16 | indexList_label = {} 17 | for label in set(y): 18 | indexList_label[label] = np.where(y == label)[0] 19 | np.random.shuffle(indexList_label[label]) 20 | # 将每个标签的样本分为n_splits份,再按顺序排列每份样本,即按照如下排列: 21 | # 第1个标签的第1份样本,第2个标签的第1份样本,...,第n个标签的第1份样本,第1个标签的第2份样本,... 22 | # 每个标签的第1份样本排列后称为第1折,第2份样本排列后称为第2折,...,同时打乱每一折的样本 23 | indexList = [] 24 | for n in range(n_splits, 0, -1): 25 | indexList_n = [] 26 | for label in indexList_label: 27 | split_point = int(len(indexList_label[label]) / n + 0.5) 28 | indexList_n.append(indexList_label[label][:split_point]) 29 | indexList_label[label] = indexList_label[label][split_point:] 30 | indexList_n = np.hstack(indexList_n) 31 | np.random.shuffle(indexList_n) 32 | indexList.append(indexList_n) 33 | indexList = np.hstack(indexList) 34 | return indexList 35 | 36 | # 分层打乱样本 37 | def stratified_shuffle_samples(X, y, n_splits, random_state=0): 38 | indexList = stratified_shuffle_index(y, n_splits, random_state) 39 | X = [x[indexList] for x in X] if type(X) == list else X[indexList] 40 | y = y[indexList] 41 | return X, y 42 | -------------------------------------------------------------------------------- /optUtils/logUtil.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/6/2 14:21 4 | # @Author : LYX-夜光 5 | import logging 6 | import re 7 | 8 | from optUtils import yaml_config 9 | 10 | # 日志配置 11 | def logging_config(logName, fileName): 12 | logger = logging.getLogger(logName) 13 | 14 | if not logger.handlers: 15 | logger.setLevel('DEBUG') 16 | BASIC_FORMAT = "%(asctime)s:%(levelname)s:%(message)s" 17 | DATE_FORMAT = '%Y-%m-%d %H:%M:%S' 18 | formatter = logging.Formatter(BASIC_FORMAT, DATE_FORMAT) 19 | 20 | # 输出到控制台的handler 21 | chlr = logging.StreamHandler() 22 | chlr.setFormatter(formatter) 23 | logger.addHandler(chlr) 24 | 25 | # 输出到文件的handler 26 | fhlr = logging.FileHandler(fileName, encoding='utf-8') 27 | fhlr.setFormatter(formatter) 28 | fhlr.setLevel('INFO') 29 | logger.addHandler(fhlr) 30 | 31 | return logger 32 | 33 | # 读取日志列表 34 | def read_log(logFile): 35 | with open(logFile, 'r') as log: 36 | logList = log.readlines() 37 | return list(map(lambda x: eval(re.findall(r"(?<=INFO:).*$", x)[0]), logList)) 38 | 39 | # 获取日志多行 40 | def get_lines_from_log(model_name, lines=None): 41 | paramList = read_log(yaml_config['dir']['log_dir'] + "/" + model_name + ".log") 42 | if type(lines) == int: 43 | return paramList[lines] 44 | elif type(lines) == list or type(lines) == tuple: 45 | return paramList[lines[0]: lines[1]] 46 | else: 47 | return paramList 48 | 49 | # 按照模型关键字获取对应超参数 50 | def get_param_from_log(model_name, model_key): 51 | paramList = read_log(yaml_config['dir']['log_dir'] + "/" + model_name + ".log") 52 | paramList = list(filter(lambda x: model_key in x['model_path'], paramList)) 53 | return paramList[0] if paramList else None 54 | 55 | # 根据模型分数排序获取对应参数列表 56 | def get_rank_param(model_name, key_list=('best_score_',), reverse_list=(True,)): 57 | paramList = read_log(yaml_config['dir']['log_dir'] + "/%s.log" % model_name) 58 | for key in key_list: 59 | paramList = filter(lambda x: key in x, paramList) 60 | for key, reverse in zip(key_list[::-1], reverse_list[::-1]): 61 | paramList = sorted(paramList, key=lambda x: x[key], reverse=reverse) 62 | return paramList 63 | 64 | # 获取分数最高的参数 65 | def get_best_param(model_name): 66 | paramList = read_log(yaml_config['dir']['log_dir'] + "/%s.log" % model_name) 67 | param = max(paramList, key=lambda x: x['best_score_']) 68 | return param -------------------------------------------------------------------------------- /optUtils/metricsUtil.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/10/2 12:04 4 | # @Author : LYX-夜光 5 | from sklearn.metrics import f1_score, average_precision_score 6 | 7 | 8 | def f1_micro_score(y_true, y_pred): 9 | return f1_score(y_true, y_pred, average='micro') 10 | 11 | def f1_macro_score(y_true, y_pred): 12 | return f1_score(y_true, y_pred, average='macro') 13 | 14 | def f1_weighted_score(y_true, y_pred): 15 | return f1_score(y_true, y_pred, average='weighted') 16 | 17 | def pr_auc_score(y_true, y_prob): 18 | return average_precision_score(y_true, y_prob) 19 | -------------------------------------------------------------------------------- /optUtils/modelUtil.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/6/2 15:39 4 | # @Author : LYX-夜光 5 | 6 | from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, VotingClassifier 7 | from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor 8 | 9 | from sklearn.svm import SVC, SVR 10 | from sklearn.linear_model import LogisticRegression, LinearRegression 11 | from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor 12 | 13 | from optUtils.pytorchModel import DeepLearningClassifier, DeepLearningRegressor, AutoEncoder, SupervisedAutoEncoder, \ 14 | VariationalAutoEncoder, SupervisedVariationalAutoEncoder 15 | 16 | __model_dict = { 17 | 'knn_clf': KNeighborsClassifier, 18 | 'knn_reg': KNeighborsRegressor, 19 | 'svm_clf': SVC, 20 | 'svm_reg': SVR, 21 | 'lr_clf': LogisticRegression, 22 | 'lr_reg': LinearRegression, 23 | 'dt_clf': DecisionTreeClassifier, 24 | 'dt_reg': DecisionTreeRegressor, 25 | 'rf_clf': RandomForestClassifier, 26 | 'rf_reg': RandomForestRegressor, 27 | 'voting': VotingClassifier, 28 | 'dl_clf': DeepLearningClassifier, 29 | 'dl_reg': DeepLearningRegressor, 30 | 'ae': AutoEncoder, 31 | 'sae': SupervisedAutoEncoder, 32 | 'vae': VariationalAutoEncoder, 33 | 'svae': SupervisedVariationalAutoEncoder, 34 | } 35 | 36 | # 选择模型 37 | def model_selection(model_name, **params): 38 | model = __model_dict[model_name](**params) 39 | return model 40 | 41 | # 注册模型 42 | def model_registration(**model): 43 | if type(model) != dict: 44 | print("请输入dict类型...") 45 | __model_dict.update(model) 46 | -------------------------------------------------------------------------------- /optUtils/pytorchModel.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/6/4 19:08 4 | # @Author : LYX-夜光 5 | import sys 6 | import time 7 | 8 | import joblib 9 | import numpy as np 10 | import torch 11 | from torch import nn 12 | import torch.nn.functional as F 13 | 14 | from sklearn.base import BaseEstimator 15 | from sklearn.metrics import accuracy_score, mean_squared_error 16 | from tqdm import tqdm 17 | 18 | from optUtils import set_seed, make_dirs, yaml_config 19 | from optUtils.logUtil import logging_config 20 | 21 | # pytorch随机种子 22 | def pytorch_set_seed(seed): 23 | if seed: 24 | set_seed(seed) 25 | torch.manual_seed(seed) # cpu 26 | torch.cuda.manual_seed(seed) # gpu 27 | torch.cuda.manual_seed_all(seed) # 并行gpu 28 | torch.backends.cudnn.deterministic = True # cpu/gpu结果一致 29 | 30 | # pytorch深度学习模型 31 | class PytorchModel(nn.Module, BaseEstimator): 32 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 33 | super().__init__() 34 | self.model_name = "base_dl" 35 | self.param_search = True # 默认开启搜索参数功能 36 | self.save_model = False # 常规训练中,默认关闭保存模型功能 37 | self.only_save_last_epoch = False # 常规训练只保存最后一个epoch 38 | 39 | self.learning_rate = learning_rate 40 | self.epochs = epochs 41 | self.batch_size = batch_size 42 | self.random_state = random_state 43 | self.device = device 44 | 45 | # 优化器、评价指标 46 | self.optim = torch.optim.Adam 47 | self.metrics = None 48 | self.metrics_list = [] # 多个评价指标 49 | 50 | # numpy or list => tensor 51 | def to_tensor(self, data): 52 | if type(data) == torch.Tensor or type(data) == list and type(data[0]) == torch.Tensor: 53 | return data 54 | if type(data) == list: 55 | tensor_data = [] 56 | for sub_data in data: 57 | dataType = torch.float32 if 'float' in str(sub_data.dtype) else torch.int64 58 | tensor_data.append(torch.tensor(sub_data, dtype=dataType)) 59 | else: 60 | dataType = torch.float32 if 'float' in str(data.dtype) else torch.int64 61 | tensor_data = torch.tensor(data, dtype=dataType) 62 | return tensor_data 63 | 64 | # 训练 65 | def fit(self, X, y, X_val=None, y_val=None): 66 | # 设置随机种子 67 | pytorch_set_seed(self.random_state) 68 | # 构建模型 69 | self.create_model() 70 | # 初始化优化器 71 | self.optimizer = self.optim(params=self.parameters(), lr=self.learning_rate) 72 | # 初始化训练集 73 | X, y = self.to_tensor(X), self.to_tensor(y) 74 | # 若不进行超参数搜索,则初始化验证集 75 | if not self.param_search and X_val is not None and y_val is not None: 76 | X_val, y_val = self.to_tensor(X_val), self.to_tensor(y_val) 77 | 78 | # 训练每个epoch 79 | pbar = tqdm(range(self.epochs), file=sys.stdout, desc=self.model_name) 80 | for epoch in pbar: 81 | self.to(self.device) 82 | train_loss, train_score, train_score_list = self.fit_epoch(X, y, train=True) 83 | train_score_dict = {self.metrics.__name__: train_score} 84 | for i, metrics in enumerate(self.metrics_list): 85 | train_score_dict.update({metrics.__name__: train_score_list[i]}) 86 | massage_dict = {"train_loss": "%.6f" % train_loss, "train_score": "%.6f" % train_score} 87 | 88 | # 有输入验证集,则计算val_loss和val_score等 89 | val_loss, val_score, val_score_dict = 0, 0, {} 90 | if not self.param_search and X_val is not None and y_val is not None: 91 | val_loss, val_score, val_score_list = self.fit_epoch(X_val, y_val, train=False) 92 | val_score_dict = {self.metrics.__name__: val_score} 93 | for i, metrics in enumerate(self.metrics_list): 94 | val_score_dict.update({metrics.__name__: val_score_list[i]}) 95 | massage_dict.update({"val_loss": "%.6f" % val_loss, "val_score": "%.6f" % val_score}) 96 | 97 | pbar.set_postfix(massage_dict) 98 | 99 | # 不进行超参数搜索,则存储每个epoch的模型和日志 100 | if not self.param_search: 101 | if not self.only_save_last_epoch or self.only_save_last_epoch and epoch + 1 == self.epochs: 102 | # 存储模型 103 | model_path = None 104 | if self.save_model: 105 | model_dir = yaml_config['dir']['model_dir'] 106 | make_dirs(model_dir) 107 | model_path = model_dir + '/%s-%03d-%s.model' % (self.model_name, epoch + 1, int(time.time())) 108 | # 存储模型时,model及其属性device必须保持相同cpu 109 | device = self.device 110 | self.device = 'cpu' 111 | self.to(self.device) 112 | joblib.dump(self, model_path) 113 | self.device = device 114 | self.to(self.device) 115 | # 存储日志 116 | log_dir = yaml_config['dir']['log_dir'] 117 | make_dirs(log_dir) 118 | logger = logging_config(self.model_name, log_dir + '/%s.log' % self.model_name) 119 | logger.info({ 120 | "epoch": epoch + 1, 121 | "best_param_": self.get_params(), 122 | "train_loss": train_loss, 123 | "val_loss": val_loss, 124 | "train_score": train_score, 125 | "best_score_": val_score, 126 | "train_score_dict": train_score_dict, 127 | "val_score_dict": val_score_dict, 128 | "model_path": model_path, 129 | }) 130 | 131 | # 每轮拟合 132 | def fit_epoch(self, X, y, train): 133 | mean_loss, y_hat = self.fit_step(X, y, train) 134 | 135 | y_numpy = y.cpu().detach().numpy() 136 | y_hat_numpy = np.hstack(y_hat) if len(y_hat[0].shape) == 1 else np.vstack(y_hat) 137 | score = self.score(X, y_numpy, y_hat_numpy) 138 | score_list = self.score_list(X, y_numpy, y_hat_numpy) 139 | 140 | return mean_loss, score, score_list 141 | 142 | # 拟合步骤 143 | def fit_step(self, X, y=None, train=True): 144 | self.train() if train else self.eval() 145 | 146 | total_loss, y_hat = 0, [] 147 | indexList = range(0, len(X[0]) if type(X) == list else len(X), self.batch_size) 148 | for i in indexList: 149 | if type(X) == list: 150 | X_batch = [x[i: i + self.batch_size].to(self.device) for x in X] 151 | else: 152 | X_batch = X[i: i + self.batch_size].to(self.device) 153 | y_batch = None if y is None else y[i:i + self.batch_size].to(self.device) 154 | output = self.forward(X_batch) 155 | 156 | loss = self.loss_fn(output, y_batch, X_batch) 157 | total_loss += loss.item() 158 | 159 | y_hat_batch = output[0] if type(output) == tuple else output 160 | y_hat.append(y_hat_batch.cpu().detach().numpy()) 161 | 162 | if train: 163 | loss.backward() # 梯度计算 164 | self.optimizer.step() # 优化更新权值 165 | self.optimizer.zero_grad() # 求解梯度前需要清空之前的梯度结果(因为model会累加梯度) 166 | 167 | mean_loss = total_loss / len(indexList) 168 | 169 | return mean_loss, y_hat 170 | 171 | # 预测结果 172 | def predict_output(self, X, batch_size, output_all_value=False): 173 | self.eval() # 求值模式 174 | self.to(self.device) 175 | X = self.to_tensor(X) 176 | y_hat = [] 177 | for i in range(0, len(X[0]) if type(X) == list else len(X), batch_size): 178 | if type(X) == list: 179 | X_batch = [x[i: i + batch_size].to(self.device) for x in X] 180 | else: 181 | X_batch = X[i:i + batch_size].to(self.device) 182 | output = self.forward(X_batch) 183 | if output_all_value: 184 | output_list = output if type(output) == tuple else [output] 185 | y_hat.append([out.cpu().detach().numpy() for out in output_list]) 186 | else: 187 | y_hat_batch = output[0] if type(output) == tuple else output 188 | y_hat.append(y_hat_batch.cpu().detach().numpy()) 189 | if output_all_value: 190 | y_hat_list = [] 191 | for i in range(len(y_hat[0])): 192 | output_list = [out[i] for out in y_hat] 193 | output_stack = np.hstack(output_list) if len(y_hat[0][i].shape) == 1 else np.vstack(output_list) 194 | y_hat_list.append(output_stack) 195 | y_hat = y_hat_list 196 | else: 197 | y_hat = np.hstack(y_hat) if len(y_hat[0].shape) == 1 else np.vstack(y_hat) 198 | return y_hat 199 | 200 | # 深度学习分类器 201 | class DeepLearningClassifier(PytorchModel): 202 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 203 | super().__init__(learning_rate, epochs, batch_size, random_state, device) 204 | self.model_name = "dl_clf" 205 | self._estimator_type = "classifier" 206 | self.label_num = 0 207 | 208 | self.metrics = accuracy_score 209 | 210 | # 组网 211 | def create_model(self): 212 | self.fc1 = nn.Linear(in_features=4, out_features=3) 213 | self.fc2 = nn.Linear(in_features=3, out_features=self.label_num) 214 | self.relu = nn.ReLU() 215 | 216 | # 损失函数 217 | def loss_fn(self, output, y_true, X_true): 218 | y_hat = output[0] if type(output) == tuple else output 219 | return F.cross_entropy(y_hat, y_true) 220 | 221 | # 前向推理 222 | def forward(self, X): 223 | y = self.fc1(X) 224 | y = self.relu(y) 225 | y = self.fc2(y) 226 | return y 227 | 228 | def fit(self, X, y, X_val=None, y_val=None): 229 | self.label_num = len(set(y)) if self.label_num == 0 else self.label_num 230 | super().fit(X, y, X_val, y_val) 231 | 232 | # 预测分类概率 233 | def predict_proba(self, X, batch_size=10000): 234 | y_prob = super().predict_output(X, batch_size) 235 | return y_prob 236 | 237 | # 预测分类标签 238 | def predict(self, X, y_prob=None, batch_size=10000): 239 | if y_prob is None: 240 | y_prob = self.predict_proba(X, batch_size) 241 | return y_prob.argmax(axis=1) 242 | 243 | # 评价指标 244 | def score(self, X, y, y_prob=None, batch_size=10000): 245 | if y_prob is None: 246 | y_prob = self.predict_proba(X, batch_size) 247 | y_pred = self.predict(X, y_prob, batch_size) 248 | if self.label_num == 2 and 'auc' in self.metrics.__name__: 249 | return self.metrics(y, y_prob[:, 1]) if len(y_prob.shape) > 1 else self.metrics(y, y_prob) 250 | return self.metrics(y, y_pred) 251 | 252 | # 评价指标列表 253 | def score_list(self, X, y, y_prob=None, batch_size=10000): 254 | score_list = [] 255 | if y_prob is None: 256 | y_prob = self.predict_proba(X, batch_size) 257 | y_pred = self.predict(X, y_prob, batch_size) 258 | for metrics in self.metrics_list: 259 | if self.label_num == 2 and 'auc' in metrics.__name__: 260 | score = metrics(y, y_prob[:, 1]) if len(y_prob.shape) > 1 else metrics(y, y_prob) 261 | else: 262 | score = metrics(y, y_pred) 263 | score_list.append(score) 264 | return score_list 265 | 266 | # 深度学习回归器 267 | class DeepLearningRegressor(PytorchModel): 268 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 269 | super().__init__(learning_rate, epochs, batch_size, random_state, device) 270 | self.model_name = "dl_reg" 271 | self._estimator_type = "regressor" 272 | 273 | self.metrics = mean_squared_error 274 | 275 | # 组网 276 | def create_model(self): 277 | self.fc1 = nn.Linear(in_features=4, out_features=2) 278 | self.fc2 = nn.Linear(in_features=2, out_features=1) 279 | self.relu = nn.ReLU() 280 | self.sigmoid = nn.Sigmoid() 281 | 282 | # 损失函数 283 | def loss_fn(self, output, y_true, X_true): 284 | y_hat = output[0] if type(output) == tuple else output 285 | return F.mse_loss(y_hat, y_true) 286 | 287 | # 前向推理 288 | def forward(self, X): 289 | y = self.fc1(X) 290 | y = self.relu(y) 291 | y = self.fc2(y) 292 | y = self.sigmoid(y) 293 | y = y.squeeze(-1) 294 | return y 295 | 296 | # 预测标签 297 | def predict(self, X, batch_size=10000): 298 | y_pred = super().predict_output(X, batch_size) 299 | return y_pred 300 | 301 | # 评价指标 302 | def score(self, X, y, y_pred=None, batch_size=10000): 303 | if y_pred is None: 304 | y_pred = self.predict(X, batch_size) 305 | return self.metrics(y, y_pred) 306 | 307 | # 评价指标列表 308 | def score_list(self, X, y, y_pred=None, batch_size=10000): 309 | score_list = [] 310 | if y_pred is None: 311 | y_pred = self.predict(X, batch_size) 312 | for metrics in self.metrics_list: 313 | score = metrics(y, y_pred) 314 | score_list.append(score) 315 | return score_list 316 | 317 | # 自编码器 318 | class AutoEncoder(DeepLearningClassifier): 319 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 320 | super().__init__(learning_rate, epochs, batch_size, random_state, device) 321 | self.model_name = "ae" 322 | self.threshold = 0.5 # 正异常阈值 323 | self.normal = 0 # 正常数据的类别 324 | 325 | # 组网 326 | def create_model(self): 327 | self.encoder = nn.Sequential( 328 | nn.Linear(in_features=4, out_features=1), 329 | ) 330 | self.decoder = nn.Sequential( 331 | nn.Linear(in_features=1, out_features=4) 332 | ) 333 | 334 | # 损失函数 335 | def loss_fn(self, output, y_true, X_true): 336 | X_hat = output[0] if type(output) == tuple else output 337 | return F.mse_loss(X_hat, X_true) 338 | 339 | # 前向推理 340 | def forward(self, X): 341 | Z = self.encoder(X) 342 | X_hat = self.decoder(Z) 343 | return X_hat 344 | 345 | # 每轮拟合 346 | def fit_epoch(self, X, y, train): 347 | mean_loss, X_hat = self.fit_step(X, train=train) 348 | X_numpy, X_hat_numpy = X.cpu().detach().numpy(), np.vstack(X_hat) 349 | if train: 350 | # 用训练数据获取阈值范围 351 | self.threshold = self.get_threshold(X_numpy, X_hat_numpy) 352 | 353 | y_prob = self.get_proba_score(X_numpy, X_hat_numpy) # y_pred取1的概率 354 | y_numpy = y.cpu().detach().numpy() 355 | 356 | score = self.score(X, y_numpy, y_prob) 357 | score_list = self.score_list(X, y_numpy, y_prob) 358 | 359 | return mean_loss, score, score_list 360 | 361 | # 预测得分 362 | def get_proba_score(self, X, X_hat): 363 | # 二范数,同np.sqrt(np.sum((X - X_hat) ** 2, axis=1)) 364 | errors = np.linalg.norm(X - X_hat, axis=1, ord=2) 365 | # 根据误差计算得分,将分数控制在0-1内 366 | scores = errors / X.shape[1] if self.normal == 0 else 1 / (errors + 1) 367 | return scores 368 | 369 | # 计算阈值 370 | def get_threshold(self, X, X_hat): 371 | return 0.5 372 | 373 | # 预测概率 374 | def predict_proba(self, X, batch_size=10000): 375 | X_hat = super().predict_proba(X, batch_size) 376 | if type(X) == torch.Tensor: 377 | X = X.cpu().detach().numpy() 378 | y_prob = self.get_proba_score(X, X_hat) 379 | return y_prob 380 | 381 | # 预测标签 382 | def predict(self, X, y_prob=None, batch_size=10000): 383 | if y_prob is None: 384 | y_prob = self.predict_proba(X, batch_size) 385 | if self.normal == 0: 386 | y_pred = np.array([self.normal if score <= self.threshold else 1 - self.normal for score in y_prob]) 387 | else: 388 | y_pred = np.array([self.normal if score >= self.threshold else 1 - self.normal for score in y_prob]) 389 | return y_pred 390 | 391 | # 监督自编码 392 | class SupervisedAutoEncoder(AutoEncoder): 393 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 394 | super().__init__(learning_rate, epochs, batch_size, random_state, device) 395 | self.model_name = "sae" 396 | 397 | # 每轮拟合 398 | def fit_epoch(self, X, y, train): 399 | y_numpy = y.cpu().detach().numpy() 400 | normal_index = y_numpy == self.normal 401 | if train: 402 | self.fit_step(X[normal_index], train=True) # 只训练正常数据 403 | 404 | mean_loss, X_hat = self.fit_step(X, train=False) # 不进行训练 405 | X_numpy, X_hat_numpy = X.cpu().detach().numpy(), np.vstack(X_hat) 406 | if train: 407 | # 用正常数据获取正常阈值范围 408 | self.threshold = self.get_threshold(X_numpy[normal_index], X_hat_numpy[normal_index]) 409 | 410 | y_prob = self.get_proba_score(X_numpy, X_hat_numpy) # y_pred取1的概率 411 | 412 | score = self.score(X, y_numpy, y_prob) 413 | score_list = self.score_list(X, y_numpy, y_prob) 414 | 415 | return mean_loss, score, score_list 416 | 417 | # 计算阈值 418 | def get_threshold(self, X, X_hat): 419 | scores = self.get_proba_score(X, X_hat) 420 | return max(scores) if self.normal == 0 else min(scores) 421 | 422 | # 变分自编码 423 | class VariationalAutoEncoder(AutoEncoder): 424 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 425 | super().__init__(learning_rate, epochs, batch_size, random_state, device) 426 | self.model_name = "vae" 427 | 428 | # 组网 429 | def create_model(self): 430 | self.encoder = nn.Sequential( 431 | nn.Linear(in_features=4, out_features=2), 432 | ) 433 | self.decoder = nn.Sequential( 434 | nn.Linear(in_features=1, out_features=4), 435 | nn.Sigmoid(), 436 | ) 437 | 438 | # 损失函数 439 | def loss_fn(self, output, y_true, X_true): 440 | X_hat, mu, log_sigma = output 441 | BCE = F.binary_cross_entropy(X_hat, X_true, reduction='sum') 442 | D_KL = 0.5 * torch.sum(torch.exp(log_sigma) + torch.pow(mu, 2) - 1. - log_sigma) 443 | loss = BCE + D_KL 444 | return loss 445 | 446 | # 前向推理 447 | def forward(self, X): 448 | H = self.encoder(X) 449 | mu, log_sigma = H.chunk(2, dim=-1) 450 | Z = self.reparameterize(mu, log_sigma) 451 | X_hat = self.decoder(Z) 452 | return X_hat, mu, log_sigma 453 | 454 | # 重构Z层:均值+随机采样*标准差 455 | def reparameterize(self, mu, log_sigma): 456 | std = torch.exp(log_sigma * 0.5) 457 | esp = torch.randn(std.size()) 458 | z = mu + esp * std 459 | return z 460 | 461 | # 监督变分自编码 462 | class SupervisedVariationalAutoEncoder(SupervisedAutoEncoder, VariationalAutoEncoder): 463 | def __init__(self, learning_rate=0.001, epochs=100, batch_size=50, random_state=0, device='cpu'): 464 | super().__init__(learning_rate, epochs, batch_size, random_state, device) 465 | self.model_name = "svae" 466 | -------------------------------------------------------------------------------- /optUtils/trainUtil.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2021/6/3 22:01 4 | # @Author : LYX-夜光 5 | import time 6 | 7 | import numpy as np 8 | 9 | import joblib 10 | from joblib import Parallel, delayed 11 | from sklearn.model_selection import KFold 12 | from skopt import BayesSearchCV 13 | 14 | from optUtils import make_dirs, yaml_config 15 | from optUtils.logUtil import logging_config 16 | from optUtils.modelUtil import model_selection 17 | 18 | 19 | # 机器学习常规训练 20 | def ml_train(X, y, X_test, y_test, model_name, model_param={}, metrics_list=(), model=None): 21 | """ 22 | :param X: 训练集的特征 23 | :param y: 训练集的标签 24 | :param X_test: 测试集的特征 25 | :param y_test: 测试集的标签 26 | :param model_name: 模型名称 27 | :param model_param: 模型参数,可缺省 28 | :param metrics_list: 多个评价指标,可缺省,默认使用模型自带的评价指标 29 | :return: 30 | """ 31 | 32 | log_dir = yaml_config['dir']['log_dir'] 33 | cus_param = yaml_config['cus_param'] 34 | 35 | if model is None: 36 | model = model_selection(model_name, **model_param) 37 | 38 | start_time = time.time() 39 | 40 | model.fit(X, y) 41 | 42 | # 获取评价指标 43 | def get_score(mdl, X, y): 44 | score_list = [] 45 | if metrics_list: 46 | y_pred = mdl.predict(X) 47 | for metrics in metrics_list: 48 | score_list.append(metrics(y, y_pred)) 49 | return score_list 50 | score_list.append(mdl.score(X, y)) 51 | return score_list 52 | 53 | train_score_list = get_score(model, X, y) 54 | train_score_dict = {metrics.__name__: train_score for metrics, train_score in zip(metrics_list, train_score_list)} 55 | 56 | test_score_list = get_score(model, X_test, y_test) 57 | test_score_dict = {metrics.__name__: val_score for metrics, val_score in zip(metrics_list, test_score_list)} 58 | 59 | run_time = int(time.time() - start_time) 60 | 61 | print("model: %s - train score: %.6f - test score: %.6f - time: %ds" % ( 62 | model_name, train_score_list[0], test_score_list[0], run_time)) 63 | 64 | # 配置日志文件 65 | make_dirs(log_dir) 66 | logger = logging_config(model_name, log_dir + '/%s.log' % model_name) 67 | log_message = { 68 | "cus_param": cus_param, 69 | "best_param_": model_param, 70 | "best_score_": test_score_list[0], 71 | "train_score": train_score_list[0], 72 | "train_score_dict": train_score_dict, 73 | "test_score_dict": test_score_dict, 74 | } 75 | logger.info(log_message) 76 | 77 | 78 | # 交叉验证 79 | def cv_train(X, y, model_name, model_param={}, metrics_list=(), model=None): 80 | """ 81 | :param X: 训练集的特征 82 | :param y: 训练集的标签 83 | :param model_name: 模型名称 84 | :param model_param: 模型参数,可缺省 85 | :param metrics_list: 多个评价指标,可缺省,默认使用模型自带的评价指标 86 | :param model: 机器学习或深度学习模型,可缺省,默认根据模型名称获取模型 87 | :return: 88 | """ 89 | 90 | log_dir = yaml_config['dir']['log_dir'] 91 | cus_param, cv_param = yaml_config['cus_param'], yaml_config['cv_param'] 92 | 93 | if model is None: 94 | model = model_selection(model_name, **model_param) 95 | if metrics_list: 96 | model.metrics = metrics_list[0] 97 | 98 | # 计算每一折的评价指标 99 | def cv_score(mdl, X, y): 100 | score_list = [] 101 | if metrics_list: 102 | y_pred = mdl.predict(X) 103 | for metrics in metrics_list: 104 | score_list.append(metrics(y, y_pred)) 105 | return score_list 106 | score_list.append(mdl.score(X, y)) 107 | return score_list 108 | 109 | # 获取每一折的训练和验证分数 110 | def get_score(mdl, train_index, val_index): 111 | start_time = time.time() 112 | mdl.fit(X[train_index], y[train_index]) 113 | train_score_list = cv_score(mdl, X[train_index], y[train_index]) 114 | val_score_list = cv_score(mdl, X[val_index], y[val_index]) 115 | run_time = int(time.time() - start_time) 116 | print("train score: %.6f - val score: %.6f - time: %ds" % (train_score_list[0], val_score_list[0], run_time)) 117 | return train_score_list, val_score_list 118 | 119 | print("参数设置:%s" % model_param) 120 | parallel = Parallel(n_jobs=cv_param['workers'], verbose=4) 121 | k_fold = KFold(n_splits=cv_param['fold']) 122 | score_lists = parallel( 123 | delayed(get_score)(model, train, val) for train, val in k_fold.split(X, y)) 124 | 125 | train_score_lists = list(map(lambda x: x[0], score_lists)) 126 | val_score_lists = list(map(lambda x: x[1], score_lists)) 127 | 128 | train_score_list = np.mean(train_score_lists, axis=0) 129 | val_score_list = np.mean(val_score_lists, axis=0) 130 | 131 | train_score_dict = {metrics.__name__: train_score for metrics, train_score in zip(metrics_list, train_score_list)} 132 | val_score_dict = {metrics.__name__: val_score for metrics, val_score in zip(metrics_list, val_score_list)} 133 | 134 | # 配置日志文件 135 | make_dirs(log_dir) 136 | logger = logging_config(model_name, log_dir + '/%s.log' % model_name) 137 | log_message = { 138 | "cus_param": cus_param, 139 | "cv_param": cv_param, 140 | "best_param_": model_param, 141 | "best_score_": val_score_list[0], 142 | "train_score": train_score_list[0], 143 | "train_score_dict": train_score_dict, 144 | "val_score_dict": val_score_dict, 145 | } 146 | logger.info(log_message) 147 | 148 | 149 | # 贝叶斯搜索 150 | def bayes_search_train(X, y, model_name, model_param, model=None, X_test=None, y_test=None): 151 | """ 152 | :param X: 训练集的特征 153 | :param y: 训练集的标签 154 | :param model_name: 模型名称 155 | :param model_param: 模型参数 156 | :param model: 机器学习或深度学习模型,可缺省,默认根据模型名称获取模型 157 | :param X_test: 测试集的特征,可缺省 158 | :param y_test: 测试集的标签,可缺省 159 | :return: 无,输出模型文件和结果日志 160 | """ 161 | 162 | model_dir, log_dir = yaml_config['dir']['model_dir'], yaml_config['dir']['log_dir'] 163 | cus_param, bys_param = yaml_config['cus_param'], yaml_config['bys_param'] 164 | 165 | if not model: 166 | model = model_selection(model_name) 167 | 168 | # 将训练集分为cv折,进行cv次训练得到交叉验证分数均值,最后再训练整个训练集 169 | bys = BayesSearchCV( 170 | model, 171 | model_param, 172 | n_iter=bys_param['n_iter'], 173 | cv=bys_param['fold'], 174 | verbose=4, 175 | n_jobs=bys_param['workers'], 176 | random_state=cus_param['seed'], 177 | ) 178 | 179 | bys.fit(X, y) 180 | 181 | make_dirs(model_dir) 182 | model_path = model_dir + '/%s-%s.model' % (model_name, int(time.time())) 183 | if 'device' in bys.best_estimator_.get_params(): 184 | bys.best_estimator_.cpu() 185 | bys.best_estimator_.device = 'cpu' 186 | model = bys.best_estimator_ 187 | joblib.dump(model, model_path) 188 | 189 | # 配置日志文件 190 | make_dirs(log_dir) 191 | logger = logging_config(model_name, log_dir + '/%s.log' % model_name) 192 | log_message = { 193 | "cus_param": cus_param, 194 | "bys_param": bys_param, 195 | "best_param_": dict(bys.best_params_), 196 | "best_score_": bys.best_score_, 197 | "train_score": bys.score(X, y), 198 | "model_path": model_path, 199 | } 200 | 201 | # 如果有测试集,则计算测试集分数 202 | if X_test and y_test: 203 | log_message.update({"test_score": bys.score(X_test, y_test)}) 204 | logger.info(log_message) 205 | 206 | return model 207 | -------------------------------------------------------------------------------- /param.yaml: -------------------------------------------------------------------------------- 1 | # 文件夹 2 | dir: 3 | model_dir: ./model # 模型文件夹 4 | log_dir: ./log # 日志文件夹 5 | 6 | # 自定义超参数 7 | cus_param: 8 | seed: 1 # 随机种子 9 | 10 | # 贝叶斯搜索超参数 11 | bys_param: 12 | n_iter: 10 # 迭代次数 13 | fold: 3 # 交叉验证折数 14 | workers: 3 # 进程个数 15 | 16 | # 交叉验证参数 17 | cv_param: 18 | fold: 10 19 | workers: 1 20 | 21 | # 模型超参数 22 | model: 23 | - [lr, { 24 | max_iter: !!python/tuple [50, 200], 25 | C: !!python/tuple [0.8, 1.2, 'uniform'], 26 | random_state: !!python/tuple [1, 500], 27 | }] 28 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.21.2 2 | pandas==1.3.4 3 | scikit-learn 4 | scikit-optimize 5 | pyaml 6 | torch==1.7.1; python_version==3.7 7 | tqdm -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2022/2/11 19:02 4 | # @Author : LYX-夜光 5 | import numpy as np 6 | from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score 7 | 8 | from dataPreprocessing import getDataset, standard 9 | 10 | from optUtils import yaml_config 11 | from optUtils.dataUtil import stratified_shuffle_index 12 | 13 | from dl_models.cnn import CNN 14 | from dl_models.ms_cnn import MS_CNN 15 | from dl_models.c_lstm import C_LSTM 16 | from dl_models.c_lstm_ae import C_LSTM_AE 17 | from dl_models.imc_lstm import IMC_LSTM 18 | from dl_models.smc_lstm import SMC_LSTM 19 | from dl_models.cmc_lstm import CMC_LSTM 20 | 21 | if __name__ == "__main__": 22 | seq_len = 60 23 | model_list = [CNN, MS_CNN, C_LSTM, IMC_LSTM, CMC_LSTM, SMC_LSTM, C_LSTM_AE] 24 | dataset_list = ['realAdExchange', 'realTraffic', 'realKnownCause', 'realAWSCloudwatch', 'A1Benchmark', 'realTweets'] 25 | 26 | for model_clf in model_list: 27 | for dataset_name in dataset_list: 28 | X, y, r = getDataset(dataset_name, seq_len=seq_len, pre_list=[standard]) 29 | 30 | seed, fold = yaml_config['cus_param']['seed'], yaml_config['cv_param']['fold'] 31 | # 根据r的取值数量分层抽样 32 | shuffle_index = stratified_shuffle_index(r, n_splits=fold, random_state=seed) 33 | X, y = X[shuffle_index], y[shuffle_index] 34 | 35 | if model_clf == C_LSTM_AE: 36 | length = 10 37 | X = np.array([[x[i: i + length] for i in range(len(x) - length + 1)] for x in X]) 38 | 39 | P, total = sum(y > 0), len(y) 40 | print("+: %d (%.2f%%)" % (P, P / total * 100), "-: %d (%.2f%%)" % (total - P, (1 - P / total) * 100)) 41 | 42 | train_point, val_point = int(len(X) * 0.6), int(len(X) * 0.8) 43 | 44 | model = model_clf(learning_rate=0.001, batch_size=512, epochs=500, random_state=1, seq_len=seq_len) 45 | 46 | # model.create_model() 47 | # print(sum([param.nelement() for param in model.parameters()])) 48 | # exit() 49 | 50 | model.model_name += "_%s" % dataset_name 51 | model.param_search = False 52 | model.save_model = True 53 | model.device = 'cuda' 54 | model.metrics = f1_score 55 | model.metrics_list = [recall_score, precision_score, accuracy_score] 56 | model.fit(X[:train_point], y[:train_point], X[train_point:val_point], y[train_point:val_point]) 57 | model.test_score(X[val_point:], y[val_point:]) 58 | -------------------------------------------------------------------------------- /train_windows.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2022/2/20 17:41 4 | # @Author : LYX-夜光 5 | from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score 6 | 7 | from dl_models.c_lstm import C_LSTM 8 | from dataPreprocessing import getDataset, standard 9 | from optUtils import yaml_config 10 | from optUtils.dataUtil import stratified_shuffle_index 11 | from dl_models.imc_lstm import IMC_LSTM 12 | from dl_models.cmc_lstm import CMC_LSTM 13 | from dl_models.smc_lstm import SMC_LSTM 14 | 15 | if __name__ == "__main__": 16 | seq_len_list = [20, 40, 80, 100] 17 | dataset_list = ['realAdExchange', 'realTraffic', 'realKnownCause', 'realAWSCloudwatch', 'A1Benchmark', 'realTweets'] 18 | model_list = [C_LSTM, IMC_LSTM, CMC_LSTM, SMC_LSTM] 19 | 20 | for seq_len in seq_len_list: 21 | for dataset_name in dataset_list: 22 | for model_clf in model_list: 23 | X, y, r = getDataset(dataset_name, seq_len=seq_len, pre_list=[standard]) 24 | 25 | seed, fold = yaml_config['cus_param']['seed'], yaml_config['cv_param']['fold'] 26 | # 根据r的取值数量分层抽样 27 | shuffle_index = stratified_shuffle_index(r, n_splits=fold, random_state=seed) 28 | X, y = X[shuffle_index], y[shuffle_index] 29 | 30 | P, total = sum(y > 0), len(y) 31 | print("+: %d (%.2f%%)" % (P, P / total * 100), "-: %d (%.2f%%)" % (total - P, (1 - P / total) * 100)) 32 | train_point, val_point = int(len(X) * 0.6), int(len(X) * 0.8) 33 | 34 | model = model_clf(learning_rate=0.001, batch_size=512, epochs=500, random_state=1, seq_len=seq_len) 35 | model.model_name += "_%s_%s" % (seq_len, dataset_name) 36 | model.param_search = False 37 | model.save_model = True 38 | model.device = 'cuda' 39 | model.metrics = f1_score 40 | model.metrics_list = [recall_score, precision_score, accuracy_score] 41 | model.fit(X[:train_point], y[:train_point], X[train_point:val_point], y[train_point:val_point]) 42 | model.test_score(X[val_point:], y[val_point:]) 43 | --------------------------------------------------------------------------------