├── README.md
├── config.json
├── data_generator.py
├── df.py
├── evaluate.py
├── preprocess_data.py
├── requirements.txt
├── run_model.py
├── var_cnn.py
└── wang_to_varcnn.py


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Var-CNN in PyTorch
  3 | 
  4 | This repository contains the PyTorch implementation of the following paper:
  5 | 
  6 | [Var-CNN: A Data-Efficient Website Fingerprinting Attack Based on Deep 
  7 | Learning](https://arxiv.org/abs/1802.10215) (PETS 2019, 
  8 | [link to presentation](https://docs.google.com/presentation/d/1Pry0nfHXvDhWYbHjf2QK75Rquc51s9rxK0g2XSNVeIk/edit?usp=sharing))
  9 | 
 10 | **Authors**: Sanjit Bhat, David Lu, Albert Kwon, and Srini Devadas
 11 | 
 12 | ## Overview
 13 | This repository is a PyTorch implementation of Var-CNN, a data-efficient approach for performing website fingerprinting attacks using deep learning techniques. The original implementation was based on TensorFlow and Keras. This project aims to provide the same functionality and performance in a more flexible and extensible PyTorch framework.
 14 | 
 15 | ## Dependencies
 16 | 1. Ensure that you have a functioning machine with an NVIDIA GPU for faster execution. 
 17 | 2. Make sure you have the PyTorch deep learning framework installed. Follow the [installation guide](https://pytorch.org/get-started/locally/) for your specific environment.
 18 | 3. To install all required Python packages, simply issue the following command:
 19 |    ```bash
 20 |    pip install -r requirements.txt
 21 | 
 22 | 
 23 | ## Control Flow
 24 | The first step in running our model is to place the adequate amount
 25 | of raw packet sequences in the ```data_dir``` folder. **Each** monitored
 26 | website needs to have **at least** ```num_mon_inst_train``` + ```num_mon_inst_test```
 27 | instances, and there needs to be **at least** 
 28 | ```num_unmon_sites_train``` + ```num_unmon_sites_test``` unmonitored sites.
 29 | 
 30 | If you use the Wang et al. data format (i.e., each line representing a 
 31 | new packet with the relative time and direction separated by a space), 
 32 | then we have that supported in ```wang_to_varcnn.py```. Otherwise, you will need 
 33 | to modify ```wang_to_varcnn.py```, or you can write your own glue code to move
 34 | to the **Wang et al. format**.
 35 | 
 36 | After setting up the data and specifying the parameters in ```config.json```,
 37 | you can run all parts of our code just by issuing a ```python run_model.py```
 38 | command. After that, our programs will be called in the following sequence:
 39 | 
 40 | 1. ```wang_to_varcnn.py```: This parses the ```data_dir``` folder; extracts direction,
 41 | time, metadata, and labels; and stores all the monitored and unmonitored
 42 | traces in ```all_closed_world.npz``` and ```all_open_world.npz```,
 43 | respectively, in the ```data_dir``` folder.
 44 | 2. ```preprocess_data.py```: This uses the data in 
 45 | ```all_closed_world.npz``` to pick a random
 46 | ```num_mon_inst_train``` and ```num_mon_inst_test``` instances of each
 47 | of the ```num_mon_sites``` monitored sites for the training and test sets, respectively.
 48 | It also performs a similar random split for the unmonitored sites (using the
 49 | ```all_open_world.npz``` file) and preprocesses
 50 | all of these traces to scale the metadata, change to inter-packet timing, etc.
 51 | Finally, it saves the direction data, time data, metadata, and labels
 52 | to ```.h5``` files to conserve RAM during the training process.
 53 | 3. ```run_model.py```: This is the main file that first calls the prior two files.
 54 | Next, it loads the model architectures from either ```var_cnn.py``` or 
 55 | ```df.py```, trains the models, saves their predictions, and
 56 | calls ```evaluate.py``` for evaluation. 
 57 | 4. During training, ```data_generator.py``` generates new batches of data in 
 58 | parallel. Since large datasets can contain hundreds of thousands 
 59 | of traces, ```data_generator.py``` uses ```.h5``` files to
 60 | access the traces for one batch without loading the entire dataset into memory.
 61 | 5. ```evaluate.py```: This first calculates metrics for each of the
 62 | in-training combinations specified in ```mixture```. Then,
 63 | it averages each of their predictions together and reports metrics for the
 64 | overall out-of-training ensemble. It saves all metrics to the ```job_result.json```
 65 | file.
 66 | 
 67 | 
 68 | ## Parameters
 69 | ```config.json``` provides the configuration settings to all the 
 70 | other programs. We describe its parameters in further detail below:
 71 | 
 72 | 1. ```data_dir```: This relative path provides the location of the "raw"
 73 | packet sequences (e.g., the "0", "1", "0-0", "0-1" files in Wang et al.'s
 74 | dataset). Also, it later stores the ```all_closed_world.npz``` and
 75 | ```all_open_world.npz``` files generated by ```wang_to_varcnn.py``` and the
 76 | ```.h5``` data files generated by ```preprocess_data.py```.
 77 | 2. ```predictions_dir```: After training the model, ```run_model.py``` 
 78 | generates predictions for the test set and stores them in this directory.
 79 | ```evaluate.py``` later uses them to calculate test metrics.
 80 | 3. ```num_mon_sites```: The number of monitored websites.
 81 | Each of the ```num_mon_sites``` sites in ```data_dir``` must have at least
 82 | ```num_mon_inst_train``` + ```num_mon_inst_test``` instances.
 83 | 4. ```num_mon_inst_train```: The number of monitored instances used for training.
 84 | 5. ```num_mon_inst_test```: The number of monitored instances used for testing.
 85 | 6. ```num_unmon_sites_train```: The number of unmonitored sites used for training.
 86 | Each site has one instance.
 87 | 7. ```num_unmon_sites_test```: The number of unmonitored sites used for testing.
 88 | Each site has one instance, and these unmonitored websites are different
 89 | from those used for training.
 90 | 8. ```model_name```: The model name. Either "var-cnn" or "df".
 91 | 9. ```batch_size```: The batch size used during training. For Var-CNN,
 92 | we found that a batch size of 50 works well. The recommended batch size
 93 | for DF is 128.
 94 | 10. ```mixture```: The mixture of ensembles used during training and
 95 | evaluation. Each of the inner arrays represent models combined in-training.
 96 | ```run_model``` will save the predictions for every such in-training combination.
 97 | Subsequently, ```evaluate_ensemble``` will report metrics for these individual
 98 | models as well as the overall out-of-training ensemble (i.e., the average
 99 | of the individual predictions). Note: this functionality only works
100 | with Var-CNN (in fact, deep fingerprinting will automatically default
101 | to using ```[["dir"]]```). Also, do not use two in-training combinations 
102 | with the same components as their prediction files will be overwritten.
103 | Default: ```[["dir", "metadata"], ["time", "metadata"]]``` for Var-CNN.
104 | 11. ```seq_length```: The length of the input sequence fed into the CNN
105 | (default: 5000).
106 | We use this parameter right from the start when scraping the raw data.
107 | 12. ```df_epochs```: The number of epochs used to train DF (default: 30).
108 | 13. ```var_cnn_max_epochs```: The maximum number of epochs used to train 
109 | Var-CNN (default: 150). The ```EarlyStopping``` callback often cuts off training 
110 | much sooner -- whenever validation accuracy fails to increase.
111 | 14. ```var_cnn_base_patience```: The "patience" (i.e., number of epochs
112 | of no validation accuracy improvement) until we decrease the learning 
113 | rate of Var-CNN and stop training (default: 5). We implement this functionality
114 | in the ```ReduceLROnPlateau``` and ```EarlyStopping``` callbacks
115 | inside ```var_cnn.py```.
116 | 15. ```dir_dilations```: Whether to use dilations with the
117 | direction ResNet (default: true).
118 | 16. ```time_dilations```: Whether to use dilations with the
119 | time ResNet (default: true).
120 | 17. ```inter_time```: Whether to use the inter-packet time
121 | (i.e., time between two packets) or the relative time (i.e.,
122 | time from the first packet) for timing data (default: true, i.e., we do use
123 | inter-packet time).
124 | 18. ```scale_metadata```. Whether to scale metadata to zero mean
125 | and unit variance (default: true).
126 |    
127 | 
128 | ## Citation
129 | If you find Var-CNN useful in your research, please consider citing:
130 | 
131 | 	@article{bhat19,
132 | 	  title={{Var-CNN: A Data-Efficient Website Fingerprinting Attack Based on Deep Learning}},
133 | 	  author={Bhat, Sanjit and Lu, David and Kwon, Albert and Devadas, Srinivas},
134 | 	  journal={Proceedings on Privacy Enhancing Technologies},
135 | 	  volume={4},
136 | 	  pages={292--310},
137 | 	  year={2019}
138 | 	}   
139 | 
140 | 


--------------------------------------------------------------------------------
/config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "data_dir": "data/",
 3 |     "predictions_dir": "predictions/",
 4 | 
 5 |     "_comment1": "====== World Size Params ======",
 6 |     "num_mon_sites": 100,
 7 |     "num_mon_inst_train": 90,
 8 |     "num_mon_inst_test": 10,
 9 |     "num_unmon_sites_train": 9000,
10 |     "num_unmon_sites_test": 1000,
11 | 
12 |     "_comment2": "====== Attack Params (use batch size 50 for Var-CNN) ======",
13 |     "model_name": "var-cnn",
14 |     "batch_size": 50,
15 | 
16 |     "_comment3": "====== Var-CNN Architecture Params ======",
17 |     "mixture": [["dir", "metadata"], ["time", "metadata"]],
18 | 
19 |     "_comment4": "====== Other Params =======",
20 |     "seq_length": 5000,
21 |     "df_epochs": 30,
22 |     "var_cnn_max_epochs": 150,
23 |     "var_cnn_base_patience": 5,
24 |     "dir_dilations": true,
25 |     "time_dilations": true,
26 |     "inter_time": true,
27 |     "scale_metadata": true
28 | }


--------------------------------------------------------------------------------
/data_generator.py:
--------------------------------------------------------------------------------
 1 | import h5py
 2 | import torch
 3 | from torch.utils.data import Dataset, DataLoader
 4 | 
 5 | 
 6 | class H5Dataset(Dataset):
 7 |     def __init__(self, config, data_type, mixture_num):
 8 |         """
 9 |         Initializes the dataset by loading the HDF5 file and storing the necessary data.
10 | 
11 |         Args:
12 |             config (dict): Configuration dictionary.
13 |             data_type (str): Either 'training_data', 'validation_data', or 'test_data'.
14 |             mixture_num (int): Index of the mixture to use.
15 |         """
16 |         self.num_mon_sites = config['num_mon_sites']
17 |         self.num_mon_inst_train = config['num_mon_inst_train']
18 |         self.num_mon_inst_test = config['num_mon_inst_test']
19 |         self.num_mon_inst = self.num_mon_inst_train + self.num_mon_inst_test
20 |         self.num_unmon_sites_train = config['num_unmon_sites_train']
21 |         self.num_unmon_sites_test = config['num_unmon_sites_test']
22 |         
23 |         self.data_dir = config['data_dir']
24 |         self.batch_size = config['batch_size']
25 |         self.mixture = config['mixture']
26 |         self.use_dir = 'dir' in self.mixture[mixture_num]
27 |         self.use_time = 'time' in self.mixture[mixture_num]
28 |         self.use_metadata = 'metadata' in self.mixture[mixture_num]
29 | 
30 |         # Open the HDF5 file and load the necessary datasets
31 |         self.filepath = f"{self.data_dir}{self.num_mon_sites}_{self.num_mon_inst}_{self.num_unmon_sites_train}_{self.num_unmon_sites_test}.h5"
32 |         self.data_type = data_type
33 | 
34 |         with h5py.File(self.filepath, 'r') as f:
35 |             self.dir_seq = f[f'{self.data_type}/dir_seq'] if self.use_dir else None
36 |             self.time_seq = f[f'{self.data_type}/time_seq'] if self.use_time else None
37 |             self.metadata = f[f'{self.data_type}/metadata'] if self.use_metadata else None
38 |             self.labels = f[f'{self.data_type}/labels']
39 | 
40 |     def __len__(self):
41 |         return len(self.labels)
42 | 
43 |     def __getitem__(self, idx):
44 |         """
45 |         Fetches a single data item.
46 | 
47 |         Args:
48 |             idx (int): Index of the data item.
49 | 
50 |         Returns:
51 |             tuple: Input dictionary and output dictionary (labels).
52 |         """
53 |         batch_data = {}
54 | 
55 |         if self.use_dir:
56 |             batch_data['dir_input'] = self.dir_seq[idx]
57 |         if self.use_time:
58 |             batch_data['time_input'] = self.time_seq[idx]
59 |         if self.use_metadata:
60 |             batch_data['metadata_input'] = self.metadata[idx]
61 | 
62 |         labels = self.labels[idx]
63 |         return batch_data, labels
64 | 
65 | 
66 | def create_dataloader(config, data_type, mixture_num, batch_size, shuffle=True, num_workers=4):
67 |     """
68 |     Creates a PyTorch DataLoader.
69 | 
70 |     Args:
71 |         config (dict): Configuration dictionary.
72 |         data_type (str): Either 'training_data', 'validation_data', or 'test_data'.
73 |         mixture_num (int): Index of the mixture to use.
74 |         batch_size (int): Batch size.
75 |         shuffle (bool): Whether to shuffle the data.
76 |         num_workers (int): Number of worker threads to use for data loading.
77 | 
78 |     Returns:
79 |         DataLoader: PyTorch DataLoader object.
80 |     """
81 |     dataset = H5Dataset(config, data_type, mixture_num)
82 |     dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
83 |     return dataloader
84 | 


--------------------------------------------------------------------------------
/df.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.optim as optim
  4 | import torch.nn.functional as F
  5 | 
  6 | class DeepFingerprintingModel(nn.Module):
  7 |     def __init__(self, config):
  8 |         super(DeepFingerprintingModel, self).__init__()
  9 |         
 10 |         num_mon_sites = config['num_mon_sites']
 11 |         num_unmon_sites_test = config['num_unmon_sites_test']
 12 |         num_unmon_sites_train = config['num_unmon_sites_train']
 13 |         num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train
 14 |         seq_length = config['seq_length']
 15 |         
 16 |         # Block 1
 17 |         self.conv1 = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=8, stride=1, padding=4)
 18 |         self.bn1 = nn.BatchNorm1d(32)
 19 |         self.conv2 = nn.Conv1d(32, 32, 8, stride=1, padding=4)
 20 |         self.bn2 = nn.BatchNorm1d(32)
 21 |         self.pool1 = nn.MaxPool1d(kernel_size=8, stride=4)
 22 |         self.drop1 = nn.Dropout(0.1)
 23 |         
 24 |         # Block 2
 25 |         self.conv3 = nn.Conv1d(32, 64, 8, stride=1, padding=4)
 26 |         self.bn3 = nn.BatchNorm1d(64)
 27 |         self.conv4 = nn.Conv1d(64, 64, 8, stride=1, padding=4)
 28 |         self.bn4 = nn.BatchNorm1d(64)
 29 |         self.pool2 = nn.MaxPool1d(kernel_size=8, stride=4)
 30 |         self.drop2 = nn.Dropout(0.1)
 31 |         
 32 |         # Block 3
 33 |         self.conv5 = nn.Conv1d(64, 128, 8, stride=1, padding=4)
 34 |         self.bn5 = nn.BatchNorm1d(128)
 35 |         self.conv6 = nn.Conv1d(128, 128, 8, stride=1, padding=4)
 36 |         self.bn6 = nn.BatchNorm1d(128)
 37 |         self.pool3 = nn.MaxPool1d(kernel_size=8, stride=4)
 38 |         self.drop3 = nn.Dropout(0.1)
 39 |         
 40 |         # Block 4
 41 |         self.conv7 = nn.Conv1d(128, 256, 8, stride=1, padding=4)
 42 |         self.bn7 = nn.BatchNorm1d(256)
 43 |         self.conv8 = nn.Conv1d(256, 256, 8, stride=1, padding=4)
 44 |         self.bn8 = nn.BatchNorm1d(256)
 45 |         self.pool4 = nn.MaxPool1d(kernel_size=8, stride=4)
 46 |         self.drop4 = nn.Dropout(0.1)
 47 | 
 48 |         # Fully connected layers
 49 |         self.fc1 = nn.Linear(256 * (seq_length // (4 ** 4)), 512)  # Adjust size according to input sequence length
 50 |         self.bn_fc1 = nn.BatchNorm1d(512)
 51 |         self.drop_fc1 = nn.Dropout(0.7)
 52 |         self.fc2 = nn.Linear(512, 512)
 53 |         self.bn_fc2 = nn.BatchNorm1d(512)
 54 |         self.drop_fc2 = nn.Dropout(0.5)
 55 | 
 56 |         output_classes = num_mon_sites if num_unmon_sites == 0 else num_mon_sites + 1
 57 |         self.fc3 = nn.Linear(512, output_classes)
 58 | 
 59 |     def forward(self, x):
 60 |         # Block 1
 61 |         x = F.elu(self.bn1(self.conv1(x)))
 62 |         x = F.elu(self.bn2(self.conv2(x)))
 63 |         x = self.pool1(x)
 64 |         x = self.drop1(x)
 65 |         
 66 |         # Block 2
 67 |         x = F.relu(self.bn3(self.conv3(x)))
 68 |         x = F.relu(self.bn4(self.conv4(x)))
 69 |         x = self.pool2(x)
 70 |         x = self.drop2(x)
 71 | 
 72 |         # Block 3
 73 |         x = F.relu(self.bn5(self.conv5(x)))
 74 |         x = F.relu(self.bn6(self.conv6(x)))
 75 |         x = self.pool3(x)
 76 |         x = self.drop3(x)
 77 |         
 78 |         # Block 4
 79 |         x = F.relu(self.bn7(self.conv7(x)))
 80 |         x = F.relu(self.bn8(self.conv8(x)))
 81 |         x = self.pool4(x)
 82 |         x = self.drop4(x)
 83 | 
 84 |         # Flatten
 85 |         x = x.view(x.size(0), -1)
 86 | 
 87 |         # Fully connected layers
 88 |         x = F.relu(self.bn_fc1(self.fc1(x)))
 89 |         x = self.drop_fc1(x)
 90 |         x = F.relu(self.bn_fc2(self.fc2(x)))
 91 |         x = self.drop_fc2(x)
 92 |         
 93 |         # Output layer
 94 |         x = F.softmax(self.fc3(x), dim=1)
 95 |         return x
 96 | 
 97 | # Configuration example
 98 | config = {
 99 |     'num_mon_sites': 100,
100 |     'num_mon_inst_test': 50,
101 |     'num_mon_inst_train': 100,
102 |     'num_unmon_sites_test': 50,
103 |     'num_unmon_sites_train': 100,
104 |     'seq_length': 5000
105 | }
106 | 
107 | # Instantiate model
108 | model = DeepFingerprintingModel(config)
109 | 
110 | # Define optimizer and loss function
111 | optimizer = optim.Adamax(model.parameters(), lr=0.002)
112 | criterion = nn.CrossEntropyLoss()
113 | 
114 | # Example input: batch of size 16, sequence length from config['seq_length'], 1 channel
115 | input_data = torch.randn(16, 1, config['seq_length'])
116 | output = model(input_data)
117 | 
118 | # Print model output
119 | print(output)
120 | 


--------------------------------------------------------------------------------
/evaluate.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | import json
  4 | import h5py
  5 | 
  6 | def find_accuracy(model_predictions, conf_thresh, actual_labels=None,
  7 |                   num_mon_sites=None, num_mon_inst_test=None,
  8 |                   num_unmon_sites_test=None, num_unmon_sites=None):
  9 |     """Compute TPR and FPR based on softmax output predictions."""
 10 | 
 11 |     # Calculates output classes (classes with the highest probability)
 12 |     actual_labels = np.argmax(actual_labels, axis=1)
 13 | 
 14 |     # Changes predictions according to confidence threshold
 15 |     thresh_model_labels = np.zeros(len(model_predictions))
 16 |     for inst_num, softmax in enumerate(model_predictions):
 17 |         predicted_class = np.argmax(softmax)
 18 |         if predicted_class < num_mon_sites and \
 19 |                 softmax[predicted_class] < conf_thresh:
 20 |             thresh_model_labels[inst_num] = num_mon_sites
 21 |         else:
 22 |             thresh_model_labels[inst_num] = predicted_class
 23 | 
 24 |     # Computes TPR and FPR
 25 |     two_class_true_pos = 0  # Mon correctly classified as any mon site
 26 |     multi_class_true_pos = 0  # Mon correctly classified as specific mon site
 27 |     false_pos = 0  # Unmon incorrectly classified as mon site
 28 | 
 29 |     for inst_num, inst_label in enumerate(actual_labels):
 30 |         if inst_label == num_mon_sites:  # Supposed to be unmon site
 31 |             if thresh_model_labels[inst_num] < num_mon_sites:
 32 |                 false_pos += 1
 33 |         else:  # Supposed to be mon site
 34 |             if thresh_model_labels[inst_num] < num_mon_sites:
 35 |                 two_class_true_pos += 1
 36 |             if thresh_model_labels[inst_num] == inst_label:
 37 |                 multi_class_true_pos += 1
 38 | 
 39 |     two_class_tpr = two_class_true_pos / \
 40 |                     (num_mon_sites * num_mon_inst_test) * 100
 41 |     two_class_tpr = '%.2f' % two_class_tpr + '%'
 42 |     multi_class_tpr = multi_class_true_pos / \
 43 |                       (num_mon_sites * num_mon_inst_test) * 100
 44 |     multi_class_tpr = '%.2f' % multi_class_tpr + '%'
 45 | 
 46 |     if num_unmon_sites == 0:  # closed-world
 47 |         fpr = '0.00%'
 48 |     else:
 49 |         fpr = false_pos / num_unmon_sites_test * 100
 50 |         fpr = '%.2f' % fpr + '%'
 51 | 
 52 |     return two_class_tpr, multi_class_tpr, fpr
 53 | 
 54 | 
 55 | def log_cw(results, sub_model_name, softmax, **parameters):
 56 |     print('%s model:' % sub_model_name)
 57 |     two_class_tpr, multi_class_tpr, fpr = find_accuracy(
 58 |         softmax, 0., **parameters)
 59 |     print('\t accuracy: %s' % multi_class_tpr)
 60 |     results['%s_acc' % sub_model_name] = multi_class_tpr
 61 | 
 62 | 
 63 | def log_ow(results, sub_model_name, softmax, **parameters):
 64 |     print('%s model:' % sub_model_name)
 65 |     for conf_thresh in np.arange(0, 1.01, 0.1):
 66 |         two_class_tpr, multi_class_tpr, fpr = find_accuracy(
 67 |             softmax, conf_thresh, **parameters)
 68 |         print('\t conf: %f' % conf_thresh)
 69 |         print('\t \t two-class TPR: %s' % two_class_tpr)
 70 |         print('\t \t multi-class TPR: %s' % multi_class_tpr)
 71 |         print('\t \t FPR: %s' % fpr)
 72 | 
 73 |         prefix = '%s_%f' % (sub_model_name, conf_thresh)
 74 |         results['%s_two_TPR' % prefix] = two_class_tpr
 75 |         results['%s_multi_TPR' % prefix] = multi_class_tpr
 76 |         results['%s_FPR' % prefix] = fpr
 77 | 
 78 | 
 79 | def log_setting(setting, predictions, results, **parameters):
 80 |     print(setting + '-world results')
 81 |     for sub_model_name, softmax in predictions.items():
 82 |         if setting == 'closed':
 83 |             log_cw(results, sub_model_name, softmax, **parameters)
 84 |         elif setting == 'open':
 85 |             log_ow(results, sub_model_name, softmax, **parameters)
 86 | 
 87 | 
 88 | def main(config):
 89 |     num_mon_sites = config['num_mon_sites']
 90 |     num_mon_inst_test = config['num_mon_inst_test']
 91 |     num_mon_inst_train = config['num_mon_inst_train']
 92 |     num_mon_inst = num_mon_inst_test + num_mon_inst_train
 93 |     num_unmon_sites_test = config['num_unmon_sites_test']
 94 |     num_unmon_sites_train = config['num_unmon_sites_train']
 95 |     num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train
 96 | 
 97 |     data_dir = config['data_dir']
 98 |     predictions_dir = config['predictions_dir']
 99 |     mixture = config['mixture']
100 | 
101 |     with h5py.File('%s%d_%d_%d_%d.h5' % (data_dir, num_mon_sites,
102 |                                          num_mon_inst, num_unmon_sites_train,
103 |                                          num_unmon_sites_test), 'r') as f:
104 |         test_labels = f['test_data/labels'][:]
105 | 
106 |     # Aggregates predictions from mixture models
107 |     predictions = {}
108 |     ensemble_softmax = None
109 |     for inner_comb in mixture:
110 |         sub_model_name = '_'.join(inner_comb)
111 |         softmax = np.load('%s%s_model.npy' % (predictions_dir, sub_model_name))
112 |         if ensemble_softmax is None:
113 |             ensemble_softmax = np.zeros_like(softmax)
114 |         predictions[sub_model_name] = softmax
115 | 
116 |     parameters = {'actual_labels': test_labels,
117 |                   'num_mon_sites': num_mon_sites,
118 |                   'num_mon_inst_test': num_mon_inst_test,
119 |                   'num_unmon_sites_test': num_unmon_sites_test,
120 |                   'num_unmon_sites': num_unmon_sites}
121 | 
122 |     # Performs simple average to get ensemble predictions
123 |     for softmax in predictions.values():
124 |         ensemble_softmax += softmax
125 |     ensemble_softmax /= len(predictions)
126 |     if len(predictions) > 1:
127 |         predictions['ensemble'] = ensemble_softmax
128 | 
129 |     results = {}
130 |     if num_unmon_sites == 0:  # Closed-world
131 |         log_setting('closed', predictions, results, **parameters)
132 |     else:  # Open-world
133 |         log_setting('open', predictions, results, **parameters)
134 | 
135 |     with open('job_result.json', 'w') as f:
136 |         json.dump(results, f, sort_keys=True, indent=4)
137 | 
138 | 
139 | if __name__ == '__main__':
140 |     with open('config.json') as config_file:
141 |         config = json.load(config_file)
142 | 
143 |     main(config)
144 | 


--------------------------------------------------------------------------------
/preprocess_data.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import, division, print_function
  2 | 
  3 | import numpy as np
  4 | import time
  5 | import random
  6 | import h5py
  7 | import json
  8 | import os
  9 | from tqdm import tqdm
 10 | import torch
 11 | from torch.utils.data import Dataset, DataLoader
 12 | import wang_to_varcnn
 13 | from sklearn.preprocessing import StandardScaler
 14 | from torch.nn.functional import one_hot
 15 | 
 16 | class CustomDataset(Dataset):
 17 |     def __init__(self, dir_seq, time_seq, metadata, labels):
 18 |         self.dir_seq = torch.tensor(dir_seq, dtype=torch.float32)
 19 |         self.time_seq = torch.tensor(time_seq, dtype=torch.float32)
 20 |         self.metadata = torch.tensor(metadata, dtype=torch.float32)
 21 |         self.labels = torch.tensor(labels, dtype=torch.long)
 22 | 
 23 |     def __len__(self):
 24 |         return len(self.dir_seq)
 25 | 
 26 |     def __getitem__(self, idx):
 27 |         return (self.dir_seq[idx], self.time_seq[idx], self.metadata[idx], self.labels[idx])
 28 | 
 29 | def main(config):
 30 |     """Preprocesses data and creates .h5 data files."""
 31 | 
 32 |     num_mon_sites = config['num_mon_sites']
 33 |     num_mon_inst_test = config['num_mon_inst_test']
 34 |     num_mon_inst_train = config['num_mon_inst_train']
 35 |     num_mon_inst = num_mon_inst_test + num_mon_inst_train
 36 |     num_unmon_sites_test = config['num_unmon_sites_test']
 37 |     num_unmon_sites_train = config['num_unmon_sites_train']
 38 |     num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train
 39 | 
 40 |     inter_time = config['inter_time']
 41 |     scale_metadata = config['scale_metadata']
 42 |     data_dir = config['data_dir']
 43 |     mon_data_loc = data_dir + 'all_closed_world.npz'
 44 |     unmon_data_loc = data_dir + 'all_open_world.npz'
 45 |     if not os.path.exists(mon_data_loc) or not os.path.exists(unmon_data_loc):
 46 |         wang_to_varcnn.main(config)
 47 | 
 48 |     print(f'Starting {num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.h5')
 49 |     start = time.time()
 50 | 
 51 |     train_seq_and_labels = []
 52 |     test_seq_and_labels = []
 53 | 
 54 |     print('reading monitored data')
 55 |     mon_dataset = np.load(mon_data_loc)
 56 |     mon_dir_seq = mon_dataset['dir_seq']
 57 |     mon_time_seq = mon_dataset['time_seq']
 58 |     mon_metadata = mon_dataset['metadata']
 59 |     mon_labels = mon_dataset['labels']
 60 | 
 61 |     mon_site_data = {}
 62 |     mon_site_labels = {}
 63 |     print('getting enough monitored websites')
 64 |     for dir_seq, time_seq, metadata, site_name in tqdm(zip(mon_dir_seq, mon_time_seq, mon_metadata, mon_labels)):
 65 |         if site_name not in mon_site_data:
 66 |             if len(mon_site_data) >= num_mon_sites:
 67 |                 continue
 68 |             else:
 69 |                 mon_site_data[site_name] = []
 70 |                 mon_site_labels[site_name] = len(mon_site_labels)
 71 | 
 72 |         mon_site_data[site_name].append([dir_seq, time_seq, metadata, mon_site_labels[site_name]])
 73 | 
 74 |     print('randomly choosing instances for training and test sets')
 75 |     assert len(mon_site_data) == num_mon_sites
 76 |     for instances in tqdm(mon_site_data.values()):
 77 |         random.shuffle(instances)
 78 |         assert len(instances) >= num_mon_inst
 79 |         for inst_num, all_data in enumerate(instances):
 80 |             if inst_num < num_mon_inst_train:
 81 |                 train_seq_and_labels.append(all_data)
 82 |             elif inst_num < num_mon_inst:
 83 |                 test_seq_and_labels.append(all_data)
 84 |             else:
 85 |                 break
 86 | 
 87 |     del mon_dataset, mon_dir_seq, mon_time_seq, mon_metadata, mon_labels, mon_site_data, mon_site_labels
 88 | 
 89 |     print('reading unmonitored data')
 90 | 
 91 |     unmon_dataset = np.load(unmon_data_loc)
 92 |     unmon_dir_seq = unmon_dataset['dir_seq']
 93 |     unmon_time_seq = unmon_dataset['time_seq']
 94 |     unmon_metadata = unmon_dataset['metadata']
 95 | 
 96 |     unmon_site_data = [[dir_seq, time_seq, metadata, num_mon_sites] for dir_seq, time_seq, metadata in
 97 |                        zip(unmon_dir_seq, unmon_time_seq, unmon_metadata)]
 98 | 
 99 |     print('randomly choosing unmonitored instances for training and test sets')
100 |     random.shuffle(unmon_site_data)
101 |     assert len(unmon_site_data) >= num_unmon_sites
102 |     for inst_num, all_data in tqdm(enumerate(unmon_site_data)):
103 |         if inst_num < num_unmon_sites_train:
104 |             train_seq_and_labels.append(all_data)
105 |         elif inst_num < num_unmon_sites:
106 |             test_seq_and_labels.append(all_data)
107 |         else:
108 |             break
109 | 
110 |     del unmon_dataset, unmon_dir_seq, unmon_time_seq, unmon_metadata, unmon_site_data
111 | 
112 |     print('processing data')
113 | 
114 |     # Shuffle training and testing data
115 |     random.shuffle(train_seq_and_labels)
116 |     random.shuffle(test_seq_and_labels)
117 | 
118 |     train_dir, train_time, train_metadata, train_labels = zip(*train_seq_and_labels)
119 |     test_dir, test_time, test_metadata, test_labels = zip(*test_seq_and_labels)
120 | 
121 |     train_dir = np.array(train_dir)
122 |     train_time = np.array(train_time)
123 |     train_metadata = np.array(train_metadata)
124 | 
125 |     test_dir = np.array(test_dir)
126 |     test_time = np.array(test_time)
127 |     test_metadata = np.array(test_metadata)
128 | 
129 |     # Converts from absolute times to inter-packet times.
130 |     if inter_time:
131 |         train_time[:, 1:] = train_time[:, 1:] - train_time[:, :-1]
132 |         test_time[:, 1:] = test_time[:, 1:] - test_time[:, :-1]
133 | 
134 |     # Reshape to add 3rd dim for CNN input
135 |     train_dir = np.expand_dims(train_dir, axis=-1)
136 |     test_dir = np.expand_dims(test_dir, axis=-1)
137 | 
138 |     train_time = np.expand_dims(train_time, axis=-1)
139 |     test_time = np.expand_dims(test_time, axis=-1)
140 | 
141 |     if scale_metadata:
142 |         metadata_scaler = StandardScaler()
143 |         train_metadata = metadata_scaler.fit_transform(train_metadata)
144 |         test_metadata = metadata_scaler.transform(test_metadata)
145 | 
146 |     # One-hot encoding of labels
147 |     num_classes = num_mon_sites if num_unmon_sites == 0 else num_mon_sites + 1
148 |     train_labels = one_hot(torch.tensor(train_labels), num_classes=num_classes)
149 |     test_labels = one_hot(torch.tensor(test_labels), num_classes=num_classes)
150 | 
151 |     print('training data stats:')
152 |     print(train_dir.shape)
153 |     print(train_time.shape)
154 |     print(train_metadata.shape)
155 |     print(train_labels.shape)
156 | 
157 |     print('testing data stats:')
158 |     print(test_dir.shape)
159 |     print(test_time.shape)
160 |     print(test_metadata.shape)
161 |     print(test_labels.shape)
162 | 
163 |     print('saving data')
164 |     with h5py.File(f'{data_dir}{num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.h5', 'w') as f:
165 |         f.create_group('training_data')
166 |         f.create_group('validation_data')
167 |         f.create_group('test_data')
168 | 
169 |         for ds_name, arr in [['dir_seq', train_dir],
170 |                              ['time_seq', train_time],
171 |                              ['metadata', train_metadata],
172 |                              ['labels', train_labels.numpy()]]:
173 |             f.create_dataset('training_data/' + ds_name, data=arr[:int(0.95 * len(arr))])
174 | 
175 |         for ds_name, arr in [['dir_seq', train_dir],
176 |                              ['time_seq', train_time],
177 |                              ['metadata', train_metadata],
178 |                              ['labels', train_labels.numpy()]]:
179 |             f.create_dataset('validation_data/' + ds_name, data=arr[int(0.95 * len(arr)):])
180 | 
181 |         for ds_name, arr in [['dir_seq', test_dir],
182 |                              ['time_seq', test_time],
183 |                              ['metadata', test_metadata],
184 |                              ['labels', test_labels.numpy()]]:
185 |             f.create_dataset('test_data/' + ds_name, data=arr)
186 | 
187 |     end = time.time()
188 |     print(f'Finished {num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.h5 in {end - start:.2f} seconds')
189 | 
190 | if __name__ == '__main__':
191 |     with open('config.json') as config_file:
192 |         config = json.load(config_file)
193 | 
194 |     main(config)
195 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | ###### Requirements without Version Specifiers ######
 2 | h5py
 3 | tqdm
 4 | sklearn
 5 | torch
 6 | torchvision
 7 | 
 8 | ###### Requirements with Version Specifiers ######
 9 | 
10 | torch == 1.13.0  
11 | torchvision == 0.14.0  


--------------------------------------------------------------------------------
/run_model.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, absolute_import, division
  2 | 
  3 | import json
  4 | import os
  5 | import time
  6 | import torch
  7 | import numpy as np
  8 | 
  9 | import var_cnn
 10 | import df
 11 | import evaluate
 12 | import preprocess_data
 13 | import data_generator
 14 | 
 15 | 
 16 | def update_config(config, updates):
 17 |     """Updates config dict and config file with updates dict."""
 18 |     config.update(updates)
 19 |     with open('config.json', 'w') as f:
 20 |         json.dump(config, f, indent=4)
 21 | 
 22 | 
 23 | def is_valid_mixture(mixture):
 24 |     """Check if mixture is a 2D array with strings representing the models."""
 25 |     assert isinstance(mixture, list) and len(mixture) > 0
 26 |     for inner_comb in mixture:
 27 |         assert isinstance(inner_comb, list) and len(inner_comb) > 0
 28 |         for model in inner_comb:
 29 |             assert model in ['dir', 'time', 'metadata']
 30 | 
 31 | 
 32 | def train_and_val(config, model, optimizer, criterion, mixture_num, sub_model_name):
 33 |     """Train and validate model."""
 34 |     print(f'training {config["model_name"]} {sub_model_name} model')
 35 | 
 36 |     train_size = int((num_mon_sites * num_mon_inst_train + num_unmon_sites_train) * 0.95)
 37 |     train_steps = train_size // batch_size
 38 |     val_size = int((num_mon_sites * num_mon_inst_train + num_unmon_sites_train) * 0.05)
 39 |     val_steps = val_size // batch_size
 40 | 
 41 |     train_time_start = time.time()
 42 |     
 43 |     for epoch in range(epochs):
 44 |         model.train()
 45 |         for i, data in enumerate(data_generator.generate(config, 'training_data', mixture_num)):
 46 |             inputs, labels = data
 47 |             optimizer.zero_grad()
 48 |             outputs = model(inputs)
 49 |             loss = criterion(outputs, labels)
 50 |             loss.backward()
 51 |             optimizer.step()
 52 |             if i >= train_steps:
 53 |                 break
 54 |         
 55 |         model.eval()
 56 |         with torch.no_grad():
 57 |             for i, data in enumerate(data_generator.generate(config, 'validation_data', mixture_num)):
 58 |                 inputs, labels = data
 59 |                 outputs = model(inputs)
 60 |                 val_loss = criterion(outputs, labels)
 61 |                 if i >= val_steps:
 62 |                     break
 63 |     
 64 |     train_time_end = time.time()
 65 |     print(f'Total training time: {train_time_end - train_time_start}')
 66 | 
 67 | 
 68 | def predict(config, model, mixture_num, sub_model_name):
 69 |     """Compute and save final predictions on test set."""
 70 |     print(f'generating predictions for {config["model_name"]} {sub_model_name} model')
 71 | 
 72 |     if config["model_name"] == 'var-cnn':
 73 |         model.load_state_dict(torch.load('model_weights.pth'))
 74 | 
 75 |     test_size = num_mon_sites * num_mon_inst_test + num_unmon_sites_test
 76 |     test_steps = test_size // batch_size
 77 | 
 78 |     test_time_start = time.time()
 79 | 
 80 |     model.eval()
 81 |     predictions = []
 82 |     with torch.no_grad():
 83 |         for i, data in enumerate(data_generator.generate(config, 'test_data', mixture_num)):
 84 |             inputs, _ = data
 85 |             outputs = model(inputs)
 86 |             predictions.append(outputs.cpu().numpy())
 87 |             if i >= test_steps:
 88 |                 break
 89 | 
 90 |     test_time_end = time.time()
 91 | 
 92 |     if not os.path.exists(predictions_dir):
 93 |         os.makedirs(predictions_dir)
 94 | 
 95 |     np.save(f'{predictions_dir}{sub_model_name}_model', np.concatenate(predictions, axis=0))
 96 | 
 97 |     print(f'Total test time: {test_time_end - test_time_start}')
 98 | 
 99 | 
100 | with open('config.json') as config_file:
101 |     config = json.load(config_file)
102 |     if config['model_name'] == 'df':
103 |         update_config(config, {'mixture': [['dir']], 'batch_size': 128})
104 | 
105 | num_mon_sites = config['num_mon_sites']
106 | num_mon_inst_test = config['num_mon_inst_test']
107 | num_mon_inst_train = config['num_mon_inst_train']
108 | num_mon_inst = num_mon_inst_test + num_mon_inst_train
109 | num_unmon_sites_test = config['num_unmon_sites_test']
110 | num_unmon_sites_train = config['num_unmon_sites_train']
111 | num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train
112 | 
113 | data_dir = config['data_dir']
114 | model_name = config['model_name']
115 | mixture = config['mixture']
116 | batch_size = config['batch_size']
117 | predictions_dir = config['predictions_dir']
118 | epochs = config['var_cnn_max_epochs'] if model_name == 'var-cnn' else config['df_epochs']
119 | is_valid_mixture(mixture)
120 | 
121 | if not os.path.exists(f'{data_dir}{num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.pth'):
122 |     preprocess_data.main(config)
123 | 
124 | for mixture_num, inner_comb in enumerate(mixture):
125 |     model, optimizer, criterion, callbacks = var_cnn.get_model(config, mixture_num) if model_name == 'var-cnn' else df.get_model(config)
126 | 
127 |     sub_model_name = '_'.join(inner_comb)
128 |     train_and_val(config, model, optimizer, criterion, mixture_num, sub_model_name)
129 |     predict(config, model, mixture_num, sub_model_name)
130 | 
131 | print('evaluating mixture on test data...')
132 | evaluate.main(config)
133 | 


--------------------------------------------------------------------------------
/var_cnn.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import time
  4 | import numpy as np
  5 | import torch
  6 | import torch.nn as nn
  7 | import torch.optim as optim
  8 | from torch.utils.data import DataLoader, Dataset
  9 | from torch.nn import functional as F
 10 | 
 11 | # 用于保存和加载模型的路径
 12 | MODEL_PATH = 'model_weights.pth'
 13 | 
 14 | # 自定义数据集类
 15 | class CustomDataset(Dataset):
 16 |     def __init__(self, config, mode):
 17 |         # 根据 mode ('training_data', 'validation_data', 'test_data') 加载数据
 18 |         # 在这里实现数据加载逻辑
 19 |         pass
 20 | 
 21 |     def __len__(self):
 22 |         # 返回数据集的大小
 23 |         pass
 24 | 
 25 |     def __getitem__(self, idx):
 26 |         # 根据索引返回单个数据样本
 27 |         pass
 28 | 
 29 | # 定义基本的残差块
 30 | class BasicBlock(nn.Module):
 31 |     def __init__(self, in_channels, out_channels, stride=1):
 32 |         super(BasicBlock, self).__init__()
 33 |         self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
 34 |         self.bn1 = nn.BatchNorm1d(out_channels)
 35 |         self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size=3, padding=1, bias=False)
 36 |         self.bn2 = nn.BatchNorm1d(out_channels)
 37 | 
 38 |         self.shortcut = nn.Conv1d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False) if stride != 1 else None
 39 | 
 40 |     def forward(self, x):
 41 |         out = F.relu(self.bn1(self.conv1(x)))
 42 |         out = self.bn2(self.conv2(out))
 43 |         if self.shortcut is not None:
 44 |             x = self.shortcut(x)
 45 |         out += x
 46 |         return F.relu(out)
 47 | 
 48 | # 定义 ResNet 模型
 49 | class ResNet(nn.Module):
 50 |     def __init__(self, block, num_blocks, num_classes):
 51 |         super(ResNet, self).__init__()
 52 |         self.in_channels = 64
 53 |         self.conv1 = nn.Conv1d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
 54 |         self.bn1 = nn.BatchNorm1d(64)
 55 |         self.maxpool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
 56 | 
 57 |         self.layer1 = self._make_layer(block, 64, num_blocks[0])
 58 |         self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
 59 |         self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
 60 |         self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
 61 |         
 62 |         self.avgpool = nn.AdaptiveAvgPool1d(1)
 63 |         self.fc = nn.Linear(512, num_classes)
 64 | 
 65 |     def _make_layer(self, block, out_channels, blocks, stride=1):
 66 |         layers = []
 67 |         layers.append(block(self.in_channels, out_channels, stride))
 68 |         self.in_channels = out_channels
 69 |         for _ in range(1, blocks):
 70 |             layers.append(block(out_channels, out_channels))
 71 |         return nn.Sequential(*layers)
 72 | 
 73 |     def forward(self, x):
 74 |         x = F.relu(self.bn1(self.conv1(x)))
 75 |         x = self.maxpool(x)
 76 | 
 77 |         x = self.layer1(x)
 78 |         x = self.layer2(x)
 79 |         x = self.layer3(x)
 80 |         x = self.layer4(x)
 81 | 
 82 |         x = self.avgpool(x)
 83 |         x = x.view(x.size(0), -1)
 84 |         x = self.fc(x)
 85 |         return x
 86 | 
 87 | # 训练和验证函数
 88 | def train_and_val(model, criterion, optimizer, train_loader, val_loader, num_epochs):
 89 |     for epoch in range(num_epochs):
 90 |         model.train()
 91 |         for inputs, labels in train_loader:
 92 |             optimizer.zero_grad()
 93 |             outputs = model(inputs)
 94 |             loss = criterion(outputs, labels)
 95 |             loss.backward()
 96 |             optimizer.step()
 97 | 
 98 |         # 验证阶段
 99 |         model.eval()
100 |         with torch.no_grad():
101 |             val_loss = 0
102 |             correct = 0
103 |             total = 0
104 |             for inputs, labels in val_loader:
105 |                 outputs = model(inputs)
106 |                 val_loss += criterion(outputs, labels).item()
107 |                 _, predicted = torch.max(outputs.data, 1)
108 |                 total += labels.size(0)
109 |                 correct += (predicted == labels).sum().item()
110 | 
111 |         print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Validation Loss: {val_loss:.4f}, Accuracy: {100 * correct / total:.2f}%')
112 | 
113 | # 预测函数
114 | def predict(model, test_loader):
115 |     model.eval()
116 |     predictions = []
117 |     with torch.no_grad():
118 |         for inputs in test_loader:
119 |             outputs = model(inputs)
120 |             _, predicted = torch.max(outputs.data, 1)
121 |             predictions.append(predicted.numpy())
122 |     return np.concatenate(predictions)
123 | 
124 | # 主程序
125 | def main():
126 |     # 读取配置
127 |     with open('config.json') as config_file:
128 |         config = json.load(config_file)
129 | 
130 |     # 获取模型参数
131 |     num_classes = config['num_mon_sites'] if config['num_unmon_sites_train'] == 0 else config['num_mon_sites'] + 1
132 |     num_epochs = config['var_cnn_max_epochs'] if config['model_name'] == 'var-cnn' else config['df_epochs']
133 |     batch_size = config['batch_size']
134 | 
135 |     # 准备数据集
136 |     train_dataset = CustomDataset(config, mode='training_data')
137 |     val_dataset = CustomDataset(config, mode='validation_data')
138 |     train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
139 |     val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
140 | 
141 |     # 初始化模型、损失函数和优化器
142 |     model = ResNet(BasicBlock, [2, 2, 2, 2], num_classes)
143 |     criterion = nn.CrossEntropyLoss()
144 |     optimizer = optim.Adam(model.parameters(), lr=0.001)
145 | 
146 |     # 训练和验证
147 |     train_and_val(model, criterion, optimizer, train_loader, val_loader, num_epochs)
148 | 
149 |     # 预测
150 |     test_dataset = CustomDataset(config, mode='test_data')
151 |     test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
152 |     predictions = predict(model, test_loader)
153 | 
154 |     # 保存模型
155 |     torch.save(model.state_dict(), MODEL_PATH)
156 | 
157 | if __name__ == "__main__":
158 |     main()
159 | 


--------------------------------------------------------------------------------
/wang_to_varcnn.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import os
  6 | import numpy as np
  7 | import random
  8 | import json
  9 | import torch
 10 | 
 11 | random.seed(4286794567481)
 12 | 
 13 | def process_trace(args):
 14 |     """Extract dir seq, time seq, metadata, and label for particular trace."""
 15 |     dir_name = args[0]
 16 |     trace_path = args[1]
 17 | 
 18 |     with open(os.path.join(dir_name, trace_path), 'r') as f:
 19 |         lines = f.readlines()
 20 | 
 21 |     dir_seq = torch.zeros(5000, dtype=torch.int8)
 22 |     time_seq = torch.zeros(5000, dtype=torch.float32)
 23 |     label = 0 if '-' not in trace_path else int(trace_path.split('-')[0]) + 1
 24 |     total_time = float(lines[-1].split('\t')[0])
 25 |     total_incoming = 0
 26 |     total_outgoing = 0
 27 | 
 28 |     for packet_num, line in enumerate(lines):
 29 |         line = line.split('\t')
 30 |         curr_time = float(line[0])
 31 |         curr_dir = torch.sign(torch.tensor(int(line[1]))).item()
 32 | 
 33 |         if packet_num < 5000:
 34 |             dir_seq[packet_num] = curr_dir
 35 |             time_seq[packet_num] = curr_time
 36 | 
 37 |         if curr_dir == 1:
 38 |             total_outgoing += 1
 39 |         elif curr_dir == -1:
 40 |             total_incoming += 1
 41 | 
 42 |     total_packets = total_incoming + total_outgoing
 43 |     if total_packets == 0:
 44 |         metadata = torch.zeros(7, dtype=torch.float32)
 45 |     else:
 46 |         metadata = torch.tensor([total_packets, total_incoming, total_outgoing,
 47 |                                  total_incoming / total_packets,
 48 |                                  total_outgoing / total_packets,
 49 |                                  total_time, total_time / total_packets],
 50 |                                 dtype=torch.float32)
 51 |     return dir_seq, time_seq, metadata, label
 52 | 
 53 | def main(config):
 54 | 
 55 |     num_mon_sites = config['num_mon_sites']
 56 |     num_mon_inst_test = config['num_mon_inst_test']
 57 |     num_mon_inst_train = config['num_mon_inst_train']
 58 |     num_mon_inst = num_mon_inst_test + num_mon_inst_train
 59 |     num_unmon_sites_test = config['num_unmon_sites_test']
 60 |     num_unmon_sites_train = config['num_unmon_sites_train']
 61 |     num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train
 62 |     data_dir = config['data_dir']
 63 | 
 64 |     arg_list = []
 65 |     for trace_path in os.listdir(data_dir + 'batch_wang'):
 66 |         arg_list.append([data_dir + 'batch_wang', trace_path])
 67 | 
 68 |     # set up the output
 69 |     mon_idx = 0
 70 |     dir_seq_mon = [None] * (num_mon_sites * num_mon_inst)
 71 |     time_seq_mon = [None] * (num_mon_sites * num_mon_inst)
 72 |     metadata_mon = [None] * (num_mon_sites * num_mon_inst)
 73 |     labels_mon = [None] * (num_mon_sites * num_mon_inst)
 74 | 
 75 |     unmon_idx = 0
 76 |     dir_seq_unmon = [None] * num_unmon_sites
 77 |     time_seq_unmon = [None] * num_unmon_sites
 78 |     metadata_unmon = [None] * num_unmon_sites
 79 |     labels_unmon = [None] * num_unmon_sites
 80 | 
 81 |     print('size of total list: %d' % len(arg_list))
 82 |     for i in range(len(arg_list)):
 83 |         dir_seq, time_seq, metadata, label = process_trace(arg_list[i])
 84 |         if i % 5000 == 0:
 85 |             print("processed", i)
 86 | 
 87 |         if label == 0:  # unmon site
 88 |             dir_seq_unmon[unmon_idx] = dir_seq
 89 |             time_seq_unmon[unmon_idx] = time_seq
 90 |             metadata_unmon[unmon_idx] = metadata
 91 |             labels_unmon[unmon_idx] = label
 92 |             unmon_idx += 1
 93 |         else:
 94 |             dir_seq_mon[mon_idx] = dir_seq
 95 |             time_seq_mon[mon_idx] = time_seq
 96 |             metadata_mon[mon_idx] = metadata
 97 |             labels_mon[mon_idx] = label
 98 |             mon_idx += 1
 99 | 
100 |     # save monitored traces
101 |     dir_seq_mon = torch.stack(dir_seq_mon).cpu().numpy()
102 |     time_seq_mon = torch.stack(time_seq_mon).cpu().numpy()
103 |     metadata_mon = torch.stack(metadata_mon).cpu().numpy()
104 | 
105 |     print('number of monitored traces: %d' % len(labels_mon))
106 |     np.savez_compressed(data_dir + 'all_closed_world.npz', dir_seq=dir_seq_mon,
107 |                         time_seq=time_seq_mon, metadata=metadata_mon,
108 |                         labels=labels_mon)
109 | 
110 |     # save unmonitored traces
111 |     dir_seq_unmon = torch.stack(dir_seq_unmon).cpu().numpy()
112 |     time_seq_unmon = torch.stack(time_seq_unmon).cpu().numpy()
113 |     metadata_unmon = torch.stack(metadata_unmon).cpu().numpy()
114 | 
115 |     print('number of unmonitored traces: %d' % len(labels_unmon))
116 |     np.savez_compressed(data_dir + 'all_open_world.npz', dir_seq=dir_seq_unmon,
117 |                         time_seq=time_seq_unmon, metadata=metadata_unmon,
118 |                         labels=labels_unmon)
119 | 
120 | if __name__ == '__main__':
121 |     with open('config.json') as config_file:
122 |         config = json.load(config_file)
123 | 
124 |     main(config)
125 | 


--------------------------------------------------------------------------------