├── README.md ├── config.json ├── data_generator.py ├── df.py ├── evaluate.py ├── preprocess_data.py ├── requirements.txt ├── run_model.py ├── var_cnn.py └── wang_to_varcnn.py /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Var-CNN in PyTorch 3 | 4 | This repository contains the PyTorch implementation of the following paper: 5 | 6 | [Var-CNN: A Data-Efficient Website Fingerprinting Attack Based on Deep 7 | Learning](https://arxiv.org/abs/1802.10215) (PETS 2019, 8 | [link to presentation](https://docs.google.com/presentation/d/1Pry0nfHXvDhWYbHjf2QK75Rquc51s9rxK0g2XSNVeIk/edit?usp=sharing)) 9 | 10 | **Authors**: Sanjit Bhat, David Lu, Albert Kwon, and Srini Devadas 11 | 12 | ## Overview 13 | This repository is a PyTorch implementation of Var-CNN, a data-efficient approach for performing website fingerprinting attacks using deep learning techniques. The original implementation was based on TensorFlow and Keras. This project aims to provide the same functionality and performance in a more flexible and extensible PyTorch framework. 14 | 15 | ## Dependencies 16 | 1. Ensure that you have a functioning machine with an NVIDIA GPU for faster execution. 17 | 2. Make sure you have the PyTorch deep learning framework installed. Follow the [installation guide](https://pytorch.org/get-started/locally/) for your specific environment. 18 | 3. To install all required Python packages, simply issue the following command: 19 | ```bash 20 | pip install -r requirements.txt 21 | 22 | 23 | ## Control Flow 24 | The first step in running our model is to place the adequate amount 25 | of raw packet sequences in the ```data_dir``` folder. **Each** monitored 26 | website needs to have **at least** ```num_mon_inst_train``` + ```num_mon_inst_test``` 27 | instances, and there needs to be **at least** 28 | ```num_unmon_sites_train``` + ```num_unmon_sites_test``` unmonitored sites. 29 | 30 | If you use the Wang et al. data format (i.e., each line representing a 31 | new packet with the relative time and direction separated by a space), 32 | then we have that supported in ```wang_to_varcnn.py```. Otherwise, you will need 33 | to modify ```wang_to_varcnn.py```, or you can write your own glue code to move 34 | to the **Wang et al. format**. 35 | 36 | After setting up the data and specifying the parameters in ```config.json```, 37 | you can run all parts of our code just by issuing a ```python run_model.py``` 38 | command. After that, our programs will be called in the following sequence: 39 | 40 | 1. ```wang_to_varcnn.py```: This parses the ```data_dir``` folder; extracts direction, 41 | time, metadata, and labels; and stores all the monitored and unmonitored 42 | traces in ```all_closed_world.npz``` and ```all_open_world.npz```, 43 | respectively, in the ```data_dir``` folder. 44 | 2. ```preprocess_data.py```: This uses the data in 45 | ```all_closed_world.npz``` to pick a random 46 | ```num_mon_inst_train``` and ```num_mon_inst_test``` instances of each 47 | of the ```num_mon_sites``` monitored sites for the training and test sets, respectively. 48 | It also performs a similar random split for the unmonitored sites (using the 49 | ```all_open_world.npz``` file) and preprocesses 50 | all of these traces to scale the metadata, change to inter-packet timing, etc. 51 | Finally, it saves the direction data, time data, metadata, and labels 52 | to ```.h5``` files to conserve RAM during the training process. 53 | 3. ```run_model.py```: This is the main file that first calls the prior two files. 54 | Next, it loads the model architectures from either ```var_cnn.py``` or 55 | ```df.py```, trains the models, saves their predictions, and 56 | calls ```evaluate.py``` for evaluation. 57 | 4. During training, ```data_generator.py``` generates new batches of data in 58 | parallel. Since large datasets can contain hundreds of thousands 59 | of traces, ```data_generator.py``` uses ```.h5``` files to 60 | access the traces for one batch without loading the entire dataset into memory. 61 | 5. ```evaluate.py```: This first calculates metrics for each of the 62 | in-training combinations specified in ```mixture```. Then, 63 | it averages each of their predictions together and reports metrics for the 64 | overall out-of-training ensemble. It saves all metrics to the ```job_result.json``` 65 | file. 66 | 67 | 68 | ## Parameters 69 | ```config.json``` provides the configuration settings to all the 70 | other programs. We describe its parameters in further detail below: 71 | 72 | 1. ```data_dir```: This relative path provides the location of the "raw" 73 | packet sequences (e.g., the "0", "1", "0-0", "0-1" files in Wang et al.'s 74 | dataset). Also, it later stores the ```all_closed_world.npz``` and 75 | ```all_open_world.npz``` files generated by ```wang_to_varcnn.py``` and the 76 | ```.h5``` data files generated by ```preprocess_data.py```. 77 | 2. ```predictions_dir```: After training the model, ```run_model.py``` 78 | generates predictions for the test set and stores them in this directory. 79 | ```evaluate.py``` later uses them to calculate test metrics. 80 | 3. ```num_mon_sites```: The number of monitored websites. 81 | Each of the ```num_mon_sites``` sites in ```data_dir``` must have at least 82 | ```num_mon_inst_train``` + ```num_mon_inst_test``` instances. 83 | 4. ```num_mon_inst_train```: The number of monitored instances used for training. 84 | 5. ```num_mon_inst_test```: The number of monitored instances used for testing. 85 | 6. ```num_unmon_sites_train```: The number of unmonitored sites used for training. 86 | Each site has one instance. 87 | 7. ```num_unmon_sites_test```: The number of unmonitored sites used for testing. 88 | Each site has one instance, and these unmonitored websites are different 89 | from those used for training. 90 | 8. ```model_name```: The model name. Either "var-cnn" or "df". 91 | 9. ```batch_size```: The batch size used during training. For Var-CNN, 92 | we found that a batch size of 50 works well. The recommended batch size 93 | for DF is 128. 94 | 10. ```mixture```: The mixture of ensembles used during training and 95 | evaluation. Each of the inner arrays represent models combined in-training. 96 | ```run_model``` will save the predictions for every such in-training combination. 97 | Subsequently, ```evaluate_ensemble``` will report metrics for these individual 98 | models as well as the overall out-of-training ensemble (i.e., the average 99 | of the individual predictions). Note: this functionality only works 100 | with Var-CNN (in fact, deep fingerprinting will automatically default 101 | to using ```[["dir"]]```). Also, do not use two in-training combinations 102 | with the same components as their prediction files will be overwritten. 103 | Default: ```[["dir", "metadata"], ["time", "metadata"]]``` for Var-CNN. 104 | 11. ```seq_length```: The length of the input sequence fed into the CNN 105 | (default: 5000). 106 | We use this parameter right from the start when scraping the raw data. 107 | 12. ```df_epochs```: The number of epochs used to train DF (default: 30). 108 | 13. ```var_cnn_max_epochs```: The maximum number of epochs used to train 109 | Var-CNN (default: 150). The ```EarlyStopping``` callback often cuts off training 110 | much sooner -- whenever validation accuracy fails to increase. 111 | 14. ```var_cnn_base_patience```: The "patience" (i.e., number of epochs 112 | of no validation accuracy improvement) until we decrease the learning 113 | rate of Var-CNN and stop training (default: 5). We implement this functionality 114 | in the ```ReduceLROnPlateau``` and ```EarlyStopping``` callbacks 115 | inside ```var_cnn.py```. 116 | 15. ```dir_dilations```: Whether to use dilations with the 117 | direction ResNet (default: true). 118 | 16. ```time_dilations```: Whether to use dilations with the 119 | time ResNet (default: true). 120 | 17. ```inter_time```: Whether to use the inter-packet time 121 | (i.e., time between two packets) or the relative time (i.e., 122 | time from the first packet) for timing data (default: true, i.e., we do use 123 | inter-packet time). 124 | 18. ```scale_metadata```. Whether to scale metadata to zero mean 125 | and unit variance (default: true). 126 | 127 | 128 | ## Citation 129 | If you find Var-CNN useful in your research, please consider citing: 130 | 131 | @article{bhat19, 132 | title={{Var-CNN: A Data-Efficient Website Fingerprinting Attack Based on Deep Learning}}, 133 | author={Bhat, Sanjit and Lu, David and Kwon, Albert and Devadas, Srinivas}, 134 | journal={Proceedings on Privacy Enhancing Technologies}, 135 | volume={4}, 136 | pages={292--310}, 137 | year={2019} 138 | } 139 | 140 | -------------------------------------------------------------------------------- /config.json: -------------------------------------------------------------------------------- 1 | { 2 | "data_dir": "data/", 3 | "predictions_dir": "predictions/", 4 | 5 | "_comment1": "====== World Size Params ======", 6 | "num_mon_sites": 100, 7 | "num_mon_inst_train": 90, 8 | "num_mon_inst_test": 10, 9 | "num_unmon_sites_train": 9000, 10 | "num_unmon_sites_test": 1000, 11 | 12 | "_comment2": "====== Attack Params (use batch size 50 for Var-CNN) ======", 13 | "model_name": "var-cnn", 14 | "batch_size": 50, 15 | 16 | "_comment3": "====== Var-CNN Architecture Params ======", 17 | "mixture": [["dir", "metadata"], ["time", "metadata"]], 18 | 19 | "_comment4": "====== Other Params =======", 20 | "seq_length": 5000, 21 | "df_epochs": 30, 22 | "var_cnn_max_epochs": 150, 23 | "var_cnn_base_patience": 5, 24 | "dir_dilations": true, 25 | "time_dilations": true, 26 | "inter_time": true, 27 | "scale_metadata": true 28 | } -------------------------------------------------------------------------------- /data_generator.py: -------------------------------------------------------------------------------- 1 | import h5py 2 | import torch 3 | from torch.utils.data import Dataset, DataLoader 4 | 5 | 6 | class H5Dataset(Dataset): 7 | def __init__(self, config, data_type, mixture_num): 8 | """ 9 | Initializes the dataset by loading the HDF5 file and storing the necessary data. 10 | 11 | Args: 12 | config (dict): Configuration dictionary. 13 | data_type (str): Either 'training_data', 'validation_data', or 'test_data'. 14 | mixture_num (int): Index of the mixture to use. 15 | """ 16 | self.num_mon_sites = config['num_mon_sites'] 17 | self.num_mon_inst_train = config['num_mon_inst_train'] 18 | self.num_mon_inst_test = config['num_mon_inst_test'] 19 | self.num_mon_inst = self.num_mon_inst_train + self.num_mon_inst_test 20 | self.num_unmon_sites_train = config['num_unmon_sites_train'] 21 | self.num_unmon_sites_test = config['num_unmon_sites_test'] 22 | 23 | self.data_dir = config['data_dir'] 24 | self.batch_size = config['batch_size'] 25 | self.mixture = config['mixture'] 26 | self.use_dir = 'dir' in self.mixture[mixture_num] 27 | self.use_time = 'time' in self.mixture[mixture_num] 28 | self.use_metadata = 'metadata' in self.mixture[mixture_num] 29 | 30 | # Open the HDF5 file and load the necessary datasets 31 | self.filepath = f"{self.data_dir}{self.num_mon_sites}_{self.num_mon_inst}_{self.num_unmon_sites_train}_{self.num_unmon_sites_test}.h5" 32 | self.data_type = data_type 33 | 34 | with h5py.File(self.filepath, 'r') as f: 35 | self.dir_seq = f[f'{self.data_type}/dir_seq'] if self.use_dir else None 36 | self.time_seq = f[f'{self.data_type}/time_seq'] if self.use_time else None 37 | self.metadata = f[f'{self.data_type}/metadata'] if self.use_metadata else None 38 | self.labels = f[f'{self.data_type}/labels'] 39 | 40 | def __len__(self): 41 | return len(self.labels) 42 | 43 | def __getitem__(self, idx): 44 | """ 45 | Fetches a single data item. 46 | 47 | Args: 48 | idx (int): Index of the data item. 49 | 50 | Returns: 51 | tuple: Input dictionary and output dictionary (labels). 52 | """ 53 | batch_data = {} 54 | 55 | if self.use_dir: 56 | batch_data['dir_input'] = self.dir_seq[idx] 57 | if self.use_time: 58 | batch_data['time_input'] = self.time_seq[idx] 59 | if self.use_metadata: 60 | batch_data['metadata_input'] = self.metadata[idx] 61 | 62 | labels = self.labels[idx] 63 | return batch_data, labels 64 | 65 | 66 | def create_dataloader(config, data_type, mixture_num, batch_size, shuffle=True, num_workers=4): 67 | """ 68 | Creates a PyTorch DataLoader. 69 | 70 | Args: 71 | config (dict): Configuration dictionary. 72 | data_type (str): Either 'training_data', 'validation_data', or 'test_data'. 73 | mixture_num (int): Index of the mixture to use. 74 | batch_size (int): Batch size. 75 | shuffle (bool): Whether to shuffle the data. 76 | num_workers (int): Number of worker threads to use for data loading. 77 | 78 | Returns: 79 | DataLoader: PyTorch DataLoader object. 80 | """ 81 | dataset = H5Dataset(config, data_type, mixture_num) 82 | dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers) 83 | return dataloader 84 | -------------------------------------------------------------------------------- /df.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.optim as optim 4 | import torch.nn.functional as F 5 | 6 | class DeepFingerprintingModel(nn.Module): 7 | def __init__(self, config): 8 | super(DeepFingerprintingModel, self).__init__() 9 | 10 | num_mon_sites = config['num_mon_sites'] 11 | num_unmon_sites_test = config['num_unmon_sites_test'] 12 | num_unmon_sites_train = config['num_unmon_sites_train'] 13 | num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train 14 | seq_length = config['seq_length'] 15 | 16 | # Block 1 17 | self.conv1 = nn.Conv1d(in_channels=1, out_channels=32, kernel_size=8, stride=1, padding=4) 18 | self.bn1 = nn.BatchNorm1d(32) 19 | self.conv2 = nn.Conv1d(32, 32, 8, stride=1, padding=4) 20 | self.bn2 = nn.BatchNorm1d(32) 21 | self.pool1 = nn.MaxPool1d(kernel_size=8, stride=4) 22 | self.drop1 = nn.Dropout(0.1) 23 | 24 | # Block 2 25 | self.conv3 = nn.Conv1d(32, 64, 8, stride=1, padding=4) 26 | self.bn3 = nn.BatchNorm1d(64) 27 | self.conv4 = nn.Conv1d(64, 64, 8, stride=1, padding=4) 28 | self.bn4 = nn.BatchNorm1d(64) 29 | self.pool2 = nn.MaxPool1d(kernel_size=8, stride=4) 30 | self.drop2 = nn.Dropout(0.1) 31 | 32 | # Block 3 33 | self.conv5 = nn.Conv1d(64, 128, 8, stride=1, padding=4) 34 | self.bn5 = nn.BatchNorm1d(128) 35 | self.conv6 = nn.Conv1d(128, 128, 8, stride=1, padding=4) 36 | self.bn6 = nn.BatchNorm1d(128) 37 | self.pool3 = nn.MaxPool1d(kernel_size=8, stride=4) 38 | self.drop3 = nn.Dropout(0.1) 39 | 40 | # Block 4 41 | self.conv7 = nn.Conv1d(128, 256, 8, stride=1, padding=4) 42 | self.bn7 = nn.BatchNorm1d(256) 43 | self.conv8 = nn.Conv1d(256, 256, 8, stride=1, padding=4) 44 | self.bn8 = nn.BatchNorm1d(256) 45 | self.pool4 = nn.MaxPool1d(kernel_size=8, stride=4) 46 | self.drop4 = nn.Dropout(0.1) 47 | 48 | # Fully connected layers 49 | self.fc1 = nn.Linear(256 * (seq_length // (4 ** 4)), 512) # Adjust size according to input sequence length 50 | self.bn_fc1 = nn.BatchNorm1d(512) 51 | self.drop_fc1 = nn.Dropout(0.7) 52 | self.fc2 = nn.Linear(512, 512) 53 | self.bn_fc2 = nn.BatchNorm1d(512) 54 | self.drop_fc2 = nn.Dropout(0.5) 55 | 56 | output_classes = num_mon_sites if num_unmon_sites == 0 else num_mon_sites + 1 57 | self.fc3 = nn.Linear(512, output_classes) 58 | 59 | def forward(self, x): 60 | # Block 1 61 | x = F.elu(self.bn1(self.conv1(x))) 62 | x = F.elu(self.bn2(self.conv2(x))) 63 | x = self.pool1(x) 64 | x = self.drop1(x) 65 | 66 | # Block 2 67 | x = F.relu(self.bn3(self.conv3(x))) 68 | x = F.relu(self.bn4(self.conv4(x))) 69 | x = self.pool2(x) 70 | x = self.drop2(x) 71 | 72 | # Block 3 73 | x = F.relu(self.bn5(self.conv5(x))) 74 | x = F.relu(self.bn6(self.conv6(x))) 75 | x = self.pool3(x) 76 | x = self.drop3(x) 77 | 78 | # Block 4 79 | x = F.relu(self.bn7(self.conv7(x))) 80 | x = F.relu(self.bn8(self.conv8(x))) 81 | x = self.pool4(x) 82 | x = self.drop4(x) 83 | 84 | # Flatten 85 | x = x.view(x.size(0), -1) 86 | 87 | # Fully connected layers 88 | x = F.relu(self.bn_fc1(self.fc1(x))) 89 | x = self.drop_fc1(x) 90 | x = F.relu(self.bn_fc2(self.fc2(x))) 91 | x = self.drop_fc2(x) 92 | 93 | # Output layer 94 | x = F.softmax(self.fc3(x), dim=1) 95 | return x 96 | 97 | # Configuration example 98 | config = { 99 | 'num_mon_sites': 100, 100 | 'num_mon_inst_test': 50, 101 | 'num_mon_inst_train': 100, 102 | 'num_unmon_sites_test': 50, 103 | 'num_unmon_sites_train': 100, 104 | 'seq_length': 5000 105 | } 106 | 107 | # Instantiate model 108 | model = DeepFingerprintingModel(config) 109 | 110 | # Define optimizer and loss function 111 | optimizer = optim.Adamax(model.parameters(), lr=0.002) 112 | criterion = nn.CrossEntropyLoss() 113 | 114 | # Example input: batch of size 16, sequence length from config['seq_length'], 1 channel 115 | input_data = torch.randn(16, 1, config['seq_length']) 116 | output = model(input_data) 117 | 118 | # Print model output 119 | print(output) 120 | -------------------------------------------------------------------------------- /evaluate.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | import json 4 | import h5py 5 | 6 | def find_accuracy(model_predictions, conf_thresh, actual_labels=None, 7 | num_mon_sites=None, num_mon_inst_test=None, 8 | num_unmon_sites_test=None, num_unmon_sites=None): 9 | """Compute TPR and FPR based on softmax output predictions.""" 10 | 11 | # Calculates output classes (classes with the highest probability) 12 | actual_labels = np.argmax(actual_labels, axis=1) 13 | 14 | # Changes predictions according to confidence threshold 15 | thresh_model_labels = np.zeros(len(model_predictions)) 16 | for inst_num, softmax in enumerate(model_predictions): 17 | predicted_class = np.argmax(softmax) 18 | if predicted_class < num_mon_sites and \ 19 | softmax[predicted_class] < conf_thresh: 20 | thresh_model_labels[inst_num] = num_mon_sites 21 | else: 22 | thresh_model_labels[inst_num] = predicted_class 23 | 24 | # Computes TPR and FPR 25 | two_class_true_pos = 0 # Mon correctly classified as any mon site 26 | multi_class_true_pos = 0 # Mon correctly classified as specific mon site 27 | false_pos = 0 # Unmon incorrectly classified as mon site 28 | 29 | for inst_num, inst_label in enumerate(actual_labels): 30 | if inst_label == num_mon_sites: # Supposed to be unmon site 31 | if thresh_model_labels[inst_num] < num_mon_sites: 32 | false_pos += 1 33 | else: # Supposed to be mon site 34 | if thresh_model_labels[inst_num] < num_mon_sites: 35 | two_class_true_pos += 1 36 | if thresh_model_labels[inst_num] == inst_label: 37 | multi_class_true_pos += 1 38 | 39 | two_class_tpr = two_class_true_pos / \ 40 | (num_mon_sites * num_mon_inst_test) * 100 41 | two_class_tpr = '%.2f' % two_class_tpr + '%' 42 | multi_class_tpr = multi_class_true_pos / \ 43 | (num_mon_sites * num_mon_inst_test) * 100 44 | multi_class_tpr = '%.2f' % multi_class_tpr + '%' 45 | 46 | if num_unmon_sites == 0: # closed-world 47 | fpr = '0.00%' 48 | else: 49 | fpr = false_pos / num_unmon_sites_test * 100 50 | fpr = '%.2f' % fpr + '%' 51 | 52 | return two_class_tpr, multi_class_tpr, fpr 53 | 54 | 55 | def log_cw(results, sub_model_name, softmax, **parameters): 56 | print('%s model:' % sub_model_name) 57 | two_class_tpr, multi_class_tpr, fpr = find_accuracy( 58 | softmax, 0., **parameters) 59 | print('\t accuracy: %s' % multi_class_tpr) 60 | results['%s_acc' % sub_model_name] = multi_class_tpr 61 | 62 | 63 | def log_ow(results, sub_model_name, softmax, **parameters): 64 | print('%s model:' % sub_model_name) 65 | for conf_thresh in np.arange(0, 1.01, 0.1): 66 | two_class_tpr, multi_class_tpr, fpr = find_accuracy( 67 | softmax, conf_thresh, **parameters) 68 | print('\t conf: %f' % conf_thresh) 69 | print('\t \t two-class TPR: %s' % two_class_tpr) 70 | print('\t \t multi-class TPR: %s' % multi_class_tpr) 71 | print('\t \t FPR: %s' % fpr) 72 | 73 | prefix = '%s_%f' % (sub_model_name, conf_thresh) 74 | results['%s_two_TPR' % prefix] = two_class_tpr 75 | results['%s_multi_TPR' % prefix] = multi_class_tpr 76 | results['%s_FPR' % prefix] = fpr 77 | 78 | 79 | def log_setting(setting, predictions, results, **parameters): 80 | print(setting + '-world results') 81 | for sub_model_name, softmax in predictions.items(): 82 | if setting == 'closed': 83 | log_cw(results, sub_model_name, softmax, **parameters) 84 | elif setting == 'open': 85 | log_ow(results, sub_model_name, softmax, **parameters) 86 | 87 | 88 | def main(config): 89 | num_mon_sites = config['num_mon_sites'] 90 | num_mon_inst_test = config['num_mon_inst_test'] 91 | num_mon_inst_train = config['num_mon_inst_train'] 92 | num_mon_inst = num_mon_inst_test + num_mon_inst_train 93 | num_unmon_sites_test = config['num_unmon_sites_test'] 94 | num_unmon_sites_train = config['num_unmon_sites_train'] 95 | num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train 96 | 97 | data_dir = config['data_dir'] 98 | predictions_dir = config['predictions_dir'] 99 | mixture = config['mixture'] 100 | 101 | with h5py.File('%s%d_%d_%d_%d.h5' % (data_dir, num_mon_sites, 102 | num_mon_inst, num_unmon_sites_train, 103 | num_unmon_sites_test), 'r') as f: 104 | test_labels = f['test_data/labels'][:] 105 | 106 | # Aggregates predictions from mixture models 107 | predictions = {} 108 | ensemble_softmax = None 109 | for inner_comb in mixture: 110 | sub_model_name = '_'.join(inner_comb) 111 | softmax = np.load('%s%s_model.npy' % (predictions_dir, sub_model_name)) 112 | if ensemble_softmax is None: 113 | ensemble_softmax = np.zeros_like(softmax) 114 | predictions[sub_model_name] = softmax 115 | 116 | parameters = {'actual_labels': test_labels, 117 | 'num_mon_sites': num_mon_sites, 118 | 'num_mon_inst_test': num_mon_inst_test, 119 | 'num_unmon_sites_test': num_unmon_sites_test, 120 | 'num_unmon_sites': num_unmon_sites} 121 | 122 | # Performs simple average to get ensemble predictions 123 | for softmax in predictions.values(): 124 | ensemble_softmax += softmax 125 | ensemble_softmax /= len(predictions) 126 | if len(predictions) > 1: 127 | predictions['ensemble'] = ensemble_softmax 128 | 129 | results = {} 130 | if num_unmon_sites == 0: # Closed-world 131 | log_setting('closed', predictions, results, **parameters) 132 | else: # Open-world 133 | log_setting('open', predictions, results, **parameters) 134 | 135 | with open('job_result.json', 'w') as f: 136 | json.dump(results, f, sort_keys=True, indent=4) 137 | 138 | 139 | if __name__ == '__main__': 140 | with open('config.json') as config_file: 141 | config = json.load(config_file) 142 | 143 | main(config) 144 | -------------------------------------------------------------------------------- /preprocess_data.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division, print_function 2 | 3 | import numpy as np 4 | import time 5 | import random 6 | import h5py 7 | import json 8 | import os 9 | from tqdm import tqdm 10 | import torch 11 | from torch.utils.data import Dataset, DataLoader 12 | import wang_to_varcnn 13 | from sklearn.preprocessing import StandardScaler 14 | from torch.nn.functional import one_hot 15 | 16 | class CustomDataset(Dataset): 17 | def __init__(self, dir_seq, time_seq, metadata, labels): 18 | self.dir_seq = torch.tensor(dir_seq, dtype=torch.float32) 19 | self.time_seq = torch.tensor(time_seq, dtype=torch.float32) 20 | self.metadata = torch.tensor(metadata, dtype=torch.float32) 21 | self.labels = torch.tensor(labels, dtype=torch.long) 22 | 23 | def __len__(self): 24 | return len(self.dir_seq) 25 | 26 | def __getitem__(self, idx): 27 | return (self.dir_seq[idx], self.time_seq[idx], self.metadata[idx], self.labels[idx]) 28 | 29 | def main(config): 30 | """Preprocesses data and creates .h5 data files.""" 31 | 32 | num_mon_sites = config['num_mon_sites'] 33 | num_mon_inst_test = config['num_mon_inst_test'] 34 | num_mon_inst_train = config['num_mon_inst_train'] 35 | num_mon_inst = num_mon_inst_test + num_mon_inst_train 36 | num_unmon_sites_test = config['num_unmon_sites_test'] 37 | num_unmon_sites_train = config['num_unmon_sites_train'] 38 | num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train 39 | 40 | inter_time = config['inter_time'] 41 | scale_metadata = config['scale_metadata'] 42 | data_dir = config['data_dir'] 43 | mon_data_loc = data_dir + 'all_closed_world.npz' 44 | unmon_data_loc = data_dir + 'all_open_world.npz' 45 | if not os.path.exists(mon_data_loc) or not os.path.exists(unmon_data_loc): 46 | wang_to_varcnn.main(config) 47 | 48 | print(f'Starting {num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.h5') 49 | start = time.time() 50 | 51 | train_seq_and_labels = [] 52 | test_seq_and_labels = [] 53 | 54 | print('reading monitored data') 55 | mon_dataset = np.load(mon_data_loc) 56 | mon_dir_seq = mon_dataset['dir_seq'] 57 | mon_time_seq = mon_dataset['time_seq'] 58 | mon_metadata = mon_dataset['metadata'] 59 | mon_labels = mon_dataset['labels'] 60 | 61 | mon_site_data = {} 62 | mon_site_labels = {} 63 | print('getting enough monitored websites') 64 | for dir_seq, time_seq, metadata, site_name in tqdm(zip(mon_dir_seq, mon_time_seq, mon_metadata, mon_labels)): 65 | if site_name not in mon_site_data: 66 | if len(mon_site_data) >= num_mon_sites: 67 | continue 68 | else: 69 | mon_site_data[site_name] = [] 70 | mon_site_labels[site_name] = len(mon_site_labels) 71 | 72 | mon_site_data[site_name].append([dir_seq, time_seq, metadata, mon_site_labels[site_name]]) 73 | 74 | print('randomly choosing instances for training and test sets') 75 | assert len(mon_site_data) == num_mon_sites 76 | for instances in tqdm(mon_site_data.values()): 77 | random.shuffle(instances) 78 | assert len(instances) >= num_mon_inst 79 | for inst_num, all_data in enumerate(instances): 80 | if inst_num < num_mon_inst_train: 81 | train_seq_and_labels.append(all_data) 82 | elif inst_num < num_mon_inst: 83 | test_seq_and_labels.append(all_data) 84 | else: 85 | break 86 | 87 | del mon_dataset, mon_dir_seq, mon_time_seq, mon_metadata, mon_labels, mon_site_data, mon_site_labels 88 | 89 | print('reading unmonitored data') 90 | 91 | unmon_dataset = np.load(unmon_data_loc) 92 | unmon_dir_seq = unmon_dataset['dir_seq'] 93 | unmon_time_seq = unmon_dataset['time_seq'] 94 | unmon_metadata = unmon_dataset['metadata'] 95 | 96 | unmon_site_data = [[dir_seq, time_seq, metadata, num_mon_sites] for dir_seq, time_seq, metadata in 97 | zip(unmon_dir_seq, unmon_time_seq, unmon_metadata)] 98 | 99 | print('randomly choosing unmonitored instances for training and test sets') 100 | random.shuffle(unmon_site_data) 101 | assert len(unmon_site_data) >= num_unmon_sites 102 | for inst_num, all_data in tqdm(enumerate(unmon_site_data)): 103 | if inst_num < num_unmon_sites_train: 104 | train_seq_and_labels.append(all_data) 105 | elif inst_num < num_unmon_sites: 106 | test_seq_and_labels.append(all_data) 107 | else: 108 | break 109 | 110 | del unmon_dataset, unmon_dir_seq, unmon_time_seq, unmon_metadata, unmon_site_data 111 | 112 | print('processing data') 113 | 114 | # Shuffle training and testing data 115 | random.shuffle(train_seq_and_labels) 116 | random.shuffle(test_seq_and_labels) 117 | 118 | train_dir, train_time, train_metadata, train_labels = zip(*train_seq_and_labels) 119 | test_dir, test_time, test_metadata, test_labels = zip(*test_seq_and_labels) 120 | 121 | train_dir = np.array(train_dir) 122 | train_time = np.array(train_time) 123 | train_metadata = np.array(train_metadata) 124 | 125 | test_dir = np.array(test_dir) 126 | test_time = np.array(test_time) 127 | test_metadata = np.array(test_metadata) 128 | 129 | # Converts from absolute times to inter-packet times. 130 | if inter_time: 131 | train_time[:, 1:] = train_time[:, 1:] - train_time[:, :-1] 132 | test_time[:, 1:] = test_time[:, 1:] - test_time[:, :-1] 133 | 134 | # Reshape to add 3rd dim for CNN input 135 | train_dir = np.expand_dims(train_dir, axis=-1) 136 | test_dir = np.expand_dims(test_dir, axis=-1) 137 | 138 | train_time = np.expand_dims(train_time, axis=-1) 139 | test_time = np.expand_dims(test_time, axis=-1) 140 | 141 | if scale_metadata: 142 | metadata_scaler = StandardScaler() 143 | train_metadata = metadata_scaler.fit_transform(train_metadata) 144 | test_metadata = metadata_scaler.transform(test_metadata) 145 | 146 | # One-hot encoding of labels 147 | num_classes = num_mon_sites if num_unmon_sites == 0 else num_mon_sites + 1 148 | train_labels = one_hot(torch.tensor(train_labels), num_classes=num_classes) 149 | test_labels = one_hot(torch.tensor(test_labels), num_classes=num_classes) 150 | 151 | print('training data stats:') 152 | print(train_dir.shape) 153 | print(train_time.shape) 154 | print(train_metadata.shape) 155 | print(train_labels.shape) 156 | 157 | print('testing data stats:') 158 | print(test_dir.shape) 159 | print(test_time.shape) 160 | print(test_metadata.shape) 161 | print(test_labels.shape) 162 | 163 | print('saving data') 164 | with h5py.File(f'{data_dir}{num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.h5', 'w') as f: 165 | f.create_group('training_data') 166 | f.create_group('validation_data') 167 | f.create_group('test_data') 168 | 169 | for ds_name, arr in [['dir_seq', train_dir], 170 | ['time_seq', train_time], 171 | ['metadata', train_metadata], 172 | ['labels', train_labels.numpy()]]: 173 | f.create_dataset('training_data/' + ds_name, data=arr[:int(0.95 * len(arr))]) 174 | 175 | for ds_name, arr in [['dir_seq', train_dir], 176 | ['time_seq', train_time], 177 | ['metadata', train_metadata], 178 | ['labels', train_labels.numpy()]]: 179 | f.create_dataset('validation_data/' + ds_name, data=arr[int(0.95 * len(arr)):]) 180 | 181 | for ds_name, arr in [['dir_seq', test_dir], 182 | ['time_seq', test_time], 183 | ['metadata', test_metadata], 184 | ['labels', test_labels.numpy()]]: 185 | f.create_dataset('test_data/' + ds_name, data=arr) 186 | 187 | end = time.time() 188 | print(f'Finished {num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.h5 in {end - start:.2f} seconds') 189 | 190 | if __name__ == '__main__': 191 | with open('config.json') as config_file: 192 | config = json.load(config_file) 193 | 194 | main(config) 195 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | ###### Requirements without Version Specifiers ###### 2 | h5py 3 | tqdm 4 | sklearn 5 | torch 6 | torchvision 7 | 8 | ###### Requirements with Version Specifiers ###### 9 | 10 | torch == 1.13.0 11 | torchvision == 0.14.0 -------------------------------------------------------------------------------- /run_model.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, absolute_import, division 2 | 3 | import json 4 | import os 5 | import time 6 | import torch 7 | import numpy as np 8 | 9 | import var_cnn 10 | import df 11 | import evaluate 12 | import preprocess_data 13 | import data_generator 14 | 15 | 16 | def update_config(config, updates): 17 | """Updates config dict and config file with updates dict.""" 18 | config.update(updates) 19 | with open('config.json', 'w') as f: 20 | json.dump(config, f, indent=4) 21 | 22 | 23 | def is_valid_mixture(mixture): 24 | """Check if mixture is a 2D array with strings representing the models.""" 25 | assert isinstance(mixture, list) and len(mixture) > 0 26 | for inner_comb in mixture: 27 | assert isinstance(inner_comb, list) and len(inner_comb) > 0 28 | for model in inner_comb: 29 | assert model in ['dir', 'time', 'metadata'] 30 | 31 | 32 | def train_and_val(config, model, optimizer, criterion, mixture_num, sub_model_name): 33 | """Train and validate model.""" 34 | print(f'training {config["model_name"]} {sub_model_name} model') 35 | 36 | train_size = int((num_mon_sites * num_mon_inst_train + num_unmon_sites_train) * 0.95) 37 | train_steps = train_size // batch_size 38 | val_size = int((num_mon_sites * num_mon_inst_train + num_unmon_sites_train) * 0.05) 39 | val_steps = val_size // batch_size 40 | 41 | train_time_start = time.time() 42 | 43 | for epoch in range(epochs): 44 | model.train() 45 | for i, data in enumerate(data_generator.generate(config, 'training_data', mixture_num)): 46 | inputs, labels = data 47 | optimizer.zero_grad() 48 | outputs = model(inputs) 49 | loss = criterion(outputs, labels) 50 | loss.backward() 51 | optimizer.step() 52 | if i >= train_steps: 53 | break 54 | 55 | model.eval() 56 | with torch.no_grad(): 57 | for i, data in enumerate(data_generator.generate(config, 'validation_data', mixture_num)): 58 | inputs, labels = data 59 | outputs = model(inputs) 60 | val_loss = criterion(outputs, labels) 61 | if i >= val_steps: 62 | break 63 | 64 | train_time_end = time.time() 65 | print(f'Total training time: {train_time_end - train_time_start}') 66 | 67 | 68 | def predict(config, model, mixture_num, sub_model_name): 69 | """Compute and save final predictions on test set.""" 70 | print(f'generating predictions for {config["model_name"]} {sub_model_name} model') 71 | 72 | if config["model_name"] == 'var-cnn': 73 | model.load_state_dict(torch.load('model_weights.pth')) 74 | 75 | test_size = num_mon_sites * num_mon_inst_test + num_unmon_sites_test 76 | test_steps = test_size // batch_size 77 | 78 | test_time_start = time.time() 79 | 80 | model.eval() 81 | predictions = [] 82 | with torch.no_grad(): 83 | for i, data in enumerate(data_generator.generate(config, 'test_data', mixture_num)): 84 | inputs, _ = data 85 | outputs = model(inputs) 86 | predictions.append(outputs.cpu().numpy()) 87 | if i >= test_steps: 88 | break 89 | 90 | test_time_end = time.time() 91 | 92 | if not os.path.exists(predictions_dir): 93 | os.makedirs(predictions_dir) 94 | 95 | np.save(f'{predictions_dir}{sub_model_name}_model', np.concatenate(predictions, axis=0)) 96 | 97 | print(f'Total test time: {test_time_end - test_time_start}') 98 | 99 | 100 | with open('config.json') as config_file: 101 | config = json.load(config_file) 102 | if config['model_name'] == 'df': 103 | update_config(config, {'mixture': [['dir']], 'batch_size': 128}) 104 | 105 | num_mon_sites = config['num_mon_sites'] 106 | num_mon_inst_test = config['num_mon_inst_test'] 107 | num_mon_inst_train = config['num_mon_inst_train'] 108 | num_mon_inst = num_mon_inst_test + num_mon_inst_train 109 | num_unmon_sites_test = config['num_unmon_sites_test'] 110 | num_unmon_sites_train = config['num_unmon_sites_train'] 111 | num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train 112 | 113 | data_dir = config['data_dir'] 114 | model_name = config['model_name'] 115 | mixture = config['mixture'] 116 | batch_size = config['batch_size'] 117 | predictions_dir = config['predictions_dir'] 118 | epochs = config['var_cnn_max_epochs'] if model_name == 'var-cnn' else config['df_epochs'] 119 | is_valid_mixture(mixture) 120 | 121 | if not os.path.exists(f'{data_dir}{num_mon_sites}_{num_mon_inst}_{num_unmon_sites_train}_{num_unmon_sites_test}.pth'): 122 | preprocess_data.main(config) 123 | 124 | for mixture_num, inner_comb in enumerate(mixture): 125 | model, optimizer, criterion, callbacks = var_cnn.get_model(config, mixture_num) if model_name == 'var-cnn' else df.get_model(config) 126 | 127 | sub_model_name = '_'.join(inner_comb) 128 | train_and_val(config, model, optimizer, criterion, mixture_num, sub_model_name) 129 | predict(config, model, mixture_num, sub_model_name) 130 | 131 | print('evaluating mixture on test data...') 132 | evaluate.main(config) 133 | -------------------------------------------------------------------------------- /var_cnn.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import time 4 | import numpy as np 5 | import torch 6 | import torch.nn as nn 7 | import torch.optim as optim 8 | from torch.utils.data import DataLoader, Dataset 9 | from torch.nn import functional as F 10 | 11 | # 用于保存和加载模型的路径 12 | MODEL_PATH = 'model_weights.pth' 13 | 14 | # 自定义数据集类 15 | class CustomDataset(Dataset): 16 | def __init__(self, config, mode): 17 | # 根据 mode ('training_data', 'validation_data', 'test_data') 加载数据 18 | # 在这里实现数据加载逻辑 19 | pass 20 | 21 | def __len__(self): 22 | # 返回数据集的大小 23 | pass 24 | 25 | def __getitem__(self, idx): 26 | # 根据索引返回单个数据样本 27 | pass 28 | 29 | # 定义基本的残差块 30 | class BasicBlock(nn.Module): 31 | def __init__(self, in_channels, out_channels, stride=1): 32 | super(BasicBlock, self).__init__() 33 | self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False) 34 | self.bn1 = nn.BatchNorm1d(out_channels) 35 | self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size=3, padding=1, bias=False) 36 | self.bn2 = nn.BatchNorm1d(out_channels) 37 | 38 | self.shortcut = nn.Conv1d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False) if stride != 1 else None 39 | 40 | def forward(self, x): 41 | out = F.relu(self.bn1(self.conv1(x))) 42 | out = self.bn2(self.conv2(out)) 43 | if self.shortcut is not None: 44 | x = self.shortcut(x) 45 | out += x 46 | return F.relu(out) 47 | 48 | # 定义 ResNet 模型 49 | class ResNet(nn.Module): 50 | def __init__(self, block, num_blocks, num_classes): 51 | super(ResNet, self).__init__() 52 | self.in_channels = 64 53 | self.conv1 = nn.Conv1d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) 54 | self.bn1 = nn.BatchNorm1d(64) 55 | self.maxpool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1) 56 | 57 | self.layer1 = self._make_layer(block, 64, num_blocks[0]) 58 | self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2) 59 | self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2) 60 | self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2) 61 | 62 | self.avgpool = nn.AdaptiveAvgPool1d(1) 63 | self.fc = nn.Linear(512, num_classes) 64 | 65 | def _make_layer(self, block, out_channels, blocks, stride=1): 66 | layers = [] 67 | layers.append(block(self.in_channels, out_channels, stride)) 68 | self.in_channels = out_channels 69 | for _ in range(1, blocks): 70 | layers.append(block(out_channels, out_channels)) 71 | return nn.Sequential(*layers) 72 | 73 | def forward(self, x): 74 | x = F.relu(self.bn1(self.conv1(x))) 75 | x = self.maxpool(x) 76 | 77 | x = self.layer1(x) 78 | x = self.layer2(x) 79 | x = self.layer3(x) 80 | x = self.layer4(x) 81 | 82 | x = self.avgpool(x) 83 | x = x.view(x.size(0), -1) 84 | x = self.fc(x) 85 | return x 86 | 87 | # 训练和验证函数 88 | def train_and_val(model, criterion, optimizer, train_loader, val_loader, num_epochs): 89 | for epoch in range(num_epochs): 90 | model.train() 91 | for inputs, labels in train_loader: 92 | optimizer.zero_grad() 93 | outputs = model(inputs) 94 | loss = criterion(outputs, labels) 95 | loss.backward() 96 | optimizer.step() 97 | 98 | # 验证阶段 99 | model.eval() 100 | with torch.no_grad(): 101 | val_loss = 0 102 | correct = 0 103 | total = 0 104 | for inputs, labels in val_loader: 105 | outputs = model(inputs) 106 | val_loss += criterion(outputs, labels).item() 107 | _, predicted = torch.max(outputs.data, 1) 108 | total += labels.size(0) 109 | correct += (predicted == labels).sum().item() 110 | 111 | print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Validation Loss: {val_loss:.4f}, Accuracy: {100 * correct / total:.2f}%') 112 | 113 | # 预测函数 114 | def predict(model, test_loader): 115 | model.eval() 116 | predictions = [] 117 | with torch.no_grad(): 118 | for inputs in test_loader: 119 | outputs = model(inputs) 120 | _, predicted = torch.max(outputs.data, 1) 121 | predictions.append(predicted.numpy()) 122 | return np.concatenate(predictions) 123 | 124 | # 主程序 125 | def main(): 126 | # 读取配置 127 | with open('config.json') as config_file: 128 | config = json.load(config_file) 129 | 130 | # 获取模型参数 131 | num_classes = config['num_mon_sites'] if config['num_unmon_sites_train'] == 0 else config['num_mon_sites'] + 1 132 | num_epochs = config['var_cnn_max_epochs'] if config['model_name'] == 'var-cnn' else config['df_epochs'] 133 | batch_size = config['batch_size'] 134 | 135 | # 准备数据集 136 | train_dataset = CustomDataset(config, mode='training_data') 137 | val_dataset = CustomDataset(config, mode='validation_data') 138 | train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) 139 | val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) 140 | 141 | # 初始化模型、损失函数和优化器 142 | model = ResNet(BasicBlock, [2, 2, 2, 2], num_classes) 143 | criterion = nn.CrossEntropyLoss() 144 | optimizer = optim.Adam(model.parameters(), lr=0.001) 145 | 146 | # 训练和验证 147 | train_and_val(model, criterion, optimizer, train_loader, val_loader, num_epochs) 148 | 149 | # 预测 150 | test_dataset = CustomDataset(config, mode='test_data') 151 | test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) 152 | predictions = predict(model, test_loader) 153 | 154 | # 保存模型 155 | torch.save(model.state_dict(), MODEL_PATH) 156 | 157 | if __name__ == "__main__": 158 | main() 159 | -------------------------------------------------------------------------------- /wang_to_varcnn.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import os 6 | import numpy as np 7 | import random 8 | import json 9 | import torch 10 | 11 | random.seed(4286794567481) 12 | 13 | def process_trace(args): 14 | """Extract dir seq, time seq, metadata, and label for particular trace.""" 15 | dir_name = args[0] 16 | trace_path = args[1] 17 | 18 | with open(os.path.join(dir_name, trace_path), 'r') as f: 19 | lines = f.readlines() 20 | 21 | dir_seq = torch.zeros(5000, dtype=torch.int8) 22 | time_seq = torch.zeros(5000, dtype=torch.float32) 23 | label = 0 if '-' not in trace_path else int(trace_path.split('-')[0]) + 1 24 | total_time = float(lines[-1].split('\t')[0]) 25 | total_incoming = 0 26 | total_outgoing = 0 27 | 28 | for packet_num, line in enumerate(lines): 29 | line = line.split('\t') 30 | curr_time = float(line[0]) 31 | curr_dir = torch.sign(torch.tensor(int(line[1]))).item() 32 | 33 | if packet_num < 5000: 34 | dir_seq[packet_num] = curr_dir 35 | time_seq[packet_num] = curr_time 36 | 37 | if curr_dir == 1: 38 | total_outgoing += 1 39 | elif curr_dir == -1: 40 | total_incoming += 1 41 | 42 | total_packets = total_incoming + total_outgoing 43 | if total_packets == 0: 44 | metadata = torch.zeros(7, dtype=torch.float32) 45 | else: 46 | metadata = torch.tensor([total_packets, total_incoming, total_outgoing, 47 | total_incoming / total_packets, 48 | total_outgoing / total_packets, 49 | total_time, total_time / total_packets], 50 | dtype=torch.float32) 51 | return dir_seq, time_seq, metadata, label 52 | 53 | def main(config): 54 | 55 | num_mon_sites = config['num_mon_sites'] 56 | num_mon_inst_test = config['num_mon_inst_test'] 57 | num_mon_inst_train = config['num_mon_inst_train'] 58 | num_mon_inst = num_mon_inst_test + num_mon_inst_train 59 | num_unmon_sites_test = config['num_unmon_sites_test'] 60 | num_unmon_sites_train = config['num_unmon_sites_train'] 61 | num_unmon_sites = num_unmon_sites_test + num_unmon_sites_train 62 | data_dir = config['data_dir'] 63 | 64 | arg_list = [] 65 | for trace_path in os.listdir(data_dir + 'batch_wang'): 66 | arg_list.append([data_dir + 'batch_wang', trace_path]) 67 | 68 | # set up the output 69 | mon_idx = 0 70 | dir_seq_mon = [None] * (num_mon_sites * num_mon_inst) 71 | time_seq_mon = [None] * (num_mon_sites * num_mon_inst) 72 | metadata_mon = [None] * (num_mon_sites * num_mon_inst) 73 | labels_mon = [None] * (num_mon_sites * num_mon_inst) 74 | 75 | unmon_idx = 0 76 | dir_seq_unmon = [None] * num_unmon_sites 77 | time_seq_unmon = [None] * num_unmon_sites 78 | metadata_unmon = [None] * num_unmon_sites 79 | labels_unmon = [None] * num_unmon_sites 80 | 81 | print('size of total list: %d' % len(arg_list)) 82 | for i in range(len(arg_list)): 83 | dir_seq, time_seq, metadata, label = process_trace(arg_list[i]) 84 | if i % 5000 == 0: 85 | print("processed", i) 86 | 87 | if label == 0: # unmon site 88 | dir_seq_unmon[unmon_idx] = dir_seq 89 | time_seq_unmon[unmon_idx] = time_seq 90 | metadata_unmon[unmon_idx] = metadata 91 | labels_unmon[unmon_idx] = label 92 | unmon_idx += 1 93 | else: 94 | dir_seq_mon[mon_idx] = dir_seq 95 | time_seq_mon[mon_idx] = time_seq 96 | metadata_mon[mon_idx] = metadata 97 | labels_mon[mon_idx] = label 98 | mon_idx += 1 99 | 100 | # save monitored traces 101 | dir_seq_mon = torch.stack(dir_seq_mon).cpu().numpy() 102 | time_seq_mon = torch.stack(time_seq_mon).cpu().numpy() 103 | metadata_mon = torch.stack(metadata_mon).cpu().numpy() 104 | 105 | print('number of monitored traces: %d' % len(labels_mon)) 106 | np.savez_compressed(data_dir + 'all_closed_world.npz', dir_seq=dir_seq_mon, 107 | time_seq=time_seq_mon, metadata=metadata_mon, 108 | labels=labels_mon) 109 | 110 | # save unmonitored traces 111 | dir_seq_unmon = torch.stack(dir_seq_unmon).cpu().numpy() 112 | time_seq_unmon = torch.stack(time_seq_unmon).cpu().numpy() 113 | metadata_unmon = torch.stack(metadata_unmon).cpu().numpy() 114 | 115 | print('number of unmonitored traces: %d' % len(labels_unmon)) 116 | np.savez_compressed(data_dir + 'all_open_world.npz', dir_seq=dir_seq_unmon, 117 | time_seq=time_seq_unmon, metadata=metadata_unmon, 118 | labels=labels_unmon) 119 | 120 | if __name__ == '__main__': 121 | with open('config.json') as config_file: 122 | config = json.load(config_file) 123 | 124 | main(config) 125 | --------------------------------------------------------------------------------