├── .gitignore ├── LICENSE ├── README.md ├── data ├── trial_ben.npy └── trial_mal.npy ├── main.py ├── malgan ├── __init__.py ├── _export_results.py ├── _log_tools.py ├── detector.py ├── discriminator.py └── generator.py ├── malware_gan_poster.pdf ├── malware_gan_report.pdf └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Miscellaneous Files 2 | .idea/ 3 | tags 4 | .DS_Store 5 | *.swp 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | *$py.class 11 | 12 | # C extensions 13 | *.so 14 | 15 | # Distribution / packaging 16 | .Python 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | .eggs/ 23 | lib/ 24 | lib64/ 25 | parts/ 26 | sdist/ 27 | var/ 28 | wheels/ 29 | *.egg-info/ 30 | .installed.cfg 31 | *.egg 32 | MANIFEST 33 | 34 | # PyInstaller 35 | # Usually these files are written by a python script from a template 36 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 37 | *.manifest 38 | *.spec 39 | 40 | # Installer logs 41 | pip-log.txt 42 | pip-delete-this-directory.txt 43 | 44 | # Unit test / coverage reports 45 | htmlcov/ 46 | .tox/ 47 | .coverage 48 | .coverage.* 49 | .cache 50 | nosetests.xml 51 | coverage.xml 52 | *.cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | db.sqlite3 64 | 65 | # Flask stuff: 66 | instance/ 67 | .webassets-cache 68 | 69 | # Scrapy stuff: 70 | .scrapy 71 | 72 | # Sphinx documentation 73 | docs/_build/ 74 | 75 | # PyBuilder 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # pyenv 82 | .python-version 83 | 84 | # celery beat schedule file 85 | celerybeat-schedule 86 | 87 | # SageMath parsed files 88 | *.sage.py 89 | 90 | # Environments 91 | .env 92 | .venv 93 | env/ 94 | venv/ 95 | ENV/ 96 | env.bak/ 97 | venv.bak/ 98 | 99 | # Spyder project settings 100 | .spyderproject 101 | .spyproject 102 | 103 | # Rope project settings 104 | .ropeproject 105 | 106 | # mkdocs documentation 107 | /site 108 | 109 | # mypy 110 | .mypy_cache/ 111 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Zayd Hammoudeh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Adversarial Malware Generation Using GANs 2 | 3 | [![docs](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/ZaydH/MalwareGAN/blob/master/LICENSE) 4 | 5 | Implementation of a Generative Adversarial Network (GAN) that can create adversarial malware examples. The work is inspired by **MalGAN** in the paper "[*Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN*](https://arxiv.org/abs/1702.05983)" by Weiwei Hu and Ying Tan. 6 | 7 | Framework written in [PyTorch](https://pytorch.org/) and supports CUDA. 8 | 9 | ## Running the Script 10 | 11 | The malware GAN is provided as a package in the folder `malgan`. A driver script is provided in `main.py`, which processes input arguments via `argparse`. The basic interface is: 12 | 13 | python main.py Z BATCH_SIZE NUM_EPOCHS MALWARE_FILE BENIGN_FILE 14 | 15 | * `Z` -- Dimension of the latent vector. Must be a positive integer. 16 | * `BATCH_SIZE` -- Batch size for *malicious* examples. The benign batch size is proportional to `BATCH_SIZE` and the fraction of total training samples that are benign. 17 | * `NUM_EPOCHS` -- Maximum number of training epochs 18 | * `MALWARE_FILE` -- Path to a serialized `numpy` or `torch` matrix where the rows represent a single **malware** file's binary feature vector. 19 | * `BENIGN_FILE` -- Path to a serialized `numpy` or `torch` matrix where the rows represent a single **benign** file's binary feature vector. 20 | 21 | For checkout purposes, we recommend calling: 22 | 23 | python main.py 10 32 100 data/trial_mal.npy data/trial_ben.npy 24 | 25 | ## Dataset 26 | 27 | A trial dataset is included with this implementation in the `data` folder. The data was publish in the repository: [yanminglai/Malware-GAN](https://github.com/yanminglai/Malware-GAN). This dataset should only be used for proof of concept and initial trials. 28 | 29 | We recommend the SLEIPNIR dataset. It was published by ad-Dujaili et al. The authors requested that the dataset not be shared publicly, and we respect that request. However, researchers and students may request access directly from the authors as described on their [Github repository](https://github.com/ALFA-group/robust-adv-malware-detection). Look for the link to the Google form. 30 | 31 | ## CUDA Support 32 | 33 | The implementation supports both CPU and CUDA (i.e., GPU) execution. If CUDA is detected on the system, the implementation defaults to CUDA support. 34 | 35 | ## Requirements 36 | 37 | This program was tested with Python 3.6.5 on MacOS and on Debian Linux. `requirements.txt` enumerates the exact packages used. A summary of the key requirements is below: 38 | 39 | * PyTorch (`torch`) -- Ver. 1.2.0 40 | * Scikit-Learn (`sklearn`) -- Ver. 0.20.2 41 | * NumPy (`numpy`) 42 | * TensorboardX -- If runtime profiling is not required, this can be removed. 43 | -------------------------------------------------------------------------------- /data/trial_ben.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/data/trial_ben.npy -------------------------------------------------------------------------------- /data/trial_mal.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/data/trial_mal.npy -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | r""" 3 | Main module for testing and debugging the MalGAN implementation. 4 | 5 | :version: 0.1.0 6 | :copyright: (c) 2019 by Zayd Hammoudeh. 7 | :license: MIT, see LICENSE for more details. 8 | """ 9 | 10 | import argparse 11 | import pickle 12 | import sys 13 | from typing import Union 14 | from pathlib import Path 15 | 16 | import numpy as np 17 | from malgan import MalGAN, MalwareDataset, BlackBoxDetector, setup_logger 18 | 19 | import torch 20 | from torch import nn 21 | 22 | 23 | 24 | def parse_args() -> argparse.Namespace: 25 | r""" 26 | Parse the command line arguments 27 | 28 | :return: Parsed argument structure 29 | """ 30 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 31 | 32 | parser.add_argument("Z", help="Dimension of the latent vector", type=int, default=10) 33 | parser.add_argument("batch_size", help="Batch size", type=int, default=32) 34 | parser.add_argument("num_epoch", help="Number of training epochs", type=int, default=100) 35 | 36 | msg = "Data file contacting the %s feature vectors" 37 | for x in ["malware", "benign"]: 38 | parser.add_argument(x[:3] + "_file", help=msg % x, type=str, default="data/%s.npy" % x) 39 | 40 | parser.add_argument("-q", help="Quiet mode", action='store_true', default=False) 41 | 42 | help_msg = " ".join(["Dimension of the hidden layer(s) in the GENERATOR." 43 | "Multiple layers should be space separated"]) 44 | parser.add_argument("--gen-hidden-sizes", help=help_msg, type=int, 45 | default=[256, 256], nargs="+") 46 | 47 | help_msg = " ".join(["Dimension of the hidden layer(s) in the DISCRIMINATOR." 48 | "Multiple layers should be space separated"]) 49 | parser.add_argument("--discrim-hidden-sizes", help=help_msg, type=int, 50 | default=[256, 256], nargs="+") 51 | 52 | help_msg = " ".join(["Activation function for the generator and discriminator hidden", 53 | "layer(s). Valid choices (case insensitive) are: \"ReLU\", \"ELU\",", 54 | "\"LeakyReLU\", \"tanh\" and \"sigmoid\"."]) 55 | parser.add_argument("--activation", help=help_msg, type=str, default="LeakyReLU") 56 | 57 | help_msg = ["Learner algorithm used in the black box detector. Valid choices (case ", 58 | "insensitive) include:"] 59 | names = BlackBoxDetector.Type.names() 60 | for i, type_name in enumerate(names): 61 | if i > 0 and len(names) > 2: # Need three options for a comma to make sense 62 | help_msg.append(",") 63 | if len(names) > 1 and i == len(names) - 1: # And only makes sense if at least two options 64 | help_msg.append(" and") 65 | help_msg.extend([" \"", type_name, "\""]) 66 | help_msg.append(".") 67 | parser.add_argument("--detector", help="".join(help_msg), type=str, 68 | default=BlackBoxDetector.Type.RandomForest.name) 69 | 70 | help_msg = "Print the results to the console. Intended for slurm results analysis" 71 | parser.add_argument("--print-results", help=help_msg, action="store_true", default=False) 72 | 73 | args = parser.parse_args() 74 | # noinspection PyTypeChecker 75 | args.activation = _configure_activation_function(args.activation) 76 | args.detector = BlackBoxDetector.Type.get_from_name(args.detector) 77 | 78 | # Check the malware and binary files exist 79 | args.mal_file = Path(args.mal_file) 80 | args.ben_file = Path(args.ben_file) 81 | for (name, path) in (("malware", args.mal_file), ("benign", args.ben_file)): 82 | if path.exists(): continue 83 | print(f"Unknown %s file \"%s\"" % (name, str(path))) 84 | sys.exit(1) 85 | return args 86 | 87 | 88 | def _configure_activation_function(act_func_name: str) -> nn.Module: 89 | r""" 90 | Parse the activation function from a string and return the corresponding activation function 91 | PyTorch module. If the activation function cannot not be found, a \p ValueError is thrown. 92 | 93 | **Note**: Activation function check is case insensitive. 94 | 95 | :param act_func_name: Name of the activation function to 96 | :return: Activation function module associated with the passed name. 97 | """ 98 | act_func_name = act_func_name.lower() # Make case insensitive 99 | # Supported activation functions 100 | act_funcs = [("relu", nn.ReLU), ("elu", nn.ELU), ("leakyrelu", nn.LeakyReLU), ("tanh", nn.Tanh), 101 | ("sigmoid", nn.Sigmoid)] 102 | for func_name, module in act_funcs: 103 | if act_func_name == func_name.lower(): 104 | return module 105 | raise ValueError("Unknown activation function: \"%s\"" % act_func_name) 106 | 107 | 108 | def load_dataset(file_path: Union[str, Path], y: int) -> MalwareDataset: 109 | r""" 110 | Extracts the input data from disk and packages them into format expected by \p MalGAN. Supports 111 | loading files from numpy, torch, and pickle. Other formats (based on the file extension) will 112 | result in a \p ValueError. 113 | 114 | :param file_path: Path to a NumPy data file containing tensors for the benign and malware 115 | data. 116 | :param y: Y value for dataset 117 | :return: \p MalwareDataset objects for the malware and benign files respectively. 118 | """ 119 | file_ext = Path(file_path).suffix 120 | if file_ext in {".npy", ".npz"}: 121 | data = np.load(file_path) 122 | elif file_ext in {".pt", ".pth"}: 123 | data = torch.load(str(file_path)) 124 | elif file_ext == ".pk": 125 | with open(str(file_path), "rb") as f_in: 126 | data = pickle.load(f_in) 127 | else: 128 | raise ValueError("Unknown file extension. Cannot determine how to import") 129 | return MalwareDataset(x=data, y=y) 130 | 131 | 132 | def main(): 133 | args = parse_args() 134 | setup_logger(args.q) 135 | 136 | MalGAN.MALWARE_BATCH_SIZE = args.batch_size 137 | 138 | malgan = MalGAN(load_dataset(args.mal_file, MalGAN.Label.Malware.value), 139 | load_dataset(args.ben_file, MalGAN.Label.Benign.value), 140 | Z=args.Z, 141 | h_gen=args.gen_hidden_sizes, 142 | h_discrim=args.discrim_hidden_sizes, 143 | g_hidden=args.activation, 144 | detector_type=args.detector) 145 | malgan.fit(args.num_epoch, quiet_mode=args.q) 146 | results = malgan.measure_and_export_results() 147 | if args.print_results: 148 | print(results) 149 | 150 | 151 | if __name__ == "__main__": 152 | main() 153 | -------------------------------------------------------------------------------- /malgan/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | r""" 3 | malgan.__init__ 4 | ~~~~~~~~~~~~~~~ 5 | 6 | MalGAN complete architecture. 7 | 8 | Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN" 9 | By Weiwei Hu and Ying Tan. 10 | 11 | :version: 0.1.0 12 | :copyright: (c) 2019 by Zayd Hammoudeh. 13 | :license: MIT, see LICENSE for more details. 14 | """ 15 | import logging 16 | import os 17 | from enum import Enum 18 | from typing import List, Tuple, Union 19 | from pathlib import Path 20 | 21 | import numpy as np 22 | 23 | import tensorboardX 24 | import torch 25 | from torch import Tensor 26 | import torch.nn as nn 27 | import torch.optim as optim 28 | from torch.optim.optimizer import Optimizer 29 | import torch.utils.data 30 | from torch.utils.data import Dataset, DataLoader, Subset 31 | 32 | from ._export_results import _export_results 33 | from ._log_tools import setup_logger, TrainingLogger 34 | from .detector import BlackBoxDetector 35 | from .discriminator import Discriminator 36 | from .generator import Generator 37 | 38 | ListOrInt = Union[List[int], int] 39 | PathOrStr = Union[str, Path] 40 | TensorTuple = Tuple[Tensor, Tensor] 41 | 42 | 43 | IS_CUDA = torch.cuda.is_available() 44 | if IS_CUDA: 45 | device = torch.device('cuda:0') 46 | # noinspection PyUnresolvedReferences 47 | torch.set_default_tensor_type(torch.cuda.FloatTensor) 48 | 49 | 50 | class MalwareDataset(Dataset): 51 | r""" 52 | Encapsulates a malware dataset. All elements in the dataset will be either malware or benign 53 | """ 54 | def __init__(self, x: Union[np.ndarray, Tensor], y): 55 | super().__init__() 56 | 57 | if isinstance(x, np.ndarray): 58 | x = torch.from_numpy(x).float() 59 | self.x = x 60 | self.y = y 61 | 62 | def __getitem__(self, index): 63 | return self.x[index], self.y 64 | 65 | def __len__(self): 66 | return self.x.shape[0] 67 | 68 | @property 69 | def num_features(self): 70 | r""" Number of features in the dataset """ 71 | return self.x.shape[1] 72 | 73 | 74 | class _DataGroup: # pylint: disable=too-few-public-methods 75 | r""" 76 | Encapsulates either PyTorch DataLoaders or Datasets. This class is intended only for internal 77 | use by MalGAN. 78 | """ 79 | def __init__(self, train: MalwareDataset, valid: MalwareDataset, test: MalwareDataset): 80 | self.train = train 81 | self.valid = valid 82 | self.test = test 83 | self.is_loaders = False 84 | 85 | def build_loader(self, batch_size: int = 0): 86 | r""" 87 | Constructs loaders from the datasets 88 | 89 | :param batch_size: Batch size for training 90 | """ 91 | self.train = DataLoader(self.train, batch_size=batch_size, shuffle=True, pin_memory=True) 92 | if self.valid: 93 | self.valid = DataLoader(self.valid, batch_size=batch_size, pin_memory=True) 94 | self.test = DataLoader(self.test, batch_size=batch_size, pin_memory=True) 95 | self.is_loaders = True 96 | 97 | 98 | # noinspection PyPep8Naming 99 | class MalGAN(nn.Module): 100 | r""" Malware Generative Adversarial Network based on the work of Hu & Tan. """ 101 | 102 | MALWARE_BATCH_SIZE = 32 103 | 104 | SAVED_MODEL_DIR = Path("saved_models") 105 | 106 | VALIDATION_SPLIT = 0.2 107 | 108 | tensorboard = None 109 | 110 | class Label(Enum): 111 | r""" Label value assigned to malware and benign examples """ 112 | Malware = 1 113 | Benign = 0 114 | 115 | # noinspection PyPep8Naming 116 | def __init__(self, mal_data: MalwareDataset, ben_data: MalwareDataset, Z: int, 117 | h_gen: ListOrInt, h_discrim: ListOrInt, 118 | test_split: float = 0.2, 119 | g_hidden: nn.Module = nn.LeakyReLU, 120 | detector_type: BlackBoxDetector.Type = BlackBoxDetector.Type.LogisticRegression): 121 | r""" 122 | Malware Generative Adversarial Network Constructor 123 | 124 | :param mal_data: Malware training dataset. 125 | :param ben_data: Benign training dataset. 126 | :param Z: Dimension of the noise vector \p z 127 | :param test_split: Fraction of input data to be used for testing 128 | :param h_gen: Width of the hidden layer(s) in the GENERATOR. If only a single hidden 129 | layer is desired, then this can be only an integer. 130 | :param h_discrim: Width of the hidden layer(s) in the DISCRIMINATOR. If only a single 131 | hidden layer is desired, then this can be only an integer. 132 | :param detector_type: Learning algorithm to be used by the black-box detector 133 | """ 134 | super().__init__() 135 | 136 | if mal_data.num_features != ben_data.num_features: 137 | raise ValueError("Mismatch in the number of features between malware and benign data") 138 | if Z <= 0: 139 | raise ValueError("Z must be a positive integers") 140 | if test_split <= 0. or test_split >= 1.: 141 | raise ValueError("test_split must be in the range (0,1)") 142 | self._M, self._Z = mal_data.num_features, Z # pylint: disable=invalid-name 143 | 144 | # Format the hidden layer sizes and make sure all are valid values 145 | if isinstance(h_gen, int): 146 | h_gen = [h_gen] 147 | if isinstance(h_discrim, int): 148 | h_discrim = [h_discrim] 149 | self.d_discrim, self.d_gen = h_discrim, h_gen 150 | for h_size in [self.d_discrim, self.d_gen]: 151 | for w in h_size: 152 | if w <= 0: 153 | raise ValueError("All hidden layer widths must be positive integers.") 154 | 155 | if not isinstance(g_hidden, nn.Module): 156 | g_hidden = g_hidden() 157 | self._g = g_hidden 158 | 159 | self._is_cuda = IS_CUDA 160 | 161 | logging.debug("Constructing new MalGAN") 162 | logging.debug("Malware Dimension (M): %d", self.M) 163 | logging.debug("Latent Dimension (Z): %d", self.Z) 164 | logging.debug("Test Split Ratio: %.3f", test_split) 165 | logging.debug("Generator Hidden Layer Sizes: %s", h_gen) 166 | logging.debug("Discriminator Hidden Layer Sizes: %s", h_discrim) 167 | logging.debug("Blackbox Detector Type: %s", detector_type.name) 168 | logging.debug("Activation Type: %s", self._g.__class__.__name__) 169 | 170 | self._bb = BlackBoxDetector(detector_type) 171 | self._gen = Generator(M=self.M, Z=self.Z, hidden_size=h_gen, g=self._g) 172 | self._discrim = Discriminator(M=self.M, hidden_size=h_discrim, g=self._g) 173 | 174 | def split_train_valid_test(dataset: Dataset, is_benign: bool): 175 | """Helper function to partition into test, train, and validation subsets""" 176 | valid_len = 0 if is_benign else int(MalGAN.VALIDATION_SPLIT * len(dataset)) 177 | test_len = int(test_split * len(dataset)) 178 | 179 | # Order must be train, validation, test 180 | lengths = [len(dataset) - valid_len - test_len, valid_len, test_len] 181 | return _DataGroup(*torch.utils.data.random_split(dataset, lengths)) 182 | 183 | # Split between train, test, and validation then construct the loaders 184 | self._mal_data = split_train_valid_test(mal_data, is_benign=False) 185 | self._ben_data = split_train_valid_test(ben_data, is_benign=True) 186 | # noinspection PyTypeChecker 187 | self._fit_blackbox(self._mal_data.train, self._ben_data.train) 188 | 189 | self._mal_data.build_loader(MalGAN.MALWARE_BATCH_SIZE) 190 | ben_bs_frac = len(ben_data) / len(mal_data) 191 | self._ben_data.build_loader(int(ben_bs_frac * MalGAN.MALWARE_BATCH_SIZE)) 192 | # Set CUDA last to ensure all parameters defined 193 | if self._is_cuda: self.cuda() 194 | 195 | @property 196 | def M(self) -> int: 197 | r"""Width of the malware feature vector""" 198 | return self._M 199 | 200 | @property 201 | def Z(self) -> int: 202 | r"""Width of the generator latent noise vector""" 203 | return self._Z 204 | 205 | def _fit_blackbox(self, mal_train: Subset, ben_train: Subset) -> None: 206 | r""" 207 | Firsts the blackbox detector using the specified malware and benign training sets. 208 | 209 | :param mal_train: Malware training dataset 210 | :param ben_train: Benign training dataset 211 | """ 212 | def extract_x(ds: Subset) -> Tensor: 213 | # noinspection PyUnresolvedReferences 214 | x = ds.dataset.x[ds.indices] 215 | return x.cpu() if self._is_cuda else x 216 | 217 | mal_x = extract_x(mal_train) 218 | ben_x = extract_x(ben_train) 219 | merged_x = torch.cat((mal_x, ben_x)) 220 | 221 | merged_y = torch.cat((torch.full((len(mal_train),), MalGAN.Label.Malware.value), 222 | torch.full((len(ben_train),), MalGAN.Label.Benign.value))) 223 | logging.debug("Starting training of blackbox detector of type \"%s\"", self._bb.type.name) 224 | self._bb.fit(merged_x, merged_y) 225 | logging.debug("COMPLETED training of blackbox detector of type \"%s\"", self._bb.type.name) 226 | 227 | def fit(self, cyc_len: int, quiet_mode: bool = False) -> None: 228 | r""" 229 | Trains the model for the specified number of epochs. The epoch with the best validation 230 | loss is used as the final model. 231 | 232 | :param cyc_len: Number of cycles (epochs) to train the model. 233 | :param quiet_mode: True if no printing to console should occur in this function 234 | """ 235 | if cyc_len <= 0: 236 | raise ValueError("At least a single training cycle is required.") 237 | 238 | MalGAN.tensorboard = tensorboardX.SummaryWriter() 239 | 240 | d_optimizer = optim.Adam(self._discrim.parameters(), lr=1e-5) 241 | g_optimizer = optim.Adam(self._gen.parameters(), lr=1e-4) 242 | 243 | if not quiet_mode: 244 | names = ["Gen Train Loss", "Gen Valid Loss", "Discrim Train Loss", "Best?"] 245 | log = TrainingLogger(names, [20, 20, 20, 7]) 246 | 247 | best_epoch, best_loss = None, np.inf 248 | for epoch_cnt in range(1, cyc_len + 1): 249 | train_l_g, train_l_d = self._fit_epoch(g_optimizer, d_optimizer) 250 | for block, loss in [("Generator", train_l_g), ("Discriminator", train_l_d)]: 251 | MalGAN.tensorboard.add_scalar('Train_%s_Loss' % block, loss, epoch_cnt) 252 | 253 | # noinspection PyTypeChecker 254 | valid_l_g = self._meas_loader_gen_loss(self._mal_data.valid) 255 | MalGAN.tensorboard.add_scalar('Validation_Generator_Loss', valid_l_g, epoch_cnt) 256 | flds = [train_l_g, valid_l_g, train_l_d, valid_l_g < best_loss] 257 | if flds[-1]: 258 | self._save(self._build_export_name(is_final=False)) 259 | best_loss = valid_l_g 260 | if not quiet_mode: log.log(epoch_cnt, flds) 261 | MalGAN.tensorboard.close() 262 | 263 | self.load(self._build_export_name(is_final=False)) 264 | self._save(self._build_export_name(is_final=True)) 265 | self._delete_old_backup(is_final=False) 266 | 267 | def _build_export_name(self, is_final: bool = True) -> str: 268 | r""" 269 | Builds the name that will be used when exporting the model. 270 | 271 | :param is_final: If \p True, then file name is for final (i.e., not training) model 272 | :return: Model name built from the model's parameters 273 | """ 274 | name = ["malgan", "z=%d" % self.Z, 275 | "d-gen=%s" % str(self.d_gen).replace(" ", "_"), 276 | "d-disc=%s" % str(self.d_discrim).replace(" ", "_"), 277 | "bs=%d" % MalGAN.MALWARE_BATCH_SIZE, 278 | "bb=%s" % self._bb.type.name, "g=%s" % self._g.__class__.__name__, 279 | "final" if is_final else "tmp"] 280 | 281 | # Either add an epoch name or 282 | return MalGAN.SAVED_MODEL_DIR / "".join(["_".join(name).lower(), ".pth"]) 283 | 284 | def _delete_old_backup(self, is_final: bool = True) -> None: 285 | r""" 286 | Helper function to delete old backed up models 287 | 288 | :param is_final: If \p True, then file name is for final (i.e., not training) model 289 | """ 290 | backup_name = self._build_export_name(is_final) 291 | try: 292 | os.remove(backup_name) 293 | except OSError: 294 | logging.warning("Error trying to delete model: %s", backup_name) 295 | 296 | def _fit_epoch(self, g_optim: Optimizer, d_optim: Optimizer) -> TensorTuple: 297 | r""" 298 | Trains a single entire epoch 299 | 300 | :param g_optim: Generator optimizer 301 | :param d_optim: Discriminator optimizer 302 | :return: Average training loss 303 | """ 304 | tot_l_g = tot_l_d = 0 305 | num_batch = min(len(self._mal_data.train), len(self._ben_data.train)) 306 | 307 | for (m, _), (b, _) in zip(self._mal_data.train, self._ben_data.train): 308 | if self._is_cuda: m, b = m.cuda(), b.cuda() 309 | m_prime, g_theta = self._gen.forward(m) 310 | l_g = self._calc_gen_loss(g_theta) 311 | g_optim.zero_grad() 312 | l_g.backward() 313 | # torch.nn.utils.clip_grad_value_(l_g, 1) 314 | g_optim.step() 315 | tot_l_g += l_g 316 | 317 | # Update the discriminator 318 | for x in [m_prime, b]: 319 | l_d = self._calc_discrim_loss(x) 320 | d_optim.zero_grad() 321 | l_d.backward() 322 | # torch.nn.utils.clip_grad_value_(l_d, 1) 323 | d_optim.step() 324 | tot_l_d += l_d 325 | # noinspection PyUnresolvedReferences 326 | return (tot_l_g / num_batch).item(), (tot_l_d / num_batch).item() 327 | 328 | def _meas_loader_gen_loss(self, loader: DataLoader) -> float: 329 | r""" Calculate the generator loss on malware dataset """ 330 | loss = 0 331 | for m, _ in loader: 332 | if self._is_cuda: m = m.cuda() 333 | _, g_theta = self._gen.forward(m) 334 | loss += self._calc_gen_loss(g_theta) 335 | # noinspection PyUnresolvedReferences 336 | return (loss / len(loader)).item() 337 | 338 | def _calc_gen_loss(self, g_theta: Tensor) -> Tensor: 339 | r""" 340 | Calculates the parameter :math:`L_{G}` as defined in Eq. (3) of Hu & Tan's paper. 341 | 342 | :param g_theta: :math:`G(_{\theta_g}(m,z)` in Eq. (1) of Hu & Tan's paper 343 | :return: Loss for the generator smoothed output. 344 | """ 345 | d_theta = self._discrim.forward(g_theta) 346 | return d_theta.log().mean() 347 | 348 | def _calc_discrim_loss(self, X: Tensor) -> Tensor: 349 | r""" 350 | Calculates the parameter :math:`L_{D}` as defined in Eq. (2) of Hu & Tan's paper. 351 | 352 | :param X: Examples to calculate the loss over. May be a mix of benign and malware samples. 353 | """ 354 | d_theta = self._discrim.forward(X) 355 | 356 | y_hat = self._bb.predict(X) 357 | d = torch.where(y_hat == MalGAN.Label.Malware.value, d_theta, 1 - d_theta) 358 | return -d.log().mean() 359 | 360 | def measure_and_export_results(self) -> str: 361 | r""" 362 | Measure the test accuracy and provide results information 363 | 364 | :return: Results information as a comma separated string 365 | """ 366 | # noinspection PyTypeChecker 367 | valid_loss = self._meas_loader_gen_loss(self._mal_data.valid) 368 | # noinspection PyTypeChecker 369 | test_loss = self._meas_loader_gen_loss(self._mal_data.test) 370 | logging.debug("Final Validation Loss: %.6f", valid_loss) 371 | logging.debug("Final Test Loss: %.6f", test_loss) 372 | 373 | num_mal_test = 0 374 | y_mal_orig, m_prime_arr, bits_changed = [], [], [] 375 | for m, _ in self._mal_data.test: 376 | y_mal_orig.append(self._bb.predict(m.cpu())) 377 | if self._is_cuda: 378 | m = m.cuda() 379 | num_mal_test += m.shape[0] 380 | 381 | m_prime, _ = self._gen.forward(m) 382 | m_prime_arr.append(m_prime.cpu() if self._is_cuda else m_prime) 383 | 384 | m_diff = m_prime - m 385 | bits_changed.append(torch.sum(m_diff.cpu(), dim=1)) 386 | 387 | # Sanity check no bits flipped 1 -> 0 388 | msg = "Malware signature changed to 0 which is not allowed" 389 | assert torch.sum(m_diff < -0.1) == 0, msg 390 | avg_changed_bits = torch.cat(bits_changed).mean() 391 | logging.debug("Avg. Malware Bits Changed Changed: %2f", avg_changed_bits) 392 | 393 | # BB prediction of the malware before the generator 394 | y_mal_orig = torch.cat(y_mal_orig) 395 | 396 | # Build an X tensor for prediction using the detector 397 | ben_test_arr = [x.cpu() if self._is_cuda else x for x, _ in self._ben_data.test] 398 | x = torch.cat(m_prime_arr + ben_test_arr) 399 | y_actual = torch.cat((torch.full((num_mal_test,), MalGAN.Label.Malware.value), 400 | torch.full((len(x) - num_mal_test,), MalGAN.Label.Benign.value))) 401 | 402 | y_hat_post = self._bb.predict(x) 403 | if self._is_cuda: 404 | y_mal_orig, y_hat_post, y_actual = y_mal_orig.cpu(), y_hat_post.cpu(), y_actual.cpu() 405 | # noinspection PyProtectedMember 406 | y_prob = self._bb._model.predict_proba(x) # pylint: disable=protected-access 407 | y_prob = y_prob[:, MalGAN.Label.Malware.value] 408 | return _export_results(self, valid_loss, test_loss, avg_changed_bits, y_actual, 409 | y_mal_orig, y_prob, y_hat_post) 410 | 411 | def _save(self, file_path: PathOrStr) -> None: 412 | r""" 413 | Export the specified model to disk. The function creates any files needed on the path. 414 | All exported models will be relative to \p EXPORT_DIR class object. 415 | 416 | :param file_path: Path to export the model. 417 | """ 418 | if isinstance(file_path, str): 419 | file_path = Path(file_path) 420 | 421 | file_path.parent.mkdir(parents=True, exist_ok=True) 422 | torch.save(self.state_dict(), str(file_path)) 423 | 424 | def forward(self, x: Tensor) -> TensorTuple: # pylint: disable=arguments-differ 425 | r""" 426 | Passes a malware tensor and augments it to make it more undetectable by 427 | 428 | :param x: Malware binary tensor 429 | :return: :math:`m'` and :math:`g_{\theta}` respectively 430 | """ 431 | return self._gen.forward(x) 432 | 433 | def load(self, filename: PathOrStr) -> None: 434 | r""" 435 | Load a MalGAN object from disk. MalGAN's \p EXPORT_DIR is prepended to the specified 436 | filename. 437 | 438 | :param filename: Path to the exported torch file 439 | """ 440 | if isinstance(filename, Path): 441 | filename = str(filename) 442 | self.load_state_dict(torch.load(filename)) 443 | self.eval() 444 | # Based on the recommendation of Soumith Chantala et al. in GAN Hacks that enabling dropout 445 | # in evaluation improves performance. Source code based on: 446 | # https://discuss.pytorch.org/t/using-dropout-in-evaluation-mode/27721 447 | for m in self._gen.modules(): 448 | if m.__class__.__name__.startswith('Dropout'): 449 | m.train() 450 | 451 | @staticmethod 452 | def _print_memory_usage() -> None: 453 | """ 454 | Helper function to print the allocated tensor memory. This is used to debug out of memory 455 | GPU errors. 456 | """ 457 | import gc 458 | import operator as op 459 | from functools import reduce 460 | for obj in gc.get_objects(): 461 | # noinspection PyBroadException 462 | try: 463 | if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)): 464 | if len(obj.size()) > 0: # pylint: disable=len-as-condition 465 | obj_tot_size = reduce(op.mul, obj.size()) 466 | else: 467 | obj_tot_size = "NA" 468 | print(obj_tot_size, type(obj), obj.size()) 469 | except: # pylint: disable=bare-except # NOQA E722 470 | pass 471 | -------------------------------------------------------------------------------- /malgan/_export_results.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | from pathlib import Path 3 | from typing import Union 4 | 5 | import numpy as np 6 | 7 | import torch 8 | from sklearn.metrics import confusion_matrix, roc_auc_score 9 | 10 | TensorOrFloat = Union[torch.Tensor, float] 11 | TorchOrNumpy = Union[torch.Tensor, np.ndarray] 12 | 13 | 14 | # noinspection PyProtectedMember,PyUnresolvedReferences 15 | def _export_results(model: 'MalGAN', valid_loss: TensorOrFloat, test_loss: TensorOrFloat, 16 | avg_num_bits_changed: TensorOrFloat, y_actual: np.ndarray, 17 | y_mal_orig: TorchOrNumpy, y_prob: TorchOrNumpy, y_hat: np.ndarray) -> str: 18 | r""" 19 | Exports MalGAN results. 20 | 21 | :param model: MalGAN model 22 | :param valid_loss: Average loss on the malware validation set 23 | :param test_loss: Average loss on the malware test set 24 | :param avg_num_bits_changed: 25 | :param y_actual: Actual labels 26 | :param y_mal_orig: Predicted value on the original (unmodified) malware 27 | :param y_prob: Probability of malware 28 | :param y_hat: Predict labels 29 | :return: Results string 30 | """ 31 | if isinstance(y_prob, torch.Tensor): 32 | y_prob = y_prob.numpy() 33 | if isinstance(y_mal_orig, torch.Tensor): 34 | y_mal_orig = y_mal_orig.numpy() 35 | 36 | results_file = Path("results.csv") 37 | exists = results_file.exists() 38 | with open(results_file, "a+") as f_out: 39 | header = ",".join(["time_completed,M,Z,batch_size,test_set_size,detector_type,activation", 40 | "gen_hidden_dim,discim_hidden_dim", 41 | "avg_validation_loss,avg_test_loss,avg_num_bits_changed", 42 | "auc,orig_mal_detect_rate,mod_mal_detect_rate,ben_mal_detect_rate"]) 43 | if not exists: 44 | f_out.write(header) 45 | 46 | results = ["\n%s" % datetime.datetime.now(), 47 | "%d,%d,%d" % (model.M, model.Z, model.__class__.MALWARE_BATCH_SIZE), 48 | "%d,%s,%s" % (len(y_actual), model._bb.type.name, model._g.__class__.__name__), 49 | "\"%s\",\"%s\"" % (str(model.d_gen), str(model.d_discrim)), 50 | "%.15f,%.15f,%.3f" % (valid_loss, test_loss, avg_num_bits_changed)] 51 | 52 | auc = roc_auc_score(y_actual, y_prob) 53 | results.append("%.8f" % auc) 54 | 55 | # Calculate the detection rate on unmodified malware 56 | results.append("%.8f" % y_mal_orig.mean()) 57 | 58 | # Write the TxR and NxR information 59 | tn, fp, fn, tp = confusion_matrix(y_actual, y_hat).ravel() 60 | tpr, fpr = tp / (tp + fn), fp / (tn + fp) 61 | for rate in [tpr, fpr]: 62 | results.append("%.8f" % rate) 63 | results = ",".join(results) 64 | f_out.write(results) 65 | 66 | return "".join([header, results]) 67 | -------------------------------------------------------------------------------- /malgan/_log_tools.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import logging 3 | import sys 4 | from _decimal import Decimal 5 | from datetime import datetime 6 | from pathlib import Path 7 | from typing import List, Optional, Union, Any 8 | 9 | import torch 10 | from torch import Tensor 11 | 12 | ListOrInt = Union[int, List[int]] 13 | 14 | LOG_DIR = Path(".") 15 | IS_CUDA = torch.cuda.is_available() 16 | 17 | 18 | def setup_logger(quiet_mode: bool, log_level: int = logging.DEBUG, 19 | job_id: Optional[ListOrInt] = None) -> None: 20 | r""" 21 | Logger Configurator 22 | 23 | Configures the test logger. 24 | 25 | :param quiet_mode: True if quiet mode (i.e., disable logging to stdout) is used 26 | :param job_id: Identification number for the job 27 | :param log_level: Level to log 28 | """ 29 | date_format = '%m/%d/%Y %I:%M:%S %p' # Example Time Format - 12/12/2010 11:46:36 AM 30 | format_str = '%(asctime)s -- %(levelname)s -- %(message)s' 31 | 32 | LOG_DIR.mkdir(parents=True, exist_ok=True) 33 | flds = ["logs"] 34 | if job_id is not None: 35 | if isinstance(job_id, int): 36 | job_id = [job_id] 37 | flds += ["_j=", "-".join("%05d" % x for x in job_id)] 38 | flds += ["_", str(datetime.now()).replace(" ", "-"), ".log"] 39 | 40 | filename = LOG_DIR / "".join(flds) 41 | logging.basicConfig(filename=filename, level=log_level, format=format_str, datefmt=date_format) 42 | 43 | # Also print to stdout 44 | if not quiet_mode: 45 | handler = logging.StreamHandler(sys.stdout) 46 | handler.setLevel(log_level) 47 | formatter = logging.Formatter(format_str) 48 | handler.setFormatter(formatter) 49 | logging.getLogger().addHandler(handler) 50 | 51 | logging.info("******************* New Run Beginning *****************") 52 | logging.debug("CUDA: %s", "ENABLED" if IS_CUDA else "Disabled") 53 | logging.info(" ".join(sys.argv)) 54 | 55 | 56 | class TrainingLogger: 57 | r""" Helper class used for standardizing logging """ 58 | FIELD_SEP = " " 59 | DEFAULT_WIDTH = 12 60 | EPOCH_WIDTH = 5 61 | 62 | DEFAULT_FIELD = None 63 | 64 | LOG = logging.info 65 | 66 | def __init__(self, fld_names: List[str], fld_widths: Optional[List[int]] = None): 67 | if fld_widths is None: fld_widths = len(fld_names) * [TrainingLogger.DEFAULT_WIDTH] 68 | if len(fld_widths) != len(fld_names): 69 | raise ValueError("Mismatch in the length of field names and widths") 70 | 71 | self._log = TrainingLogger.LOG # Function used for logging 72 | self._fld_widths = fld_widths 73 | 74 | # Print the column headers 75 | combined_names = ["Epoch"] + fld_names 76 | combined_widths = [TrainingLogger.EPOCH_WIDTH] + fld_widths 77 | fmt_str = TrainingLogger.FIELD_SEP.join(["{:^%d}" % _d for _d in combined_widths]) 78 | self._log(fmt_str.format(*combined_names)) 79 | # Line of separators under the headers (default value is hyphen) 80 | sep_line = TrainingLogger.FIELD_SEP.join(["{:-^%d}" % _w for _w in combined_widths]) 81 | logging.info(sep_line.format(*(len(combined_widths) * [""]))) 82 | 83 | @property 84 | def num_fields(self) -> int: 85 | r""" Number of fields to log """ 86 | return len(self._fld_widths) 87 | 88 | def log(self, epoch: int, values: List[Any]) -> None: 89 | r""" Log the list of values """ 90 | values = self._clean_values_list(values) 91 | format_str = self._build_values_format_str(values) 92 | self._log(format_str.format(epoch, *values)) 93 | 94 | def _build_values_format_str(self, values: List[Any]) -> str: 95 | r""" Constructs a format string based on the values """ 96 | def _get_fmt_str(_w: int, fmt: str) -> str: 97 | return "{:^%d%s}" % (_w, fmt) 98 | 99 | frmt = [_get_fmt_str(self.EPOCH_WIDTH, "d")] 100 | for width, v in zip(self._fld_widths, values): 101 | if isinstance(v, str): fmt_str = "s" 102 | elif isinstance(v, Decimal): fmt_str = ".3E" 103 | elif isinstance(v, int): fmt_str = "d" 104 | elif isinstance(v, float): fmt_str = ".4f" 105 | else: raise ValueError("Unknown value type") 106 | 107 | frmt.append(_get_fmt_str(width, fmt_str)) 108 | return TrainingLogger.FIELD_SEP.join(frmt) 109 | 110 | def _clean_values_list(self, values: List[Any]) -> List[Any]: 111 | r""" Modifies values in the \p values list to make them straightforward to log """ 112 | values = copy.deepcopy(values) 113 | # Populate any missing fields 114 | while len(values) < self.num_fields: 115 | values.append(TrainingLogger.DEFAULT_FIELD) 116 | 117 | new_vals = [] 118 | for v in values: 119 | if isinstance(v, bool): v = "+" if v else "" 120 | elif v is None: v = "N/A" 121 | elif isinstance(v, Tensor): v = v.item() 122 | 123 | # Must be separate since v can be a float due to a Tensor 124 | if isinstance(v, float) and (v <= 1E-3 or v >= 1E4): v = Decimal(v) 125 | new_vals.append(v) 126 | return new_vals 127 | -------------------------------------------------------------------------------- /malgan/detector.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | r""" 3 | malgan.detector 4 | ~~~~~~~~~~~~ 5 | 6 | Black box malware detector. 7 | 8 | Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN" 9 | By Weiwei Hu and Ying Tan. 10 | 11 | :version: 0.1.0 12 | :copyright: (c) 2019 by Zayd Hammoudeh. 13 | :license: MIT, see LICENSE for more details. 14 | """ 15 | from enum import Enum 16 | from typing import Union 17 | 18 | import numpy as np 19 | import sklearn 20 | from sklearn.tree import DecisionTreeClassifier 21 | from sklearn.ensemble import RandomForestClassifier 22 | from sklearn.linear_model import LogisticRegression 23 | from sklearn.neural_network import MLPClassifier 24 | from sklearn.svm import SVC 25 | 26 | import torch 27 | from torch import Tensor 28 | 29 | TorchOrNumpy = Union[np.ndarray, Tensor] 30 | 31 | 32 | # noinspection PyPep8Naming 33 | class BlackBoxDetector: 34 | r""" 35 | Black box detector that intends to mimic an antivirus/anti-Malware program that detects whether 36 | a specific program is either malware or benign. 37 | """ 38 | class Type(Enum): 39 | r""" Learner algorithm to be used by the black-box detector """ 40 | DecisionTree = DecisionTreeClassifier() 41 | LogisticRegression = LogisticRegression(solver='lbfgs', max_iter=int(1e6)) 42 | MultiLayerPerceptron = MLPClassifier() 43 | RandomForest = RandomForestClassifier(n_estimators=100) 44 | SVM = SVC(gamma="auto") 45 | 46 | @classmethod 47 | def names(cls): 48 | r""" Builds the list of all enum names """ 49 | return [c.name for c in cls] 50 | 51 | @classmethod 52 | def get_from_name(cls, name): 53 | r""" 54 | Gets the enum item from the specified name 55 | 56 | :param name: Name of the enum object 57 | :return: Enum item associated with the specified name 58 | """ 59 | for c in BlackBoxDetector.Type: 60 | if c.name == name: 61 | return c 62 | raise ValueError("Unknown enum \"%s\" for class \"%s\"", name, cls.name) 63 | 64 | def __init__(self, learner_type: 'BlackBoxDetector.Type'): 65 | self.type = learner_type 66 | # noinspection PyCallingNonCallable 67 | self._model = sklearn.clone(self.type.value) 68 | self.training = True 69 | 70 | def fit(self, X: TorchOrNumpy, y: TorchOrNumpy): 71 | r""" 72 | Fits the learner. Supports NumPy and PyTorch arrays as input. Returns a torch tensor 73 | as output. 74 | 75 | :param X: Examples upon which to train 76 | :param y: Labels for the examples 77 | """ 78 | if isinstance(X, Tensor): 79 | X = X.numpy() 80 | if isinstance(y, Tensor): 81 | y = y.numpy() 82 | self._model.fit(X, y) 83 | self.training = False 84 | 85 | def predict(self, X: TorchOrNumpy) -> Tensor: 86 | r""" 87 | Predict the labels for \p X 88 | 89 | :param X: Set of examples for which label probabilities should be predicted 90 | :return: Predicted value for \p X 91 | """ 92 | if self.training: 93 | raise ValueError("Detector does not appear to be trained but trying to predict") 94 | if torch.cuda.is_available(): 95 | X = X.cpu() 96 | if isinstance(X, Tensor): 97 | X = X.numpy() 98 | y = torch.from_numpy(self._model.predict(X)).float() 99 | return y.cuda() if torch.cuda.is_available() else y 100 | -------------------------------------------------------------------------------- /malgan/discriminator.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | r""" 3 | malgan.discriminator 4 | ~~~~~~~~~~~~~~~~~ 5 | 6 | Discriminator (i.e., substitute detector) block for MalGAN. 7 | 8 | Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN" 9 | By Weiwei Hu and Ying Tan. 10 | 11 | :version: 0.1.0 12 | :copyright: (c) 2019 by Zayd Hammoudeh. 13 | :license: MIT, see LICENSE for more details. 14 | """ 15 | from typing import List 16 | 17 | import torch 18 | from torch import Tensor 19 | import torch.nn as nn 20 | 21 | 22 | # noinspection PyPep8Naming 23 | class Discriminator(nn.Module): 24 | r""" MalGAN discriminator (substitute detector). Simple feed forward network. """ 25 | EPS = 1e-7 26 | 27 | def __init__(self, M: int, hidden_size: List[int], g: nn.Module): 28 | r"""Discriminator Constructor 29 | 30 | Builds the discriminator block. 31 | 32 | :param M: Width of the malware feature vector 33 | :param hidden_size: Width of the hidden layer(s). 34 | :param g: Activation function 35 | """ 36 | super().__init__() 37 | 38 | # Build the feed forward layers. 39 | self._layers = nn.Sequential() 40 | for i, (in_w, out_w) in enumerate(zip([M] + hidden_size[:-1], hidden_size)): 41 | layer = nn.Sequential(nn.Linear(in_w, out_w), g) 42 | self._layers.add_module("FF%02d" % i, layer) 43 | 44 | layer = nn.Sequential(nn.Linear(hidden_size[-1], 1), nn.Sigmoid()) 45 | self._layers.add_module("FF%02d" % len(hidden_size), layer) 46 | 47 | def forward(self, X: Tensor) -> Tensor: 48 | r""" 49 | Forward path through the discriminator. 50 | 51 | :param X: Input example tensor 52 | :return: :math:`D_{sigma}(x)` -- Value predicted by the discriminator. 53 | """ 54 | d_theta = self._layers(X) 55 | return torch.clamp(d_theta, self.EPS, 1. - self.EPS).view(-1) 56 | -------------------------------------------------------------------------------- /malgan/generator.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | r""" 3 | malgan.generator 4 | ~~~~~~~~~~~~~ 5 | 6 | Generator block for MalGAN. 7 | 8 | Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN" 9 | By Weiwei Hu and Ying Tan. 10 | 11 | :version: 0.1.0 12 | :copyright: (c) 2019 by Zayd Hammoudeh. 13 | :license: MIT, see LICENSE for more details. 14 | """ 15 | from typing import List, Tuple 16 | 17 | import torch 18 | from torch import Tensor 19 | import torch.nn as nn 20 | 21 | TensorTuple = Tuple[Tensor, Tensor] 22 | 23 | 24 | class Generator(nn.Module): 25 | r""" MalGAN generator block """ 26 | 27 | # noinspection PyPep8Naming 28 | def __init__(self, M: int, Z: int, hidden_size: List[int], g: nn.Module): 29 | r"""Generator Constructor 30 | 31 | :param M: Dimension of the feature vector \p m 32 | :param Z: Dimension of the noise vector \p z 33 | :param hidden_size: Width of the hidden layer(s) 34 | :param g: Activation function 35 | """ 36 | super().__init__() 37 | 38 | self._Z = Z 39 | 40 | # Build the feed forward net 41 | self._layers, dim = nn.Sequential(), [M + self._Z] + hidden_size 42 | for i, (d_in, d_out) in enumerate(zip(dim[:-1], dim[1:])): 43 | self._layers.add_module("FF%02d" % i, nn.Sequential(nn.Linear(d_in, d_out), g)) 44 | 45 | # Last layer is always sigmoid 46 | layer = nn.Sequential(nn.Linear(dim[-1], M), nn.Sigmoid()) 47 | self._layers.add_module("FF%02d" % len(dim), layer) 48 | 49 | # noinspection PyUnresolvedReferences 50 | def forward(self, m: torch.Tensor, 51 | z: torch.Tensor = None) -> TensorTuple: # pylint: disable=arguments-differ 52 | r""" 53 | Forward pass through the generator. Automatically generates the noise vector \p z that 54 | is coupled with \p m. 55 | 56 | :param m: Input vector :math:`m` 57 | :param z: Noise vector :math:`z`. If no random vector is specified, the random vector is 58 | generated within this function call via a call to \p torch.rand 59 | :return: Tuple of (:math:`m'`, :math:`G_{\theta_{g}}`), i.e., the output tensor with the 60 | feature predictions as well as the smoothed prediction that can be used for 61 | back-propagation. 62 | """ 63 | if z is None: 64 | num_ele = m.shape[0] 65 | z = torch.rand((num_ele, self._Z)) 66 | 67 | # Concatenation of m and z 68 | o = torch.cat((m, z), dim=1) 69 | o = self._layers.forward(o) 70 | g_theta = torch.max(m, o) # Ensure binary bits only set positive 71 | 72 | m_prime = (g_theta > 0.5).float() 73 | return m_prime, g_theta 74 | -------------------------------------------------------------------------------- /malware_gan_poster.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/malware_gan_poster.pdf -------------------------------------------------------------------------------- /malware_gan_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/malware_gan_report.pdf -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.16.0 2 | torch==1.2.0 3 | typing>=3.6.6 4 | scikit_learn>=0.20.2 5 | tensorboardX==1.6 6 | --------------------------------------------------------------------------------