├── .gitignore
├── LICENSE
├── README.md
├── data
    ├── trial_ben.npy
    └── trial_mal.npy
├── main.py
├── malgan
    ├── __init__.py
    ├── _export_results.py
    ├── _log_tools.py
    ├── detector.py
    ├── discriminator.py
    └── generator.py
├── malware_gan_poster.pdf
├── malware_gan_report.pdf
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Miscellaneous Files
  2 | .idea/
  3 | tags
  4 | .DS_Store
  5 | *.swp
  6 | 
  7 | # Byte-compiled / optimized / DLL files
  8 | __pycache__/
  9 | *.py[cod]
 10 | *$py.class
 11 | 
 12 | # C extensions
 13 | *.so
 14 | 
 15 | # Distribution / packaging
 16 | .Python
 17 | build/
 18 | develop-eggs/
 19 | dist/
 20 | downloads/
 21 | eggs/
 22 | .eggs/
 23 | lib/
 24 | lib64/
 25 | parts/
 26 | sdist/
 27 | var/
 28 | wheels/
 29 | *.egg-info/
 30 | .installed.cfg
 31 | *.egg
 32 | MANIFEST
 33 | 
 34 | # PyInstaller
 35 | #  Usually these files are written by a python script from a template
 36 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 37 | *.manifest
 38 | *.spec
 39 | 
 40 | # Installer logs
 41 | pip-log.txt
 42 | pip-delete-this-directory.txt
 43 | 
 44 | # Unit test / coverage reports
 45 | htmlcov/
 46 | .tox/
 47 | .coverage
 48 | .coverage.*
 49 | .cache
 50 | nosetests.xml
 51 | coverage.xml
 52 | *.cover
 53 | .hypothesis/
 54 | .pytest_cache/
 55 | 
 56 | # Translations
 57 | *.mo
 58 | *.pot
 59 | 
 60 | # Django stuff:
 61 | *.log
 62 | local_settings.py
 63 | db.sqlite3
 64 | 
 65 | # Flask stuff:
 66 | instance/
 67 | .webassets-cache
 68 | 
 69 | # Scrapy stuff:
 70 | .scrapy
 71 | 
 72 | # Sphinx documentation
 73 | docs/_build/
 74 | 
 75 | # PyBuilder
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # pyenv
 82 | .python-version
 83 | 
 84 | # celery beat schedule file
 85 | celerybeat-schedule
 86 | 
 87 | # SageMath parsed files
 88 | *.sage.py
 89 | 
 90 | # Environments
 91 | .env
 92 | .venv
 93 | env/
 94 | venv/
 95 | ENV/
 96 | env.bak/
 97 | venv.bak/
 98 | 
 99 | # Spyder project settings
100 | .spyderproject
101 | .spyproject
102 | 
103 | # Rope project settings
104 | .ropeproject
105 | 
106 | # mkdocs documentation
107 | /site
108 | 
109 | # mypy
110 | .mypy_cache/
111 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Zayd Hammoudeh
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Adversarial Malware Generation Using GANs
 2 | 
 3 | [![docs](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/ZaydH/MalwareGAN/blob/master/LICENSE)
 4 | 
 5 | Implementation of a Generative Adversarial Network (GAN) that can create adversarial malware examples.  The work is inspired by **MalGAN** in the paper "[*Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN*](https://arxiv.org/abs/1702.05983)" by Weiwei Hu and Ying Tan.
 6 | 
 7 | Framework written in [PyTorch](https://pytorch.org/) and supports CUDA.
 8 | 
 9 | ## Running the Script
10 | 
11 | The malware GAN is provided as a package in the folder `malgan`.  A driver script is provided in `main.py`, which processes input arguments via `argparse`.  The basic interface is:
12 | 
13 |     python main.py Z BATCH_SIZE NUM_EPOCHS MALWARE_FILE BENIGN_FILE
14 | 
15 | * `Z` -- Dimension of the latent vector.  Must be a positive integer.
16 | * `BATCH_SIZE` -- Batch size for *malicious* examples.  The benign batch size is proportional to `BATCH_SIZE` and the fraction of total training samples that are benign.
17 | * `NUM_EPOCHS` -- Maximum number of training epochs
18 | * `MALWARE_FILE` -- Path to a serialized `numpy` or `torch` matrix where the rows represent a single **malware** file's binary feature vector.
19 | * `BENIGN_FILE` -- Path to a serialized `numpy` or `torch` matrix where the rows represent a single **benign** file's binary feature vector.
20 | 
21 | For checkout purposes, we recommend calling:
22 | 
23 |     python main.py 10 32 100 data/trial_mal.npy data/trial_ben.npy 
24 | 
25 | ## Dataset
26 | 
27 | A trial dataset is included with this implementation in the `data` folder.  The data was publish in the repository: [yanminglai/Malware-GAN](https://github.com/yanminglai/Malware-GAN).  This dataset should only be used for proof of concept and initial trials. 
28 | 
29 | We recommend the SLEIPNIR dataset.  It was published by ad-Dujaili et al.  The authors requested that the dataset not be shared publicly, and we respect that request.  However, researchers and students may request access directly from the authors as described on their [Github repository](https://github.com/ALFA-group/robust-adv-malware-detection).  Look for the link to the Google form.
30 | 
31 | ## CUDA Support
32 | 
33 | The implementation supports both CPU and CUDA (i.e., GPU) execution.  If CUDA is detected on the system, the implementation defaults to CUDA support.
34 | 
35 | ## Requirements
36 | 
37 | This program was tested with Python 3.6.5 on MacOS and on Debian Linux.  `requirements.txt` enumerates the exact packages used. A summary of the key requirements is below: 
38 | 
39 | * PyTorch (`torch`) -- Ver. 1.2.0
40 | * Scikit-Learn (`sklearn`) -- Ver. 0.20.2
41 | * NumPy (`numpy`)
42 | * TensorboardX -- If runtime profiling is not required, this can be removed.
43 | 


--------------------------------------------------------------------------------
/data/trial_ben.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/data/trial_ben.npy


--------------------------------------------------------------------------------
/data/trial_mal.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/data/trial_mal.npy


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | r"""
  3 |     Main module for testing and debugging the MalGAN implementation.
  4 | 
  5 |     :version: 0.1.0
  6 |     :copyright: (c) 2019 by Zayd Hammoudeh.
  7 |     :license: MIT, see LICENSE for more details.
  8 | """
  9 | 
 10 | import argparse
 11 | import pickle
 12 | import sys
 13 | from typing import Union
 14 | from pathlib import Path
 15 | 
 16 | import numpy as np
 17 | from malgan import MalGAN, MalwareDataset, BlackBoxDetector, setup_logger
 18 | 
 19 | import torch
 20 | from torch import nn
 21 | 
 22 | 
 23 | 
 24 | def parse_args() -> argparse.Namespace:
 25 |     r"""
 26 |     Parse the command line arguments
 27 | 
 28 |     :return: Parsed argument structure
 29 |     """
 30 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 31 | 
 32 |     parser.add_argument("Z", help="Dimension of the latent vector", type=int, default=10)
 33 |     parser.add_argument("batch_size", help="Batch size", type=int, default=32)
 34 |     parser.add_argument("num_epoch", help="Number of training epochs", type=int, default=100)
 35 | 
 36 |     msg = "Data file contacting the %s feature vectors"
 37 |     for x in ["malware", "benign"]:
 38 |         parser.add_argument(x[:3] + "_file", help=msg % x, type=str, default="data/%s.npy" % x)
 39 | 
 40 |     parser.add_argument("-q", help="Quiet mode", action='store_true', default=False)
 41 | 
 42 |     help_msg = " ".join(["Dimension of the hidden layer(s) in the GENERATOR."
 43 |                          "Multiple layers should be space separated"])
 44 |     parser.add_argument("--gen-hidden-sizes", help=help_msg, type=int,
 45 |                         default=[256, 256], nargs="+")
 46 | 
 47 |     help_msg = " ".join(["Dimension of the hidden layer(s) in the DISCRIMINATOR."
 48 |                          "Multiple layers should be space separated"])
 49 |     parser.add_argument("--discrim-hidden-sizes", help=help_msg, type=int,
 50 |                         default=[256, 256], nargs="+")
 51 | 
 52 |     help_msg = " ".join(["Activation function for the generator and discriminator hidden",
 53 |                          "layer(s). Valid choices (case insensitive) are: \"ReLU\", \"ELU\",",
 54 |                          "\"LeakyReLU\", \"tanh\" and \"sigmoid\"."])
 55 |     parser.add_argument("--activation", help=help_msg, type=str, default="LeakyReLU")
 56 | 
 57 |     help_msg = ["Learner algorithm used in the black box detector. Valid choices (case ",
 58 |                 "insensitive) include:"]
 59 |     names = BlackBoxDetector.Type.names()
 60 |     for i, type_name in enumerate(names):
 61 |         if i > 0 and len(names) > 2:  # Need three options for a comma to make sense
 62 |             help_msg.append(",")
 63 |         if len(names) > 1 and i == len(names) - 1:  # And only makes sense if at least two options
 64 |             help_msg.append(" and")
 65 |         help_msg.extend([" \"", type_name, "\""])
 66 |     help_msg.append(".")
 67 |     parser.add_argument("--detector", help="".join(help_msg), type=str,
 68 |                         default=BlackBoxDetector.Type.RandomForest.name)
 69 | 
 70 |     help_msg = "Print the results to the console. Intended for slurm results analysis"
 71 |     parser.add_argument("--print-results", help=help_msg, action="store_true", default=False)
 72 | 
 73 |     args = parser.parse_args()
 74 |     # noinspection PyTypeChecker
 75 |     args.activation = _configure_activation_function(args.activation)
 76 |     args.detector = BlackBoxDetector.Type.get_from_name(args.detector)
 77 | 
 78 |     # Check the malware and binary files exist
 79 |     args.mal_file = Path(args.mal_file)
 80 |     args.ben_file = Path(args.ben_file)
 81 |     for (name, path) in (("malware", args.mal_file), ("benign", args.ben_file)):
 82 |         if path.exists(): continue
 83 |         print(f"Unknown %s file \"%s\"" % (name, str(path)))
 84 |         sys.exit(1)
 85 |     return args
 86 | 
 87 | 
 88 | def _configure_activation_function(act_func_name: str) -> nn.Module:
 89 |     r"""
 90 |     Parse the activation function from a string and return the corresponding activation function
 91 |     PyTorch module.  If the activation function cannot not be found, a \p ValueError is thrown.
 92 | 
 93 |     **Note**: Activation function check is case insensitive.
 94 | 
 95 |     :param act_func_name: Name of the activation function to
 96 |     :return: Activation function module associated with the passed name.
 97 |     """
 98 |     act_func_name = act_func_name.lower()  # Make case insensitive
 99 |     # Supported activation functions
100 |     act_funcs = [("relu", nn.ReLU), ("elu", nn.ELU), ("leakyrelu", nn.LeakyReLU), ("tanh", nn.Tanh),
101 |                  ("sigmoid", nn.Sigmoid)]
102 |     for func_name, module in act_funcs:
103 |         if act_func_name == func_name.lower():
104 |             return module
105 |     raise ValueError("Unknown activation function: \"%s\"" % act_func_name)
106 | 
107 | 
108 | def load_dataset(file_path: Union[str, Path], y: int) -> MalwareDataset:
109 |     r"""
110 |     Extracts the input data from disk and packages them into format expected by \p MalGAN.  Supports
111 |     loading files from numpy, torch, and pickle.  Other formats (based on the file extension) will
112 |     result in a \p ValueError.
113 | 
114 |     :param file_path: Path to a NumPy data file containing tensors for the benign and malware
115 |                       data.
116 |     :param y: Y value for dataset
117 |     :return: \p MalwareDataset objects for the malware and benign files respectively.
118 |     """
119 |     file_ext = Path(file_path).suffix
120 |     if file_ext in {".npy", ".npz"}:
121 |         data = np.load(file_path)
122 |     elif file_ext in {".pt", ".pth"}:
123 |         data = torch.load(str(file_path))
124 |     elif file_ext == ".pk":
125 |         with open(str(file_path), "rb") as f_in:
126 |             data = pickle.load(f_in)
127 |     else:
128 |         raise ValueError("Unknown file extension.  Cannot determine how to import")
129 |     return MalwareDataset(x=data, y=y)
130 | 
131 | 
132 | def main():
133 |     args = parse_args()
134 |     setup_logger(args.q)
135 | 
136 |     MalGAN.MALWARE_BATCH_SIZE = args.batch_size
137 | 
138 |     malgan = MalGAN(load_dataset(args.mal_file, MalGAN.Label.Malware.value),
139 |                     load_dataset(args.ben_file, MalGAN.Label.Benign.value),
140 |                     Z=args.Z,
141 |                     h_gen=args.gen_hidden_sizes,
142 |                     h_discrim=args.discrim_hidden_sizes,
143 |                     g_hidden=args.activation,
144 |                     detector_type=args.detector)
145 |     malgan.fit(args.num_epoch, quiet_mode=args.q)
146 |     results = malgan.measure_and_export_results()
147 |     if args.print_results:
148 |         print(results)
149 | 
150 | 
151 | if __name__ == "__main__":
152 |     main()
153 | 


--------------------------------------------------------------------------------
/malgan/__init__.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | r"""
  3 |     malgan.__init__
  4 |     ~~~~~~~~~~~~~~~
  5 | 
  6 |     MalGAN complete architecture.
  7 | 
  8 |     Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN"
  9 |     By Weiwei Hu and Ying Tan.
 10 | 
 11 |     :version: 0.1.0
 12 |     :copyright: (c) 2019 by Zayd Hammoudeh.
 13 |     :license: MIT, see LICENSE for more details.
 14 | """
 15 | import logging
 16 | import os
 17 | from enum import Enum
 18 | from typing import List, Tuple, Union
 19 | from pathlib import Path
 20 | 
 21 | import numpy as np
 22 | 
 23 | import tensorboardX
 24 | import torch
 25 | from torch import Tensor
 26 | import torch.nn as nn
 27 | import torch.optim as optim
 28 | from torch.optim.optimizer import Optimizer
 29 | import torch.utils.data
 30 | from torch.utils.data import Dataset, DataLoader, Subset
 31 | 
 32 | from ._export_results import _export_results
 33 | from ._log_tools import setup_logger, TrainingLogger
 34 | from .detector import BlackBoxDetector
 35 | from .discriminator import Discriminator
 36 | from .generator import Generator
 37 | 
 38 | ListOrInt = Union[List[int], int]
 39 | PathOrStr = Union[str, Path]
 40 | TensorTuple = Tuple[Tensor, Tensor]
 41 | 
 42 | 
 43 | IS_CUDA = torch.cuda.is_available()
 44 | if IS_CUDA:
 45 |     device = torch.device('cuda:0')
 46 |     # noinspection PyUnresolvedReferences
 47 |     torch.set_default_tensor_type(torch.cuda.FloatTensor)
 48 | 
 49 | 
 50 | class MalwareDataset(Dataset):
 51 |     r"""
 52 |     Encapsulates a malware dataset.  All elements in the dataset will be either malware or benign
 53 |     """
 54 |     def __init__(self, x: Union[np.ndarray, Tensor], y):
 55 |         super().__init__()
 56 | 
 57 |         if isinstance(x, np.ndarray):
 58 |             x = torch.from_numpy(x).float()
 59 |         self.x = x
 60 |         self.y = y
 61 | 
 62 |     def __getitem__(self, index):
 63 |         return self.x[index], self.y
 64 | 
 65 |     def __len__(self):
 66 |         return self.x.shape[0]
 67 | 
 68 |     @property
 69 |     def num_features(self):
 70 |         r""" Number of features in the dataset """
 71 |         return self.x.shape[1]
 72 | 
 73 | 
 74 | class _DataGroup:  # pylint: disable=too-few-public-methods
 75 |     r"""
 76 |     Encapsulates either PyTorch DataLoaders or Datasets.  This class is intended only for internal
 77 |     use by MalGAN.
 78 |     """
 79 |     def __init__(self, train: MalwareDataset, valid: MalwareDataset, test: MalwareDataset):
 80 |         self.train = train
 81 |         self.valid = valid
 82 |         self.test = test
 83 |         self.is_loaders = False
 84 | 
 85 |     def build_loader(self, batch_size: int = 0):
 86 |         r"""
 87 |         Constructs loaders from the datasets
 88 | 
 89 |         :param batch_size: Batch size for training
 90 |         """
 91 |         self.train = DataLoader(self.train, batch_size=batch_size, shuffle=True, pin_memory=True)
 92 |         if self.valid:
 93 |             self.valid = DataLoader(self.valid, batch_size=batch_size, pin_memory=True)
 94 |         self.test = DataLoader(self.test, batch_size=batch_size, pin_memory=True)
 95 |         self.is_loaders = True
 96 | 
 97 | 
 98 | # noinspection PyPep8Naming
 99 | class MalGAN(nn.Module):
100 |     r""" Malware Generative Adversarial Network based on the work of Hu & Tan. """
101 | 
102 |     MALWARE_BATCH_SIZE = 32
103 | 
104 |     SAVED_MODEL_DIR = Path("saved_models")
105 | 
106 |     VALIDATION_SPLIT = 0.2
107 | 
108 |     tensorboard = None
109 | 
110 |     class Label(Enum):
111 |         r""" Label value assigned to malware and benign examples """
112 |         Malware = 1
113 |         Benign = 0
114 | 
115 |     # noinspection PyPep8Naming
116 |     def __init__(self, mal_data: MalwareDataset, ben_data: MalwareDataset, Z: int,
117 |                  h_gen: ListOrInt, h_discrim: ListOrInt,
118 |                  test_split: float = 0.2,
119 |                  g_hidden: nn.Module = nn.LeakyReLU,
120 |                  detector_type: BlackBoxDetector.Type = BlackBoxDetector.Type.LogisticRegression):
121 |         r"""
122 |         Malware Generative Adversarial Network Constructor
123 | 
124 |         :param mal_data: Malware training dataset.
125 |         :param ben_data: Benign training dataset.
126 |         :param Z: Dimension of the noise vector \p z
127 |         :param test_split: Fraction of input data to be used for testing
128 |         :param h_gen: Width of the hidden layer(s) in the GENERATOR.  If only a single hidden
129 |                       layer is desired, then this can be only an integer.
130 |         :param h_discrim: Width of the hidden layer(s) in the DISCRIMINATOR.  If only a single
131 |                           hidden layer is desired, then this can be only an integer.
132 |         :param detector_type: Learning algorithm to be used by the black-box detector
133 |         """
134 |         super().__init__()
135 | 
136 |         if mal_data.num_features != ben_data.num_features:
137 |             raise ValueError("Mismatch in the number of features between malware and benign data")
138 |         if Z <= 0:
139 |             raise ValueError("Z must be a positive integers")
140 |         if test_split <= 0. or test_split >= 1.:
141 |             raise ValueError("test_split must be in the range (0,1)")
142 |         self._M, self._Z = mal_data.num_features, Z  # pylint: disable=invalid-name
143 | 
144 |         # Format the hidden layer sizes and make sure all are valid values
145 |         if isinstance(h_gen, int):
146 |             h_gen = [h_gen]
147 |         if isinstance(h_discrim, int):
148 |             h_discrim = [h_discrim]
149 |         self.d_discrim, self.d_gen = h_discrim, h_gen
150 |         for h_size in [self.d_discrim, self.d_gen]:
151 |             for w in h_size:
152 |                 if w <= 0:
153 |                     raise ValueError("All hidden layer widths must be positive integers.")
154 | 
155 |         if not isinstance(g_hidden, nn.Module):
156 |             g_hidden = g_hidden()
157 |         self._g = g_hidden
158 | 
159 |         self._is_cuda = IS_CUDA
160 | 
161 |         logging.debug("Constructing new MalGAN")
162 |         logging.debug("Malware Dimension (M): %d", self.M)
163 |         logging.debug("Latent Dimension (Z): %d", self.Z)
164 |         logging.debug("Test Split Ratio: %.3f", test_split)
165 |         logging.debug("Generator Hidden Layer Sizes: %s", h_gen)
166 |         logging.debug("Discriminator Hidden Layer Sizes: %s", h_discrim)
167 |         logging.debug("Blackbox Detector Type: %s", detector_type.name)
168 |         logging.debug("Activation Type: %s", self._g.__class__.__name__)
169 | 
170 |         self._bb = BlackBoxDetector(detector_type)
171 |         self._gen = Generator(M=self.M, Z=self.Z, hidden_size=h_gen, g=self._g)
172 |         self._discrim = Discriminator(M=self.M, hidden_size=h_discrim, g=self._g)
173 | 
174 |         def split_train_valid_test(dataset: Dataset, is_benign: bool):
175 |             """Helper function to partition into test, train, and validation subsets"""
176 |             valid_len = 0 if is_benign else int(MalGAN.VALIDATION_SPLIT * len(dataset))
177 |             test_len = int(test_split * len(dataset))
178 | 
179 |             # Order must be train, validation, test
180 |             lengths = [len(dataset) - valid_len - test_len, valid_len, test_len]
181 |             return _DataGroup(*torch.utils.data.random_split(dataset, lengths))
182 | 
183 |         # Split between train, test, and validation then construct the loaders
184 |         self._mal_data = split_train_valid_test(mal_data, is_benign=False)
185 |         self._ben_data = split_train_valid_test(ben_data, is_benign=True)
186 |         # noinspection PyTypeChecker
187 |         self._fit_blackbox(self._mal_data.train, self._ben_data.train)
188 | 
189 |         self._mal_data.build_loader(MalGAN.MALWARE_BATCH_SIZE)
190 |         ben_bs_frac = len(ben_data) / len(mal_data)
191 |         self._ben_data.build_loader(int(ben_bs_frac * MalGAN.MALWARE_BATCH_SIZE))
192 |         # Set CUDA last to ensure all parameters defined
193 |         if self._is_cuda: self.cuda()
194 | 
195 |     @property
196 |     def M(self) -> int:
197 |         r"""Width of the malware feature vector"""
198 |         return self._M
199 | 
200 |     @property
201 |     def Z(self) -> int:
202 |         r"""Width of the generator latent noise vector"""
203 |         return self._Z
204 | 
205 |     def _fit_blackbox(self, mal_train: Subset, ben_train: Subset) -> None:
206 |         r"""
207 |         Firsts the blackbox detector using the specified malware and benign training sets.
208 | 
209 |         :param mal_train: Malware training dataset
210 |         :param ben_train: Benign training dataset
211 |         """
212 |         def extract_x(ds: Subset) -> Tensor:
213 |             # noinspection PyUnresolvedReferences
214 |             x = ds.dataset.x[ds.indices]
215 |             return x.cpu() if self._is_cuda else x
216 | 
217 |         mal_x = extract_x(mal_train)
218 |         ben_x = extract_x(ben_train)
219 |         merged_x = torch.cat((mal_x, ben_x))
220 | 
221 |         merged_y = torch.cat((torch.full((len(mal_train),), MalGAN.Label.Malware.value),
222 |                               torch.full((len(ben_train),), MalGAN.Label.Benign.value)))
223 |         logging.debug("Starting training of blackbox detector of type \"%s\"", self._bb.type.name)
224 |         self._bb.fit(merged_x, merged_y)
225 |         logging.debug("COMPLETED training of blackbox detector of type \"%s\"", self._bb.type.name)
226 | 
227 |     def fit(self, cyc_len: int, quiet_mode: bool = False) -> None:
228 |         r"""
229 |         Trains the model for the specified number of epochs.  The epoch with the best validation
230 |         loss is used as the final model.
231 | 
232 |         :param cyc_len: Number of cycles (epochs) to train the model.
233 |         :param quiet_mode: True if no printing to console should occur in this function
234 |         """
235 |         if cyc_len <= 0:
236 |             raise ValueError("At least a single training cycle is required.")
237 | 
238 |         MalGAN.tensorboard = tensorboardX.SummaryWriter()
239 | 
240 |         d_optimizer = optim.Adam(self._discrim.parameters(), lr=1e-5)
241 |         g_optimizer = optim.Adam(self._gen.parameters(), lr=1e-4)
242 | 
243 |         if not quiet_mode:
244 |             names = ["Gen Train Loss", "Gen Valid Loss", "Discrim Train Loss", "Best?"]
245 |             log = TrainingLogger(names, [20, 20, 20, 7])
246 | 
247 |         best_epoch, best_loss = None, np.inf
248 |         for epoch_cnt in range(1, cyc_len + 1):
249 |             train_l_g, train_l_d = self._fit_epoch(g_optimizer, d_optimizer)
250 |             for block, loss in [("Generator", train_l_g), ("Discriminator", train_l_d)]:
251 |                 MalGAN.tensorboard.add_scalar('Train_%s_Loss' % block, loss, epoch_cnt)
252 | 
253 |             # noinspection PyTypeChecker
254 |             valid_l_g = self._meas_loader_gen_loss(self._mal_data.valid)
255 |             MalGAN.tensorboard.add_scalar('Validation_Generator_Loss', valid_l_g, epoch_cnt)
256 |             flds = [train_l_g, valid_l_g, train_l_d, valid_l_g < best_loss]
257 |             if flds[-1]:
258 |                 self._save(self._build_export_name(is_final=False))
259 |                 best_loss = valid_l_g
260 |             if not quiet_mode: log.log(epoch_cnt, flds)
261 |         MalGAN.tensorboard.close()
262 | 
263 |         self.load(self._build_export_name(is_final=False))
264 |         self._save(self._build_export_name(is_final=True))
265 |         self._delete_old_backup(is_final=False)
266 | 
267 |     def _build_export_name(self, is_final: bool = True) -> str:
268 |         r"""
269 |         Builds the name that will be used when exporting the model.
270 | 
271 |         :param is_final: If \p True, then file name is for final (i.e., not training) model
272 |         :return: Model name built from the model's parameters
273 |         """
274 |         name = ["malgan", "z=%d" % self.Z,
275 |                 "d-gen=%s" % str(self.d_gen).replace(" ", "_"),
276 |                 "d-disc=%s" % str(self.d_discrim).replace(" ", "_"),
277 |                 "bs=%d" % MalGAN.MALWARE_BATCH_SIZE,
278 |                 "bb=%s" % self._bb.type.name, "g=%s" % self._g.__class__.__name__,
279 |                 "final" if is_final else "tmp"]
280 | 
281 |         # Either add an epoch name or
282 |         return MalGAN.SAVED_MODEL_DIR / "".join(["_".join(name).lower(), ".pth"])
283 | 
284 |     def _delete_old_backup(self,  is_final: bool = True) -> None:
285 |         r"""
286 |         Helper function to delete old backed up models
287 | 
288 |         :param is_final: If \p True, then file name is for final (i.e., not training) model
289 |         """
290 |         backup_name = self._build_export_name(is_final)
291 |         try:
292 |             os.remove(backup_name)
293 |         except OSError:
294 |             logging.warning("Error trying to delete model: %s", backup_name)
295 | 
296 |     def _fit_epoch(self, g_optim: Optimizer, d_optim: Optimizer) -> TensorTuple:
297 |         r"""
298 |         Trains a single entire epoch
299 | 
300 |         :param g_optim: Generator optimizer
301 |         :param d_optim: Discriminator optimizer
302 |         :return: Average training loss
303 |         """
304 |         tot_l_g = tot_l_d = 0
305 |         num_batch = min(len(self._mal_data.train), len(self._ben_data.train))
306 | 
307 |         for (m, _), (b, _) in zip(self._mal_data.train, self._ben_data.train):
308 |             if self._is_cuda: m, b = m.cuda(), b.cuda()
309 |             m_prime, g_theta = self._gen.forward(m)
310 |             l_g = self._calc_gen_loss(g_theta)
311 |             g_optim.zero_grad()
312 |             l_g.backward()
313 |             # torch.nn.utils.clip_grad_value_(l_g, 1)
314 |             g_optim.step()
315 |             tot_l_g += l_g
316 | 
317 |             # Update the discriminator
318 |             for x in [m_prime, b]:
319 |                 l_d = self._calc_discrim_loss(x)
320 |                 d_optim.zero_grad()
321 |                 l_d.backward()
322 |                 # torch.nn.utils.clip_grad_value_(l_d, 1)
323 |                 d_optim.step()
324 |                 tot_l_d += l_d
325 |         # noinspection PyUnresolvedReferences
326 |         return (tot_l_g / num_batch).item(), (tot_l_d / num_batch).item()
327 | 
328 |     def _meas_loader_gen_loss(self, loader: DataLoader) -> float:
329 |         r""" Calculate the generator loss on malware dataset """
330 |         loss = 0
331 |         for m, _ in loader:
332 |             if self._is_cuda: m = m.cuda()
333 |             _, g_theta = self._gen.forward(m)
334 |             loss += self._calc_gen_loss(g_theta)
335 |         # noinspection PyUnresolvedReferences
336 |         return (loss / len(loader)).item()
337 | 
338 |     def _calc_gen_loss(self, g_theta: Tensor) -> Tensor:
339 |         r"""
340 |         Calculates the parameter :math:`L_{G}` as defined in Eq. (3) of Hu & Tan's paper.
341 | 
342 |         :param g_theta: :math:`G(_{\theta_g}(m,z)` in Eq. (1) of Hu & Tan's paper
343 |         :return: Loss for the generator smoothed output.
344 |         """
345 |         d_theta = self._discrim.forward(g_theta)
346 |         return d_theta.log().mean()
347 | 
348 |     def _calc_discrim_loss(self, X: Tensor) -> Tensor:
349 |         r"""
350 |         Calculates the parameter :math:`L_{D}` as defined in Eq. (2) of Hu & Tan's paper.
351 | 
352 |         :param X: Examples to calculate the loss over.  May be a mix of benign and malware samples.
353 |         """
354 |         d_theta = self._discrim.forward(X)
355 | 
356 |         y_hat = self._bb.predict(X)
357 |         d = torch.where(y_hat == MalGAN.Label.Malware.value, d_theta, 1 - d_theta)
358 |         return -d.log().mean()
359 | 
360 |     def measure_and_export_results(self) -> str:
361 |         r"""
362 |         Measure the test accuracy and provide results information
363 | 
364 |         :return: Results information as a comma separated string
365 |         """
366 |         # noinspection PyTypeChecker
367 |         valid_loss = self._meas_loader_gen_loss(self._mal_data.valid)
368 |         # noinspection PyTypeChecker
369 |         test_loss = self._meas_loader_gen_loss(self._mal_data.test)
370 |         logging.debug("Final Validation Loss: %.6f", valid_loss)
371 |         logging.debug("Final Test Loss: %.6f", test_loss)
372 | 
373 |         num_mal_test = 0
374 |         y_mal_orig, m_prime_arr, bits_changed = [], [], []
375 |         for m, _ in self._mal_data.test:
376 |             y_mal_orig.append(self._bb.predict(m.cpu()))
377 |             if self._is_cuda:
378 |                 m = m.cuda()
379 |             num_mal_test += m.shape[0]
380 | 
381 |             m_prime, _ = self._gen.forward(m)
382 |             m_prime_arr.append(m_prime.cpu() if self._is_cuda else m_prime)
383 | 
384 |             m_diff = m_prime - m
385 |             bits_changed.append(torch.sum(m_diff.cpu(), dim=1))
386 | 
387 |             # Sanity check no bits flipped 1 -> 0
388 |             msg = "Malware signature changed to 0 which is not allowed"
389 |             assert torch.sum(m_diff < -0.1) == 0, msg
390 |         avg_changed_bits = torch.cat(bits_changed).mean()
391 |         logging.debug("Avg. Malware Bits Changed Changed: %2f", avg_changed_bits)
392 | 
393 |         # BB prediction of the malware before the generator
394 |         y_mal_orig = torch.cat(y_mal_orig)
395 | 
396 |         # Build an X tensor for prediction using the detector
397 |         ben_test_arr = [x.cpu() if self._is_cuda else x for x, _ in self._ben_data.test]
398 |         x = torch.cat(m_prime_arr + ben_test_arr)
399 |         y_actual = torch.cat((torch.full((num_mal_test,), MalGAN.Label.Malware.value),
400 |                              torch.full((len(x) - num_mal_test,), MalGAN.Label.Benign.value)))
401 | 
402 |         y_hat_post = self._bb.predict(x)
403 |         if self._is_cuda:
404 |             y_mal_orig, y_hat_post, y_actual = y_mal_orig.cpu(), y_hat_post.cpu(), y_actual.cpu()
405 |         # noinspection PyProtectedMember
406 |         y_prob = self._bb._model.predict_proba(x)  # pylint: disable=protected-access
407 |         y_prob = y_prob[:, MalGAN.Label.Malware.value]
408 |         return _export_results(self, valid_loss, test_loss, avg_changed_bits, y_actual,
409 |                                y_mal_orig, y_prob, y_hat_post)
410 | 
411 |     def _save(self, file_path: PathOrStr) -> None:
412 |         r"""
413 |         Export the specified model to disk.  The function creates any files needed on the path.
414 |         All exported models will be relative to \p EXPORT_DIR class object.
415 | 
416 |         :param file_path: Path to export the model.
417 |         """
418 |         if isinstance(file_path, str):
419 |             file_path = Path(file_path)
420 | 
421 |         file_path.parent.mkdir(parents=True, exist_ok=True)
422 |         torch.save(self.state_dict(), str(file_path))
423 | 
424 |     def forward(self, x: Tensor) -> TensorTuple:  # pylint: disable=arguments-differ
425 |         r"""
426 |         Passes a malware tensor and augments it to make it more undetectable by
427 | 
428 |         :param x: Malware binary tensor
429 |         :return: :math:`m'` and :math:`g_{\theta}` respectively
430 |         """
431 |         return self._gen.forward(x)
432 | 
433 |     def load(self, filename: PathOrStr) -> None:
434 |         r"""
435 |         Load a MalGAN object from disk.  MalGAN's \p EXPORT_DIR is prepended to the specified
436 |         filename.
437 | 
438 |         :param filename: Path to the exported torch file
439 |         """
440 |         if isinstance(filename, Path):
441 |             filename = str(filename)
442 |         self.load_state_dict(torch.load(filename))
443 |         self.eval()
444 |         # Based on the recommendation of Soumith Chantala et al. in GAN Hacks that enabling dropout
445 |         # in evaluation improves performance. Source code based on:
446 |         # https://discuss.pytorch.org/t/using-dropout-in-evaluation-mode/27721
447 |         for m in self._gen.modules():
448 |             if m.__class__.__name__.startswith('Dropout'):
449 |                 m.train()
450 | 
451 |     @staticmethod
452 |     def _print_memory_usage() -> None:
453 |         """
454 |         Helper function to print the allocated tensor memory.  This is used to debug out of memory
455 |         GPU errors.
456 |         """
457 |         import gc
458 |         import operator as op
459 |         from functools import reduce
460 |         for obj in gc.get_objects():
461 |             # noinspection PyBroadException
462 |             try:
463 |                 if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
464 |                     if len(obj.size()) > 0:  # pylint: disable=len-as-condition
465 |                         obj_tot_size = reduce(op.mul, obj.size())
466 |                     else:
467 |                         obj_tot_size = "NA"
468 |                     print(obj_tot_size, type(obj), obj.size())
469 |             except:  # pylint: disable=bare-except  # NOQA E722
470 |                 pass
471 | 


--------------------------------------------------------------------------------
/malgan/_export_results.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | from pathlib import Path
 3 | from typing import Union
 4 | 
 5 | import numpy as np
 6 | 
 7 | import torch
 8 | from sklearn.metrics import confusion_matrix, roc_auc_score
 9 | 
10 | TensorOrFloat = Union[torch.Tensor, float]
11 | TorchOrNumpy = Union[torch.Tensor, np.ndarray]
12 | 
13 | 
14 | # noinspection PyProtectedMember,PyUnresolvedReferences
15 | def _export_results(model: 'MalGAN', valid_loss: TensorOrFloat, test_loss: TensorOrFloat,
16 |                     avg_num_bits_changed: TensorOrFloat, y_actual: np.ndarray,
17 |                     y_mal_orig: TorchOrNumpy, y_prob: TorchOrNumpy, y_hat: np.ndarray) -> str:
18 |     r"""
19 |     Exports MalGAN results.
20 | 
21 |     :param model: MalGAN model
22 |     :param valid_loss: Average loss on the malware validation set
23 |     :param test_loss: Average loss on the malware test set
24 |     :param avg_num_bits_changed:
25 |     :param y_actual: Actual labels
26 |     :param y_mal_orig: Predicted value on the original (unmodified) malware
27 |     :param y_prob: Probability of malware
28 |     :param y_hat: Predict labels
29 |     :return: Results string
30 |     """
31 |     if isinstance(y_prob, torch.Tensor):
32 |         y_prob = y_prob.numpy()
33 |     if isinstance(y_mal_orig, torch.Tensor):
34 |         y_mal_orig = y_mal_orig.numpy()
35 | 
36 |     results_file = Path("results.csv")
37 |     exists = results_file.exists()
38 |     with open(results_file, "a+") as f_out:
39 |         header = ",".join(["time_completed,M,Z,batch_size,test_set_size,detector_type,activation",
40 |                            "gen_hidden_dim,discim_hidden_dim",
41 |                            "avg_validation_loss,avg_test_loss,avg_num_bits_changed",
42 |                            "auc,orig_mal_detect_rate,mod_mal_detect_rate,ben_mal_detect_rate"])
43 |         if not exists:
44 |             f_out.write(header)
45 | 
46 |         results = ["\n%s" % datetime.datetime.now(),
47 |                    "%d,%d,%d" % (model.M, model.Z, model.__class__.MALWARE_BATCH_SIZE),
48 |                    "%d,%s,%s" % (len(y_actual), model._bb.type.name, model._g.__class__.__name__),
49 |                    "\"%s\",\"%s\"" % (str(model.d_gen), str(model.d_discrim)),
50 |                    "%.15f,%.15f,%.3f" % (valid_loss, test_loss, avg_num_bits_changed)]
51 | 
52 |         auc = roc_auc_score(y_actual, y_prob)
53 |         results.append("%.8f" % auc)
54 | 
55 |         # Calculate the detection rate on unmodified malware
56 |         results.append("%.8f" % y_mal_orig.mean())
57 | 
58 |         # Write the TxR and NxR information
59 |         tn, fp, fn, tp = confusion_matrix(y_actual, y_hat).ravel()
60 |         tpr, fpr = tp / (tp + fn), fp / (tn + fp)
61 |         for rate in [tpr, fpr]:
62 |             results.append("%.8f" % rate)
63 |         results = ",".join(results)
64 |         f_out.write(results)
65 | 
66 |         return "".join([header, results])
67 | 


--------------------------------------------------------------------------------
/malgan/_log_tools.py:
--------------------------------------------------------------------------------
  1 | import copy
  2 | import logging
  3 | import sys
  4 | from _decimal import Decimal
  5 | from datetime import datetime
  6 | from pathlib import Path
  7 | from typing import List, Optional, Union, Any
  8 | 
  9 | import torch
 10 | from torch import Tensor
 11 | 
 12 | ListOrInt = Union[int, List[int]]
 13 | 
 14 | LOG_DIR = Path(".")
 15 | IS_CUDA = torch.cuda.is_available()
 16 | 
 17 | 
 18 | def setup_logger(quiet_mode: bool, log_level: int = logging.DEBUG,
 19 |                  job_id: Optional[ListOrInt] = None) -> None:
 20 |     r"""
 21 |     Logger Configurator
 22 | 
 23 |     Configures the test logger.
 24 | 
 25 |     :param quiet_mode: True if quiet mode (i.e., disable logging to stdout) is used
 26 |     :param job_id: Identification number for the job
 27 |     :param log_level: Level to log
 28 |     """
 29 |     date_format = '%m/%d/%Y %I:%M:%S %p'  # Example Time Format - 12/12/2010 11:46:36 AM
 30 |     format_str = '%(asctime)s -- %(levelname)s -- %(message)s'
 31 | 
 32 |     LOG_DIR.mkdir(parents=True, exist_ok=True)
 33 |     flds = ["logs"]
 34 |     if job_id is not None:
 35 |         if isinstance(job_id, int):
 36 |             job_id = [job_id]
 37 |         flds += ["_j=", "-".join("%05d" % x for x in job_id)]
 38 |     flds += ["_", str(datetime.now()).replace(" ", "-"), ".log"]
 39 | 
 40 |     filename = LOG_DIR / "".join(flds)
 41 |     logging.basicConfig(filename=filename, level=log_level, format=format_str, datefmt=date_format)
 42 | 
 43 |     # Also print to stdout
 44 |     if not quiet_mode:
 45 |         handler = logging.StreamHandler(sys.stdout)
 46 |         handler.setLevel(log_level)
 47 |         formatter = logging.Formatter(format_str)
 48 |         handler.setFormatter(formatter)
 49 |         logging.getLogger().addHandler(handler)
 50 | 
 51 |     logging.info("******************* New Run Beginning *****************")
 52 |     logging.debug("CUDA: %s", "ENABLED" if IS_CUDA else "Disabled")
 53 |     logging.info(" ".join(sys.argv))
 54 | 
 55 | 
 56 | class TrainingLogger:
 57 |     r""" Helper class used for standardizing logging """
 58 |     FIELD_SEP = " "
 59 |     DEFAULT_WIDTH = 12
 60 |     EPOCH_WIDTH = 5
 61 | 
 62 |     DEFAULT_FIELD = None
 63 | 
 64 |     LOG = logging.info
 65 | 
 66 |     def __init__(self, fld_names: List[str], fld_widths: Optional[List[int]] = None):
 67 |         if fld_widths is None: fld_widths = len(fld_names) * [TrainingLogger.DEFAULT_WIDTH]
 68 |         if len(fld_widths) != len(fld_names):
 69 |             raise ValueError("Mismatch in the length of field names and widths")
 70 | 
 71 |         self._log = TrainingLogger.LOG  # Function used for logging
 72 |         self._fld_widths = fld_widths
 73 | 
 74 |         # Print the column headers
 75 |         combined_names = ["Epoch"] + fld_names
 76 |         combined_widths = [TrainingLogger.EPOCH_WIDTH] + fld_widths
 77 |         fmt_str = TrainingLogger.FIELD_SEP.join(["{:^%d}" % _d for _d in combined_widths])
 78 |         self._log(fmt_str.format(*combined_names))
 79 |         # Line of separators under the headers (default value is hyphen)
 80 |         sep_line = TrainingLogger.FIELD_SEP.join(["{:-^%d}" % _w for _w in combined_widths])
 81 |         logging.info(sep_line.format(*(len(combined_widths) * [""])))
 82 | 
 83 |     @property
 84 |     def num_fields(self) -> int:
 85 |         r""" Number of fields to log """
 86 |         return len(self._fld_widths)
 87 | 
 88 |     def log(self, epoch: int, values: List[Any]) -> None:
 89 |         r""" Log the list of values """
 90 |         values = self._clean_values_list(values)
 91 |         format_str = self._build_values_format_str(values)
 92 |         self._log(format_str.format(epoch, *values))
 93 | 
 94 |     def _build_values_format_str(self, values: List[Any]) -> str:
 95 |         r""" Constructs a format string based on the values """
 96 |         def _get_fmt_str(_w: int, fmt: str) -> str:
 97 |             return "{:^%d%s}" % (_w, fmt)
 98 | 
 99 |         frmt = [_get_fmt_str(self.EPOCH_WIDTH, "d")]
100 |         for width, v in zip(self._fld_widths, values):
101 |             if isinstance(v, str): fmt_str = "s"
102 |             elif isinstance(v, Decimal): fmt_str = ".3E"
103 |             elif isinstance(v, int): fmt_str = "d"
104 |             elif isinstance(v, float): fmt_str = ".4f"
105 |             else: raise ValueError("Unknown value type")
106 | 
107 |             frmt.append(_get_fmt_str(width, fmt_str))
108 |         return TrainingLogger.FIELD_SEP.join(frmt)
109 | 
110 |     def _clean_values_list(self, values: List[Any]) -> List[Any]:
111 |         r""" Modifies values in the \p values list to make them straightforward to log """
112 |         values = copy.deepcopy(values)
113 |         # Populate any missing fields
114 |         while len(values) < self.num_fields:
115 |             values.append(TrainingLogger.DEFAULT_FIELD)
116 | 
117 |         new_vals = []
118 |         for v in values:
119 |             if isinstance(v, bool): v = "+" if v else ""
120 |             elif v is None: v = "N/A"
121 |             elif isinstance(v, Tensor): v = v.item()
122 | 
123 |             # Must be separate since v can be a float due to a Tensor
124 |             if isinstance(v, float) and (v <= 1E-3 or v >= 1E4): v = Decimal(v)
125 |             new_vals.append(v)
126 |         return new_vals
127 | 


--------------------------------------------------------------------------------
/malgan/detector.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | r"""
  3 |     malgan.detector
  4 |     ~~~~~~~~~~~~
  5 | 
  6 |     Black box malware detector.
  7 | 
  8 |     Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN"
  9 |     By Weiwei Hu and Ying Tan.
 10 | 
 11 |     :version: 0.1.0
 12 |     :copyright: (c) 2019 by Zayd Hammoudeh.
 13 |     :license: MIT, see LICENSE for more details.
 14 | """
 15 | from enum import Enum
 16 | from typing import Union
 17 | 
 18 | import numpy as np
 19 | import sklearn
 20 | from sklearn.tree import DecisionTreeClassifier
 21 | from sklearn.ensemble import RandomForestClassifier
 22 | from sklearn.linear_model import LogisticRegression
 23 | from sklearn.neural_network import MLPClassifier
 24 | from sklearn.svm import SVC
 25 | 
 26 | import torch
 27 | from torch import Tensor
 28 | 
 29 | TorchOrNumpy = Union[np.ndarray, Tensor]
 30 | 
 31 | 
 32 | # noinspection PyPep8Naming
 33 | class BlackBoxDetector:
 34 |     r"""
 35 |     Black box detector that intends to mimic an antivirus/anti-Malware program that detects whether
 36 |     a specific program is either malware or benign.
 37 |     """
 38 |     class Type(Enum):
 39 |         r""" Learner algorithm to be used by the black-box detector """
 40 |         DecisionTree = DecisionTreeClassifier()
 41 |         LogisticRegression = LogisticRegression(solver='lbfgs', max_iter=int(1e6))
 42 |         MultiLayerPerceptron = MLPClassifier()
 43 |         RandomForest = RandomForestClassifier(n_estimators=100)
 44 |         SVM = SVC(gamma="auto")
 45 | 
 46 |         @classmethod
 47 |         def names(cls):
 48 |             r""" Builds the list of all enum names """
 49 |             return [c.name for c in cls]
 50 | 
 51 |         @classmethod
 52 |         def get_from_name(cls, name):
 53 |             r"""
 54 |             Gets the enum item from the specified name
 55 | 
 56 |             :param name: Name of the enum object
 57 |             :return: Enum item associated with the specified name
 58 |             """
 59 |             for c in BlackBoxDetector.Type:
 60 |                 if c.name == name:
 61 |                     return c
 62 |             raise ValueError("Unknown enum \"%s\" for class \"%s\"", name, cls.name)
 63 | 
 64 |     def __init__(self, learner_type: 'BlackBoxDetector.Type'):
 65 |         self.type = learner_type
 66 |         # noinspection PyCallingNonCallable
 67 |         self._model = sklearn.clone(self.type.value)
 68 |         self.training = True
 69 | 
 70 |     def fit(self, X: TorchOrNumpy, y: TorchOrNumpy):
 71 |         r"""
 72 |         Fits the learner.  Supports NumPy and PyTorch arrays as input.  Returns a torch tensor
 73 |         as output.
 74 | 
 75 |         :param X: Examples upon which to train
 76 |         :param y: Labels for the examples
 77 |         """
 78 |         if isinstance(X, Tensor):
 79 |             X = X.numpy()
 80 |         if isinstance(y, Tensor):
 81 |             y = y.numpy()
 82 |         self._model.fit(X, y)
 83 |         self.training = False
 84 | 
 85 |     def predict(self, X: TorchOrNumpy) -> Tensor:
 86 |         r"""
 87 |         Predict the labels for \p X
 88 | 
 89 |         :param X: Set of examples for which label probabilities should be predicted
 90 |         :return: Predicted value for \p X
 91 |         """
 92 |         if self.training:
 93 |             raise ValueError("Detector does not appear to be trained but trying to predict")
 94 |         if torch.cuda.is_available():
 95 |             X = X.cpu()
 96 |         if isinstance(X, Tensor):
 97 |             X = X.numpy()
 98 |         y = torch.from_numpy(self._model.predict(X)).float()
 99 |         return y.cuda() if torch.cuda.is_available() else y
100 | 


--------------------------------------------------------------------------------
/malgan/discriminator.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | r"""
 3 |     malgan.discriminator
 4 |     ~~~~~~~~~~~~~~~~~
 5 | 
 6 |     Discriminator (i.e., substitute detector) block for MalGAN.
 7 | 
 8 |     Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN"
 9 |     By Weiwei Hu and Ying Tan.
10 | 
11 |     :version: 0.1.0
12 |     :copyright: (c) 2019 by Zayd Hammoudeh.
13 |     :license: MIT, see LICENSE for more details.
14 | """
15 | from typing import List
16 | 
17 | import torch
18 | from torch import Tensor
19 | import torch.nn as nn
20 | 
21 | 
22 | # noinspection PyPep8Naming
23 | class Discriminator(nn.Module):
24 |     r""" MalGAN discriminator (substitute detector).  Simple feed forward network. """
25 |     EPS = 1e-7
26 | 
27 |     def __init__(self, M: int, hidden_size: List[int], g: nn.Module):
28 |         r"""Discriminator Constructor
29 | 
30 |         Builds the discriminator block.
31 | 
32 |         :param M: Width of the malware feature vector
33 |         :param hidden_size: Width of the hidden layer(s).
34 |         :param g: Activation function
35 |         """
36 |         super().__init__()
37 | 
38 |         # Build the feed forward layers.
39 |         self._layers = nn.Sequential()
40 |         for i, (in_w, out_w) in enumerate(zip([M] + hidden_size[:-1], hidden_size)):
41 |             layer = nn.Sequential(nn.Linear(in_w, out_w), g)
42 |             self._layers.add_module("FF%02d" % i, layer)
43 | 
44 |         layer = nn.Sequential(nn.Linear(hidden_size[-1], 1), nn.Sigmoid())
45 |         self._layers.add_module("FF%02d" % len(hidden_size), layer)
46 | 
47 |     def forward(self, X: Tensor) -> Tensor:
48 |         r"""
49 |         Forward path through the discriminator.
50 | 
51 |         :param X: Input example tensor
52 |         :return: :math:`D_{sigma}(x)` -- Value predicted by the discriminator.
53 |         """
54 |         d_theta = self._layers(X)
55 |         return torch.clamp(d_theta, self.EPS, 1. - self.EPS).view(-1)
56 | 


--------------------------------------------------------------------------------
/malgan/generator.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | r"""
 3 |     malgan.generator
 4 |     ~~~~~~~~~~~~~
 5 | 
 6 |     Generator block for MalGAN.
 7 | 
 8 |     Based on the paper: "Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN"
 9 |     By Weiwei Hu and Ying Tan.
10 | 
11 |     :version: 0.1.0
12 |     :copyright: (c) 2019 by Zayd Hammoudeh.
13 |     :license: MIT, see LICENSE for more details.
14 | """
15 | from typing import List, Tuple
16 | 
17 | import torch
18 | from torch import Tensor
19 | import torch.nn as nn
20 | 
21 | TensorTuple = Tuple[Tensor, Tensor]
22 | 
23 | 
24 | class Generator(nn.Module):
25 |     r""" MalGAN generator block """
26 | 
27 |     # noinspection PyPep8Naming
28 |     def __init__(self, M: int, Z: int, hidden_size: List[int], g: nn.Module):
29 |         r"""Generator Constructor
30 | 
31 |         :param M: Dimension of the feature vector \p m
32 |         :param Z: Dimension of the noise vector \p z
33 |         :param hidden_size: Width of the hidden layer(s)
34 |         :param g: Activation function
35 |         """
36 |         super().__init__()
37 | 
38 |         self._Z = Z
39 | 
40 |         # Build the feed forward net
41 |         self._layers, dim = nn.Sequential(), [M + self._Z] + hidden_size
42 |         for i, (d_in, d_out) in enumerate(zip(dim[:-1], dim[1:])):
43 |             self._layers.add_module("FF%02d" % i, nn.Sequential(nn.Linear(d_in, d_out), g))
44 | 
45 |         # Last layer is always sigmoid
46 |         layer = nn.Sequential(nn.Linear(dim[-1], M), nn.Sigmoid())
47 |         self._layers.add_module("FF%02d" % len(dim), layer)
48 | 
49 |     # noinspection PyUnresolvedReferences
50 |     def forward(self, m: torch.Tensor,
51 |                 z: torch.Tensor = None) -> TensorTuple:  # pylint: disable=arguments-differ
52 |         r"""
53 |         Forward pass through the generator.  Automatically generates the noise vector \p z that
54 |         is coupled with \p m.
55 | 
56 |         :param m: Input vector :math:`m`
57 |         :param z: Noise vector :math:`z`.  If no random vector is specified, the random vector is
58 |                   generated within this function call via a call to \p torch.rand
59 |         :return: Tuple of (:math:`m'`, :math:`G_{\theta_{g}}`), i.e., the output tensor with the
60 |                  feature predictions as well as the smoothed prediction that can be used for
61 |                  back-propagation.
62 |         """
63 |         if z is None:
64 |             num_ele = m.shape[0]
65 |             z = torch.rand((num_ele, self._Z))
66 | 
67 |         # Concatenation of m and z
68 |         o = torch.cat((m, z), dim=1)
69 |         o = self._layers.forward(o)
70 |         g_theta = torch.max(m, o)  # Ensure binary bits only set positive
71 | 
72 |         m_prime = (g_theta > 0.5).float()
73 |         return m_prime, g_theta
74 | 


--------------------------------------------------------------------------------
/malware_gan_poster.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/malware_gan_poster.pdf


--------------------------------------------------------------------------------
/malware_gan_report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZaydH/MalwareGAN/ea3f4e5139e6343c26273db0299a4b9d96d814af/malware_gan_report.pdf


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.16.0
2 | torch==1.2.0
3 | typing>=3.6.6
4 | scikit_learn>=0.20.2
5 | tensorboardX==1.6
6 | 


--------------------------------------------------------------------------------