├── .gitignore
├── README.md
├── config.py
├── data_preprocess.py
├── data_utils
    ├── __init__.py
    ├── bin_to_img.py
    ├── data_loaders.py
    ├── extract_data.py
    ├── extract_opcode.py
    ├── extract_pe_features.py
    └── misc.py
├── detect_malware.py
├── models
    ├── AnnModels.py
    ├── CnnModels.py
    ├── RnnModels.py
    ├── Shallow_ML_models.py
    ├── TransferLearnModels.py
    ├── __init__.py
    ├── model_trainers_testers.py
    └── models_utils.py
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | models/__pycache__
3 | data_utils/__pycache__
4 | data/
5 | logs/
6 | .idea/
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Malware Classification using classical Machine Learning and Deep Learning
  2 | 
  3 | This repository is the official implementation of the research mentioned in the chapter **"An Empirical Analysis of Image-Based Learning Techniques for Malware Classification"** of the Book **"Malware Analysis Using Artificial Intelligence and Deep Learning"**
  4 | 
  5 | The book or chapters can be purchased from: https://link.springer.com/chapter/10.1007/978-3-030-62582-5_16
  6 | 
  7 | The arXiv eprint is at: https://arxiv.org/abs/2103.13827
  8 | 
  9 | ![alt text](https://media.springernature.com/w306/springer-static/cover-hires/book/978-3-030-62582-5)
 10 | 
 11 | 
 12 | ### Abstract
 13 | 
 14 | In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role—specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.
 15 | 
 16 | ### Quick Notes:
 17 | 
 18 | * Classic ML-based approaches tried : K-NN, Random Forest, and XGBoost
 19 | * Deep Learning-based approaches tried: ANN, CNN, LSTM, and GRU
 20 | * Implementation is using sklearn, numpy, pandas and pytorch.
 21 | * MS Windows executable binary files are used as data.
 22 | * Features
 23 |   * Classic ML-based approaches: PE fie features are extracted and used
 24 |   * Deep Learning-based approaches: (1) Opcodes (2) Converted executables into gray-scale images
 25 | * This project is an extension of https://github.com/pratikpv/malware_classification
 26 | 
 27 | ### Steps to repro
 28 | 
 29 | # Packages requirements
 30 | 
 31 | * Install pefile pythong package e.g. `conda install pefile` 
 32 | * Install PyTorch and other libs e.g. `conda install -c pytorch torchtext`. All other common dependencies should be covered by anaconda distro.
 33 | * `objdump` in ubuntu. (This code is developed and tested for ubuntu-based development env)
 34 | 
 35 | # Malware samples
 36 | 
 37 |  * copy the malware samples at <project_dir>/data/exec_files/exec_files. You can reach out to me for samples used in this research. Overall directory structure should look like this,
 38 | 
 39 | ```
 40 | ├── config.py
 41 | ├── data
 42 | │             ├── exec_files
 43 | │             │             └── exec_files
 44 | │             │                 ├── adload
 45 | │             │                 ├── agent
 46 | │             │                 ├── alureon
 47 | │             │                 ├── bho
 48 | │             │                 ├── ceeinject
 49 | │             │                 ├── cycbot
 50 | │             │                 ├── delfinject
 51 | │             │                 └── fakerean
 52 | ├── data_preprocess.py
 53 | ├── data_utils
 54 | .
 55 | .
 56 | ```
 57 | 
 58 | # Data preprocessing
 59 | 
 60 | Execute `data_preprocess.py` with below mentioned options to preprocess the data.
 61 | 
 62 | `python data_preprocess.py --extract_pe_features`
 63 | 
 64 | `python data_preprocess.py --bin_to_img`
 65 | 
 66 | `python data_preprocess.py --extract_opcodes`
 67 | 
 68 | `python data_preprocess.py --split_opcodes`
 69 | 
 70 | 
 71 | # Train and test models
 72 | Execute `detect_malware.py` with appropriate command-line args for models to train/test. e.g.
 73 | 
 74 | `python detect_malware.py --deep_feedforward`
 75 | 
 76 | `python detect_malware.py --deep_rnn`
 77 | 
 78 | `python detect_malware.py --shallow_ml`
 79 | 
 80 | `python detect_malware.py --transfer_conv_ml`
 81 | 
 82 | # Dataset
 83 | 
 84 | Apply for access here: https://forms.gle/65SNHJpQ7U4TYkCU7
 85 | 
 86 | ### If you like our work and is useful for your research please cite this chapter/paper as:
 87 | ```
 88 | Prajapati P., Stamp M. (2021) An Empirical Analysis of Image-Based Learning Techniques for Malware Classification. In: Stamp M., Alazab M., Shalaginov A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_16
 89 | ```
 90 | or
 91 | ```
 92 | @Inbook{
 93 |     Prajapati2021,
 94 |     author={Prajapati, Pratikkumar and Stamp, Mark},
 95 |     editor={Stamp, Mark and Alazab, Mamoun  and Shalaginov, Andrii},
 96 |     title={An Empirical Analysis of Image-Based Learning Techniques for Malware Classification},
 97 |     bookTitle={Malware Analysis Using Artificial Intelligence and Deep Learning},
 98 |     year={2021},
 99 |     publisher={Springer International Publishing},
100 |     address={Cham},
101 |     pages={411-435},
102 |     abstract={In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role---specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.},
103 |     isbn={978-3-030-62582-5},
104 |     doi={10.1007/978-3-030-62582-5_16},
105 |     url={https://doi.org/10.1007/978-3-030-62582-5_16}
106 | }
107 | ```
108 | 


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import torch
 3 | import multiprocessing
 4 | 
 5 | # original dataset i.e. binary executable files
 6 | ORG_DATASET_ROOT_PATH = os.path.join('data', 'exec_files')
 7 | ORG_DATASET_DIR_NAME = 'org_dataset'
 8 | ORG_DATASET_PATH = os.path.join('data', 'exec_files', ORG_DATASET_DIR_NAME)
 9 | supported_image_dims = [0, 1, 64, 128, 256, 512, 1024]
10 | 
11 | # opcodes from binary executables
12 | # -1: the entire opcode len of the binary, else the specific opcode len
13 | supported_opcode_lens = [-1, 10, 20, 50, 100, 500, 1000, 2000, 5000]
14 | ORG_DATASET_OPCODES_PATH = os.path.join('data', 'exec_files', 'org_dataset_opcodes')
15 | # PE features from binary executables
16 | ORG_DATASET_PE_FEATURES_CSV = os.path.join('data', 'org_malware_dataset_pe_features.csv')
17 | 
18 | # count files for ORG_DATASET_PATH
19 | ORG_DATASET_COUNT_CSV = os.path.join('data', 'org_malware_dataset_count.csv')
20 | # count files for ORG_DATASET_PATH_IMAGE*
21 | ORG_DATASET_COUNT_IMAGES_CSV = os.path.join('data', 'org_malware_dataset_count_images.csv')
22 | # count files for ORG_DATASET_OPCODES_PATH
23 | ORG_DATASET_COUNT_PE_FEATURES_CSV = os.path.join('data', 'org_malware_dataset_count_pe_features.csv')
24 | # count files for ORG_DATASET_PE_FEATURES_CSV
25 | ORG_DATASET_COUNT_OPCODES_PATH = os.path.join('data', 'org_malware_dataset_count_opcodes.csv')
26 | 
27 | use_cuda = torch.cuda.is_available()
28 | device = torch.device("cuda" if use_cuda else "cpu")
29 | LINE_LEN = 80
30 | LOG_MASTER_DIR = 'logs'
31 | MODEL_INFO_LOG = 'model_info_and_results.log'
32 | MODEL_META_INFO_LOG = 'models_meta_info.log'
33 | MODEL_LOSS_INFO_LOG = 'model_losses.log'
34 | MODEL_ACC_INFO_LOG = 'model_acc.log'
35 | MODEL_CONF_MATRIX_CSV = 'confusion_matrix.csv'
36 | MODEL_CONF_MATRIX_PNG = 'confusion_matrix.png'
37 | MODEL_CONF_MATRIX_NORMALIZED_CSV = 'confusion_matrix_normalized.csv'
38 | MODEL_CONF_MATRIX_NORMALIZED_PNG = 'confusion_matrix_normalized.png'
39 | MODEL_ACCURACY_PNG = 'model_accuracy.png'
40 | MODEL_LOSS_PNG = 'model_loss.png'
41 | EXPERIMENT_RESULTS = 'experiments_results.csv'
42 | GRID_CV_EXPERIMENT_RESULTS = 'grid_cv_experiments_results.csv'
43 | CPU_COUNT = multiprocessing.cpu_count()
44 | 
45 | DEEP_FF = 'deep_feed_forward'
46 | DEEP_RNN = 'rnn'
47 | SHALLOW_ML = 'shallow_ml'
48 | FEATURE_TYPE_IMAGE = 'image'
49 | FEATURE_TYPE_OPCODE = 'opcode'
50 | 


--------------------------------------------------------------------------------
/data_preprocess.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import sys
 3 | from config import *
 4 | from data_utils.extract_pe_features import *
 5 | from data_utils.bin_to_img import *
 6 | from data_utils.extract_opcode import *
 7 | from data_utils.misc import *
 8 | from data_utils.data_loaders import *
 9 | 
10 | 
11 | def main():
12 |     max_files = 0  # set 0 to process all files or set a specific number
13 | 
14 |     if args.extract_pe_features:
15 |         extract_pe_features(ORG_DATASET_PE_FEATURES_CSV, ORG_DATASET_COUNT_PE_FEATURES_CSV, ORG_DATASET_PATH,
16 |                             max_files=max_files)
17 | 
18 |     if args.bin_to_img:
19 |         list_of_widths = [0, 1, 64, 128, 256, 512, 1024]
20 |         for width in list_of_widths:
21 |             convert_bin_to_img(ORG_DATASET_PATH, width, max_files=max_files)
22 | 
23 |     if args.extract_opcodes:
24 |         process_opcodes_bulk(ORG_DATASET_PATH, max_files=max_files)
25 | 
26 |     if args.count_samples:
27 |         count_dataset(ORG_DATASET_PATH, ORG_DATASET_COUNT_CSV)
28 |         count_dataset(ORG_DATASET_OPCODES_PATH, ORG_DATASET_COUNT_OPCODES_PATH)
29 |         count_dataset(get_image_datapath(image_dim=256), ORG_DATASET_COUNT_IMAGES_CSV)
30 | 
31 |     if args.split_opcodes:
32 |         list_of_opcode_lens = [10, 20, 50, 100, 500, 1000, 2000, 5000]
33 |         for opcode_len in list_of_opcode_lens:
34 |             process_split_opcodes(ORG_DATASET_OPCODES_PATH, opcode_len=opcode_len)
35 | 
36 |     if args.latex_format:
37 |         # tuple -> log_date_dir , experiment
38 |         data_list = [("25-May-2020_22_44_37", "experiment_14"),
39 |                      ("13-Jun-2020_16_49_09", "experiment_29"),
40 |                      ("09-Jun-2020_20_42_39", "conv1d_experiment_65"),
41 |                      ("14-Jun-2020_09_03_12", "experiment_18"),
42 |                      ("06-Jun-2020_22_13_17", "rnn_experiment_22"),
43 |                      ("06-Jun-2020_22_13_17", "rnn_experiment_46"),
44 |                      ("13-Jun-2020_21_04_18", "tl_experiment_1"),
45 |                      ("12-Jun-2020_21_54_44", "XGB_experiment_1"),
46 |                      ("12-Jun-2020_21_54_44", "Knn_experiment_1"),
47 |                      ("12-Jun-2020_21_54_44", "RandomForest_experiment_1")
48 |                      ]
49 |         process_cf_for_latex(data_list)
50 | 
51 | 
52 | if __name__ == "__main__":
53 |     parser = argparse.ArgumentParser(description='Process the Malware data')
54 | 
55 |     parser.add_argument('--extract_pe_features', action='store_true', help='Extract features from PE format',
56 |                         default=False)
57 |     parser.add_argument('--bin_to_img', action='store_true', help='Generate image files from malware binaries',
58 |                         default=False)
59 |     parser.add_argument('--extract_opcodes', action='store_true', help='Extract opcodes from malware binaries',
60 |                         default=False)
61 |     parser.add_argument('--count_samples', action='store_true', help='Count all sample files for all experiments',
62 |                         default=False)
63 |     parser.add_argument('--split_opcodes', action='store_true', help='split opcodes into train-test for TorchText',
64 |                         default=False)
65 | 
66 |     parser.add_argument('--latex_format', action='store_true', help='Normalize Conf. matrix and save for latex',
67 |                         default=False)
68 | 
69 |     args = parser.parse_args()
70 | 
71 |     if len(sys.argv) < 2:
72 |         parser.print_usage()
73 |         sys.exit(1)
74 | 
75 |     main()
76 | 


--------------------------------------------------------------------------------
/data_utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pratikpv/malware_detect2/12f4390f1c4a7a3e8b6bc355cc06f08f4dd3126d/data_utils/__init__.py


--------------------------------------------------------------------------------
/data_utils/bin_to_img.py:
--------------------------------------------------------------------------------
 1 | import multiprocessing
 2 | from tqdm import tqdm
 3 | import numpy as np
 4 | import imageio
 5 | import array
 6 | import os
 7 | 
 8 | 
 9 | def generate_and_save_image(input_filename, output_filename, width):
10 |     f = open(input_filename, 'rb')
11 |     ln = os.path.getsize(input_filename)  # length of file in bytes
12 |     if width == 0:
13 |         width = ln
14 |     if ln == 0:
15 |         return
16 |     rem = ln % width
17 |     a = array.array("B")  # uint8 array
18 |     a.fromfile(f, ln - rem)
19 |     f.close()
20 |     g = np.reshape(a, (len(a) // width, width))
21 |     g = np.uint8(g)
22 |     imageio.imwrite(output_filename, g)  # save the image
23 | 
24 | 
25 | def convert_bin_to_img(input_dir, width, max_files=0):
26 |     output_dir = input_dir + '_width_' + str(width)
27 |     if not os.path.isdir(output_dir):
28 |         os.mkdir(output_dir)
29 | 
30 |     list_dirs = os.listdir(input_dir)
31 |     with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
32 | 
33 |         jobs = []
34 |         results = []
35 |         total_count = 0
36 | 
37 |         for dirname in list_dirs:
38 |             list_files = os.listdir(os.path.join(input_dir, dirname))
39 |             count = 0
40 |             for filename in list_files:
41 |                 input_filename = os.path.join(input_dir, dirname, filename)
42 |                 try:
43 |                     output_filename = os.path.splitext(os.path.basename(input_filename))[0] + '.png'
44 |                     output_class_dir = os.path.join(output_dir, dirname)
45 |                     if not os.path.isdir(output_class_dir):
46 |                         os.mkdir(output_class_dir)
47 |                     output_filename = os.path.join(output_dir, dirname, output_filename)
48 | 
49 |                     jobs.append(
50 |                         pool.apply_async(generate_and_save_image, (input_filename, output_filename, width)))
51 |                     count += 1
52 |                     if max_files > 0 and max_files == count:
53 |                         break
54 |                 except:
55 |                     print('Ignoring ', filename)
56 | 
57 |             total_count += count
58 |         tqdm_desc = 'Converting Malware bins to images for width ' + str(width)
59 |         for job in tqdm(jobs, desc=tqdm_desc):
60 |             results.append(job.get())
61 | 


--------------------------------------------------------------------------------
/data_utils/data_loaders.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | from torch.utils.data import DataLoader
  3 | import torchvision
  4 | import torchtext
  5 | from config import *
  6 | import numpy as np
  7 | 
  8 | 
  9 | def get_image_datapath(image_dim, check_exist=True):
 10 |     if image_dim not in supported_image_dims:
 11 |         raise Exception("Unknown Image dim given")
 12 | 
 13 |     dir_name = os.path.join(ORG_DATASET_ROOT_PATH, ORG_DATASET_DIR_NAME + '_width_' + str(image_dim))
 14 | 
 15 |     if not check_exist:
 16 |         return dir_name
 17 | 
 18 |     if os.path.isdir(dir_name):
 19 |         return dir_name
 20 |     print(f'dir_name = {dir_name}')
 21 |     raise Exception("Data dir for Image dim {image_dim} missing".format(image_dim=image_dim))
 22 | 
 23 | 
 24 | def get_opcode_datapath(opcode_len, check_exist=True):
 25 |     if check_exist:
 26 |         if opcode_len not in supported_opcode_lens:
 27 |             raise Exception("Unknown opcode len given")
 28 | 
 29 |     opcode_len_str = '_' + str(opcode_len)
 30 |     if opcode_len == -1:
 31 |         opcode_len_str = ''
 32 | 
 33 |     train_split_json = os.path.join('org_dataset_opcodes_train' + opcode_len_str + '.json')
 34 |     test_split_json = os.path.join('org_dataset_opcodes_test' + opcode_len_str + '.json')
 35 |     combined_path_for_json = 'org_dataset_opcodes_split' + opcode_len_str
 36 |     exist = False
 37 | 
 38 |     if os.path.isfile(os.path.join(ORG_DATASET_ROOT_PATH, train_split_json)) and \
 39 |             os.path.isfile(os.path.join(ORG_DATASET_ROOT_PATH, test_split_json)) and \
 40 |             os.path.isdir(os.path.join(ORG_DATASET_ROOT_PATH, combined_path_for_json)):
 41 |         exist = True
 42 | 
 43 |     data_path = dict()
 44 |     data_path['train_split_json'] = train_split_json
 45 |     data_path['test_split_json'] = test_split_json
 46 |     data_path['combined_path_for_json'] = combined_path_for_json
 47 | 
 48 |     if not check_exist:
 49 |         return data_path
 50 |     if check_exist and exist:
 51 |         return data_path
 52 | 
 53 |     print(f'train_split_json : {train_split_json}')
 54 |     print(f'test_split_json : {test_split_json}')
 55 |     raise Exception("Data dir for opcode len {opcode_len} missing".format(opcode_len=opcode_len))
 56 | 
 57 | 
 58 | def get_image_data_loaders(data_path=None, image_dim=64, train_split=0.8, batch_size=256,
 59 |                            convert_to_rgb=False, pretrained_image_dim=64, conv1d_image_dim_w=1024):
 60 |     workers_count = min(int(CPU_COUNT * 0.80), batch_size)
 61 | 
 62 |     image_dim_h = image_dim
 63 |     image_dim_w = image_dim
 64 | 
 65 |     if image_dim == 0:
 66 |         # Conv1D case
 67 |         image_dim_h = 1
 68 |         image_dim_w = conv1d_image_dim_w
 69 | 
 70 |     transform = None
 71 |     if convert_to_rgb:
 72 |         transform = torchvision.transforms.Compose([
 73 |             torchvision.transforms.Resize((pretrained_image_dim, pretrained_image_dim)),
 74 |             torchvision.transforms.Lambda(lambda image: image.convert('RGB')),
 75 |             torchvision.transforms.ToTensor()
 76 |         ])
 77 |     else:
 78 |         transform = torchvision.transforms.Compose([
 79 |             torchvision.transforms.Grayscale(num_output_channels=1),
 80 |             torchvision.transforms.Resize((image_dim_h, image_dim_w)),
 81 |             torchvision.transforms.ToTensor()
 82 |         ])
 83 | 
 84 |     dataset = torchvision.datasets.ImageFolder(data_path, transform=transform)
 85 |     dataset_len = len(dataset)
 86 |     # dataset_len = 1000
 87 |     indices = list(range(dataset_len))
 88 | 
 89 |     random.shuffle(indices)
 90 |     split = int(np.floor(train_split * dataset_len))
 91 | 
 92 |     train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
 93 |                                                sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[:split]),
 94 |                                                num_workers=workers_count)
 95 | 
 96 |     val_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
 97 |                                              sampler=torch.utils.data.sampler.SubsetRandomSampler(
 98 |                                                  indices[split:dataset_len]),
 99 |                                              num_workers=workers_count)
100 | 
101 |     train_set_len = len(train_loader) * batch_size
102 |     val_set_len = len(val_loader) * batch_size
103 |     class_names = dataset.classes
104 |     num_of_classes = len(dataset.classes)
105 | 
106 |     return train_loader, val_loader, dataset_len, class_names
107 | 
108 | 
109 | def get_opcode_data_loaders(data_path=None, opcode_len=500, batch_size=512):
110 |     TEXT = torchtext.legacy.data.Field()
111 |     LABEL = torchtext.legacy.data.LabelField()
112 | 
113 |     fields = {'text': ('text', TEXT), 'label': ('label', LABEL)}
114 | 
115 |     train_data, test_data = torchtext.legacy.data.TabularDataset.splits(
116 |         path=ORG_DATASET_ROOT_PATH,
117 |         train=data_path['train_split_json'],
118 |         test=data_path['test_split_json'],
119 |         format='json',
120 |         fields=fields
121 |     )
122 |     TEXT.build_vocab(train_data)
123 |     LABEL.build_vocab(train_data)
124 | 
125 |     train_iterator, test_iterator = torchtext.legacy.data.BucketIterator.splits((train_data, test_data),
126 |                                                                          batch_size=batch_size,
127 |                                                                          sort=False,
128 |                                                                          shuffle=True,
129 |                                                                          repeat=False,
130 |                                                                          device='cpu')
131 | 
132 |     dataset_len = len(train_data) + len(test_data)
133 |     class_names = list(LABEL.vocab.stoi.keys())
134 |     text_vocal_len = len(TEXT.vocab)
135 |     label_vocab_len = len(LABEL.vocab)
136 | 
137 |     pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
138 | 
139 |     return train_iterator, test_iterator, dataset_len, class_names, text_vocal_len, label_vocab_len, pad_idx
140 | 


--------------------------------------------------------------------------------
/data_utils/extract_data.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import sqlalchemy
 3 | 
 4 | DATA_CSV_FILENAME = 'malware_data_raw.csv'
 5 | 
 6 | 
 7 | def save_tables_to_csv():
 8 |     engine = sqlalchemy.create_engine('mysql+pymysql://admin:Pratik@123@localhost:3306/new_schema')
 9 |     df = pd.read_sql_table('FILES', engine)
10 |     df.to_csv(DATA_CSV_FILENAME)
11 |     return df
12 | 
13 | 
14 | def main():
15 |     # save_tables_to_csv()
16 |     df = pd.read_csv(DATA_CSV_FILENAME)
17 |     print(df.keys())
18 |     df = df[['SHA1', 'FAMILY']]
19 |     print(df.groupby('FAMILY').count())
20 | 
21 | 
22 | if __name__ == "__main__":
23 |     main()
24 | 


--------------------------------------------------------------------------------
/data_utils/extract_opcode.py:
--------------------------------------------------------------------------------
  1 | from tqdm import tqdm
  2 | import multiprocessing
  3 | from config import *
  4 | from subprocess import Popen, PIPE
  5 | import random
  6 | import json
  7 | from data_utils.data_loaders import *
  8 | 
  9 | 
 10 | def generate_opcode(bin_filename, text_filename, debug=False):
 11 |     list_of_cmd_args = [
 12 |         ['objdump', '-j .text', '-D', bin_filename],
 13 |         ['objdump', '-j CODE', '-D', bin_filename],
 14 |         ['objdump', '-d', bin_filename]
 15 |     ]
 16 | 
 17 |     got_upcode = False
 18 |     asm_code = ''
 19 | 
 20 |     for cmd_num, cmd_args in enumerate(list_of_cmd_args):
 21 |         try:
 22 |             if debug:
 23 |                 print(f'cmd_num = {cmd_num}')
 24 |                 print(f'cmd_args = {" ".join(cmd_args)}')
 25 |             process = Popen(cmd_args, stdout=PIPE, stderr=PIPE)
 26 |             p_out, p_err = process.communicate()
 27 |             if debug:
 28 |                 print(p_out)
 29 |             asm_code = str(p_out).split('\\n')
 30 |         except ValueError:
 31 |             got_upcode = False
 32 |         else:
 33 |             if len(asm_code) > 5:
 34 |                 got_upcode = True
 35 | 
 36 |         if got_upcode:
 37 |             break
 38 | 
 39 |     if got_upcode:
 40 |         with open(text_filename, 'w+') as f:
 41 |             for line in asm_code:
 42 |                 line = line.split('\\t')
 43 |                 if len(line) > 2:
 44 |                     opcode_line = line[2]
 45 |                     opcode_line = opcode_line.split(' ')
 46 |                     if len(opcode_line) > 0:
 47 |                         f.write(opcode_line[0])
 48 |                         f.write('\n')
 49 | 
 50 |     else:
 51 |         # TODO some files are empty. check generate_opcode
 52 |         if debug:
 53 |             print(f'Giving up on {bin_filename}')
 54 | 
 55 | 
 56 | def process_opcodes_bulk(input_dir, output_dir=ORG_DATASET_OPCODES_PATH, max_files=0):
 57 |     if not os.path.isdir(output_dir):
 58 |         os.mkdir(output_dir)
 59 | 
 60 |     list_dirs = os.listdir(input_dir)
 61 |     with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
 62 | 
 63 |         jobs = []
 64 |         results = []
 65 |         total_count = 0
 66 | 
 67 |         for dirname in list_dirs:
 68 |             list_files = os.listdir(os.path.join(input_dir, dirname))
 69 |             count = 0
 70 |             for filename in list_files:
 71 |                 input_filename = os.path.join(input_dir, dirname, filename)
 72 |                 try:
 73 |                     output_filename = os.path.splitext(os.path.basename(input_filename))[0] + '.txt'
 74 |                     output_class_dir = os.path.join(output_dir, dirname)
 75 |                     if not os.path.isdir(output_class_dir):
 76 |                         os.mkdir(output_class_dir)
 77 |                     output_filename = os.path.join(output_dir, dirname, output_filename)
 78 | 
 79 |                     jobs.append(
 80 |                         pool.apply_async(generate_opcode, (input_filename, output_filename)))
 81 |                     count += 1
 82 |                     if max_files > 0 and max_files == count:
 83 |                         break
 84 |                 except:
 85 |                     print('Ignoring ', filename)
 86 | 
 87 |             total_count += count
 88 |         tqdm_desc = 'Extracting opcodes from Malware bins'
 89 |         for job in tqdm(jobs, desc=tqdm_desc):
 90 |             results.append(job.get())
 91 | 
 92 | 
 93 | def get_jason_filename(combined_path_for_json, key, class_name):
 94 |     return os.path.join(combined_path_for_json, class_name, key + '_' + class_name + '.json')
 95 | 
 96 | 
 97 | def split_opcodes(input_class_dir, opcode_len=-1, train_split=0.8):
 98 |     """
 99 |     input_class_dir contains opcode json file.
100 |     Create train and test split for this json file
101 |     """
102 |     opcode_path = get_opcode_datapath(opcode_len, check_exist=False)
103 |     combined_path_for_json = opcode_path['combined_path_for_json']
104 | 
105 |     list_files = os.listdir(input_class_dir)
106 |     class_name = os.path.basename(input_class_dir)
107 |     random.shuffle(list_files)
108 |     total_samples = len(list_files)
109 |     train_size = int(total_samples * train_split)
110 |     test_size = total_samples - train_size
111 | 
112 |     train_files = list_files[0:train_size]
113 |     test_files = list_files[train_size:]
114 | 
115 |     samples = {'train': train_files, 'test': test_files}
116 | 
117 |     for key in samples.keys():
118 |         sample_json = []
119 |         all_filenames = samples[key]
120 |         for filename in all_filenames:
121 |             input_filename = os.path.join(input_class_dir, filename)
122 |             opcodes = []
123 |             with open(input_filename, 'r') as f:
124 |                 opcodes = f.read().splitlines()
125 | 
126 |             # print(opcodes)
127 |             if opcode_len != -1:
128 |                 opcodes = opcodes[0:opcode_len]
129 |             if len(opcodes) < 1:
130 |                 continue
131 | 
132 |             dir_name = os.path.join(ORG_DATASET_ROOT_PATH, combined_path_for_json)
133 |             os.makedirs(os.path.join(dir_name, class_name), exist_ok=True)
134 |             out_json_file = get_jason_filename(dir_name, key, class_name)
135 |             with open(out_json_file, 'a') as outfile:
136 |                 file_dict = {'text': opcodes, 'label': class_name}
137 |                 json.dump(file_dict, outfile)
138 |                 outfile.write('\n')
139 | 
140 | 
141 | def process_split_opcodes(root_datapath, opcode_len=-1):
142 |     opcode_path = get_opcode_datapath(opcode_len, check_exist=False)
143 |     train_split_json = os.path.join(ORG_DATASET_ROOT_PATH, opcode_path['train_split_json'])
144 |     test_split_json = os.path.join(ORG_DATASET_ROOT_PATH, opcode_path['test_split_json'])
145 |     combined_path_for_json = opcode_path['combined_path_for_json']
146 | 
147 |     classes = os.listdir(root_datapath)
148 |     #
149 |     with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
150 |         jobs = []
151 |         results = []
152 |         for class_name in classes:
153 |             class_dir = os.path.join(root_datapath, class_name)
154 |             jobs.append(pool.apply_async(split_opcodes, (class_dir, opcode_len,)))
155 | 
156 |         for job in tqdm(jobs, desc="Generating opcodes with json and splitting for opcode len = {opcode_len}".format(
157 |                 opcode_len=opcode_len)):
158 |             results.append(job.get())
159 | 
160 |     print(f'Merging all training and testing classes')
161 |     samples = {'train': train_split_json,
162 |                'test': test_split_json}
163 | 
164 |     for key in samples.keys():
165 |         out_json_file = samples[key]
166 |         with open(out_json_file, 'a') as outfile:
167 |             for class_name in classes:
168 |                 json_file = os.path.join(ORG_DATASET_ROOT_PATH,
169 |                                          get_jason_filename(combined_path_for_json, key, class_name))
170 |                 print(f'loading {json_file}')
171 |                 count = 0
172 |                 with open(json_file, 'r') as fp:
173 |                     while True:
174 |                         line = fp.readline()
175 |                         if not line:
176 |                             break
177 |                         count += 1
178 |                         try:
179 |                             # write only right formated lines
180 |                             json.loads(line)
181 |                         except:
182 |                             print(f'Error in {json_file} at count {count}')
183 |                         else:
184 |                             outfile.write(line)
185 | 
186 |     print(f'Merged all training and testing classes {samples}')
187 | 


--------------------------------------------------------------------------------
/data_utils/extract_pe_features.py:
--------------------------------------------------------------------------------
  1 | import multiprocessing
  2 | import pefile
  3 | import os
  4 | import hashlib
  5 | import array
  6 | import math
  7 | from tqdm import tqdm
  8 | import numpy as np
  9 | import imageio
 10 | 
 11 | 
 12 | def get_md5(fname):
 13 |     hash_md5 = hashlib.md5()
 14 |     with open(fname, "rb") as f:
 15 |         for chunk in iter(lambda: f.read(4096), b""):
 16 |             hash_md5.update(chunk)
 17 |     return hash_md5.hexdigest()
 18 | 
 19 | 
 20 | def get_entropy(data):
 21 |     if len(data) == 0:
 22 |         return 0.0
 23 |     occurences = array.array('L', [0] * 256)
 24 |     for x in data:
 25 |         occurences[x if isinstance(x, int) else ord(x)] += 1
 26 | 
 27 |     entropy = 0
 28 |     for x in occurences:
 29 |         if x:
 30 |             p_x = float(x) / len(data)
 31 |             entropy -= p_x * math.log(p_x, 2)
 32 | 
 33 |     return entropy
 34 | 
 35 | 
 36 | def get_resources(pe):
 37 |     """Extract resources :
 38 |     [entropy, size]"""
 39 |     resources = []
 40 |     if hasattr(pe, 'DIRECTORY_ENTRY_RESOURCE'):
 41 |         try:
 42 |             for resource_type in pe.DIRECTORY_ENTRY_RESOURCE.entries:
 43 |                 if hasattr(resource_type, 'directory'):
 44 |                     for resource_id in resource_type.directory.entries:
 45 |                         if hasattr(resource_id, 'directory'):
 46 |                             for resource_lang in resource_id.directory.entries:
 47 |                                 data = pe.get_data(resource_lang.data.struct.OffsetToData,
 48 |                                                    resource_lang.data.struct.Size)
 49 |                                 size = resource_lang.data.struct.Size
 50 |                                 entropy = get_entropy(data)
 51 | 
 52 |                                 resources.append([entropy, size])
 53 |         except Exception as e:
 54 |             return resources
 55 |     return resources
 56 | 
 57 | 
 58 | def get_version_info(pe):
 59 |     """Return version infos"""
 60 |     res = {}
 61 |     for fileinfo in pe.FileInfo:
 62 |         if fileinfo.Key == 'StringFileInfo':
 63 |             for st in fileinfo.StringTable:
 64 |                 for entry in st.entries.items():
 65 |                     res[entry[0]] = entry[1]
 66 |         if fileinfo.Key == 'VarFileInfo':
 67 |             for var in fileinfo.Var:
 68 |                 res[var.entry.items()[0][0]] = var.entry.items()[0][1]
 69 |     if hasattr(pe, 'VS_FIXEDFILEINFO'):
 70 |         res['flags'] = pe.VS_FIXEDFILEINFO.FileFlags
 71 |         res['os'] = pe.VS_FIXEDFILEINFO.FileOS
 72 |         res['type'] = pe.VS_FIXEDFILEINFO.FileType
 73 |         res['file_version'] = pe.VS_FIXEDFILEINFO.FileVersionLS
 74 |         res['product_version'] = pe.VS_FIXEDFILEINFO.ProductVersionLS
 75 |         res['signature'] = pe.VS_FIXEDFILEINFO.Signature
 76 |         res['struct_version'] = pe.VS_FIXEDFILEINFO.StrucVersion
 77 |     return res
 78 | 
 79 | 
 80 | def extract_infos(fpath):
 81 |     res = []
 82 |     res.append(os.path.basename(fpath))
 83 |     res.append(get_md5(fpath))
 84 |     pe = pefile.PE(fpath)
 85 |     res.append(pe.FILE_HEADER.Machine)
 86 |     res.append(pe.FILE_HEADER.SizeOfOptionalHeader)
 87 |     res.append(pe.FILE_HEADER.Characteristics)
 88 |     res.append(pe.OPTIONAL_HEADER.MajorLinkerVersion)
 89 |     res.append(pe.OPTIONAL_HEADER.MinorLinkerVersion)
 90 |     res.append(pe.OPTIONAL_HEADER.SizeOfCode)
 91 |     res.append(pe.OPTIONAL_HEADER.SizeOfInitializedData)
 92 |     res.append(pe.OPTIONAL_HEADER.SizeOfUninitializedData)
 93 |     res.append(pe.OPTIONAL_HEADER.AddressOfEntryPoint)
 94 |     res.append(pe.OPTIONAL_HEADER.BaseOfCode)
 95 |     try:
 96 |         res.append(pe.OPTIONAL_HEADER.BaseOfData)
 97 |     except AttributeError:
 98 |         res.append(0)
 99 |     res.append(pe.OPTIONAL_HEADER.ImageBase)
100 |     res.append(pe.OPTIONAL_HEADER.SectionAlignment)
101 |     res.append(pe.OPTIONAL_HEADER.FileAlignment)
102 |     res.append(pe.OPTIONAL_HEADER.MajorOperatingSystemVersion)
103 |     res.append(pe.OPTIONAL_HEADER.MinorOperatingSystemVersion)
104 |     res.append(pe.OPTIONAL_HEADER.MajorImageVersion)
105 |     res.append(pe.OPTIONAL_HEADER.MinorImageVersion)
106 |     res.append(pe.OPTIONAL_HEADER.MajorSubsystemVersion)
107 |     res.append(pe.OPTIONAL_HEADER.MinorSubsystemVersion)
108 |     res.append(pe.OPTIONAL_HEADER.SizeOfImage)
109 |     res.append(pe.OPTIONAL_HEADER.SizeOfHeaders)
110 |     res.append(pe.OPTIONAL_HEADER.CheckSum)
111 |     res.append(pe.OPTIONAL_HEADER.Subsystem)
112 |     res.append(pe.OPTIONAL_HEADER.DllCharacteristics)
113 |     res.append(pe.OPTIONAL_HEADER.SizeOfStackReserve)
114 |     res.append(pe.OPTIONAL_HEADER.SizeOfStackCommit)
115 |     res.append(pe.OPTIONAL_HEADER.SizeOfHeapReserve)
116 |     res.append(pe.OPTIONAL_HEADER.SizeOfHeapCommit)
117 |     res.append(pe.OPTIONAL_HEADER.LoaderFlags)
118 |     res.append(pe.OPTIONAL_HEADER.NumberOfRvaAndSizes)
119 |     res.append(len(pe.sections))
120 |     entropy = list(map(lambda x: x.get_entropy(), pe.sections))
121 |     if len(entropy) > 0:
122 |         res.append(sum(entropy) / float(len(entropy)))
123 |         res.append(min(entropy))
124 |         res.append(max(entropy))
125 |     else:
126 |         res.append(0)
127 |         res.append(0)
128 |         res.append(0)
129 | 
130 |     raw_sizes = list(map(lambda x: x.SizeOfRawData, pe.sections))
131 |     if len(raw_sizes) > 0:
132 |         res.append(sum(raw_sizes) / float(len(raw_sizes)))
133 |         res.append(min(raw_sizes))
134 |         res.append(max(raw_sizes))
135 |     else:
136 |         res.append(0)
137 |         res.append(0)
138 |         res.append(0)
139 | 
140 |     virtual_sizes = list(map(lambda x: x.Misc_VirtualSize, pe.sections))
141 |     if len(virtual_sizes) > 0:
142 |         res.append(sum(virtual_sizes) / float(len(virtual_sizes)))
143 |         res.append(min(virtual_sizes))
144 |         res.append(max(virtual_sizes))
145 |     else:
146 |         res.append(0)
147 |         res.append(0)
148 |         res.append(0)
149 |     # Imports
150 |     try:
151 |         res.append(len(pe.DIRECTORY_ENTRY_IMPORT))
152 |         imports = sum([x.imports for x in pe.DIRECTORY_ENTRY_IMPORT], [])
153 |         res.append(len(imports))
154 |         res.append(len(list(filter(lambda x: x.name is None, imports))))
155 |     except AttributeError:
156 |         res.append(0)
157 |         res.append(0)
158 |         res.append(0)
159 |     # Exports
160 |     try:
161 |         res.append(len(pe.DIRECTORY_ENTRY_EXPORT.symbols))
162 |     except AttributeError:
163 |         # No export
164 |         res.append(0)
165 |     # Resources
166 |     resources = get_resources(pe)
167 |     res.append(len(resources))
168 |     if len(resources) > 0:
169 |         entropy = list(map(lambda x: x[0], resources))
170 |         res.append(sum(entropy) / float(len(entropy)))
171 |         res.append(min(entropy))
172 |         res.append(max(entropy))
173 |         sizes = list(map(lambda x: x[1], resources))
174 |         res.append(sum(sizes) / float(len(sizes)))
175 |         res.append(min(sizes))
176 |         res.append(max(sizes))
177 |     else:
178 |         res.append(0)
179 |         res.append(0)
180 |         res.append(0)
181 |         res.append(0)
182 |         res.append(0)
183 |         res.append(0)
184 | 
185 |     # Load configuration size
186 |     try:
187 |         res.append(pe.DIRECTORY_ENTRY_LOAD_CONFIG.struct.Size)
188 |     except AttributeError:
189 |         res.append(0)
190 | 
191 |     # Version configuration size
192 |     try:
193 |         version_infos = get_version_info(pe)
194 |         res.append(len(version_infos.keys()))
195 |     except AttributeError:
196 |         res.append(0)
197 |     return res
198 | 
199 | 
200 | def collect_features(path, class_id, class_name, max_files=0, max_size=0):
201 |     count = 0
202 |     list_of_res = []
203 |     for ffile in os.listdir(path):
204 |         full_name = os.path.join(path, ffile)
205 |         # print(full_name)
206 |         statinfo = os.stat(full_name)
207 |         if max_size != 0 and statinfo.st_size > max_size:
208 |             continue
209 |         try:
210 |             res = extract_infos(full_name)
211 |             res.append(class_id)
212 |             res.append(class_name)
213 |             count += 1
214 |             if max_files > 0 and count > max_files:
215 |                 break
216 |         except pefile.PEFormatError:
217 |             # print(f'PE format is invalid for {full_name}')
218 |             pass
219 |         list_of_res.append(res)
220 |     return count, list_of_res
221 | 
222 | 
223 | def extract_pe_features(features_cvs_filename, count_cvs_filename, path_to_samples, max_files=15000, max_size=5242880):
224 |     features_columns = [
225 |         "Name",
226 |         "md5",
227 |         "Machine",
228 |         "SizeOfOptionalHeader",
229 |         "Characteristics",
230 |         "MajorLinkerVersion",
231 |         "MinorLinkerVersion",
232 |         "SizeOfCode",
233 |         "SizeOfInitializedData",
234 |         "SizeOfUninitializedData",
235 |         "AddressOfEntryPoint",
236 |         "BaseOfCode",
237 |         "BaseOfData",
238 |         "ImageBase",
239 |         "SectionAlignment",
240 |         "FileAlignment",
241 |         "MajorOperatingSystemVersion",
242 |         "MinorOperatingSystemVersion",
243 |         "MajorImageVersion",
244 |         "MinorImageVersion",
245 |         "MajorSubsystemVersion",
246 |         "MinorSubsystemVersion",
247 |         "SizeOfImage",
248 |         "SizeOfHeaders",
249 |         "CheckSum",
250 |         "Subsystem",
251 |         "DllCharacteristics",
252 |         "SizeOfStackReserve",
253 |         "SizeOfStackCommit",
254 |         "SizeOfHeapReserve",
255 |         "SizeOfHeapCommit",
256 |         "LoaderFlags",
257 |         "NumberOfRvaAndSizes",
258 |         "SectionsNb",
259 |         "SectionsMeanEntropy",
260 |         "SectionsMinEntropy",
261 |         "SectionsMaxEntropy",
262 |         "SectionsMeanRawsize",
263 |         "SectionsMinRawsize",
264 |         "SectionMaxRawsize",
265 |         "SectionsMeanVirtualsize",
266 |         "SectionsMinVirtualsize",
267 |         "SectionMaxVirtualsize",
268 |         "ImportsNbDLL",
269 |         "ImportsNb",
270 |         "ImportsNbOrdinal",
271 |         "ExportNb",
272 |         "ResourcesNb",
273 |         "ResourcesMeanEntropy",
274 |         "ResourcesMinEntropy",
275 |         "ResourcesMaxEntropy",
276 |         "ResourcesMeanSize",
277 |         "ResourcesMinSize",
278 |         "ResourcesMaxSize",
279 |         "LoadConfigurationSize",
280 |         "VersionInformationSize",
281 |         "Malware_ClassID",
282 |         "Malware_ClassName"
283 |     ]
284 | 
285 |     csv_delimiter = ','
286 |     features_cvs_file = open(features_cvs_filename, "w")
287 |     features_cvs_file.write(csv_delimiter.join(features_columns) + "\n")
288 | 
289 |     count_cvs_file = open(count_cvs_filename, "w")
290 |     count_columns = ['class_name', 'total_samples']
291 |     count_cvs_file.write(csv_delimiter.join(count_columns) + "\n")
292 | 
293 |     class_names = os.listdir(path_to_samples)
294 |     class_count = len(class_names)
295 | 
296 |     for class_id in tqdm(range(class_count), desc='Extracting features from Malware samples'):
297 |         sub_dir = os.path.join(path_to_samples, class_names[class_id])
298 |         total_samples, list_of_res = collect_features(sub_dir, class_id, class_names[class_id], max_files, max_size)
299 |         for res in list_of_res:
300 |             features_cvs_file.write(csv_delimiter.join(map(lambda x: str(x), res)) + "\n")
301 |         count_cvs_file.write(class_names[class_id] + csv_delimiter + str(total_samples) + "\n")
302 |         features_cvs_file.flush()
303 |         count_cvs_file.flush()
304 | 
305 |     features_cvs_file.close()
306 |     count_cvs_file.close()
307 | 
308 | 


--------------------------------------------------------------------------------
/data_utils/misc.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from tqdm import tqdm
 3 | import pandas as pd
 4 | from utils import *
 5 | import sys
 6 | import traceback
 7 | 
 8 | 
 9 | def count_dataset(root_data_dir, csv_filename):
10 |     class_names = os.listdir(root_data_dir)
11 |     class_count = len(class_names)
12 | 
13 |     csv_delimiter = ','
14 |     count_columns = ['class_name', 'total_samples']
15 |     count_csv_file = open(csv_filename, "w")
16 | 
17 |     count_csv_file.write(csv_delimiter.join(count_columns) + "\n")
18 | 
19 |     for class_id in tqdm(range(class_count), desc='Counting dataset'):
20 |         files = os.listdir(os.path.join(root_data_dir, class_names[class_id]))
21 |         count_csv_file.write(class_names[class_id] + csv_delimiter + str(len(files)) + "\n")
22 | 
23 |     count_csv_file.close()
24 | 
25 | 
26 | def process_cf_for_latex(data_list):
27 |     data_path = os.path.join('logs', 'logs_preserve', 'logs_only')
28 |     cf_file = 'confusion_matrix.csv'
29 |     cf_file_latex = 'confusion_matrix_latex.txt'
30 | 
31 |     for date_dir, expr_name in data_list:
32 |         cf_filepath = os.path.join(data_path, date_dir, expr_name, cf_file)
33 |         print_line()
34 |         print(f'Processing {cf_filepath}', end='')
35 |         try:
36 |             df = pd.read_csv(cf_filepath, index_col=0)
37 |             df = df.apply(lambda x: x / x.sum(), axis=1)  # normalize by row
38 |             df_s = len(df.columns)
39 | 
40 |             cf_file_latexpath = os.path.join(data_path, date_dir, expr_name, cf_file_latex)
41 |             with open(cf_file_latexpath, 'w+') as f:
42 |                 for r in range(df_s):
43 |                     for c in range(df_s):
44 |                         cell_val = round(df.iloc[r, c], 4)
45 |                         msg = '{c} {r} {cell_val}\n'.format(c=c, r=r, cell_val=cell_val)
46 |                         f.write(msg)
47 |         except:
48 |             print(traceback.print_exc())
49 |             print_line()
50 |             print(sys.exc_info()[0])
51 |             print_line()
52 |             print(f'\t--> failed')
53 |         else:
54 |             print(f'\t--> success')
55 |             print(cf_file_latexpath)
56 |         print_line()


--------------------------------------------------------------------------------
/detect_malware.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | from datetime import datetime
  3 | from models.models_utils import *
  4 | from utils import *
  5 | from sklearn.model_selection import train_test_split
  6 | from sklearn.preprocessing import StandardScaler
  7 | from models.Shallow_ML_models import *
  8 | import sklearn.metrics
  9 | import sys
 10 | import traceback
 11 | 
 12 | 
 13 | def setup():
 14 |     current_time_str = str(datetime.now().strftime("%d-%b-%Y_%H_%M_%S"))
 15 |     LOG_DIR = os.path.join(LOG_MASTER_DIR, current_time_str)
 16 |     os.makedirs(LOG_DIR)
 17 |     return LOG_DIR
 18 | 
 19 | 
 20 | def execute_deep_feedforward_model(model_params, LOG_DIR):
 21 |     print(f'Model params: {model_params}')
 22 | 
 23 |     batch_size = model_params['batch_size']
 24 |     feature_type = model_params['feature_type']
 25 | 
 26 |     if feature_type == FEATURE_TYPE_IMAGE:
 27 |         image_dim = model_params['image_dim']
 28 |         conv1d_image_dim_w = -1
 29 |         data_path = get_image_datapath(image_dim)
 30 |         if image_dim == 0:
 31 |             # conv1d models
 32 |             conv1d_image_dim_w = model_params['conv1d_image_dim_w']
 33 | 
 34 |         print(f'Loading image data')
 35 |         train_loader, val_loader, dataset_len, class_names = get_image_data_loaders(data_path=data_path,
 36 |                                                                                     image_dim=image_dim,
 37 |                                                                                     batch_size=batch_size,
 38 |                                                                                     conv1d_image_dim_w=conv1d_image_dim_w)
 39 | 
 40 |     else:
 41 |         print(f'Loading opcode data')
 42 |         opcode_len = model_params['opcode_len']
 43 |         data_path = get_opcode_datapath(opcode_len)
 44 |         train_loader, val_loader, dataset_len, class_names, \
 45 |         text_vocal_len, label_vocab_len, pad_idx = get_opcode_data_loaders(data_path=data_path,
 46 |                                                                            opcode_len=opcode_len,
 47 |                                                                            batch_size=batch_size)
 48 |         model_params['input_dim'] = text_vocal_len
 49 |         model_params['output_dim'] = label_vocab_len
 50 | 
 51 |     train_set_len = len(train_loader) * batch_size
 52 |     val_set_len = len(val_loader) * batch_size
 53 |     num_of_classes = len(class_names)
 54 | 
 55 |     model_params['num_of_classes'] = num_of_classes
 56 |     model_params['class_names'] = class_names
 57 | 
 58 |     if feature_type == FEATURE_TYPE_IMAGE:
 59 |         model = create_deep_image_model(model_params).to(device)
 60 |     else:
 61 |         model = create_deep_opcode_model(model_params).to(device)
 62 | 
 63 |     criterion = nn.CrossEntropyLoss().to(device)
 64 | 
 65 |     model, train_losses, train_accuracy = train_ann_model(model=model, model_params=model_params, criterion=criterion,
 66 |                                                           train_loader=train_loader, log_dir=LOG_DIR)
 67 |     test_accuracy, predicted, ground_truth = test_ann_model(model=model, model_params=model_params, criterion=criterion,
 68 |                                                             val_loader=val_loader)
 69 | 
 70 |     model_params['train_accuracy'] = np.mean(train_accuracy)
 71 |     model_params['test_accuracy'] = np.mean(test_accuracy)
 72 | 
 73 |     print(f"Average Train accuracy: {model_params['train_accuracy']:7.4f}%")
 74 |     print(f"Average Test accuracy : {model_params['test_accuracy']:7.4f}%")
 75 | 
 76 |     save_model_results_to_log(model=model, model_params=model_params,
 77 |                               train_losses=train_losses, train_accuracy=train_accuracy,
 78 |                               predicted=predicted, ground_truth=ground_truth,
 79 |                               log_dir=LOG_DIR)
 80 | 
 81 | 
 82 | def execute_deep_rnn_model(model_params, LOG_DIR):
 83 |     print(f'Model params: {model_params}')
 84 | 
 85 |     batch_size = model_params['batch_size']
 86 |     opcode_len = model_params['opcode_len']
 87 |     data_path = get_opcode_datapath(opcode_len)
 88 | 
 89 |     print_line()
 90 |     print(f'Loading Opcode data')
 91 | 
 92 |     train_iterator, test_iterator, dataset_len, class_names, \
 93 |     text_vocal_len, label_vocab_len, pad_idx = get_opcode_data_loaders(data_path=data_path,
 94 |                                                                        opcode_len=opcode_len,
 95 |                                                                        batch_size=batch_size)
 96 |     num_of_classes = len(class_names)
 97 | 
 98 |     print(f'Total images available: {dataset_len}')
 99 |     print(f'Number of classes: {num_of_classes}')
100 |     print_line()
101 | 
102 |     model_params['num_of_classes'] = num_of_classes
103 |     model_params['class_names'] = class_names
104 |     model_params['input_dim'] = text_vocal_len
105 |     model_params['output_dim'] = label_vocab_len
106 | 
107 |     model = create_deep_opcode_model(model_params).to(device)
108 |     criterion = nn.CrossEntropyLoss().to(device)
109 | 
110 |     model, train_losses, train_accuracy = train_rnn_model(model=model, model_params=model_params, criterion=criterion,
111 |                                                           train_loader=train_iterator, log_dir=LOG_DIR)
112 |     test_accuracy, predicted, ground_truth = test_rnn_model(model=model, model_params=model_params, criterion=criterion,
113 |                                                             val_loader=test_iterator)
114 | 
115 |     model_params['train_accuracy'] = np.mean(train_accuracy)
116 |     model_params['test_accuracy'] = np.mean(test_accuracy)
117 | 
118 |     print(f"Average Train accuracy: {model_params['train_accuracy']:7.4f}%")
119 |     print(f"Average Test accuracy : {model_params['test_accuracy']:7.4f}%")
120 | 
121 |     save_model_results_to_log(model=model, model_params=model_params,
122 |                               train_losses=train_losses, train_accuracy=train_accuracy,
123 |                               predicted=predicted, ground_truth=ground_truth,
124 |                               log_dir=LOG_DIR)
125 | 
126 | 
127 | def execute_conv_tl_model(model_params, LOG_DIR):
128 |     print(f'Model params: {model_params}')
129 | 
130 |     batch_size = model_params['batch_size']
131 |     image_dim = model_params['image_dim']
132 |     data_path = get_image_datapath(image_dim)
133 |     # dataloader transforms input images of image_dim to what pre-trained model expects
134 |     # pretrained_image_dim is what pre-trained model expects
135 |     pretrained_image_dim = get_pretrained_image_dim(model_params['model_name'])
136 |     train_loader, val_loader, dataset_len, class_names = get_image_data_loaders(data_path=data_path,
137 |                                                                                 image_dim=image_dim,
138 |                                                                                 batch_size=batch_size,
139 |                                                                                 convert_to_rgb=True,
140 |                                                                                 pretrained_image_dim=pretrained_image_dim)
141 | 
142 |     train_set_len = len(train_loader) * batch_size
143 |     val_set_len = len(val_loader) * batch_size
144 |     num_of_classes = len(class_names)
145 | 
146 |     model_params['num_of_classes'] = num_of_classes
147 |     model_params['class_names'] = class_names
148 | 
149 |     model = create_conv_tl_model(model_params).to(device)
150 |     criterion = nn.CrossEntropyLoss().to(device)
151 | 
152 |     model, train_losses, train_accuracy = train_ann_model(model=model, model_params=model_params, criterion=criterion,
153 |                                                           train_loader=train_loader, log_dir=LOG_DIR)
154 |     test_accuracy, predicted, ground_truth = test_ann_model(model=model, model_params=model_params, criterion=criterion,
155 |                                                             val_loader=val_loader)
156 | 
157 |     model_params['train_accuracy'] = np.mean(train_accuracy)
158 |     model_params['test_accuracy'] = np.mean(test_accuracy)
159 | 
160 |     print(f"Average Train accuracy: {model_params['train_accuracy']:7.4f}%")
161 |     print(f"Average Test accuracy : {model_params['test_accuracy']:7.4f}%")
162 | 
163 |     save_model_results_to_log(model=model, model_params=model_params,
164 |                               train_losses=train_losses, train_accuracy=train_accuracy,
165 |                               predicted=predicted, ground_truth=ground_truth,
166 |                               log_dir=LOG_DIR)
167 | 
168 | 
169 | def process_deep_learning(experiment_types, LOG_DIR):
170 |     for expr_type in experiment_types:
171 |         malware_expr_list = get_malware_experiments_list(expr_type)
172 |         print(malware_expr_list)
173 |         total_expr = len(malware_expr_list)
174 | 
175 |         final_results = []
176 |         for num, ml in enumerate(malware_expr_list):
177 |             if 'num_layers' in ml.keys():
178 |                 num_layers = ml['num_layers']
179 |                 if num_layers == 1:
180 |                     ml['dropout'] = 0
181 | 
182 |             print_line()
183 |             print(f'Executing : {ml["experiment_name"]} ({num + 1}/{total_expr})')
184 |             print_line()
185 |             try:
186 |                 if expr_type == DEEP_FF:
187 |                     execute_deep_feedforward_model(ml, LOG_DIR)
188 |                 if expr_type == DEEP_RNN:
189 |                     execute_deep_rnn_model(ml, LOG_DIR)
190 |             except Exception:
191 |                 temp_dict = {'experiment_name': ml['experiment_name'],
192 |                              'train_accuracy': 'failed',
193 |                              'test_accuracy': 'failed'}
194 |                 print_line()
195 |                 print("FAILED")
196 |                 print(traceback.print_exc())
197 |                 print_line()
198 |                 print(sys.exc_info()[0])
199 |                 print_line()
200 |             else:
201 |                 temp_dict = {'experiment_name': ml['experiment_name'],
202 |                              'train_accuracy': ml['train_accuracy'],
203 |                              'test_accuracy': ml['test_accuracy']}
204 | 
205 |             final_results.append(temp_dict)
206 | 
207 |         exp_results_filename = os.path.join(LOG_DIR, expr_type + '_' + EXPERIMENT_RESULTS)
208 |         df = pd.DataFrame(final_results)
209 |         expr_name = df['experiment_name']
210 |         df.drop(['experiment_name'], axis=1, inplace=True)
211 |         df.set_index(expr_name, drop=True, inplace=True)
212 |         df.to_csv(exp_results_filename)
213 |         save_models_metadata_to_log(malware_expr_list, LOG_DIR)
214 | 
215 | 
216 | def prepare_shallow_model(model_params, LOG_DIR):
217 |     print(f'Model params: {model_params}')
218 |     df = pd.read_csv(ORG_DATASET_PE_FEATURES_CSV)
219 | 
220 |     # sort class names and re-assign the new class IDs w.r.t. sored classes.
221 |     df.sort_values(by=['Malware_ClassName'], inplace=True)
222 |     malware_classes = df['Malware_ClassName'].values
223 |     malware_classes = sorted(list(set(list(malware_classes))))
224 |     new_class_ids = df.apply(lambda row: malware_classes.index(row['Malware_ClassName']), axis=1)
225 |     df['Malware_ClassID'] = new_class_ids
226 | 
227 |     data = df.drop(['Name', 'md5', 'Malware_ClassName'], axis=1)
228 |     x = data.drop(['Malware_ClassID'], axis=1)
229 |     y = data['Malware_ClassID']
230 |     x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
231 | 
232 |     model_params['num_of_classes'] = len(malware_classes)
233 |     model_params['class_names'] = malware_classes
234 | 
235 |     sc = StandardScaler()
236 |     x_train = sc.fit_transform(x_train)
237 |     x_test = sc.transform(x_test)
238 | 
239 |     model, gsc_model = create_shallow_model(model_params=model_params)
240 | 
241 |     x_pred, y_pred, best_estimator, best_params = execute_shallow_model(model=gsc_model, x_train=x_train,
242 |                                                                         y_train=y_train, x_test=x_test,
243 |                                                                         model_params=model_params)
244 | 
245 |     test_accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
246 | 
247 |     model_params['train_accuracy'] = gsc_model.cv_results_['mean_train_score'][gsc_model.best_index_]
248 |     model_params['test_accuracy'] = test_accuracy
249 |     print(f"Train accuracy: {model_params['train_accuracy']:7.4f}%")
250 |     print(f"Test accuracy: {model_params['train_accuracy']:7.4f}%")
251 | 
252 |     save_model_results_to_log(model=gsc_model, model_params=model_params,
253 |                               predicted=y_pred, ground_truth=y_test, best_params=best_params,
254 |                               log_dir=LOG_DIR)
255 | 
256 | 
257 | def process_shallow_learning(LOG_DIR):
258 |     shallow_expr_list = get_shallow_expr_list()
259 |     total_expr = len(shallow_expr_list)
260 |     final_results = []
261 |     for num, ml in enumerate(shallow_expr_list):
262 |         print_line()
263 |         print(f'Executing : {ml["experiment_name"]} ({num + 1}/{total_expr})')
264 |         print_line()
265 |         prepare_shallow_model(ml, LOG_DIR)
266 |         temp_dict = {'experiment_name': ml['experiment_name'],
267 |                      'train_accuracy': ml['train_accuracy'],
268 |                      'test_accuracy': ml['test_accuracy']}
269 |         final_results.append(temp_dict)
270 | 
271 |     exp_results_filename = os.path.join(LOG_DIR, 'shallow_' + EXPERIMENT_RESULTS)
272 |     df = pd.DataFrame(final_results)
273 |     expr_name = df['experiment_name']
274 |     df.drop(['experiment_name'], axis=1, inplace=True)
275 |     df.set_index(expr_name, drop=True, inplace=True)
276 |     df.to_csv(exp_results_filename)
277 |     save_models_metadata_to_log(shallow_expr_list, LOG_DIR)
278 | 
279 | 
280 | def process_conv_transfer_learning(LOG_DIR):
281 |     tl_expr_list = get_conv_transfer_learning_expr_list()
282 |     total_expr = len(tl_expr_list)
283 |     final_results = []
284 |     for num, ml in enumerate(tl_expr_list):
285 |         print_line()
286 |         print(f'Executing : {ml["experiment_name"]} ({num + 1}/{total_expr})')
287 |         print_line()
288 |         try:
289 |             execute_conv_tl_model(ml, LOG_DIR)
290 |         except Exception:
291 |             temp_dict = {'experiment_name': ml['experiment_name'],
292 |                          'train_accuracy': 'failed',
293 |                          'test_accuracy': 'failed'}
294 |             print_line()
295 |             print("FAILED")
296 |             print(traceback.print_exc())
297 |             print_line()
298 |             print(sys.exc_info()[0])
299 |             print_line()
300 |         else:
301 |             temp_dict = {'experiment_name': ml['experiment_name'],
302 |                          'train_accuracy': ml['train_accuracy'],
303 |                          'test_accuracy': ml['test_accuracy']}
304 |         final_results.append(temp_dict)
305 | 
306 |     exp_results_filename = os.path.join(LOG_DIR, 'conv_tl_' + EXPERIMENT_RESULTS)
307 |     df = pd.DataFrame(final_results)
308 |     expr_name = df['experiment_name']
309 |     df.drop(['experiment_name'], axis=1, inplace=True)
310 |     df.set_index(expr_name, drop=True, inplace=True)
311 |     df.to_csv(exp_results_filename)
312 |     save_models_metadata_to_log(tl_expr_list, LOG_DIR)
313 | 
314 | 
315 | def main(args, LOG_DIR):
316 |     deep_learning_models = []
317 |     if args.deep_feedforward:
318 |         deep_learning_models.append(DEEP_FF)
319 |     if args.deep_rnn:
320 |         deep_learning_models.append(DEEP_RNN)
321 | 
322 |     if len(deep_learning_models) > 0:
323 |         print_line()
324 |         print(f'Starting Deep Learning Experiments to detect Malwares')
325 |         print_line()
326 |         process_deep_learning(deep_learning_models, LOG_DIR)
327 |         print_line()
328 | 
329 |     if args.shallow_ml:
330 |         print(f'Starting shallow Machine Learning Experiments to detect Malwares')
331 |         print_line()
332 |         process_shallow_learning(LOG_DIR)
333 |         print_line()
334 | 
335 |     if args.transfer_conv_ml:
336 |         print(f'Starting conv-based Transfer Learning Experiments to detect Malwares')
337 |         print_line()
338 |         process_conv_transfer_learning(LOG_DIR)
339 |         print_line()
340 | 
341 | 
342 | def print_banner(LOG_DIR):
343 |     print_line()
344 |     if use_cuda:
345 |         print('Using GPU:', torch.cuda.get_device_name(torch.cuda.current_device()))
346 |     else:
347 |         print('Running on :', device)
348 | 
349 |     print(f'LOG_DIR = {LOG_DIR}')
350 |     print_line()
351 | 
352 | 
353 | if __name__ == '__main__':
354 |     parser = argparse.ArgumentParser(description='Train Machine Learning models to detect and classify Malware')
355 | 
356 |     parser.add_argument('--deep_feedforward', action='store_true', help='Execute deep feedforward models',
357 |                         default=False)
358 |     parser.add_argument('--deep_rnn', action='store_true', help='Execute deep rnn models',
359 |                         default=False)
360 |     parser.add_argument('--shallow_ml', action='store_true', help='Execute shallow machine learning models',
361 |                         default=False)
362 |     parser.add_argument('--transfer_conv_ml', action='store_true', help='Transfer learning using conv-based models',
363 |                         default=False)
364 |     args = parser.parse_args()
365 | 
366 |     if len(sys.argv) < 2:
367 |         parser.print_usage()
368 |         sys.exit(1)
369 | 
370 |     LOG_DIR = setup()
371 |     print_banner(LOG_DIR)
372 |     main(args, LOG_DIR)
373 |     print_banner(LOG_DIR)
374 | 


--------------------------------------------------------------------------------
/models/AnnModels.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import torch
 3 | import torch.nn as nn
 4 | import torch.nn.functional as F
 5 | from torch.utils.data import DataLoader
 6 | from torchvision import datasets, transforms, models
 7 | from torchvision.utils import make_grid
 8 | from config import *
 9 | import numpy as np
10 | 
11 | 
12 | class ANNMalware_Model1(nn.Module):
13 |     def __init__(self, image_dim=32, num_of_classes=20):
14 |         super().__init__()
15 | 
16 |         self.image_dim = image_dim
17 |         self.num_of_classes = num_of_classes
18 | 
19 |         self.linear1_in_features = int(self.image_dim * self.image_dim)
20 |         # reduce the neurons by 20% i.e. take 80% in_features
21 |         self.linear1_out_features = int(self.linear1_in_features * 0.80)
22 |         # reduce the neurons by 40%
23 |         self.linear2_out_features = int(self.linear1_out_features * 0.60)
24 | 
25 |         self.classifier = nn.Sequential(
26 |             nn.Linear(self.linear1_in_features, self.linear1_out_features),
27 |             nn.ReLU(inplace=True),
28 |             nn.Linear(self.linear1_out_features, self.linear2_out_features),
29 |             nn.ReLU(inplace=True),
30 |             nn.Linear(self.linear2_out_features, self.num_of_classes),
31 |         )
32 | 
33 |     def forward(self, x):
34 |         x = torch.flatten(x, 1)
35 |         x = self.classifier(x)
36 |         return F.log_softmax(x, dim=1)
37 | 
38 | 
39 | class ANNMalware_Model2(nn.Module):
40 |     def __init__(self, image_dim=32, num_of_classes=20):
41 |         super().__init__()
42 | 
43 |         self.image_dim = image_dim
44 |         self.num_of_classes = num_of_classes
45 | 
46 |         self.linear1_in_features = int(self.image_dim * self.image_dim)
47 |         # reduce the neurons by 20% i.e. take 80% in_features
48 |         self.linear1_out_features = int(self.linear1_in_features * 0.80)
49 |         # reduce the neurons by 40%
50 |         self.linear2_out_features = int(self.linear1_out_features * 0.60)
51 |         # reduce the neurons by 20%
52 |         self.linear3_out_features = int(self.linear1_out_features * 0.40)
53 |         # reduce the neurons by 20%
54 |         self.linear4_out_features = int(self.linear1_out_features * 0.20)
55 | 
56 |         self.classifier = nn.Sequential(
57 |             nn.Linear(self.linear1_in_features, self.linear1_out_features),
58 |             nn.ReLU(inplace=True),
59 |             nn.Linear(self.linear1_out_features, self.linear2_out_features),
60 |             nn.ReLU(inplace=True),
61 |             nn.Linear(self.linear2_out_features, self.linear3_out_features),
62 |             nn.ReLU(inplace=True),
63 |             nn.Linear(self.linear3_out_features, self.linear4_out_features),
64 |             nn.ReLU(inplace=True),
65 |             nn.Linear(self.linear4_out_features, self.num_of_classes)
66 |         )
67 | 
68 |     def forward(self, x):
69 |         x = torch.flatten(x, 1)
70 |         x = self.classifier(x)
71 |         return F.log_softmax(x, dim=1)
72 | 


--------------------------------------------------------------------------------
/models/CnnModels.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | import torch
  3 | import torch.nn as nn
  4 | import torch.nn.functional as F
  5 | from torch.utils.data import DataLoader
  6 | from torchvision import datasets, transforms, models
  7 | from torchvision.utils import make_grid
  8 | from config import *
  9 | import numpy as np
 10 | 
 11 | 
 12 | class CNNMalware_Model1(nn.Module):
 13 |     def __init__(self, image_dim=32, num_of_classes=20):
 14 |         super().__init__()
 15 | 
 16 |         self.image_dim = image_dim
 17 |         self.num_of_classes = num_of_classes
 18 | 
 19 |         self.conv1_out_channel = 12
 20 |         self.conv1_kernel_size = 3
 21 | 
 22 |         self.conv2_out_channel = 16
 23 |         self.conv2_kernel_size = 3
 24 | 
 25 |         self.linear1_out_features = 120
 26 |         self.linear2_out_features = 90
 27 | 
 28 |         self.conv1 = nn.Conv2d(1, self.conv1_out_channel, self.conv1_kernel_size, stride=1,
 29 |                                padding=(2, 2))
 30 | 
 31 |         self.conv2 = nn.Conv2d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size, stride=1,
 32 |                                padding=(2, 2))
 33 | 
 34 |         self.temp = int((((self.image_dim + 2) / 2) + 2) / 2)
 35 | 
 36 |         self.fc1 = nn.Linear(self.temp * self.temp * self.conv2_out_channel, self.linear1_out_features)
 37 |         self.fc2 = nn.Linear(self.linear1_out_features, self.linear2_out_features)
 38 |         self.fc3 = nn.Linear(self.linear2_out_features, self.num_of_classes)
 39 | 
 40 |     def forward(self, X):
 41 |         X = F.relu(self.conv1(X))
 42 |         X = F.max_pool2d(X, 2, 2)
 43 |         X = F.relu(self.conv2(X))
 44 |         X = F.max_pool2d(X, 2, 2)
 45 |         X = X.view(-1, self.temp * self.temp * self.conv2_out_channel)
 46 |         X = F.relu(self.fc1(X))
 47 |         X = F.relu(self.fc2(X))
 48 |         X = self.fc3(X)
 49 |         return F.log_softmax(X, dim=1)
 50 | 
 51 | 
 52 | class CNNMalware_Model2(nn.Module):
 53 |     def __init__(self, image_dim=32, num_of_classes=20):
 54 |         super().__init__()
 55 | 
 56 |         self.image_dim = image_dim
 57 |         self.num_of_classes = num_of_classes
 58 |         self.padding = 2
 59 |         self.conv1_out_channel = 15
 60 |         self.conv1_kernel_size = 15
 61 |         self.stride = 1
 62 |         self.conv2_out_channel = 16
 63 |         self.conv2_kernel_size = 3
 64 | 
 65 |         conv1_nurons = int((self.image_dim - self.conv1_kernel_size + 2 * self.padding) / self.stride + 1)
 66 |         maxpool2d_1_nurons = int(conv1_nurons / 2)
 67 |         conv2_nurons = ((maxpool2d_1_nurons - self.conv2_kernel_size + 2 * self.padding) / self.stride + 1)
 68 |         maxpool2d_2_nurons = int(conv2_nurons / 2)
 69 | 
 70 |         self.linear1_in_features = int(maxpool2d_2_nurons * maxpool2d_2_nurons * self.conv2_out_channel)
 71 | 
 72 |         # reduce the neurons by 20% i.e. take 80% in_features
 73 |         self.linear1_out_features = int(self.linear1_in_features * 0.80)
 74 |         # reduce the neurons by 40%
 75 |         self.linear2_out_features = int(self.linear1_out_features * 0.60)
 76 | 
 77 |         self.features = nn.Sequential(
 78 |             nn.Conv2d(1, self.conv1_out_channel, self.conv1_kernel_size,
 79 |                       stride=self.stride, padding=(self.padding, self.padding)),
 80 |             nn.ReLU(inplace=True),
 81 |             nn.MaxPool2d(kernel_size=2),
 82 |             nn.Conv2d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size,
 83 |                       stride=self.stride, padding=(self.padding, self.padding)),
 84 |             nn.ReLU(inplace=True),
 85 |             nn.MaxPool2d(kernel_size=2),
 86 |         )
 87 | 
 88 |         self.classifier = nn.Sequential(
 89 |             nn.Linear(self.linear1_in_features, self.linear1_out_features),
 90 |             nn.ReLU(inplace=True),
 91 |             nn.Linear(self.linear1_out_features, self.linear2_out_features),
 92 |             nn.ReLU(inplace=True),
 93 |             nn.Linear(self.linear2_out_features, self.num_of_classes),
 94 |         )
 95 | 
 96 |     def forward(self, x):
 97 |         x = self.features(x)
 98 |         x = torch.flatten(x, 1)
 99 |         x = self.classifier(x)
100 |         return F.log_softmax(x, dim=1)
101 | 
102 | 
103 | class CNNMalware_Model3(nn.Module):
104 |     def __init__(self, image_dim_h=1, image_dim_w=1024, num_of_classes=20):
105 |         super().__init__()
106 | 
107 |         self.image_dim_h = image_dim_h
108 |         self.image_dim_w = image_dim_w
109 |         self.num_of_classes = num_of_classes
110 | 
111 |         self.conv1_in_channel = 1
112 |         self.conv1_out_channel = 28
113 |         self.conv1_kernel_size = 3
114 | 
115 |         self.conv2_out_channel = 16
116 |         self.conv2_kernel_size = 3
117 | 
118 |         self.linear1_out_features = 120
119 |         self.linear2_out_features = 90
120 | 
121 |         self.conv1 = nn.Conv1d(self.conv1_in_channel, self.conv1_out_channel, self.conv1_kernel_size, stride=1,
122 |                                padding=(0))
123 | 
124 |         self.conv2 = nn.Conv1d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size, stride=1,
125 |                                padding=(2))
126 | 
127 |         self.fc1_size = self.conv2_out_channel * self.image_dim_w
128 | 
129 |         self.fc1 = nn.Linear(self.fc1_size, self.linear1_out_features)
130 |         self.fc2 = nn.Linear(self.linear1_out_features, self.linear2_out_features)
131 |         self.fc3 = nn.Linear(self.linear2_out_features, self.num_of_classes)
132 | 
133 |     def forward(self, X):
134 |         X = X.squeeze(dim=1)
135 |         X = F.relu(self.conv1(X))
136 |         X = F.relu(self.conv2(X))
137 |         X = X.view(-1, self.fc1_size)
138 |         X = F.relu(self.fc1(X))
139 |         X = F.relu(self.fc2(X))
140 |         X = self.fc3(X)
141 |         return F.log_softmax(X, dim=1)
142 | 
143 | 
144 | class CNNMalware_Model4(nn.Module):
145 |     def __init__(self, image_dim_h=1, image_dim_w=1024, num_of_classes=20,
146 |                  c1_out=32, c1_kernel=16, c1_padding=2, c1_stride=2,
147 |                  c2_out=32, c2_kernel=8, c2_padding=2, c2_stride=2,
148 |                  ):
149 |         super().__init__()
150 | 
151 |         self.image_dim_h = image_dim_h
152 |         self.image_dim_w = image_dim_w
153 |         self.num_of_classes = num_of_classes
154 |         self.dilation = 1
155 | 
156 |         self.conv1_in_channel = 1
157 |         self.conv1_out_channel = c1_out
158 |         self.conv1_kernel_size = c1_kernel
159 |         self.conv1_padding = c1_padding
160 |         self.conv1_stride = c1_stride
161 | 
162 |         self.conv2_out_channel = c2_out
163 |         self.conv2_kernel_size = c2_kernel
164 |         self.conv2_padding = c2_padding
165 |         self.conv2_stride = c2_stride
166 | 
167 |         self.linear1_out_features = 128 * 4
168 |         self.linear2_out_features = 128
169 | 
170 |         self.conv1 = nn.Conv1d(self.conv1_in_channel, self.conv1_out_channel, self.conv1_kernel_size,
171 |                                stride=self.conv1_stride,
172 |                                padding=(self.conv1_padding), dilation=self.dilation)
173 | 
174 |         self.conv1_out_dim = self.calc_conv1d_out(self.image_dim_w, self.conv1_padding,
175 |                                                   self.dilation, self.conv1_kernel_size, self.conv1_stride)
176 | 
177 |         self.conv2 = nn.Conv1d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size,
178 |                                stride=self.conv2_stride,
179 |                                padding=(self.conv2_padding))
180 | 
181 |         self.conv2_out_dim = self.calc_conv1d_out(self.conv1_out_dim, self.conv2_padding,
182 |                                                   self.dilation, self.conv2_kernel_size, self.conv2_stride)
183 | 
184 |         self.fc1 = nn.Linear(self.conv2_out_dim * self.conv2_out_channel, self.linear1_out_features)
185 |         self.fc2 = nn.Linear(self.linear1_out_features, self.linear2_out_features)
186 |         self.fc3 = nn.Linear(self.linear2_out_features, self.num_of_classes)
187 | 
188 |     def calc_conv1d_out(self, l_in, padding, dilation, kernel_size, stride):
189 |         return int(((l_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride) + 1)
190 | 
191 |     def forward(self, X):
192 |         X = X.squeeze(dim=1)
193 |         X = F.relu(self.conv1(X))
194 |         X = F.relu(self.conv2(X))
195 |         X = X.view(X.shape[0], -1)
196 |         X = F.relu(self.fc1(X))
197 |         X = F.relu(self.fc2(X))
198 |         X = self.fc3(X)
199 |         return F.log_softmax(X, dim=1)
200 | 
201 | 
202 | class CNNMalware_Model5(nn.Module):
203 |     def __init__(self, input_dim=1024, embedding_dim=128, n_filters=3, filter_sizes=[3,6], output_dim=128, dropout=0.3):
204 |         super().__init__()
205 | 
206 |         self.embedding = nn.Embedding(input_dim, embedding_dim)
207 | 
208 |         self.convs = nn.ModuleList([
209 |             nn.Conv2d(in_channels=1,
210 |                       out_channels=n_filters,
211 |                       kernel_size=(fs, embedding_dim))
212 |             for fs in filter_sizes
213 |         ])
214 | 
215 |         self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
216 | 
217 |         self.dropout = nn.Dropout(dropout)
218 | 
219 |     def forward(self, text):
220 |         text = text.permute(1, 0)
221 |         embedded = self.embedding(text)
222 |         embedded = embedded.unsqueeze(1)
223 |         conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
224 |         pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
225 |         cat = self.dropout(torch.cat(pooled, dim=1))
226 |         cat2 = self.fc(cat)
227 |         return F.log_softmax(cat2, dim=1)
228 | 


--------------------------------------------------------------------------------
/models/RnnModels.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | 
  5 | 
  6 | class RNNMalware_Model1(torch.nn.Module):
  7 |     def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20,
  8 |                  batch_size=256, num_layers=1, bidirectional=False, dropout=0):
  9 |         super().__init__()
 10 |         self.input_dim = input_dim
 11 |         self.embedding_dim = embedding_dim
 12 |         self.hidden_dim = hidden_dim
 13 |         self.output_dim = output_dim
 14 |         self.batch_size = batch_size
 15 |         self.num_layers = num_layers
 16 |         self.bidirectional = bidirectional
 17 |         self.dropout = dropout
 18 |         self.fc_hidden_dim = self.hidden_dim
 19 | 
 20 |         if self.bidirectional:
 21 |             self.fc_hidden_dim = self.hidden_dim * 2
 22 | 
 23 |         self.embedding = nn.Embedding(self.input_dim, self.embedding_dim)
 24 |         self.rnn = nn.RNN(input_size=self.embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers,
 25 |                           nonlinearity='relu', bidirectional=self.bidirectional, dropout=self.dropout)
 26 |         self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim)
 27 | 
 28 |     def forward(self, opcode):
 29 |         embedded = self.embedding(opcode)
 30 |         output, hidden = self.rnn(embedded)
 31 |         return self.fc(output[-1])
 32 | 
 33 | 
 34 | class LSTMMalware_Model1(torch.nn.Module):
 35 |     def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20,
 36 |                  batch_size=256, num_layers=1, bidirectional=False, dropout=0):
 37 |         super().__init__()
 38 |         self.input_dim = input_dim
 39 |         self.embedding_dim = embedding_dim
 40 |         self.hidden_dim = hidden_dim
 41 |         self.output_dim = output_dim
 42 |         self.batch_size = batch_size
 43 |         self.num_layers = num_layers
 44 |         self.bidirectional = bidirectional
 45 |         self.dropout = dropout
 46 |         self.fc_hidden_dim = self.hidden_dim
 47 | 
 48 |         if self.bidirectional:
 49 |             self.fc_hidden_dim = self.hidden_dim * 2
 50 | 
 51 |         self.embedding = nn.Embedding(self.input_dim, self.embedding_dim)
 52 | 
 53 |         self.lstm = nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers,
 54 |                             bidirectional=self.bidirectional, dropout=self.dropout)
 55 | 
 56 |         self.dropout_layer = nn.Dropout(self.dropout)
 57 | 
 58 |         self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim)
 59 | 
 60 |     def forward(self, opcode):
 61 |         embedded = self.embedding(opcode)
 62 |         output, (hidden, cell) = self.lstm(embedded)
 63 | 
 64 |         if self.bidirectional:
 65 |             hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
 66 |         else:
 67 |             hidden = hidden[-1::].squeeze(0)
 68 | 
 69 |         fc_in = self.dropout_layer(hidden)
 70 |         return self.fc(fc_in)
 71 | 
 72 | 
 73 | class GRUMalware_Model1(torch.nn.Module):
 74 |     def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20,
 75 |                  batch_size=256, num_layers=1, bidirectional=False, dropout=0):
 76 |         super().__init__()
 77 |         self.input_dim = input_dim
 78 |         self.embedding_dim = embedding_dim
 79 |         self.hidden_dim = hidden_dim
 80 |         self.output_dim = output_dim
 81 |         self.batch_size = batch_size
 82 |         self.num_layers = num_layers
 83 |         self.bidirectional = bidirectional
 84 |         self.dropout = dropout
 85 |         self.fc_hidden_dim = self.hidden_dim
 86 | 
 87 |         if self.bidirectional:
 88 |             self.fc_hidden_dim = self.hidden_dim * 2
 89 | 
 90 |         self.embedding = nn.Embedding(self.input_dim, self.embedding_dim)
 91 | 
 92 |         self.gru = nn.GRU(input_size=self.embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers,
 93 |                           bidirectional=self.bidirectional, dropout=self.dropout)
 94 | 
 95 |         self.dropout_layer = nn.Dropout(self.dropout)
 96 | 
 97 |         self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim)
 98 | 
 99 |     def forward(self, opcode):
100 |         embedded = self.embedding(opcode)
101 |         output, hidden = self.gru(embedded)
102 | 
103 |         if self.bidirectional:
104 |             hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
105 |         else:
106 |             hidden = hidden[-1::].squeeze(0)
107 | 
108 |         fc_in = self.dropout_layer(hidden)
109 |         return self.fc(fc_in)
110 | 
111 | 
112 | class StackedMalware_Model1(torch.nn.Module):
113 |     def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20,
114 |                  batch_size=256, num_layers=1, bidirectional=False, dropout=0, LG=True):
115 |         super().__init__()
116 |         self.input_dim = input_dim
117 |         self.embedding_dim = embedding_dim
118 |         self.hidden_dim = hidden_dim
119 |         self.output_dim = output_dim
120 |         self.batch_size = batch_size
121 |         self.num_layers = num_layers
122 |         self.fc_hidden_dim = self.hidden_dim
123 |         self.bidirectional = bidirectional
124 |         self.num_of_direction = 1
125 |         if self.bidirectional:
126 |             self.num_of_direction = 2
127 |             self.fc_hidden_dim = self.hidden_dim * 2
128 |         self.dropout = dropout
129 |         self.LG = LG
130 |         self.embedding_dim_gru = self.embedding_dim
131 |         self.embedding_dim_lstm = self.embedding_dim
132 | 
133 |         if self.LG:
134 |             self.embedding_dim_gru = self.hidden_dim
135 |         else:
136 |             self.embedding_dim_lstm = self.hidden_dim
137 | 
138 |         self.embedding = nn.Embedding(self.input_dim, self.embedding_dim)
139 | 
140 |         self.lstm = nn.LSTM(input_size=self.embedding_dim_lstm, hidden_size=self.hidden_dim, num_layers=self.num_layers,
141 |                             bidirectional=self.bidirectional, dropout=self.dropout)
142 | 
143 |         self.gru = nn.GRU(input_size=self.embedding_dim_gru, hidden_size=self.hidden_dim, num_layers=self.num_layers,
144 |                           bidirectional=self.bidirectional, dropout=self.dropout)
145 | 
146 |         self.dropout_layer = nn.Dropout(self.dropout)
147 | 
148 |         self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim)
149 | 
150 |     def forward(self, opcode):
151 |         embedded = self.embedding(opcode)
152 | 
153 |         if self.LG:
154 |             output, (hidden, cell) = self.lstm(embedded)
155 |             output, hidden = self.gru(hidden)
156 |         else:
157 |             output, hidden = self.gru(embedded)
158 |             output, (hidden, cell) = self.lstm(hidden)
159 | 
160 |         if self.bidirectional:
161 |             hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
162 |         else:
163 |             hidden = hidden[-1, :, :].squeeze(0)
164 | 
165 |         fc_in = self.dropout_layer(hidden)
166 |         return self.fc(fc_in)
167 | 


--------------------------------------------------------------------------------
/models/Shallow_ML_models.py:
--------------------------------------------------------------------------------
 1 | import xgboost as xgb
 2 | 
 3 | def execute_shallow_model(model=None, x_train=None, y_train=None, x_test=None,model_params=None):
 4 |     if model_params['model_name'] == 'XGB':
 5 |         model.fit(x_train, y_train, eval_metric='error')
 6 |     else:
 7 |         model.fit(x_train, y_train)
 8 | 
 9 |     best_params = model.best_params_
10 |     best_estimator = model.best_estimator_
11 | 
12 |     y_pred = best_estimator.predict(x_test)  # test data
13 |     x_pred = best_estimator.predict(x_train)  # training data
14 | 
15 |     return x_pred, y_pred, best_estimator, best_params
16 | 


--------------------------------------------------------------------------------
/models/TransferLearnModels.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | from torchvision import models
 3 | import torch.nn.functional as F
 4 | 
 5 | 
 6 | class Resnet152_wrapper(nn.Module):
 7 |     def __init__(self, model_params):
 8 |         super(Resnet152_wrapper, self).__init__()
 9 |         self.model_params = model_params
10 |         self.num_of_classes = model_params['num_of_classes']
11 |         self.resnet152 = models.resnet152(pretrained=True)
12 |         self.params_to_update = []
13 | 
14 |         for param in self.resnet152.parameters():
15 |             param.requires_grad = False
16 | 
17 |         for param in self.resnet152.layer4.parameters():
18 |             self.params_to_update.append(param)
19 |             param.requires_grad = True
20 | 
21 |         self.fc_malware1 = nn.Linear(1000, 500)
22 |         self.fc_malware2 = nn.Linear(500, self.num_of_classes)
23 | 
24 |         for param in self.fc_malware1.parameters():
25 |             self.params_to_update.append(param)
26 |             param.requires_grad = True
27 | 
28 |         for param in self.fc_malware2.parameters():
29 |             self.params_to_update.append(param)
30 |             param.requires_grad = True
31 | 
32 |     def parameters(self):
33 |         return self.params_to_update
34 | 
35 |     def forward(self, x):
36 |         x = self.resnet152(x)
37 |         x = self.fc_malware1(x)
38 |         x = self.fc_malware2(x)
39 |         return F.log_softmax(x, dim=1)
40 | 
41 | 
42 | class VGG19_wrapper(nn.Module):
43 |     def __init__(self, model_params):
44 |         super(VGG19_wrapper, self).__init__()
45 |         self.model_params = model_params
46 |         self.num_of_classes = model_params['num_of_classes']
47 |         self.vgg19 = models.vgg19(pretrained=True)
48 |         self.params_to_update = []
49 | 
50 |         for param in self.vgg19.parameters():
51 |             param.requires_grad = False
52 | 
53 |         list_of_features_layers = [34, 35, 36]
54 |         for f in list_of_features_layers:
55 |             for param in self.vgg19.features[f].parameters():
56 |                 self.params_to_update.append(param)
57 |                 param.requires_grad = True
58 | 
59 |         for param in self.vgg19.classifier.parameters():
60 |             self.params_to_update.append(param)
61 |             param.requires_grad = True
62 | 
63 |         self.fc_malware1 = nn.Linear(1000, 500)
64 |         self.fc_malware2 = nn.Linear(500, self.num_of_classes)
65 | 
66 |         for param in self.fc_malware1.parameters():
67 |             self.params_to_update.append(param)
68 |             param.requires_grad = True
69 | 
70 |         for param in self.fc_malware2.parameters():
71 |             self.params_to_update.append(param)
72 |             param.requires_grad = True
73 | 
74 |     def parameters(self):
75 |         return self.params_to_update
76 | 
77 |     def forward(self, x):
78 |         x = self.vgg19(x)
79 |         x = self.fc_malware1(x)
80 |         x = self.fc_malware2(x)
81 |         return F.log_softmax(x, dim=1)
82 | 


--------------------------------------------------------------------------------
/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pratikpv/malware_detect2/12f4390f1c4a7a3e8b6bc355cc06f08f4dd3126d/models/__init__.py


--------------------------------------------------------------------------------
/models/model_trainers_testers.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from torch.utils.data import DataLoader
  5 | from torchvision import datasets, transforms, models
  6 | from torchvision.utils import make_grid
  7 | from config import *
  8 | import numpy as np
  9 | from models.CnnModels import *
 10 | from data_utils.data_loaders import *
 11 | from utils import *
 12 | from tqdm import tqdm
 13 | 
 14 | 
 15 | def train_ann_model(model=None, model_params=None, criterion=None,
 16 |                     train_loader=None, log_dir=None):
 17 |     epochs = model_params['epochs']
 18 |     lr = model_params['lr']
 19 | 
 20 |     optimizer = torch.optim.Adam(model.parameters(), lr=lr)
 21 | 
 22 |     train_losses = []
 23 |     train_accuracy = []
 24 | 
 25 |     tqdm_train_descr_format = "Training Feed-Forward model: Epoch Accuracy = {:02.4f}%, Loss = {:.8f}"
 26 |     tqdm_train_descr = tqdm_train_descr_format.format(0, float('inf'))
 27 |     tqdm_train_obj = tqdm(range(epochs), desc=tqdm_train_descr)
 28 | 
 29 |     model.train(True)
 30 | 
 31 |     for i in tqdm_train_obj:
 32 | 
 33 |         epoch_corr = 0
 34 |         epoch_loss = 0
 35 |         total_samples = 0
 36 | 
 37 |         for b, (X_train, y_train) in enumerate(train_loader):
 38 |             X_train = X_train.to(device)
 39 |             y_train = y_train.to(device)
 40 | 
 41 |             y_pred = model(X_train)
 42 |             loss = criterion(y_pred, y_train)
 43 | 
 44 |             predicted = torch.max(y_pred.data, 1)[1]
 45 |             batch_corr = (predicted == y_train).sum()
 46 |             epoch_corr += batch_corr.item()
 47 |             epoch_loss += loss.item()
 48 |             total_samples += y_pred.shape[0]
 49 | 
 50 |             # Update parameters
 51 |             optimizer.zero_grad()
 52 |             loss.backward()
 53 |             optimizer.step()
 54 | 
 55 |         epoch_accuracy = epoch_corr * 100 / total_samples
 56 |         epoch_loss = epoch_loss / total_samples
 57 | 
 58 |         train_losses.append(epoch_loss)
 59 |         train_accuracy.append(epoch_accuracy)
 60 | 
 61 |         tqdm_descr = tqdm_train_descr_format.format(epoch_accuracy, epoch_loss)
 62 |         tqdm_train_obj.set_description(tqdm_descr)
 63 | 
 64 |     return model, train_losses, train_accuracy
 65 | 
 66 | 
 67 | def test_ann_model(model=None, model_params=None, criterion=None,
 68 |                    val_loader=None):
 69 |     tqdm_test_descr_format = "Testing Feed-Forward model: Batch Accuracy = {:02.4f}%"
 70 |     tqdm_test_descr = tqdm_test_descr_format.format(0)
 71 |     tqdm_test_obj = tqdm(val_loader, desc=tqdm_test_descr)
 72 |     num_of_batches = len(val_loader)
 73 | 
 74 |     model.eval()
 75 | 
 76 |     total_test_loss = 0
 77 |     total_test_acc = 0
 78 |     predicted_all = torch.tensor([], dtype=torch.long, device=device)
 79 |     ground_truth_all = torch.tensor([], dtype=torch.long, device=device)
 80 | 
 81 |     with torch.no_grad():
 82 |         for b, (X_test, y_test) in enumerate(tqdm_test_obj):
 83 |             X_test = X_test.to(device)
 84 |             y_test = y_test.to(device)
 85 | 
 86 |             predictions = model(X_test)
 87 |             loss = criterion(predictions, y_test)
 88 | 
 89 |             predicted = torch.max(predictions.data, 1)[1]
 90 |             batch_corr = (predicted == y_test).sum()
 91 |             batch_acc = batch_corr.item() * 100 / predictions.shape[0]
 92 |             total_test_acc += batch_acc
 93 |             total_test_loss += loss.item()
 94 | 
 95 |             predicted_all = torch.cat((predicted_all, predicted), 0)
 96 |             ground_truth_all = torch.cat((ground_truth_all, y_test), 0)
 97 | 
 98 |             tqdm_test_descr = tqdm_test_descr_format.format(batch_acc)
 99 |             tqdm_test_obj.set_description(tqdm_test_descr)
100 | 
101 |     predicted_all = predicted_all.cpu().numpy()
102 |     ground_truth_all = ground_truth_all.cpu().numpy()
103 |     total_test_acc = total_test_acc / num_of_batches
104 | 
105 |     return total_test_acc, predicted_all, ground_truth_all
106 | 
107 | 
108 | def train_rnn_model(model=None, model_params=None, criterion=None,
109 |                     train_loader=None, log_dir=None):
110 |     epochs = model_params['epochs']
111 |     lr = model_params['lr']
112 | 
113 |     optimizer = torch.optim.Adam(model.parameters(), lr=lr)
114 | 
115 |     train_losses = []
116 |     train_accuracy = []
117 | 
118 |     tqdm_train_descr_format = "Training RNN model: Epoch Accuracy = {:02.4f}%, Loss = {:.8f}"
119 |     tqdm_train_descr = tqdm_train_descr_format.format(0, float('inf'))
120 |     tqdm_train_obj = tqdm(range(epochs), desc=tqdm_train_descr)
121 | 
122 |     model.train(True)
123 | 
124 |     for i in tqdm_train_obj:
125 | 
126 |         epoch_corr = 0
127 |         epoch_loss = 0
128 |         total_samples = 0
129 |         for b, batch in enumerate(train_loader):
130 |             batch.text = batch.text.to(device)
131 |             batch.label = batch.label.to(device)
132 | 
133 |             predictions = model(batch.text)
134 |             loss = criterion(predictions, batch.label)
135 | 
136 |             predicted = torch.max(predictions.data, 1)[1]
137 |             batch_corr = (predicted == batch.label).sum()
138 |             epoch_corr += batch_corr.item()
139 |             epoch_loss += loss.item()
140 |             total_samples += predictions.shape[0]
141 | 
142 |             optimizer.zero_grad()
143 |             loss.backward()
144 |             optimizer.step()
145 | 
146 |         epoch_accuracy = epoch_corr * 100 / total_samples
147 |         epoch_loss = epoch_loss / total_samples
148 | 
149 |         train_losses.append(epoch_loss)
150 |         train_accuracy.append(epoch_accuracy)
151 | 
152 |         tqdm_descr = tqdm_train_descr_format.format(epoch_accuracy, epoch_loss)
153 |         tqdm_train_obj.set_description(tqdm_descr)
154 | 
155 |     return model, train_losses, train_accuracy
156 | 
157 | 
158 | def test_rnn_model(model=None, model_params=None, criterion=None,
159 |                    val_loader=None):
160 |     tqdm_test_descr_format = "Testing RNN model: Batch Accuracy = {:02.4f}%"
161 |     tqdm_test_descr = tqdm_test_descr_format.format(0)
162 |     tqdm_test_obj = tqdm(val_loader, desc=tqdm_test_descr)
163 |     num_of_batches = len(val_loader)
164 | 
165 |     model.eval()
166 | 
167 |     total_test_loss = 0
168 |     total_test_acc = 0
169 |     predicted_all = torch.tensor([], dtype=torch.long, device=device)
170 |     ground_truth_all = torch.tensor([], dtype=torch.long, device=device)
171 | 
172 |     with torch.no_grad():
173 |         for b, batch in enumerate(tqdm_test_obj):
174 |             batch.text = batch.text.to(device)
175 |             batch.label = batch.label.to(device)
176 | 
177 |             predictions = model(batch.text)
178 |             loss = criterion(predictions, batch.label)
179 | 
180 |             predicted = torch.max(predictions.data, 1)[1]
181 |             batch_corr = (predicted == batch.label).sum()
182 |             batch_acc = batch_corr.item() * 100 / predictions.shape[0]
183 |             total_test_acc += batch_acc
184 |             total_test_loss += loss.item()
185 | 
186 |             predicted_all = torch.cat((predicted_all, predicted), 0)
187 |             ground_truth_all = torch.cat((ground_truth_all, batch.label), 0)
188 | 
189 |             tqdm_test_descr = tqdm_test_descr_format.format(batch_acc)
190 |             tqdm_test_obj.set_description(tqdm_test_descr)
191 | 
192 |     predicted_all = predicted_all.cpu().numpy()
193 |     ground_truth_all = ground_truth_all.cpu().numpy()
194 |     total_test_acc = total_test_acc / num_of_batches
195 | 
196 |     return total_test_acc, predicted_all, ground_truth_all
197 | 


--------------------------------------------------------------------------------
/models/models_utils.py:
--------------------------------------------------------------------------------
  1 | from models.CnnModels import *
  2 | from models.AnnModels import *
  3 | from models.RnnModels import *
  4 | from models.TransferLearnModels import *
  5 | from data_utils.data_loaders import *
  6 | from models.model_trainers_testers import *
  7 | from sklearn.model_selection import GridSearchCV
  8 | import xgboost as xgb
  9 | from sklearn.ensemble import RandomForestClassifier
 10 | from sklearn.neighbors import KNeighborsClassifier
 11 | import itertools
 12 | 
 13 | 
 14 | def get_deep_rnn_expr_list(print_grid=True, simple_list=True):
 15 |     """
 16 |     sample template:
 17 |             {
 18 |             'experiment_name': 'rnn_experiment_1',
 19 |             'model_name': 'RNNMalware_Model1',
 20 |             'batch_size': 512,
 21 |             'embedding_dim': 256,
 22 |             'hidden_dim': 128,
 23 |             'epochs': 50,
 24 |             'lr': 0.0001,
 25 |             'num_layers': 1,
 26 |             'bidirectional': False,
 27 |             'dropout': 0,
 28 |             'LG': False
 29 |         }
 30 | 
 31 |     supported models so far :
 32 |     (1) RNNMalware_Model1 :
 33 |     (2) LSTMMalware_Model1 :
 34 |     (3) GRUMalware_Model1 :
 35 |     (4) StackedMalware_Model1 : Stack of LSTM and GRU layers.
 36 | 
 37 |             if LG=True => LSTM at bottom and GRU on top, e.g.
 38 |                     +---------+
 39 |                     |  GRUs   |
 40 |                     ----------|
 41 |                     |  LSTMs  |
 42 |                     +---------+
 43 |             else:
 44 |                     +---------+
 45 |                     |  LSTMs  |
 46 |                     ----------|
 47 |                     |  GRUs   |
 48 |                     +---------+
 49 | 
 50 |     ###########################################################
 51 |     Grid Template for RNNMalware_Model1, LSTMMalware_Model1, GRUMalware_Model1
 52 |         {    'model_name': ['LSTMMalware_Model1', 'GRUMalware_Model1', 'RNNMalware_Model1'],
 53 |             'batch_size': [128],
 54 |             'embedding_dim': [256],
 55 |             'hidden_dim': [256],
 56 |             'epochs': [2],
 57 |             'lr': [0.001],
 58 |             'num_layers': [1, 3],
 59 |             'bidirectional': [True, False],
 60 |             'dropout': [0.3],
 61 |             'opcode_len': [10]
 62 |         }
 63 |     ###########################################################
 64 |     Simple Template for StackedMalware_Model1
 65 |         {
 66 |             'model_name': 'StackedMalware_Model1',
 67 |             'experiment_name': 'rnn_experiment_1',
 68 |             'batch_size': 128,
 69 |             'embedding_dim': 8,
 70 |             'hidden_dim': 8,
 71 |             'epochs': 2,
 72 |             'lr': 0.001,
 73 |             'num_layers': 3,
 74 |             'bidirectional': True,
 75 |             'dropout': 0.3,
 76 |             'LG': True,
 77 |             'opcode_len': 500
 78 |         }
 79 |     ###########################################################
 80 |     """
 81 | 
 82 |     get_deep_rnn_expr_grid = {
 83 |         'model_name': ['StackedMalware_Model1'],
 84 |         'batch_size': [128],
 85 |         'embedding_dim': [256, 1024],
 86 |         'hidden_dim': [256, 1024],
 87 |         'epochs': [20],
 88 |         'lr': [0.001],
 89 |         'num_layers': [1, 3],
 90 |         'bidirectional': [True, False],
 91 |         'dropout': [0.3],
 92 |         'opcode_len': [500],
 93 |         'LG': [False]
 94 |     }
 95 | 
 96 |     simple_list = False
 97 |     if not simple_list and print_grid:
 98 |         print_line()
 99 |         print(f'Experiments Grid')
100 |         print(get_deep_rnn_expr_grid)
101 |         print_line()
102 | 
103 |     if simple_list:
104 |         get_deep_rnn_expr_list = [
105 |             {
106 |                 'model_name': 'StackedMalware_Model1',
107 |                 'experiment_name': 'rnn_experiment_1',
108 |                 'batch_size': 128,
109 |                 'embedding_dim': 8,
110 |                 'hidden_dim': 8,
111 |                 'epochs': 2,
112 |                 'lr': 0.001,
113 |                 'num_layers': 3,
114 |                 'bidirectional': True,
115 |                 'dropout': 0.3,
116 |                 'LG': True,
117 |                 'opcode_len': 10
118 |             }
119 |         ]
120 |         return get_deep_rnn_expr_list
121 |     else:
122 |         keys, values = zip(*get_deep_rnn_expr_grid.items())
123 |         permutations_dicts = []
124 |         count = 1
125 |         for v in itertools.product(*values):
126 |             temp_dict = dict(zip(keys, v))
127 |             temp_exp = 'rnn_experiment_' + str(count)
128 |             temp_dict['experiment_name'] = temp_exp
129 |             permutations_dicts.append(temp_dict)
130 |             count += 1
131 | 
132 |         return permutations_dicts
133 | 
134 | 
135 | def get_deep_feedforward_expr_list(print_grid=True, simple_list=True):
136 |     """
137 |     ###########################################################
138 |     Template for ANNMalware_Model1 ( same for ANNMalware_Model2 but change model_name
139 |             {
140 |                 'model_name': 'ANNMalware_Model1',
141 |                 'experiment_name': 'ann_experiment_1',
142 |                 'batch_size': 128,
143 |                 'image_dim': 256,
144 |                 'epochs': 10,
145 |                 'lr': 0.001,
146 |                 'feature_type': FEATURE_TYPE_IMAGE
147 |             }
148 |     ###########################################################
149 |     Template for CNNMalware_Model1
150 |             {
151 |                 'model_name': 'CNNMalware_Model1',
152 |                 'experiment_name': 'cnn_experiment_1',
153 |                 'batch_size': 256,
154 |                 'image_dim': 256,
155 |                 'epochs': 10,
156 |                 'lr': 0.001,
157 |                 'feature_type': FEATURE_TYPE_IMAGE
158 |             }
159 |     ###########################################################
160 |     Template for CNNMalware_Model2
161 |             {
162 |                 'model_name': 'CNNMalware_Model2',
163 |                 'experiment_name': 'cnn_experiment_1',
164 |                 'batch_size': 512,
165 |                 'image_dim': 0,
166 |                 'epochs': 10,
167 |                 'lr': 0.001,
168 |                 'feature_type': FEATURE_TYPE_IMAGE
169 |             }
170 |     ###########################################################
171 |     Template for CNNMalware_Model3
172 |             {
173 |                 'model_name': 'CNNMalware_Model3',
174 |                 'experiment_name': 'cnn_experiment_1',
175 |                 'batch_size': 512,
176 |                 'image_dim': 0,
177 |                 'epochs': 1,
178 |                 'lr': 0.001,
179 |                 'conv1d_image_dim_w': 1024 * 4,
180 |                 'feature_type': FEATURE_TYPE_IMAGE
181 |             }
182 |     ###########################################################
183 |     Template for CNNMalware_Model4
184 |         {
185 |             'model_name': 'CNNMalware_Model4',
186 |             'experiment_name': 'cnn_experiment_1',
187 |             'batch_size': 512,
188 |             'image_dim': 0,
189 |             'epochs': 15,
190 |             'lr': 0.001,
191 |             'conv1d_image_dim_w': 1024 * 4,
192 |             'c1_out': 64,
193 |             'c1_kernel': 32,
194 |             'c1_padding': 2,
195 |             'c1_stride': 2,
196 |             'c2_out': 32,
197 |             'c2_kernel': 8,
198 |             'c2_padding': 2,
199 |             'c2_stride': 2,
200 |             'feature_type': FEATURE_TYPE_IMAGE
201 |         }
202 |     ###########################################################
203 |     Template for CNNMalware_Model5
204 |         {
205 |             'model_name': 'CNNMalware_Model5',
206 |             'experiment_name': 'cnn_experiment_1',
207 |             'batch_size': 512,
208 |             'image_dim': 256,
209 |             'epochs': 15,
210 |             'lr': 0.001,
211 |             'opcode_len': 500,
212 |             'embedding_dim': 128,
213 |             'hidden_dim': 128,
214 |             'n_filters': 3,
215 |             'filter_sizes': [3, 6],
216 |             'dropout': 0,
217 |             'feature_type': FEATURE_TYPE_OPCODE
218 |         }
219 |     ###########################################################
220 |     """
221 |     simple_list = False
222 |     get_deep_feedforward_expr_grid = {
223 |         'model_name': ['CNNMalware_Model5'],
224 |         'batch_size': [256],
225 |         'epochs': [20],
226 |         'lr': [0.001],
227 |         'opcode_len': [5000],
228 |         'embedding_dim': [512],
229 |         'n_filters': [16, 32],
230 |         'filter_sizes': [[16, 24, 32], [32, 64, 128]],
231 |         'dropout': [0.3],
232 |         'feature_type': [FEATURE_TYPE_OPCODE]
233 |     }
234 | 
235 |     if not simple_list and print_grid:
236 |         print_line()
237 |         print(f'Experiments Grid')
238 |         print(get_deep_feedforward_expr_grid)
239 |         print_line()
240 | 
241 |     if simple_list:
242 |         get_deep_feedforward_expr_list = [
243 |             {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 256, 'epochs': 50, 'lr': 0.001,
244 |              'experiment_name': 'experiment_35', 'feature_type': FEATURE_TYPE_IMAGE},
245 |             {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 256, 'epochs': 50, 'lr': 0.0001,
246 |              'experiment_name': 'experiment_36', 'feature_type': FEATURE_TYPE_IMAGE},
247 |             {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 512, 'epochs': 50, 'lr': 0.001,
248 |              'experiment_name': 'experiment_37', 'feature_type': FEATURE_TYPE_IMAGE},
249 |             {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 512, 'epochs': 50, 'lr': 0.0001,
250 |              'experiment_name': 'experiment_38', 'feature_type': FEATURE_TYPE_IMAGE},
251 |             {'model_name': 'CNNMalware_Model2', 'batch_size': 64, 'image_dim': 1024, 'epochs': 20, 'lr': 0.001,
252 |              'experiment_name': 'experiment_39', 'feature_type': FEATURE_TYPE_IMAGE},
253 |             {'model_name': 'CNNMalware_Model2', 'batch_size': 64, 'image_dim': 1024, 'epochs': 20, 'lr': 0.0001,
254 |              'experiment_name': 'experiment_40', 'feature_type': FEATURE_TYPE_IMAGE}
255 |         ]
256 |         return get_deep_feedforward_expr_list
257 |     else:
258 |         keys, values = zip(*get_deep_feedforward_expr_grid.items())
259 |         permutations_dicts = []
260 |         count = 1
261 |         for v in itertools.product(*values):
262 |             temp_dict = dict(zip(keys, v))
263 |             temp_exp = 'experiment_' + str(count)
264 |             temp_dict['experiment_name'] = temp_exp
265 |             permutations_dicts.append(temp_dict)
266 |             count += 1
267 | 
268 |         return permutations_dicts
269 | 
270 | 
271 | def get_shallow_expr_list():
272 |     """list_shallow_expr = [
273 |         {
274 |             'experiment_name': 'XGB_experiment_1',
275 |             'model_name': 'XGB',
276 |             'param_grid': {
277 |                 'max_depth': [5, 15, 20, 50, 100],
278 |                 'learning_rate': [0.1, 0.01, 0.001],
279 |                 'n_estimators': list(range(10, 500, 50)),
280 |             }
281 |         },
282 |         {
283 |             'experiment_name': 'RandomForest_experiment_1',
284 |             'model_name': 'RandomForest',
285 |             'param_grid': {
286 |                 'criterion': ['gini', 'entropy'],
287 |                 'n_estimators': [10, 40, 100, 500],
288 |             }
289 |         },
290 |         {
291 |             'experiment_name': 'Knn_experiment_1',
292 |             'model_name': 'Knn',
293 |             'param_grid': {
294 |                 'n_neighbors': list(range(130, 190, 10)),
295 |                 'p': [1, 2],
296 |             }
297 |         },
298 |         {
299 |             'experiment_name': 'Knn_experiment_2',
300 |             'model_name': 'Knn',
301 |             'param_grid': {
302 |                 'n_neighbors': list(range(130, 190, 10)),
303 |                 'p': [2],
304 |             }
305 |         }
306 | 
307 |         {
308 |             'experiment_name': 'Knn_experiment_1',
309 |             'model_name': 'Knn',
310 |             'param_grid': {
311 |                 'n_neighbors': list(range(50, 160, 30)),
312 |                 'p': [1, 2, 3],
313 |                 'weights': ['uniform', 'distance'],
314 |                 'algorithm': ['ball_tree', 'kd_tree', 'brute']
315 | 
316 |             }
317 |         }
318 | 
319 |     ]"""
320 | 
321 |     list_shallow_expr = [
322 |         {
323 |             'experiment_name': 'XGB_experiment_1',
324 |             'model_name': 'XGB',
325 |             'param_grid': {
326 |                 'max_depth': [5, 15, 20, 50, 100],
327 |                 'learning_rate': [0.1, 0.01, 0.001],
328 |                 'n_estimators': list(range(10, 500, 50)),
329 |             }
330 |         },
331 |         {
332 |             'experiment_name': 'RandomForest_experiment_1',
333 |             'model_name': 'RandomForest',
334 |             'param_grid': {
335 |                 'criterion': ['gini', 'entropy'],
336 |                 'n_estimators': [10, 20, 30, 40, 100, 500],
337 |                 'max_features': ['sqrt', 'log2'],
338 |             }
339 |         },
340 |         {
341 |             'experiment_name': 'Knn_experiment_1',
342 |             'model_name': 'Knn',
343 |             'param_grid': {
344 |                 'n_neighbors': list(range(1, 170, 10)),
345 |                 'p': [1, 2, 3],
346 |                 'weights': ['uniform', 'distance'],
347 |                 'algorithm': ['ball_tree', 'kd_tree', 'brute']
348 |             }
349 |         }
350 |     ]
351 | 
352 |     return list_shallow_expr
353 | 
354 | 
355 | def get_conv_transfer_learning_expr_list():
356 |     list_tl_expr = [
357 |         {
358 |             'experiment_name': 'tl_experiment_1',
359 |             'model_name': 'vgg19',
360 |             'batch_size': 256,
361 |             'image_dim': 256,
362 |             'epochs': 20,
363 |             'lr': 0.0001
364 |         }
365 |     ]
366 | 
367 |     return list_tl_expr
368 | 
369 | 
370 | def get_malware_experiments_list(expr_type):
371 |     expr_list = None
372 |     if expr_type == DEEP_FF:
373 |         expr_list = get_deep_feedforward_expr_list()
374 |     if expr_type == DEEP_RNN:
375 |         expr_list = get_deep_rnn_expr_list()
376 |     if expr_type == SHALLOW_ML:
377 |         expr_list = get_shallow_expr_list()
378 | 
379 |     if expr_list is None:
380 |         raise Exception('Unknown experiment type')
381 |     else:
382 |         return expr_list
383 | 
384 | 
385 | def create_deep_image_model(model_params):
386 |     image_dim = model_params['image_dim']
387 |     num_of_classes = model_params['num_of_classes']
388 |     model_name = model_params['model_name']
389 | 
390 |     model = None
391 |     if model_name == 'CNNMalware_Model1':
392 |         if image_dim == 0:
393 |             raise Exception("CNNMalware_Model1 needs image_dim != 0")
394 |         model = CNNMalware_Model1(image_dim=image_dim, num_of_classes=num_of_classes).to(device)
395 |     if model_name == 'CNNMalware_Model2':
396 |         if image_dim == 0:
397 |             raise Exception("CNNMalware_Model2 needs image_dim != 0")
398 |         model = CNNMalware_Model2(image_dim=image_dim, num_of_classes=num_of_classes).to(device)
399 | 
400 |     if model_name == 'CNNMalware_Model3':
401 |         conv1d_image_dim_w = model_params['conv1d_image_dim_w']
402 |         if image_dim != 0:
403 |             raise Exception("CNNMalware_Model3 needs image_dim = 0")
404 |         model = CNNMalware_Model3(image_dim_w=conv1d_image_dim_w, num_of_classes=num_of_classes).to(device)
405 | 
406 |     if model_name == 'CNNMalware_Model4':
407 |         if image_dim != 0:
408 |             raise Exception("CNNMalware_Model4 needs image_dim = 0")
409 | 
410 |         conv1d_image_dim_w = model_params['conv1d_image_dim_w']
411 |         c1_out = model_params['c1_out']
412 |         c1_kernel = model_params['c1_kernel']
413 |         c1_padding = model_params['c1_padding']
414 |         c1_stride = model_params['c1_stride']
415 | 
416 |         c2_out = model_params['c2_out']
417 |         c2_kernel = model_params['c2_kernel']
418 |         c2_padding = model_params['c2_padding']
419 |         c2_stride = model_params['c2_stride']
420 | 
421 |         model = CNNMalware_Model4(image_dim_w=conv1d_image_dim_w, num_of_classes=num_of_classes,
422 |                                   c1_out=c1_out, c1_kernel=c1_kernel, c1_padding=c1_padding, c1_stride=c1_stride,
423 |                                   c2_out=c2_out, c2_kernel=c2_kernel, c2_padding=c2_padding, c2_stride=c2_stride,
424 |                                   ).to(device)
425 | 
426 |     if model_name == 'ANNMalware_Model1':
427 |         model = ANNMalware_Model1(image_dim=image_dim, num_of_classes=num_of_classes).to(device)
428 |     if model_name == 'ANNMalware_Model2':
429 |         model = ANNMalware_Model2(image_dim=image_dim, num_of_classes=num_of_classes).to(device)
430 | 
431 |     if model is None:
432 |         raise Exception("Unknown Image-based model name given")
433 |     return model
434 | 
435 | 
436 | def get_pretrained_image_dim(model_name):
437 |     if model_name == 'resnet152':
438 |         return 224
439 |     if model_name == 'vgg19':
440 |         return 224
441 |     raise Exception("Unknown model name given for tl")
442 | 
443 | 
444 | def create_deep_opcode_model(model_params):
445 |     input_dim = model_params['input_dim']
446 |     output_dim = model_params['output_dim']
447 |     embedding_dim = model_params['embedding_dim']
448 |     batch_size = model_params['batch_size']
449 |     num_of_classes = model_params['num_of_classes']
450 |     model_name = model_params['model_name']
451 | 
452 |     hidden_dim = 0
453 |     if 'hidden_dim' in model_params.keys():
454 |         hidden_dim = model_params['hidden_dim']
455 | 
456 |     if 'bidirectional' in model_params.keys():
457 |         bidirectional = model_params['bidirectional']
458 | 
459 |     if 'dropout' in model_params.keys():
460 |         dropout = model_params['dropout']
461 | 
462 |     if 'num_layers' in model_params.keys():
463 |         num_layers = model_params['num_layers']
464 | 
465 |     model = None
466 |     if model_name == 'RNNMalware_Model1':
467 |         model = RNNMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim,
468 |                                   output_dim=output_dim, batch_size=batch_size,
469 |                                   num_layers=num_layers, bidirectional=bidirectional, dropout=dropout)
470 | 
471 |     if model_name == 'LSTMMalware_Model1':
472 |         model = LSTMMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim,
473 |                                    output_dim=output_dim, batch_size=batch_size,
474 |                                    num_layers=num_layers, bidirectional=bidirectional, dropout=dropout)
475 | 
476 |     if model_name == 'GRUMalware_Model1':
477 |         model = GRUMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim,
478 |                                   output_dim=output_dim, batch_size=batch_size,
479 |                                   num_layers=num_layers, bidirectional=bidirectional, dropout=dropout)
480 | 
481 |     if model_name == 'StackedMalware_Model1':
482 |         LG = model_params['LG']
483 |         model = StackedMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim,
484 |                                       output_dim=output_dim, batch_size=batch_size,
485 |                                       num_layers=num_layers, bidirectional=bidirectional, dropout=dropout, LG=LG)
486 | 
487 |     if model_name == 'CNNMalware_Model5':
488 |         feature_type = model_params['feature_type']
489 |         if feature_type != FEATURE_TYPE_OPCODE:
490 |             raise Exception("CNNMalware_Model5 needs Opcode features")
491 | 
492 |         input_dim = model_params['input_dim']
493 |         output_dim = model_params['output_dim']
494 |         embedding_dim = model_params['embedding_dim']
495 |         n_filters = model_params['n_filters']
496 |         filter_sizes = model_params['filter_sizes']
497 |         dropout = model_params['dropout']
498 | 
499 |         model = CNNMalware_Model5(input_dim=input_dim, embedding_dim=embedding_dim,
500 |                                   n_filters=n_filters, filter_sizes=filter_sizes,
501 |                                   output_dim=output_dim, dropout=dropout
502 |                                   ).to(device)
503 |     if model is None:
504 |         raise Exception("Unknown Opcode-based model name given")
505 | 
506 |     return model
507 | 
508 | 
509 | def create_conv_tl_model(model_params):
510 |     model = None
511 |     model_name = model_params['model_name']
512 |     print(f'Creating model {model_name}')
513 |     if model_name == 'resnet152':
514 |         model = Resnet152_wrapper(model_params)
515 |         print(model)
516 |     if model_name == 'vgg19':
517 |         model = VGG19_wrapper(model_params)
518 |         print(model)
519 |     if model is None:
520 |         raise Exception("Unknown Model name given")
521 |     return model
522 | 
523 | 
524 | def create_shallow_model(model_params=None):
525 |     model_name = model_params['model_name']
526 |     param_grid = model_params['param_grid']
527 |     model = None
528 |     gsc = None
529 | 
530 |     if model_name == 'XGB':
531 |         model = xgb.XGBClassifier(use_label_encoder=False)
532 |     if model_name == 'RandomForest':
533 |         model = RandomForestClassifier()
534 |     if model_name == 'Knn':
535 |         model = KNeighborsClassifier()
536 | 
537 |     gsc = GridSearchCV(estimator=model, param_grid=param_grid,
538 |                        cv=5, scoring='accuracy', verbose=1, n_jobs=-1, refit=True,
539 |                        return_train_score=True)
540 |     if model is None or gsc is None:
541 |         raise Exception("Unknown Model name given")
542 | 
543 |     return model, gsc
544 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | from config import *
  2 | import os
  3 | import pickle
  4 | import matplotlib.pyplot as plt
  5 | import sklearn.metrics as metrics
  6 | import seaborn as sns
  7 | import pandas as pd
  8 | import numpy as np
  9 | import torch.nn as nn
 10 | import sys
 11 | from sklearn.model_selection import GridSearchCV
 12 | import json
 13 | 
 14 | def save_model_results_to_log(model=None, model_params=None,
 15 |                               train_losses=None, train_accuracy=None,
 16 |                               predicted=None, ground_truth=None, best_params=None,
 17 |                               misc_data=None, log_dir=None):
 18 |     print('Saving model results', end='')
 19 |     experiment_name = model_params['experiment_name']
 20 |     model_name = model_params['model_name']
 21 |     num_of_classes = model_params['num_of_classes']
 22 |     class_names = model_params['class_names']
 23 | 
 24 |     model_log_dir = os.path.join(log_dir, experiment_name)
 25 |     os.makedirs(model_log_dir, exist_ok=True)
 26 |     model_log_file = os.path.join(model_log_dir, MODEL_INFO_LOG)
 27 |     model_train_losses_log_file = os.path.join(model_log_dir, MODEL_LOSS_INFO_LOG)
 28 |     model_train_accuracy_log_file = os.path.join(model_log_dir, MODEL_ACC_INFO_LOG)
 29 |     model_save_path = os.path.join(model_log_dir, model_name + '.pt')
 30 |     model_conf_mat_csv = os.path.join(model_log_dir, MODEL_CONF_MATRIX_CSV)
 31 |     model_conf_mat_png = os.path.join(model_log_dir, MODEL_CONF_MATRIX_PNG)
 32 |     model_conf_mat_normalized_csv = os.path.join(model_log_dir, MODEL_CONF_MATRIX_NORMALIZED_CSV)
 33 |     model_conf_mat_normalized_png = os.path.join(model_log_dir, MODEL_CONF_MATRIX_NORMALIZED_PNG)
 34 | 
 35 |     model_loss_png = os.path.join(model_log_dir, MODEL_LOSS_PNG)
 36 |     model_accuracy_png = os.path.join(model_log_dir, MODEL_ACCURACY_PNG)
 37 | 
 38 |     grid_cv_filepath = os.path.join(model_log_dir,GRID_CV_EXPERIMENT_RESULTS)
 39 |     print('.', end='')
 40 |     # generate and save confusion matrix
 41 |     plot_x_label = "Predictions"
 42 |     plot_y_label = "Actual"
 43 |     cmap = plt.cm.Blues
 44 |     pred_class_indexes = sorted(np.unique(predicted))
 45 |     pred_num_classes = len(pred_class_indexes)
 46 |     target_class_names = [class_names[i] for i in pred_class_indexes]
 47 | 
 48 |     cm = metrics.confusion_matrix(ground_truth, predicted)
 49 | 
 50 |     print('.', end='')
 51 |     df_confusion = pd.DataFrame(cm)
 52 |     df_confusion.index = target_class_names
 53 |     df_confusion.columns = target_class_names
 54 |     df_confusion.round(2)
 55 |     df_confusion.to_csv(model_conf_mat_csv)
 56 |     fig = plt.figure(figsize=(20, 20))
 57 |     sns.heatmap(df_confusion, annot=True, cmap=cmap)
 58 |     plt.xlabel(plot_x_label)
 59 |     plt.ylabel(plot_y_label)
 60 |     plt.title('Confusion Matrix')
 61 |     plt.savefig(model_conf_mat_png)
 62 |     plt.close(fig)
 63 | 
 64 |     print('.', end='')
 65 |     cm = metrics.confusion_matrix(ground_truth, predicted, normalize='all')
 66 |     df_confusion = pd.DataFrame(cm)
 67 |     df_confusion.index = target_class_names
 68 |     df_confusion.columns = target_class_names
 69 |     df_confusion.round(2)
 70 |     df_confusion.to_csv(model_conf_mat_normalized_csv)
 71 |     fig = plt.figure(figsize=(20, 20))
 72 |     sns.heatmap(df_confusion, annot=True, cmap=cmap)
 73 |     plt.xlabel(plot_x_label)
 74 |     plt.ylabel(plot_y_label)
 75 |     plt.title('Normalized Confusion Matrix')
 76 |     plt.savefig(model_conf_mat_normalized_png)
 77 |     plt.close(fig)
 78 | 
 79 |     if train_losses is not None:
 80 |         print('.', end='')
 81 |         fig = plt.figure(figsize=(8, 8))
 82 |         plt.plot(train_losses, label='Loss')
 83 |         plt.xlabel('Epoch')
 84 |         plt.ylabel('Loss')
 85 |         plt.title('Training Loss')
 86 |         plt.legend()
 87 |         plt.savefig(model_loss_png)
 88 |         plt.close(fig)
 89 | 
 90 |         print('.', end='')
 91 |         # save model training stats
 92 |         with open(model_train_losses_log_file, 'wb') as file:
 93 |             pickle.dump(train_losses, file)
 94 |             file.flush()
 95 | 
 96 |     if train_accuracy is not None:
 97 |         print('.', end='')
 98 |         fig = plt.figure(figsize=(8, 8))
 99 |         plt.plot(train_accuracy, label='Accuracy')
100 |         plt.xlabel('Epoch')
101 |         plt.ylabel('Accuracy')
102 |         plt.title('Training Accuracy')
103 |         plt.legend()
104 |         plt.savefig(model_accuracy_png)
105 |         plt.close(fig)
106 | 
107 |         print('.', end='')
108 |         with open(model_train_accuracy_log_file, 'wb') as file:
109 |             pickle.dump(train_accuracy, file)
110 |             file.flush()
111 | 
112 |     print('.', end='')
113 |     report = metrics.classification_report(ground_truth, predicted, target_names=list(target_class_names))
114 | 
115 |     if not isinstance(model, nn.Module):
116 |         cv_df = pd.DataFrame(model.cv_results_)
117 |         cv_df.to_csv(grid_cv_filepath)
118 | 
119 |     # save model arch and params
120 |     with open(model_log_file, 'a') as file:
121 |         file.write('-' * LINE_LEN + '\n')
122 |         file.write('model architecture' + '\n')
123 |         file.write('-' * LINE_LEN + '\n')
124 |         file.write(str(model) + '\n')
125 |         file.write('-' * LINE_LEN + '\n')
126 |         file.write('model params' + '\n')
127 |         file.write('-' * LINE_LEN + '\n')
128 |         file.write(str(model_params) + '\n')
129 |         file.write('-' * LINE_LEN + '\n')
130 |         if not isinstance(model, nn.Module):
131 |             file.write('GridSearchCV results' + '\n')
132 |             if isinstance(model, GridSearchCV):
133 |                 file.write(str(model.cv_results_) + '\n')
134 |         file.write('-' * LINE_LEN + '\n')
135 | 
136 |         if misc_data:
137 |             file.write('misc data: ' + misc_data + '\n')
138 |             file.write('-' * LINE_LEN + '\n')
139 | 
140 |         if best_params is not None:
141 |             file.write('best params of the grid search' + '\n')
142 |             file.write('-' * LINE_LEN + '\n')
143 |             file.write(str(best_params) + '\n')
144 |             file.write('-' * LINE_LEN + '\n')
145 | 
146 |         file.write('classification report' + '\n')
147 |         file.write('-' * LINE_LEN + '\n')
148 |         file.write(report + '\n')
149 |         file.write('-' * LINE_LEN + '\n')
150 |         file.flush()
151 | 
152 |     print('.', end='')
153 |     # save model as pytorch state dict
154 |     if isinstance(model, nn.Module):
155 |         torch.save(model.state_dict(), model_save_path)
156 |     else:
157 |         # save model to file
158 |         pickle.dump(model, open(model_save_path, "wb"))
159 | 
160 |     print('Done')
161 |     sys.stdout.flush()
162 | 
163 | 
164 | def save_models_metadata_to_log(list_of_model_params, LOG_DIR, logfile=MODEL_META_INFO_LOG):
165 |     logfile = os.path.join(LOG_DIR, logfile)
166 |     with open(logfile, 'a') as file:
167 |         file.write('-' * LINE_LEN + '\n')
168 |         for i in list_of_model_params:
169 |             file.write(str(i) + '\n')
170 |         file.write('-' * LINE_LEN + '\n')
171 |         file.flush()
172 | 
173 | 
174 | def print_line(print_len=LINE_LEN):
175 |     print('-' * print_len)
176 | 


--------------------------------------------------------------------------------