├── .gitignore ├── README.md ├── config.py ├── data_preprocess.py ├── data_utils ├── __init__.py ├── bin_to_img.py ├── data_loaders.py ├── extract_data.py ├── extract_opcode.py ├── extract_pe_features.py └── misc.py ├── detect_malware.py ├── models ├── AnnModels.py ├── CnnModels.py ├── RnnModels.py ├── Shallow_ML_models.py ├── TransferLearnModels.py ├── __init__.py ├── model_trainers_testers.py └── models_utils.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | models/__pycache__ 3 | data_utils/__pycache__ 4 | data/ 5 | logs/ 6 | .idea/ 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Malware Classification using classical Machine Learning and Deep Learning 2 | 3 | This repository is the official implementation of the research mentioned in the chapter **"An Empirical Analysis of Image-Based Learning Techniques for Malware Classification"** of the Book **"Malware Analysis Using Artificial Intelligence and Deep Learning"** 4 | 5 | The book or chapters can be purchased from: https://link.springer.com/chapter/10.1007/978-3-030-62582-5_16 6 | 7 | The arXiv eprint is at: https://arxiv.org/abs/2103.13827 8 | 9 | ![alt text](https://media.springernature.com/w306/springer-static/cover-hires/book/978-3-030-62582-5) 10 | 11 | 12 | ### Abstract 13 | 14 | In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role—specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published. 15 | 16 | ### Quick Notes: 17 | 18 | * Classic ML-based approaches tried : K-NN, Random Forest, and XGBoost 19 | * Deep Learning-based approaches tried: ANN, CNN, LSTM, and GRU 20 | * Implementation is using sklearn, numpy, pandas and pytorch. 21 | * MS Windows executable binary files are used as data. 22 | * Features 23 |   * Classic ML-based approaches: PE fie features are extracted and used 24 |   * Deep Learning-based approaches: (1) Opcodes (2) Converted executables into gray-scale images 25 | * This project is an extension of https://github.com/pratikpv/malware_classification 26 | 27 | ### Steps to repro 28 | 29 | # Packages requirements 30 | 31 | * Install pefile pythong package e.g. `conda install pefile` 32 | * Install PyTorch and other libs e.g. `conda install -c pytorch torchtext`. All other common dependencies should be covered by anaconda distro. 33 | * `objdump` in ubuntu. (This code is developed and tested for ubuntu-based development env) 34 | 35 | # Malware samples 36 | 37 |  * copy the malware samples at /data/exec_files/exec_files. You can reach out to me for samples used in this research. Overall directory structure should look like this, 38 | 39 | ``` 40 | ├── config.py 41 | ├── data 42 | │             ├── exec_files 43 | │             │             └── exec_files 44 | │             │                 ├── adload 45 | │             │                 ├── agent 46 | │             │                 ├── alureon 47 | │             │                 ├── bho 48 | │             │                 ├── ceeinject 49 | │             │                 ├── cycbot 50 | │             │                 ├── delfinject 51 | │             │                 └── fakerean 52 | ├── data_preprocess.py 53 | ├── data_utils 54 | . 55 | . 56 | ``` 57 | 58 | # Data preprocessing 59 | 60 | Execute `data_preprocess.py` with below mentioned options to preprocess the data. 61 | 62 | `python data_preprocess.py --extract_pe_features` 63 | 64 | `python data_preprocess.py --bin_to_img` 65 | 66 | `python data_preprocess.py --extract_opcodes` 67 | 68 | `python data_preprocess.py --split_opcodes` 69 | 70 | 71 | # Train and test models 72 | Execute `detect_malware.py` with appropriate command-line args for models to train/test. e.g. 73 | 74 | `python detect_malware.py --deep_feedforward` 75 | 76 | `python detect_malware.py --deep_rnn` 77 | 78 | `python detect_malware.py --shallow_ml` 79 | 80 | `python detect_malware.py --transfer_conv_ml` 81 | 82 | # Dataset 83 | 84 | Apply for access here: https://forms.gle/65SNHJpQ7U4TYkCU7 85 | 86 | ### If you like our work and is useful for your research please cite this chapter/paper as: 87 | ``` 88 | Prajapati P., Stamp M. (2021) An Empirical Analysis of Image-Based Learning Techniques for Malware Classification. In: Stamp M., Alazab M., Shalaginov A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_16 89 | ``` 90 | or 91 | ``` 92 | @Inbook{ 93 |     Prajapati2021, 94 |     author={Prajapati, Pratikkumar and Stamp, Mark}, 95 |     editor={Stamp, Mark and Alazab, Mamoun  and Shalaginov, Andrii}, 96 |     title={An Empirical Analysis of Image-Based Learning Techniques for Malware Classification}, 97 |     bookTitle={Malware Analysis Using Artificial Intelligence and Deep Learning}, 98 |     year={2021}, 99 |     publisher={Springer International Publishing}, 100 |     address={Cham}, 101 |     pages={411-435}, 102 |     abstract={In this chapter, we consider malware classification using deep learning techniques and image-based features. We employ a wide variety of deep learning techniques, including multilayer perceptrons (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and gated recurrent units (GRU). Among our CNN experiments, transfer learning plays a prominent role---specifically, we test the VGG-19 and ResNet152 models. As compared to previous work, the results presented in this chapter are based on a larger and more diverse malware dataset, we consider a wider array of features, and we experiment with a much greater variety of learning techniques. Consequently, our results are the most comprehensive and complete that have yet been published.}, 103 |     isbn={978-3-030-62582-5}, 104 |     doi={10.1007/978-3-030-62582-5_16}, 105 |     url={https://doi.org/10.1007/978-3-030-62582-5_16} 106 | } 107 | ``` 108 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import multiprocessing 4 | 5 | # original dataset i.e. binary executable files 6 | ORG_DATASET_ROOT_PATH = os.path.join('data', 'exec_files') 7 | ORG_DATASET_DIR_NAME = 'org_dataset' 8 | ORG_DATASET_PATH = os.path.join('data', 'exec_files', ORG_DATASET_DIR_NAME) 9 | supported_image_dims = [0, 1, 64, 128, 256, 512, 1024] 10 | 11 | # opcodes from binary executables 12 | # -1: the entire opcode len of the binary, else the specific opcode len 13 | supported_opcode_lens = [-1, 10, 20, 50, 100, 500, 1000, 2000, 5000] 14 | ORG_DATASET_OPCODES_PATH = os.path.join('data', 'exec_files', 'org_dataset_opcodes') 15 | # PE features from binary executables 16 | ORG_DATASET_PE_FEATURES_CSV = os.path.join('data', 'org_malware_dataset_pe_features.csv') 17 | 18 | # count files for ORG_DATASET_PATH 19 | ORG_DATASET_COUNT_CSV = os.path.join('data', 'org_malware_dataset_count.csv') 20 | # count files for ORG_DATASET_PATH_IMAGE* 21 | ORG_DATASET_COUNT_IMAGES_CSV = os.path.join('data', 'org_malware_dataset_count_images.csv') 22 | # count files for ORG_DATASET_OPCODES_PATH 23 | ORG_DATASET_COUNT_PE_FEATURES_CSV = os.path.join('data', 'org_malware_dataset_count_pe_features.csv') 24 | # count files for ORG_DATASET_PE_FEATURES_CSV 25 | ORG_DATASET_COUNT_OPCODES_PATH = os.path.join('data', 'org_malware_dataset_count_opcodes.csv') 26 | 27 | use_cuda = torch.cuda.is_available() 28 | device = torch.device("cuda" if use_cuda else "cpu") 29 | LINE_LEN = 80 30 | LOG_MASTER_DIR = 'logs' 31 | MODEL_INFO_LOG = 'model_info_and_results.log' 32 | MODEL_META_INFO_LOG = 'models_meta_info.log' 33 | MODEL_LOSS_INFO_LOG = 'model_losses.log' 34 | MODEL_ACC_INFO_LOG = 'model_acc.log' 35 | MODEL_CONF_MATRIX_CSV = 'confusion_matrix.csv' 36 | MODEL_CONF_MATRIX_PNG = 'confusion_matrix.png' 37 | MODEL_CONF_MATRIX_NORMALIZED_CSV = 'confusion_matrix_normalized.csv' 38 | MODEL_CONF_MATRIX_NORMALIZED_PNG = 'confusion_matrix_normalized.png' 39 | MODEL_ACCURACY_PNG = 'model_accuracy.png' 40 | MODEL_LOSS_PNG = 'model_loss.png' 41 | EXPERIMENT_RESULTS = 'experiments_results.csv' 42 | GRID_CV_EXPERIMENT_RESULTS = 'grid_cv_experiments_results.csv' 43 | CPU_COUNT = multiprocessing.cpu_count() 44 | 45 | DEEP_FF = 'deep_feed_forward' 46 | DEEP_RNN = 'rnn' 47 | SHALLOW_ML = 'shallow_ml' 48 | FEATURE_TYPE_IMAGE = 'image' 49 | FEATURE_TYPE_OPCODE = 'opcode' 50 | -------------------------------------------------------------------------------- /data_preprocess.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | from config import * 4 | from data_utils.extract_pe_features import * 5 | from data_utils.bin_to_img import * 6 | from data_utils.extract_opcode import * 7 | from data_utils.misc import * 8 | from data_utils.data_loaders import * 9 | 10 | 11 | def main(): 12 | max_files = 0 # set 0 to process all files or set a specific number 13 | 14 | if args.extract_pe_features: 15 | extract_pe_features(ORG_DATASET_PE_FEATURES_CSV, ORG_DATASET_COUNT_PE_FEATURES_CSV, ORG_DATASET_PATH, 16 | max_files=max_files) 17 | 18 | if args.bin_to_img: 19 | list_of_widths = [0, 1, 64, 128, 256, 512, 1024] 20 | for width in list_of_widths: 21 | convert_bin_to_img(ORG_DATASET_PATH, width, max_files=max_files) 22 | 23 | if args.extract_opcodes: 24 | process_opcodes_bulk(ORG_DATASET_PATH, max_files=max_files) 25 | 26 | if args.count_samples: 27 | count_dataset(ORG_DATASET_PATH, ORG_DATASET_COUNT_CSV) 28 | count_dataset(ORG_DATASET_OPCODES_PATH, ORG_DATASET_COUNT_OPCODES_PATH) 29 | count_dataset(get_image_datapath(image_dim=256), ORG_DATASET_COUNT_IMAGES_CSV) 30 | 31 | if args.split_opcodes: 32 | list_of_opcode_lens = [10, 20, 50, 100, 500, 1000, 2000, 5000] 33 | for opcode_len in list_of_opcode_lens: 34 | process_split_opcodes(ORG_DATASET_OPCODES_PATH, opcode_len=opcode_len) 35 | 36 | if args.latex_format: 37 | # tuple -> log_date_dir , experiment 38 | data_list = [("25-May-2020_22_44_37", "experiment_14"), 39 | ("13-Jun-2020_16_49_09", "experiment_29"), 40 | ("09-Jun-2020_20_42_39", "conv1d_experiment_65"), 41 | ("14-Jun-2020_09_03_12", "experiment_18"), 42 | ("06-Jun-2020_22_13_17", "rnn_experiment_22"), 43 | ("06-Jun-2020_22_13_17", "rnn_experiment_46"), 44 | ("13-Jun-2020_21_04_18", "tl_experiment_1"), 45 | ("12-Jun-2020_21_54_44", "XGB_experiment_1"), 46 | ("12-Jun-2020_21_54_44", "Knn_experiment_1"), 47 | ("12-Jun-2020_21_54_44", "RandomForest_experiment_1") 48 | ] 49 | process_cf_for_latex(data_list) 50 | 51 | 52 | if __name__ == "__main__": 53 | parser = argparse.ArgumentParser(description='Process the Malware data') 54 | 55 | parser.add_argument('--extract_pe_features', action='store_true', help='Extract features from PE format', 56 | default=False) 57 | parser.add_argument('--bin_to_img', action='store_true', help='Generate image files from malware binaries', 58 | default=False) 59 | parser.add_argument('--extract_opcodes', action='store_true', help='Extract opcodes from malware binaries', 60 | default=False) 61 | parser.add_argument('--count_samples', action='store_true', help='Count all sample files for all experiments', 62 | default=False) 63 | parser.add_argument('--split_opcodes', action='store_true', help='split opcodes into train-test for TorchText', 64 | default=False) 65 | 66 | parser.add_argument('--latex_format', action='store_true', help='Normalize Conf. matrix and save for latex', 67 | default=False) 68 | 69 | args = parser.parse_args() 70 | 71 | if len(sys.argv) < 2: 72 | parser.print_usage() 73 | sys.exit(1) 74 | 75 | main() 76 | -------------------------------------------------------------------------------- /data_utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pratikpv/malware_detect2/12f4390f1c4a7a3e8b6bc355cc06f08f4dd3126d/data_utils/__init__.py -------------------------------------------------------------------------------- /data_utils/bin_to_img.py: -------------------------------------------------------------------------------- 1 | import multiprocessing 2 | from tqdm import tqdm 3 | import numpy as np 4 | import imageio 5 | import array 6 | import os 7 | 8 | 9 | def generate_and_save_image(input_filename, output_filename, width): 10 | f = open(input_filename, 'rb') 11 | ln = os.path.getsize(input_filename) # length of file in bytes 12 | if width == 0: 13 | width = ln 14 | if ln == 0: 15 | return 16 | rem = ln % width 17 | a = array.array("B") # uint8 array 18 | a.fromfile(f, ln - rem) 19 | f.close() 20 | g = np.reshape(a, (len(a) // width, width)) 21 | g = np.uint8(g) 22 | imageio.imwrite(output_filename, g) # save the image 23 | 24 | 25 | def convert_bin_to_img(input_dir, width, max_files=0): 26 | output_dir = input_dir + '_width_' + str(width) 27 | if not os.path.isdir(output_dir): 28 | os.mkdir(output_dir) 29 | 30 | list_dirs = os.listdir(input_dir) 31 | with multiprocessing.Pool(multiprocessing.cpu_count()) as pool: 32 | 33 | jobs = [] 34 | results = [] 35 | total_count = 0 36 | 37 | for dirname in list_dirs: 38 | list_files = os.listdir(os.path.join(input_dir, dirname)) 39 | count = 0 40 | for filename in list_files: 41 | input_filename = os.path.join(input_dir, dirname, filename) 42 | try: 43 | output_filename = os.path.splitext(os.path.basename(input_filename))[0] + '.png' 44 | output_class_dir = os.path.join(output_dir, dirname) 45 | if not os.path.isdir(output_class_dir): 46 | os.mkdir(output_class_dir) 47 | output_filename = os.path.join(output_dir, dirname, output_filename) 48 | 49 | jobs.append( 50 | pool.apply_async(generate_and_save_image, (input_filename, output_filename, width))) 51 | count += 1 52 | if max_files > 0 and max_files == count: 53 | break 54 | except: 55 | print('Ignoring ', filename) 56 | 57 | total_count += count 58 | tqdm_desc = 'Converting Malware bins to images for width ' + str(width) 59 | for job in tqdm(jobs, desc=tqdm_desc): 60 | results.append(job.get()) 61 | -------------------------------------------------------------------------------- /data_utils/data_loaders.py: -------------------------------------------------------------------------------- 1 | import random 2 | from torch.utils.data import DataLoader 3 | import torchvision 4 | import torchtext 5 | from config import * 6 | import numpy as np 7 | 8 | 9 | def get_image_datapath(image_dim, check_exist=True): 10 | if image_dim not in supported_image_dims: 11 | raise Exception("Unknown Image dim given") 12 | 13 | dir_name = os.path.join(ORG_DATASET_ROOT_PATH, ORG_DATASET_DIR_NAME + '_width_' + str(image_dim)) 14 | 15 | if not check_exist: 16 | return dir_name 17 | 18 | if os.path.isdir(dir_name): 19 | return dir_name 20 | print(f'dir_name = {dir_name}') 21 | raise Exception("Data dir for Image dim {image_dim} missing".format(image_dim=image_dim)) 22 | 23 | 24 | def get_opcode_datapath(opcode_len, check_exist=True): 25 | if check_exist: 26 | if opcode_len not in supported_opcode_lens: 27 | raise Exception("Unknown opcode len given") 28 | 29 | opcode_len_str = '_' + str(opcode_len) 30 | if opcode_len == -1: 31 | opcode_len_str = '' 32 | 33 | train_split_json = os.path.join('org_dataset_opcodes_train' + opcode_len_str + '.json') 34 | test_split_json = os.path.join('org_dataset_opcodes_test' + opcode_len_str + '.json') 35 | combined_path_for_json = 'org_dataset_opcodes_split' + opcode_len_str 36 | exist = False 37 | 38 | if os.path.isfile(os.path.join(ORG_DATASET_ROOT_PATH, train_split_json)) and \ 39 | os.path.isfile(os.path.join(ORG_DATASET_ROOT_PATH, test_split_json)) and \ 40 | os.path.isdir(os.path.join(ORG_DATASET_ROOT_PATH, combined_path_for_json)): 41 | exist = True 42 | 43 | data_path = dict() 44 | data_path['train_split_json'] = train_split_json 45 | data_path['test_split_json'] = test_split_json 46 | data_path['combined_path_for_json'] = combined_path_for_json 47 | 48 | if not check_exist: 49 | return data_path 50 | if check_exist and exist: 51 | return data_path 52 | 53 | print(f'train_split_json : {train_split_json}') 54 | print(f'test_split_json : {test_split_json}') 55 | raise Exception("Data dir for opcode len {opcode_len} missing".format(opcode_len=opcode_len)) 56 | 57 | 58 | def get_image_data_loaders(data_path=None, image_dim=64, train_split=0.8, batch_size=256, 59 | convert_to_rgb=False, pretrained_image_dim=64, conv1d_image_dim_w=1024): 60 | workers_count = min(int(CPU_COUNT * 0.80), batch_size) 61 | 62 | image_dim_h = image_dim 63 | image_dim_w = image_dim 64 | 65 | if image_dim == 0: 66 | # Conv1D case 67 | image_dim_h = 1 68 | image_dim_w = conv1d_image_dim_w 69 | 70 | transform = None 71 | if convert_to_rgb: 72 | transform = torchvision.transforms.Compose([ 73 | torchvision.transforms.Resize((pretrained_image_dim, pretrained_image_dim)), 74 | torchvision.transforms.Lambda(lambda image: image.convert('RGB')), 75 | torchvision.transforms.ToTensor() 76 | ]) 77 | else: 78 | transform = torchvision.transforms.Compose([ 79 | torchvision.transforms.Grayscale(num_output_channels=1), 80 | torchvision.transforms.Resize((image_dim_h, image_dim_w)), 81 | torchvision.transforms.ToTensor() 82 | ]) 83 | 84 | dataset = torchvision.datasets.ImageFolder(data_path, transform=transform) 85 | dataset_len = len(dataset) 86 | # dataset_len = 1000 87 | indices = list(range(dataset_len)) 88 | 89 | random.shuffle(indices) 90 | split = int(np.floor(train_split * dataset_len)) 91 | 92 | train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 93 | sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[:split]), 94 | num_workers=workers_count) 95 | 96 | val_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 97 | sampler=torch.utils.data.sampler.SubsetRandomSampler( 98 | indices[split:dataset_len]), 99 | num_workers=workers_count) 100 | 101 | train_set_len = len(train_loader) * batch_size 102 | val_set_len = len(val_loader) * batch_size 103 | class_names = dataset.classes 104 | num_of_classes = len(dataset.classes) 105 | 106 | return train_loader, val_loader, dataset_len, class_names 107 | 108 | 109 | def get_opcode_data_loaders(data_path=None, opcode_len=500, batch_size=512): 110 | TEXT = torchtext.legacy.data.Field() 111 | LABEL = torchtext.legacy.data.LabelField() 112 | 113 | fields = {'text': ('text', TEXT), 'label': ('label', LABEL)} 114 | 115 | train_data, test_data = torchtext.legacy.data.TabularDataset.splits( 116 | path=ORG_DATASET_ROOT_PATH, 117 | train=data_path['train_split_json'], 118 | test=data_path['test_split_json'], 119 | format='json', 120 | fields=fields 121 | ) 122 | TEXT.build_vocab(train_data) 123 | LABEL.build_vocab(train_data) 124 | 125 | train_iterator, test_iterator = torchtext.legacy.data.BucketIterator.splits((train_data, test_data), 126 | batch_size=batch_size, 127 | sort=False, 128 | shuffle=True, 129 | repeat=False, 130 | device='cpu') 131 | 132 | dataset_len = len(train_data) + len(test_data) 133 | class_names = list(LABEL.vocab.stoi.keys()) 134 | text_vocal_len = len(TEXT.vocab) 135 | label_vocab_len = len(LABEL.vocab) 136 | 137 | pad_idx = TEXT.vocab.stoi[TEXT.pad_token] 138 | 139 | return train_iterator, test_iterator, dataset_len, class_names, text_vocal_len, label_vocab_len, pad_idx 140 | -------------------------------------------------------------------------------- /data_utils/extract_data.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import sqlalchemy 3 | 4 | DATA_CSV_FILENAME = 'malware_data_raw.csv' 5 | 6 | 7 | def save_tables_to_csv(): 8 | engine = sqlalchemy.create_engine('mysql+pymysql://admin:Pratik@123@localhost:3306/new_schema') 9 | df = pd.read_sql_table('FILES', engine) 10 | df.to_csv(DATA_CSV_FILENAME) 11 | return df 12 | 13 | 14 | def main(): 15 | # save_tables_to_csv() 16 | df = pd.read_csv(DATA_CSV_FILENAME) 17 | print(df.keys()) 18 | df = df[['SHA1', 'FAMILY']] 19 | print(df.groupby('FAMILY').count()) 20 | 21 | 22 | if __name__ == "__main__": 23 | main() 24 | -------------------------------------------------------------------------------- /data_utils/extract_opcode.py: -------------------------------------------------------------------------------- 1 | from tqdm import tqdm 2 | import multiprocessing 3 | from config import * 4 | from subprocess import Popen, PIPE 5 | import random 6 | import json 7 | from data_utils.data_loaders import * 8 | 9 | 10 | def generate_opcode(bin_filename, text_filename, debug=False): 11 | list_of_cmd_args = [ 12 | ['objdump', '-j .text', '-D', bin_filename], 13 | ['objdump', '-j CODE', '-D', bin_filename], 14 | ['objdump', '-d', bin_filename] 15 | ] 16 | 17 | got_upcode = False 18 | asm_code = '' 19 | 20 | for cmd_num, cmd_args in enumerate(list_of_cmd_args): 21 | try: 22 | if debug: 23 | print(f'cmd_num = {cmd_num}') 24 | print(f'cmd_args = {" ".join(cmd_args)}') 25 | process = Popen(cmd_args, stdout=PIPE, stderr=PIPE) 26 | p_out, p_err = process.communicate() 27 | if debug: 28 | print(p_out) 29 | asm_code = str(p_out).split('\\n') 30 | except ValueError: 31 | got_upcode = False 32 | else: 33 | if len(asm_code) > 5: 34 | got_upcode = True 35 | 36 | if got_upcode: 37 | break 38 | 39 | if got_upcode: 40 | with open(text_filename, 'w+') as f: 41 | for line in asm_code: 42 | line = line.split('\\t') 43 | if len(line) > 2: 44 | opcode_line = line[2] 45 | opcode_line = opcode_line.split(' ') 46 | if len(opcode_line) > 0: 47 | f.write(opcode_line[0]) 48 | f.write('\n') 49 | 50 | else: 51 | # TODO some files are empty. check generate_opcode 52 | if debug: 53 | print(f'Giving up on {bin_filename}') 54 | 55 | 56 | def process_opcodes_bulk(input_dir, output_dir=ORG_DATASET_OPCODES_PATH, max_files=0): 57 | if not os.path.isdir(output_dir): 58 | os.mkdir(output_dir) 59 | 60 | list_dirs = os.listdir(input_dir) 61 | with multiprocessing.Pool(multiprocessing.cpu_count()) as pool: 62 | 63 | jobs = [] 64 | results = [] 65 | total_count = 0 66 | 67 | for dirname in list_dirs: 68 | list_files = os.listdir(os.path.join(input_dir, dirname)) 69 | count = 0 70 | for filename in list_files: 71 | input_filename = os.path.join(input_dir, dirname, filename) 72 | try: 73 | output_filename = os.path.splitext(os.path.basename(input_filename))[0] + '.txt' 74 | output_class_dir = os.path.join(output_dir, dirname) 75 | if not os.path.isdir(output_class_dir): 76 | os.mkdir(output_class_dir) 77 | output_filename = os.path.join(output_dir, dirname, output_filename) 78 | 79 | jobs.append( 80 | pool.apply_async(generate_opcode, (input_filename, output_filename))) 81 | count += 1 82 | if max_files > 0 and max_files == count: 83 | break 84 | except: 85 | print('Ignoring ', filename) 86 | 87 | total_count += count 88 | tqdm_desc = 'Extracting opcodes from Malware bins' 89 | for job in tqdm(jobs, desc=tqdm_desc): 90 | results.append(job.get()) 91 | 92 | 93 | def get_jason_filename(combined_path_for_json, key, class_name): 94 | return os.path.join(combined_path_for_json, class_name, key + '_' + class_name + '.json') 95 | 96 | 97 | def split_opcodes(input_class_dir, opcode_len=-1, train_split=0.8): 98 | """ 99 | input_class_dir contains opcode json file. 100 | Create train and test split for this json file 101 | """ 102 | opcode_path = get_opcode_datapath(opcode_len, check_exist=False) 103 | combined_path_for_json = opcode_path['combined_path_for_json'] 104 | 105 | list_files = os.listdir(input_class_dir) 106 | class_name = os.path.basename(input_class_dir) 107 | random.shuffle(list_files) 108 | total_samples = len(list_files) 109 | train_size = int(total_samples * train_split) 110 | test_size = total_samples - train_size 111 | 112 | train_files = list_files[0:train_size] 113 | test_files = list_files[train_size:] 114 | 115 | samples = {'train': train_files, 'test': test_files} 116 | 117 | for key in samples.keys(): 118 | sample_json = [] 119 | all_filenames = samples[key] 120 | for filename in all_filenames: 121 | input_filename = os.path.join(input_class_dir, filename) 122 | opcodes = [] 123 | with open(input_filename, 'r') as f: 124 | opcodes = f.read().splitlines() 125 | 126 | # print(opcodes) 127 | if opcode_len != -1: 128 | opcodes = opcodes[0:opcode_len] 129 | if len(opcodes) < 1: 130 | continue 131 | 132 | dir_name = os.path.join(ORG_DATASET_ROOT_PATH, combined_path_for_json) 133 | os.makedirs(os.path.join(dir_name, class_name), exist_ok=True) 134 | out_json_file = get_jason_filename(dir_name, key, class_name) 135 | with open(out_json_file, 'a') as outfile: 136 | file_dict = {'text': opcodes, 'label': class_name} 137 | json.dump(file_dict, outfile) 138 | outfile.write('\n') 139 | 140 | 141 | def process_split_opcodes(root_datapath, opcode_len=-1): 142 | opcode_path = get_opcode_datapath(opcode_len, check_exist=False) 143 | train_split_json = os.path.join(ORG_DATASET_ROOT_PATH, opcode_path['train_split_json']) 144 | test_split_json = os.path.join(ORG_DATASET_ROOT_PATH, opcode_path['test_split_json']) 145 | combined_path_for_json = opcode_path['combined_path_for_json'] 146 | 147 | classes = os.listdir(root_datapath) 148 | # 149 | with multiprocessing.Pool(multiprocessing.cpu_count()) as pool: 150 | jobs = [] 151 | results = [] 152 | for class_name in classes: 153 | class_dir = os.path.join(root_datapath, class_name) 154 | jobs.append(pool.apply_async(split_opcodes, (class_dir, opcode_len,))) 155 | 156 | for job in tqdm(jobs, desc="Generating opcodes with json and splitting for opcode len = {opcode_len}".format( 157 | opcode_len=opcode_len)): 158 | results.append(job.get()) 159 | 160 | print(f'Merging all training and testing classes') 161 | samples = {'train': train_split_json, 162 | 'test': test_split_json} 163 | 164 | for key in samples.keys(): 165 | out_json_file = samples[key] 166 | with open(out_json_file, 'a') as outfile: 167 | for class_name in classes: 168 | json_file = os.path.join(ORG_DATASET_ROOT_PATH, 169 | get_jason_filename(combined_path_for_json, key, class_name)) 170 | print(f'loading {json_file}') 171 | count = 0 172 | with open(json_file, 'r') as fp: 173 | while True: 174 | line = fp.readline() 175 | if not line: 176 | break 177 | count += 1 178 | try: 179 | # write only right formated lines 180 | json.loads(line) 181 | except: 182 | print(f'Error in {json_file} at count {count}') 183 | else: 184 | outfile.write(line) 185 | 186 | print(f'Merged all training and testing classes {samples}') 187 | -------------------------------------------------------------------------------- /data_utils/extract_pe_features.py: -------------------------------------------------------------------------------- 1 | import multiprocessing 2 | import pefile 3 | import os 4 | import hashlib 5 | import array 6 | import math 7 | from tqdm import tqdm 8 | import numpy as np 9 | import imageio 10 | 11 | 12 | def get_md5(fname): 13 | hash_md5 = hashlib.md5() 14 | with open(fname, "rb") as f: 15 | for chunk in iter(lambda: f.read(4096), b""): 16 | hash_md5.update(chunk) 17 | return hash_md5.hexdigest() 18 | 19 | 20 | def get_entropy(data): 21 | if len(data) == 0: 22 | return 0.0 23 | occurences = array.array('L', [0] * 256) 24 | for x in data: 25 | occurences[x if isinstance(x, int) else ord(x)] += 1 26 | 27 | entropy = 0 28 | for x in occurences: 29 | if x: 30 | p_x = float(x) / len(data) 31 | entropy -= p_x * math.log(p_x, 2) 32 | 33 | return entropy 34 | 35 | 36 | def get_resources(pe): 37 | """Extract resources : 38 | [entropy, size]""" 39 | resources = [] 40 | if hasattr(pe, 'DIRECTORY_ENTRY_RESOURCE'): 41 | try: 42 | for resource_type in pe.DIRECTORY_ENTRY_RESOURCE.entries: 43 | if hasattr(resource_type, 'directory'): 44 | for resource_id in resource_type.directory.entries: 45 | if hasattr(resource_id, 'directory'): 46 | for resource_lang in resource_id.directory.entries: 47 | data = pe.get_data(resource_lang.data.struct.OffsetToData, 48 | resource_lang.data.struct.Size) 49 | size = resource_lang.data.struct.Size 50 | entropy = get_entropy(data) 51 | 52 | resources.append([entropy, size]) 53 | except Exception as e: 54 | return resources 55 | return resources 56 | 57 | 58 | def get_version_info(pe): 59 | """Return version infos""" 60 | res = {} 61 | for fileinfo in pe.FileInfo: 62 | if fileinfo.Key == 'StringFileInfo': 63 | for st in fileinfo.StringTable: 64 | for entry in st.entries.items(): 65 | res[entry[0]] = entry[1] 66 | if fileinfo.Key == 'VarFileInfo': 67 | for var in fileinfo.Var: 68 | res[var.entry.items()[0][0]] = var.entry.items()[0][1] 69 | if hasattr(pe, 'VS_FIXEDFILEINFO'): 70 | res['flags'] = pe.VS_FIXEDFILEINFO.FileFlags 71 | res['os'] = pe.VS_FIXEDFILEINFO.FileOS 72 | res['type'] = pe.VS_FIXEDFILEINFO.FileType 73 | res['file_version'] = pe.VS_FIXEDFILEINFO.FileVersionLS 74 | res['product_version'] = pe.VS_FIXEDFILEINFO.ProductVersionLS 75 | res['signature'] = pe.VS_FIXEDFILEINFO.Signature 76 | res['struct_version'] = pe.VS_FIXEDFILEINFO.StrucVersion 77 | return res 78 | 79 | 80 | def extract_infos(fpath): 81 | res = [] 82 | res.append(os.path.basename(fpath)) 83 | res.append(get_md5(fpath)) 84 | pe = pefile.PE(fpath) 85 | res.append(pe.FILE_HEADER.Machine) 86 | res.append(pe.FILE_HEADER.SizeOfOptionalHeader) 87 | res.append(pe.FILE_HEADER.Characteristics) 88 | res.append(pe.OPTIONAL_HEADER.MajorLinkerVersion) 89 | res.append(pe.OPTIONAL_HEADER.MinorLinkerVersion) 90 | res.append(pe.OPTIONAL_HEADER.SizeOfCode) 91 | res.append(pe.OPTIONAL_HEADER.SizeOfInitializedData) 92 | res.append(pe.OPTIONAL_HEADER.SizeOfUninitializedData) 93 | res.append(pe.OPTIONAL_HEADER.AddressOfEntryPoint) 94 | res.append(pe.OPTIONAL_HEADER.BaseOfCode) 95 | try: 96 | res.append(pe.OPTIONAL_HEADER.BaseOfData) 97 | except AttributeError: 98 | res.append(0) 99 | res.append(pe.OPTIONAL_HEADER.ImageBase) 100 | res.append(pe.OPTIONAL_HEADER.SectionAlignment) 101 | res.append(pe.OPTIONAL_HEADER.FileAlignment) 102 | res.append(pe.OPTIONAL_HEADER.MajorOperatingSystemVersion) 103 | res.append(pe.OPTIONAL_HEADER.MinorOperatingSystemVersion) 104 | res.append(pe.OPTIONAL_HEADER.MajorImageVersion) 105 | res.append(pe.OPTIONAL_HEADER.MinorImageVersion) 106 | res.append(pe.OPTIONAL_HEADER.MajorSubsystemVersion) 107 | res.append(pe.OPTIONAL_HEADER.MinorSubsystemVersion) 108 | res.append(pe.OPTIONAL_HEADER.SizeOfImage) 109 | res.append(pe.OPTIONAL_HEADER.SizeOfHeaders) 110 | res.append(pe.OPTIONAL_HEADER.CheckSum) 111 | res.append(pe.OPTIONAL_HEADER.Subsystem) 112 | res.append(pe.OPTIONAL_HEADER.DllCharacteristics) 113 | res.append(pe.OPTIONAL_HEADER.SizeOfStackReserve) 114 | res.append(pe.OPTIONAL_HEADER.SizeOfStackCommit) 115 | res.append(pe.OPTIONAL_HEADER.SizeOfHeapReserve) 116 | res.append(pe.OPTIONAL_HEADER.SizeOfHeapCommit) 117 | res.append(pe.OPTIONAL_HEADER.LoaderFlags) 118 | res.append(pe.OPTIONAL_HEADER.NumberOfRvaAndSizes) 119 | res.append(len(pe.sections)) 120 | entropy = list(map(lambda x: x.get_entropy(), pe.sections)) 121 | if len(entropy) > 0: 122 | res.append(sum(entropy) / float(len(entropy))) 123 | res.append(min(entropy)) 124 | res.append(max(entropy)) 125 | else: 126 | res.append(0) 127 | res.append(0) 128 | res.append(0) 129 | 130 | raw_sizes = list(map(lambda x: x.SizeOfRawData, pe.sections)) 131 | if len(raw_sizes) > 0: 132 | res.append(sum(raw_sizes) / float(len(raw_sizes))) 133 | res.append(min(raw_sizes)) 134 | res.append(max(raw_sizes)) 135 | else: 136 | res.append(0) 137 | res.append(0) 138 | res.append(0) 139 | 140 | virtual_sizes = list(map(lambda x: x.Misc_VirtualSize, pe.sections)) 141 | if len(virtual_sizes) > 0: 142 | res.append(sum(virtual_sizes) / float(len(virtual_sizes))) 143 | res.append(min(virtual_sizes)) 144 | res.append(max(virtual_sizes)) 145 | else: 146 | res.append(0) 147 | res.append(0) 148 | res.append(0) 149 | # Imports 150 | try: 151 | res.append(len(pe.DIRECTORY_ENTRY_IMPORT)) 152 | imports = sum([x.imports for x in pe.DIRECTORY_ENTRY_IMPORT], []) 153 | res.append(len(imports)) 154 | res.append(len(list(filter(lambda x: x.name is None, imports)))) 155 | except AttributeError: 156 | res.append(0) 157 | res.append(0) 158 | res.append(0) 159 | # Exports 160 | try: 161 | res.append(len(pe.DIRECTORY_ENTRY_EXPORT.symbols)) 162 | except AttributeError: 163 | # No export 164 | res.append(0) 165 | # Resources 166 | resources = get_resources(pe) 167 | res.append(len(resources)) 168 | if len(resources) > 0: 169 | entropy = list(map(lambda x: x[0], resources)) 170 | res.append(sum(entropy) / float(len(entropy))) 171 | res.append(min(entropy)) 172 | res.append(max(entropy)) 173 | sizes = list(map(lambda x: x[1], resources)) 174 | res.append(sum(sizes) / float(len(sizes))) 175 | res.append(min(sizes)) 176 | res.append(max(sizes)) 177 | else: 178 | res.append(0) 179 | res.append(0) 180 | res.append(0) 181 | res.append(0) 182 | res.append(0) 183 | res.append(0) 184 | 185 | # Load configuration size 186 | try: 187 | res.append(pe.DIRECTORY_ENTRY_LOAD_CONFIG.struct.Size) 188 | except AttributeError: 189 | res.append(0) 190 | 191 | # Version configuration size 192 | try: 193 | version_infos = get_version_info(pe) 194 | res.append(len(version_infos.keys())) 195 | except AttributeError: 196 | res.append(0) 197 | return res 198 | 199 | 200 | def collect_features(path, class_id, class_name, max_files=0, max_size=0): 201 | count = 0 202 | list_of_res = [] 203 | for ffile in os.listdir(path): 204 | full_name = os.path.join(path, ffile) 205 | # print(full_name) 206 | statinfo = os.stat(full_name) 207 | if max_size != 0 and statinfo.st_size > max_size: 208 | continue 209 | try: 210 | res = extract_infos(full_name) 211 | res.append(class_id) 212 | res.append(class_name) 213 | count += 1 214 | if max_files > 0 and count > max_files: 215 | break 216 | except pefile.PEFormatError: 217 | # print(f'PE format is invalid for {full_name}') 218 | pass 219 | list_of_res.append(res) 220 | return count, list_of_res 221 | 222 | 223 | def extract_pe_features(features_cvs_filename, count_cvs_filename, path_to_samples, max_files=15000, max_size=5242880): 224 | features_columns = [ 225 | "Name", 226 | "md5", 227 | "Machine", 228 | "SizeOfOptionalHeader", 229 | "Characteristics", 230 | "MajorLinkerVersion", 231 | "MinorLinkerVersion", 232 | "SizeOfCode", 233 | "SizeOfInitializedData", 234 | "SizeOfUninitializedData", 235 | "AddressOfEntryPoint", 236 | "BaseOfCode", 237 | "BaseOfData", 238 | "ImageBase", 239 | "SectionAlignment", 240 | "FileAlignment", 241 | "MajorOperatingSystemVersion", 242 | "MinorOperatingSystemVersion", 243 | "MajorImageVersion", 244 | "MinorImageVersion", 245 | "MajorSubsystemVersion", 246 | "MinorSubsystemVersion", 247 | "SizeOfImage", 248 | "SizeOfHeaders", 249 | "CheckSum", 250 | "Subsystem", 251 | "DllCharacteristics", 252 | "SizeOfStackReserve", 253 | "SizeOfStackCommit", 254 | "SizeOfHeapReserve", 255 | "SizeOfHeapCommit", 256 | "LoaderFlags", 257 | "NumberOfRvaAndSizes", 258 | "SectionsNb", 259 | "SectionsMeanEntropy", 260 | "SectionsMinEntropy", 261 | "SectionsMaxEntropy", 262 | "SectionsMeanRawsize", 263 | "SectionsMinRawsize", 264 | "SectionMaxRawsize", 265 | "SectionsMeanVirtualsize", 266 | "SectionsMinVirtualsize", 267 | "SectionMaxVirtualsize", 268 | "ImportsNbDLL", 269 | "ImportsNb", 270 | "ImportsNbOrdinal", 271 | "ExportNb", 272 | "ResourcesNb", 273 | "ResourcesMeanEntropy", 274 | "ResourcesMinEntropy", 275 | "ResourcesMaxEntropy", 276 | "ResourcesMeanSize", 277 | "ResourcesMinSize", 278 | "ResourcesMaxSize", 279 | "LoadConfigurationSize", 280 | "VersionInformationSize", 281 | "Malware_ClassID", 282 | "Malware_ClassName" 283 | ] 284 | 285 | csv_delimiter = ',' 286 | features_cvs_file = open(features_cvs_filename, "w") 287 | features_cvs_file.write(csv_delimiter.join(features_columns) + "\n") 288 | 289 | count_cvs_file = open(count_cvs_filename, "w") 290 | count_columns = ['class_name', 'total_samples'] 291 | count_cvs_file.write(csv_delimiter.join(count_columns) + "\n") 292 | 293 | class_names = os.listdir(path_to_samples) 294 | class_count = len(class_names) 295 | 296 | for class_id in tqdm(range(class_count), desc='Extracting features from Malware samples'): 297 | sub_dir = os.path.join(path_to_samples, class_names[class_id]) 298 | total_samples, list_of_res = collect_features(sub_dir, class_id, class_names[class_id], max_files, max_size) 299 | for res in list_of_res: 300 | features_cvs_file.write(csv_delimiter.join(map(lambda x: str(x), res)) + "\n") 301 | count_cvs_file.write(class_names[class_id] + csv_delimiter + str(total_samples) + "\n") 302 | features_cvs_file.flush() 303 | count_cvs_file.flush() 304 | 305 | features_cvs_file.close() 306 | count_cvs_file.close() 307 | 308 | -------------------------------------------------------------------------------- /data_utils/misc.py: -------------------------------------------------------------------------------- 1 | import os 2 | from tqdm import tqdm 3 | import pandas as pd 4 | from utils import * 5 | import sys 6 | import traceback 7 | 8 | 9 | def count_dataset(root_data_dir, csv_filename): 10 | class_names = os.listdir(root_data_dir) 11 | class_count = len(class_names) 12 | 13 | csv_delimiter = ',' 14 | count_columns = ['class_name', 'total_samples'] 15 | count_csv_file = open(csv_filename, "w") 16 | 17 | count_csv_file.write(csv_delimiter.join(count_columns) + "\n") 18 | 19 | for class_id in tqdm(range(class_count), desc='Counting dataset'): 20 | files = os.listdir(os.path.join(root_data_dir, class_names[class_id])) 21 | count_csv_file.write(class_names[class_id] + csv_delimiter + str(len(files)) + "\n") 22 | 23 | count_csv_file.close() 24 | 25 | 26 | def process_cf_for_latex(data_list): 27 | data_path = os.path.join('logs', 'logs_preserve', 'logs_only') 28 | cf_file = 'confusion_matrix.csv' 29 | cf_file_latex = 'confusion_matrix_latex.txt' 30 | 31 | for date_dir, expr_name in data_list: 32 | cf_filepath = os.path.join(data_path, date_dir, expr_name, cf_file) 33 | print_line() 34 | print(f'Processing {cf_filepath}', end='') 35 | try: 36 | df = pd.read_csv(cf_filepath, index_col=0) 37 | df = df.apply(lambda x: x / x.sum(), axis=1) # normalize by row 38 | df_s = len(df.columns) 39 | 40 | cf_file_latexpath = os.path.join(data_path, date_dir, expr_name, cf_file_latex) 41 | with open(cf_file_latexpath, 'w+') as f: 42 | for r in range(df_s): 43 | for c in range(df_s): 44 | cell_val = round(df.iloc[r, c], 4) 45 | msg = '{c} {r} {cell_val}\n'.format(c=c, r=r, cell_val=cell_val) 46 | f.write(msg) 47 | except: 48 | print(traceback.print_exc()) 49 | print_line() 50 | print(sys.exc_info()[0]) 51 | print_line() 52 | print(f'\t--> failed') 53 | else: 54 | print(f'\t--> success') 55 | print(cf_file_latexpath) 56 | print_line() -------------------------------------------------------------------------------- /detect_malware.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from datetime import datetime 3 | from models.models_utils import * 4 | from utils import * 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.preprocessing import StandardScaler 7 | from models.Shallow_ML_models import * 8 | import sklearn.metrics 9 | import sys 10 | import traceback 11 | 12 | 13 | def setup(): 14 | current_time_str = str(datetime.now().strftime("%d-%b-%Y_%H_%M_%S")) 15 | LOG_DIR = os.path.join(LOG_MASTER_DIR, current_time_str) 16 | os.makedirs(LOG_DIR) 17 | return LOG_DIR 18 | 19 | 20 | def execute_deep_feedforward_model(model_params, LOG_DIR): 21 | print(f'Model params: {model_params}') 22 | 23 | batch_size = model_params['batch_size'] 24 | feature_type = model_params['feature_type'] 25 | 26 | if feature_type == FEATURE_TYPE_IMAGE: 27 | image_dim = model_params['image_dim'] 28 | conv1d_image_dim_w = -1 29 | data_path = get_image_datapath(image_dim) 30 | if image_dim == 0: 31 | # conv1d models 32 | conv1d_image_dim_w = model_params['conv1d_image_dim_w'] 33 | 34 | print(f'Loading image data') 35 | train_loader, val_loader, dataset_len, class_names = get_image_data_loaders(data_path=data_path, 36 | image_dim=image_dim, 37 | batch_size=batch_size, 38 | conv1d_image_dim_w=conv1d_image_dim_w) 39 | 40 | else: 41 | print(f'Loading opcode data') 42 | opcode_len = model_params['opcode_len'] 43 | data_path = get_opcode_datapath(opcode_len) 44 | train_loader, val_loader, dataset_len, class_names, \ 45 | text_vocal_len, label_vocab_len, pad_idx = get_opcode_data_loaders(data_path=data_path, 46 | opcode_len=opcode_len, 47 | batch_size=batch_size) 48 | model_params['input_dim'] = text_vocal_len 49 | model_params['output_dim'] = label_vocab_len 50 | 51 | train_set_len = len(train_loader) * batch_size 52 | val_set_len = len(val_loader) * batch_size 53 | num_of_classes = len(class_names) 54 | 55 | model_params['num_of_classes'] = num_of_classes 56 | model_params['class_names'] = class_names 57 | 58 | if feature_type == FEATURE_TYPE_IMAGE: 59 | model = create_deep_image_model(model_params).to(device) 60 | else: 61 | model = create_deep_opcode_model(model_params).to(device) 62 | 63 | criterion = nn.CrossEntropyLoss().to(device) 64 | 65 | model, train_losses, train_accuracy = train_ann_model(model=model, model_params=model_params, criterion=criterion, 66 | train_loader=train_loader, log_dir=LOG_DIR) 67 | test_accuracy, predicted, ground_truth = test_ann_model(model=model, model_params=model_params, criterion=criterion, 68 | val_loader=val_loader) 69 | 70 | model_params['train_accuracy'] = np.mean(train_accuracy) 71 | model_params['test_accuracy'] = np.mean(test_accuracy) 72 | 73 | print(f"Average Train accuracy: {model_params['train_accuracy']:7.4f}%") 74 | print(f"Average Test accuracy : {model_params['test_accuracy']:7.4f}%") 75 | 76 | save_model_results_to_log(model=model, model_params=model_params, 77 | train_losses=train_losses, train_accuracy=train_accuracy, 78 | predicted=predicted, ground_truth=ground_truth, 79 | log_dir=LOG_DIR) 80 | 81 | 82 | def execute_deep_rnn_model(model_params, LOG_DIR): 83 | print(f'Model params: {model_params}') 84 | 85 | batch_size = model_params['batch_size'] 86 | opcode_len = model_params['opcode_len'] 87 | data_path = get_opcode_datapath(opcode_len) 88 | 89 | print_line() 90 | print(f'Loading Opcode data') 91 | 92 | train_iterator, test_iterator, dataset_len, class_names, \ 93 | text_vocal_len, label_vocab_len, pad_idx = get_opcode_data_loaders(data_path=data_path, 94 | opcode_len=opcode_len, 95 | batch_size=batch_size) 96 | num_of_classes = len(class_names) 97 | 98 | print(f'Total images available: {dataset_len}') 99 | print(f'Number of classes: {num_of_classes}') 100 | print_line() 101 | 102 | model_params['num_of_classes'] = num_of_classes 103 | model_params['class_names'] = class_names 104 | model_params['input_dim'] = text_vocal_len 105 | model_params['output_dim'] = label_vocab_len 106 | 107 | model = create_deep_opcode_model(model_params).to(device) 108 | criterion = nn.CrossEntropyLoss().to(device) 109 | 110 | model, train_losses, train_accuracy = train_rnn_model(model=model, model_params=model_params, criterion=criterion, 111 | train_loader=train_iterator, log_dir=LOG_DIR) 112 | test_accuracy, predicted, ground_truth = test_rnn_model(model=model, model_params=model_params, criterion=criterion, 113 | val_loader=test_iterator) 114 | 115 | model_params['train_accuracy'] = np.mean(train_accuracy) 116 | model_params['test_accuracy'] = np.mean(test_accuracy) 117 | 118 | print(f"Average Train accuracy: {model_params['train_accuracy']:7.4f}%") 119 | print(f"Average Test accuracy : {model_params['test_accuracy']:7.4f}%") 120 | 121 | save_model_results_to_log(model=model, model_params=model_params, 122 | train_losses=train_losses, train_accuracy=train_accuracy, 123 | predicted=predicted, ground_truth=ground_truth, 124 | log_dir=LOG_DIR) 125 | 126 | 127 | def execute_conv_tl_model(model_params, LOG_DIR): 128 | print(f'Model params: {model_params}') 129 | 130 | batch_size = model_params['batch_size'] 131 | image_dim = model_params['image_dim'] 132 | data_path = get_image_datapath(image_dim) 133 | # dataloader transforms input images of image_dim to what pre-trained model expects 134 | # pretrained_image_dim is what pre-trained model expects 135 | pretrained_image_dim = get_pretrained_image_dim(model_params['model_name']) 136 | train_loader, val_loader, dataset_len, class_names = get_image_data_loaders(data_path=data_path, 137 | image_dim=image_dim, 138 | batch_size=batch_size, 139 | convert_to_rgb=True, 140 | pretrained_image_dim=pretrained_image_dim) 141 | 142 | train_set_len = len(train_loader) * batch_size 143 | val_set_len = len(val_loader) * batch_size 144 | num_of_classes = len(class_names) 145 | 146 | model_params['num_of_classes'] = num_of_classes 147 | model_params['class_names'] = class_names 148 | 149 | model = create_conv_tl_model(model_params).to(device) 150 | criterion = nn.CrossEntropyLoss().to(device) 151 | 152 | model, train_losses, train_accuracy = train_ann_model(model=model, model_params=model_params, criterion=criterion, 153 | train_loader=train_loader, log_dir=LOG_DIR) 154 | test_accuracy, predicted, ground_truth = test_ann_model(model=model, model_params=model_params, criterion=criterion, 155 | val_loader=val_loader) 156 | 157 | model_params['train_accuracy'] = np.mean(train_accuracy) 158 | model_params['test_accuracy'] = np.mean(test_accuracy) 159 | 160 | print(f"Average Train accuracy: {model_params['train_accuracy']:7.4f}%") 161 | print(f"Average Test accuracy : {model_params['test_accuracy']:7.4f}%") 162 | 163 | save_model_results_to_log(model=model, model_params=model_params, 164 | train_losses=train_losses, train_accuracy=train_accuracy, 165 | predicted=predicted, ground_truth=ground_truth, 166 | log_dir=LOG_DIR) 167 | 168 | 169 | def process_deep_learning(experiment_types, LOG_DIR): 170 | for expr_type in experiment_types: 171 | malware_expr_list = get_malware_experiments_list(expr_type) 172 | print(malware_expr_list) 173 | total_expr = len(malware_expr_list) 174 | 175 | final_results = [] 176 | for num, ml in enumerate(malware_expr_list): 177 | if 'num_layers' in ml.keys(): 178 | num_layers = ml['num_layers'] 179 | if num_layers == 1: 180 | ml['dropout'] = 0 181 | 182 | print_line() 183 | print(f'Executing : {ml["experiment_name"]} ({num + 1}/{total_expr})') 184 | print_line() 185 | try: 186 | if expr_type == DEEP_FF: 187 | execute_deep_feedforward_model(ml, LOG_DIR) 188 | if expr_type == DEEP_RNN: 189 | execute_deep_rnn_model(ml, LOG_DIR) 190 | except Exception: 191 | temp_dict = {'experiment_name': ml['experiment_name'], 192 | 'train_accuracy': 'failed', 193 | 'test_accuracy': 'failed'} 194 | print_line() 195 | print("FAILED") 196 | print(traceback.print_exc()) 197 | print_line() 198 | print(sys.exc_info()[0]) 199 | print_line() 200 | else: 201 | temp_dict = {'experiment_name': ml['experiment_name'], 202 | 'train_accuracy': ml['train_accuracy'], 203 | 'test_accuracy': ml['test_accuracy']} 204 | 205 | final_results.append(temp_dict) 206 | 207 | exp_results_filename = os.path.join(LOG_DIR, expr_type + '_' + EXPERIMENT_RESULTS) 208 | df = pd.DataFrame(final_results) 209 | expr_name = df['experiment_name'] 210 | df.drop(['experiment_name'], axis=1, inplace=True) 211 | df.set_index(expr_name, drop=True, inplace=True) 212 | df.to_csv(exp_results_filename) 213 | save_models_metadata_to_log(malware_expr_list, LOG_DIR) 214 | 215 | 216 | def prepare_shallow_model(model_params, LOG_DIR): 217 | print(f'Model params: {model_params}') 218 | df = pd.read_csv(ORG_DATASET_PE_FEATURES_CSV) 219 | 220 | # sort class names and re-assign the new class IDs w.r.t. sored classes. 221 | df.sort_values(by=['Malware_ClassName'], inplace=True) 222 | malware_classes = df['Malware_ClassName'].values 223 | malware_classes = sorted(list(set(list(malware_classes)))) 224 | new_class_ids = df.apply(lambda row: malware_classes.index(row['Malware_ClassName']), axis=1) 225 | df['Malware_ClassID'] = new_class_ids 226 | 227 | data = df.drop(['Name', 'md5', 'Malware_ClassName'], axis=1) 228 | x = data.drop(['Malware_ClassID'], axis=1) 229 | y = data['Malware_ClassID'] 230 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20) 231 | 232 | model_params['num_of_classes'] = len(malware_classes) 233 | model_params['class_names'] = malware_classes 234 | 235 | sc = StandardScaler() 236 | x_train = sc.fit_transform(x_train) 237 | x_test = sc.transform(x_test) 238 | 239 | model, gsc_model = create_shallow_model(model_params=model_params) 240 | 241 | x_pred, y_pred, best_estimator, best_params = execute_shallow_model(model=gsc_model, x_train=x_train, 242 | y_train=y_train, x_test=x_test, 243 | model_params=model_params) 244 | 245 | test_accuracy = sklearn.metrics.accuracy_score(y_test, y_pred) 246 | 247 | model_params['train_accuracy'] = gsc_model.cv_results_['mean_train_score'][gsc_model.best_index_] 248 | model_params['test_accuracy'] = test_accuracy 249 | print(f"Train accuracy: {model_params['train_accuracy']:7.4f}%") 250 | print(f"Test accuracy: {model_params['train_accuracy']:7.4f}%") 251 | 252 | save_model_results_to_log(model=gsc_model, model_params=model_params, 253 | predicted=y_pred, ground_truth=y_test, best_params=best_params, 254 | log_dir=LOG_DIR) 255 | 256 | 257 | def process_shallow_learning(LOG_DIR): 258 | shallow_expr_list = get_shallow_expr_list() 259 | total_expr = len(shallow_expr_list) 260 | final_results = [] 261 | for num, ml in enumerate(shallow_expr_list): 262 | print_line() 263 | print(f'Executing : {ml["experiment_name"]} ({num + 1}/{total_expr})') 264 | print_line() 265 | prepare_shallow_model(ml, LOG_DIR) 266 | temp_dict = {'experiment_name': ml['experiment_name'], 267 | 'train_accuracy': ml['train_accuracy'], 268 | 'test_accuracy': ml['test_accuracy']} 269 | final_results.append(temp_dict) 270 | 271 | exp_results_filename = os.path.join(LOG_DIR, 'shallow_' + EXPERIMENT_RESULTS) 272 | df = pd.DataFrame(final_results) 273 | expr_name = df['experiment_name'] 274 | df.drop(['experiment_name'], axis=1, inplace=True) 275 | df.set_index(expr_name, drop=True, inplace=True) 276 | df.to_csv(exp_results_filename) 277 | save_models_metadata_to_log(shallow_expr_list, LOG_DIR) 278 | 279 | 280 | def process_conv_transfer_learning(LOG_DIR): 281 | tl_expr_list = get_conv_transfer_learning_expr_list() 282 | total_expr = len(tl_expr_list) 283 | final_results = [] 284 | for num, ml in enumerate(tl_expr_list): 285 | print_line() 286 | print(f'Executing : {ml["experiment_name"]} ({num + 1}/{total_expr})') 287 | print_line() 288 | try: 289 | execute_conv_tl_model(ml, LOG_DIR) 290 | except Exception: 291 | temp_dict = {'experiment_name': ml['experiment_name'], 292 | 'train_accuracy': 'failed', 293 | 'test_accuracy': 'failed'} 294 | print_line() 295 | print("FAILED") 296 | print(traceback.print_exc()) 297 | print_line() 298 | print(sys.exc_info()[0]) 299 | print_line() 300 | else: 301 | temp_dict = {'experiment_name': ml['experiment_name'], 302 | 'train_accuracy': ml['train_accuracy'], 303 | 'test_accuracy': ml['test_accuracy']} 304 | final_results.append(temp_dict) 305 | 306 | exp_results_filename = os.path.join(LOG_DIR, 'conv_tl_' + EXPERIMENT_RESULTS) 307 | df = pd.DataFrame(final_results) 308 | expr_name = df['experiment_name'] 309 | df.drop(['experiment_name'], axis=1, inplace=True) 310 | df.set_index(expr_name, drop=True, inplace=True) 311 | df.to_csv(exp_results_filename) 312 | save_models_metadata_to_log(tl_expr_list, LOG_DIR) 313 | 314 | 315 | def main(args, LOG_DIR): 316 | deep_learning_models = [] 317 | if args.deep_feedforward: 318 | deep_learning_models.append(DEEP_FF) 319 | if args.deep_rnn: 320 | deep_learning_models.append(DEEP_RNN) 321 | 322 | if len(deep_learning_models) > 0: 323 | print_line() 324 | print(f'Starting Deep Learning Experiments to detect Malwares') 325 | print_line() 326 | process_deep_learning(deep_learning_models, LOG_DIR) 327 | print_line() 328 | 329 | if args.shallow_ml: 330 | print(f'Starting shallow Machine Learning Experiments to detect Malwares') 331 | print_line() 332 | process_shallow_learning(LOG_DIR) 333 | print_line() 334 | 335 | if args.transfer_conv_ml: 336 | print(f'Starting conv-based Transfer Learning Experiments to detect Malwares') 337 | print_line() 338 | process_conv_transfer_learning(LOG_DIR) 339 | print_line() 340 | 341 | 342 | def print_banner(LOG_DIR): 343 | print_line() 344 | if use_cuda: 345 | print('Using GPU:', torch.cuda.get_device_name(torch.cuda.current_device())) 346 | else: 347 | print('Running on :', device) 348 | 349 | print(f'LOG_DIR = {LOG_DIR}') 350 | print_line() 351 | 352 | 353 | if __name__ == '__main__': 354 | parser = argparse.ArgumentParser(description='Train Machine Learning models to detect and classify Malware') 355 | 356 | parser.add_argument('--deep_feedforward', action='store_true', help='Execute deep feedforward models', 357 | default=False) 358 | parser.add_argument('--deep_rnn', action='store_true', help='Execute deep rnn models', 359 | default=False) 360 | parser.add_argument('--shallow_ml', action='store_true', help='Execute shallow machine learning models', 361 | default=False) 362 | parser.add_argument('--transfer_conv_ml', action='store_true', help='Transfer learning using conv-based models', 363 | default=False) 364 | args = parser.parse_args() 365 | 366 | if len(sys.argv) < 2: 367 | parser.print_usage() 368 | sys.exit(1) 369 | 370 | LOG_DIR = setup() 371 | print_banner(LOG_DIR) 372 | main(args, LOG_DIR) 373 | print_banner(LOG_DIR) 374 | -------------------------------------------------------------------------------- /models/AnnModels.py: -------------------------------------------------------------------------------- 1 | import random 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | from torch.utils.data import DataLoader 6 | from torchvision import datasets, transforms, models 7 | from torchvision.utils import make_grid 8 | from config import * 9 | import numpy as np 10 | 11 | 12 | class ANNMalware_Model1(nn.Module): 13 | def __init__(self, image_dim=32, num_of_classes=20): 14 | super().__init__() 15 | 16 | self.image_dim = image_dim 17 | self.num_of_classes = num_of_classes 18 | 19 | self.linear1_in_features = int(self.image_dim * self.image_dim) 20 | # reduce the neurons by 20% i.e. take 80% in_features 21 | self.linear1_out_features = int(self.linear1_in_features * 0.80) 22 | # reduce the neurons by 40% 23 | self.linear2_out_features = int(self.linear1_out_features * 0.60) 24 | 25 | self.classifier = nn.Sequential( 26 | nn.Linear(self.linear1_in_features, self.linear1_out_features), 27 | nn.ReLU(inplace=True), 28 | nn.Linear(self.linear1_out_features, self.linear2_out_features), 29 | nn.ReLU(inplace=True), 30 | nn.Linear(self.linear2_out_features, self.num_of_classes), 31 | ) 32 | 33 | def forward(self, x): 34 | x = torch.flatten(x, 1) 35 | x = self.classifier(x) 36 | return F.log_softmax(x, dim=1) 37 | 38 | 39 | class ANNMalware_Model2(nn.Module): 40 | def __init__(self, image_dim=32, num_of_classes=20): 41 | super().__init__() 42 | 43 | self.image_dim = image_dim 44 | self.num_of_classes = num_of_classes 45 | 46 | self.linear1_in_features = int(self.image_dim * self.image_dim) 47 | # reduce the neurons by 20% i.e. take 80% in_features 48 | self.linear1_out_features = int(self.linear1_in_features * 0.80) 49 | # reduce the neurons by 40% 50 | self.linear2_out_features = int(self.linear1_out_features * 0.60) 51 | # reduce the neurons by 20% 52 | self.linear3_out_features = int(self.linear1_out_features * 0.40) 53 | # reduce the neurons by 20% 54 | self.linear4_out_features = int(self.linear1_out_features * 0.20) 55 | 56 | self.classifier = nn.Sequential( 57 | nn.Linear(self.linear1_in_features, self.linear1_out_features), 58 | nn.ReLU(inplace=True), 59 | nn.Linear(self.linear1_out_features, self.linear2_out_features), 60 | nn.ReLU(inplace=True), 61 | nn.Linear(self.linear2_out_features, self.linear3_out_features), 62 | nn.ReLU(inplace=True), 63 | nn.Linear(self.linear3_out_features, self.linear4_out_features), 64 | nn.ReLU(inplace=True), 65 | nn.Linear(self.linear4_out_features, self.num_of_classes) 66 | ) 67 | 68 | def forward(self, x): 69 | x = torch.flatten(x, 1) 70 | x = self.classifier(x) 71 | return F.log_softmax(x, dim=1) 72 | -------------------------------------------------------------------------------- /models/CnnModels.py: -------------------------------------------------------------------------------- 1 | import random 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | from torch.utils.data import DataLoader 6 | from torchvision import datasets, transforms, models 7 | from torchvision.utils import make_grid 8 | from config import * 9 | import numpy as np 10 | 11 | 12 | class CNNMalware_Model1(nn.Module): 13 | def __init__(self, image_dim=32, num_of_classes=20): 14 | super().__init__() 15 | 16 | self.image_dim = image_dim 17 | self.num_of_classes = num_of_classes 18 | 19 | self.conv1_out_channel = 12 20 | self.conv1_kernel_size = 3 21 | 22 | self.conv2_out_channel = 16 23 | self.conv2_kernel_size = 3 24 | 25 | self.linear1_out_features = 120 26 | self.linear2_out_features = 90 27 | 28 | self.conv1 = nn.Conv2d(1, self.conv1_out_channel, self.conv1_kernel_size, stride=1, 29 | padding=(2, 2)) 30 | 31 | self.conv2 = nn.Conv2d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size, stride=1, 32 | padding=(2, 2)) 33 | 34 | self.temp = int((((self.image_dim + 2) / 2) + 2) / 2) 35 | 36 | self.fc1 = nn.Linear(self.temp * self.temp * self.conv2_out_channel, self.linear1_out_features) 37 | self.fc2 = nn.Linear(self.linear1_out_features, self.linear2_out_features) 38 | self.fc3 = nn.Linear(self.linear2_out_features, self.num_of_classes) 39 | 40 | def forward(self, X): 41 | X = F.relu(self.conv1(X)) 42 | X = F.max_pool2d(X, 2, 2) 43 | X = F.relu(self.conv2(X)) 44 | X = F.max_pool2d(X, 2, 2) 45 | X = X.view(-1, self.temp * self.temp * self.conv2_out_channel) 46 | X = F.relu(self.fc1(X)) 47 | X = F.relu(self.fc2(X)) 48 | X = self.fc3(X) 49 | return F.log_softmax(X, dim=1) 50 | 51 | 52 | class CNNMalware_Model2(nn.Module): 53 | def __init__(self, image_dim=32, num_of_classes=20): 54 | super().__init__() 55 | 56 | self.image_dim = image_dim 57 | self.num_of_classes = num_of_classes 58 | self.padding = 2 59 | self.conv1_out_channel = 15 60 | self.conv1_kernel_size = 15 61 | self.stride = 1 62 | self.conv2_out_channel = 16 63 | self.conv2_kernel_size = 3 64 | 65 | conv1_nurons = int((self.image_dim - self.conv1_kernel_size + 2 * self.padding) / self.stride + 1) 66 | maxpool2d_1_nurons = int(conv1_nurons / 2) 67 | conv2_nurons = ((maxpool2d_1_nurons - self.conv2_kernel_size + 2 * self.padding) / self.stride + 1) 68 | maxpool2d_2_nurons = int(conv2_nurons / 2) 69 | 70 | self.linear1_in_features = int(maxpool2d_2_nurons * maxpool2d_2_nurons * self.conv2_out_channel) 71 | 72 | # reduce the neurons by 20% i.e. take 80% in_features 73 | self.linear1_out_features = int(self.linear1_in_features * 0.80) 74 | # reduce the neurons by 40% 75 | self.linear2_out_features = int(self.linear1_out_features * 0.60) 76 | 77 | self.features = nn.Sequential( 78 | nn.Conv2d(1, self.conv1_out_channel, self.conv1_kernel_size, 79 | stride=self.stride, padding=(self.padding, self.padding)), 80 | nn.ReLU(inplace=True), 81 | nn.MaxPool2d(kernel_size=2), 82 | nn.Conv2d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size, 83 | stride=self.stride, padding=(self.padding, self.padding)), 84 | nn.ReLU(inplace=True), 85 | nn.MaxPool2d(kernel_size=2), 86 | ) 87 | 88 | self.classifier = nn.Sequential( 89 | nn.Linear(self.linear1_in_features, self.linear1_out_features), 90 | nn.ReLU(inplace=True), 91 | nn.Linear(self.linear1_out_features, self.linear2_out_features), 92 | nn.ReLU(inplace=True), 93 | nn.Linear(self.linear2_out_features, self.num_of_classes), 94 | ) 95 | 96 | def forward(self, x): 97 | x = self.features(x) 98 | x = torch.flatten(x, 1) 99 | x = self.classifier(x) 100 | return F.log_softmax(x, dim=1) 101 | 102 | 103 | class CNNMalware_Model3(nn.Module): 104 | def __init__(self, image_dim_h=1, image_dim_w=1024, num_of_classes=20): 105 | super().__init__() 106 | 107 | self.image_dim_h = image_dim_h 108 | self.image_dim_w = image_dim_w 109 | self.num_of_classes = num_of_classes 110 | 111 | self.conv1_in_channel = 1 112 | self.conv1_out_channel = 28 113 | self.conv1_kernel_size = 3 114 | 115 | self.conv2_out_channel = 16 116 | self.conv2_kernel_size = 3 117 | 118 | self.linear1_out_features = 120 119 | self.linear2_out_features = 90 120 | 121 | self.conv1 = nn.Conv1d(self.conv1_in_channel, self.conv1_out_channel, self.conv1_kernel_size, stride=1, 122 | padding=(0)) 123 | 124 | self.conv2 = nn.Conv1d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size, stride=1, 125 | padding=(2)) 126 | 127 | self.fc1_size = self.conv2_out_channel * self.image_dim_w 128 | 129 | self.fc1 = nn.Linear(self.fc1_size, self.linear1_out_features) 130 | self.fc2 = nn.Linear(self.linear1_out_features, self.linear2_out_features) 131 | self.fc3 = nn.Linear(self.linear2_out_features, self.num_of_classes) 132 | 133 | def forward(self, X): 134 | X = X.squeeze(dim=1) 135 | X = F.relu(self.conv1(X)) 136 | X = F.relu(self.conv2(X)) 137 | X = X.view(-1, self.fc1_size) 138 | X = F.relu(self.fc1(X)) 139 | X = F.relu(self.fc2(X)) 140 | X = self.fc3(X) 141 | return F.log_softmax(X, dim=1) 142 | 143 | 144 | class CNNMalware_Model4(nn.Module): 145 | def __init__(self, image_dim_h=1, image_dim_w=1024, num_of_classes=20, 146 | c1_out=32, c1_kernel=16, c1_padding=2, c1_stride=2, 147 | c2_out=32, c2_kernel=8, c2_padding=2, c2_stride=2, 148 | ): 149 | super().__init__() 150 | 151 | self.image_dim_h = image_dim_h 152 | self.image_dim_w = image_dim_w 153 | self.num_of_classes = num_of_classes 154 | self.dilation = 1 155 | 156 | self.conv1_in_channel = 1 157 | self.conv1_out_channel = c1_out 158 | self.conv1_kernel_size = c1_kernel 159 | self.conv1_padding = c1_padding 160 | self.conv1_stride = c1_stride 161 | 162 | self.conv2_out_channel = c2_out 163 | self.conv2_kernel_size = c2_kernel 164 | self.conv2_padding = c2_padding 165 | self.conv2_stride = c2_stride 166 | 167 | self.linear1_out_features = 128 * 4 168 | self.linear2_out_features = 128 169 | 170 | self.conv1 = nn.Conv1d(self.conv1_in_channel, self.conv1_out_channel, self.conv1_kernel_size, 171 | stride=self.conv1_stride, 172 | padding=(self.conv1_padding), dilation=self.dilation) 173 | 174 | self.conv1_out_dim = self.calc_conv1d_out(self.image_dim_w, self.conv1_padding, 175 | self.dilation, self.conv1_kernel_size, self.conv1_stride) 176 | 177 | self.conv2 = nn.Conv1d(self.conv1_out_channel, self.conv2_out_channel, self.conv2_kernel_size, 178 | stride=self.conv2_stride, 179 | padding=(self.conv2_padding)) 180 | 181 | self.conv2_out_dim = self.calc_conv1d_out(self.conv1_out_dim, self.conv2_padding, 182 | self.dilation, self.conv2_kernel_size, self.conv2_stride) 183 | 184 | self.fc1 = nn.Linear(self.conv2_out_dim * self.conv2_out_channel, self.linear1_out_features) 185 | self.fc2 = nn.Linear(self.linear1_out_features, self.linear2_out_features) 186 | self.fc3 = nn.Linear(self.linear2_out_features, self.num_of_classes) 187 | 188 | def calc_conv1d_out(self, l_in, padding, dilation, kernel_size, stride): 189 | return int(((l_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride) + 1) 190 | 191 | def forward(self, X): 192 | X = X.squeeze(dim=1) 193 | X = F.relu(self.conv1(X)) 194 | X = F.relu(self.conv2(X)) 195 | X = X.view(X.shape[0], -1) 196 | X = F.relu(self.fc1(X)) 197 | X = F.relu(self.fc2(X)) 198 | X = self.fc3(X) 199 | return F.log_softmax(X, dim=1) 200 | 201 | 202 | class CNNMalware_Model5(nn.Module): 203 | def __init__(self, input_dim=1024, embedding_dim=128, n_filters=3, filter_sizes=[3,6], output_dim=128, dropout=0.3): 204 | super().__init__() 205 | 206 | self.embedding = nn.Embedding(input_dim, embedding_dim) 207 | 208 | self.convs = nn.ModuleList([ 209 | nn.Conv2d(in_channels=1, 210 | out_channels=n_filters, 211 | kernel_size=(fs, embedding_dim)) 212 | for fs in filter_sizes 213 | ]) 214 | 215 | self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim) 216 | 217 | self.dropout = nn.Dropout(dropout) 218 | 219 | def forward(self, text): 220 | text = text.permute(1, 0) 221 | embedded = self.embedding(text) 222 | embedded = embedded.unsqueeze(1) 223 | conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs] 224 | pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved] 225 | cat = self.dropout(torch.cat(pooled, dim=1)) 226 | cat2 = self.fc(cat) 227 | return F.log_softmax(cat2, dim=1) 228 | -------------------------------------------------------------------------------- /models/RnnModels.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | 6 | class RNNMalware_Model1(torch.nn.Module): 7 | def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20, 8 | batch_size=256, num_layers=1, bidirectional=False, dropout=0): 9 | super().__init__() 10 | self.input_dim = input_dim 11 | self.embedding_dim = embedding_dim 12 | self.hidden_dim = hidden_dim 13 | self.output_dim = output_dim 14 | self.batch_size = batch_size 15 | self.num_layers = num_layers 16 | self.bidirectional = bidirectional 17 | self.dropout = dropout 18 | self.fc_hidden_dim = self.hidden_dim 19 | 20 | if self.bidirectional: 21 | self.fc_hidden_dim = self.hidden_dim * 2 22 | 23 | self.embedding = nn.Embedding(self.input_dim, self.embedding_dim) 24 | self.rnn = nn.RNN(input_size=self.embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers, 25 | nonlinearity='relu', bidirectional=self.bidirectional, dropout=self.dropout) 26 | self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim) 27 | 28 | def forward(self, opcode): 29 | embedded = self.embedding(opcode) 30 | output, hidden = self.rnn(embedded) 31 | return self.fc(output[-1]) 32 | 33 | 34 | class LSTMMalware_Model1(torch.nn.Module): 35 | def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20, 36 | batch_size=256, num_layers=1, bidirectional=False, dropout=0): 37 | super().__init__() 38 | self.input_dim = input_dim 39 | self.embedding_dim = embedding_dim 40 | self.hidden_dim = hidden_dim 41 | self.output_dim = output_dim 42 | self.batch_size = batch_size 43 | self.num_layers = num_layers 44 | self.bidirectional = bidirectional 45 | self.dropout = dropout 46 | self.fc_hidden_dim = self.hidden_dim 47 | 48 | if self.bidirectional: 49 | self.fc_hidden_dim = self.hidden_dim * 2 50 | 51 | self.embedding = nn.Embedding(self.input_dim, self.embedding_dim) 52 | 53 | self.lstm = nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers, 54 | bidirectional=self.bidirectional, dropout=self.dropout) 55 | 56 | self.dropout_layer = nn.Dropout(self.dropout) 57 | 58 | self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim) 59 | 60 | def forward(self, opcode): 61 | embedded = self.embedding(opcode) 62 | output, (hidden, cell) = self.lstm(embedded) 63 | 64 | if self.bidirectional: 65 | hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1) 66 | else: 67 | hidden = hidden[-1::].squeeze(0) 68 | 69 | fc_in = self.dropout_layer(hidden) 70 | return self.fc(fc_in) 71 | 72 | 73 | class GRUMalware_Model1(torch.nn.Module): 74 | def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20, 75 | batch_size=256, num_layers=1, bidirectional=False, dropout=0): 76 | super().__init__() 77 | self.input_dim = input_dim 78 | self.embedding_dim = embedding_dim 79 | self.hidden_dim = hidden_dim 80 | self.output_dim = output_dim 81 | self.batch_size = batch_size 82 | self.num_layers = num_layers 83 | self.bidirectional = bidirectional 84 | self.dropout = dropout 85 | self.fc_hidden_dim = self.hidden_dim 86 | 87 | if self.bidirectional: 88 | self.fc_hidden_dim = self.hidden_dim * 2 89 | 90 | self.embedding = nn.Embedding(self.input_dim, self.embedding_dim) 91 | 92 | self.gru = nn.GRU(input_size=self.embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_layers, 93 | bidirectional=self.bidirectional, dropout=self.dropout) 94 | 95 | self.dropout_layer = nn.Dropout(self.dropout) 96 | 97 | self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim) 98 | 99 | def forward(self, opcode): 100 | embedded = self.embedding(opcode) 101 | output, hidden = self.gru(embedded) 102 | 103 | if self.bidirectional: 104 | hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1) 105 | else: 106 | hidden = hidden[-1::].squeeze(0) 107 | 108 | fc_in = self.dropout_layer(hidden) 109 | return self.fc(fc_in) 110 | 111 | 112 | class StackedMalware_Model1(torch.nn.Module): 113 | def __init__(self, input_dim=0, embedding_dim=100, hidden_dim=100, output_dim=20, 114 | batch_size=256, num_layers=1, bidirectional=False, dropout=0, LG=True): 115 | super().__init__() 116 | self.input_dim = input_dim 117 | self.embedding_dim = embedding_dim 118 | self.hidden_dim = hidden_dim 119 | self.output_dim = output_dim 120 | self.batch_size = batch_size 121 | self.num_layers = num_layers 122 | self.fc_hidden_dim = self.hidden_dim 123 | self.bidirectional = bidirectional 124 | self.num_of_direction = 1 125 | if self.bidirectional: 126 | self.num_of_direction = 2 127 | self.fc_hidden_dim = self.hidden_dim * 2 128 | self.dropout = dropout 129 | self.LG = LG 130 | self.embedding_dim_gru = self.embedding_dim 131 | self.embedding_dim_lstm = self.embedding_dim 132 | 133 | if self.LG: 134 | self.embedding_dim_gru = self.hidden_dim 135 | else: 136 | self.embedding_dim_lstm = self.hidden_dim 137 | 138 | self.embedding = nn.Embedding(self.input_dim, self.embedding_dim) 139 | 140 | self.lstm = nn.LSTM(input_size=self.embedding_dim_lstm, hidden_size=self.hidden_dim, num_layers=self.num_layers, 141 | bidirectional=self.bidirectional, dropout=self.dropout) 142 | 143 | self.gru = nn.GRU(input_size=self.embedding_dim_gru, hidden_size=self.hidden_dim, num_layers=self.num_layers, 144 | bidirectional=self.bidirectional, dropout=self.dropout) 145 | 146 | self.dropout_layer = nn.Dropout(self.dropout) 147 | 148 | self.fc = nn.Linear(self.fc_hidden_dim, self.output_dim) 149 | 150 | def forward(self, opcode): 151 | embedded = self.embedding(opcode) 152 | 153 | if self.LG: 154 | output, (hidden, cell) = self.lstm(embedded) 155 | output, hidden = self.gru(hidden) 156 | else: 157 | output, hidden = self.gru(embedded) 158 | output, (hidden, cell) = self.lstm(hidden) 159 | 160 | if self.bidirectional: 161 | hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1) 162 | else: 163 | hidden = hidden[-1, :, :].squeeze(0) 164 | 165 | fc_in = self.dropout_layer(hidden) 166 | return self.fc(fc_in) 167 | -------------------------------------------------------------------------------- /models/Shallow_ML_models.py: -------------------------------------------------------------------------------- 1 | import xgboost as xgb 2 | 3 | def execute_shallow_model(model=None, x_train=None, y_train=None, x_test=None,model_params=None): 4 | if model_params['model_name'] == 'XGB': 5 | model.fit(x_train, y_train, eval_metric='error') 6 | else: 7 | model.fit(x_train, y_train) 8 | 9 | best_params = model.best_params_ 10 | best_estimator = model.best_estimator_ 11 | 12 | y_pred = best_estimator.predict(x_test) # test data 13 | x_pred = best_estimator.predict(x_train) # training data 14 | 15 | return x_pred, y_pred, best_estimator, best_params 16 | -------------------------------------------------------------------------------- /models/TransferLearnModels.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | from torchvision import models 3 | import torch.nn.functional as F 4 | 5 | 6 | class Resnet152_wrapper(nn.Module): 7 | def __init__(self, model_params): 8 | super(Resnet152_wrapper, self).__init__() 9 | self.model_params = model_params 10 | self.num_of_classes = model_params['num_of_classes'] 11 | self.resnet152 = models.resnet152(pretrained=True) 12 | self.params_to_update = [] 13 | 14 | for param in self.resnet152.parameters(): 15 | param.requires_grad = False 16 | 17 | for param in self.resnet152.layer4.parameters(): 18 | self.params_to_update.append(param) 19 | param.requires_grad = True 20 | 21 | self.fc_malware1 = nn.Linear(1000, 500) 22 | self.fc_malware2 = nn.Linear(500, self.num_of_classes) 23 | 24 | for param in self.fc_malware1.parameters(): 25 | self.params_to_update.append(param) 26 | param.requires_grad = True 27 | 28 | for param in self.fc_malware2.parameters(): 29 | self.params_to_update.append(param) 30 | param.requires_grad = True 31 | 32 | def parameters(self): 33 | return self.params_to_update 34 | 35 | def forward(self, x): 36 | x = self.resnet152(x) 37 | x = self.fc_malware1(x) 38 | x = self.fc_malware2(x) 39 | return F.log_softmax(x, dim=1) 40 | 41 | 42 | class VGG19_wrapper(nn.Module): 43 | def __init__(self, model_params): 44 | super(VGG19_wrapper, self).__init__() 45 | self.model_params = model_params 46 | self.num_of_classes = model_params['num_of_classes'] 47 | self.vgg19 = models.vgg19(pretrained=True) 48 | self.params_to_update = [] 49 | 50 | for param in self.vgg19.parameters(): 51 | param.requires_grad = False 52 | 53 | list_of_features_layers = [34, 35, 36] 54 | for f in list_of_features_layers: 55 | for param in self.vgg19.features[f].parameters(): 56 | self.params_to_update.append(param) 57 | param.requires_grad = True 58 | 59 | for param in self.vgg19.classifier.parameters(): 60 | self.params_to_update.append(param) 61 | param.requires_grad = True 62 | 63 | self.fc_malware1 = nn.Linear(1000, 500) 64 | self.fc_malware2 = nn.Linear(500, self.num_of_classes) 65 | 66 | for param in self.fc_malware1.parameters(): 67 | self.params_to_update.append(param) 68 | param.requires_grad = True 69 | 70 | for param in self.fc_malware2.parameters(): 71 | self.params_to_update.append(param) 72 | param.requires_grad = True 73 | 74 | def parameters(self): 75 | return self.params_to_update 76 | 77 | def forward(self, x): 78 | x = self.vgg19(x) 79 | x = self.fc_malware1(x) 80 | x = self.fc_malware2(x) 81 | return F.log_softmax(x, dim=1) 82 | -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pratikpv/malware_detect2/12f4390f1c4a7a3e8b6bc355cc06f08f4dd3126d/models/__init__.py -------------------------------------------------------------------------------- /models/model_trainers_testers.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from torch.utils.data import DataLoader 5 | from torchvision import datasets, transforms, models 6 | from torchvision.utils import make_grid 7 | from config import * 8 | import numpy as np 9 | from models.CnnModels import * 10 | from data_utils.data_loaders import * 11 | from utils import * 12 | from tqdm import tqdm 13 | 14 | 15 | def train_ann_model(model=None, model_params=None, criterion=None, 16 | train_loader=None, log_dir=None): 17 | epochs = model_params['epochs'] 18 | lr = model_params['lr'] 19 | 20 | optimizer = torch.optim.Adam(model.parameters(), lr=lr) 21 | 22 | train_losses = [] 23 | train_accuracy = [] 24 | 25 | tqdm_train_descr_format = "Training Feed-Forward model: Epoch Accuracy = {:02.4f}%, Loss = {:.8f}" 26 | tqdm_train_descr = tqdm_train_descr_format.format(0, float('inf')) 27 | tqdm_train_obj = tqdm(range(epochs), desc=tqdm_train_descr) 28 | 29 | model.train(True) 30 | 31 | for i in tqdm_train_obj: 32 | 33 | epoch_corr = 0 34 | epoch_loss = 0 35 | total_samples = 0 36 | 37 | for b, (X_train, y_train) in enumerate(train_loader): 38 | X_train = X_train.to(device) 39 | y_train = y_train.to(device) 40 | 41 | y_pred = model(X_train) 42 | loss = criterion(y_pred, y_train) 43 | 44 | predicted = torch.max(y_pred.data, 1)[1] 45 | batch_corr = (predicted == y_train).sum() 46 | epoch_corr += batch_corr.item() 47 | epoch_loss += loss.item() 48 | total_samples += y_pred.shape[0] 49 | 50 | # Update parameters 51 | optimizer.zero_grad() 52 | loss.backward() 53 | optimizer.step() 54 | 55 | epoch_accuracy = epoch_corr * 100 / total_samples 56 | epoch_loss = epoch_loss / total_samples 57 | 58 | train_losses.append(epoch_loss) 59 | train_accuracy.append(epoch_accuracy) 60 | 61 | tqdm_descr = tqdm_train_descr_format.format(epoch_accuracy, epoch_loss) 62 | tqdm_train_obj.set_description(tqdm_descr) 63 | 64 | return model, train_losses, train_accuracy 65 | 66 | 67 | def test_ann_model(model=None, model_params=None, criterion=None, 68 | val_loader=None): 69 | tqdm_test_descr_format = "Testing Feed-Forward model: Batch Accuracy = {:02.4f}%" 70 | tqdm_test_descr = tqdm_test_descr_format.format(0) 71 | tqdm_test_obj = tqdm(val_loader, desc=tqdm_test_descr) 72 | num_of_batches = len(val_loader) 73 | 74 | model.eval() 75 | 76 | total_test_loss = 0 77 | total_test_acc = 0 78 | predicted_all = torch.tensor([], dtype=torch.long, device=device) 79 | ground_truth_all = torch.tensor([], dtype=torch.long, device=device) 80 | 81 | with torch.no_grad(): 82 | for b, (X_test, y_test) in enumerate(tqdm_test_obj): 83 | X_test = X_test.to(device) 84 | y_test = y_test.to(device) 85 | 86 | predictions = model(X_test) 87 | loss = criterion(predictions, y_test) 88 | 89 | predicted = torch.max(predictions.data, 1)[1] 90 | batch_corr = (predicted == y_test).sum() 91 | batch_acc = batch_corr.item() * 100 / predictions.shape[0] 92 | total_test_acc += batch_acc 93 | total_test_loss += loss.item() 94 | 95 | predicted_all = torch.cat((predicted_all, predicted), 0) 96 | ground_truth_all = torch.cat((ground_truth_all, y_test), 0) 97 | 98 | tqdm_test_descr = tqdm_test_descr_format.format(batch_acc) 99 | tqdm_test_obj.set_description(tqdm_test_descr) 100 | 101 | predicted_all = predicted_all.cpu().numpy() 102 | ground_truth_all = ground_truth_all.cpu().numpy() 103 | total_test_acc = total_test_acc / num_of_batches 104 | 105 | return total_test_acc, predicted_all, ground_truth_all 106 | 107 | 108 | def train_rnn_model(model=None, model_params=None, criterion=None, 109 | train_loader=None, log_dir=None): 110 | epochs = model_params['epochs'] 111 | lr = model_params['lr'] 112 | 113 | optimizer = torch.optim.Adam(model.parameters(), lr=lr) 114 | 115 | train_losses = [] 116 | train_accuracy = [] 117 | 118 | tqdm_train_descr_format = "Training RNN model: Epoch Accuracy = {:02.4f}%, Loss = {:.8f}" 119 | tqdm_train_descr = tqdm_train_descr_format.format(0, float('inf')) 120 | tqdm_train_obj = tqdm(range(epochs), desc=tqdm_train_descr) 121 | 122 | model.train(True) 123 | 124 | for i in tqdm_train_obj: 125 | 126 | epoch_corr = 0 127 | epoch_loss = 0 128 | total_samples = 0 129 | for b, batch in enumerate(train_loader): 130 | batch.text = batch.text.to(device) 131 | batch.label = batch.label.to(device) 132 | 133 | predictions = model(batch.text) 134 | loss = criterion(predictions, batch.label) 135 | 136 | predicted = torch.max(predictions.data, 1)[1] 137 | batch_corr = (predicted == batch.label).sum() 138 | epoch_corr += batch_corr.item() 139 | epoch_loss += loss.item() 140 | total_samples += predictions.shape[0] 141 | 142 | optimizer.zero_grad() 143 | loss.backward() 144 | optimizer.step() 145 | 146 | epoch_accuracy = epoch_corr * 100 / total_samples 147 | epoch_loss = epoch_loss / total_samples 148 | 149 | train_losses.append(epoch_loss) 150 | train_accuracy.append(epoch_accuracy) 151 | 152 | tqdm_descr = tqdm_train_descr_format.format(epoch_accuracy, epoch_loss) 153 | tqdm_train_obj.set_description(tqdm_descr) 154 | 155 | return model, train_losses, train_accuracy 156 | 157 | 158 | def test_rnn_model(model=None, model_params=None, criterion=None, 159 | val_loader=None): 160 | tqdm_test_descr_format = "Testing RNN model: Batch Accuracy = {:02.4f}%" 161 | tqdm_test_descr = tqdm_test_descr_format.format(0) 162 | tqdm_test_obj = tqdm(val_loader, desc=tqdm_test_descr) 163 | num_of_batches = len(val_loader) 164 | 165 | model.eval() 166 | 167 | total_test_loss = 0 168 | total_test_acc = 0 169 | predicted_all = torch.tensor([], dtype=torch.long, device=device) 170 | ground_truth_all = torch.tensor([], dtype=torch.long, device=device) 171 | 172 | with torch.no_grad(): 173 | for b, batch in enumerate(tqdm_test_obj): 174 | batch.text = batch.text.to(device) 175 | batch.label = batch.label.to(device) 176 | 177 | predictions = model(batch.text) 178 | loss = criterion(predictions, batch.label) 179 | 180 | predicted = torch.max(predictions.data, 1)[1] 181 | batch_corr = (predicted == batch.label).sum() 182 | batch_acc = batch_corr.item() * 100 / predictions.shape[0] 183 | total_test_acc += batch_acc 184 | total_test_loss += loss.item() 185 | 186 | predicted_all = torch.cat((predicted_all, predicted), 0) 187 | ground_truth_all = torch.cat((ground_truth_all, batch.label), 0) 188 | 189 | tqdm_test_descr = tqdm_test_descr_format.format(batch_acc) 190 | tqdm_test_obj.set_description(tqdm_test_descr) 191 | 192 | predicted_all = predicted_all.cpu().numpy() 193 | ground_truth_all = ground_truth_all.cpu().numpy() 194 | total_test_acc = total_test_acc / num_of_batches 195 | 196 | return total_test_acc, predicted_all, ground_truth_all 197 | -------------------------------------------------------------------------------- /models/models_utils.py: -------------------------------------------------------------------------------- 1 | from models.CnnModels import * 2 | from models.AnnModels import * 3 | from models.RnnModels import * 4 | from models.TransferLearnModels import * 5 | from data_utils.data_loaders import * 6 | from models.model_trainers_testers import * 7 | from sklearn.model_selection import GridSearchCV 8 | import xgboost as xgb 9 | from sklearn.ensemble import RandomForestClassifier 10 | from sklearn.neighbors import KNeighborsClassifier 11 | import itertools 12 | 13 | 14 | def get_deep_rnn_expr_list(print_grid=True, simple_list=True): 15 | """ 16 | sample template: 17 | { 18 | 'experiment_name': 'rnn_experiment_1', 19 | 'model_name': 'RNNMalware_Model1', 20 | 'batch_size': 512, 21 | 'embedding_dim': 256, 22 | 'hidden_dim': 128, 23 | 'epochs': 50, 24 | 'lr': 0.0001, 25 | 'num_layers': 1, 26 | 'bidirectional': False, 27 | 'dropout': 0, 28 | 'LG': False 29 | } 30 | 31 | supported models so far : 32 | (1) RNNMalware_Model1 : 33 | (2) LSTMMalware_Model1 : 34 | (3) GRUMalware_Model1 : 35 | (4) StackedMalware_Model1 : Stack of LSTM and GRU layers. 36 | 37 | if LG=True => LSTM at bottom and GRU on top, e.g. 38 | +---------+ 39 | | GRUs | 40 | ----------| 41 | | LSTMs | 42 | +---------+ 43 | else: 44 | +---------+ 45 | | LSTMs | 46 | ----------| 47 | | GRUs | 48 | +---------+ 49 | 50 | ########################################################### 51 | Grid Template for RNNMalware_Model1, LSTMMalware_Model1, GRUMalware_Model1 52 | { 'model_name': ['LSTMMalware_Model1', 'GRUMalware_Model1', 'RNNMalware_Model1'], 53 | 'batch_size': [128], 54 | 'embedding_dim': [256], 55 | 'hidden_dim': [256], 56 | 'epochs': [2], 57 | 'lr': [0.001], 58 | 'num_layers': [1, 3], 59 | 'bidirectional': [True, False], 60 | 'dropout': [0.3], 61 | 'opcode_len': [10] 62 | } 63 | ########################################################### 64 | Simple Template for StackedMalware_Model1 65 | { 66 | 'model_name': 'StackedMalware_Model1', 67 | 'experiment_name': 'rnn_experiment_1', 68 | 'batch_size': 128, 69 | 'embedding_dim': 8, 70 | 'hidden_dim': 8, 71 | 'epochs': 2, 72 | 'lr': 0.001, 73 | 'num_layers': 3, 74 | 'bidirectional': True, 75 | 'dropout': 0.3, 76 | 'LG': True, 77 | 'opcode_len': 500 78 | } 79 | ########################################################### 80 | """ 81 | 82 | get_deep_rnn_expr_grid = { 83 | 'model_name': ['StackedMalware_Model1'], 84 | 'batch_size': [128], 85 | 'embedding_dim': [256, 1024], 86 | 'hidden_dim': [256, 1024], 87 | 'epochs': [20], 88 | 'lr': [0.001], 89 | 'num_layers': [1, 3], 90 | 'bidirectional': [True, False], 91 | 'dropout': [0.3], 92 | 'opcode_len': [500], 93 | 'LG': [False] 94 | } 95 | 96 | simple_list = False 97 | if not simple_list and print_grid: 98 | print_line() 99 | print(f'Experiments Grid') 100 | print(get_deep_rnn_expr_grid) 101 | print_line() 102 | 103 | if simple_list: 104 | get_deep_rnn_expr_list = [ 105 | { 106 | 'model_name': 'StackedMalware_Model1', 107 | 'experiment_name': 'rnn_experiment_1', 108 | 'batch_size': 128, 109 | 'embedding_dim': 8, 110 | 'hidden_dim': 8, 111 | 'epochs': 2, 112 | 'lr': 0.001, 113 | 'num_layers': 3, 114 | 'bidirectional': True, 115 | 'dropout': 0.3, 116 | 'LG': True, 117 | 'opcode_len': 10 118 | } 119 | ] 120 | return get_deep_rnn_expr_list 121 | else: 122 | keys, values = zip(*get_deep_rnn_expr_grid.items()) 123 | permutations_dicts = [] 124 | count = 1 125 | for v in itertools.product(*values): 126 | temp_dict = dict(zip(keys, v)) 127 | temp_exp = 'rnn_experiment_' + str(count) 128 | temp_dict['experiment_name'] = temp_exp 129 | permutations_dicts.append(temp_dict) 130 | count += 1 131 | 132 | return permutations_dicts 133 | 134 | 135 | def get_deep_feedforward_expr_list(print_grid=True, simple_list=True): 136 | """ 137 | ########################################################### 138 | Template for ANNMalware_Model1 ( same for ANNMalware_Model2 but change model_name 139 | { 140 | 'model_name': 'ANNMalware_Model1', 141 | 'experiment_name': 'ann_experiment_1', 142 | 'batch_size': 128, 143 | 'image_dim': 256, 144 | 'epochs': 10, 145 | 'lr': 0.001, 146 | 'feature_type': FEATURE_TYPE_IMAGE 147 | } 148 | ########################################################### 149 | Template for CNNMalware_Model1 150 | { 151 | 'model_name': 'CNNMalware_Model1', 152 | 'experiment_name': 'cnn_experiment_1', 153 | 'batch_size': 256, 154 | 'image_dim': 256, 155 | 'epochs': 10, 156 | 'lr': 0.001, 157 | 'feature_type': FEATURE_TYPE_IMAGE 158 | } 159 | ########################################################### 160 | Template for CNNMalware_Model2 161 | { 162 | 'model_name': 'CNNMalware_Model2', 163 | 'experiment_name': 'cnn_experiment_1', 164 | 'batch_size': 512, 165 | 'image_dim': 0, 166 | 'epochs': 10, 167 | 'lr': 0.001, 168 | 'feature_type': FEATURE_TYPE_IMAGE 169 | } 170 | ########################################################### 171 | Template for CNNMalware_Model3 172 | { 173 | 'model_name': 'CNNMalware_Model3', 174 | 'experiment_name': 'cnn_experiment_1', 175 | 'batch_size': 512, 176 | 'image_dim': 0, 177 | 'epochs': 1, 178 | 'lr': 0.001, 179 | 'conv1d_image_dim_w': 1024 * 4, 180 | 'feature_type': FEATURE_TYPE_IMAGE 181 | } 182 | ########################################################### 183 | Template for CNNMalware_Model4 184 | { 185 | 'model_name': 'CNNMalware_Model4', 186 | 'experiment_name': 'cnn_experiment_1', 187 | 'batch_size': 512, 188 | 'image_dim': 0, 189 | 'epochs': 15, 190 | 'lr': 0.001, 191 | 'conv1d_image_dim_w': 1024 * 4, 192 | 'c1_out': 64, 193 | 'c1_kernel': 32, 194 | 'c1_padding': 2, 195 | 'c1_stride': 2, 196 | 'c2_out': 32, 197 | 'c2_kernel': 8, 198 | 'c2_padding': 2, 199 | 'c2_stride': 2, 200 | 'feature_type': FEATURE_TYPE_IMAGE 201 | } 202 | ########################################################### 203 | Template for CNNMalware_Model5 204 | { 205 | 'model_name': 'CNNMalware_Model5', 206 | 'experiment_name': 'cnn_experiment_1', 207 | 'batch_size': 512, 208 | 'image_dim': 256, 209 | 'epochs': 15, 210 | 'lr': 0.001, 211 | 'opcode_len': 500, 212 | 'embedding_dim': 128, 213 | 'hidden_dim': 128, 214 | 'n_filters': 3, 215 | 'filter_sizes': [3, 6], 216 | 'dropout': 0, 217 | 'feature_type': FEATURE_TYPE_OPCODE 218 | } 219 | ########################################################### 220 | """ 221 | simple_list = False 222 | get_deep_feedforward_expr_grid = { 223 | 'model_name': ['CNNMalware_Model5'], 224 | 'batch_size': [256], 225 | 'epochs': [20], 226 | 'lr': [0.001], 227 | 'opcode_len': [5000], 228 | 'embedding_dim': [512], 229 | 'n_filters': [16, 32], 230 | 'filter_sizes': [[16, 24, 32], [32, 64, 128]], 231 | 'dropout': [0.3], 232 | 'feature_type': [FEATURE_TYPE_OPCODE] 233 | } 234 | 235 | if not simple_list and print_grid: 236 | print_line() 237 | print(f'Experiments Grid') 238 | print(get_deep_feedforward_expr_grid) 239 | print_line() 240 | 241 | if simple_list: 242 | get_deep_feedforward_expr_list = [ 243 | {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 256, 'epochs': 50, 'lr': 0.001, 244 | 'experiment_name': 'experiment_35', 'feature_type': FEATURE_TYPE_IMAGE}, 245 | {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 256, 'epochs': 50, 'lr': 0.0001, 246 | 'experiment_name': 'experiment_36', 'feature_type': FEATURE_TYPE_IMAGE}, 247 | {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 512, 'epochs': 50, 'lr': 0.001, 248 | 'experiment_name': 'experiment_37', 'feature_type': FEATURE_TYPE_IMAGE}, 249 | {'model_name': 'CNNMalware_Model2', 'batch_size': 256, 'image_dim': 512, 'epochs': 50, 'lr': 0.0001, 250 | 'experiment_name': 'experiment_38', 'feature_type': FEATURE_TYPE_IMAGE}, 251 | {'model_name': 'CNNMalware_Model2', 'batch_size': 64, 'image_dim': 1024, 'epochs': 20, 'lr': 0.001, 252 | 'experiment_name': 'experiment_39', 'feature_type': FEATURE_TYPE_IMAGE}, 253 | {'model_name': 'CNNMalware_Model2', 'batch_size': 64, 'image_dim': 1024, 'epochs': 20, 'lr': 0.0001, 254 | 'experiment_name': 'experiment_40', 'feature_type': FEATURE_TYPE_IMAGE} 255 | ] 256 | return get_deep_feedforward_expr_list 257 | else: 258 | keys, values = zip(*get_deep_feedforward_expr_grid.items()) 259 | permutations_dicts = [] 260 | count = 1 261 | for v in itertools.product(*values): 262 | temp_dict = dict(zip(keys, v)) 263 | temp_exp = 'experiment_' + str(count) 264 | temp_dict['experiment_name'] = temp_exp 265 | permutations_dicts.append(temp_dict) 266 | count += 1 267 | 268 | return permutations_dicts 269 | 270 | 271 | def get_shallow_expr_list(): 272 | """list_shallow_expr = [ 273 | { 274 | 'experiment_name': 'XGB_experiment_1', 275 | 'model_name': 'XGB', 276 | 'param_grid': { 277 | 'max_depth': [5, 15, 20, 50, 100], 278 | 'learning_rate': [0.1, 0.01, 0.001], 279 | 'n_estimators': list(range(10, 500, 50)), 280 | } 281 | }, 282 | { 283 | 'experiment_name': 'RandomForest_experiment_1', 284 | 'model_name': 'RandomForest', 285 | 'param_grid': { 286 | 'criterion': ['gini', 'entropy'], 287 | 'n_estimators': [10, 40, 100, 500], 288 | } 289 | }, 290 | { 291 | 'experiment_name': 'Knn_experiment_1', 292 | 'model_name': 'Knn', 293 | 'param_grid': { 294 | 'n_neighbors': list(range(130, 190, 10)), 295 | 'p': [1, 2], 296 | } 297 | }, 298 | { 299 | 'experiment_name': 'Knn_experiment_2', 300 | 'model_name': 'Knn', 301 | 'param_grid': { 302 | 'n_neighbors': list(range(130, 190, 10)), 303 | 'p': [2], 304 | } 305 | } 306 | 307 | { 308 | 'experiment_name': 'Knn_experiment_1', 309 | 'model_name': 'Knn', 310 | 'param_grid': { 311 | 'n_neighbors': list(range(50, 160, 30)), 312 | 'p': [1, 2, 3], 313 | 'weights': ['uniform', 'distance'], 314 | 'algorithm': ['ball_tree', 'kd_tree', 'brute'] 315 | 316 | } 317 | } 318 | 319 | ]""" 320 | 321 | list_shallow_expr = [ 322 | { 323 | 'experiment_name': 'XGB_experiment_1', 324 | 'model_name': 'XGB', 325 | 'param_grid': { 326 | 'max_depth': [5, 15, 20, 50, 100], 327 | 'learning_rate': [0.1, 0.01, 0.001], 328 | 'n_estimators': list(range(10, 500, 50)), 329 | } 330 | }, 331 | { 332 | 'experiment_name': 'RandomForest_experiment_1', 333 | 'model_name': 'RandomForest', 334 | 'param_grid': { 335 | 'criterion': ['gini', 'entropy'], 336 | 'n_estimators': [10, 20, 30, 40, 100, 500], 337 | 'max_features': ['sqrt', 'log2'], 338 | } 339 | }, 340 | { 341 | 'experiment_name': 'Knn_experiment_1', 342 | 'model_name': 'Knn', 343 | 'param_grid': { 344 | 'n_neighbors': list(range(1, 170, 10)), 345 | 'p': [1, 2, 3], 346 | 'weights': ['uniform', 'distance'], 347 | 'algorithm': ['ball_tree', 'kd_tree', 'brute'] 348 | } 349 | } 350 | ] 351 | 352 | return list_shallow_expr 353 | 354 | 355 | def get_conv_transfer_learning_expr_list(): 356 | list_tl_expr = [ 357 | { 358 | 'experiment_name': 'tl_experiment_1', 359 | 'model_name': 'vgg19', 360 | 'batch_size': 256, 361 | 'image_dim': 256, 362 | 'epochs': 20, 363 | 'lr': 0.0001 364 | } 365 | ] 366 | 367 | return list_tl_expr 368 | 369 | 370 | def get_malware_experiments_list(expr_type): 371 | expr_list = None 372 | if expr_type == DEEP_FF: 373 | expr_list = get_deep_feedforward_expr_list() 374 | if expr_type == DEEP_RNN: 375 | expr_list = get_deep_rnn_expr_list() 376 | if expr_type == SHALLOW_ML: 377 | expr_list = get_shallow_expr_list() 378 | 379 | if expr_list is None: 380 | raise Exception('Unknown experiment type') 381 | else: 382 | return expr_list 383 | 384 | 385 | def create_deep_image_model(model_params): 386 | image_dim = model_params['image_dim'] 387 | num_of_classes = model_params['num_of_classes'] 388 | model_name = model_params['model_name'] 389 | 390 | model = None 391 | if model_name == 'CNNMalware_Model1': 392 | if image_dim == 0: 393 | raise Exception("CNNMalware_Model1 needs image_dim != 0") 394 | model = CNNMalware_Model1(image_dim=image_dim, num_of_classes=num_of_classes).to(device) 395 | if model_name == 'CNNMalware_Model2': 396 | if image_dim == 0: 397 | raise Exception("CNNMalware_Model2 needs image_dim != 0") 398 | model = CNNMalware_Model2(image_dim=image_dim, num_of_classes=num_of_classes).to(device) 399 | 400 | if model_name == 'CNNMalware_Model3': 401 | conv1d_image_dim_w = model_params['conv1d_image_dim_w'] 402 | if image_dim != 0: 403 | raise Exception("CNNMalware_Model3 needs image_dim = 0") 404 | model = CNNMalware_Model3(image_dim_w=conv1d_image_dim_w, num_of_classes=num_of_classes).to(device) 405 | 406 | if model_name == 'CNNMalware_Model4': 407 | if image_dim != 0: 408 | raise Exception("CNNMalware_Model4 needs image_dim = 0") 409 | 410 | conv1d_image_dim_w = model_params['conv1d_image_dim_w'] 411 | c1_out = model_params['c1_out'] 412 | c1_kernel = model_params['c1_kernel'] 413 | c1_padding = model_params['c1_padding'] 414 | c1_stride = model_params['c1_stride'] 415 | 416 | c2_out = model_params['c2_out'] 417 | c2_kernel = model_params['c2_kernel'] 418 | c2_padding = model_params['c2_padding'] 419 | c2_stride = model_params['c2_stride'] 420 | 421 | model = CNNMalware_Model4(image_dim_w=conv1d_image_dim_w, num_of_classes=num_of_classes, 422 | c1_out=c1_out, c1_kernel=c1_kernel, c1_padding=c1_padding, c1_stride=c1_stride, 423 | c2_out=c2_out, c2_kernel=c2_kernel, c2_padding=c2_padding, c2_stride=c2_stride, 424 | ).to(device) 425 | 426 | if model_name == 'ANNMalware_Model1': 427 | model = ANNMalware_Model1(image_dim=image_dim, num_of_classes=num_of_classes).to(device) 428 | if model_name == 'ANNMalware_Model2': 429 | model = ANNMalware_Model2(image_dim=image_dim, num_of_classes=num_of_classes).to(device) 430 | 431 | if model is None: 432 | raise Exception("Unknown Image-based model name given") 433 | return model 434 | 435 | 436 | def get_pretrained_image_dim(model_name): 437 | if model_name == 'resnet152': 438 | return 224 439 | if model_name == 'vgg19': 440 | return 224 441 | raise Exception("Unknown model name given for tl") 442 | 443 | 444 | def create_deep_opcode_model(model_params): 445 | input_dim = model_params['input_dim'] 446 | output_dim = model_params['output_dim'] 447 | embedding_dim = model_params['embedding_dim'] 448 | batch_size = model_params['batch_size'] 449 | num_of_classes = model_params['num_of_classes'] 450 | model_name = model_params['model_name'] 451 | 452 | hidden_dim = 0 453 | if 'hidden_dim' in model_params.keys(): 454 | hidden_dim = model_params['hidden_dim'] 455 | 456 | if 'bidirectional' in model_params.keys(): 457 | bidirectional = model_params['bidirectional'] 458 | 459 | if 'dropout' in model_params.keys(): 460 | dropout = model_params['dropout'] 461 | 462 | if 'num_layers' in model_params.keys(): 463 | num_layers = model_params['num_layers'] 464 | 465 | model = None 466 | if model_name == 'RNNMalware_Model1': 467 | model = RNNMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim, 468 | output_dim=output_dim, batch_size=batch_size, 469 | num_layers=num_layers, bidirectional=bidirectional, dropout=dropout) 470 | 471 | if model_name == 'LSTMMalware_Model1': 472 | model = LSTMMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim, 473 | output_dim=output_dim, batch_size=batch_size, 474 | num_layers=num_layers, bidirectional=bidirectional, dropout=dropout) 475 | 476 | if model_name == 'GRUMalware_Model1': 477 | model = GRUMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim, 478 | output_dim=output_dim, batch_size=batch_size, 479 | num_layers=num_layers, bidirectional=bidirectional, dropout=dropout) 480 | 481 | if model_name == 'StackedMalware_Model1': 482 | LG = model_params['LG'] 483 | model = StackedMalware_Model1(input_dim=input_dim, embedding_dim=embedding_dim, hidden_dim=hidden_dim, 484 | output_dim=output_dim, batch_size=batch_size, 485 | num_layers=num_layers, bidirectional=bidirectional, dropout=dropout, LG=LG) 486 | 487 | if model_name == 'CNNMalware_Model5': 488 | feature_type = model_params['feature_type'] 489 | if feature_type != FEATURE_TYPE_OPCODE: 490 | raise Exception("CNNMalware_Model5 needs Opcode features") 491 | 492 | input_dim = model_params['input_dim'] 493 | output_dim = model_params['output_dim'] 494 | embedding_dim = model_params['embedding_dim'] 495 | n_filters = model_params['n_filters'] 496 | filter_sizes = model_params['filter_sizes'] 497 | dropout = model_params['dropout'] 498 | 499 | model = CNNMalware_Model5(input_dim=input_dim, embedding_dim=embedding_dim, 500 | n_filters=n_filters, filter_sizes=filter_sizes, 501 | output_dim=output_dim, dropout=dropout 502 | ).to(device) 503 | if model is None: 504 | raise Exception("Unknown Opcode-based model name given") 505 | 506 | return model 507 | 508 | 509 | def create_conv_tl_model(model_params): 510 | model = None 511 | model_name = model_params['model_name'] 512 | print(f'Creating model {model_name}') 513 | if model_name == 'resnet152': 514 | model = Resnet152_wrapper(model_params) 515 | print(model) 516 | if model_name == 'vgg19': 517 | model = VGG19_wrapper(model_params) 518 | print(model) 519 | if model is None: 520 | raise Exception("Unknown Model name given") 521 | return model 522 | 523 | 524 | def create_shallow_model(model_params=None): 525 | model_name = model_params['model_name'] 526 | param_grid = model_params['param_grid'] 527 | model = None 528 | gsc = None 529 | 530 | if model_name == 'XGB': 531 | model = xgb.XGBClassifier(use_label_encoder=False) 532 | if model_name == 'RandomForest': 533 | model = RandomForestClassifier() 534 | if model_name == 'Knn': 535 | model = KNeighborsClassifier() 536 | 537 | gsc = GridSearchCV(estimator=model, param_grid=param_grid, 538 | cv=5, scoring='accuracy', verbose=1, n_jobs=-1, refit=True, 539 | return_train_score=True) 540 | if model is None or gsc is None: 541 | raise Exception("Unknown Model name given") 542 | 543 | return model, gsc 544 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | from config import * 2 | import os 3 | import pickle 4 | import matplotlib.pyplot as plt 5 | import sklearn.metrics as metrics 6 | import seaborn as sns 7 | import pandas as pd 8 | import numpy as np 9 | import torch.nn as nn 10 | import sys 11 | from sklearn.model_selection import GridSearchCV 12 | import json 13 | 14 | def save_model_results_to_log(model=None, model_params=None, 15 | train_losses=None, train_accuracy=None, 16 | predicted=None, ground_truth=None, best_params=None, 17 | misc_data=None, log_dir=None): 18 | print('Saving model results', end='') 19 | experiment_name = model_params['experiment_name'] 20 | model_name = model_params['model_name'] 21 | num_of_classes = model_params['num_of_classes'] 22 | class_names = model_params['class_names'] 23 | 24 | model_log_dir = os.path.join(log_dir, experiment_name) 25 | os.makedirs(model_log_dir, exist_ok=True) 26 | model_log_file = os.path.join(model_log_dir, MODEL_INFO_LOG) 27 | model_train_losses_log_file = os.path.join(model_log_dir, MODEL_LOSS_INFO_LOG) 28 | model_train_accuracy_log_file = os.path.join(model_log_dir, MODEL_ACC_INFO_LOG) 29 | model_save_path = os.path.join(model_log_dir, model_name + '.pt') 30 | model_conf_mat_csv = os.path.join(model_log_dir, MODEL_CONF_MATRIX_CSV) 31 | model_conf_mat_png = os.path.join(model_log_dir, MODEL_CONF_MATRIX_PNG) 32 | model_conf_mat_normalized_csv = os.path.join(model_log_dir, MODEL_CONF_MATRIX_NORMALIZED_CSV) 33 | model_conf_mat_normalized_png = os.path.join(model_log_dir, MODEL_CONF_MATRIX_NORMALIZED_PNG) 34 | 35 | model_loss_png = os.path.join(model_log_dir, MODEL_LOSS_PNG) 36 | model_accuracy_png = os.path.join(model_log_dir, MODEL_ACCURACY_PNG) 37 | 38 | grid_cv_filepath = os.path.join(model_log_dir,GRID_CV_EXPERIMENT_RESULTS) 39 | print('.', end='') 40 | # generate and save confusion matrix 41 | plot_x_label = "Predictions" 42 | plot_y_label = "Actual" 43 | cmap = plt.cm.Blues 44 | pred_class_indexes = sorted(np.unique(predicted)) 45 | pred_num_classes = len(pred_class_indexes) 46 | target_class_names = [class_names[i] for i in pred_class_indexes] 47 | 48 | cm = metrics.confusion_matrix(ground_truth, predicted) 49 | 50 | print('.', end='') 51 | df_confusion = pd.DataFrame(cm) 52 | df_confusion.index = target_class_names 53 | df_confusion.columns = target_class_names 54 | df_confusion.round(2) 55 | df_confusion.to_csv(model_conf_mat_csv) 56 | fig = plt.figure(figsize=(20, 20)) 57 | sns.heatmap(df_confusion, annot=True, cmap=cmap) 58 | plt.xlabel(plot_x_label) 59 | plt.ylabel(plot_y_label) 60 | plt.title('Confusion Matrix') 61 | plt.savefig(model_conf_mat_png) 62 | plt.close(fig) 63 | 64 | print('.', end='') 65 | cm = metrics.confusion_matrix(ground_truth, predicted, normalize='all') 66 | df_confusion = pd.DataFrame(cm) 67 | df_confusion.index = target_class_names 68 | df_confusion.columns = target_class_names 69 | df_confusion.round(2) 70 | df_confusion.to_csv(model_conf_mat_normalized_csv) 71 | fig = plt.figure(figsize=(20, 20)) 72 | sns.heatmap(df_confusion, annot=True, cmap=cmap) 73 | plt.xlabel(plot_x_label) 74 | plt.ylabel(plot_y_label) 75 | plt.title('Normalized Confusion Matrix') 76 | plt.savefig(model_conf_mat_normalized_png) 77 | plt.close(fig) 78 | 79 | if train_losses is not None: 80 | print('.', end='') 81 | fig = plt.figure(figsize=(8, 8)) 82 | plt.plot(train_losses, label='Loss') 83 | plt.xlabel('Epoch') 84 | plt.ylabel('Loss') 85 | plt.title('Training Loss') 86 | plt.legend() 87 | plt.savefig(model_loss_png) 88 | plt.close(fig) 89 | 90 | print('.', end='') 91 | # save model training stats 92 | with open(model_train_losses_log_file, 'wb') as file: 93 | pickle.dump(train_losses, file) 94 | file.flush() 95 | 96 | if train_accuracy is not None: 97 | print('.', end='') 98 | fig = plt.figure(figsize=(8, 8)) 99 | plt.plot(train_accuracy, label='Accuracy') 100 | plt.xlabel('Epoch') 101 | plt.ylabel('Accuracy') 102 | plt.title('Training Accuracy') 103 | plt.legend() 104 | plt.savefig(model_accuracy_png) 105 | plt.close(fig) 106 | 107 | print('.', end='') 108 | with open(model_train_accuracy_log_file, 'wb') as file: 109 | pickle.dump(train_accuracy, file) 110 | file.flush() 111 | 112 | print('.', end='') 113 | report = metrics.classification_report(ground_truth, predicted, target_names=list(target_class_names)) 114 | 115 | if not isinstance(model, nn.Module): 116 | cv_df = pd.DataFrame(model.cv_results_) 117 | cv_df.to_csv(grid_cv_filepath) 118 | 119 | # save model arch and params 120 | with open(model_log_file, 'a') as file: 121 | file.write('-' * LINE_LEN + '\n') 122 | file.write('model architecture' + '\n') 123 | file.write('-' * LINE_LEN + '\n') 124 | file.write(str(model) + '\n') 125 | file.write('-' * LINE_LEN + '\n') 126 | file.write('model params' + '\n') 127 | file.write('-' * LINE_LEN + '\n') 128 | file.write(str(model_params) + '\n') 129 | file.write('-' * LINE_LEN + '\n') 130 | if not isinstance(model, nn.Module): 131 | file.write('GridSearchCV results' + '\n') 132 | if isinstance(model, GridSearchCV): 133 | file.write(str(model.cv_results_) + '\n') 134 | file.write('-' * LINE_LEN + '\n') 135 | 136 | if misc_data: 137 | file.write('misc data: ' + misc_data + '\n') 138 | file.write('-' * LINE_LEN + '\n') 139 | 140 | if best_params is not None: 141 | file.write('best params of the grid search' + '\n') 142 | file.write('-' * LINE_LEN + '\n') 143 | file.write(str(best_params) + '\n') 144 | file.write('-' * LINE_LEN + '\n') 145 | 146 | file.write('classification report' + '\n') 147 | file.write('-' * LINE_LEN + '\n') 148 | file.write(report + '\n') 149 | file.write('-' * LINE_LEN + '\n') 150 | file.flush() 151 | 152 | print('.', end='') 153 | # save model as pytorch state dict 154 | if isinstance(model, nn.Module): 155 | torch.save(model.state_dict(), model_save_path) 156 | else: 157 | # save model to file 158 | pickle.dump(model, open(model_save_path, "wb")) 159 | 160 | print('Done') 161 | sys.stdout.flush() 162 | 163 | 164 | def save_models_metadata_to_log(list_of_model_params, LOG_DIR, logfile=MODEL_META_INFO_LOG): 165 | logfile = os.path.join(LOG_DIR, logfile) 166 | with open(logfile, 'a') as file: 167 | file.write('-' * LINE_LEN + '\n') 168 | for i in list_of_model_params: 169 | file.write(str(i) + '\n') 170 | file.write('-' * LINE_LEN + '\n') 171 | file.flush() 172 | 173 | 174 | def print_line(print_len=LINE_LEN): 175 | print('-' * print_len) 176 | --------------------------------------------------------------------------------