├── README.md ├── calls.zip ├── deep_learnin_lstm_malware_detection.ipynb ├── fig-1.png └── types.zip /README.md: -------------------------------------------------------------------------------- 1 | # Deep LSTM based Malware Analysis 2 | 3 | ## Deep learning based Sequential model for malware analysis using Windows exe API Calls 4 | 5 | Malware development has seen diversity in terms of architecture and features. This advancement in the competencies of malware poses a severe threat and opens new research dimensions in malware detection. This study is focused on metamorphic malware that is the most advanced member of the malware family. It is quite impossible for anti-virus applications using traditional signature-based methods to detect metamorphic malware, which makes it difficult to classify this type of malware accordingly. Recent research literature about malware detection and classification discusses this issue related to malware behavior. 6 | 7 | **Cite The DataSet** 8 | If you find this implementation useful please cite it: 9 | 10 | @article{10.7717/peerj-cs.285, 11 | title = {Deep learning based Sequential model for malware analysis using Windows exe API Calls}, 12 | author = {Catak, Ferhat Ozgur and Yazı, Ahmet Faruk and Elezaj, Ogerta and Ahmed, Javed}, 13 | year = 2020, 14 | month = jul, 15 | keywords = {Malware analysis, Sequential models, Network security, Long-short-term memory, Malware dataset}, 16 | volume = 6, 17 | pages = {e285}, 18 | journal = {PeerJ Computer Science}, 19 | issn = {2376-5992}, 20 | url = {https://doi.org/10.7717/peerj-cs.285}, 21 | doi = {10.7717/peerj-cs.285} 22 | } 23 | 24 | You can access the dataset from my [My GitHub Repository](https://github.com/ocatak/lstm_malware_detection). 25 | 26 | ## Introduction 27 | 28 | Malicious software, commonly known as malware, is any software intentionally designed to cause damage to computer systems and compromise user security. An application or code is considered malware if it secretly acts against the interests of the computer user and performs malicious activities. Malware targets various platforms such as servers, personal computers, mobile phones, and cameras to gain unauthorized access, steal personal data, and disrupt the normal function of the system. 29 | 30 | One approach to deal with malware protection problem is by identifying the malicious software and evaluating its behavior. Usually, this problem is solved through the analysis of malware behavior. This field closely follows the model of malicious software family, which also reflects the pattern of malicious behavior. There are very few studies that have demonstrated the methods of classification according to the malware families. 31 | 32 | All operating system API calls made to act by any software show the overall direction of this program. Whether this program is malware or not can be learned by examining these actions in-depth. If it is malware, then what is its malware family. The malware-made operating system API call is a data attribute, and the sequence in which those API calls are generated is also critical to detect the malware family. Performing specific API calls is a particular order that represents a behavior. One of the deep learning methods LSTM (long-short term memory) has been commonly used in the processing of such time-sequential data. 33 | 34 | ## System Architecture 35 | 36 | This research has two main objectives; first, we created a relevant dataset, and then, using this dataset, we did a comparative study using various machine learning to detect and classify malware automatically based on their types. 37 | 38 | ### Dataset Creation 39 | 40 | One of the most important contributions of this work is the new Windows PE Malware API sequence dataset, which contains malware analysis information. There are 7107 malware from different classes in this dataset. The Cuckoo Sandbox application, as explained above, is used to obtain the Windows API call sequences of malicious software, and VirusTotal Service is used to detect the classes of malware. 41 | 42 | The following figure illustrates the system architecture used to collect the data and to classify them using LSTM algorithms. 43 | ![Malware dataset](fig-1.png) 44 | 45 | Our system consists of three main parts, data collection, data pre-processing and analyses, and data classification. 46 | 47 | The following steps were followed when creating the dataset. 48 | 49 | Cuckoo Sandbox application is installed on a computer running Ubuntu Linux distribution. The analysis machine was run as a virtual server to run and analyze malware. The Windows operating system is installed on this server. 50 | -------------------------------------------------------------------------------- /calls.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ocatak/lstm_malware_detection/5110304a56a3abc7d4ec551b529b321cc666fccb/calls.zip -------------------------------------------------------------------------------- /deep_learnin_lstm_malware_detection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Deep LSTM based Malware Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Deep learning based Sequential model for malware analysis using Windows exe API Calls" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Malware development has seen diversity in terms of architecture and features. This advancement in the competencies of malware poses a severe threat and opens new research dimensions in malware detection. This study is focused on metamorphic malware that is the most advanced member of the malware family. It is quite impossible for anti-virus applications using traditional signature-based methods to detect metamorphic malware, which makes it difficult to classify this type of malware accordingly. Recent research literature about malware detection and classification discusses this issue related to malware behavior." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "**Cite The DataSet** \n", 29 | "If you find this implementation useful please cite it:\n", 30 | "\n", 31 | " @article{catak_lstm2020,\n", 32 | " author = {Ferhat Ozgur Catak}, \n", 33 | " title = {Deep learning based Sequential model for malware analysis using Windows exe API Calls},\n", 34 | " journal = {Peerj Computer Science},\n", 35 | " year = 2020,\n", 36 | " pages = {1-17},\n", 37 | " month = 7\n", 38 | " }\n", 39 | " \n", 40 | "You can access the dataset from my [My GitHub Repository](https://github.com/ocatak/lstm_malware_detection)." 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## Introduction" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "Malicious software, commonly known as malware, is any software intentionally designed to cause damage to computer systems and compromise user security. An application or code is considered malware if it secretly acts against the interests of the computer user and performs malicious activities. Malware targets various platforms such as servers, personal computers, mobile phones, and cameras to gain unauthorized access, steal personal data, and disrupt the normal function of the system. \n", 55 | "\n", 56 | "One approach to deal with malware protection problem is by identifying the malicious software and evaluating its behavior. Usually, this problem is solved through the analysis of malware behavior. This field closely follows the model of malicious software family, which also reflects the pattern of malicious behavior. There are very few studies that have demonstrated the methods of classification according to the malware families.\n", 57 | "\n", 58 | "All operating system API calls made to act by any software show the overall direction of this program. Whether this program is malware or not can be learned by examining these actions in-depth. If it is malware, then what is its malware family. The malware-made operating system API call is a data attribute, and the sequence in which those API calls are generated is also critical to detect the malware family. Performing specific API calls is a particular order that represents a behavior. One of the deep learning methods LSTM (long-short term memory) has been commonly used in the processing of such time-sequential data." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## System Architecture" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "This research has two main objectives; first, we created a relevant dataset, and then, using this dataset, we did a comparative study using various machine learning to detect and classify malware automatically based on their types." 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "### Dataset Creation" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "One of the most important contributions of this work is the new Windows PE Malware API sequence dataset, which contains malware analysis information. There are 7107 malware from different classes in this dataset. The Cuckoo Sandbox application, as explained above, is used to obtain the Windows API call sequences of malicious software, and VirusTotal Service is used to detect the classes of malware.\n", 87 | "\n", 88 | "The following figure illustrates the system architecture used to collect the data and to classify them using LSTM algorithms.\n", 89 | "![Malware dataset](fig-1.png)\n", 90 | "\n", 91 | "Our system consists of three main parts, data collection, data pre-processing and analyses, and data classification.\n", 92 | "\n", 93 | "The following steps were followed when creating the dataset.\n", 94 | "\n", 95 | "Cuckoo Sandbox application is installed on a computer running Ubuntu Linux distribution. The analysis machine was run as a virtual server to run and analyze malware. The Windows operating system is installed on this server." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "## Let’s coding\n", 103 | "We import the usual standard libraries to build an LSTM model to detect the malware." 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 1, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "import pandas as pd\n", 113 | "import matplotlib.pyplot as plt\n", 114 | "import seaborn as sns\n", 115 | "from sklearn.preprocessing import LabelEncoder\n", 116 | "from sklearn.model_selection import train_test_split\n", 117 | "from sklearn.metrics import confusion_matrix\n", 118 | "from keras.preprocessing.text import Tokenizer\n", 119 | "from keras.layers import LSTM, Dense, Dropout, Embedding\n", 120 | "from keras.preprocessing import sequence\n", 121 | "from keras.utils import np_utils\n", 122 | "from keras.models import Sequential\n", 123 | "from keras.layers import SpatialDropout1D\n", 124 | "from mlxtend.plotting import plot_confusion_matrix" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "In this work, we will use standard our malware dataset to show the results. You can access the dataset from [My GitHub Repository](https://github.com/ocatak/lstm_malware_detection). We need to merge the call and the label datasets." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 2, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "malware_calls_df = pd.read_csv(\"calls.zip\", compression=\"zip\",\n", 141 | " sep=\"\\t\", names=[\"API_Calls\"])\n", 142 | "\n", 143 | "malware_labels_df = pd.read_csv(\"types.zip\", compression=\"zip\",\n", 144 | " sep=\"\\t\", names=[\"API_Labels\"])\n", 145 | "\n", 146 | "malware_calls_df[\"API_Labels\"] = malware_labels_df.API_Labels\n", 147 | "malware_calls_df[\"API_Calls\"] = malware_calls_df.API_Calls.apply(lambda x: \" \".join(x.split(\",\")))\n", 148 | "\n", 149 | "malware_calls_df[\"API_Labels\"] = malware_calls_df.API_Labels.apply(lambda x: 1 if x == \"Virus\" else 0)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Let's analyze the class distribution" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 15, 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAWy0lEQVR4nO3df7RdZX3n8ffH8FMxKibQkIBgja1ABUuGYu04TnEkajUsO9hoLWmlE4dFW13+KrhmWfsjHVbrOBUV1qL+ILRWjFoFf1DFWGudonhRRgRkkQpCTCQBfwXqINDv/HGe1OPNufe50Jx7b5L3a62zzt7f/Tx7PyfF++l+9j77pKqQJGk6j5jrAUiS5j/DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaF9gpJ3pTkr+fw+J9N8ttt+deTfGo37vuGJM9qy7v1cyZ5Q5J37q79ae9lWGiPkeSlSSaS3JNka5Irk/zSXI9rsqp6b1U9p9cuySVJ/mQG+zuuqj777x1Xkmcl2Txp339aVb/979239n6GhfYISV4N/AXwp8DhwFHAhcCquRzXOCXZb67HIO1kWGjeS/IY4I+Ac6rqb6vq3qq6v6o+WlWvm6LPB5J8O8n3k3wuyXFD256X5MYkO5J8K8lrW31Rko8l+V6S7yT5xyQj/zeS5L8k+Xrb/9uBDG37zSSfb8tJ8r+TbGttv5rk+CRrgV8HXt/OlD7a2t+W5PeTfBW4N8l+rfbsocMflOT9bfxfTnLC0LEryZOG1i9J8idJHgVcCRzRjndPkiMmT2sleWGb9vpem1p7ytC225K8tn2G77cxHDSD/xNqL2BYaE/wdOAg4MMPoc+VwHLgMODLwHuHtr0LeEVVPRo4HvhMq78G2AwsZnD28gZgl+fhJFkEfAj4H8Ai4J+BZ0wxjucAzwSeDDwW+DXg7qq6uI3pz6rqkKp6wVCflwDPBx5bVQ+M2Ocq4APAocDfAB9Jsv+U/xJAVd0LPBfY0o53SFVtmfS5ngy8D3hV+zf4BPDRJAcMNXsxsBI4Bngq8JvTHVd7D8NCe4LHA3dN8YdzpKp6d1XtqKr7gDcBJ7QzFID7gWOTLKyq71bVl4fqS4AntDOXf6zRD097HnBjVX2wqu5nMD327SmGcj/waOBngVTVTVW1tTP8C6rqjqr64RTbrx069lsYBOkpnX3OxK8BH6+qq9q+3wwcDPzipLFtqarvAB8FTtwNx9UewLDQnuBuYNFM5/CTLEhyfpJ/TvID4La2aVF7/1UGf/C/meQfkjy91f8c2AR8Ksk3kpw7xSGOAO7YudIC5Y5RDavqM8DbgXcAdya5OMnCzkcYua9R26vqXxmcDR3R6TMTRwDfnLTvO4ClQ22GQ/FfgEN2w3G1BzAstCe4Gvh/wOkzbP9SBlM1zwYeAxzd6gGoqi9V1SoGU1QfATa0+o6qek1VPRF4AfDqJKeO2P9W4MidK0kyvD5ZVV1QVScBxzGYjtp5nWWqRz73HgU9fOxHAMuAnVNK/wI8cqjtTz2E/W4BnjC0752f61udftoHGBaa96rq+8AbgXckOT3JI5Psn+S5Sf5sRJdHA/cxOCN5JIM7qABIckD7HsRj2lTLD4AH27ZfSfKk9kdyZ/3BEfv/OHBckhe1s53f4yf/KP+bJP8hyS+0awr3Mgi9nfu8E3jiQ/znADhp6Nivap/1C23bdcBL29nVSuA/DfW7E3j80HTcZBuA5yc5tY33NW3f//Qwxqi9jGGhPUJVvQV4NYOLytsZTI/8DoMzg8kuZTCd8i3gRn78h3Sn3wBua1NU/x14WasvBz4N3MPgbObCUd9vqKq7gDOA8xkE0nLg/0wx9IXAXwLfbWO6m8G1ABhcaD+23Xk06nNM5XIG1xe+2z7Li1rwAbySwVnR9xjcbfVv+62qrzO4gP2NdsyfmLqqqpsZ/Fu8Dbir7ecFVfWjhzA27aXijx9Jkno8s5AkdRkWkqQuw0KS1GVYSJK69toHlS1atKiOPvrouR6GJO1Rrr322ruqavHk+l4bFkcffTQTExNzPQxJ2qMk+eaoutNQkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkrrF+gzvJY4F3Ascz+EnHlwM3A+9n8FOXtwEvrqrvtvbnAWcx+CWx36uqT7b6ScAlDH48/hPAK2vMP8Rx0usuHefutYe69s/PnOshSHNi3GcWbwX+rqp+FjgBuAk4F9hYVcuBjW2dJMcCqxn8TvFK4MIkC9p+LgLWMvhFsuVtuyRplowtLJIsBJ7J4KcjqaofVdX3gFXA+tZsPXB6W14FXFZV91XVrcAm4OQkS4CFVXV1O5u4dKiPJGkWjPPM4okMfiv5PUm+kuSdSR4FHF5VWwHa+2Gt/VIGv6u80+ZWW9qWJ9d3kWRtkokkE9u3b9+9n0aS9mHjDIv9gJ8HLqqqpwH30qacppARtZqmvmux6uKqWlFVKxYv3uUJu5Kkh2mcYbEZ2FxVX2zrH2QQHne2qSXa+7ah9kcO9V8GbGn1ZSPqkqRZMrawqKpvA3ck+ZlWOhW4EbgCWNNqa4DL2/IVwOokByY5hsGF7GvaVNWOJKckCXDmUB9J0iwY948f/S7w3iQHAN8AfotBQG1IchZwO3AGQFXdkGQDg0B5ADinqh5s+zmbH986e2V7SZJmyVjDoqquA1aM2HTqFO3XAetG1CcYfFdDkjQH/Aa3JKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKlrrGGR5LYk1ye5LslEqx2a5Kokt7T3xw21Py/JpiQ3JzltqH5S28+mJBckyTjHLUn6SbNxZvGfq+rEqlrR1s8FNlbVcmBjWyfJscBq4DhgJXBhkgWtz0XAWmB5e62chXFLkpq5mIZaBaxvy+uB04fql1XVfVV1K7AJODnJEmBhVV1dVQVcOtRHkjQLxh0WBXwqybVJ1rba4VW1FaC9H9bqS4E7hvpubrWlbXlyfRdJ1iaZSDKxffv23fgxJGnftt+Y9/+MqtqS5DDgqiRfn6btqOsQNU1912LVxcDFACtWrBjZRpL00I31zKKqtrT3bcCHgZOBO9vUEu19W2u+GThyqPsyYEurLxtRlyTNkrGFRZJHJXn0zmXgOcDXgCuANa3ZGuDytnwFsDrJgUmOYXAh+5o2VbUjySntLqgzh/pIkmbBOKehDgc+3O5y3Q/4m6r6uyRfAjYkOQu4HTgDoKpuSLIBuBF4ADinqh5s+zobuAQ4GLiyvSRJs2RsYVFV3wBOGFG/Gzh1ij7rgHUj6hPA8bt7jJKkmfEb3JKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkrrGHRZIFSb6S5GNt/dAkVyW5pb0/bqjteUk2Jbk5yWlD9ZOSXN+2XZAk4x63JOnHZuPM4pXATUPr5wIbq2o5sLGtk+RYYDVwHLASuDDJgtbnImAtsLy9Vs7CuCVJzVjDIsky4PnAO4fKq4D1bXk9cPpQ/bKquq+qbgU2AScnWQIsrKqrq6qAS4f6SJJmwbjPLP4CeD3wr0O1w6tqK0B7P6zVlwJ3DLXb3GpL2/Lk+i6SrE0ykWRi+/btu+cTSJLGFxZJfgXYVlXXzrTLiFpNU9+1WHVxVa2oqhWLFy+e4WElST37jXHfzwBemOR5wEHAwiR/DdyZZElVbW1TTNta+83AkUP9lwFbWn3ZiLokaZaM7cyiqs6rqmVVdTSDC9efqaqXAVcAa1qzNcDlbfkKYHWSA5Mcw+BC9jVtqmpHklPaXVBnDvWRJM2CcZ5ZTOV8YEOSs4DbgTMAquqGJBuAG4EHgHOq6sHW52zgEuBg4Mr2kiTNklkJi6r6LPDZtnw3cOoU7dYB60bUJ4DjxzdCSdJ0/Aa3JKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHXNKCySbJxJTZK0d5r22VBJDgIeCSxqv5W987clFgJHjHlskqR5ovcgwVcAr2IQDNfy47D4AfCOMY5LkjSPTBsWVfVW4K1Jfreq3jZLY5IkzTMzekR5Vb0tyS8CRw/3qapLxzQuSdI8MqOwSPJXwE8D1wE7f5CoAMNCkvYBM/3xoxXAsVVV4xyMJGl+mun3LL4G/NQ4ByJJmr9memaxCLgxyTXAfTuLVfXCsYxKkjSvzDQs3jTOQUiS5reZ3g31D+MeiCRp/prp3VA7GNz9BHAAsD9wb1UtHNfAJEnzx0zPLB49vJ7kdODksYxIkjTvPKynzlbVR4Bf3s1jkSTNUzOdhnrR0OojGHzvwu9cSNI+YqZnFi8Yep0G7ABWTdchyUFJrknyf5PckOQPW/3QJFcluaW9P26oz3lJNiW5OclpQ/WTklzftl2QJKOOKUkaj5les/ith7Hv+4Bfrqp7kuwPfD7JlcCLgI1VdX6Sc4Fzgd9PciywGjiOwVNuP53kyVX1IHARsBb4AvAJYCVw5cMYkyTpYZjpjx8tS/LhJNuS3JnkQ0mWTdenBu5pq/u3VzE4I1nf6uuB09vyKuCyqrqvqm4FNgEnJ1kCLKyqq9vjRi4d6iNJmgUznYZ6D3AFg/+Pfynw0VabVpIFSa4DtgFXVdUXgcOraitAez+sNV8K3DHUfXOrLW3Lk+ujjrc2yUSSie3bt8/wo0mSemYaFour6j1V9UB7XQIs7nWqqger6kRgGYOzhOOnaT7qOkRNUx91vIurakVVrVi8uDs8SdIMzTQs7krysnamsCDJy4C7Z3qQqvoe8FkG1xrubFNLtPdtrdlm4MihbsuALa2+bERdkjRLZhoWLwdeDHwb2Ar8V2Dai95JFid5bFs+GHg28HUG01lrWrM1wOVt+QpgdZIDkxwDLAeuaVNVO5Kc0u6COnOojyRpFsz0QYJ/DKypqu/C4PZX4M0MQmQqS4D1SRYwCKUNVfWxJFcDG5KcBdwOnAFQVTck2QDcCDwAnNPuhAI4G7gEOJjBXVDeCSVJs2imYfHUnUEBUFXfSfK06TpU1VeBXdpU1d3AqVP0WQesG1GfAKa73iFJGqOZTkM9YtKX5w5l5kEjSdrDzfQP/v8C/inJBxncifRiRpwBSJL2TjP9BvelSSYYPDwwwIuq6saxjkySNG/MeCqphYMBIUn7oIf1iHJJ0r7FsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkrrGFhZJjkzy90luSnJDkle2+qFJrkpyS3t/3FCf85JsSnJzktOG6iclub5tuyBJxjVuSdKuxnlm8QDwmqp6CnAKcE6SY4FzgY1VtRzY2NZp21YDxwErgQuTLGj7ughYCyxvr5VjHLckaZKxhUVVba2qL7flHcBNwFJgFbC+NVsPnN6WVwGXVdV9VXUrsAk4OckSYGFVXV1VBVw61EeSNAtm5ZpFkqOBpwFfBA6vqq0wCBTgsNZsKXDHULfNrba0LU+ujzrO2iQTSSa2b9++Oz+CJO3Txh4WSQ4BPgS8qqp+MF3TEbWapr5rseriqlpRVSsWL1780AcrSRpprGGRZH8GQfHeqvrbVr6zTS3R3re1+mbgyKHuy4Atrb5sRF2SNEvGeTdUgHcBN1XVW4Y2XQGsactrgMuH6quTHJjkGAYXsq9pU1U7kpzS9nnmUB9J0izYb4z7fgbwG8D1Sa5rtTcA5wMbkpwF3A6cAVBVNyTZANzI4E6qc6rqwdbvbOAS4GDgyvaSJM2SsYVFVX2e0dcbAE6dos86YN2I+gRw/O4bnSTpofAb3JKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkdRkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkrrGFRZJ3J9mW5GtDtUOTXJXklvb+uKFt5yXZlOTmJKcN1U9Kcn3bdkGSjGvMkqTRxnlmcQmwclLtXGBjVS0HNrZ1khwLrAaOa30uTLKg9bkIWAssb6/J+5QkjdnYwqKqPgd8Z1J5FbC+La8HTh+qX1ZV91XVrcAm4OQkS4CFVXV1VRVw6VAfSdIs2W+Wj3d4VW0FqKqtSQ5r9aXAF4babW61+9vy5PpISdYyOAvhqKOO2o3DluaX2//o5+Z6CJqHjnrj9WPb93y5wD3qOkRNUx+pqi6uqhVVtWLx4sW7bXCStK+b7bC4s00t0d63tfpm4MihdsuALa2+bERdkjSLZjssrgDWtOU1wOVD9dVJDkxyDIML2de0KasdSU5pd0GdOdRHkjRLxnbNIsn7gGcBi5JsBv4AOB/YkOQs4HbgDICquiHJBuBG4AHgnKp6sO3qbAZ3Vh0MXNlekqRZNLawqKqXTLHp1CnarwPWjahPAMfvxqFJkh6i+XKBW5I0jxkWkqQuw0KS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhSSpy7CQJHUZFpKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqcuwkCR1GRaSpC7DQpLUZVhIkroMC0lSl2EhSeoyLCRJXYaFJKnLsJAkde0xYZFkZZKbk2xKcu5cj0eS9iV7RFgkWQC8A3gucCzwkiTHzu2oJGnfsUeEBXAysKmqvlFVPwIuA1bN8ZgkaZ+x31wPYIaWAncMrW8GfmFyoyRrgbVt9Z4kN8/C2PYFi4C75noQ80HevGauh6Bd+d/nTn+Q3bGXJ4wq7ilhMepfoHYpVF0MXDz+4exbkkxU1Yq5Hoc0iv99zo49ZRpqM3Dk0PoyYMscjUWS9jl7Slh8CVie5JgkBwCrgSvmeEyStM/YI6ahquqBJL8DfBJYALy7qm6Y42HtS5za03zmf5+zIFW7TP1LkvQT9pRpKEnSHDIsJEldhoWm5WNWNF8leXeSbUm+Ntdj2RcYFpqSj1nRPHcJsHKuB7GvMCw0HR+zonmrqj4HfGeux7GvMCw0nVGPWVk6R2ORNIcMC01nRo9ZkbT3Myw0HR+zIgkwLDQ9H7MiCTAsNI2qegDY+ZiVm4ANPmZF80WS9wFXAz+TZHOSs+Z6THszH/chSeryzEKS1GVYSJK6DAtJUpdhIUnqMiwkSV2GhfQwJLnnIbR9U5LXjmv/0mwwLCRJXYaFtJskeUGSLyb5SpJPJzl8aPMJST6T5JYk/22oz+uSfCnJV5P84Yh9LknyuSTXJflakv84Kx9GmsSwkHafzwOnVNXTGDzO/fVD254KPB94OvDGJEckeQ6wnMGj4E8ETkryzEn7fCnwyao6ETgBuG7Mn0Eaab+5HoC0F1kGvD/JEuAA4NahbZdX1Q+BHyb5ewYB8UvAc4CvtDaHMAiPzw31+xLw7iT7Ax+pKsNCc8IzC2n3eRvw9qr6OeAVwEFD2yY/V6cYPAL+f1bVie31pKp61080GvzAzzOBbwF/leTM8Q1fmpphIe0+j2HwRx1gzaRtq5IclOTxwLMYnDF8Enh5kkMAkixNcthwpyRPALZV1V8C7wJ+fozjl6bkNJT08Dwyyeah9bcAbwI+kORbwBeAY4a2XwN8HDgK+OOq2gJsSfIU4OokAPcALwO2DfV7FvC6JPe37Z5ZaE741FlJUpfTUJKkLsNCktRlWEiSugwLSVKXYSFJ6jIsJEldhoUkqev/A9KCa8ttegB1AAAAAElFTkSuQmCC\n", 167 | "text/plain": [ 168 | "
" 169 | ] 170 | }, 171 | "metadata": { 172 | "needs_background": "light" 173 | }, 174 | "output_type": "display_data" 175 | } 176 | ], 177 | "source": [ 178 | "sns.countplot(malware_calls_df.API_Labels)\n", 179 | "plt.xlabel('Labels')\n", 180 | "plt.title('Class distribution')\n", 181 | "plt.savefig(\"class_distribution.png\")\n", 182 | "plt.show()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "![Class distribution](class_distribution.png)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "Now we can create our sequence matrix. In order to build an LSTM model, you need to create a tokenization based sequence matrix as the input dataset" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 4, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "name": "stdout", 206 | "output_type": "stream", 207 | "text": [ 208 | "Found 278 unique tokens.\n", 209 | "Shape of data tensor: (7107, 100)\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "max_words = 800\n", 215 | "max_len = 100\n", 216 | "\n", 217 | "X = malware_calls_df.API_Calls\n", 218 | "Y = malware_calls_df.API_Labels.astype('category').cat.codes\n", 219 | "\n", 220 | "tok = Tokenizer(num_words=max_words)\n", 221 | "tok.fit_on_texts(X)\n", 222 | "print('Found %s unique tokens.' % len(tok.word_index))\n", 223 | "X = tok.texts_to_sequences(X.values)\n", 224 | "X = sequence.pad_sequences(X, maxlen=max_len)\n", 225 | "print('Shape of data tensor:', X.shape)\n", 226 | "\n", 227 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\n", 228 | " test_size=0.15)\n", 229 | "\n", 230 | "le = LabelEncoder()\n", 231 | "Y_train_enc = le.fit_transform(Y_train)\n", 232 | "Y_train_enc = np_utils.to_categorical(Y_train_enc)\n", 233 | "\n", 234 | "Y_test_enc = le.transform(Y_test)\n", 235 | "Y_test_enc = np_utils.to_categorical(Y_test_enc)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "The LSTM based classification model is then given for example as exercise here:" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 5, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "def malware_model(act_func=\"softsign\"):\n", 252 | " model = Sequential()\n", 253 | " model.add(Embedding(max_words, 300, input_length=max_len))\n", 254 | " model.add(SpatialDropout1D(0.1))\n", 255 | " model.add(LSTM(32, dropout=0.1, recurrent_dropout=0.1,\n", 256 | " return_sequences=True, activation=act_func))\n", 257 | " model.add(LSTM(32, dropout=0.1, activation=act_func, return_sequences=True))\n", 258 | " model.add(LSTM(32, dropout=0.1, activation=act_func))\n", 259 | " model.add(Dense(128, activation=act_func))\n", 260 | " model.add(Dropout(0.1))\n", 261 | " model.add(Dense(256, activation=act_func))\n", 262 | " model.add(Dropout(0.1))\n", 263 | " model.add(Dense(128, activation=act_func))\n", 264 | " model.add(Dropout(0.1))\n", 265 | " model.add(Dense(1, name='out_layer', activation=\"linear\"))\n", 266 | " return model" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "The next step is to train the model. I trained and saved my model. Because of the dataset, the training stage take lots of time. In order to reduce the execution time you can load my previous trained model from the GitHub repository." 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 6, 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "name": "stdout", 283 | "output_type": "stream", 284 | "text": [ 285 | "Model: \"sequential\"\n", 286 | "_________________________________________________________________\n", 287 | "Layer (type) Output Shape Param # \n", 288 | "=================================================================\n", 289 | "embedding (Embedding) (None, 100, 300) 240000 \n", 290 | "_________________________________________________________________\n", 291 | "spatial_dropout1d (SpatialDr (None, 100, 300) 0 \n", 292 | "_________________________________________________________________\n", 293 | "lstm (LSTM) (None, 100, 32) 42624 \n", 294 | "_________________________________________________________________\n", 295 | "lstm_1 (LSTM) (None, 100, 32) 8320 \n", 296 | "_________________________________________________________________\n", 297 | "lstm_2 (LSTM) (None, 32) 8320 \n", 298 | "_________________________________________________________________\n", 299 | "dense (Dense) (None, 128) 4224 \n", 300 | "_________________________________________________________________\n", 301 | "dropout (Dropout) (None, 128) 0 \n", 302 | "_________________________________________________________________\n", 303 | "dense_1 (Dense) (None, 256) 33024 \n", 304 | "_________________________________________________________________\n", 305 | "dropout_1 (Dropout) (None, 256) 0 \n", 306 | "_________________________________________________________________\n", 307 | "dense_2 (Dense) (None, 128) 32896 \n", 308 | "_________________________________________________________________\n", 309 | "dropout_2 (Dropout) (None, 128) 0 \n", 310 | "_________________________________________________________________\n", 311 | "out_layer (Dense) (None, 1) 129 \n", 312 | "=================================================================\n", 313 | "Total params: 369,537\n", 314 | "Trainable params: 369,537\n", 315 | "Non-trainable params: 0\n", 316 | "_________________________________________________________________\n", 317 | "None\n", 318 | "Epoch 1/10\n", 319 | "7/7 [==============================] - 22s 3s/step - loss: 0.0486 - accuracy: 0.9487 - val_loss: 0.0311 - val_accuracy: 0.9672\n", 320 | "Epoch 2/10\n", 321 | "7/7 [==============================] - 21s 3s/step - loss: 0.0378 - accuracy: 0.9591 - val_loss: 0.0302 - val_accuracy: 0.9672\n", 322 | "Epoch 3/10\n", 323 | "7/7 [==============================] - 21s 3s/step - loss: 0.0364 - accuracy: 0.9604 - val_loss: 0.0362 - val_accuracy: 0.9625\n", 324 | "Epoch 4/10\n", 325 | "7/7 [==============================] - 20s 3s/step - loss: 0.0378 - accuracy: 0.9593 - val_loss: 0.0328 - val_accuracy: 0.9616\n", 326 | "Epoch 5/10\n", 327 | "7/7 [==============================] - 22s 3s/step - loss: 0.0365 - accuracy: 0.9609 - val_loss: 0.0351 - val_accuracy: 0.9606\n", 328 | "Epoch 6/10\n", 329 | "7/7 [==============================] - 21s 3s/step - loss: 0.0369 - accuracy: 0.9601 - val_loss: 0.0369 - val_accuracy: 0.9606\n", 330 | "Epoch 7/10\n", 331 | "7/7 [==============================] - 22s 3s/step - loss: 0.0371 - accuracy: 0.9594 - val_loss: 0.0395 - val_accuracy: 0.9625\n", 332 | "Epoch 8/10\n", 333 | "7/7 [==============================] - 22s 3s/step - loss: 0.0378 - accuracy: 0.9601 - val_loss: 0.0365 - val_accuracy: 0.9588\n", 334 | "Epoch 9/10\n", 335 | "7/7 [==============================] - 22s 3s/step - loss: 0.0358 - accuracy: 0.9618 - val_loss: 0.0440 - val_accuracy: 0.9456\n", 336 | "Epoch 10/10\n", 337 | "7/7 [==============================] - 21s 3s/step - loss: 0.0373 - accuracy: 0.9589 - val_loss: 0.0354 - val_accuracy: 0.9644\n" 338 | ] 339 | } 340 | ], 341 | "source": [ 342 | "model = malware_model()\n", 343 | "print(model.summary())\n", 344 | "model.compile(loss='mse', optimizer=\"rmsprop\",\n", 345 | " metrics=['accuracy'])\n", 346 | "\n", 347 | "filepath = \"lstm-malware-model.hdf5\"\n", 348 | "model.load_weights(filepath)\n", 349 | "\n", 350 | "history = model.fit(X_train, Y_train, batch_size=1000, epochs=10,\n", 351 | " validation_data=(X_test, Y_test), verbose=1)" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "## Model Evaluation" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Now, we have finished the training phase of the LSTM model. We can evaluate our model's classification performance using the confusion matrix. According to the confusion matrix, the model’s classification performance quite good." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 12, 371 | "metadata": {}, 372 | "outputs": [ 373 | { 374 | "data": { 375 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAATUAAAEGCAYAAAAE8QIHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3deXxU1fnH8c+ThCVAgCCgLC6oLCJFUBQFVEBQVBSl1oqCC1j3rXbRn1atrW2t2FatWqUWd7RWreAuYF0QUXADAUFQlLCHJURMSDI8vz9miGPMMgMZ5ubm+/Y1L+bOnHvucw08Oefec841d0dEJCwy0h2AiEhtUlITkVBRUhORUFFSE5FQUVITkVDJSncA8Swr261hTrrDkCT0PmCvdIcgSfjqq2Xk5+fbztSR2Xxv97KihMp60bpX3X3YzhwvWcFKag1zaNT19HSHIUl457270x2CJKF/3z47XYeXFSX877T443ta7/QBkxSopCYidYGBBffKlZKaiCTHgIzMdEdRJSU1EUme7dRluZRSUhORJKn7KSJho5aaiISGoZaaiISJqaUmIiGju58iEh66USAiYWKo+ykiIaOWmoiEh7qfIhImBmTqRoGIhImuqYlIeKj7KSJho5aaiISKWmoiEhqmaVIiEjaaJiUi4aEbBSISNup+ikhoaD01EQkXdT9FJGx0o0BEQkXX1EQkNEzdTxEJG7XURCRMTElNRMIiupq3kpqIhIUZlqGkJiIhopaaiISKkpqIhIqSmoiEh8VeAaWkJiJJMUwtNREJl4wMzSgQkRBRS01EwiPg19SC24YUkcAys4ReCdQzzMwWmdkSM7u2ku9bmNnzZvaJmc03s/NqqlNJTUSSsv1Gwc4mNTPLBO4Bjge6A6PMrHuFYpcCC9z9IGAg8Bcza1hdvUpqIpI0y7CEXjU4DFji7l+4ewnwJDCiQhkHciyaIZsBG4Cy6irVNTURSY7V2o2CDsDyuO08oG+FMncDU4CVQA7wU3ffVl2laqmJSNKS6H62NrM5ca8L4quppGqvsH0c8DHQHugF3G1mzauLTS21WnDpqIGcN7IfZsaDz77D3ZPe4I9XncIJR/WgpDTCl3n5XHDTYxR8UwRAj87tufs3o8hp2pht25wBo29ja0m1LWrZBZYvX875553NmjWrycjIYOy4C7jsiivTHVYgJdFSy3f3PlV8lwfsGbfdkWiLLN55wK3u7sASM/sS6Aa8X9UBldR2Uvf92nHeyH4cOWY8JaURptxzCS/PmM/0WZ9xw9+nEIls45YrRvCrscfym7smk5mZwcRbzmHcDY8wb/EKWrVoSmlZJN2nIUBWVha33vYXeh98MIWFhfTrewjHDBnKAd0rXruu32pxRsFsoLOZdQJWAGcAZ1Yo8zVwDPC2me0OdAW+qK5SdT93UrdOe/D+vGUUFZcSiWzj7Q+WMGLQQUyf9RmRSLTr//68L+mwe0sAhhzRjU8/X8G8xSsA2FCwhW3bKra4JR3atWtH74MPBiAnJ4du3Q5g5coVaY4qoCzBVzXcvQy4DHgVWAg85e7zzewiM7soVuz3QD8zmwdMB65x9/zq6lVLbSfNX7qS3152Eq1aNKVoawnDBhzIhwu+/l6Zs0ccwdOvfQhA573a4g5T7rmU1rnNePrVD/jrw9PSEbpU46tly/j444849LCK160Fq71pUu7+EvBShc/ui3u/Ejg2mTqV1HbSoi/X8JeHpvLCPy5jS9FW5i5eQVlcd/LX444jEtnGky/NBiArM5N+vfdlwOjxfFtcwsv3X8GHC7/mjfcXp+sUpIJvvvmGUaf/mPF/uYPmzau9Jl1vBXmalLqfteDh596l35l/Zui4O9hYsIUlX68D4KyT+nLCUT049/qHysuuWLuJtz9YwvpNWygqLuWVGfPp3W3PKmqWXa20tJRRp/+Yn446i1NOHZnucIKrFrqfqaKkVgva5DYDYM89chkx+CCeemUOQ/sdwC/OHcJpV91PUXFpedmpMxfQo3MHshs3IDMzgyMP2Z+FX6xOV+gSx9256Gfj6NrtAK78+dXpDifQamuaVCqktPtpZsOAO4FM4AF3vzWVx0uXJ24/n1Yto3cxr7r1KTYVFvG3a06nUcMsXvjHZQC8P28ZV/zhSTYVFnHXY68z47Ff4+68OmM+r8yYn+YzEICZ77zDpMcfpUePH9H3kF4A3HzLHxl2/AlpjixY0pmwEpGypBY3r2so0fEos81sirsvSNUx02XIuDt+8FmPETdXWf7Jl2aXX2OT4Og/YABFpboTnYggJ7VUdj8TmdclInVQLc39TIlUdj8TmddFbNpEdOpEg2YpDEdEakuQW2qpTGqJzOvC3ScAEwAymrRV218k6GpvQntKpLL7mci8rjqncaMGvPbAlWRkGJPvvoRVb93GM3deVGX5hg2yePTW8/h08k289cgv2atdq/LvzjqpL/Mm38i8yTdy1knfNWIfufU89turTUrPo74oKipi6OCjiUQiPPbIw/Q4oDM9DujMY488XGn5rVu3MvrMn3Jgt/05sl9fvlq2rPy7qvYfc9YZLPn881SfSmAYYJbYKx1SmdTK53XFFnU7g+gSInXaOSOOYPL0T9i2zfnbI9MY95tHqi1/7ilHsLGwiB4jbubvj/+PP1wZvayY27wJ119wPEeNuZ0jR4/n+guOp2VONgAT/vM2V58zJOXnUh88/OBERpwykoKCAv5wy8289c57vD3zff5wy81s3LjxB+UfmvgvclvmMv+zJVx+5c+5/rprANiwYUOV+19w4cX89fbbdul5pVftLBKZKilLalXN60rV8XaVM07ow/NvzAXgjfcXU7hla7Xlhw/syePPvwfAs9M+YuBhXQEY2u8Aps/6jI2bv2VTYRHTZ33Gsf2jE6ff+XApg/t2JTNTwwh31pNPPM5JJ49g6muvcswxQ2nVqhW5ubkcc8xQXnv1lR+Uf+H5yZw15hwARv74NN54fTruXu3+/QccyeuvT6OsrP6stJKRYQm90hJbKit395fcvYu77+fuf0jlsXaFBlmZ7NOhNV+v2pDwPu3btiBvdfQ3eiSyjc3fFLFby6a0b9OSvDXftRRWrN1E+zbRSe/uztLl+fTs0qF2T6CeKSkpYdmXX7D3PvuwcuUKOu753dWQDh07VjpZPb5cVlYWzVu0YP369dXun5GRwX777c/cTz5J8RkFRIJdzzB2P0OndW4zCgq/TWqfyprg7pX/wD3uPsq6DYW0a9Mi6RjlO/n5+bRo+d0viooq/9lUXq6m/du0acuqVXX+knFCjHrcUgubouISGjdqkNQ+K9ZsouMeuQBkZmbQvFk2Gwq2sGLtJjrunlterkPblqxaV1C+3bhRA4q2lv6gPklcdnY2xcXFAHTo0JG85d+NMFqRl0e7du1/sE98ubKyMjYXFNCqVasa9y/eWkx2dnaqTiVw1FILiU2FRWRmZNCoYeIjYV58c175nc2RQ3rz5uzoahxTZy5kyBHdaJmTTcucbIYc0Y2pMxeW77f/Xm1ZuHRV7Z5APZObm0skEqG4uJihxx7HtGmvsXHjRjZu3Mi0aa8x9NjjfrDPicNP5vFHo3c2n33maY4eNBgzq3H/JYsXc0D3A3fZuaVbkG8UaOmhJE2btZB+vffjf+8tYtq/rqJLp91plt2IJa/8notunsS0dxdyw8Un8uGCr3nxzXk89NxMJt5yNp9OvomNm7cw5toHAdi4+Vv+9M9XmPHYrwH444RX2Lg52rVt2yqH4q0lrM7fnLbzDIshQ45l5jszGHzMEP7vuhsYcMShAFx3/Y20ahUdXvO7397IwYf0YfhJJ3Pu2HGMPXcMB3bbn9zcVjz6+JMAtGrVqsr916xZQ+PsbNq1a5eGM0yDNLbCEmGVXStIl4wmbb1R19PTHUa1DurakStGD2bcDdUP5dgZl581iM1binn4uXdTdozasnH23ekOoVoff/QRd93xVyY+/GjKjnHXHX+jefPmnDt2XMqOUVv69+3DBx/M2amU1KR9V+/8s3sTKjv3d0M+qOYZBSmhllqSPlmUx5tzFpORYSlbhntTYRGTXqzyuRKShF69e3P0wEFEIhEyMzNTcoyWLVty5ugxKak7qILcUlNS2wGPTJ6V0vofnZLa+uubc84bm9L6zz73vJTWH0RBnialpCYiyQn4NTUlNRFJSnTuZ3CzmpKaiCQtwDlNSU1Ekpeu2QKJUFITkeQEfD01JTURScr29dSCSklNRJJUT58mJSLhFeCcpqQmIkky3SgQkRDRODURCR0lNREJlQDnNCU1EUmeWmoiEh6a0C4iYWKk76EqiVBSE5GkZQS4qaYHr4hI0mrraVJmNszMFpnZEjO7tooyA83sYzObb2Zv1lSnWmoikhSrpQntZpYJ3AMMBfKA2WY2xd0XxJVpCdwLDHP3r82sbU31qqUmIknLsMReNTgMWOLuX7h7CfAkMKJCmTOBZ939awB3X1tjbMmfjojUd0k8ob21mc2Je10QV00HYHncdl7ss3hdgFwze8PMPjCzs2uKrcrup5kVAtsfl7Q953rsvbt785oqF5HwMaJ3QBOUX80j8iqrpOIj2rKAQ4BjgGzgXTOb5e6LqzpglUnN3XNqCFZE6qlaGtGRB+wZt90RWFlJmXx33wJsMbO3gIOAKpNaQt1PMxtgZufF3rc2s07JRC4iIWLR9dQSedVgNtDZzDqZWUPgDGBKhTKTgSPNLMvMmgB9gYXVVVrj3U8zuwnoA3QFHgQaAo8B/WvaV0TCqTaGqbl7mZldBrwKZAIT3X2+mV0U+/4+d19oZq8Ac4FtwAPu/ml19SYypONUoDfwYexAK81MXVOResqovcG37v4S8FKFz+6rsD0eGJ9onYkktRJ3dzNzADNrmmjlIhJOQZ4mlcg1tafM7H6gpZn9DJgG/DO1YYlIUCU6myBdM6lqbKm5++1mNhTYTHTMyI3uPjXlkYlIYAV57mei06TmER0j4rH3IlKPBTelJdD9NLPzgfeBkcBpwCwzG5vqwEQkuGppSEdKJNJS+xXQ293XA5jZbsBMYGIqAxORYIre/Ux3FFVLJKnlAYVx24V8f76WiNQnVkcXiTSzq2NvVwDvmdlkotfURhDtjopIPVVXn1GwfYDt0thru8mpC0dEgq7Odj/d/eZdGYiI1B11taUGgJm1AX4NHAg03v65uw9OYVwiEmDBTWmJzSh4HPgM6ATcDCwjOrteROohM8jMsIRe6ZBIUtvN3f8FlLr7m+4+Fjg8xXGJSIDV9XFqpbE/V5nZiUQXceuYupBEJOgCfEktoaR2i5m1AH4B/B1oDvw8pVGJSGAZVrfnfrr7C7G3BcCg1IYjIoGXxhU4ElHd4Nu/88OHIJRz9ytqO5gu+3bggX//vrarlRTqce3L6Q5BkpC3oqBW6qmrQzrm7LIoRKTOMCCzLiY1d394VwYiInVHnZxRICJSFSU1EQmN6FLdwc1qSmoikrQgt9QSWfm2i5lNN7NPY9s9zew3qQ9NRIIqyA9eSWSa1D+B/yM2s8Dd5xJ9krKI1EMGZJkl9EqHRLqfTdz9/Qp96LIUxSMidUCAL6kllNTyzWw/YgNxzew0YFVKoxKRwDKr49OkgEuBCUA3M1sBfAmMTmlUIhJoAc5pCc39/AIYYmZNgQx3L6xpHxEJtyDf/Uxk5dsbK2wD4O6/S1FMIhJgBmlbADIRiXQ/t8S9bwwMBxamJhwRCTyr4y01d/9L/LaZ3Q5MSVlEIhJ4FuCnFOzIjIImwL61HYiI1A119hF525nZPL5bVy0TaAPoeppIPVankxrRa2jblQFr3F2Db0XqsTo7od3MMoAX3b3HLopHRAIu+oi8dEdRtWpDc/dtwCdmttcuikdE6oCM2KyCml41MbNhZrbIzJaY2bXVlDvUzCKxGU3VSqT72Q6Yb2bvEze8w91PTmBfEQmZ2rpRYGaZwD3AUCAPmG1mU9x9QSXl/gy8mki9iSS1m5OMVURCrpYuqR0GLInNWsLMngRGAAsqlLsceAY4NJFKE0lqJ7j7NfEfmNmfgTcTOYCIhI2Rkfg4tdZmFv8QpwnuPiH2vgOwPO67PKDv945k1gE4FRhMgkktkct9Qyv57PhEKheR8DGSWiQy3937xL0mVKiqooqP5bwDuMbdI4nGV91zPy8GLgH2NbO5cV/lAO8kegARCRmDrNoZqJYH7Bm33RFYWaFMH+DJ2BCS1sAJZlbm7s9VVWl13c9JwMvAn4D4uxKF7r4hicBFJES2t9RqwWygs5l1AlYQXVH7zPgC7t6p/LhmDwEvVJfQoPrnfhYABcCoHY9ZRMKoNhaJdPcyM7uM6F3NTGCiu883s4ti39+3I/XqaVIikrTamlDg7i8BL1X4rNJk5u7nJlKnkpqIJMVI7A5juiipiUhyrHa6n6mipCYiSYnOKFBSE5EQCW5KU1ITkR0Q4IaakpqIJMvq7npqIiIV6e6niISObhSISHhYHV7OW0SkInU/RSR01FITkVAJbkpTUhORJBmQqZaaiIRJgHOakpqIJMuwAHdAldREJGlqqdUTW7cWc/lZwykp2UokUsbA405m3BX/B8DTj07g2cceIDMrkyOOPpZLfq0nD6bLn07/EYO7t2H9NyWccPsMAFpkN+DOMb3omJtN3sYirnj0IzYXldFzzxbccloPIPoP+a7XljD10zXpDD/tokM6gpvVlNRqUcOGjbjj4edo0rQZZaWlXHLm8Rx+1BC2FhczY/rLPPT82zRs2IiN69elO9R67dk5eTz2zleMH9Wz/LMLB+/Lu5+v5/7/fcGFg/blwsH7Mf7FRSxeXcipd84kss1pk9OIF37Rn9cXrCWyreJDj+oRC3ZLLchj6OocM6NJ02YAlJWVUlZWBmY898RERl9wJQ0bNgIgd7c26Qyz3pv9xUY2fVv6vc+GHNiWZ+esAODZOSsYemBbAIpLt5UnsEYNMvB6nMviZZgl9EoHtdRqWSQS4fyRg1jx9ZeceuY4DjyoD8uXLeWTOe8y4W+30LBRYy799e84oOfB6Q5V4rTOacS6wq0ArCvcym7NGpV/d9BeLbj19B/RPjebXz4xt3630ti+SGS6o6iakloty8zM5MHJb1G4uYDrLx3DF4sXEImUUbi5gPufmsrCeR9y01Vj+ff0jwI9Klu+88nXBRx/+wz2a9uU287oyZufraOkbFu6w0qrIN/9VPczRXKat6B33/689/Z02uzenqOHDsfM6N7zECwjg00b16c7RImTX7iVNjnR1lmbnEas/2brD8osXbuFopIIXfZotqvDC5wkntC+yymp1aKNG/Ip3FwAwNbiIubMfJO99u3CkUNO5INZbwHw9ZdLKCstoWXubukMVSqYvmAtI/t0AGBknw5Mm78WgI6tssmM9bXa5zamU5umrNhQlLY4g8IS/C8dUtb9NLOJwHBgrbv3SNVxgmT92jX88dpLiEQiuG9j0LBT6D/oOEpLSvjTdZdz9vB+ZDVoyHW33quuZxr97ayD6LtfK3KbNmTGbwZx52ufc//rX3DXmF785LCOrNxUxOWPfAxAn31yuXDwvpRGHHfnpmfns7HCTYb6JujX1MxTdDvHzI4CvgEeSTSpdevR2x949vWUxCOpce6EWekOQZKQ99gVFK/+fKdSUrcfJf7v9MgurT5w9z47c7xkpaz76e5vARtSVb+IpI8l+EqHtN/9NLMLgAsAdm/fMc3RiEhNgv7cz7TfKHD3Ce7ex937tMxtne5wRCQBaqmJSLgEt6GW/pZaXbO1uIjLRg8nEonw8n+fYNSxfRh1bB9e/u8TlZZ/8sF7GH3C4Zxz0gCuPOcUVq9YXv7dmpV5XD12JKOP78voEw5nVd7XANz083EsX7Z0l5xP2DXKymDSxX3JMDi1TwemXXMU0645ilNjwzcquv7kbkz5eX+m/Lw/U685ig9/P+R73zdrlMWMGwZx06ndyz+746yD2Lt1k5SeR9DUy2lSZvYEMBBobWZ5wE3u/q9UHW9XefGZxzl66HC2FG7mwbtv44FnXsfMGDdyEAMGH09Oi5bfK9/lgJ488MzrNM5uwn8nTeQf42/i5jsmAnDLNRdz9kVXc2j/QXy75RsyMqK/Y04ZNZZJD9zFNbfcucvPL2x+clhHXp23mpzGDbh86P6cesdMHOe5q/ozff4aNheVfa/8H6Z8Vv5+TP+96d6h+fe+v2pYZ95f+v37X5PeXc4FA/fl+qc/Td2JBEyAG2opvfs5yt3buXsDd+8YhoQGMPX5/zDgmBN4f8brHNp/IM1b5pLToiWH9h/Ie29P/0H5gw8/ksbZ0d/iB/bqw9rVKwH4cslnRMrKOLT/IACaNG1WXu6gPkfwwcw3oxPiZaecfHB7ps1fy5FdW/PO4nwKikrZXFTGO4vzOapr9QsLnNS7HS98tLJ8+8AOzWndrCEzFud/r9zsLzfQr8tu5YN064UAX1RT9zMJpSUlrFz+Fe067sW6NStpu8d3XZg2u7dn3ZqV1ewNLz79GIcfFe3OLF+2lGbNW3D9ZWcz9pSjuefPNxKJRADIyMigw96dWPpZ/fnNnwoNMo09WzVhxcYidm/RmFWbisu/W11QzO4tGle5b/vcxnRslc27S6LT2czgupO78ecXFv2grDt8lf8t3drl1P5JBFA0XwV3RoGSWhIKNq6nWU4LACobtFzdLIFXJz/FZ59+xKjzLwcgUlbG3Dnvcuk1v2PC09NZlbeMl5+dVF4+t1Ub8teuruUzqF9ymzZkc3F09H/lP5qqB54P79WeV+auZvuCHKP77cUbC9exqqC40vLrvympNkmGSoLzPjX3sw5o1DibkpLoX+q2e3Rg7eoV5d+tW7OS1m3bVbrfnJlv8Oh9f+HWf0wqX1Ot7R7t6dy9J+333IesrCwGHHMiixfMLd+npKSYRo3ryT+SFCkujdAoK/pXfPWmYtq1/O7/5x4tGrOm4IeT1rcb3qsdz3+0qny71965jOm/N29cdzTXntSNUw/pwK9O6FL+faOsDIpLIyk4i2Cqrd6nmQ0zs0VmtsTMrq3k+7PMbG7sNdPMDqqpTg3pSEJOi5Zsi0TYurWYwwYMZsJff09hwSYAZs/4HxdefeMP9lm8YC7jb7ya2x/4z/cWh+z2o4MpLNjExg355LZqzYfvvUW3Hr3Lv1++bCn77N8t9ScVYpuLysjMMBpmZfD2onx+cUIXmmdH/8oP6Nqa219aXOl+ndo0pXl2Fh99tan8s19M+qT8/cg+HfjRni0YH7d/pzZN+Xz1Nyk6k6CxWpm7bGaZwD3AUCAPmG1mU9x9QVyxL4Gj3X2jmR0PTAD6VlevklqSDu0/iHkfzKJPv4Gcc8kv+dlpxwBwzqW/onnLXAAeuPOPdOvRmwHHHM+9t91E0bdbuPHK8wDYvV1Hbr1vEpmZmVx6ze+46pxTAKfLgb046SdnA7Ahfy2NGmXTuu0eaTnHMJmxKJ8+nXKZ+fl67pm6lP9e2Q+Au6cuoaAo2jW98rjOfLq8gOkLoitznNS7HS9+vKrKOivarVlDiksj5YtM1ge11LU8DFji7l9E67QngRFAeVJz95lx5WcBNU47StmE9h1RFya0L14wl38/eC83jL8vZcf490P30rRpDsN/MiZlx6gtQZ/Q3r19c8YevQ+/fGJuzYV30HlH7sM3W8v4z/t5KTtGbamNCe0H9jzYJ73wZkJle+3dvMoJ7WZ2GjDM3c+PbY8B+rr7ZVWU/yXQbXv5qqillqQu3XtycN8BRCIRMjMzU3KMZjktOG7ET1NSd32zYOVmZi3ZQIZBqlbh3lxcynMfVH/nO3QST4utzWxO3PYEd59QTS2V/pTMbBAwDhhQ0wGV1HbAiaeNTm39Pz4rpfXXN0/PTm0L6pnZK2ouFDJJDNfIr2bpoTxgz7jtjsAPfjuYWU/gAeB4d69xyWjd/RSRpNXSkI7ZQGcz62RmDYEzgCnfP47tBTwLjHH3yu/sVKCWmogkp5bGoLl7mZldBrwKZAIT3X2+mV0U+/4+4EZgN+De2B3XspoWnVRSE5Gk1dZsAXd/CXipwmf3xb0/H6j2xkBFSmoikhQj2E9oV1ITkaQFOKcpqYnIDghwVlNSE5GkBfkZBUpqIpK04KY0JTUR2REBzmpKaiKSlO2LRAaVkpqIJCeNC0AmQklNRJIW4JympCYiyaqdRSJTRUlNRJIW4JympCYiyUnj0+8SoqQmIskLcFZTUhORpGlIh4iEiq6piUh4GGQoqYlIuAQ3qympiUhStEikiIROgHOakpqIJE8tNREJFU2TEpFQCW5KU1ITkSQl+KDitFFSE5GkaUaBiIRLcHOakpqIJC/AOU1JTUSSZXpEnoiER9BnFGSkOwARkdqklpqIJC3ILTUlNRFJmoZ0iEh4aPCtiIRJ0G8UKKmJSNLU/RSRUFFLTURCJcA5TUlNRHZAgLOakpqIJMUg0NOkzN3THUM5M1sHfJXuOFKgNZCf7iAkKWH9me3t7m12pgIze4Xo/59E5Lv7sJ05XrICldTCyszmuHufdMchidPPrO7S3E8RCRUlNREJFSW1XWNCugOQpOlnVkfpmpqIhIpaaiISKkpqIhIqSmopZGbDzGyRmS0xs2vTHY/UzMwmmtlaM/s03bHIjlFSSxEzywTuAY4HugOjzKx7eqOSBDwE7NLBolK7lNRS5zBgibt/4e4lwJPAiDTHJDVw97eADemOQ3acklrqdACWx23nxT4TkRRSUkudymb8avyMSIopqaVOHrBn3HZHYGWaYhGpN5TUUmc20NnMOplZQ+AMYEqaYxIJPSW1FHH3MuAy4FVgIfCUu89Pb1RSEzN7AngX6GpmeWY2Lt0xSXI0TUpEQkUtNREJFSU1EQkVJTURCRUlNREJFSU1EQkVJbV6yswGmtkLsfcnV7eKiJm1NLNLduAYvzWzXyb6eYUyD5nZaUkcax+trCGgpBY6sdVBkuLuU9z91mqKtASSTmoi6aCkVkfEWiKfmdnDZjbXzJ42syax75aZ2Y1mNgP4iZkda2bvmtmHZvYfM2sWKzcsVscMYGRc3eea2d2x97ub2X/N7JPYqx9wK7CfmX1sZuNj5X5lZrNjsdwcV9f1sTXkpgFdEzivn8Xq+WxGISQAAAJ9SURBVMTMntl+TjFDzOxtM1tsZsNj5TPNbHzcsS/c2f+3Ei5KanVLV2CCu/cENvP91lOxuw8ApgG/AYa4+8HAHOBqM2sM/BM4CTgS2KOKY9wFvOnuBwEHA/OBa4Gl7t7L3X9lZscCnYkur9QLOMTMjjKzQ4hOB+tNNGkemsA5Pevuh8aOtxCIH8G/D3A0cCJwX+wcxgEF7n5orP6fmVmnBI4j9URWugOQpCx393di7x8DrgBuj23/O/bn4UQXpXzHzAAaEp320w340t0/BzCzx4ALKjnGYOBsAHePAAVmlluhzLGx10ex7WZEk1wO8F93/zZ2jETmuvYws1uIdnGbEZ1Wtt1T7r4N+NzMvoidw7FAz7jrbS1ix16cwLGkHlBSq1sqzmmL394S+9OAqe4+Kr6gmfWqZP8dZcCf3P3+Cse4ageO8RBwirt/YmbnAgPjvqvsfA243N3jkx9mtk+Sx5WQUvezbtnLzI6IvR8FzKikzCygv5ntD2BmTcysC/AZ0MnM9ovbvzLTgYtj+2aaWXOgkGgrbLtXgbFx1+o6mFlb4C3gVDPLNrMcol3dmuQAq8ysAXBWhe9+YmYZsZj3BRbFjn1xrDxm1sXMmiZwHKknlNTqloXAOWY2F2gF/KNiAXdfB5wLPBErNwvo5u7FRLubL8ZuFHxVxTGuBAaZ2TzgA+BAd19PtDv7qZmNd/fXgEnAu7FyTwM57v4h0W7wx8AzwNsJnNMNwHvAVKKJN94i4E3gZeCi2Dk8ACwAPowN4bgf9TgkjlbpqCNi3asX3L1HmkMRCTS11EQkVNRSE5FQUUtNREJFSU1EQkVJTURCRUlNREJFSU1EQuX/AYRZ60jWv/ijAAAAAElFTkSuQmCC\n", 376 | "text/plain": [ 377 | "
" 378 | ] 379 | }, 380 | "metadata": { 381 | "needs_background": "light" 382 | }, 383 | "output_type": "display_data" 384 | } 385 | ], 386 | "source": [ 387 | "y_test_pred = model.predict_classes(X_test)\n", 388 | "cm = confusion_matrix(Y_test, y_test_pred)\n", 389 | "\n", 390 | "plot_confusion_matrix(conf_mat=cm,\n", 391 | " show_absolute=True,\n", 392 | " show_normed=True,\n", 393 | " colorbar=True)\n", 394 | "plt.savefig(\"confusion_matrix.png\")\n", 395 | "plt.show()" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "![Confusion matrix](confusion_matrix.png)\n", 403 | "\n", 404 | "Let's continue with the training history of our model." 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 14, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "image/png": "\n", 415 | "text/plain": [ 416 | "
" 417 | ] 418 | }, 419 | "metadata": { 420 | "needs_background": "light" 421 | }, 422 | "output_type": "display_data" 423 | }, 424 | { 425 | "data": { 426 | "image/png": "\n", 427 | "text/plain": [ 428 | "
" 429 | ] 430 | }, 431 | "metadata": { 432 | "needs_background": "light" 433 | }, 434 | "output_type": "display_data" 435 | } 436 | ], 437 | "source": [ 438 | "plt.plot(history.history['accuracy'])\n", 439 | "plt.plot(history.history['val_accuracy'])\n", 440 | "plt.title('model accuracy')\n", 441 | "plt.ylabel('accuracy')\n", 442 | "plt.xlabel('epoch')\n", 443 | "plt.legend(['train', 'test'], loc='upper left')\n", 444 | "plt.grid()\n", 445 | "plt.savefig(\"accuracy.png\")\n", 446 | "plt.show()\n", 447 | "\n", 448 | "plt.plot(history.history['loss'])\n", 449 | "plt.plot(history.history['val_loss'])\n", 450 | "plt.title('model loss')\n", 451 | "plt.ylabel('loss')\n", 452 | "plt.xlabel('epoch')\n", 453 | "plt.legend(['train', 'test'], loc='upper left')\n", 454 | "plt.grid()\n", 455 | "plt.savefig(\"loss.png\")\n", 456 | "plt.show()" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "![Accuracy history](accuracy.png)\n", 464 | "![Loss history](loss.png)" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "metadata": {}, 470 | "source": [ 471 | "## Conclusion" 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "The purpose of this study was to create an LSTM based malware detection model using my previous malware dataset. Although our dataset contains instances that belong to some malware families with unbalanced distribution, we have shown that this problem does not affect classification performance." 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [] 487 | } 488 | ], 489 | "metadata": { 490 | "kernelspec": { 491 | "display_name": "Python 3", 492 | "language": "python", 493 | "name": "python3" 494 | }, 495 | "language_info": { 496 | "codemirror_mode": { 497 | "name": "ipython", 498 | "version": 3 499 | }, 500 | "file_extension": ".py", 501 | "mimetype": "text/x-python", 502 | "name": "python", 503 | "nbconvert_exporter": "python", 504 | "pygments_lexer": "ipython3", 505 | "version": "3.7.6" 506 | } 507 | }, 508 | "nbformat": 4, 509 | "nbformat_minor": 4 510 | } 511 | -------------------------------------------------------------------------------- /fig-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ocatak/lstm_malware_detection/5110304a56a3abc7d4ec551b529b321cc666fccb/fig-1.png -------------------------------------------------------------------------------- /types.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ocatak/lstm_malware_detection/5110304a56a3abc7d4ec551b529b321cc666fccb/types.zip --------------------------------------------------------------------------------