├── Data ├── Oversampled │ ├── Class Distribution.png │ └── dt.png ├── Undersampled │ ├── Class Distribution.png │ └── dt.png └── Unsampled │ ├── Class Distribution.png │ ├── dt.png │ └── pca_dt.png ├── Dataset └── Android_Permission.csv ├── Group_7_Presentation.pdf ├── Group_7_Report.pdf ├── Plots ├── 2.png ├── 3.png ├── 4.png ├── 5.png ├── c1.png ├── c2.png ├── c3.png ├── c4.png ├── e1.png ├── e2.png ├── e3.png ├── e4.png ├── e5.png ├── o1.png ├── o2.png ├── o3.png ├── p2.png └── t1.png ├── README.md ├── Results ├── t2.png ├── t3.png ├── t4.png └── t5.png └── code.ipynb /Data/Oversampled/Class Distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Oversampled/Class Distribution.png -------------------------------------------------------------------------------- /Data/Oversampled/dt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Oversampled/dt.png -------------------------------------------------------------------------------- /Data/Undersampled/Class Distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Undersampled/Class Distribution.png -------------------------------------------------------------------------------- /Data/Undersampled/dt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Undersampled/dt.png -------------------------------------------------------------------------------- /Data/Unsampled/Class Distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Unsampled/Class Distribution.png -------------------------------------------------------------------------------- /Data/Unsampled/dt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Unsampled/dt.png -------------------------------------------------------------------------------- /Data/Unsampled/pca_dt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Data/Unsampled/pca_dt.png -------------------------------------------------------------------------------- /Group_7_Presentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Group_7_Presentation.pdf -------------------------------------------------------------------------------- /Group_7_Report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Group_7_Report.pdf -------------------------------------------------------------------------------- /Plots/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/2.png -------------------------------------------------------------------------------- /Plots/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/3.png -------------------------------------------------------------------------------- /Plots/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/4.png -------------------------------------------------------------------------------- /Plots/5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/5.png -------------------------------------------------------------------------------- /Plots/c1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/c1.png -------------------------------------------------------------------------------- /Plots/c2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/c2.png -------------------------------------------------------------------------------- /Plots/c3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/c3.png -------------------------------------------------------------------------------- /Plots/c4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/c4.png -------------------------------------------------------------------------------- /Plots/e1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/e1.png -------------------------------------------------------------------------------- /Plots/e2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/e2.png -------------------------------------------------------------------------------- /Plots/e3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/e3.png -------------------------------------------------------------------------------- /Plots/e4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/e4.png -------------------------------------------------------------------------------- /Plots/e5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/e5.png -------------------------------------------------------------------------------- /Plots/o1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/o1.png -------------------------------------------------------------------------------- /Plots/o2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/o2.png -------------------------------------------------------------------------------- /Plots/o3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/o3.png -------------------------------------------------------------------------------- /Plots/p2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/p2.png -------------------------------------------------------------------------------- /Plots/t1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Plots/t1.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Android Malware Detection System Using Machine Learning 2 | 3 | > ## Purpose: 4 | Project at [IIITD](https://www.iiitd.ac.in/) 5 | under the course [CSE343 : Machine Learning](http://techtree.iiitd.edu.in/viewDescription/filename?=ECE363 "Course Description") under the guidance of Professor [Anubha Gupta](https://www.iiitd.ac.in/anubha "Profile") 6 | 7 | > ## Contributors: 8 | - [Bijendar Prasad](https://Findcoding "GitHub Profile") 9 | 10 | > ## Motivation: 11 | 12 | As the android market continues to expand, so does the prevalence of malicious apps. According to [ZDNet](https://www.zdnet.com/article/play-store-identified-as-main-distribution-vector-for-most-android-malware), as many as 10%-24% of apps available on the Play store could be malicious in nature. These apps may appear innocuous at first glance, but they can wreak havoc on a user’s system in a variety of harmful ways. Unfortunately, current methods for detecting malware are both resource-intensive and exhaustive, and they struggle to keep up with the rapid pace at which new malware is being developed. 13 | 14 | **What can help us to overcome these challenges ?** 15 | - Developing a comprehensive strategy to assess and analyze data from confirmed malicious applications. 16 | - Creating a model that can accurately predict the presence of malicious applications based on their permissions. 17 | - Introducing a machine learning-based malware detection model that utilizes publicly available metadata information. This model will be evaluated to determine its effectiveness as a first-stage filter for detecting Android malware. 18 | 19 | 20 | > ## Introduction: 21 | 22 | Despite the growing threat of malware, there is still no reliable and robust method for detecting malicious applications. However, 23 | with the increasing use of machine learning in various fields, we believe that this issue can be addressed through the application 24 | of machine learning techniques. Our project aims to conduct a thorough and systematic investigation into the use of machine 25 | learning for malware detection, with the ultimate goal of developing an efficient ML model capable of accurately classifying 26 | apps as either **benign (0)** or **malware (1)** based on their requested permissions. 27 | This study Proposes: 28 | - Conducting an in-depth examination and evaluation of Android metadata and permissions as predictors of malware. 29 | - Introducing a machine learning-based malware detection strategy that utilizes publicly available metadata information. 30 | - Analyzing the effectiveness of this model and assessing its potential as a first-stage filter for detecting Android malware. 31 | 32 | 33 | 34 | > ## Dataset Description: 35 | - Dataset has been taken from [kaggle](https://www.kaggle.com/saurabhshahane/android-permission-dataset/) 36 | - Data contains the details of the permission of almost 30k app 37 | - There are 183 features in the dataset like Dangerous Permissions Count, Default : Access DRM content, Default : Move application resource, etc. 38 | - There is one target class (binary- 0/1) named - ‘Class’, indicating Benign(0) and Malware(1) applications. 39 | - There are 29,999 records with 20,000 malwares and 9,999 benign apps. 40 | 41 | **Prerocessing, Visualization and Analysis:** 42 | The data is first imported from a CSV file and loaded into a dataframe for ease of 43 | use. The necessary attributes are then extracted from the dataset. To gain a better understanding of the data, several plots are 44 | generated. The data is checked for null or missing values, and any such values are replaced with the mean of the corresponding 45 | column. The distribution of malware and benign applications across various settings is then analyzed, and the results are 46 | visualized through a series of plots created using **Matplotlib** and **Seaborn**. 47 | 48 | 49 | > ## Plots: 50 | 51 |
52 | Unsampled Class Distribution 53 | Undersampled Class Distribution 54 | Oversampled Class Distribution 55 |
56 |
57 |
58 | Columns Name vs Missing Values 59 |
60 | 61 |
62 | 63 | > ## Exploratory Data Analysis(EDA): 64 | The EDA for the Android Permission Dataset provided valuable insights into the relationships between different features in the 65 | dataset and helped us identify the most important features for predicting the app rating. It also provided a foundation for further 66 | analysis using machine learning techniques. 67 |
68 | 69 | 70 |
71 |
72 |
73 | 74 | 75 | 76 |
77 | 78 | > ## Methodology: 79 | 80 | After preprocessing the data, it is split into testing and training sets at an **8:2 ratio**. We attempted both under and oversampling 81 | techniques on the dataset, but the results were not promising. We then applied various classifiers, including logistic regression, 82 | decision trees, and Naive Bayes, but the outcomes were unsatisfactory. Upon further inspection of the dataset, we discovered 83 | that it contained several multivariate data tables, which required us to apply **PCA** to each dataset. We plotted the variance 84 | percentage after using PCA and chose to use the inverse transform. We then applied Random Forest to the dataset, which 85 | resulted in a significant improvement in accuracy. We then used the boosting approach to further increase prediction accuracy, 86 | both on an unsampled dataset and on one with reliable features selected. The results showed that the model was improving. 87 | Finally, we applied **SVM** and **MLP** to the final dataset and achieved our best results. When comparing the results obtained after 88 | feature selection and **boosting**, we can see that we have made significant progress and achieved our final accuracy. 89 | 90 |
91 | 92 | PCA features vs Variance Percentage 93 |
94 | 95 |
96 | 97 | 98 | > ## Libraries Used: 99 | - [Numpy](https://numpy.org/) 100 | - [Pandas](https://pandas.pydata.org/) 101 | - [Matplotlib](https://matplotlib.org/) 102 | - [Seaboran](https://seaborn.pydata.org/) 103 | - [Scikit-Learn](https://scikit-learn.org/) 104 | - [Imblearn](https://imbalanced-learn.org/stable/) 105 | - [Xgboost](https://xgboost.ai/) 106 | 107 | > ## Results and Analysis: 108 | 109 | ### On Basic Models 110 | 111 | | Models| Unsampled | Oversampled | Undersampled | 112 | | --- | --- | --- | --- | 113 | | **Logistic** | Training Accuracy 0.69
Test Accuracy 0.68
Recall Score 0.95
ROC Score 0.53 | Training Accuracy 0.63
Test Accuracy 0.62
Recall Score 0.66
ROC Score 0.61 | Training Accuracy 0.63
Test Accuracy 0.63
Recall Score 0.67
ROC Score 0.62 | 114 | | **Naive** | Training Accuracy 0.68
Test Accuracy 0.67
Recall Score 0.97
ROC Score 0.52 | Training Accuracy 0.53
Test Accuracy 0.53
Recall Score 0.98
ROC Score 0.51 | Training Accuracy 0.53
Test Accuracy 0.53
Recall Score 0.99
ROC Score 0.50 | 115 | | **Decision Tree** | Training Accuracy 0.67
Test Accuracy 0.67
Recall Score 0.99
ROC Score 0.51 | Training Accuracy 0.57
Test Accuracy 0.55
Recall Score 0.68
ROC Score 0.54 | Training Accuracy 0.55
Test Accuracy 0.56
Recall Score 0.79
ROC Score 0.55 | 116 | 117 | *As we can see that sampling is not effective in our case so move forward with unsampled data only.* 118 | 119 | 120 | | Models | Optimal Parameter | Accuracy | Recall | ROC | 121 | | --- | --- | --- | --- | --- | 122 | | **SVM** | default | Training Accuracy 0.85
Test Accuracy 0.85 | 0.94 | 0.80 | 123 | | **Random Forest** | n_estimators=200, n_jobs = -1 | Training Accuracy 0.87
Test Accuracy 0.86 | 0.93 | 0.81 | 124 | | **MLP** | random_state = 42, max_iter = 300 | Training Accuracy 0.85
Test Accuracy 0.85 | 0.95 | 0.80 | 125 | 126 | By looking at the result all the three models performs more or less the same with Random Forest with Accuracy of 86%. As we seen in the Tabulation that, Accuracy follows the order as follow: **Random Forest > MLP > SVM** 127 | 128 | > ## Conclusion: 129 | 130 | - Learning 131 | Different ways to visualize the data for better understanding of features. Machine Learning models like Logistic Regression, Naive Bayes and Decision Tree to model the problem. How to use platforms like Kaggle and Google Colab. How to work and collaborate in teams. 132 | 133 | > ## References: 134 | 135 | - [[1](https://www.ijrdet.com/files/Volume11Issue2/IJRDET_0222_03.pdf)] Android Malware Prediction using Machine Learning Techniques: A Review 136 | 137 | - [[2](https://www.sciencedirect.com/science/article/pii/S1877050921014186)] An Efficient Android Malware Prediction Using Ensemble Machine Learning Algorithms 138 | 139 | - [[3](https://www.kaggle.com/saurabhshahane/android-permission-dataset/)] Android Permission Dataset 140 | 141 | -------------------------------------------------------------------------------- /Results/t2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Results/t2.png -------------------------------------------------------------------------------- /Results/t3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Results/t3.png -------------------------------------------------------------------------------- /Results/t4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Results/t4.png -------------------------------------------------------------------------------- /Results/t5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Findcoding/Android-Malware-Detection-System-Using-Machine-Learning/d802d3f6ab04c287afe747c94c2d890a757c934d/Results/t5.png --------------------------------------------------------------------------------