├── .gitignore ├── AXMLPrinter2.jar ├── README.md ├── __pycache__ └── tensorflow.cpython-39.pyc ├── best_tree.pdf ├── combine_features.py ├── d2j-dex2jar.sh ├── demo.py ├── demo ├── __pycache__ │ └── extract_features.cpython-39.pyc ├── extract_features.py └── malware_check.py ├── demo_copy.py ├── dex2jar-2.0 ├── d2j-baksmali.bat ├── d2j-baksmali.sh ├── d2j-dex-recompute-checksum.bat ├── d2j-dex-recompute-checksum.sh ├── d2j-dex2jar.bat ├── d2j-dex2jar.sh ├── d2j-dex2smali.bat ├── d2j-dex2smali.sh ├── d2j-jar2dex.bat ├── d2j-jar2dex.sh ├── d2j-jar2jasmin.bat ├── d2j-jar2jasmin.sh ├── d2j-jasmin2jar.bat ├── d2j-jasmin2jar.sh ├── d2j-smali.bat ├── d2j-smali.sh ├── d2j-std-apk.bat ├── d2j-std-apk.sh ├── d2j_invoke.bat ├── d2j_invoke.sh └── lib │ ├── antlr-runtime-3.5.jar │ ├── asm-debug-all-4.1.jar │ ├── d2j-base-cmd-2.0.jar │ ├── d2j-jasmin-2.0.jar │ ├── d2j-smali-2.0.jar │ ├── dex-ir-2.0.jar │ ├── dex-reader-2.0.jar │ ├── dex-reader-api-2.0.jar │ ├── dex-tools-2.0.jar │ ├── dex-translator-2.0.jar │ ├── dex-writer-2.0.jar │ └── dx-1.7.jar ├── extract_apk.sh ├── extract_apk_demo.sh ├── extract_apks_parallel.sh ├── find_top_features.py ├── get_dates.py ├── match_features.py ├── model_training ├── saved_model │ └── model_android_malware_classification │ │ ├── keras_metadata.pb │ │ ├── saved_model.pb │ │ └── variables │ │ ├── variables.data-00000-of-00001 │ │ └── variables.index └── train_model.py ├── parse_disassembled.py ├── parse_xml.py ├── parse_xml_demo.py ├── plot_data.py ├── requirements.txt ├── resources.arsc ├── run_trials.sh ├── sklearn_forest.py ├── sklearn_svm.py ├── sklearn_tree.py ├── sort_malicious.py ├── tensorflow_learn.py └── valid_apks.txt /.gitignore: -------------------------------------------------------------------------------- 1 | config.ini 2 | all_apks/ 3 | malicious_apk/ 4 | demo_apk/ 5 | current/ 6 | benign_apk/ 7 | external/ 8 | bak/ 9 | invalid_andrototal_responses 10 | *.json 11 | *.csv 12 | *.png 13 | *.apk 14 | results_* 15 | .DS_Store 16 | -------------------------------------------------------------------------------- /AXMLPrinter2.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/AXMLPrinter2.jar -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 1. Get datasets 2 | 3 | - Pull the dataset of benign apk files and store them to \/benign_apk/ 4 | 5 | http://205.174.165.80/CICDataset/CICMalAnal2017/Dataset/APKs/Benign-APKs-2017.zip 6 | 7 | - Pull the datasets of malicious apk files and store them to \/malicious_apk/ 8 | 9 | http://205.174.165.80/CICDataset/CICMalAnal2017/Dataset/APKs/ (exclude Benign.tar.gz) 10 | 11 | - Rename them to .apk. 12 | 13 | # 2. Extract data 14 | 15 | Run `extract_apks_parallel.sh` unpacks the .apk files into folders and processes some of the data there in. 16 | You can monitor it in another shell by running `watch "wc -l benign_apk/valid_apks.txt; wc -l malicious_apk/valid_apks.txt"` 17 | 18 | # 3. Generate feature vectors 19 | 20 | Run one of the following scripts to generate feature vectors: 21 | 22 | - `parse_xml.py` for permissions. "app_permission_vectors.json" is generated 23 | - `parse_maline_output.py` for syscalls. "app_syscall_vectors.json" is generated. You will have to run [maline](https://github.com/soarlab/maline) first for this to work. 24 | - `parse_disassembled.py` for API calls. "app_method_vectors.json" is generated 25 | - `parse_ssdeep.py` for fuzzy hashes. "app_hash_vectors.json" is generated. You will have to run [ssdeep](http://ssdeep.sourceforge.net/) first for this to work. 26 | - `combine_features.py` for a combination of the top weighted features. "app_feature_vectors.json" is generated. This only works if you've previously trained a network on the specified features, and the feature weights files are named appropriately. 27 | 28 | # 4. Trials 29 | Run `$ run_trials.sh app_feature_vectors.json` (or whichever json you want) which runs the `tensorflow_learn.py` script (where the ML happens) a number of times and puts the results into a folder. It also runs `plot_data.py` and `match_features.py` to create a plot and create a list of top weighted features, respectively. 30 | 31 | # 5. Tuning 32 | 33 | Change the parameters or input data and repeat step 6. It should be non-destructive so you can compare the results of different runs. 34 | 35 | Note: If you want to use a SVM instead of a neural network, use `sklearn_svm.py` in place of `tensorflow_learn.py`. You can also use `sklearn_tree.py` to use a decision tree. 36 | -------------------------------------------------------------------------------- /__pycache__/tensorflow.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/__pycache__/tensorflow.cpython-39.pyc -------------------------------------------------------------------------------- /best_tree.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/best_tree.pdf -------------------------------------------------------------------------------- /combine_features.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads JSON files for different features (permissions, system calls, etc.) 5 | with data on a set of apps, and reads the features' weights from a trained model, and 6 | makes a dataset with the most heavily weighted features of each type. 7 | 8 | The output data format is as follows: 9 | {"features": ["ANDROID.PERMISSION.READ_PHONE_STATE", "java/security/Signature",...], 10 | "apps": {"999eca2457729e371355aea5faa38e14.apk": {"vector": [0,0,0,1], "malicious": [0,1]}, ...}} 11 | """ 12 | 13 | from configparser import ConfigParser 14 | import json 15 | 16 | 17 | 18 | def main(): 19 | config = ConfigParser() 20 | config.read('config.ini') 21 | FEATURES = config.get('AMA', 'FEATURES').split(',') 22 | TOP_N_FEATURES = config.getint('AMA', 'TOP_N_FEATURES') 23 | INCLUDE_DATES = config.getboolean('AMA', 'INCLUDE_DATES') 24 | 25 | all_features = [] # list of strings naming each feature used in the combined dataset 26 | app_feature_map = {} # mapping from android app names to lists of features 27 | app_malicious_map = {} # mapping from android app names to 1 or 0 for malware or goodware 28 | for feature in FEATURES: 29 | with open(feature + '_weights.json') as weights: 30 | feature_weights = json.load(weights) 31 | print('Found ' + str(len(feature_weights)) + ' sets of weights for ' + feature) 32 | # no need to look at benign weights; they're complementary 33 | malicious_weights = [weight[0] for weight in feature_weights] 34 | malicious_indices = sorted(range(len(malicious_weights)), key=lambda k: malicious_weights[k], reverse=True) 35 | with open('app_' + feature + '_vectors.json') as vectors: 36 | feature_data = json.load(vectors) 37 | feature_names = feature_data['features'] 38 | print('Selecting ' + str(TOP_N_FEATURES) + ' top features of ' + str(len(feature_names))) 39 | for i in range(min(int(len(malicious_indices) / 2), int(TOP_N_FEATURES / 2))): 40 | index = malicious_indices[i] 41 | all_features.append(feature_names[index]) 42 | for i in range(min(int(len(malicious_indices) / 2), int(TOP_N_FEATURES / 2))): 43 | index = malicious_indices[-i] 44 | all_features.append(feature_names[index]) 45 | # The date feature has equal numbers of apps in each range to avoid it 46 | # being used as a feature directly, so only use those apps 47 | if INCLUDE_DATES: 48 | with open('app_date_vectors.json') as vectors: 49 | feature_data = json.load(vectors) 50 | date_buckets = feature_data['features'] 51 | all_features += date_buckets 52 | date_apps = feature_data['apps'] 53 | for app in date_apps: 54 | if app not in app_malicious_map: 55 | app_malicious_map[app] = date_apps[app]['malicious'] 56 | if app not in app_feature_map: 57 | app_feature_map[app] = [] 58 | for bucket in date_buckets: 59 | index = date_buckets.index(bucket) 60 | if date_apps[app]['vector'][index] == 1: 61 | app_feature_map[app].append(bucket) 62 | for feature in FEATURES: 63 | with open('app_' + feature + '_vectors.json') as vectors: 64 | feature_data = json.load(vectors) 65 | feature_names = feature_data['features'] 66 | feature_apps = feature_data['apps'] 67 | print('Found ' + str(len(feature_apps)) + ' apps for ' + feature) 68 | for app in feature_apps: 69 | if INCLUDE_DATES and app not in date_apps: 70 | continue 71 | if app not in app_malicious_map: 72 | app_malicious_map[app] = feature_apps[app]['malicious'] 73 | if app not in app_feature_map: 74 | app_feature_map[app] = [] 75 | for feature_name in all_features: 76 | if feature_name in feature_names: 77 | index = feature_names.index(feature_name) 78 | if feature_apps[app]['vector'][index] == 1: 79 | app_feature_map[app].append(feature_name) 80 | all_apps = {} # mapping combining app_feature_map and app_malicious_map using bits 81 | for app_name in app_feature_map: 82 | bit_vector = [1 if p in app_feature_map[app_name] else 0 for p in all_features] 83 | all_apps[app_name] = {'vector': bit_vector, 'malicious': app_malicious_map[app_name]} 84 | with open('app_feature_vectors.json', 'w') as outfile: 85 | json.dump({'features': all_features, 'apps': all_apps}, outfile) 86 | print('Wrote data on ' + str(len(all_features)) + ' features and ' + str(len(all_apps)) + ' apps to a file.') 87 | 88 | if __name__=='__main__': 89 | main() 90 | -------------------------------------------------------------------------------- /d2j-dex2jar.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.dex2jar.tools.Dex2jarCmd" "$@" 37 | -------------------------------------------------------------------------------- /demo.py: -------------------------------------------------------------------------------- 1 | #Basic classification: Classify Android malwares 2 | import json 3 | import sys 4 | 5 | # TensorFlow and tf.keras 6 | import tensorflow as tf 7 | 8 | # Helper libraries 9 | import numpy as np 10 | import matplotlib.pyplot as plt 11 | 12 | print(tf.__version__) 13 | 14 | # Import the datasets from CIC 15 | with open(sys.argv[1]) as infile: 16 | dataset = json.load(infile) 17 | 18 | app_names = [app for app in dataset['apps']] 19 | features = np.array(dataset['features']) 20 | train_files = np.array([np.array(dataset['apps'][app]['vector']) for app in app_names]) 21 | print(train_files.shape) 22 | train_labels = np.array([dataset['apps'][app]['malicious'][0] for app in app_names]) 23 | 24 | test_files = np.array([dataset['apps'][app]['vector'] for app in app_names]) 25 | test_labels = np.array([dataset['apps'][app]['malicious'][0] for app in app_names]) 26 | 27 | class_names = ['benign', 'malicious'] 28 | 29 | ## Explore the data 30 | 31 | # print(train_files.shape) 32 | # print(len(train_labels)) 33 | # print(train_labels) 34 | # print(test_files.shape) 35 | # print(len(test_labels)) 36 | 37 | ## Build the model 38 | ### Set up the layers 39 | model = tf.keras.Sequential([ 40 | tf.keras.layers.Flatten(input_shape=(100, )), 41 | tf.keras.layers.Dense(128, activation='relu'), 42 | tf.keras.layers.Dense(10) 43 | ]) 44 | 45 | ### Compile the model 46 | model.compile(optimizer='adam', 47 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 48 | metrics=['accuracy']) 49 | 50 | ## Train the model 51 | ### Feed the model 52 | model.fit(train_files, train_labels, epochs=10) 53 | test_loss, test_acc = model.evaluate(test_files, test_labels, verbose=2) 54 | 55 | print('\nTest accuracy:', test_acc) 56 | 57 | ### Make predictions 58 | probability_model = tf.keras.Sequential([model, 59 | tf.keras.layers.Softmax()]) 60 | 61 | predictions = probability_model.predict(test_files) 62 | # print(app_names[190]) 63 | # print(predictions[190]) 64 | # print(np.argmax(predictions[190])) 65 | # print(test_labels[190]) 66 | 67 | model.save('saved_model/model_android_malware_classification') 68 | ## Use the trained model 69 | file = test_files[190] 70 | print(file.shape) 71 | file = (np.expand_dims(file,0)) 72 | print(file.shape) 73 | predictions_single = probability_model.predict(file) 74 | print(predictions_single) 75 | print(np.argmax(predictions_single[0])) -------------------------------------------------------------------------------- /demo/__pycache__/extract_features.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/demo/__pycache__/extract_features.cpython-39.pyc -------------------------------------------------------------------------------- /demo/extract_features.py: -------------------------------------------------------------------------------- 1 | from subprocess import call 2 | from defusedxml import ElementTree 3 | import os 4 | import sys 5 | import json 6 | import numpy as np 7 | from termcolor import colored 8 | 9 | def extract(file): 10 | rc = call("../extract_apk_demo.sh", shell=True) 11 | #parse_xml 12 | root_dir = os.getcwd() 13 | directory="../demo_apk" 14 | # for i, directory in enumerate(['benign_apk', 'malicious_apk']): 15 | os.chdir(directory) 16 | category_root_dir = os.getcwd() 17 | filename = file 18 | 19 | with open("../model_training/app_feature_vectors.json") as infile: 20 | dataset = json.load(infile) 21 | 22 | features = np.array(dataset['features']) 23 | #print(features) 24 | vector = [0]*len(features) 25 | #print(vector) 26 | 27 | # for filename in glob.glob('*.apk'): 28 | #print('Processing ' + filename) 29 | try: 30 | os.chdir(filename[:-4]) 31 | with open('AndroidManifest.xml') as xml_file: 32 | et = ElementTree.parse(xml_file) 33 | except (ElementTree.ParseError, UnicodeDecodeError, FileNotFoundError): 34 | #print('Parsing error encountered for ' + filename) 35 | os.chdir(category_root_dir) 36 | app_name = filename 37 | # make a one-hot bit vector of length 2. 1st bit set if malicious, otherwise 2nd bit 38 | # app_malicious_map[app_name] = [1,0] if i else [0,1] 39 | permissions = et.getroot().findall('./uses-permission') 40 | all_permissions = [] 41 | for permission in permissions: 42 | try: 43 | permission_name = permission.attrib['{http://schemas.android.com/apk/res/android}name'].upper() 44 | if not permission_name.startswith('ANDROID.PERMISSION'): continue # ignore custom permissions 45 | if permission_name not in all_permissions: 46 | all_permissions.append(permission_name) 47 | except: 48 | pass 49 | print(colored("API PERMISSIONS:", "blue")) 50 | for permission in all_permissions: 51 | try: 52 | result = np.where(features == permission) 53 | #print(result[0][0]) 54 | vector[result[0][0]]=1 55 | print(colored("X","red"), end=" ") 56 | print(permission) 57 | except: 58 | print(colored(" ","red"), end=" ") 59 | print(permission) 60 | pass 61 | try: 62 | #print('Processing ' + filename) 63 | all_methods = [] 64 | with open('disassembled.code') as disassembled_code: 65 | app_name = filename 66 | # make a one-hot bit vector of length 2. 1st bit set if malicious, otherwise 2nd bit 67 | # parse the file and record any interesting methods 68 | for line in disassembled_code.readlines(): 69 | try: 70 | method = line.split('// Method ')[1].split(':')[0] 71 | #if not method.startswith('java') and not method.startswith('android'): 72 | if not method.startswith('java'): 73 | continue 74 | # Comment the below line to use methods rather than classes 75 | method = method.split('.')[0] 76 | # the method is probably obfuscated; ignore it 77 | if len(method.split('/')[-1]) < 4 or len(method.split('/')[-2]) == 1: 78 | continue 79 | if method not in all_methods: 80 | all_methods.append(method) 81 | except: 82 | continue 83 | 84 | print(colored("API METHODS:", "yellow")) 85 | for method in all_methods: 86 | try: 87 | result = np.where(features == method) 88 | #print(result[0][0]) 89 | vector[result[0][0]]=1 90 | print(colored("X","red"), end=" ") 91 | print(method) 92 | except: 93 | print(colored(" ","red"), end=" ") 94 | print(method) 95 | pass 96 | except FileNotFoundError as e: 97 | print(e) 98 | finally: 99 | os.chdir(category_root_dir) 100 | os.chdir(root_dir) 101 | #print(vector) 102 | return vector -------------------------------------------------------------------------------- /demo/malware_check.py: -------------------------------------------------------------------------------- 1 | import json 2 | import sys 3 | import subprocess 4 | from termcolor import colored 5 | 6 | # TensorFlow and tf.keras 7 | import tensorflow as tf 8 | 9 | # Helper libraries 10 | import numpy as np 11 | import matplotlib.pyplot as plt 12 | 13 | import extract_features 14 | #1. process the input as an .apk file 15 | filename = sys.argv[1] 16 | class_name = ["BENIGN","MALICIOUS"] 17 | test_files = extract_features.extract(filename) 18 | # print(test_files) 19 | #2. load the saved model and predict 20 | 21 | new_model = tf.keras.models.load_model('../model_training/saved_model/model_android_malware_classification') 22 | 23 | # print(new_model.summary()) 24 | probability_model = tf.keras.Sequential([new_model, 25 | tf.keras.layers.Softmax()]) 26 | file = np.array(test_files) 27 | # print(file.shape) 28 | file = (np.expand_dims(file,0)) 29 | # print(file.shape) 30 | predictions_single = probability_model.predict(file) 31 | # print(predictions_single) 32 | label=np.argmax(predictions_single[0]) 33 | color = "green" if label == 0 else "red" 34 | print(colored("This file is "+class_name[label],color)) 35 | -------------------------------------------------------------------------------- /demo_copy.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """classification.ipynb 3 | 4 | Automatically generated by Colaboratory. 5 | 6 | Original file is located at 7 | https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/classification.ipynb 8 | 9 | ##### Copyright 2018 The TensorFlow Authors. 10 | """ 11 | 12 | #@title Licensed under the Apache License, Version 2.0 (the "License"); 13 | # you may not use this file except in compliance with the License. 14 | # You may obtain a copy of the License at 15 | # 16 | # https://www.apache.org/licenses/LICENSE-2.0 17 | # 18 | # Unless required by applicable law or agreed to in writing, software 19 | # distributed under the License is distributed on an "AS IS" BASIS, 20 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 21 | # See the License for the specific language governing permissions and 22 | # limitations under the License. 23 | 24 | #@title MIT License 25 | # 26 | # Copyright (c) 2017 François Chollet 27 | # 28 | # Permission is hereby granted, free of charge, to any person obtaining a 29 | # copy of this software and associated documentation files (the "Software"), 30 | # to deal in the Software without restriction, including without limitation 31 | # the rights to use, copy, modify, merge, publish, distribute, sublicense, 32 | # and/or sell copies of the Software, and to permit persons to whom the 33 | # Software is furnished to do so, subject to the following conditions: 34 | # 35 | # The above copyright notice and this permission notice shall be included in 36 | # all copies or substantial portions of the Software. 37 | # 38 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 39 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 40 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 41 | # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 42 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 43 | # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 44 | # DEALINGS IN THE SOFTWARE. 45 | 46 | """# Basic classification: Classify images of clothing 47 | 48 | 49 | 52 | 55 | 58 | 61 |
50 | View on TensorFlow.org 51 | 53 | Run in Google Colab 54 | 56 | View source on GitHub 57 | 59 | Download notebook 60 |
62 | 63 | This guide trains a neural network model to classify images of clothing, like sneakers and shirts. It's okay if you don't understand all the details; this is a fast-paced overview of a complete TensorFlow program with the details explained as you go. 64 | 65 | This guide uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow. 66 | """ 67 | 68 | # TensorFlow and tf.keras 69 | import tensorflow as tf 70 | 71 | # Helper libraries 72 | import numpy as np 73 | import matplotlib.pyplot as plt 74 | 75 | print(tf.__version__) 76 | 77 | """## Import the Fashion MNIST dataset 78 | 79 | This guide uses the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset which contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 by 28 pixels), as seen here: 80 | 81 | 82 | 86 | 89 |
83 | Fashion MNIST sprite 85 |
87 | Figure 1. Fashion-MNIST samples (by Zalando, MIT License).
  88 |
90 | 91 | Fashion MNIST is intended as a drop-in replacement for the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset—often used as the "Hello, World" of machine learning programs for computer vision. The MNIST dataset contains images of handwritten digits (0, 1, 2, etc.) in a format identical to that of the articles of clothing you'll use here. 92 | 93 | This guide uses Fashion MNIST for variety, and because it's a slightly more challenging problem than regular MNIST. Both datasets are relatively small and are used to verify that an algorithm works as expected. They're good starting points to test and debug code. 94 | 95 | Here, 60,000 images are used to train the network and 10,000 images to evaluate how accurately the network learned to classify images. You can access the Fashion MNIST directly from TensorFlow. Import and [load the Fashion MNIST data](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/fashion_mnist/load_data) directly from TensorFlow: 96 | """ 97 | 98 | fashion_mnist = tf.keras.datasets.fashion_mnist 99 | 100 | (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() 101 | 102 | """Loading the dataset returns four NumPy arrays: 103 | 104 | * The `train_images` and `train_labels` arrays are the *training set*—the data the model uses to learn. 105 | * The model is tested against the *test set*, the `test_images`, and `test_labels` arrays. 106 | 107 | The images are 28x28 NumPy arrays, with pixel values ranging from 0 to 255. The *labels* are an array of integers, ranging from 0 to 9. These correspond to the *class* of clothing the image represents: 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 |
LabelClass
0T-shirt/top
1Trouser
2Pullover
3Dress
4Coat
5Sandal
6Shirt
7Sneaker
8Bag
9Ankle boot
155 | 156 | Each image is mapped to a single label. Since the *class names* are not included with the dataset, store them here to use later when plotting the images: 157 | """ 158 | 159 | class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 160 | 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] 161 | 162 | """## Explore the data 163 | 164 | Let's explore the format of the dataset before training the model. The following shows there are 60,000 images in the training set, with each image represented as 28 x 28 pixels: 165 | """ 166 | 167 | train_images.shape 168 | 169 | """Likewise, there are 60,000 labels in the training set:""" 170 | 171 | len(train_labels) 172 | 173 | """Each label is an integer between 0 and 9:""" 174 | 175 | train_labels 176 | 177 | """There are 10,000 images in the test set. Again, each image is represented as 28 x 28 pixels:""" 178 | 179 | test_images.shape 180 | 181 | """And the test set contains 10,000 images labels:""" 182 | 183 | len(test_labels) 184 | 185 | """## Preprocess the data 186 | 187 | The data must be preprocessed before training the network. If you inspect the first image in the training set, you will see that the pixel values fall in the range of 0 to 255: 188 | """ 189 | 190 | plt.figure() 191 | plt.imshow(train_images[0]) 192 | plt.colorbar() 193 | plt.grid(False) 194 | plt.show() 195 | 196 | """Scale these values to a range of 0 to 1 before feeding them to the neural network model. To do so, divide the values by 255. It's important that the *training set* and the *testing set* be preprocessed in the same way:""" 197 | 198 | train_images = train_images / 255.0 199 | 200 | test_images = test_images / 255.0 201 | 202 | """To verify that the data is in the correct format and that you're ready to build and train the network, let's display the first 25 images from the *training set* and display the class name below each image.""" 203 | 204 | plt.figure(figsize=(10,10)) 205 | for i in range(25): 206 | plt.subplot(5,5,i+1) 207 | plt.xticks([]) 208 | plt.yticks([]) 209 | plt.grid(False) 210 | plt.imshow(train_images[i], cmap=plt.cm.binary) 211 | plt.xlabel(class_names[train_labels[i]]) 212 | plt.show() 213 | 214 | """## Build the model 215 | 216 | Building the neural network requires configuring the layers of the model, then compiling the model. 217 | 218 | ### Set up the layers 219 | 220 | The basic building block of a neural network is the [*layer*](https://www.tensorflow.org/api_docs/python/tf/keras/layers). Layers extract representations from the data fed into them. Hopefully, these representations are meaningful for the problem at hand. 221 | 222 | Most of deep learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training. 223 | """ 224 | 225 | model = tf.keras.Sequential([ 226 | tf.keras.layers.Flatten(input_shape=(28, 28)), 227 | tf.keras.layers.Dense(128, activation='relu'), 228 | tf.keras.layers.Dense(10) 229 | ]) 230 | 231 | """The first layer in this network, `tf.keras.layers.Flatten`, transforms the format of the images from a two-dimensional array (of 28 by 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data. 232 | 233 | After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected, or fully connected, neural layers. The first `Dense` layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes. 234 | 235 | ### Compile the model 236 | 237 | Before the model is ready for training, it needs a few more settings. These are added during the model's [*compile*](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile) step: 238 | 239 | * [*Loss function*](https://www.tensorflow.org/api_docs/python/tf/keras/losses) —This measures how accurate the model is during training. You want to minimize this function to "steer" the model in the right direction. 240 | * [*Optimizer*](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) —This is how the model is updated based on the data it sees and its loss function. 241 | * [*Metrics*](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) —Used to monitor the training and testing steps. The following example uses *accuracy*, the fraction of the images that are correctly classified. 242 | """ 243 | 244 | model.compile(optimizer='adam', 245 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 246 | metrics=['accuracy']) 247 | 248 | """## Train the model 249 | 250 | Training the neural network model requires the following steps: 251 | 252 | 1. Feed the training data to the model. In this example, the training data is in the `train_images` and `train_labels` arrays. 253 | 2. The model learns to associate images and labels. 254 | 3. You ask the model to make predictions about a test set—in this example, the `test_images` array. 255 | 4. Verify that the predictions match the labels from the `test_labels` array. 256 | 257 | ### Feed the model 258 | 259 | To start training, call the [`model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method—so called because it "fits" the model to the training data: 260 | """ 261 | 262 | model.fit(train_images, train_labels, epochs=10) 263 | 264 | """As the model trains, the loss and accuracy metrics are displayed. This model reaches an accuracy of about 0.91 (or 91%) on the training data. 265 | 266 | ### Evaluate accuracy 267 | 268 | Next, compare how the model performs on the test dataset: 269 | """ 270 | 271 | test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2) 272 | 273 | print('\nTest accuracy:', test_acc) 274 | 275 | """It turns out that the accuracy on the test dataset is a little less than the accuracy on the training dataset. This gap between training accuracy and test accuracy represents *overfitting*. Overfitting happens when a machine learning model performs worse on new, previously unseen inputs than it does on the training data. An overfitted model "memorizes" the noise and details in the training dataset to a point where it negatively impacts the performance of the model on the new data. For more information, see the following: 276 | * [Demonstrate overfitting](https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#demonstrate_overfitting) 277 | * [Strategies to prevent overfitting](https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#strategies_to_prevent_overfitting) 278 | 279 | ### Make predictions 280 | 281 | With the model trained, you can use it to make predictions about some images. 282 | The model's linear outputs, [logits](https://developers.google.com/machine-learning/glossary#logits). Attach a softmax layer to convert the logits to probabilities, which are easier to interpret. 283 | """ 284 | 285 | probability_model = tf.keras.Sequential([model, 286 | tf.keras.layers.Softmax()]) 287 | 288 | predictions = probability_model.predict(test_images) 289 | 290 | """Here, the model has predicted the label for each image in the testing set. Let's take a look at the first prediction:""" 291 | 292 | predictions[0] 293 | 294 | """A prediction is an array of 10 numbers. They represent the model's "confidence" that the image corresponds to each of the 10 different articles of clothing. You can see which label has the highest confidence value:""" 295 | 296 | np.argmax(predictions[0]) 297 | 298 | """So, the model is most confident that this image is an ankle boot, or `class_names[9]`. Examining the test label shows that this classification is correct:""" 299 | 300 | test_labels[0] 301 | 302 | """Graph this to look at the full set of 10 class predictions.""" 303 | 304 | def plot_image(i, predictions_array, true_label, img): 305 | true_label, img = true_label[i], img[i] 306 | plt.grid(False) 307 | plt.xticks([]) 308 | plt.yticks([]) 309 | 310 | plt.imshow(img, cmap=plt.cm.binary) 311 | 312 | predicted_label = np.argmax(predictions_array) 313 | if predicted_label == true_label: 314 | color = 'blue' 315 | else: 316 | color = 'red' 317 | 318 | plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label], 319 | 100*np.max(predictions_array), 320 | class_names[true_label]), 321 | color=color) 322 | 323 | def plot_value_array(i, predictions_array, true_label): 324 | true_label = true_label[i] 325 | plt.grid(False) 326 | plt.xticks(range(10)) 327 | plt.yticks([]) 328 | thisplot = plt.bar(range(10), predictions_array, color="#777777") 329 | plt.ylim([0, 1]) 330 | predicted_label = np.argmax(predictions_array) 331 | 332 | thisplot[predicted_label].set_color('red') 333 | thisplot[true_label].set_color('blue') 334 | 335 | """### Verify predictions 336 | 337 | With the model trained, you can use it to make predictions about some images. 338 | 339 | Let's look at the 0th image, predictions, and prediction array. Correct prediction labels are blue and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label. 340 | """ 341 | 342 | i = 0 343 | plt.figure(figsize=(6,3)) 344 | plt.subplot(1,2,1) 345 | plot_image(i, predictions[i], test_labels, test_images) 346 | plt.subplot(1,2,2) 347 | plot_value_array(i, predictions[i], test_labels) 348 | plt.show() 349 | 350 | i = 12 351 | plt.figure(figsize=(6,3)) 352 | plt.subplot(1,2,1) 353 | plot_image(i, predictions[i], test_labels, test_images) 354 | plt.subplot(1,2,2) 355 | plot_value_array(i, predictions[i], test_labels) 356 | plt.show() 357 | 358 | """Let's plot several images with their predictions. Note that the model can be wrong even when very confident.""" 359 | 360 | # Plot the first X test images, their predicted labels, and the true labels. 361 | # Color correct predictions in blue and incorrect predictions in red. 362 | num_rows = 5 363 | num_cols = 3 364 | num_images = num_rows*num_cols 365 | plt.figure(figsize=(2*2*num_cols, 2*num_rows)) 366 | for i in range(num_images): 367 | plt.subplot(num_rows, 2*num_cols, 2*i+1) 368 | plot_image(i, predictions[i], test_labels, test_images) 369 | plt.subplot(num_rows, 2*num_cols, 2*i+2) 370 | plot_value_array(i, predictions[i], test_labels) 371 | plt.tight_layout() 372 | plt.show() 373 | 374 | """## Use the trained model 375 | 376 | Finally, use the trained model to make a prediction about a single image. 377 | """ 378 | 379 | # Grab an image from the test dataset. 380 | img = test_images[1] 381 | 382 | print(img.shape) 383 | 384 | """`tf.keras` models are optimized to make predictions on a *batch*, or collection, of examples at once. Accordingly, even though you're using a single image, you need to add it to a list:""" 385 | 386 | # Add the image to a batch where it's the only member. 387 | img = (np.expand_dims(img,0)) 388 | 389 | print(img.shape) 390 | 391 | """Now predict the correct label for this image:""" 392 | 393 | predictions_single = probability_model.predict(img) 394 | 395 | print(predictions_single) 396 | 397 | plot_value_array(1, predictions_single[0], test_labels) 398 | _ = plt.xticks(range(10), class_names, rotation=45) 399 | plt.show() 400 | 401 | """`tf.keras.Model.predict` returns a list of lists—one list for each image in the batch of data. Grab the predictions for our (only) image in the batch:""" 402 | 403 | np.argmax(predictions_single[0]) 404 | 405 | """And the model predicts a label as expected.""" -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-baksmali.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.d2j.smali.BaksmaliCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-baksmali.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.d2j.smali.BaksmaliCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-dex-recompute-checksum.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.dex2jar.tools.DexRecomputeChecksum %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-dex-recompute-checksum.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.dex2jar.tools.DexRecomputeChecksum" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-dex2jar.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.dex2jar.tools.Dex2jarCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-dex2jar.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.dex2jar.tools.Dex2jarCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-dex2smali.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.d2j.smali.BaksmaliCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-dex2smali.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.d2j.smali.BaksmaliCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-jar2dex.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.dex2jar.tools.Jar2Dex %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-jar2dex.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.dex2jar.tools.Jar2Dex" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-jar2jasmin.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.d2j.jasmin.Jar2JasminCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-jar2jasmin.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.d2j.jasmin.Jar2JasminCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-jasmin2jar.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.d2j.jasmin.Jasmin2JarCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-jasmin2jar.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.d2j.jasmin.Jasmin2JarCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-smali.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.d2j.smali.SmaliCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-smali.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.d2j.smali.SmaliCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-std-apk.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | REM 4 | REM dex2jar - Tools to work with android .dex and java .class files 5 | REM Copyright (c) 2009-2013 Panxiaobo 6 | REM 7 | REM Licensed under the Apache License, Version 2.0 (the "License"); 8 | REM you may not use this file except in compliance with the License. 9 | REM You may obtain a copy of the License at 10 | REM 11 | REM http://www.apache.org/licenses/LICENSE-2.0 12 | REM 13 | REM Unless required by applicable law or agreed to in writing, software 14 | REM distributed under the License is distributed on an "AS IS" BASIS, 15 | REM WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | REM See the License for the specific language governing permissions and 17 | REM limitations under the License. 18 | REM 19 | 20 | REM call d2j_invoke.bat to setup java environment 21 | @"%~dp0d2j_invoke.bat" com.googlecode.dex2jar.tools.StdApkCmd %* 22 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j-std-apk.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | # call d2j_invoke.sh to setup java environment 36 | "$PRGDIR/d2j_invoke.sh" "com.googlecode.dex2jar.tools.StdApkCmd" "$@" 37 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j_invoke.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | REM better invocation scripts for windows from lanchon, release in public domain. thanks! 3 | REM https://code.google.com/p/dex2jar/issues/detail?id=192 4 | 5 | setlocal enabledelayedexpansion 6 | 7 | set LIB=%~dp0lib 8 | 9 | set CP= 10 | for %%X in ("%LIB%"\*.jar) do ( 11 | set CP=!CP!%%X; 12 | ) 13 | 14 | java -Xms512m -Xmx1024m -cp "%CP%" %* 15 | -------------------------------------------------------------------------------- /dex2jar-2.0/d2j_invoke.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # 4 | # dex2jar - Tools to work with android .dex and java .class files 5 | # Copyright (c) 2009-2013 Panxiaobo 6 | # 7 | # Licensed under the Apache License, Version 2.0 (the "License"); 8 | # you may not use this file except in compliance with the License. 9 | # You may obtain a copy of the License at 10 | # 11 | # http://www.apache.org/licenses/LICENSE-2.0 12 | # 13 | # Unless required by applicable law or agreed to in writing, software 14 | # distributed under the License is distributed on an "AS IS" BASIS, 15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | # See the License for the specific language governing permissions and 17 | # limitations under the License. 18 | # 19 | 20 | # copy from $Tomcat/bin/startup.sh 21 | # resolve links - $0 may be a softlink 22 | PRG="$0" 23 | while [ -h "$PRG" ] ; do 24 | ls=`ls -ld "$PRG"` 25 | link=`expr "$ls" : '.*-> \(.*\)$'` 26 | if expr "$link" : '/.*' > /dev/null; then 27 | PRG="$link" 28 | else 29 | PRG=`dirname "$PRG"`/"$link" 30 | fi 31 | done 32 | PRGDIR=`dirname "$PRG"` 33 | # 34 | 35 | _classpath="." 36 | if [ `uname -a | grep -i -c cygwin` -ne 0 ]; then # Cygwin, translate the path 37 | for k in "$PRGDIR"/lib/*.jar 38 | do 39 | _classpath="${_classpath};`cygpath -w ${k}`" 40 | done 41 | else 42 | for k in "$PRGDIR"/lib/*.jar 43 | do 44 | _classpath="${_classpath}:${k}" 45 | done 46 | fi 47 | 48 | java -Xms512m -Xmx1024m -classpath "${_classpath}" "$@" 49 | -------------------------------------------------------------------------------- /dex2jar-2.0/lib/antlr-runtime-3.5.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/antlr-runtime-3.5.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/asm-debug-all-4.1.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/asm-debug-all-4.1.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/d2j-base-cmd-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/d2j-base-cmd-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/d2j-jasmin-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/d2j-jasmin-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/d2j-smali-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/d2j-smali-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dex-ir-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dex-ir-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dex-reader-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dex-reader-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dex-reader-api-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dex-reader-api-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dex-tools-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dex-tools-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dex-translator-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dex-translator-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dex-writer-2.0.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dex-writer-2.0.jar -------------------------------------------------------------------------------- /dex2jar-2.0/lib/dx-1.7.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/dex2jar-2.0/lib/dx-1.7.jar -------------------------------------------------------------------------------- /extract_apk.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # 4 | # This script is called by extract_apks_paprellel.sh with an apk filename, 5 | # and does several things: 6 | # 1. The apk is extracted into a folder 7 | # 2. The AndroidManifest.xml files are converted from binary to ASCII 8 | # 3. The classes.dex file is converted to classes-dex2jar.jar 9 | # 4. The classes in the jar file are printed to classes.list 10 | # 5. The jar file is disassembled to disassembled.code 11 | # 12 | # Dependencies: unzip, java, javap, xmllint, dex2jar 13 | # 14 | 15 | if [ -z "$1" ]; then 16 | echo "Usage: extract_apk.sh /path/to/apk" 17 | exit 1 18 | fi 19 | 20 | apk="$1" 21 | length=${#apk} 22 | folder="${apk:0:length-4}" 23 | mkdir -p "$folder" 24 | echo "Extracting $apk" 25 | unzip -u -d "$folder" "$apk" 1>/dev/null 26 | if [ $(ls -l "$folder" | wc -l) -gt 1 ]; then 27 | echo `basename $apk` >> valid_apks.txt 28 | cd "$folder" 29 | if [ -e "AndroidManifest.xml" ] && [ ! -e "AndroidManifest_ascii.xml" ]; then 30 | echo "Converting $folder/AndroidManifest.xml to ASCII" 31 | cp AndroidManifest.xml AndroidManifest_bin.xml 32 | java -jar ../../AXMLPrinter2.jar AndroidManifest_bin.xml > AndroidManifest_ascii.xml 33 | echo "Cleaning up $folder/AndroidManifest.xml" 34 | xmllint AndroidManifest_ascii.xml > AndroidManifest.xml 35 | if [ $? -ne 0 ]; then 36 | echo `basename $apk` >> ../malformed_xml.txt 37 | fi 38 | fi 39 | if [ -e "classes.dex" ] && [ ! -e "classes.list" ]; then 40 | echo "Converting $folder/classes.dex to jar format" 41 | echo `pwd` 42 | ../../dex2jar-2.0/d2j-dex2jar.sh classes.dex 43 | # dex2jar classes.dex 44 | if [ $? -ne 0 ]; then 45 | echo `basename $apk` >> ../malformed_dex.txt 46 | else 47 | jar -tf classes-dex2jar.jar > classes.list 48 | javap -c -classpath classes-dex2jar.jar $(jar -tf classes-dex2jar.jar | grep "class$" | sed s/\.class$//) > disassembled.code 49 | fi 50 | fi 51 | cd .. 52 | fi 53 | -------------------------------------------------------------------------------- /extract_apk_demo.sh: -------------------------------------------------------------------------------- 1 | 2 | original_dir=$(pwd) 3 | cd ../ 4 | for dir in demo_apk; do 5 | echo "Entering $dir" 6 | cd $dir 7 | find `pwd` -type f -name "*.apk" | xargs -n 1 ../extract_apk.sh 8 | cat valid_apks.txt | sort | uniq > valid_apks_tmp.txt 9 | cat malformed_xml.txt | sort | uniq > malformed_xml_tmp.txt 10 | cat malformed_dex.txt | sort | uniq > malformed_dex_tmp.txt 11 | cp valid_apks_tmp.txt valid_apks.txt 2>/dev/null 12 | cp malformed_xml_tmp.txt malformed_xml.txt 2>/dev/null 13 | cp malformed_dex_tmp.txt malformed_dex.txt 2>/dev/null 14 | rm -f valid_apks_tmp.txt malformed_xml_tmp.txt malformed_dex_tmp.txt 15 | cd $original_dir 16 | done 17 | -------------------------------------------------------------------------------- /extract_apks_parallel.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # 4 | # This script calls the extract_apk.sh script many times in parallel 5 | # for each of the sorted applications to unpack apk files and parse 6 | # their contents. 7 | # 8 | 9 | original_dir=$(pwd) 10 | for dir in malicious_apk benign_apk; do 11 | echo "Entering $dir" 12 | cd $dir 13 | # cpus=$(ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w) 14 | # cpus=$(expr $cpus - 2) 15 | # find `pwd` -type f -name "*.apk" | xargs --max-args=1 --max-procs=$cpus ../extract_apk.sh 16 | find `pwd` -type f -name "*.apk" | xargs -n 1 ../extract_apk.sh 17 | cat valid_apks.txt | sort | uniq > valid_apks_tmp.txt 18 | cat malformed_xml.txt | sort | uniq > malformed_xml_tmp.txt 19 | cat malformed_dex.txt | sort | uniq > malformed_dex_tmp.txt 20 | cp valid_apks_tmp.txt valid_apks.txt 2>/dev/null 21 | cp malformed_xml_tmp.txt malformed_xml.txt 2>/dev/null 22 | cp malformed_dex_tmp.txt malformed_dex.txt 2>/dev/null 23 | rm -f valid_apks_tmp.txt malformed_xml_tmp.txt malformed_dex_tmp.txt 24 | cd $original_dir 25 | done 26 | -------------------------------------------------------------------------------- /find_top_features.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script prints each feature in the given JSON file along with 5 | how many apps use it, and how many of each type (malicious or benign). 6 | """ 7 | 8 | import sys 9 | import json 10 | 11 | def main(): 12 | with open(sys.argv[1]) as f: 13 | j = json.load(f) 14 | all_features = j['features'] 15 | num_apps_each = {} 16 | for app in j['apps']: 17 | for i,bit in enumerate(j['apps'][app]['vector']): 18 | if bit == 1: 19 | if j['apps'][app]['malicious'] == [1,0]: 20 | if all_features[i] in num_apps_each: 21 | num_apps_each[all_features[i]][0] += 1 22 | else: 23 | num_apps_each[all_features[i]] = [1,0] 24 | else: 25 | if all_features[i] in num_apps_each: 26 | num_apps_each[all_features[i]][1] += 1 27 | else: 28 | num_apps_each[all_features[i]] = [0,1] 29 | # This sorts by number of malicious apps using it, but can easily be changed 30 | for feature in sorted(num_apps_each.items(), key=lambda item: item[1][0], reverse=True): 31 | print('{} was used by {} apps ({} malicious and {} benign)'.format(feature[0], str(feature[1][0] + feature[1][1]), str(feature[1][0]), str(feature[1][1]))) 32 | 33 | if __name__=='__main__': 34 | if len(sys.argv) == 1: 35 | print('Usage: python3 {} '.format(sys.argv[0])) 36 | sys.exit(1) 37 | main() 38 | -------------------------------------------------------------------------------- /get_dates.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads the resources.arsc files in the malicous_apk 5 | and benign_apk folders and copies the modified file dates into a 6 | JSON file for later analysis 7 | 8 | The output data format is as follows: 9 | {"features": ["1222000000_to_1222111111", ...], 10 | "apps": {"999eca2457729e371355aea5faa38e14.apk": {"vector": [0,0,0,1], "malicious": [0,1]}, ...}} 11 | """ 12 | 13 | import os 14 | import json 15 | import glob 16 | import time 17 | import random 18 | from configparser import ConfigParser 19 | 20 | __author__='mkkeffeler' 21 | 22 | def main(): 23 | config = ConfigParser() 24 | config.read('config.ini') 25 | NUM_DATE_BUCKETS = config.getint('AMA', 'NUM_DATE_BUCKETS') 26 | 27 | date_buckets = [] # list of strings naming each date range used in the dataset 28 | app_date_map = {} # mapping from android app names to lists of dates 29 | app_malicious_map = {} # mapping from android app names to 1 or 0 for malware or goodware 30 | apps_per_bucket = {} # number of apps in each date range 31 | relevant_buckets = [] # subset of buckets that will be used (based on NUM_DATE_BUCKETS) 32 | all_apps = {} # mapping combining app_date_map and app_malicious_map using bits 33 | apps_found_per_bucket = {} # number of apps added to all_apps for each bucket 34 | root_dir = os.getcwd() 35 | 36 | num_apps_before = 0 37 | num_apps_after = 0 38 | num_file_not_found = 0 39 | fnfe = open('file_not_found_error', 'w') 40 | for i, directory in enumerate(['benign_apk', 'malicious_apk']): 41 | os.chdir(directory) 42 | for filename in glob.glob('*.apk'): 43 | #print('Processing ' + filename) 44 | try: 45 | os.chdir(filename[:-4]) 46 | if os.path.exists('classes.dex'): 47 | mtime = os.stat('classes.dex') 48 | else: 49 | mtime = os.stat('resources.arsc') 50 | if mtime.st_mtime < 1222000000: 51 | num_apps_before += 1 52 | if mtime.st_mtime > time.time(): 53 | num_apps_after += 1 54 | except FileNotFoundError: 55 | num_file_not_found += 1 56 | fnfe.write(filename + '\n') 57 | os.chdir(os.path.join(root_dir, directory)) 58 | continue 59 | app_date_map[filename] = int(mtime.st_mtime) 60 | app_name = filename 61 | # make a one-hot bit vector of length 2. 1st bit set if malicious, otherwise 2nd bit 62 | app_malicious_map[app_name] = [1,0] if i else [0,1] 63 | os.chdir(os.pardir) 64 | os.chdir(root_dir) 65 | fnfe.close() 66 | 67 | # Android was released Sept. 23, 2008 68 | startdate = 1222000000 69 | secondsinmonth = 60 * 60 * 24 * 28 70 | enddate = startdate + secondsinmonth 71 | while True: 72 | date_buckets.append(str(startdate)+"_to_"+str(enddate)) 73 | # Apps can't have been made in the future 74 | if(enddate >= time.time()): 75 | break 76 | startdate = enddate 77 | enddate = startdate + secondsinmonth 78 | 79 | # Count the number of apps per date range so we can ensure there's an equal number in each 80 | for app_name in app_date_map: 81 | for bucket in date_buckets: 82 | mtime = app_date_map[app_name] 83 | startdate = int(bucket.split("_to_")[0]) 84 | enddate = int(bucket.split("_to_")[1]) 85 | if (startdate <= mtime) and (mtime < enddate): 86 | if bucket not in apps_per_bucket: 87 | apps_per_bucket[bucket] = app_malicious_map[app_name] 88 | else: 89 | if app_malicious_map[app_name][0] == 1: 90 | apps_per_bucket[bucket][0] += 1 91 | else: 92 | apps_per_bucket[bucket][1] += 1 93 | break 94 | with open('apps_per_bucket.json', 'w') as f: 95 | json.dump(apps_per_bucket, f) 96 | relevant_buckets = sorted(apps_per_bucket, key=lambda bucket: min(apps_per_bucket[bucket]), reverse=True) 97 | if len(relevant_buckets) > NUM_DATE_BUCKETS: 98 | relevant_buckets = relevant_buckets[:NUM_DATE_BUCKETS] 99 | apps_per_bucket_limit = min(apps_per_bucket[relevant_buckets[-1]]) # number of apps of each type (benign/malicious) of each bucket 100 | 101 | # Now add apps_per_bucket_limit apps from each bucket to all_apps 102 | for bucket in relevant_buckets: 103 | apps_found_per_bucket[bucket] = [0,0] 104 | for app_name in app_date_map: 105 | date_vector = [] 106 | in_relevant_bucket = False 107 | this_bucket = '' 108 | for bucket in relevant_buckets: 109 | mtime = app_date_map[app_name] 110 | startdate = int(bucket.split("_to_")[0]) 111 | enddate = int(bucket.split("_to_")[1]) 112 | if (startdate <= mtime) and (mtime < enddate): 113 | date_vector.append(1) 114 | in_relevant_bucket = True 115 | this_bucket = bucket 116 | else: 117 | date_vector.append(0) 118 | malicious = (app_malicious_map[app_name] == [1,0]) 119 | if in_relevant_bucket and apps_found_per_bucket[this_bucket][0 if malicious else 1] < apps_per_bucket_limit: 120 | apps_found_per_bucket[this_bucket][0 if malicious else 1] += 1 121 | all_apps[app_name] = {'vector': date_vector, 'malicious': app_malicious_map[app_name]} 122 | with open('app_date_vectors.json', 'w') as outfile: 123 | json.dump({'features': relevant_buckets, 'apps': all_apps}, outfile) 124 | print('Wrote data on ' + str(len(relevant_buckets)) + ' date buckets and ' + str(len(all_apps)) + ' apps to a file.') 125 | print('{} apps were before Android began and {} were after today'.format(num_apps_before, num_apps_after)) 126 | print('{} classes.dex files were not found'.format(num_file_not_found)) 127 | 128 | if __name__=='__main__': 129 | main() 130 | -------------------------------------------------------------------------------- /match_features.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads the JSON file passed as the first command line argument to 5 | get the names of features and feature_weights.json (written by tensorflow_learn.py) 6 | to get the feature weights from a trained ML model. It then matches the feature 7 | weights to the human-readable names for them, and prints them out, sorted by the weights. 8 | So the first feature listed is the one most indicative of maliciousness, and the first 9 | one in the second list is the one most indicative of benign (the lists are just mirror 10 | images of each other). 11 | """ 12 | 13 | import json 14 | import sys 15 | 16 | def main(): 17 | with open(sys.argv[1]) as vectors: 18 | # Dataset of feature names that were used in the model 19 | feature_names = json.load(vectors)['features'] 20 | 21 | with open('feature_weights.json') as weights: 22 | # Tensorflow model calculated weights for every feature 23 | feature_weights = json.load(weights) 24 | 25 | # Separate malicous and benign weights 26 | malicious_weights = [weight[0] for weight in feature_weights] 27 | benign_weights = [weight[1] for weight in feature_weights] 28 | 29 | # Sort weights in descending order 30 | malicious_indices=sorted(range(len(malicious_weights)), key=lambda k: malicious_weights[k], reverse=True) 31 | benign_indices=sorted(range(len(benign_weights)), key=lambda k: benign_weights[k], reverse=True) 32 | 33 | # Prints the rank of each feature, its weight, and the feature name 34 | print('MALICIOUS FEATURE RANKINGS:\n') 35 | for i,x in enumerate(malicious_indices): 36 | print ('{}. {} {}'.format(i, feature_names[x], malicious_weights[x])) 37 | 38 | print ('\n\n\n\n\nBENIGN FEATURE RANKINGS:\n') 39 | for i,x in enumerate(benign_indices): 40 | print ('{}. {} {}'.format(i, feature_names[x], benign_weights[x])) 41 | 42 | if __name__=='__main__': 43 | main() 44 | -------------------------------------------------------------------------------- /model_training/saved_model/model_android_malware_classification/keras_metadata.pb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/model_training/saved_model/model_android_malware_classification/keras_metadata.pb -------------------------------------------------------------------------------- /model_training/saved_model/model_android_malware_classification/saved_model.pb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/model_training/saved_model/model_android_malware_classification/saved_model.pb -------------------------------------------------------------------------------- /model_training/saved_model/model_android_malware_classification/variables/variables.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/model_training/saved_model/model_android_malware_classification/variables/variables.data-00000-of-00001 -------------------------------------------------------------------------------- /model_training/saved_model/model_android_malware_classification/variables/variables.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/model_training/saved_model/model_android_malware_classification/variables/variables.index -------------------------------------------------------------------------------- /model_training/train_model.py: -------------------------------------------------------------------------------- 1 | #Basic classification: Classify Android malwares 2 | import json 3 | import sys 4 | 5 | # TensorFlow and tf.keras 6 | import tensorflow as tf 7 | 8 | # Helper libraries 9 | import numpy as np 10 | import matplotlib.pyplot as plt 11 | 12 | print(tf.__version__) 13 | 14 | # Import the datasets from CIC 15 | with open(sys.argv[1]) as infile: 16 | dataset = json.load(infile) 17 | 18 | app_names = [app for app in dataset['apps']] 19 | features = np.array(dataset['features']) 20 | train_files = np.array([np.array(dataset['apps'][app]['vector']) for app in app_names]) 21 | print(train_files.shape) 22 | train_labels = np.array([dataset['apps'][app]['malicious'][0] for app in app_names]) 23 | 24 | test_files = np.array([dataset['apps'][app]['vector'] for app in app_names]) 25 | test_labels = np.array([dataset['apps'][app]['malicious'][0] for app in app_names]) 26 | 27 | class_names = ['benign', 'malicious'] 28 | 29 | ## Explore the data 30 | 31 | # print(train_files.shape) 32 | # print(len(train_labels)) 33 | # print(train_labels) 34 | # print(test_files.shape) 35 | # print(len(test_labels)) 36 | 37 | ## Build the model 38 | ### Set up the layers 39 | model = tf.keras.Sequential([ 40 | tf.keras.layers.Flatten(input_shape=(100, )), 41 | tf.keras.layers.Dense(128, activation='relu'), 42 | tf.keras.layers.Dense(10) 43 | ]) 44 | 45 | ### Compile the model 46 | model.compile(optimizer='adam', 47 | loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 48 | metrics=['accuracy']) 49 | 50 | ## Train the model 51 | ### Feed the model 52 | model.fit(train_files, train_labels, epochs=10) 53 | model.save('saved_model/model_android_malware_classification') 54 | test_loss, test_acc = model.evaluate(test_files, test_labels, verbose=2) 55 | 56 | print('\nTest accuracy:', test_acc) 57 | 58 | ### Make predictions 59 | probability_model = tf.keras.Sequential([model, 60 | tf.keras.layers.Softmax()]) 61 | 62 | predictions = probability_model.predict(test_files) 63 | # print(app_names[190]) 64 | # print(predictions[190]) 65 | # print(np.argmax(predictions[190])) 66 | # print(test_labels[190]) 67 | 68 | ## Use the trained model 69 | file = test_files[190] 70 | print(file.shape) 71 | file = (np.expand_dims(file,0)) 72 | print(file.shape) 73 | predictions_single = probability_model.predict(file) 74 | 75 | print(predictions_single) 76 | 77 | print(np.argmax(predictions_single[0])) -------------------------------------------------------------------------------- /parse_disassembled.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads the disassembled.code files in the malicous_apk 5 | and benign_apk folders and copies the Android and Java methods called 6 | into a JSON file for later analysis. 7 | 8 | The output data format is as follows: 9 | {"features": ["java/lang/String.length", ...], 10 | "apps": {"999eca2457729e371355aea5faa38e14.apk": {"vector": [0,0,0,1], "malicious": [0,1]}, ...}} 11 | """ 12 | 13 | import os 14 | import json 15 | import glob 16 | 17 | 18 | 19 | def main(): 20 | all_methods = [] # list of strings naming each method used in the dataset 21 | app_method_map = {} # mapping from android app names to lists of methods 22 | app_malicious_map = {} # mapping from android app names to 1 or 0 for malware or goodware 23 | root_dir = os.getcwd() 24 | for i, directory in enumerate(['benign_apk', 'malicious_apk']): 25 | os.chdir(directory) 26 | category_root_dir = os.getcwd() 27 | for filename in glob.glob('*.apk'): 28 | try: 29 | print('Processing ' + filename) 30 | os.chdir(filename[:-4]) 31 | with open('disassembled.code') as disassembled_code: 32 | app_name = filename 33 | # make a one-hot bit vector of length 2. 1st bit set if malicious, otherwise 2nd bit 34 | app_malicious_map[app_name] = [1,0] if i else [0,1] 35 | # parse the file and record any interesting methods 36 | app_method_map[app_name] = [] 37 | for line in disassembled_code.readlines(): 38 | try: 39 | method = line.split('// Method ')[1].split(':')[0] 40 | #if not method.startswith('java') and not method.startswith('android'): 41 | if not method.startswith('java'): 42 | continue 43 | # Comment the below line to use methods rather than classes 44 | method = method.split('.')[0] 45 | # the method is probably obfuscated; ignore it 46 | if len(method.split('/')[-1]) < 4 or len(method.split('/')[-2]) == 1: 47 | continue 48 | if method not in all_methods: 49 | all_methods.append(method) 50 | if method not in app_method_map[app_name]: 51 | app_method_map[app_name].append(method) 52 | except IndexError: 53 | continue 54 | except FileNotFoundError as e: 55 | print(e) 56 | finally: 57 | os.chdir(category_root_dir) 58 | os.chdir(root_dir) 59 | all_apps = {} # mapping combining app_methods_map and app_malicious_map using bits 60 | for app_name in app_method_map: 61 | bit_vector = [1 if m in app_method_map[app_name] else 0 for m in all_methods] 62 | all_apps[app_name] = {'vector': bit_vector, 'malicious': app_malicious_map[app_name]} 63 | with open('app_method_vectors.json', 'w') as outfile: 64 | json.dump({'features': all_methods, 'apps': all_apps}, outfile) 65 | print('Wrote data on ' + str(len(all_methods)) + ' methods and ' + str(len(all_apps)) + ' apps to a file.') 66 | 67 | if __name__=='__main__': 68 | main() 69 | -------------------------------------------------------------------------------- /parse_xml.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads the AndroidManifest.xml files in the malicous_apk 5 | and benign_apk folders and copies the requested permissions into a 6 | JSON file for later analysis 7 | 8 | The output data format is as follows: 9 | {"features": ["ANDROID.PERMISSION.RECEIVE_BOOT_COMPLETED", ...], 10 | "apps": {"999eca2457729e371355aea5faa38e14.apk": {"vector": [0,0,0,1], "malicious": [0,1]}, ...}} 11 | """ 12 | 13 | import os 14 | from defusedxml import ElementTree 15 | import json 16 | import glob 17 | 18 | 19 | 20 | def main(): 21 | all_permissions = [] # list of strings naming each permission used in the dataset 22 | app_permission_map = {} # mapping from android app names to lists of permissions 23 | app_malicious_map = {} # mapping from android app names to 1 or 0 for malware or goodware 24 | root_dir = os.getcwd() 25 | for i, directory in enumerate(['benign_apk', 'malicious_apk']): 26 | os.chdir(directory) 27 | category_root_dir = os.getcwd() 28 | for filename in glob.glob('*.apk'): 29 | print('Processing ' + filename) 30 | try: 31 | os.chdir(filename[:-4]) 32 | with open('AndroidManifest.xml') as xml_file: 33 | et = ElementTree.parse(xml_file) 34 | except (ElementTree.ParseError, UnicodeDecodeError, FileNotFoundError): 35 | print('Parsing error encountered for ' + filename) 36 | os.chdir(category_root_dir) 37 | continue 38 | app_name = filename 39 | # make a one-hot bit vector of length 2. 1st bit set if malicious, otherwise 2nd bit 40 | app_malicious_map[app_name] = [1,0] if i else [0,1] 41 | permissions = et.getroot().findall('./uses-permission') 42 | app_permission_map[app_name] = [] 43 | for permission in permissions: 44 | try: 45 | permission_name = permission.attrib['{http://schemas.android.com/apk/res/android}name'].upper() 46 | if not permission_name.startswith('ANDROID.PERMISSION'): continue # ignore custom permissions 47 | if permission_name not in all_permissions: all_permissions.append(permission_name) 48 | app_permission_map[app_name].append(permission_name) 49 | except KeyError: 50 | pass 51 | os.chdir(os.pardir) 52 | os.chdir(root_dir) 53 | all_apps = {} # mapping combining app_permission_map and app_malicious_map using bits 54 | for app_name in app_permission_map: 55 | bit_vector = [1 if p in app_permission_map[app_name] else 0 for p in all_permissions] 56 | all_apps[app_name] = {'vector': bit_vector, 'malicious': app_malicious_map[app_name]} 57 | with open('app_permission_vectors.json', 'w') as outfile: 58 | json.dump({'features': all_permissions, 'apps': all_apps}, outfile) 59 | print('Wrote data on ' + str(len(all_permissions)) + ' permissions and ' + str(len(all_apps)) + ' apps to a file.') 60 | 61 | if __name__=='__main__': 62 | main() 63 | -------------------------------------------------------------------------------- /parse_xml_demo.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads the AndroidManifest.xml files in the malicous_apk 5 | and benign_apk folders and copies the requested permissions into a 6 | JSON file for later analysis 7 | 8 | The output data format is as follows: 9 | {"features": ["ANDROID.PERMISSION.RECEIVE_BOOT_COMPLETED", ...], 10 | "apps": {"999eca2457729e371355aea5faa38e14.apk": {"vector": [0,0,0,1], "malicious": [0,1]}, ...}} 11 | """ 12 | 13 | import os 14 | from defusedxml import ElementTree 15 | import json 16 | import glob 17 | 18 | 19 | 20 | def main(): 21 | all_permissions = [] # list of strings naming each permission used in the dataset 22 | app_permission_map = {} # mapping from android app names to lists of permissions 23 | app_malicious_map = {} # mapping from android app names to 1 or 0 for malware or goodware 24 | root_dir = os.getcwd() 25 | for i, directory in enumerate(['benign_apk', 'malicious_apk']): 26 | os.chdir(directory) 27 | category_root_dir = os.getcwd() 28 | for filename in glob.glob('*.apk'): 29 | print('Processing ' + filename) 30 | try: 31 | os.chdir(filename[:-4]) 32 | with open('AndroidManifest.xml') as xml_file: 33 | et = ElementTree.parse(xml_file) 34 | except (ElementTree.ParseError, UnicodeDecodeError, FileNotFoundError): 35 | print('Parsing error encountered for ' + filename) 36 | os.chdir(category_root_dir) 37 | continue 38 | app_name = filename 39 | # make a one-hot bit vector of length 2. 1st bit set if malicious, otherwise 2nd bit 40 | app_malicious_map[app_name] = [1,0] if i else [0,1] 41 | permissions = et.getroot().findall('./uses-permission') 42 | app_permission_map[app_name] = [] 43 | for permission in permissions: 44 | try: 45 | permission_name = permission.attrib['{http://schemas.android.com/apk/res/android}name'].upper() 46 | if not permission_name.startswith('ANDROID.PERMISSION'): continue # ignore custom permissions 47 | if permission_name not in all_permissions: all_permissions.append(permission_name) 48 | app_permission_map[app_name].append(permission_name) 49 | except KeyError: 50 | pass 51 | os.chdir(os.pardir) 52 | os.chdir(root_dir) 53 | all_apps = {} # mapping combining app_permission_map and app_malicious_map using bits 54 | for app_name in app_permission_map: 55 | bit_vector = [1 if p in app_permission_map[app_name] else 0 for p in all_permissions] 56 | all_apps[app_name] = {'vector': bit_vector, 'malicious': app_malicious_map[app_name]} 57 | with open('app_permission_vectors.json', 'w') as outfile: 58 | json.dump({'features': all_permissions, 'apps': all_apps}, outfile) 59 | print('Wrote data on ' + str(len(all_permissions)) + ' permissions and ' + str(len(all_apps)) + ' apps to a file.') 60 | 61 | if __name__=='__main__': 62 | main() 63 | -------------------------------------------------------------------------------- /plot_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | # This script plots the data produced by run_trials.sh 4 | 5 | import matplotlib.pyplot as plt 6 | from matplotlib.axes import Axes 7 | import numpy 8 | 9 | CSV_FILE = 'results.csv' 10 | 11 | def main(): 12 | # read the data from the CSV files 13 | data = numpy.genfromtxt(CSV_FILE, delimiter=',', names=True) 14 | 15 | # plot training steps vs classification accuracy 16 | plt.plot(data['n'], data['false_positive'], 'r.', markersize=20) 17 | plt.plot(data['n'], data['false_negative'], 'g.', markersize=20) 18 | plt.plot(data['n'], data['accuracy'], 'b.', markersize=20) 19 | plt.title('Number of Training Steps vs Classification Accuracy \n and False Negative/Positive Rates') 20 | plt.xlabel('Number of Training Steps') 21 | plt.ylabel('Classification Accuracy') 22 | plt.grid(True) 23 | plt.savefig('training_steps_vs_accuracy.png') 24 | 25 | if __name__=='__main__': 26 | main() 27 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib 2 | pydotplus 3 | sklearn 4 | numpy 5 | -------------------------------------------------------------------------------- /resources.arsc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daivc96/android-malware-classification/da779870ceeb458f55a076c0ea414780eb149201/resources.arsc -------------------------------------------------------------------------------- /run_trials.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | if [ ! -e "$1" ]; then 6 | echo "Usage: $0 " 7 | exit 1 8 | fi 9 | 10 | rm -f results.csv 11 | echo "n,false_positive,false_negative,accuracy" >> results.csv 12 | for n in 20 40 60 80 100 120 140 160 180 200; do 13 | m=10 14 | false_positive_sum=0 15 | false_negative_sum=0 16 | accuracy_sum=0 17 | for i in `seq 1 $m`; do 18 | output=`python3 tensorflow_learn.py $1 $n 2>/dev/null` 19 | output_arr=(${output//$'\n'/ }) 20 | echo "n: $n false_positive: ${output_arr[0]} false_negative: ${output_arr[1]} accuracy: ${output_arr[2]}" 21 | false_positive_sum=$(echo "scale=2;$false_positive_sum + ${output_arr[0]}" | bc) 22 | false_negative_sum=$(echo "scale=2;$false_negative_sum + ${output_arr[1]}" | bc) 23 | accuracy_sum=$(echo "scale=2;$accuracy_sum + ${output_arr[2]}" | bc) 24 | NUMBER_MALICIOUS="${output_arr[3]}" 25 | NUMBER_BENIGN="${output_arr[4]}" 26 | done 27 | false_positive_avg=$(echo "scale=2;$false_positive_sum/$m" | bc) 28 | false_negative_avg=$(echo "scale=2;$false_negative_sum/$m" | bc) 29 | accuracy_avg=$(echo "scale=2;$accuracy_sum/$m" | bc) 30 | echo "$n,$false_positive_avg,$false_negative_avg,$accuracy_avg" >> results.csv 31 | done 32 | 33 | python3 plot_data.py 34 | python3 match_features.py $1 >> top_features.txt 35 | 36 | rm -f RUN_INFO 37 | grep "LEARNING_RATE" config.ini >> RUN_INFO 38 | grep "NUM_CHUNKS" config.ini >> RUN_INFO 39 | grep "SHUFFLE_CHUNKS" config.ini >> RUN_INFO 40 | grep "DECAY_RATE" config.ini >> RUN_INFO 41 | echo "NUMBER_MALICIOUS = $NUMBER_MALICIOUS" >> RUN_INFO 42 | echo "NUMBER_BENIGN = $NUMBER_BENIGN" >> RUN_INFO 43 | 44 | RUN_FOLDER="results_`date +%s`" 45 | mkdir "$RUN_FOLDER" 46 | mv results.csv RUN_INFO feature_weights.json training_steps_vs_accuracy.png top_features.txt "$RUN_FOLDER" 47 | cp $1 "$RUN_FOLDER" 48 | -------------------------------------------------------------------------------- /sklearn_forest.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | import os 4 | import json 5 | import sys 6 | import sklearn 7 | from sklearn.model_selection import train_test_split 8 | from sklearn.model_selection import GridSearchCV 9 | from sklearn import ensemble 10 | from sklearn.metrics import confusion_matrix, classification_report 11 | # from sklearn.externals import joblib 12 | import joblib 13 | 14 | def train_forest_classifer(features, labels, model_output_path): 15 | """ 16 | train_forest_classifer will train a RandomForestClassifier 17 | 18 | features: 2D array of each input feature for each sample 19 | labels: array of string labels classifying each sample 20 | model_output_path: path for storing the trained forest model 21 | """ 22 | # save 20% of data for performance evaluation 23 | X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2) 24 | 25 | param = [ 26 | { 27 | "max_depth": [None, 10, 100, 1000, 10000], 28 | "n_estimators": [1, 10, 100] 29 | } 30 | ] 31 | 32 | forest = ensemble.RandomForestClassifier(random_state=0) 33 | 34 | # 10-fold cross validation, use 4 thread as each fold and each parameter set can be train in parallel 35 | clf = GridSearchCV(forest, param, 36 | cv=10, n_jobs=20, verbose=3) 37 | 38 | clf.fit(X_train, y_train) 39 | 40 | if os.path.exists(model_output_path): 41 | joblib.dump(clf.best_estimator_, model_output_path) 42 | else: 43 | print("Cannot save trained forest model to {0}.".format(model_output_path)) 44 | 45 | print("\nBest parameters set:") 46 | print(clf.best_params_) 47 | 48 | y_predict=clf.predict(X_test) 49 | 50 | labels=sorted(list(set(labels))) 51 | print("\nConfusion matrix:") 52 | print("Labels: {0}\n".format(",".join(labels))) 53 | print(confusion_matrix(y_test, y_predict, labels=labels)) 54 | 55 | print("\nClassification report:") 56 | print(classification_report(y_test, y_predict)) 57 | 58 | def main(): 59 | # load the feature data from a file 60 | with open(sys.argv[1]) as infile: 61 | dataset = json.load(infile) 62 | app_names = list(dataset['apps'].keys()) 63 | feature_vectors = [dataset['apps'][app]['vector'] for app in app_names] 64 | labels = ['1' if dataset['apps'][app]['malicious'] == [1,0] else '0' for app in app_names] 65 | train_forest_classifer(feature_vectors, labels, 'model.out') 66 | 67 | if __name__=='__main__': 68 | main() 69 | -------------------------------------------------------------------------------- /sklearn_svm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | import os 4 | import json 5 | import sys 6 | import sklearn 7 | from sklearn.model_selection import train_test_split 8 | from sklearn.model_selection import GridSearchCV 9 | from sklearn import ensemble 10 | from sklearn.svm import SVC 11 | from sklearn.metrics import confusion_matrix, classification_report 12 | # from sklearn.externals import joblib 13 | import joblib 14 | 15 | # Code credit to https://code.oursky.com/tensorflow-svm-image-classifications-engine/ 16 | 17 | def train_svm_classifer(features, labels, model_output_path): 18 | """ 19 | train_svm_classifer will train a SVM, saved the trained and SVM model and 20 | report the classification performance 21 | 22 | features: 2D array of each input feature for each sample 23 | labels: array of string labels classifying each sample 24 | model_output_path: path for storing the trained svm model 25 | """ 26 | # save 20% of data for performance evaluation 27 | X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2) 28 | 29 | param = [ 30 | { 31 | "kernel": ["linear"], 32 | "C": [1, 10, 100, 1000] 33 | }, 34 | { 35 | "kernel": ["rbf"], 36 | "C": [1, 10, 100, 1000], 37 | "gamma": [1e-2, 1e-3, 1e-4, 1e-5] 38 | } 39 | ] 40 | 41 | # request probability estimation 42 | svm = SVC(probability=True) 43 | 44 | # 10-fold cross validation, use 4 thread as each fold and each parameter set can be train in parallel 45 | clf = GridSearchCV(svm, param, 46 | cv=10, n_jobs=20, verbose=3) 47 | 48 | clf.fit(X_train, y_train) 49 | 50 | if os.path.exists(model_output_path): 51 | joblib.dump(clf.best_estimator_, model_output_path) 52 | else: 53 | print("Cannot save trained svm model to {0}.".format(model_output_path)) 54 | 55 | print("\nBest parameters set:") 56 | print(clf.best_params_) 57 | 58 | y_predict=clf.predict(X_test) 59 | 60 | labels=sorted(list(set(labels))) 61 | print("\nConfusion matrix:") 62 | print("Labels: {0}\n".format(",".join(labels))) 63 | print(confusion_matrix(y_test, y_predict, labels=labels)) 64 | 65 | print("\nClassification report:") 66 | print(classification_report(y_test, y_predict)) 67 | 68 | def main(): 69 | # load the feature data from a file 70 | with open(sys.argv[1]) as infile: 71 | dataset = json.load(infile) 72 | app_names = list(dataset['apps'].keys()) 73 | feature_vectors = [dataset['apps'][app]['vector'] for app in app_names] 74 | labels = ['1' if dataset['apps'][app]['malicious'] == [1,0] else '0' for app in app_names] 75 | train_svm_classifer(feature_vectors, labels, 'model.out') 76 | 77 | if __name__=='__main__': 78 | main() -------------------------------------------------------------------------------- /sklearn_tree.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | 4 | import os 5 | import json 6 | import sys 7 | import sklearn 8 | from sklearn.model_selection import train_test_split 9 | from sklearn.model_selection import GridSearchCV 10 | from sklearn import ensemble 11 | from sklearn.svm import SVC 12 | from sklearn.metrics import confusion_matrix, classification_report 13 | import joblib 14 | import sklearn 15 | import pydotplus 16 | from sklearn import tree 17 | from sklearn.metrics import confusion_matrix, classification_report 18 | 19 | 20 | def train_tree_classifer(features, labels, model_output_path): 21 | """ 22 | train_tree_classifer will train a DecisionTree and write it out to a pdf file 23 | 24 | features: 2D array of each input feature for each sample 25 | labels: array of string labels classifying each sample 26 | model_output_path: path for storing the trained tree model 27 | """ 28 | # save 20% of data for performance evaluation 29 | X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2) 30 | 31 | param = [ 32 | { 33 | "max_depth": [None, 10, 100, 1000, 10000] 34 | } 35 | ] 36 | 37 | dtree = tree.DecisionTreeClassifier(random_state=0) 38 | 39 | # 10-fold cross validation, use 4 thread as each fold and each parameter set can be train in parallel 40 | clf = GridSearchCV(dtree, param, 41 | cv=10, n_jobs=20, verbose=3) 42 | 43 | clf.fit(X_train, y_train) 44 | 45 | if os.path.exists(model_output_path): 46 | joblib.dump(clf.best_estimator_, model_output_path) 47 | else: 48 | print("Cannot save trained tree model to {0}.".format(model_output_path)) 49 | 50 | dot_data = tree.export_graphviz(clf.best_estimator_, out_file=None) 51 | graph = pydotplus.graph_from_dot_data(dot_data) 52 | graph.write_pdf('best_tree.pdf') 53 | 54 | print("\nBest parameters set:") 55 | print(clf.best_params_) 56 | 57 | y_predict=clf.predict(X_test) 58 | 59 | labels=sorted(list(set(labels))) 60 | print("\nConfusion matrix:") 61 | print("Labels: {0}\n".format(",".join(labels))) 62 | print(confusion_matrix(y_test, y_predict, labels=labels)) 63 | 64 | print("\nClassification report:") 65 | print(classification_report(y_test, y_predict)) 66 | 67 | def main(): 68 | # load the feature data from a file 69 | with open(sys.argv[1]) as infile: 70 | dataset = json.load(infile) 71 | app_names = list(dataset['apps'].keys()) 72 | feature_vectors = [dataset['apps'][app]['vector'] for app in app_names] 73 | labels = ['1' if dataset['apps'][app]['malicious'] == [1,0] else '0' for app in app_names] 74 | train_tree_classifer(feature_vectors, labels, 'model.out') 75 | 76 | if __name__=='__main__': 77 | main() 78 | -------------------------------------------------------------------------------- /sort_malicious.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | # This script looks at the .apk files in the 'all_apks' folder, 4 | # submits their hashes to andrototal.org to see if they're malicious, 5 | # and sorts them into folders based on that ('malicious_apk' and 'benign_apk') 6 | # (make sure those folders exist before running this) 7 | 8 | import subprocess 9 | import os 10 | import json 11 | import glob 12 | from configparser import ConfigParser 13 | 14 | def main(): 15 | config = ConfigParser() 16 | config.read('config.ini') 17 | API_KEY = config.get('AMA', 'API_KEY') 18 | for apk in glob.glob('all_apks/*.apk'): 19 | if not os.path.isfile(apk[:-4] + '_andrototal.json'): 20 | print('Checking ' + apk) 21 | try: 22 | analysis = subprocess.check_output('./tools/andrototal_cli.py analysis -at-key {} {}'.format(API_KEY, apk.split('/')[1][:-4]), shell=True).decode('utf-8') 23 | except subprocess.CalledProcessError as e: 24 | print(str(e)) 25 | continue 26 | with open(apk[:-4] + '_andrototal.json', 'w') as out_file: 27 | out_file.write(analysis) 28 | try: 29 | with open(apk[:-4] + '_andrototal.json') as json_file: 30 | analysis = json.load(json_file) 31 | if type(analysis) == str: raise ValueError 32 | except ValueError: 33 | with open('invalid_andrototal_responses', 'a') as out_file: 34 | out_file.write(apk.split('/')[1] + '\n') 35 | continue 36 | if all([test['result'] == 'NO_THREAT_FOUND' for test in analysis['tests']]): 37 | os.rename(apk, 'benign_apk/' + apk.split('/')[1]) 38 | else: 39 | os.rename(apk, 'malicious_apk/' + apk.split('/')[1]) 40 | 41 | if __name__=='__main__': 42 | main() 43 | -------------------------------------------------------------------------------- /tensorflow_learn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ 4 | This script reads app_permission_vectors.json (written by parse_xml.py) and 5 | feeds the data into a tensorflow "neural network" to try to learn from it. 6 | """ 7 | 8 | 9 | import tensorflow.compat.v1 as tf 10 | tf.disable_v2_behavior() 11 | import numpy as np 12 | import json 13 | import math 14 | import random 15 | import sys 16 | from configparser import ConfigParser 17 | 18 | def main(): 19 | config = ConfigParser() 20 | config.read('config.ini') 21 | LEARNING_RATE = config.getfloat('AMA', 'LEARNING_RATE') 22 | NUM_CHUNKS = config.getint('AMA', 'NUM_CHUNKS') 23 | SHUFFLE_CHUNKS = config.getboolean('AMA', 'SHUFFLE_CHUNKS') 24 | DECAY_RATE = config.getfloat('AMA', 'DECAY_RATE') 25 | 26 | # load the data from a file 27 | with open(sys.argv[1]) as infile: 28 | dataset = json.load(infile) 29 | 30 | # placeholder for any number of bit vectors 31 | x = tf.placeholder(tf.float32, [None, len(dataset['features'])]) 32 | 33 | # weights for each synapse 34 | #Creates a tensor with all elements set to zero. 35 | W = tf.Variable(tf.zeros([len(dataset['features']), 2])) 36 | 37 | b = tf.Variable(tf.zeros([2])) 38 | 39 | # results 40 | #the softmax function is used to normalize the outputs, converting them from weighted sum values into probabilities that sum to one 41 | y = tf.nn.softmax(tf.matmul(x, W) + b) 42 | 43 | # placeholder for correct answers 44 | y_ = tf.placeholder(tf.float32, [None, 2]) 45 | 46 | cross_entropy = -tf.reduce_sum(y_*tf.log(y)) 47 | init = tf.initialize_all_variables() 48 | #cap phat tai nguyen, bat dau session 49 | sess = tf.Session() 50 | 51 | sess.run(init) 52 | 53 | #TODO make a clean interface for this 54 | #TODO make sure the training sample contains reasonable proportions of benign/malicious 55 | malicious_app_names = [app for app in dataset['apps'] if dataset['apps'][app]['malicious'] == [1,0]] 56 | benign_app_names = [app for app in dataset['apps'] if dataset['apps'][app]['malicious'] == [0,1]] 57 | # break up the data into chunks for training and testing 58 | malicious_app_name_chunks = list(chunks(malicious_app_names, math.floor(len(dataset['apps']) / NUM_CHUNKS))) 59 | benign_app_name_chunks = list(chunks(benign_app_names, math.floor(len(dataset['apps']) / NUM_CHUNKS))) 60 | if SHUFFLE_CHUNKS: 61 | random.shuffle(malicious_app_name_chunks) 62 | random.shuffle(benign_app_name_chunks) 63 | 64 | # the first chunk of each will be used for testing (and the rest for training) 65 | for i in range(int(sys.argv[2])): 66 | #Áp dụng giảm dần theo cấp số nhân cho tốc độ học 67 | # decayed Learning Rate increases accuracy by 3-5% 68 | decayed_learning_rate = tf.train.exponential_decay(LEARNING_RATE, i, 1, DECAY_RATE, staircase=False) 69 | train_step = tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(cross_entropy) 70 | 71 | j = random.randrange(1, len(malicious_app_name_chunks)) 72 | k = random.randrange(1, len(benign_app_name_chunks)) 73 | app_names_chunk = malicious_app_name_chunks[j] + benign_app_name_chunks[k] 74 | batch_xs = [dataset['apps'][app]['vector'] for app in app_names_chunk] 75 | batch_ys = [dataset['apps'][app]['malicious'] for app in app_names_chunk] 76 | sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) 77 | # ''' 78 | # Runs operations and evaluates tensors in fetches. 79 | # run( 80 | # fetches, feed_dict=None, options=None, run_metadata=None 81 | # ) 82 | # ''' 83 | feature_weights=sess.run([W])[0] 84 | # store feature weights that were calculated. Used by match_features.py 85 | with open ('feature_weights.json','w') as f: 86 | json.dump(feature_weights.tolist(),f) 87 | 88 | #TESTING 89 | 90 | app_names_chunk = malicious_app_name_chunks[0] + benign_app_name_chunks[0] 91 | test_xs = [dataset['apps'][app]['vector'] for app in app_names_chunk] 92 | test_ys = [dataset['apps'][app]['malicious'] for app in app_names_chunk] 93 | 94 | prediction_diff = tf.subtract(tf.argmax(y,1), tf.argmax(y_,1)) 95 | # calculate how often benign apps were thought to be malicious 96 | false_positive = tf.equal(prediction_diff, tf.constant(-1,shape=[len(test_ys)],dtype=tf.int64)) 97 | # calculate how often malicious apps were thought to be benign 98 | false_negative = tf.equal(prediction_diff, tf.constant(1,shape=[len(test_ys)],dtype=tf.int64)) 99 | # recalculate prediction accuracy 100 | correct_prediction = tf.equal(prediction_diff, tf.constant(0,shape=[len(test_ys)],dtype=tf.int64)) 101 | 102 | false_positive_rate = tf.reduce_mean(tf.cast(false_positive, tf.float32)) 103 | false_negative_rate = tf.reduce_mean(tf.cast(false_negative, tf.float32)) 104 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 105 | 106 | print(sess.run(false_positive_rate, feed_dict={x: test_xs, y_: test_ys})) 107 | print(sess.run(false_negative_rate, feed_dict={x: test_xs, y_: test_ys})) 108 | print(sess.run(accuracy, feed_dict={x: test_xs, y_: test_ys})) 109 | print(str(len(malicious_app_names))) 110 | print(str(len(benign_app_names))) 111 | 112 | def chunks(l, n): 113 | """Yield successive n-sized chunks from l.""" 114 | for i in range(0, len(l), n): 115 | yield l[i:i+n] 116 | 117 | if __name__=='__main__': 118 | main() 119 | -------------------------------------------------------------------------------- /valid_apks.txt: -------------------------------------------------------------------------------- 1 | cc.pacer.androidapp.apk 2 | --------------------------------------------------------------------------------