├── requirements.txt ├── .gitattributes ├── Images └── output_images.PNG ├── Classification ├── svm_binary.py └── svm_multi.py ├── Pre-processing ├── hist_L7_all_classes.py ├── hist_all_layers_binary.py └── hist_all_layers_all_classes.py └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | opencv-python 2 | numpy 3 | tqdm 4 | pandas 5 | skimage 6 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /Images/output_images.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shivmohith/Network-Traffic-Classification/HEAD/Images/output_images.PNG -------------------------------------------------------------------------------- /Classification/svm_binary.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn import preprocessing, model_selection, neighbors, svm 3 | from sklearn.metrics import confusion_matrix 4 | from sklearn import metrics 5 | import pandas as pd 6 | import cv2 7 | import os 8 | from random import shuffle 9 | from tqdm import tqdm 10 | import pandas as pd 11 | 12 | 13 | df = pd.read_csv('dataset_L7_binary.csv') 14 | df.replace('?',-99999, inplace=True) 15 | #df.drop(['random'], 1, inplace=True) 16 | 17 | X = np.array(df.drop(['label'], 1)) 18 | y = np.array(df['label']) 19 | 20 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1) 21 | 22 | clf = svm.SVC(kernel = 'linear') 23 | 24 | clf.fit(X_train, y_train) 25 | svm_predictions = clf.predict(X_test) 26 | 27 | print('BINARY CLASSIFICATION L7') 28 | print('') 29 | 30 | # model accuracy for X_test 31 | accuracy = clf.score(X_test, y_test) 32 | print("Accuracy =", accuracy) 33 | print('') 34 | #cm = confusion_matrix(y_test, svm_predictions) 35 | #print("confusion matrix") 36 | #print(cm) 37 | #print('') 38 | 39 | report = metrics.classification_report(y_test,clf.predict(X_test)) 40 | print("Report") 41 | print(report) 42 | print() 43 | 44 | su_vec = clf.support_vectors_ 45 | print('support vectors') 46 | print(su_vec) 47 | print('') 48 | su_vec_n = clf.n_support_ 49 | print('Number of support vectors for each class') 50 | print(su_vec_n) 51 | print('') 52 | print('total Number of support vectors') 53 | print(sum(su_vec_n)) 54 | print('') 55 | weights = clf.coef_ 56 | print('The weights associated for each feature') 57 | print(weights) 58 | -------------------------------------------------------------------------------- /Classification/svm_multi.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn import preprocessing, model_selection, neighbors, svm 3 | from sklearn.metrics import confusion_matrix 4 | from sklearn import metrics 5 | import pandas as pd 6 | import cv2 7 | import os 8 | from random import shuffle 9 | from tqdm import tqdm 10 | import pandas as pd 11 | from sklearn.svm import SVC 12 | 13 | df = pd.read_csv('dataset_L7_bin_size_8.csv') 14 | df.replace('?',-99999, inplace=True) 15 | #df.drop(['random'], 1, inplace=True) 16 | 17 | X = np.array(df.drop(['label'], 1)) 18 | y = np.array(df['label']) 19 | 20 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2) 21 | 22 | min_max_scaler = preprocessing.MinMaxScaler() 23 | X_train_minmax = min_max_scaler.fit_transform(X_train) 24 | X_test_minmax = min_max_scaler.transform(X_test) 25 | 26 | svm_model_linear = SVC(kernel = 'linear',C=1).fit(X_train_minmax, y_train) 27 | svm_predictions = svm_model_linear.predict(X_test_minmax) 28 | 29 | print('MULTI-CLASS CLASSIFICATION_L7 BIN SIZE 8') 30 | print('') 31 | 32 | # model accuracy for X_test 33 | accuracy = svm_model_linear.score(X_test_minmax, y_test) 34 | print("Accuracy =", accuracy) 35 | print('') 36 | 37 | report = metrics.classification_report(y_test,svm_model_linear.predict(X_test_minmax)) 38 | print("Report") 39 | print(report) 40 | print() 41 | # creating a confusion matrix 42 | #cm = confusion_matrix(y_test, svm_predictions) 43 | #print("confusion matrix") 44 | #print(cm) 45 | print('') 46 | su_vec = svm_model_linear.support_vectors_ 47 | print('support vectors') 48 | print(su_vec) 49 | print('') 50 | su_vec_n = svm_model_linear.n_support_ 51 | print('Number of support vectors for each class') 52 | print(su_vec_n) 53 | print('') 54 | print('total Number of support vectors') 55 | print(sum(su_vec_n)) 56 | print('') 57 | weights = svm_model_linear.coef_ 58 | print('The weights associated for each feature') 59 | print(weights) 60 | 61 | -------------------------------------------------------------------------------- /Pre-processing/hist_L7_all_classes.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import os 3 | import numpy as np 4 | from random import shuffle 5 | from tqdm import tqdm 6 | import pandas as pd 7 | from skimage.feature import hog 8 | TRAIN_DIR = r"D:\ML mini project\main codes\train_L7" 9 | training_data = [] 10 | hist = [] 11 | temp = [] #pixels of an image 12 | temp2 = [] #multiple images(pixel values) in one array 13 | 14 | def label(arr): 15 | if("BitTorrent" in arr): 16 | return 1 17 | elif("Facetime" in arr): 18 | return 1 19 | elif("FTP" in arr): 20 | return 1 21 | elif("Gmail" in arr): 22 | return 1 23 | elif("MySQL" in arr): 24 | return 1 25 | elif("Outlook" in arr): 26 | return 1 27 | elif("Skype" in arr): 28 | return 1 29 | elif("SMB" in arr): 30 | return 1 31 | elif("Weibo" in arr): 32 | return 1 33 | elif("WorldOfWarcraft" in arr): 34 | return 1 35 | elif("Cridex" in arr): 36 | return 0 37 | elif("Geodo" in arr): 38 | return 0 39 | elif("Htbot" in arr): 40 | return 0 41 | elif("Miuref" in arr): 42 | return 0 43 | elif("Neris" in arr): 44 | return 0 45 | elif("Nsis-ay" in arr): 46 | return 0 47 | elif("Shifu" in arr): 48 | return 0 49 | elif("Tinba" in arr): 50 | return 0 51 | elif("Virut" in arr): 52 | return 0 53 | elif("Zeus" in arr): 54 | return 0 55 | 56 | for img in tqdm(os.listdir(TRAIN_DIR)): 57 | ppc = 8 58 | path = os.path.join(TRAIN_DIR, img) 59 | t = label(path) 60 | img = cv2.imread(path, cv2.IMREAD_GRAYSCALE) 61 | hist = cv2.calcHist([img],[0],None,[32],[0,256]) #4th parameter can be varied to change the bin size 62 | 63 | for i in range(len(hist)): 64 | temp.append(hist[i][0]) 65 | temp.append(t) 66 | temp2.append(temp) 67 | temp = [] 68 | 69 | #np.savetxt("foo.csv", temp,fmt = "%d", delimiter=",") 70 | #print(temp2) 71 | 72 | shuffle(temp2) 73 | temp3 = pd.DataFrame(temp2) #converting into dataframe 74 | print(temp3.shape) 75 | temp3.to_csv('dataset_L7_hog_binary.csv', index=False) 76 | 77 | -------------------------------------------------------------------------------- /Pre-processing/hist_all_layers_binary.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import os 3 | import numpy as np 4 | from random import shuffle 5 | from tqdm import tqdm 6 | import pandas as pd 7 | from skimage.feature import hog 8 | TRAIN_DIR = r"D:\ML mini project\main codes\train_L7" 9 | training_data = [] 10 | hist = [] 11 | temp = [] #pixels of an image 12 | temp2 = [] #multiple images(pixel values) in one array 13 | 14 | def label(arr): 15 | if("BitTorrent" in arr): 16 | return 1 17 | elif("Facetime" in arr): 18 | return 1 19 | elif("FTP" in arr): 20 | return 1 21 | elif("Gmail" in arr): 22 | return 1 23 | elif("MySQL" in arr): 24 | return 1 25 | elif("Outlook" in arr): 26 | return 1 27 | elif("Skype" in arr): 28 | return 1 29 | elif("SMB" in arr): 30 | return 1 31 | elif("Weibo" in arr): 32 | return 1 33 | elif("WorldOfWarcraft" in arr): 34 | return 1 35 | elif("Cridex" in arr): 36 | return 0 37 | elif("Geodo" in arr): 38 | return 0 39 | elif("Htbot" in arr): 40 | return 0 41 | elif("Miuref" in arr): 42 | return 0 43 | elif("Neris" in arr): 44 | return 0 45 | elif("Nsis-ay" in arr): 46 | return 0 47 | elif("Shifu" in arr): 48 | return 0 49 | elif("Tinba" in arr): 50 | return 0 51 | elif("Virut" in arr): 52 | return 0 53 | elif("Zeus" in arr): 54 | return 0 55 | 56 | for img in tqdm(os.listdir(TRAIN_DIR)): 57 | ppc = 8 58 | path = os.path.join(TRAIN_DIR, img) 59 | t = label(path) 60 | img = cv2.imread(path, cv2.IMREAD_GRAYSCALE) 61 | hist = cv2.calcHist([img],[0],None,[32],[0,256]) #4th parameter can be varied to change the bin size 62 | 63 | for i in range(len(hist)): 64 | temp.append(hist[i][0]) 65 | temp.append(t) 66 | temp2.append(temp) 67 | temp = [] 68 | 69 | #np.savetxt("foo.csv", temp,fmt = "%d", delimiter=",") 70 | #print(temp2) 71 | 72 | shuffle(temp2) 73 | temp3 = pd.DataFrame(temp2) #converting into dataframe 74 | print(temp3.shape) 75 | temp3.to_csv('dataset_L7_binary_bin_size_8.csv', index=False) 76 | 77 | -------------------------------------------------------------------------------- /Pre-processing/hist_all_layers_all_classes.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import os 3 | import numpy as np 4 | from random import shuffle 5 | from tqdm import tqdm 6 | import pandas as pd 7 | from skimage.feature import hog 8 | TRAIN_DIR = r"D:\ML mini project\main codes\train_L7" 9 | training_data = [] 10 | hist = [] 11 | temp = [] #pixels of an image 12 | temp2 = [] #multiple images(pixel values) in one array 13 | 14 | def label(arr): 15 | if("BitTorrent" in arr): 16 | return 0 17 | elif("Facetime" in arr): 18 | return 1 19 | elif("FTP" in arr): 20 | return 2 21 | elif("Gmail" in arr): 22 | return 3 23 | elif("MySQL" in arr): 24 | return 4 25 | elif("Outlook" in arr): 26 | return 5 27 | elif("Skype" in arr): 28 | return 6 29 | elif("SMB" in arr): 30 | return 7 31 | elif("Weibo" in arr): 32 | return 8 33 | elif("WorldOfWarcraft" in arr): 34 | return 9 35 | elif("Cridex" in arr): 36 | return -1 37 | elif("Geodo" in arr): 38 | return -2 39 | elif("Htbot" in arr): 40 | return -3 41 | elif("Miuref" in arr): 42 | return -4 43 | elif("Neris" in arr): 44 | return -5 45 | elif("Nsis-ay" in arr): 46 | return -6 47 | elif("Shifu" in arr): 48 | return -7 49 | elif("Tinba" in arr): 50 | return -8 51 | elif("Virut" in arr): 52 | return -9 53 | elif("Zeus" in arr): 54 | return -10 55 | 56 | for img in tqdm(os.listdir(TRAIN_DIR)): 57 | ppc = 8 58 | path = os.path.join(TRAIN_DIR, img) 59 | t = label(path) 60 | img = cv2.imread(path, cv2.IMREAD_GRAYSCALE) 61 | hist = cv2.calcHist([img],[0],None,[32],[0,256]) #4th parameter can be varied to change the bin size 62 | 63 | for i in range(len(hist)): 64 | temp.append(hist[i][0]) 65 | temp.append(t) 66 | temp2.append(temp) 67 | temp = [] 68 | 69 | #np.savetxt("foo.csv", temp,fmt = "%d", delimiter=",") 70 | #print(temp2) 71 | 72 | shuffle(temp2) 73 | temp3 = pd.DataFrame(temp2) #converting into dataframe 74 | print(temp3.shape) 75 | temp3.to_csv('dataset_all_layers_bin_size_8.csv', index=False) 76 | 77 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Network-Traffic-Classification 2 | This project was done during months of April and March in 2019 as a mini project part of my machine learning course taken in my university. 3 | 4 | ## Description 5 | 6 | Traffic classification is the first step for network anomaly detection or network based intrusion detection system and plays an important role in network security domain. The traffic observed in a network system can be classified mainly into two types – Normal (benign) traffic and Malware traffic. Unnecessary traffic from malicious sources can cause a lot of problems like flood the network bandwidth, DOS attacks and jam the receiver and thus detection of such unwanted traffic and classifying the required from the “malware” traffic is extremely important. 7 | 8 | This project involves converting the raw network traffic, collected using the Wireshark tool, into images. These images are used to classify the network traffic. 9 | There are 2 types of classification, 10 | - Classifying as benign or malware - Binary classification 11 | - Classifying into each class under benign and malware - Multi-class classification 12 | 13 | An accuracy of **100%** and **98%** was achieved for Binary and Multi-class classification respectively. 14 | 15 | #### Dataset 16 | 17 | The dataset used for the training is USTC-TFC2016. The dataset consists of benign traffic and malware traffic. 18 | The following table shows the 10 different classes under beign and malware. 19 | 20 | |Benign Classes |Malware Classes | 21 | |---------------|----------------| 22 | |BitTorrent |Cridex | 23 | |Facetime |Geodo | 24 | |FTP |Htbot | 25 | |GMail |Miuref | 26 | |MySQL |Neris | 27 | |Outlook |Nsis-ay | 28 | |Skype |Shifu | 29 | |SMB |Tinba | 30 | |Weibo |Virut | 31 | |World of Craft |Zeus | 32 | 33 | There are 2 types of data for each class. One is the traffic information of only the 7th layer (Application layer) of the OSI model and the other is 34 | traffic information of all the layers in the OSI model. This project classifies the network traffic using both the data individually. 35 | 36 | ## Methedology 37 | 38 | 1. Convert the raw Wireshark data in .pcap format into images. This was done using the code available in https://github.com/echowei/DeepTraffic/tree/master/1.malware_traffic_classification/2.PreprocessedTools(USTC-TK2016) 39 | ![Example Image from each class](Images/output_images.PNG) 40 | 41 | The above images are grey scale images where each pixel value can range from 0 to 255 42 | 43 | 2. Extract the histogram of the images which are considered as the image features. These images features are stored in an excel file with the class label as the last column. 44 | The histogram is obtained for different bin sizes (1, 2, 4, 8) 45 | 46 | 3. Use Support vector machines to for classification 47 | 48 | ## Running the application 49 | 50 | 1. Install the required libraries specified in *requirements.txt* 51 | 2. Extract histogram features of images and save them using the python files in Pre-processing directory 52 | 3. Perform classifcation by running the files in Classification directory 53 | - *svm_binary.py*: to perform binary classification 54 | - *svm_multi.py*: to perform multi-class classification 55 | **Replace the file paths in the code with the appropriate file paths** 56 | 57 | ## References 58 | *Wei Wang, Ming Zhu, ‘Malware Traffic Classification Using Convolutional Neural Network for Representation Learning’* 59 | --------------------------------------------------------------------------------