├── requirements.txt
├── .gitattributes
├── Images
    └── output_images.PNG
├── Classification
    ├── svm_binary.py
    └── svm_multi.py
├── Pre-processing
    ├── hist_L7_all_classes.py
    ├── hist_all_layers_binary.py
    └── hist_all_layers_all_classes.py
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | opencv-python
2 | numpy
3 | tqdm
4 | pandas
5 | skimage
6 | 


--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/Images/output_images.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shivmohith/Network-Traffic-Classification/HEAD/Images/output_images.PNG


--------------------------------------------------------------------------------
/Classification/svm_binary.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn import preprocessing, model_selection, neighbors, svm
 3 | from sklearn.metrics import confusion_matrix
 4 | from sklearn import metrics
 5 | import pandas as pd
 6 | import cv2 
 7 | import os  
 8 | from random import shuffle 
 9 | from tqdm import tqdm
10 | import pandas as pd
11 | 
12 | 
13 | df = pd.read_csv('dataset_L7_binary.csv')
14 | df.replace('?',-99999, inplace=True)
15 | #df.drop(['random'], 1, inplace=True)
16 | 
17 | X = np.array(df.drop(['label'], 1))
18 | y = np.array(df['label'])
19 | 
20 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)
21 | 
22 | clf = svm.SVC(kernel = 'linear')
23 | 
24 | clf.fit(X_train, y_train)
25 | svm_predictions = clf.predict(X_test)
26 | 
27 | print('BINARY CLASSIFICATION L7')
28 | print('')
29 |   
30 | # model accuracy for X_test   
31 | accuracy = clf.score(X_test, y_test)
32 | print("Accuracy =", accuracy)
33 | print('')
34 | #cm = confusion_matrix(y_test, svm_predictions)
35 | #print("confusion matrix")
36 | #print(cm)
37 | #print('')
38 | 
39 | report = metrics.classification_report(y_test,clf.predict(X_test))
40 | print("Report")
41 | print(report)
42 | print()
43 | 
44 | su_vec = clf.support_vectors_
45 | print('support vectors')
46 | print(su_vec)
47 | print('')
48 | su_vec_n = clf.n_support_
49 | print('Number of support vectors for each class')
50 | print(su_vec_n)
51 | print('')
52 | print('total Number of support vectors')
53 | print(sum(su_vec_n))
54 | print('')
55 | weights = clf.coef_
56 | print('The weights associated for each feature')
57 | print(weights)
58 | 


--------------------------------------------------------------------------------
/Classification/svm_multi.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn import preprocessing, model_selection, neighbors, svm
 3 | from sklearn.metrics import confusion_matrix
 4 | from sklearn import metrics
 5 | import pandas as pd
 6 | import cv2 
 7 | import os  
 8 | from random import shuffle 
 9 | from tqdm import tqdm
10 | import pandas as pd
11 | from sklearn.svm import SVC 
12 | 
13 | df = pd.read_csv('dataset_L7_bin_size_8.csv')
14 | df.replace('?',-99999, inplace=True)
15 | #df.drop(['random'], 1, inplace=True)
16 | 
17 | X = np.array(df.drop(['label'], 1))
18 | y = np.array(df['label'])
19 | 
20 | X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
21 | 
22 | min_max_scaler = preprocessing.MinMaxScaler()
23 | X_train_minmax = min_max_scaler.fit_transform(X_train)
24 | X_test_minmax = min_max_scaler.transform(X_test)
25 | 
26 | svm_model_linear = SVC(kernel = 'linear',C=1).fit(X_train_minmax, y_train) 
27 | svm_predictions = svm_model_linear.predict(X_test_minmax)
28 | 
29 | print('MULTI-CLASS CLASSIFICATION_L7 BIN SIZE 8')
30 | print('')
31 |   
32 | # model accuracy for X_test   
33 | accuracy = svm_model_linear.score(X_test_minmax, y_test)
34 | print("Accuracy =", accuracy)
35 | print('')
36 | 
37 | report = metrics.classification_report(y_test,svm_model_linear.predict(X_test_minmax))
38 | print("Report")
39 | print(report)
40 | print()
41 | # creating a confusion matrix 
42 | #cm = confusion_matrix(y_test, svm_predictions)
43 | #print("confusion matrix")
44 | #print(cm)
45 | print('')
46 | su_vec = svm_model_linear.support_vectors_
47 | print('support vectors')
48 | print(su_vec)
49 | print('')
50 | su_vec_n = svm_model_linear.n_support_
51 | print('Number of support vectors for each class')
52 | print(su_vec_n)
53 | print('')
54 | print('total Number of support vectors')
55 | print(sum(su_vec_n))
56 | print('')
57 | weights = svm_model_linear.coef_
58 | print('The weights associated for each feature')
59 | print(weights)
60 | 
61 | 


--------------------------------------------------------------------------------
/Pre-processing/hist_L7_all_classes.py:
--------------------------------------------------------------------------------
 1 | import cv2 
 2 | import os 
 3 | import numpy as np 
 4 | from random import shuffle 
 5 | from tqdm import tqdm
 6 | import pandas as pd
 7 | from skimage.feature import hog
 8 | TRAIN_DIR = r"D:\ML mini project\main codes\train_L7"
 9 | training_data = []
10 | hist = []
11 | temp = [] #pixels of an image
12 | temp2 = [] #multiple images(pixel values) in one array
13 | 
14 | def label(arr):
15 |     if("BitTorrent" in arr):
16 |         return 1
17 |     elif("Facetime" in arr):
18 |         return 1
19 |     elif("FTP" in arr):
20 |         return 1
21 |     elif("Gmail" in arr):
22 |         return 1
23 |     elif("MySQL" in arr):
24 |         return 1
25 |     elif("Outlook" in arr):
26 |         return 1
27 |     elif("Skype" in arr):
28 |         return 1
29 |     elif("SMB" in arr):
30 |         return 1
31 |     elif("Weibo" in arr):
32 |         return 1
33 |     elif("WorldOfWarcraft" in arr):
34 |         return 1
35 |     elif("Cridex" in arr):
36 |         return 0
37 |     elif("Geodo" in arr):
38 |         return 0
39 |     elif("Htbot" in arr):
40 |         return 0
41 |     elif("Miuref" in arr):
42 |         return 0
43 |     elif("Neris" in arr):
44 |         return 0
45 |     elif("Nsis-ay" in arr):
46 |         return 0    
47 |     elif("Shifu" in arr):
48 |         return 0
49 |     elif("Tinba" in arr):
50 |         return 0
51 |     elif("Virut" in arr):
52 |         return 0
53 |     elif("Zeus" in arr):
54 |         return 0    
55 | 
56 | for img in tqdm(os.listdir(TRAIN_DIR)):
57 |         ppc = 8
58 |         path = os.path.join(TRAIN_DIR, img)
59 |         t = label(path)
60 |         img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
61 |         hist = cv2.calcHist([img],[0],None,[32],[0,256]) #4th parameter can be varied to change the bin size
62 |         
63 |         for i in range(len(hist)):
64 |              temp.append(hist[i][0])
65 |         temp.append(t)
66 |         temp2.append(temp)
67 |         temp = []
68 |         
69 |         #np.savetxt("foo.csv", temp,fmt = "%d", delimiter=",")
70 | #print(temp2)
71 |         
72 | shuffle(temp2)
73 | temp3 = pd.DataFrame(temp2) #converting into dataframe
74 | print(temp3.shape)
75 | temp3.to_csv('dataset_L7_hog_binary.csv', index=False)
76 |             
77 | 


--------------------------------------------------------------------------------
/Pre-processing/hist_all_layers_binary.py:
--------------------------------------------------------------------------------
 1 | import cv2 
 2 | import os 
 3 | import numpy as np 
 4 | from random import shuffle 
 5 | from tqdm import tqdm
 6 | import pandas as pd
 7 | from skimage.feature import hog
 8 | TRAIN_DIR = r"D:\ML mini project\main codes\train_L7"
 9 | training_data = []
10 | hist = []
11 | temp = [] #pixels of an image
12 | temp2 = [] #multiple images(pixel values) in one array
13 | 
14 | def label(arr):
15 |     if("BitTorrent" in arr):
16 |         return 1
17 |     elif("Facetime" in arr):
18 |         return 1
19 |     elif("FTP" in arr):
20 |         return 1
21 |     elif("Gmail" in arr):
22 |         return 1
23 |     elif("MySQL" in arr):
24 |         return 1
25 |     elif("Outlook" in arr):
26 |         return 1
27 |     elif("Skype" in arr):
28 |         return 1
29 |     elif("SMB" in arr):
30 |         return 1
31 |     elif("Weibo" in arr):
32 |         return 1
33 |     elif("WorldOfWarcraft" in arr):
34 |         return 1
35 |     elif("Cridex" in arr):
36 |         return 0
37 |     elif("Geodo" in arr):
38 |         return 0
39 |     elif("Htbot" in arr):
40 |         return 0
41 |     elif("Miuref" in arr):
42 |         return 0
43 |     elif("Neris" in arr):
44 |         return 0
45 |     elif("Nsis-ay" in arr):
46 |         return 0
47 |     elif("Shifu" in arr):
48 |         return 0
49 |     elif("Tinba" in arr):
50 |         return 0
51 |     elif("Virut" in arr):
52 |         return 0
53 |     elif("Zeus" in arr):
54 |         return 0    
55 | 
56 | for img in tqdm(os.listdir(TRAIN_DIR)):
57 |         ppc = 8
58 |         path = os.path.join(TRAIN_DIR, img)
59 |         t = label(path)
60 |         img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
61 |         hist = cv2.calcHist([img],[0],None,[32],[0,256]) #4th parameter can be varied to change the bin size
62 |         
63 |         for i in range(len(hist)):
64 |              temp.append(hist[i][0])
65 |         temp.append(t)
66 |         temp2.append(temp)
67 |         temp = []
68 |         
69 |         #np.savetxt("foo.csv", temp,fmt = "%d", delimiter=",")
70 | #print(temp2)
71 |         
72 | shuffle(temp2)
73 | temp3 = pd.DataFrame(temp2) #converting into dataframe
74 | print(temp3.shape)
75 | temp3.to_csv('dataset_L7_binary_bin_size_8.csv', index=False)
76 |             
77 | 


--------------------------------------------------------------------------------
/Pre-processing/hist_all_layers_all_classes.py:
--------------------------------------------------------------------------------
 1 | import cv2 
 2 | import os 
 3 | import numpy as np 
 4 | from random import shuffle 
 5 | from tqdm import tqdm
 6 | import pandas as pd
 7 | from skimage.feature import hog
 8 | TRAIN_DIR = r"D:\ML mini project\main codes\train_L7"
 9 | training_data = []
10 | hist = []
11 | temp = [] #pixels of an image
12 | temp2 = [] #multiple images(pixel values) in one array
13 | 
14 | def label(arr):
15 |     if("BitTorrent" in arr):
16 |         return 0
17 |     elif("Facetime" in arr):
18 |         return 1
19 |     elif("FTP" in arr):
20 |         return 2
21 |     elif("Gmail" in arr):
22 |         return 3
23 |     elif("MySQL" in arr):
24 |         return 4
25 |     elif("Outlook" in arr):
26 |         return 5
27 |     elif("Skype" in arr):
28 |         return 6
29 |     elif("SMB" in arr):
30 |         return 7
31 |     elif("Weibo" in arr):
32 |         return 8
33 |     elif("WorldOfWarcraft" in arr):
34 |         return 9
35 |     elif("Cridex" in arr):
36 |         return -1
37 |     elif("Geodo" in arr):
38 |         return -2
39 |     elif("Htbot" in arr):
40 |         return -3
41 |     elif("Miuref" in arr):
42 |         return -4
43 |     elif("Neris" in arr):
44 |         return -5
45 |     elif("Nsis-ay" in arr):
46 |         return -6
47 |     elif("Shifu" in arr):
48 |         return -7
49 |     elif("Tinba" in arr):
50 |         return -8
51 |     elif("Virut" in arr):
52 |         return -9
53 |     elif("Zeus" in arr):
54 |         return -10    
55 | 
56 | for img in tqdm(os.listdir(TRAIN_DIR)):
57 |         ppc = 8
58 |         path = os.path.join(TRAIN_DIR, img)
59 |         t = label(path)
60 |         img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
61 |         hist = cv2.calcHist([img],[0],None,[32],[0,256]) #4th parameter can be varied to change the bin size
62 |         
63 |         for i in range(len(hist)):
64 |              temp.append(hist[i][0])
65 |         temp.append(t)
66 |         temp2.append(temp)
67 |         temp = []
68 |         
69 |         #np.savetxt("foo.csv", temp,fmt = "%d", delimiter=",")
70 | #print(temp2)
71 |         
72 | shuffle(temp2)
73 | temp3 = pd.DataFrame(temp2) #converting into dataframe
74 | print(temp3.shape)
75 | temp3.to_csv('dataset_all_layers_bin_size_8.csv', index=False)
76 |             
77 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Network-Traffic-Classification
 2 |  This project was done during months of April and March in 2019 as a mini project part of my machine learning course taken in my university.
 3 | 
 4 | ## Description
 5 | 
 6 | Traffic classification is the first step for network anomaly detection or network based intrusion detection system and plays an important role in network security domain. The traffic observed in a network system can be classified mainly into two types – Normal (benign) traffic and Malware traffic. Unnecessary traffic from malicious sources can cause a lot of problems like flood the network bandwidth, DOS attacks and jam the receiver and thus detection of such unwanted traffic and classifying the required from the “malware” traffic is extremely important.
 7 | 
 8 | This project involves converting the raw network traffic, collected using the Wireshark tool, into images. These images are used to classify the network traffic.
 9 | There are 2 types of classification,
10 | - Classifying as benign or malware - Binary classification
11 | - Classifying into each class under benign and malware - Multi-class classification
12 | 
13 | An accuracy of **100%** and **98%** was achieved for Binary and Multi-class classification respectively.
14 | 
15 | #### Dataset
16 | 
17 | The dataset used for the training is USTC-TFC2016. The dataset consists of benign traffic and malware traffic. 
18 | The following table shows the 10 different classes under beign and malware.
19 | 
20 | |Benign Classes |Malware Classes |
21 | |---------------|----------------|
22 | |BitTorrent     |Cridex          |
23 | |Facetime       |Geodo           |
24 | |FTP            |Htbot           |
25 | |GMail          |Miuref          |
26 | |MySQL          |Neris           |
27 | |Outlook        |Nsis-ay         |
28 | |Skype          |Shifu           |
29 | |SMB            |Tinba           |
30 | |Weibo          |Virut           |
31 | |World of Craft |Zeus            |
32 | 
33 | There are 2 types of data for each class. One is the traffic information of only the 7th layer (Application layer) of the OSI model and the other is
34 | traffic information of all the layers in the OSI model. This project classifies the network traffic using both the data individually.
35 | 
36 | ## Methedology
37 | 
38 | 1. Convert the raw Wireshark data in .pcap format into images. This was done using the code available in https://github.com/echowei/DeepTraffic/tree/master/1.malware_traffic_classification/2.PreprocessedTools(USTC-TK2016)
39 | ![Example Image from each class](Images/output_images.PNG)
40 | 
41 | The above images are grey scale images where each pixel value can range from 0 to 255
42 | 
43 | 2. Extract the histogram of the images which are considered as the image features. These images features are stored in an excel file with the class label as the last column.
44 | The histogram is obtained for different bin sizes (1, 2, 4, 8)
45 | 
46 | 3. Use Support vector machines to for classification
47 | 
48 | ## Running the application
49 | 
50 | 1. Install the required libraries specified in *requirements.txt*
51 | 2. Extract histogram features of images and save them using the python files in Pre-processing directory
52 | 3. Perform classifcation by running the files in Classification directory
53 |     - *svm_binary.py*: to perform binary classification
54 |     - *svm_multi.py*: to perform multi-class classification
55 | **Replace the file paths in the code with the appropriate file paths**
56 | 
57 | ## References
58 | *Wei Wang, Ming Zhu, ‘Malware Traffic Classification Using Convolutional Neural Network for Representation Learning’* 
59 | 


--------------------------------------------------------------------------------