├── .gitignore
├── LICENSE
├── Master_Thesis.pdf
├── README.md
├── convert_pcap_to_h5.py
├── data_exploration.py
├── datasets
    ├── LinuxChrome
    │   ├── 8
    │   │   └── extracted_8-2104_1422.h5
    │   └── 16
    │   │   └── extracted_16-2104_1523.h5
    ├── WindowsAndreas
    │   ├── 8
    │   │   └── extracted_8-2304_0930.h5
    │   └── 16
    │   │   └── extracted_16-2304_0932.h5
    ├── WindowsChrome
    │   ├── 8
    │   │   └── extracted_8-2004_1553.h5
    │   └── 16
    │   │   └── extracted_16-2004_1615.h5
    ├── WindowsFirefox
    │   ├── 8
    │   │   └── extracted_8-2004_2104.h5
    │   └── 16
    │   │   └── extracted_16-2004_2210.h5
    └── WindowsSalik
    │   ├── 8
    │       └── extracted_8-2004_1525.h5
    │   └── 16
    │       └── extracted_16-2004_1542.h5
├── extract_headers.py
├── extract_payload.py
├── filter_http_https.py
├── ip_header_test.py
├── pca
    ├── dataanalyzer.py
    ├── pca.py
    └── summarystats.py
├── pcap
    └── pcaptools.py
├── tf
    ├── confusionmatrix.py
    ├── dataset.py
    ├── early_stopping.py
    └── tf_utils.py
├── trafficgen
    ├── PyTgen
    │   ├── config.py
    │   ├── core
    │   │   ├── __init__.py
    │   │   ├── generator.py
    │   │   ├── runner.py
    │   │   └── scheduler.py
    │   ├── nslookup.py
    │   └── run.py
    └── Streaming
    │   ├── streaming_generator.py
    │   ├── streaming_types.py
    │   ├── unix_capture.py
    │   └── win_capture.py
├── train
    ├── train_header.py
    ├── train_logistic.py
    └── train_payload.py
├── utils.py
└── visualization
    ├── classes_module.py
    ├── heatmap.py
    ├── histogram.py
    ├── pca_plots.py
    ├── t-sne_compare.py
    ├── t_sne.py
    ├── vis_utils.py
    └── visualize_activations.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | trafficgen/__pycache__/
 2 | trafficgen/core/__pycache__/
 3 | __pycache__/
 4 | *.pcap
 5 | .idea/
 6 | .vscode/
 7 | constants.py
 8 | trained_models/
 9 | tensorboard/
10 | *.png
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Salik Lennert Pedersen
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Master_Thesis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/Master_Thesis.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Classification of encrypted traffic using deep learning
 2 | This repository contains the code used and developed during a master thesis at DTU Compute in 2018.  
 3 | Professor [Ole Winther](http://cogsys.imm.dtu.dk/staff/winther/) has been supervisor for this master thesis.  
 4 | Alex Omø Agerholm from [Napatech](https://www.napatech.com/) has been co-supervisor for this project.
 5 | 
 6 | In this thesis we examined and evaluated different ways of classifying encrypted network traffic by use of neural networks. For this purpose we created a dataset with a streaming/non-streaming focus. The dataset comprises seven different classes, five streaming and two non-streaming.
 7 | The thesis serves as a preliminary proof-of-concept for Napatech A/S. 
 8 | 
 9 | We propose a novel approach where the unencrypted parts of network traffic, namely the headers are utilized. This is done by concatenating the initial headers from a session thus forming a signature datapoint as shown in the following figure: 
10 | 
11 | <img src="https://saliklp.github.io/plots/Header-datapoint.png" alt="Header datapoint" width="50%">
12 | 
13 | The datasets created by use of the first 8 and 16 headers are available in the datasets folder in this repository.
14 | We explored the dataset by running t-SNE on the concatenated headers dataset. As can be seen in the t-SNE plot below, which shows all the individual datasets merged, it seems possible to perform classification of individual classes.
15 | 
16 | <img src="https://saliklp.github.io/plots/t-SNE_16headers_all_merged_perplexity30.png" alt="t-SNE plot" width="50%">
17 | 
18 | In experiments using the header-based approach we achieve very promising results, showing that a simple neural network with a single hidden layer of less than 50 units, can predict the individual classes with an accuracy of 96.4\% and an AUC of 0.99 to 1.00 for the individual classes, as shown in the following figures.
19 | 
20 | <img src="https://saliklp.github.io/plots/trainAllMerged_acc964.png" alt="Confusion matrix of all 7 classes" width="49%"> <img src="https://saliklp.github.io/plots/ROC_16header_12unit_all_.png" alt="ROC Plot" width="49.4%">
21 | 
22 | The thesis hereby provides a solution to network traffic classification using the unencrypted headers.
23 | 


--------------------------------------------------------------------------------
/convert_pcap_to_h5.py:
--------------------------------------------------------------------------------
1 | import pcap.pcaptools as pcap
2 | 
3 | 
4 | 
5 | if __name__ == '__main__':
6 |     pcap.process_pcap_to_h5('/home/mclrn/Data/', '/home/mclrn/Data/h5/', session_threshold=5000)


--------------------------------------------------------------------------------
/data_exploration.py:
--------------------------------------------------------------------------------
  1 | import pca.dataanalyzer
  2 | import utils
  3 | import glob
  4 | import os
  5 | import pandas as pd
  6 | import numpy as np
  7 | from matplotlib import pyplot as plt
  8 | from pca.dataanalyzer import byteindextoheaderfield
  9 | 
 10 | num_headers = 8
 11 | train_dir = 'C:/Users/salik/Documents/Data/LinuxChrome/{}/'.format(num_headers)
 12 | dataframes = []
 13 | for fullname in glob.iglob(train_dir + '*.h5'):
 14 |     filename = os.path.basename(fullname)
 15 |     df = utils.load_h5(train_dir, filename)
 16 |     dataframes.append(df)
 17 | # create one large dataframe
 18 | data = pd.concat(dataframes)
 19 | print(len(data))
 20 | print("drtv:", len(data[data['label'] == 'drtv']))
 21 | print("hbo:", len(data[data['label'] == 'hbo']))
 22 | print("http:", len(data[data['label'] == 'http']))
 23 | print("https:", len(data[data['label'] == 'https']))
 24 | print("netflix:", len(data[data['label'] == 'netflix']))
 25 | print("twitch:", len(data[data['label'] == 'twitch']))
 26 | print("youtube:", len(data[data['label'] == 'youtube']))
 27 | 
 28 | 
 29 | # dr_mean, dr_mean_sub, dr_std = getmeanstd(data, 'drtv')
 30 | # nf_mean, nf_mean_sub, nf_std = getmeanstd(data, 'netflix')
 31 | # mean_diff = dr_mean - nf_mean
 32 | # sort_diff = (-abs(mean_diff)).argsort() #Sort on absolute values in decending order
 33 | # for i in range(10):
 34 | #     packetnumber = math.ceil(sort_diff[i] / 54)
 35 | #     bytenumber = sort_diff[i] % 54
 36 | #     print('Index %i is bytenumber %i in packet: %i' % (sort_diff[i],bytenumber, packetnumber), byteindextoheaderfield(sort_diff[i]))
 37 | #
 38 | # bytes = getbytes(data)
 39 | # labels = data['label']
 40 | # pca = p.runpca(bytes, 50)
 41 | # # p.plotvarianceexp(pca, 50)
 42 | # Z = p.componentprojection(bytes, pca)
 43 | # for pc in range(17):
 44 | #     p.plotprojection(Z, pc, labels)
 45 | # p.showplots()
 46 | # id_max = np.argmax(mean_diff)
 47 | # print(byteindextoheaderfield(id_max))
 48 | # id_min = np.argmin(mean_diff)
 49 | # print(dr_mean-nf_mean)
 50 | 
 51 | def createBoxplotsFromColumns(title,param, param1):
 52 | 
 53 |     plt.title(title)
 54 |     plt.boxplot([param,param1])
 55 | 
 56 |     plt.savefig('boxplots/boxplot:%s.png' % title,dpi=300)
 57 |     plt.gcf().clear()
 58 | 
 59 | def compare_data(path1, path2, nrheaders):
 60 |     df1 = pd.DataFrame()
 61 |     df2 = pd.DataFrame()
 62 | 
 63 |     for fullname in glob.iglob(path1 + '*.h5'):
 64 |         filename = os.path.basename(fullname)
 65 |         df1 = utils.load_h5(path1, filename)
 66 | 
 67 |     for fullname in glob.iglob(path2 + '*.h5'):
 68 |         filename = os.path.basename(fullname)
 69 |         df2 = utils.load_h5(path2, filename)
 70 | 
 71 |     classes = []
 72 |     # find all classes
 73 |     for label in set(df1['label']):
 74 |         classes.append(label)
 75 |     print(set(df1['label']))
 76 |     print(set(df2['label']))
 77 |     # filter on classes
 78 |     for c in classes:
 79 |         ## Exclude youtube as it contains both UDP and TCP
 80 |         if(c == 'youtube'):
 81 |             continue
 82 | 
 83 | 
 84 |         # create selector
 85 |         df1_selector = df1['label'] == c
 86 |         df2_selector = df2['label'] == c
 87 | 
 88 | 
 89 |         df1_values = df1[df1_selector]['bytes'].values
 90 |         df2_values = df2[df2_selector]['bytes'].values
 91 | 
 92 |         df1_bytes = np.zeros((df1_values.shape[0], nrheaders * 54))
 93 |         df2_bytes = np.zeros((df2_values.shape[0], nrheaders * 54))
 94 | 
 95 |         for i, v in enumerate(df1_values):
 96 |             payload = np.zeros(nrheaders * 54, dtype=np.uint8)
 97 |             payload[:v.shape[0]] = v
 98 |             df1_bytes[i] = payload
 99 | 
100 |         for i, v in enumerate(df2_values):
101 |             payload = np.zeros(nrheaders * 54, dtype=np.uint8)
102 |             payload[:v.shape[0]] = v
103 |             df2_bytes[i] = payload
104 | 
105 | 
106 | 
107 |         # Extract byte 23 to determine the protocol.
108 |         TCP = True if int(df2_bytes[0][23]) == 6 else False
109 | 
110 | 
111 |         df1_mean = np.mean(df1_bytes, axis=0)
112 |         df2_mean = np.mean(df2_bytes, axis=0)
113 | 
114 |         df1_min = np.min(df1_bytes, axis=0)
115 |         df2_min = np.min(df2_bytes, axis=0)
116 | 
117 |         df1_max = np.max(df1_bytes, axis=0)
118 |         df2_max = np.max(df2_bytes, axis=0)
119 | 
120 | 
121 | 
122 |         for index, mean in enumerate(df1_mean):
123 |             if(index % 25 == 0):
124 |                 print(c,index)
125 |             if df1_mean[index] > 0 or df2_mean[index] > 0:
126 |                 if(int(df1_min[index]) != int(df1_max[index]) or int(df2_min[index]) != int(df2_max[index])):
127 |                     if(int(df1_mean[index]) != int(df2_mean[index]) or int(df1_min[index]) != int(df2_min[index]) or int(df1_max[index]) != int(df2_max[index])):
128 |                         print(index, " : ", int(df1_mean[index]), ' : ' , int(df2_mean[index]), int(df1_min[index]), int(df2_min[index]), int(df1_max[index]), int(df2_max[index]))
129 |                         headername = byteindextoheaderfield(index, TCP)
130 |                         headername = headername.replace('/',' ')
131 |                         createBoxplotsFromColumns(c+headername +':'+str(index), df1_bytes[:,index], df2_bytes[:,index])
132 | 
133 | 
134 | # compare_data('/home/mclrn/Data/linux/no_checksum/8/', '/home/mclrn/Data/windows_firefox/no_checksum/8/',8)
135 | 
136 | 


--------------------------------------------------------------------------------
/datasets/LinuxChrome/16/extracted_16-2104_1523.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/LinuxChrome/16/extracted_16-2104_1523.h5


--------------------------------------------------------------------------------
/datasets/LinuxChrome/8/extracted_8-2104_1422.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/LinuxChrome/8/extracted_8-2104_1422.h5


--------------------------------------------------------------------------------
/datasets/WindowsAndreas/16/extracted_16-2304_0932.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsAndreas/16/extracted_16-2304_0932.h5


--------------------------------------------------------------------------------
/datasets/WindowsAndreas/8/extracted_8-2304_0930.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsAndreas/8/extracted_8-2304_0930.h5


--------------------------------------------------------------------------------
/datasets/WindowsChrome/16/extracted_16-2004_1615.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsChrome/16/extracted_16-2004_1615.h5


--------------------------------------------------------------------------------
/datasets/WindowsChrome/8/extracted_8-2004_1553.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsChrome/8/extracted_8-2004_1553.h5


--------------------------------------------------------------------------------
/datasets/WindowsFirefox/16/extracted_16-2004_2210.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsFirefox/16/extracted_16-2004_2210.h5


--------------------------------------------------------------------------------
/datasets/WindowsFirefox/8/extracted_8-2004_2104.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsFirefox/8/extracted_8-2004_2104.h5


--------------------------------------------------------------------------------
/datasets/WindowsSalik/16/extracted_16-2004_1542.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsSalik/16/extracted_16-2004_1542.h5


--------------------------------------------------------------------------------
/datasets/WindowsSalik/8/extracted_8-2004_1525.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsSalik/8/extracted_8-2004_1525.h5


--------------------------------------------------------------------------------
/extract_headers.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | 
 3 | import utils
 4 | 
 5 | if __name__ == '__main__':
 6 |     loaddir = "/home/mclrn/Data/h5/"
 7 |     headers = [1, 2, 4, 8, 16]
 8 |     for headersize in headers:
 9 |         savedir= "/home/mclrn/Data/h5/" + str(headersize) + "/"
10 |         now = datetime.datetime.now()
11 |         savename = "extracted_%d-%.2d%.2d_%.2d%.2d" % (headersize, now.day, now.month, now.hour, now.minute)
12 |         utils.saveextractedheaders(loaddir, savedir, savename, num_headers=headersize)


--------------------------------------------------------------------------------
/extract_payload.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import utils
 3 | import glob
 4 | import os
 5 | import numpy as np
 6 | import pandas as pd
 7 | 
 8 | 
 9 | if __name__ == '__main__':
10 |     loaddir = "E:/Data/h5/"
11 |     labels = ['https', 'netflix']
12 |     max_packet_length = 1514
13 |     for label in labels:
14 |         print("Starting label: " + label)
15 |         savedir = loaddir + label + "/"
16 |         now = datetime.datetime.now()
17 |         savename = "payload_%s-%.2d%.2d_%.2d%.2d" % (label, now.day, now.month, now.hour, now.minute)
18 |         filelist = glob.glob(loaddir + label + '*.h5')
19 |         # Try only one of each file
20 |         fullname = filelist[0]
21 |         # for fullname in filelist:
22 |         load_dir, filename = os.path.split(fullname)
23 |         print("Loading: {0}".format(filename))
24 |         df = utils.load_h5(load_dir, filename)
25 |         packets = df['bytes'].values
26 |         payloads = []
27 |         labels = []
28 |         filenames = []
29 |         for packet in packets:
30 |             if len(packet) == max_packet_length:
31 |                 # Extract the payload from the packet should have length 1460
32 |                 payload = packet[54:]
33 |                 p = np.fromstring(payload, dtype=np.uint8)
34 |                 payloads.append(p)
35 |                 labels.append(label)
36 |                 filenames.append(filename)
37 |         d = {'filename': filenames, 'bytes': payloads, 'label': labels}
38 |         dataframe = pd.DataFrame(data=d)
39 |         key = savename.split('-')[0]
40 |         dataframe.to_hdf(savedir + savename + '.h5', key=key, mode='w')
41 |         # utils.saveextractedheaders(loaddir, savedir, savename, num_headers=headersize)
42 |         print("Done with label: " + label)
43 | 


--------------------------------------------------------------------------------
/filter_http_https.py:
--------------------------------------------------------------------------------
 1 | import utils
 2 | import trafficgen.PyTgen.config as cf
 3 | import socket
 4 | 
 5 | 
 6 | def save_dataframe_h5(df, dir, filename):
 7 |     key = filename.split('-')[0]
 8 |     df.to_hdf(dir + filename + '.h5', key=key)
 9 | 
10 | 
11 | Conf = cf.Conf
12 | all_ips = []
13 | for http in Conf.http_urls:
14 |     url = http.split('/')[2]
15 |     ip_list = []
16 |     ais = socket.getaddrinfo(url, 0, 0, 0, 0)
17 |     for result in ais:
18 |         ip_list.append(result[-1][0])
19 |     ip_list = list(set(ip_list))
20 |     all_ips.append(ip_list)
21 | 
22 | http_list = [ip for ips in all_ips for ip in ips]
23 | 
24 | all_ips = []
25 | for http in Conf.https_urls:
26 |     url = http.split('/')[2]
27 |     ip_list = []
28 |     ais = socket.getaddrinfo(url, 0, 0, 0, 0)
29 |     for result in ais:
30 |         ip_list.append(result[-1][0])
31 |     ip_list = list(set(ip_list))
32 |     all_ips.append(ip_list)
33 | https_list = [ip for ips in all_ips for ip in ips]
34 | 
35 | dir = 'E:/Data/'
36 | filename = 'http_https-browse'
37 | http_dataframe = utils.filter_pcap_by_ip(dir, filename, http_list, 'http')
38 | save_dataframe_h5(http_dataframe, dir, 'http-browse-1104_2010')
39 | 
40 | https_dataframe = utils.filter_pcap_by_ip(dir, filename, https_list, 'https')
41 | save_dataframe_h5(https_dataframe, dir, 'https-browse-1104_2020')
42 | 


--------------------------------------------------------------------------------
/ip_header_test.py:
--------------------------------------------------------------------------------
 1 | import glob
 2 | import os
 3 | import utils
 4 | import numpy as np
 5 | import multiprocessing
 6 | 
 7 | num_headers = 1
 8 | 
 9 | load_dir = 'E:/Data/h5/'
10 | 
11 | 
12 | def ipheadertask(filelist):
13 |     j = 1
14 |     for fullname in filelist:
15 |         print("Loading filenr: {}".format(j))
16 |         load_dir, filename = os.path.split(fullname)
17 |         df = utils.load_h5(load_dir, filename)
18 |         frames = df['bytes'].values
19 |         for i, frame in enumerate(frames):
20 |             p = np.fromstring(frame, dtype=np.uint8)
21 |             if p[14] != 69:
22 |                 print("IP Header length not 20! in file {0}".format(filename))
23 |         j += 1
24 | 
25 | if __name__ == '__main__':
26 |     filelist = glob.glob(load_dir + '*.h5')
27 |     filesplits = utils.split_list(filelist, 4)
28 | 
29 |     threads = []
30 |     for split in filesplits:
31 |         # create a thread for each
32 |         t = multiprocessing.Process(target=ipheadertask, args=(split,))
33 |         threads.append(t)
34 |         t.start()
35 |     # create one large dataframe
36 | 
37 |     for t in threads:
38 |         t.join()
39 |         print("Process joined: ", t)
40 | 


--------------------------------------------------------------------------------
/pca/dataanalyzer.py:
--------------------------------------------------------------------------------
 1 | import utils
 2 | import glob
 3 | import os
 4 | import pandas as pd
 5 | import numpy as np
 6 | import math
 7 | import pca as p
 8 | 
 9 | 
10 | def getbytes(dataframe, payload_length=810):
11 |     values = dataframe['bytes'].values
12 |     bytes = np.zeros((values.shape[0], payload_length))
13 |     for i, v in enumerate(values):
14 |         payload = np.zeros(payload_length, dtype=np.uint8)
15 |         payload[:v.shape[0]] = v
16 |         bytes[i] = payload
17 |     return bytes
18 | 
19 | 
20 | def getmeanstd(dataframe, label):
21 |     labels = dataframe['label'] == label
22 |     bytes = getbytes(dataframe[labels])
23 |     # values = dataframe[labels]['bytes'].values
24 |     # bytes = np.zeros((values.shape[0], values[0].shape[0]))
25 |     # for i, v in enumerate(values):
26 |     #     bytes[i] = v
27 | 
28 |     # Ys = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
29 |     mean = np.mean(bytes, axis=0)
30 |     mean_sub = np.subtract(bytes, mean)
31 |     std = mean_sub / np.std(bytes, axis=0)
32 |     return mean, mean_sub, std
33 | 
34 | 
35 | def byteindextoheaderfield(number, TCP=True):
36 |     if TCP:
37 |         bytenumber = number % 54
38 |     else:
39 |         bytenumber = number % 42
40 | 
41 |     if bytenumber in range(6):
42 |         return "Destination MAC"
43 |     if bytenumber in range(6, 12):
44 |         return "Source MAC"
45 |     if bytenumber in (12, 13):
46 |         return "Eth. Type"
47 |     if bytenumber == 14:
48 |         return "IP Version and header length"
49 |     if bytenumber == 15:
50 |         return "Explicit Congestion Notification"
51 |     if bytenumber in (16, 17):
52 |         return "Total Length (IP header)"
53 |     if bytenumber in (18, 19):
54 |         return "Identification (IP header)"
55 |     if bytenumber in (20, 21):
56 |         return "Fragment offset (IP header)"
57 |     if bytenumber == 22:
58 |         return "Time to live (IP header)"
59 |     if bytenumber == 23:
60 |         return "Protocol (IP header)"
61 |     if bytenumber in (24, 25):
62 |         return "Header checksum (IP header)"
63 |     if bytenumber in range(26, 30):
64 |         return "Source IP (IP header)"
65 |     if bytenumber in range(30, 34):
66 |         return "Destination IP (IP header)"
67 |     if bytenumber in (34, 35):
68 |         return "Source Port (TCP/UDP header)"
69 |     if bytenumber in (36, 37):
70 |         return "Destination Port (TCP/UDP header)"
71 |     if bytenumber in range(38, 42):
72 |         if TCP:
73 |             return "Sequence number (TCP header)"
74 |         elif bytenumber in (38, 39):
75 |             return "Length of data (UDP Header)"
76 |         else:
77 |             return "UDP Checksum (UDP Header)"
78 |     if bytenumber in range(42, 46):
79 |         return "ACK number (TCP header)"
80 |     if bytenumber == 46:
81 |         return "TCP Header length or Nonce (TCP header)"
82 |     if bytenumber == 47:
83 |         return "TCP FLAGS (CWR, ECN-ECHO, ACK, PUSH, RST, SYN, FIN) (TCP header)"
84 |     if bytenumber in (48, 49):
85 |         return "Window size (TCP header)"
86 |     if bytenumber in (50, 51):
87 |         return "Checksum (TCP header)"
88 |     if bytenumber in (52, 53):
89 |         return "Urgent Pointer (TCP header)"
90 | 
91 | 
92 | 
93 | 
94 | 


--------------------------------------------------------------------------------
/pca/pca.py:
--------------------------------------------------------------------------------
 1 | from sklearn.decomposition import PCA
 2 | import matplotlib.pyplot as plt
 3 | import numpy as np
 4 | 
 5 | 
 6 | def runpca(X, num_comp=None):
 7 |     pca = PCA(n_components=num_comp, svd_solver='full')
 8 |     pca.fit(X)
 9 |     # print(pca.n_components_)
10 |     # print(pca.explained_variance_ratio_)
11 |     # print(sum(pca.explained_variance_ratio_))
12 |     return pca
13 | 
14 | 
15 | def plotvarianceexp(pca, num_comp):
16 |     # plt.ion()
17 |     rho = pca.explained_variance_ratio_
18 |     rhosum = np.empty(len(rho))
19 |     for i in range(1, len(rho) + 1):
20 |         rhosum[i - 1] = np.sum(rho[0:i])
21 |     ind = np.arange(num_comp)
22 |     width = 0.35
23 |     opacity = 0.8
24 |     fig, ax = plt.subplots()
25 |     ax.set_ylim(0, 1.1)
26 |     # Variance explained by single component
27 |     bars1 = ax.bar(ind, rho[0:num_comp], width, alpha=opacity, color="xkcd:blue", label='Single')
28 |     # Variance explained by cummulative component
29 |     bars2 = ax.bar(ind + width, rhosum[0:num_comp], width, alpha=opacity, color="g", label='Cummulative')
30 |     # add some text for labels, title and axes ticks
31 |     ax.set_ylabel('Variance Explained', fontsize=16)
32 |     # ax.set_title('Variance Explained by principal components')
33 |     ax.set_xticks(ind + width / 2)
34 | 
35 |     labels = ["" for x in range(num_comp)]
36 |     for k in range(0, num_comp):
37 |         labels[k] = ('$v{0}$'.format(k + 1))
38 |     ax.set_xticklabels(labels, fontsize=16)
39 |     ax.legend(fontsize=16)
40 |     autolabel(bars2, ax)
41 |     plt.yticks(fontsize=16)
42 |     plt.draw()
43 | 
44 | 
45 | def componentprojection(data, pca):
46 |     V = pca.components_.T
47 |     Z = data @ V
48 |     return Z
49 | 
50 | def plotprojection(Z, pc, labels, class_labels):
51 |     diff_labels = np.unique(labels)
52 |     opacity = 0.8
53 |     fig, ax = plt.subplots()
54 |     color_map = {0: 'orangered', 1: 'royalblue', 2: 'lightgreen', 3: 'darkorchid', 4: 'teal', 5: 'darkslategrey',
55 |                  6: 'darkgreen', 7: 'darkgrey'}
56 |     for label in diff_labels:
57 |         idx = labels == label
58 |         ax.plot(Z[idx, pc], Z[idx, pc + 1], 'o', alpha=opacity, c=color_map[label], label='{label}'.format(label=class_labels[label]))
59 |     # ax.plot(Z[idx_below, pc], Z[idx_below, pc + 1], 'o', alpha=opacity,
60 |     #         label='{name} below mean'.format(name=attributeNames[att]))
61 |     ax.set_ylabel('$v{0}$'.format(pc + 2))
62 |     ax.set_xlabel('$v{0}$'.format(pc + 1))
63 |     ax.legend()
64 |     ax.set_title('Data projected on v{0} and v{1}'.format(pc+1, pc+2))
65 |     # fig.savefig('v{0}_v{1}_{att}.png'.format(pc + 1, pc + 2, att=attributeNames[att]), dpi=300)
66 |     plt.draw()
67 | 
68 | 
69 | def showplots():
70 |     plt.show()
71 | 
72 | 
73 | def autolabel(rects, ax):
74 |     """
75 |     Attach a text label above each bar displaying its height
76 |     """
77 |     for rect in rects:
78 |         height = rect.get_height()
79 |         ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
80 |                 '%4.2f' % height,
81 |                 ha='center', va='bottom', fontsize=16)


--------------------------------------------------------------------------------
/pca/summarystats.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import utils
 4 | # import seaborn as sns
 5 | import glob
 6 | import os
 7 | from scipy import stats, integrate
 8 | import matplotlib.pyplot as plt
 9 | # sns.set(color_codes=True)
10 | 
11 | 
12 | def remove_checksum(nrheaders):
13 |     # nrheaders = 1
14 |     read_dir = '/home/mclrn/Data/linux/'
15 |     headers = '{0}/'.format(nrheaders)
16 |     dataframes = []
17 |     for fullname in glob.iglob(read_dir + headers + '*.h5'):
18 |         filename = os.path.basename(fullname)
19 |         df = utils.load_h5(read_dir + headers, filename)
20 |         dataframes.append(df)
21 |     # create one large dataframe
22 |     df = pd.concat(dataframes)
23 |     # df = pd.read_hdf('/home/mclrn/Data/salik_windows/{0}/'.format(nrheaders), key="extracted_{0}".format(nrheaders))
24 |     df = df.sample(frac=1).reset_index(drop=True)
25 |     values, counts = np.unique(df['label'], return_counts=True)
26 |     print(values, counts)
27 | 
28 |     #selector = df['label'] == 'youtube'
29 | 
30 |     values = df['bytes'].values
31 |     bytes = np.zeros((values.shape[0], nrheaders * 54))
32 |     for i, v in enumerate(values):
33 |         payload = np.zeros(nrheaders * 54, dtype=np.uint8)
34 |         payload[:v.shape[0]] = v
35 |         bytes[i] = payload
36 | 
37 |     #mean = np.mean(bytes, axis=0)
38 |     #min = np.min(bytes, axis=0)
39 |     #max = np.max(bytes, axis=0)
40 |     #print(np.max(bytes[0:, 23]))  # Protocol field if value = 6 then TCP if value = 17 the UDP
41 |     bytes_no_checksum = []
42 |     for j, b in enumerate(bytes):
43 |         if b[23] == 6:
44 |             # TCP
45 |             # if bytenumber in (50, 51):
46 |             #   return "Checksum (TCP header)"
47 |             for i in range(nrheaders):
48 |                 b[i * 54 + 50] = 0
49 |                 b[i * 54 + 51] = 0
50 |                 b[i * 54 + 24] = 0
51 |                 b[i * 54 + 25] = 0
52 |         elif b[23] == 17:
53 |             # UDP
54 |             # if bytenumber in (40,41)
55 |             # return "UDP Checksum (UDP Header)"
56 |             for i in range(nrheaders):
57 |                 b[i * 42 + 40] = 0
58 |                 b[i * 42 + 41] = 0
59 |                 b[i * 42 + 24] = 0
60 |                 b[i * 42 + 25] = 0
61 |         else:
62 |             print("Byte was not 6 nor 17 but: %d" % bytes[23])
63 | 
64 |         bytes_no_checksum.append(b)
65 |     new_data = {'bytes': bytes_no_checksum, 'label': df['label'].values}
66 |     new_df = pd.DataFrame(new_data)
67 |     # print(df)
68 |     # print(new_df)
69 |     save_dir = read_dir + "no_checksum/" + headers
70 |     # if not os.path.exists(save_dir):
71 |     os.makedirs(save_dir, exist_ok=True)
72 |     new_df.to_hdf(save_dir +
73 |                   "extracted_{0}-no_checksum".format(nrheaders) + '.h5',
74 |                   key='extracted_{0}'.format(nrheaders), mode='w')
75 | 
76 | 
77 | nr = [1,2,4,8,16]
78 | for n in nr:
79 |     remove_checksum(n)
80 | 
81 | 
82 | #sns.distplot(bytes[0:,370], kde=False, rug=True)
83 | 
84 | '''
85 | for index, m in enumerate(mean):
86 |     if m > 0 and min[index] != max[index]:
87 |         print(index, min[index], max[index], mean[index])
88 |         sns.distplot(bytes[0:,index], kde=False, rug=True)
89 |         plt.show()
90 | '''
91 | #plt.show()
92 | #plt.savefig("dist.png")
93 | 
94 | #for i in mean:
95 | #    print("%.2f" % i)
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------
/pcap/pcaptools.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | from scapy.all import *
  3 | import glob
  4 | import os
  5 | import multiprocessing
  6 | 
  7 | from utils import split_list
  8 | 
  9 | 
 10 | def pcap_cleaner(dir):
 11 |     """
 12 |     This method can be used for cleaning pcap files.
 13 |     :param dir: Directory containing the pcap files that should be filtered
 14 |     :return: None
 15 |     """
 16 |     for fullname in glob.iglob(dir + '*.pcap'):
 17 |         dir, filename = os.path.split(fullname)
 18 |         command = 'tshark -r %s -2 -R "!(eth.dst[0]&1) && !(tcp.port==5901) && ip" -w %s/filtered/%s' % (fullname,dir, filename)
 19 |         os.system(command)
 20 | 
 21 | 
 22 | def save_pcap_task(files, save_dir, session_threshold):
 23 |     """
 24 |     This method takes all files in a list (full path names) and uses the method save_pcap that converts a pcap to h5 format
 25 |     :param files:
 26 |     :param session_threshold:
 27 |     :return:
 28 |     """
 29 |     for fullname in files:
 30 |         print('Currently saving file: ', fullname)
 31 | 
 32 |         save_pcap(fullname, save_dir, session_threshold)
 33 | 
 34 | 
 35 | def process_pcap_to_h5(read_dir, save_dir, session_threshold=5000):
 36 |     """
 37 |     Use this method to process all pcap files in a directory to a h5 format.
 38 |     Session threshold is used to filter out all sessions containing fewer packets
 39 |     :param save_dir:
 40 |     :param read_dir: Directory containing pcap files that should be converted into h5 format
 41 |     :param session_threshold: Threshold to filter out session with less packets
 42 |     :return: None
 43 |     """
 44 |     h5files = []
 45 | 
 46 |     for h5 in glob.iglob(save_dir + '*.h5'):
 47 |         h5files.append(os.path.basename(h5))
 48 |     # Load all files
 49 |     files = []
 50 |     for fullname in glob.iglob(read_dir + '*.pcap'):
 51 |         filename = os.path.basename(fullname)
 52 |         h5name = filename +'.h5'
 53 |         if h5name in h5files:
 54 |             os.rename(fullname, read_dir + '/processed_pcap/' + filename)
 55 |         else:
 56 |             files.append(fullname)
 57 | 
 58 |     splits = 4
 59 |     files_splits = split_list(files, splits)
 60 |     processes = []
 61 |     for file_split in files_splits:
 62 |         # create a thread for each
 63 |         t1 = multiprocessing.Process(target=save_pcap_task, args=(file_split, save_dir, session_threshold))
 64 |         print("Starting process", t1)
 65 |         processes.append(t1)
 66 |         t1.start()
 67 | 
 68 |     for process in processes:
 69 |         process.join()
 70 |         print("Process joined", process)
 71 | 
 72 | def save_pcap(fullname, save_dir, session_threshold=0):
 73 |     """
 74 |     This method read a pcap file and saves it to an h5 dataframe.
 75 |     The file is overwritten if it already exists.
 76 |     :param dir: The folder containing the pcap file
 77 |     :param filename: The name of the pcap file
 78 |     :return: Nothing
 79 |     """
 80 |     dir_n, filename = os.path.split(fullname)
 81 |     df = read_pcap(fullname, filename, session_threshold)
 82 |     key = filename.split('-')[0]
 83 |     df.to_hdf(save_dir + filename + '.h5', key=key, mode='w')
 84 | 
 85 | 
 86 | def read_pcap(fullname, filename, session_threshold=0):
 87 |     """
 88 |     This method will extract the packets of the major session within the pcap file. It will label the packets according
 89 |     to the filename.
 90 |     The method excludes packets between local/internal ip adresses (ip.src and ip.dst startswith 10.....)
 91 |     The method finds the major sessions by counting the packets for each session and calculate a threshold dependent
 92 |     on the session with most packets. All sessions with more packets than the threshold value is extracted and placed
 93 |     in the dataframe.
 94 | 
 95 |     :param dir: The directory in which the pcap file is located. Should end with a /
 96 |     :param filename: The name of the pcap file. It is expected to contain the label of the data before the first - char
 97 |     :return: A dataframe containing the extracted packets.
 98 |     """
 99 |     time_s = time.clock()
100 |     label = filename.split('-')[0]
101 |     print("Read PCAP, label is %s" % label)
102 |     if not filename.endswith('.pcap'):
103 |         filename += '.pcap'
104 |     data = rdpcap(fullname)
105 |     # Workaround/speedup for pandas append to dataframe
106 |     frametimes =[]
107 |     dsts = []
108 |     srcs = []
109 |     protocols = []
110 |     dports = []
111 |     sports = []
112 |     bytes = []
113 |     labels = []
114 | 
115 |     time_r = time.clock()
116 |     time_read = time_r-time_s
117 |     print("Time to read PCAP: "+ str(time_read))
118 |     sessions = data.sessions(session_extractor=session_extractor)
119 |     for id, session in sessions.items():
120 |         if session_threshold is not None:
121 |             if len(session) < session_threshold:
122 |                 continue
123 |         for packet in session:
124 |             # Check that the packet is transferred by either UDP or TCP and ensure that it is not a packet between to local/internal IP adresses (occurs when using vnc and such)
125 |             if IP in packet and (UDP in packet or TCP in packet) and not (packet[IP].dst.startswith('10.') and packet[IP].src.startswith('10.')):
126 |                 ip_layer = packet[IP]
127 |                 transport_layer = ip_layer.payload
128 |                 frametimes.append(packet.time)
129 |                 dsts.append(ip_layer.dst)
130 |                 srcs.append(ip_layer.src)
131 |                 protocols.append(transport_layer.name)
132 |                 dports.append(transport_layer.dport)
133 |                 sports.append(transport_layer.sport)
134 |                 # Save the raw byte string
135 |                 raw_payload = raw(packet)
136 |                 bytes.append(raw_payload)
137 |                 labels.append(label)
138 |     time_t = time.clock()
139 |     print("Time spend: %ds" % (time_t-time_r))
140 |     d = {'time': frametimes,
141 |          'ip.dst': dsts,
142 |          'ip.src': srcs,
143 |          'protocol': protocols,
144 |          'port.dst': dports,
145 |          'port.src': sports,
146 |          'bytes': bytes,
147 |          'label': labels}
148 |     df = pd.DataFrame(data=d)
149 |     time_e = time.clock()
150 |     total_time = time_e - time_s
151 |     print("Time to convert PCAP to dataframe: " + str(total_time))
152 |     return df
153 | 
154 | 
155 | def session_extractor(p):
156 |     """
157 |     Custom session extractor to use for scapy to group bi directional sessions instead of a uni directional flows.
158 |     :param p: packet as used by scapy
159 |     :return: session string to use a key in dict
160 |     """
161 |     sess = "Other"
162 |     if 'Ether' in p:
163 |         if 'IP' in p:
164 |             src = p[IP].src
165 |             dst = p[IP].dst
166 |             if NTP in p:
167 |                 if src.startswith('10.') or src.startswith('192.168.'):
168 |                     sess = p.sprintf("NTP %IP.src%:%r,UDP.sport% > %IP.dst%:%r,UDP.dport%")
169 |                 elif dst.startswith('10.') or dst.startswith('192.168.'):
170 |                     sess = p.sprintf("NTP %IP.dst%:%r,UDP.dport% > %IP.src%:%r,UDP.sport%")
171 |             elif 'TCP' in p:
172 |                 if src.startswith('10.') or src.startswith('192.168.'):
173 |                     sess = p.sprintf("TCP %IP.src%:%r,TCP.sport% > %IP.dst%:%r,TCP.dport%")
174 |                 elif dst.startswith('10.') or dst.startswith('192.168.'):
175 |                     sess = p.sprintf("TCP %IP.dst%:%r,TCP.dport% > %IP.src%:%r,TCP.sport%")
176 |             elif 'UDP' in p:
177 |                 if src.startswith('10.') or src.startswith('192.168.'):
178 |                     sess = p.sprintf("UDP %IP.src%:%r,UDP.sport% > %IP.dst%:%r,UDP.dport%")
179 |                 elif dst.startswith('10.') or dst.startswith('192.168.'):
180 |                     sess = p.sprintf("UDP %IP.dst%:%r,UDP.dport% > %IP.src%:%r,UDP.sport%")
181 |             elif 'ICMP' in p:
182 |                 sess = p.sprintf("ICMP %IP.src% > %IP.dst% type=%r,ICMP.type% code=%r,ICMP.code% id=%ICMP.id%")
183 |             else:
184 |                 sess = p.sprintf("IP %IP.src% > %IP.dst% proto=%IP.proto%")
185 |         elif 'ARP' in p:
186 |             sess = p.sprintf("ARP %ARP.psrc% > %ARP.pdst%")
187 |         else:
188 |             sess = p.sprintf("Ethernet type=%04xr,Ether.type%")
189 |     return sess


--------------------------------------------------------------------------------
/tf/confusionmatrix.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib
  3 | 
  4 | 
  5 | class ConfusionMatrix:
  6 |     """
  7 |        Simple confusion matrix class
  8 |        row is the true class, column is the predicted class
  9 |     """
 10 |     def __init__(self, num_classes, class_names=None):
 11 |         self.n_classes = num_classes
 12 |         if class_names is None:
 13 |             self.class_names = map(str, range(num_classes))
 14 |         else:
 15 |             self.class_names = class_names
 16 | 
 17 |         # find max class_name and pad
 18 |         max_len = max(map(len, self.class_names))
 19 |         self.max_len = max_len
 20 |         for idx, name in enumerate(self.class_names):
 21 |             if len(self.class_names) < max_len:
 22 |                 self.class_names[idx] = name + " "*(max_len-len(name))
 23 | 
 24 |         self.mat = np.zeros((num_classes, num_classes), dtype='int')
 25 | 
 26 |     def __str__(self):
 27 |         # calucate row and column sums
 28 |         col_sum = np.sum(self.mat, axis=1)
 29 |         row_sum = np.sum(self.mat, axis=0)
 30 | 
 31 |         s = []
 32 | 
 33 |         mat_str = self.mat.__str__()
 34 |         mat_str = mat_str.replace('[','').replace(']','').split('\n')
 35 | 
 36 |         for idx, row in enumerate(mat_str):
 37 |             if idx == 0:
 38 |                 pad = " "
 39 |             else:
 40 |                 pad = ""
 41 |             class_name = self.class_names[idx]
 42 |             class_name = " {:10s}".format(class_name)
 43 |             class_name += " |"
 44 |             row_str = class_name + pad + row
 45 |             row_str += " |" + str(col_sum[idx])
 46 |             s.append(row_str)
 47 | 
 48 |         row_sum = [(self.max_len+7)*" "+" ".join(map(str, row_sum))]
 49 |         hline = [(1+self.max_len)*" "+"-"*len(row_sum[0])]
 50 | 
 51 |         s = hline + s + hline + row_sum
 52 | 
 53 |         # add linebreaks
 54 |         s_out = [line+'\n' for line in s]
 55 |         return "".join(s_out)
 56 | 
 57 |     def batch_add(self, targets, preds):
 58 |         assert targets.shape == preds.shape
 59 |         assert len(targets) == len(preds)
 60 |         assert max(targets) < self.n_classes
 61 |         assert max(preds) < self.n_classes
 62 |         targets = targets.flatten()
 63 |         preds = preds.flatten()
 64 |         for i in range(len(targets)):
 65 |                 self.mat[targets[i], preds[i]] += 1
 66 | 
 67 |     def get_errors(self):
 68 |         tp = np.asarray(np.diag(self.mat).flatten(), dtype='float')
 69 |         fn = np.asarray(np.sum(self.mat, axis=1).flatten(), dtype='float') - tp
 70 |         fp = np.asarray(np.sum(self.mat, axis=0).flatten(), dtype='float') - tp
 71 |         tn = np.asarray(np.sum(self.mat)*np.ones(self.n_classes).flatten(),
 72 |                         dtype='float') - tp - fn - fp
 73 |         return tp, fn, fp, tn
 74 | 
 75 |     def accuracy(self):
 76 |         """
 77 |         Calculates global accuracy
 78 |         :return: accuracy
 79 |         :example: >>> conf = ConfusionMatrix(3)
 80 |                   >>> conf.batchAdd([0,0,1],[0,0,2])
 81 |                   >>> print conf.accuracy()
 82 |         """
 83 |         tp, _, _, _ = self.get_errors()
 84 |         n_samples = np.sum(self.mat)
 85 |         return np.sum(tp) / n_samples
 86 | 
 87 |     def sensitivity(self):
 88 |         tp, tn, fp, fn = self.get_errors()
 89 |         res = tp / (tp + fn)
 90 |         res = res[~np.isnan(res)]
 91 |         return res
 92 | 
 93 |     def specificity(self):
 94 |         tp, tn, fp, fn = self.get_errors()
 95 |         res = tn / (tn + fp)
 96 |         res = res[~np.isnan(res)]
 97 |         return res
 98 | 
 99 |     def positive_predictive_value(self):
100 |         tp, tn, fp, fn = self.get_errors()
101 |         res = tp / (tp + fp)
102 |         res = res[~np.isnan(res)]
103 |         return res
104 | 
105 |     def negative_predictive_value(self):
106 |         tp, tn, fp, fn = self.get_errors()
107 |         res = tn / (tn + fn)
108 |         res = res[~np.isnan(res)]
109 |         return res
110 | 
111 |     def false_positive_rate(self):
112 |         tp, tn, fp, fn = self.get_errors()
113 |         res = fp / (fp + tn)
114 |         res = res[~np.isnan(res)]
115 |         return res
116 | 
117 |     def false_discovery_rate(self):
118 |         tp, tn, fp, fn = self.get_errors()
119 |         res = fp / (tp + fp)
120 |         res = res[~np.isnan(res)]
121 |         return res
122 | 
123 |     def F1(self):
124 |         tp, tn, fp, fn = self.get_errors()
125 |         res = (2*tp) / (2*tp + fp + fn)
126 |         res = res[~np.isnan(res)]
127 |         return res
128 | 
129 |     def matthews_correlation(self):
130 |         tp, tn, fp, fn = self.get_errors()
131 |         numerator = tp*tn - fp*fn
132 |         denominator = np.sqrt((tp + fp)*(tp + fn)*(tn + fp)*(tn + fn))
133 |         res = numerator / denominator
134 |         res = res[~np.isnan(res)]
135 |         return res


--------------------------------------------------------------------------------
/tf/dataset.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from tensorflow.python.framework import dtypes
  3 | from tensorflow.python.framework import random_seed
  4 | from tensorflow.contrib.learn.python.learn.datasets import base
  5 | import glob
  6 | import os
  7 | import utils
  8 | import pandas as pd
  9 | from sklearn.preprocessing import LabelEncoder
 10 | 
 11 | _label_encoder = LabelEncoder()
 12 | 
 13 | 
 14 | class DataSet(object):
 15 | 
 16 |     def __init__(self,
 17 |                  payloads,
 18 |                  labels,
 19 |                  dtype=dtypes.float32,
 20 |                  seed=None):
 21 |         """Construct a DataSet.
 22 |         one_hot arg is used only if fake_data is true.  `dtype` can be either
 23 |         `uint8` to leave the input as `[0, 255]`, or `float32` to rescale into
 24 |         `[0, 1]`.  Seed arg provides for convenient deterministic testing.
 25 |         """
 26 |         seed1, seed2 = random_seed.get_seed(seed)
 27 |         # If op level seed is not set, use whatever graph level seed is returned
 28 |         np.random.seed(seed1 if seed is None else seed2)
 29 |         dtype = dtypes.as_dtype(dtype).base_dtype
 30 |         if dtype not in (dtypes.uint8, dtypes.float32):
 31 |             raise TypeError('Invalid payload dtype %r, expected uint8 or float32' %
 32 |                             dtype)
 33 | 
 34 |         assert payloads.shape[0] == labels.shape[0], (
 35 |                 'payloads.shape: %s labels.shape: %s' % (payloads.shape, labels.shape))
 36 |         self._num_examples = payloads.shape[0]
 37 | 
 38 |         if dtype == dtypes.float32:
 39 |             # Convert from [0, 255] -> [0.0, 1.0].
 40 |             payloads = payloads.astype(np.float32)
 41 |             payloads = np.multiply(payloads, 1.0 / 255.0)
 42 | 
 43 |         self._payloads = payloads
 44 |         self._labels = labels
 45 |         self._epochs_completed = 0
 46 |         self._index_in_epoch = 0
 47 | 
 48 |     @property
 49 |     def payloads(self):
 50 |         return self._payloads
 51 | 
 52 |     @property
 53 |     def labels(self):
 54 |         return self._labels
 55 | 
 56 |     @property
 57 |     def num_examples(self):
 58 |         return self._num_examples
 59 | 
 60 |     @property
 61 |     def epochs_completed(self):
 62 |         return self._epochs_completed
 63 | 
 64 |     def next_batch(self, batch_size, shuffle=True):
 65 |         """Return the next `batch_size` examples from this data set."""
 66 |         start = self._index_in_epoch
 67 |         # Shuffle for the first epoch
 68 | 
 69 |         if self._epochs_completed == 0 and start == 0 and shuffle:
 70 |             perm0 = np.arange(self._num_examples)
 71 |             np.random.shuffle(perm0)
 72 |             self._payloads = self.payloads[perm0]
 73 |             self._labels = self.labels[perm0]
 74 |         # Go to the next epoch
 75 |         if start + batch_size > self._num_examples:
 76 |             # Finished epoch
 77 |             self._epochs_completed += 1
 78 |             # Get the rest examples in this epoch
 79 |             rest_num_examples = self._num_examples - start
 80 |             payloads_rest_part = self._payloads[start:self._num_examples]
 81 |             labels_rest_part = self._labels[start:self._num_examples]
 82 |             # Shuffle the data
 83 |             if shuffle:
 84 |                 perm = np.arange(self._num_examples)
 85 |                 np.random.shuffle(perm)
 86 |                 self._payloads = self.payloads[perm]
 87 |                 self._labels = self.labels[perm]
 88 |             # Start next epoch
 89 |             start = 0
 90 |             self._index_in_epoch = batch_size - rest_num_examples
 91 |             end = self._index_in_epoch
 92 |             images_new_part = self._payloads[start:end]
 93 |             labels_new_part = self._labels[start:end]
 94 |             return np.concatenate((payloads_rest_part, images_new_part), axis=0), np.concatenate(
 95 |                 (labels_rest_part, labels_new_part), axis=0)
 96 |         else:
 97 |             self._index_in_epoch += batch_size
 98 |             end = self._index_in_epoch
 99 |             return self._payloads[start:end], self._labels[start:end]
100 | 
101 | 
102 | def dense_to_one_hot(labels_dense, num_classes):
103 |     """Convert class labels from scalars to one-hot vectors."""
104 |     num_labels = labels_dense.shape[0]
105 |     index_offset = np.arange(num_labels) * num_classes
106 |     labels_one_hot = np.zeros((num_labels, num_classes), dtype=np.int8)
107 |     labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
108 |     return labels_one_hot
109 | 
110 | 
111 | def extract_labels(dataframe, one_hot=False, num_classes=10):
112 |     """Extract the labels into a 1D uint8 numpy array [index].
113 | 
114 |     Args:
115 |     dataframe: A pandas dataframe object.
116 |     one_hot: Does one hot encoding for the result.
117 |     num_classes: Number of classes for the one hot encoding.
118 | 
119 |     Returns:
120 |     labels: a 1D uint8 numpy array.
121 |     """
122 |     print('Extracting labels', )
123 |     labels = dataframe['label'].values
124 |     labels = _label_encoder.fit_transform(labels)
125 |     if one_hot:
126 |         return dense_to_one_hot(labels, num_classes)
127 |     return labels
128 | 
129 | 
130 | def read_data_sets(train_dirs=[], test_dirs=None,
131 |                    merge_data=True,
132 |                    one_hot=False,
133 |                    dtype=dtypes.float32,
134 |                    validation_size=0.2,
135 |                    test_size=0.2,
136 |                    seed=None,
137 |                    balance_classes=False,
138 |                    payload_length=810):
139 |     trainframes = []
140 |     testframes = []
141 |     for train_dir in train_dirs:
142 |         for fullname in glob.iglob(train_dir + '*.h5'):
143 |             filename = os.path.basename(fullname)
144 |             df = utils.load_h5(train_dir, filename)
145 |             trainframes.append(df)
146 |         # create one large dataframe
147 |     train_data = pd.concat(trainframes)
148 |     if test_dirs != train_dirs:
149 |         for test_dir in test_dirs:
150 |             for fullname in glob.iglob(test_dir + '*.h5'):
151 |                 filename = os.path.basename(fullname)
152 |                 df = utils.load_h5(test_dir, filename)
153 |                 testframes.append(df)
154 |         test_data = pd.concat(testframes)
155 |     else:
156 |         test_data = pd.DataFrame()
157 | 
158 |     if merge_data:
159 |         train_data = pd.concat([test_data, train_data])
160 | 
161 |     num_classes = len(train_data['label'].unique())
162 | 
163 |     if balance_classes:
164 |         values, counts = np.unique(train_data['label'], return_counts=True)
165 |         smallest_class = np.argmin(counts)
166 |         amount = counts[smallest_class]
167 |         new_data = []
168 |         for v in values:
169 |             sample = train_data.loc[train_data['label'] == v].sample(n=amount)
170 |             new_data.append(sample)
171 |         train_data = new_data
172 |         train_data = pd.concat(train_data)
173 | 
174 | 
175 |     # shuffle the dataframe and reset the index
176 |     train_data = train_data.sample(frac=1, random_state=seed).reset_index(drop=True)
177 |     #
178 |     # youtube_selector = train_data['label'] == 'youtube'
179 |     # youtube_data = train_data[youtube_selector]
180 |     # for index, row in youtube_data.iterrows():
181 |     #     bytes = row[0]
182 |     #     if bytes[23] == 17.0:
183 |     #         train_data.loc[index, 'label'] = 'youtube_udp'
184 |     #     else:
185 |     #         train_data.loc[index, 'label'] = 'youtube_tcp'
186 | 
187 |     if test_dirs != train_dirs:
188 |         test_data = test_data.sample(frac=1, random_state=seed).reset_index(drop=True)
189 |         test_labels = extract_labels(test_data, one_hot=one_hot, num_classes=num_classes)
190 |         test_payloads = test_data['bytes'].values
191 |         test_payloads = utils.pad_arrays_with_zero(test_payloads, payload_length=payload_length)
192 |     train_labels = extract_labels(train_data, one_hot=one_hot, num_classes=num_classes)
193 |     train_payloads = train_data['bytes'].values
194 |     # pad with zero up to payload_length length
195 |     train_payloads = utils.pad_arrays_with_zero(train_payloads, payload_length=payload_length)
196 | 
197 |     # TODO make seperate TEST SET ONCE ready
198 |     total_length = len(train_payloads)
199 |     validation_amount = int(total_length * validation_size)
200 |     if merge_data:
201 |         test_amount = int(total_length * test_size)
202 |         test_payloads = train_payloads[:test_amount]
203 |         test_labels = train_labels[:test_amount]
204 |         val_payloads = train_payloads[test_amount:(validation_amount + test_amount)]
205 |         val_labels = train_labels[test_amount:(validation_amount + test_amount)]
206 |         train_payloads = train_payloads[(validation_amount + test_amount):]
207 |         train_labels = train_labels[(validation_amount + test_amount):]
208 |     else:
209 |         val_payloads = train_payloads[:validation_amount]
210 |         val_labels = train_labels[:validation_amount]
211 |         train_payloads = train_payloads[validation_amount:]
212 |         train_labels = train_labels[validation_amount:]
213 | 
214 |     options = dict(dtype=dtype, seed=seed)
215 |     print("Training set size: {0}".format(len(train_payloads)))
216 |     print("Validation set size: {0}".format(len(val_payloads)))
217 |     print("Test set size: {0}".format(len(test_payloads)))
218 |     train = DataSet(train_payloads, train_labels, **options)
219 |     validation = DataSet(val_payloads, val_labels, **options)
220 |     test = DataSet(test_payloads, test_labels, **options)
221 | 
222 |     return base.Datasets(train=train, validation=validation, test=test)
223 | 


--------------------------------------------------------------------------------
/tf/early_stopping.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | 
 4 | class EarlyStopping:
 5 |     """Stop training when a monitored quantity has stopped improving.
 6 |   Arguments:
 7 |       min_delta: minimum change in the monitored quantity
 8 |           to qualify as an improvement, i.e. an absolute
 9 |           change of less than min_delta, will count as no
10 |           improvement.
11 |       patience: number of epochs with no improvement
12 |           after which training will be stopped.
13 |   """
14 | 
15 |     def __init__(self,
16 |                  min_delta=0,
17 |                  patience=0):
18 |         self.stop_training = False
19 |         self.patience = patience
20 |         self.min_delta = min_delta
21 |         self.wait = 0
22 |         self.stopped_epoch = 0
23 | 
24 |         self.monitor_op = np.less
25 |         self.min_delta *= -1
26 |         self.best = np.Inf
27 | 
28 |     def on_train_begin(self):
29 |         # Allow instances to be re-used
30 |         self.wait = 0
31 |         self.stopped_epoch = 0
32 |         self.best = np.Inf
33 |         self.stop_training = False
34 | 
35 |     def on_epoch_end(self, epoch, current):
36 |         if self.monitor_op(current - self.min_delta, self.best):
37 |             self.best = current
38 |             self.wait = 0
39 |         else:
40 |             if self.wait >= self.patience:
41 |                 self.stopped_epoch = epoch
42 |                 self.stop_training = True
43 |             self.wait += 1
44 | 
45 |     def on_train_end(self):
46 |         if self.stopped_epoch > 0:
47 |             print('Epoch {}: early stopping'.format(self.stopped_epoch))
48 | 


--------------------------------------------------------------------------------
/tf/tf_utils.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | 
  3 | 
  4 | def train_input_fn(features, labels, batch_size):
  5 |     """An input function for training"""
  6 |     # Convert the inputs to a Dataset.
  7 |     dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
  8 |     # Shuffle, repeat, and batch the examples.
  9 |     dataset = dataset.shuffle(1000).repeat().batch(batch_size)
 10 |     # Build the Iterator, and return the read end of the pipeline.
 11 |     return dataset.make_one_shot_iterator().get_next()
 12 | 
 13 | 
 14 | def ffn_layer(name, inputs, hidden_units, activation=tf.nn.relu, seed=None):
 15 |     """Reusable code for making a simple neural net layer.
 16 | 
 17 |     It does a matrix multiply, bias add, and then uses relu to nonlinearize.
 18 |     It also sets up name scoping so that the resultant graph is easy to read,
 19 |     and adds a number of summary ops.
 20 |     """
 21 |     input_dim = inputs.get_shape().as_list()[1]
 22 |     # use xavier glorot intitializer as regular uniform dist did not work
 23 |     weight_initializer = tf.contrib.layers.xavier_initializer(uniform=True, seed=seed, dtype=tf.float32)
 24 |     with tf.variable_scope(name):
 25 |         # This Variable will hold the state of the weights for the layer
 26 |         with tf.name_scope('weights'):
 27 |             weights = tf.get_variable('W', [input_dim, hidden_units], initializer=weight_initializer)
 28 |             variable_summaries(weights)
 29 |         with tf.name_scope('biases'):
 30 |             biases = tf.get_variable('b', [hidden_units], initializer=tf.zeros_initializer)
 31 |             variable_summaries(biases)
 32 |         with tf.name_scope('Wx_plus_b'):
 33 |             preactivate = tf.matmul(inputs, weights) + biases
 34 |             tf.summary.histogram('pre_activations', preactivate)
 35 |         activations = activation(preactivate, name='activation')
 36 |         tf.summary.histogram('activations', activations)
 37 |         return activations
 38 |         # layer = tf.layers.dense(inputs=inputs, units=hidden_units, activation=activation)
 39 |         # return layer
 40 | 
 41 | 
 42 | def conv_layer_1d(name, inputs, num_filters=1, filter_size=(1, 1), strides=1, activation=None):
 43 |     """"A simple function to create a 1D conv layer"""
 44 |     with tf.variable_scope(name):
 45 |         # TensorFlow operation for convolution
 46 |         layer = tf.layers.conv1d(inputs=inputs,
 47 |                                  filters=num_filters,
 48 |                                  kernel_size=filter_size,
 49 |                                  strides=strides,
 50 |                                  padding='same',
 51 |                                  activation=activation)
 52 |         return layer
 53 | 
 54 | 
 55 | def conv_layer_2d(name, inputs, num_filters=1, filter_size=(1, 1), strides=(1,1), activation=None):
 56 |     """"A simple function to create a 2D conv layer"""
 57 |     with tf.variable_scope(name):
 58 |         # TensorFlow operation for convolution
 59 |         layer = tf.layers.conv2d(inputs=inputs,
 60 |                                  filters=num_filters,
 61 |                                  kernel_size=filter_size,
 62 |                                  strides=strides,
 63 |                                  padding='same',
 64 |                                  activation=activation)
 65 |         return layer
 66 | 
 67 | 
 68 | def max_pool_layer(inputs, name, pool_size=(1, 1), strides=(1, 1), padding='same'):
 69 |     """"A simple function to create a max pool layer"""
 70 |     with tf.variable_scope(name):
 71 |         # TensorFlow operation for max pooling
 72 |         layer = tf.layers.max_pooling2d(inputs=inputs,
 73 |                                         pool_size=pool_size,
 74 |                                         strides=strides,
 75 |                                         padding=padding)
 76 |         return layer
 77 | 
 78 | def max_pool_layer1d(inputs, name, pool_size=(1, 1), strides=(1, 1), padding='same'):
 79 |     """"A simple function to create a max pool layer"""
 80 |     with tf.variable_scope(name):
 81 |         # TensorFlow operation for max pooling
 82 |         layer = tf.layers.max_pooling1d(inputs=inputs,
 83 |                                         pool_size=pool_size,
 84 |                                         strides=strides,
 85 |                                         padding=padding)
 86 |         return layer
 87 | def dropout(inputs, keep_prob=1.0):
 88 |     with tf.name_scope('dropout'):
 89 |         # keep_prob = tf.placeholder(tf.float32)
 90 |         tf.summary.scalar('dropout_keep_probability', keep_prob)
 91 |         return tf.nn.dropout(inputs, keep_prob)
 92 | 
 93 | 
 94 | def variable_summaries(var):
 95 |   """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
 96 |   with tf.name_scope('summaries'):
 97 |     mean = tf.reduce_mean(var)
 98 |     tf.summary.scalar('mean', mean)
 99 |     with tf.name_scope('stddev'):
100 |       stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
101 |     tf.summary.scalar('stddev', stddev)
102 |     tf.summary.scalar('max', tf.reduce_max(var))
103 |     tf.summary.scalar('min', tf.reduce_min(var))
104 |     tf.summary.histogram('histogram', var)
105 | 
106 | #
107 | # mnist_data = input_data.read_data_sets('MNIST_data',
108 | #                                        one_hot=True,   # Convert the labels into one hot encoding
109 | #                                        dtype='float32', # rescale images to `[0, 1]`
110 | #                                        reshape=False, # Don't flatten the images to vectors
111 | #                                        )
112 | 
113 | # # Simple MNIST test of the layers
114 | #
115 | # tf.reset_default_graph()
116 | # num_classes = 10
117 | # cm = conf.ConfusionMatrix(num_classes)
118 | # height, width, nchannels = 28, 28, 1
119 | # gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.45)
120 | # filters_1 = 16
121 | # kernel_size_1 = (5,5)
122 | # pool_size_1 = (2,2)
123 | # x_pl = tf.placeholder(tf.float32, [None, height, width, nchannels], name='xPlaceholder')
124 | # y_pl = tf.placeholder(tf.float64, [None, num_classes], name='yPlaceholder')
125 | # y_pl = tf.cast(y_pl, tf.float32)
126 | #
127 | # x = conv_layer_2d('layer1', x_pl, filters_1, kernel_size_1, activation=tf.nn.relu)
128 | # x = max_pool_layer(x, 'max_pool', pool_size = pool_size_1, strides=pool_size_1)
129 | # x = tf.contrib.layers.flatten(x)
130 | #
131 | # y = ffn_layer('output_layer', x, hidden_units=num_classes, activation=tf.nn.softmax)
132 | #
133 | # with tf.variable_scope('loss'):
134 | #     # computing cross entropy per sample
135 | #     cross_entropy = -tf.reduce_sum(y_pl * tf.log(y + 1e-8), reduction_indices=[1])
136 | #
137 | #     # averaging over samples
138 | #     cross_entropy = tf.reduce_mean(cross_entropy)
139 | #
140 | # with tf.variable_scope('training'):
141 | #     # defining our optimizer
142 | #     optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
143 | #
144 | #     # applying the gradients
145 | #     train_op = optimizer.minimize(cross_entropy)
146 | #
147 | # with tf.variable_scope('performance'):
148 | #     # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
149 | #     correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pl, axis=1))
150 | #
151 | #     # averaging the one-hot encoded vector
152 | #     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
153 | #
154 | #
155 | # # Training Loop
156 | # batch_size = 100
157 | # max_epochs = 10
158 | #
159 | # valid_loss, valid_accuracy = [], []
160 | # train_loss, train_accuracy = [], []
161 | # test_loss, test_accuracy = [], []
162 | #
163 | # with tf.Session() as sess:
164 | #     sess.run(tf.global_variables_initializer())
165 | #     print('Begin training loop')
166 | #
167 | #     try:
168 | #         while mnist_data.train.epochs_completed < max_epochs:
169 | #             _train_loss, _train_accuracy = [], []
170 | #
171 | #             ## Run train op
172 | #             x_batch, y_batch = mnist_data.train.next_batch(batch_size)
173 | #             fetches_train = [train_op, cross_entropy, accuracy]
174 | #             feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
175 | #             _, _loss, _acc = sess.run(fetches_train, feed_dict_train)
176 | #
177 | #             _train_loss.append(_loss)
178 | #             _train_accuracy.append(_acc)
179 | #
180 | #             ## Compute validation loss and accuracy
181 | #             if mnist_data.train.epochs_completed % 1 == 0 \
182 | #                     and mnist_data.train._index_in_epoch <= batch_size:
183 | #                 train_loss.append(np.mean(_train_loss))
184 | #                 train_accuracy.append(np.mean(_train_accuracy))
185 | #
186 | #                 fetches_valid = [cross_entropy, accuracy]
187 | #
188 | #                 feed_dict_valid = {x_pl: mnist_data.validation.images, y_pl: mnist_data.validation.labels}
189 | #                 _loss, _acc = sess.run(fetches_valid, feed_dict_valid)
190 | #
191 | #                 valid_loss.append(_loss)
192 | #                 valid_accuracy.append(_acc)
193 | #                 print(
194 | #                     "Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f},  Valid loss {:6.3f},  Valid acc {:6.3f}".format(
195 | #                         mnist_data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
196 | #                         valid_accuracy[-1]))
197 | #
198 | #         test_epoch = mnist_data.test.epochs_completed
199 | #         while mnist_data.test.epochs_completed == test_epoch:
200 | #             x_batch, y_batch = mnist_data.test.next_batch(batch_size)
201 | #             feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
202 | #             _loss, _acc = sess.run(fetches_valid, feed_dict_test)
203 | #             y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
204 | #             y_preds = tf.argmax(y_preds, axis=1).eval()
205 | #             y_true = tf.argmax(y_batch, axis=1).eval()
206 | #             cm.batch_add(y_true,y_preds)
207 | #             test_loss.append(_loss)
208 | #             test_accuracy.append(_acc)
209 | #         print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
210 | #             np.mean(test_loss), np.mean(test_accuracy)))
211 | #
212 | #
213 | #     except KeyboardInterrupt:
214 | #         pass
215 | #
216 | #
217 | # print(cm.accuracy())
218 | # epoch = np.arange(len(train_loss))
219 | # plt.figure()
220 | # plt.plot(epoch, train_accuracy,'r', epoch, valid_accuracy,'b')
221 | # plt.legend(['Train Acc','Val Acc'], loc=4)
222 | # plt.xlabel('Epochs'), plt.ylabel('Acc'), plt.ylim([0.75,1.03])
223 | # plt.show()


--------------------------------------------------------------------------------
/trafficgen/PyTgen/config.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Config file adaptation/modification of the PyTgen generator config.py
  3 | '''
  4 | 
  5 | import logging
  6 | 
  7 | 
  8 | #
  9 | # This is a default configuration for the classification of encrypted traffic project
 10 | #
 11 | class Conf(object):
 12 |     # maximum number of worker threads that can be used to execute the jobs.
 13 |     # the program will start using 3 threads and spawn new ones if needed.
 14 |     # this setting depends on the number of jobs that have to be executed
 15 |     # simultaneously (not the number of jobs given in the config file).
 16 |     maxthreads = 15
 17 | 
 18 |     # set to "logging.INFO" or "logging.DEBUG"
 19 |     loglevel = logging.DEBUG
 20 | 
 21 |     # ssh commands that will be randomly executed by the ssh traffic generator
 22 |     ssh_commands = ['ls', 'cd', 'cd /etc', 'ps ax', 'date', 'mount', 'free', 'vmstat',
 23 |                     'touch /tmp/tmpfile', 'rm /tmp/tmpfile', 'ls /tmp/tmpfile',
 24 |                     'tail /etc/hosts', 'tail /etc/passwd', 'tail /etc/fstab',
 25 |                     'cat /var/log/messages', 'cat /etc/group', 'cat /etc/mtab']
 26 | 
 27 |     # urls the http generator will randomly fetch from
 28 |     https_urls = ['https://www.dr.dk/', 'https://da.wikipedia.org/wiki/Forside',
 29 |                   'https://en.wikipedia.org/wiki/Main_Page', 'https://www.dk-hostmaster.dk/',
 30 |                   'https://www.cph.dk/', 'https://translate.google.com/', 'https://www.borger.dk/',
 31 |                   'https://www.sdu.dk/da/', 'https://www.sundhed.dk/', 'https://www.facebook.com/',
 32 |                   'https://www.ug.dk/', 'https://erhvervsstyrelsen.dk/', 'https://www.nets.eu/dk-da',
 33 |                   'https://www.jobindex.dk/', 'https://www.rejseplanen.dk/webapp/index.html', 'https://yousee.dk/',
 34 |                   'https://www.sparnord.dk/', 'https://gigahost.dk/', 'https://www.information.dk/',
 35 |                   'https://stps.dk/', 'https://www.skat.dk/', 'https://danskebank.dk/privat', 'https://www.sst.dk/']
 36 | 
 37 |     http_urls = ['http://naturstyrelsen.dk/', 'http://www.valutakurser.dk/', 'http://ordnet.dk/ddo/forside',
 38 |                  'http://www.speedtest.net/', 'http://bygningsreglementet.dk/', 'http://www.ft.dk/', 'http://tv2.dk/',
 39 |                  'http://www.kl.dk/', 'http://www.symbiosis.dk/', 'http://www.noegletal.dk/',
 40 |                  'http://novonordiskfonden.dk/da', 'http://frida.fooddata.dk/',
 41 |                  'http://www.arbejdsmiljoforskning.dk/da', 'http://www.su.dk/', 'http://www.trafikstyrelsen.dk/da.aspx',
 42 |                  'http://www.regioner.dk/', 'http://www.geus.dk/UK/Pages/default.aspx', 'http://bm.dk/',
 43 |                  'http://www.m.dk/#!/', 'http://www.regionsjaelland.dk/Sider/default.aspx',
 44 |                  'http://www.trafikstyrelsen.dk/da.aspx']
 45 |     # http_intern = ['http://web.intern.ndsec']
 46 | 
 47 |     # a number of files that will randomly be used for ftp upload
 48 |     ftp_put = ['~/files/file%s' % i for i in range(0, 9)]
 49 | 
 50 |     # a number of files that will randomly be used for ftp download
 51 |     ftp_get = ['~/files/file%s' % i for i in range(0, 9)]
 52 | 
 53 |     # array of source-destination tuples for sftp upload
 54 |     sftp_put = [('~/files/file%s' % i, '/tmp/file%s' % i) for i in range(0, 9)]
 55 | 
 56 |     # array of source-destination tuples for sftp download
 57 |     sftp_get = [('/media/share/files/file%s' % i, '~/files/tmp/file%s' % i) for i in range(0, 9)]
 58 | 
 59 |     # significant part of the shell prompt to be able to recognize
 60 |     # the end of a telnet data transmission
 61 |     telnet_prompt = "$ "
 62 | 
 63 |     # job configuration (see config.example.py)
 64 |     jobdef = [
 65 |               # http (intern)
 66 |               # ('http_gen', [(start_hour, start_min), (end_hour, end_min), (interval_min, interval_sec)], [urls, retry, sleep_multiply])              
 67 |               # ('http_gen', [(9, 0), (16, 30), (60, 0)], [http_intern, 2, 30]),
 68 |               # ('http_gen', [(9, 55), (9, 30), (5, 0)], [http_intern, 5, 20]),
 69 |               # ('http_gen', [(12, 0), (12, 30), (2, 0)], [http_intern, 6, 10]),
 70 |               # ('http_gen', [(10, 50), (12, 0), (10, 0)], [http_intern, 2, 10]),
 71 |               # ('http_gen', [(15, 0), (17, 30), (30, 0)], [http_intern, 8, 20]),
 72 |               #
 73 |               # http (extern)
 74 |               # ('http_gen', [(12, 0), (12, 30), (5, 0)], [http_extern, 10, 20]),
 75 |               # ('http_gen', [(9, 0), (17, 0), (30, 0)], [http_extern, 5, 30]),
 76 |               ('http_gen', [(11, 0), (13, 0), (0, 10)], [http_urls, 1, 5]),
 77 |               ('http_gen', [(11, 0), (13, 0), (0, 10)], [https_urls, 1, 5]),
 78 |               # ('http_gen', [(9, 0), (17, 0), (90, 0)], [http_extern, 10, 30]),
 79 |               # ('http_gen', [(12, 0), (12, 10), (5, 0)], [http_extern, 15, 20]),
 80 |               #
 81 |               # smtp
 82 |               # ('smtp_gen', [(9, 0), (18, 0), (120, 0)], ['mail.extern.ndsec', 'mail2', 'mail', 'mail2@mail.extern.ndsec', 'mail52@mail.extern.ndsec']),
 83 |               # ('smtp_gen', [(12, 0), (13, 0), (30, 0)], ['mail.extern.ndsec', 'mail20', 'mail', 'mail20@mail.extern.ndsec', 'mail51@mail.extern.ndsec']),
 84 |               #
 85 |               # ftp
 86 |               # ('ftp_gen', [(9, 0), (11, 0), (15, 0)], ['ftp.intern.ndsec', 'ndsec', 'ndsec', ftp_put, ftp_get, 10, False, 5]),
 87 |               # ('ftp_gen', [(10, 0), (18, 0), (135, 0)], ['ftp.intern.ndsec', 'ndsec', 'ndsec', ftp_put, [], 2, False]),
 88 |               #
 89 |               # nfs / smb
 90 |               # ('copy_gen', [(9, 0), (12, 0), (90, 0)], [None, 'Z:/tmp/dummyfile.txt', 30]),
 91 |               # ('copy_gen', [(10, 0), (16, 0), (120, 0)], [None, 'Z:/tmp/dummyfile.txt', 80]),
 92 |               # ('copy_gen', [(12, 0), (17, 0), (160, 0)], [None, 'Z:/tmp/dummyfile.txt', 180]),
 93 |               # ('copy_gen', [(9, 0), (18, 0), (0, 10)], ['file1', 'file2']),
 94 |               #
 95 |               # telnet
 96 |               # ('telnet_gen', [(9, 0), (18, 0), (60, 0)], ['telnet.intern.ndsec', None, 'ndsec', 'ndsec', 5, ssh_commands, telnet_prompt, 10]),
 97 |               # ('telnet_gen', [(9, 0), (18, 0), (240, 0)], ['telnet.intern.ndsec', 23, 'ndsec', 'ndsec', 2, [], telnet_prompt]),
 98 |               # ('telnet_gen', [(16, 0), (18, 0), (120, 0)], ['telnet.intern.ndsec', 23, 'ndsec', 'wrongpass', 2, [], telnet_prompt]),
 99 |               #
100 |               # ssh
101 |             #   ('ssh_gen', [(9, 0), (18, 0), (120, 0)], ['ssh.intern.ndsec', 22, 'ndsec', 'ndsec', 10, ssh_commands]),
102 |             #   ('ssh_gen', [(9, 0), (18, 0), (240, 0)], ['ssh.intern.ndsec', 22, 'ndsec', 'ndsec', 60, [], 30]),
103 |               # ('ssh_gen', [(9, 0), (18, 0), (120, 0)], ['192.168.10.50', 22, 'dummy1', 'dummy1', 5, ssh_commands]),
104 |               # ('ssh_gen', [(12, 0), (14, 0), (120, 0)], ['ssh.intern.ndsec', 22, 'dummy1', 'wrongpass', 5, ssh_commands]),
105 |               #
106 |               # sftp
107 |               # ('sftp_gen', [(17, 0), (18, 0), (60, 0)], ['127.0.0.1', 22, 'user', 'pass', sftp_put, sftp_get, 5, 1]),
108 |               ]


--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/__init__.py:
--------------------------------------------------------------------------------
1 | '''
2 | Python 3 adaption/modification of the PyTgen __init__
3 | '''
4 | from .runner import runner
5 | from .scheduler import scheduler


--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/generator.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Python 3 adaption/modification of the PyTgen generator
  3 | '''
  4 | import logging
  5 | import random
  6 | import datetime
  7 | import time
  8 | import base64
  9 | import string
 10 | import os
 11 | 
 12 | import smtplib      # Used for smtp/mail
 13 | import ftplib       # Used for ftp
 14 | import shutil       # Used for network copy
 15 | import telnetlib    # Used for telnet traffic
 16 | import urllib3      # Used for http/https
 17 | # import paramiko     # Used for ssh + sftp
 18 | urllib3.disable_warnings()
 19 | logging.getLogger("urllib3").setLevel(logging.WARNING)
 20 | 
 21 | class http_gen():
 22 |     '''
 23 |     http and https generator
 24 |     send HTTP GET requests to a webserver to retrieve a given URL n times.
 25 |     delay the requests by a random time controlled by a multiplier. 
 26 |     to get best results with this generator, the size of the http answers 
 27 |     should differ (send random data)
 28 |     '''
 29 |     __generator__ = 'http'
 30 | 
 31 |     def __init__(self,
 32 |                  params):
 33 |         self._urls = params[0]
 34 |         self._num = params[1]
 35 |         self._multiplier = 5
 36 | 
 37 |         if len(params) == 3:
 38 |             self._multiplier = params[2]
 39 | 
 40 |     def __call__(self):
 41 |         for _ in range(self._num):
 42 |             url = self._urls[random.randint(0, (len(self._urls) - 1))]
 43 |             logging.getLogger(self.__generator__).info("Requesting: %s", url)
 44 | 
 45 |             try:
 46 |                 for __ in range(int(random.random() * 10 + 1)):
 47 |                   http = urllib3.PoolManager()
 48 |                   response = http.request('GET', url)
 49 |                   logging.getLogger(self.__generator__).debug("Recieved %s bytes from %s",
 50 |                                                               str(len(response.data)),
 51 |                                                               url)
 52 | 
 53 |             except:
 54 |                 logging.getLogger(self.__generator__).debug("Failed to request %s",
 55 |                                                             url)
 56 | 
 57 |             time.sleep(random.random() * self._multiplier)
 58 | 
 59 | class smtp_gen():
 60 |     '''
 61 |     smtp generator
 62 |     connect to an smtp server and send an email containing a random length
 63 |     string to a destination email address.
 64 |     '''
 65 |     __generator__ = "smtp"
 66 | 
 67 |     def __init__(self,
 68 |                  params):
 69 |         self._host = params[0]
 70 |         self._user = params[1]
 71 |         self._pass = params[2]
 72 |         self._from = params[3]
 73 |         self._to = params[4]
 74 | 
 75 |     def __call__(self):
 76 |         rnd = ''.join(random.choice(string.ascii_letters) for _ in range(int(3000 * random.random())))
 77 | 
 78 |         msg = "From: " + self._from + "\r\n" \
 79 |             + "To: " + self._to + "\r\n" \
 80 |             + "Subject: PyTgen " + str(datetime.datetime.now()) + "\r\n\r\n" \
 81 |             + rnd + "\r\n"
 82 | 
 83 |         logging.getLogger(self.__generator__).info("Sending email to %s (size: %s)",
 84 |                                                    self._host,
 85 |                                                    len(rnd))
 86 | 
 87 |         try:
 88 |             sender = smtplib.SMTP(self._host, 25)
 89 | 
 90 |             try:
 91 |                 logging.getLogger(self.__generator__).debug("Using TLS")
 92 |                 sender.starttls()
 93 | 
 94 |             except:
 95 |                 pass
 96 | 
 97 |             try:
 98 |                 sender.login(self._user, self._pass)
 99 | 
100 |             except smtplib.SMTPAuthenticationError:
101 |                 logging.getLogger(self.__generator__).debug("Using PLAIN auth")
102 |                 sender.docmd("AUTH LOGIN", base64.b64encode(self._user))
103 |                 sender.docmd(base64.b64encode(self._pass), "")
104 | 
105 |             sender.sendmail(self._from, self._to, msg)
106 |             logging.getLogger(self.__generator__).debug("Email sent successful")
107 | 
108 |         except:
109 |             raise
110 | 
111 |         else:
112 |             sender.quit()
113 | 
114 | class ftp_gen():
115 |     '''
116 |     ftp and ftp_tls generator
117 |     connect to a host using ftp and start uploading and downloading files. 
118 |     The files to be put or retrieved are specified in an array and are
119 |     randomly choosen. An empty array will skip upload or download.
120 |     '''
121 |     __generator__ = 'ftp'
122 | 
123 |     def __init__(self,
124 |                params):
125 |         self._host = params[0]
126 |         self._user = params[1]
127 |         self._pass = params[2]
128 |         self._put = params[3]
129 |         self._get = params[4]
130 |         self._num = params[5]
131 |         self._tls = params[6]
132 |         self._multiplier = 5
133 | 
134 |         if len(params) == 7:
135 |             self._multiplier = params[6]
136 | 
137 |     def __call__(self):
138 |         ftp = None
139 |         if self._tls == True:
140 |             logging.getLogger(self.__generator__).info("Connecting to ftps://%s",
141 |                                                        self._host)
142 |             try:
143 |                 # 20% chanche to login with a wrong password at first try
144 |                 if random.random() > 0.8:
145 |                     try:
146 |                         logging.getLogger(self.__generator__).debug("Logging in with wrong credentials")
147 |                         ftp = ftplib.FTP_TLS(self._host,
148 |                                              self._user,
149 |                                              "wrongpass")
150 |                     except:
151 |                         pass
152 | 
153 |                     time.sleep(2 * random.random())
154 | 
155 |                 ftp = ftplib.FTP_TLS(self._host,
156 |                                      self._user,
157 |                                      self._pass)
158 |                 ftp.prot_p()
159 | 
160 |             except:
161 |                 logging.getLogger(self.__generator__).debug("Error connecting to ftps://%s",
162 |                                                             self._host)
163 | 
164 |         else:
165 |             logging.getLogger(self.__generator__).info("Connecting to ftp://%s",
166 |                                                        self._host)
167 |             try:
168 |                 # 20% chanche to login with a wrong password at first try
169 |                 if random.random() > 0.8:
170 |                     try:
171 |                         logging.getLogger(self.__generator__).debug("Logging in with wrong credentials")
172 |                         ftp = ftplib.FTP(self._host,
173 |                                          self._user,
174 |                                          "wrongpass")
175 |                     except:
176 |                         pass
177 | 
178 |                     time.sleep(2 * random.random())
179 | 
180 |                 ftp = ftplib.FTP(self._host,
181 |                                  self._user,
182 |                                  self._pass)
183 | 
184 |             except:
185 |                 logging.getLogger(self.__generator__).debug("Error connecting to ftp://%s",
186 |                                                             self._host)
187 | 
188 |         if ftp is not None:
189 |             ftp.retrlines('LIST')
190 | 
191 |             for _ in range(self._num):
192 |                 if len(self._put) is not 0:
193 |                     ressource = self._put[random.randint(0, (len(self._put) - 1))]
194 |                     (path, filename) = os.path.split(ressource)
195 | 
196 |                     logging.getLogger(self.__generator__).debug("Uploading %s",
197 |                                                                 ressource)
198 | 
199 |                     f = open(ressource, 'r')
200 |                     ftp.storbinary("STOR " + filename, f)
201 |                     f.close()
202 | 
203 |                 time.sleep(self._multiplier * random.random())
204 | 
205 |                 if len(self._get) is not 0:
206 |                     ressource = self._get[random.randint(0, (len(self._get) - 1))]
207 | 
208 |                     logging.getLogger(self.__generator__).debug("Downloading %s",
209 |                                                                 ressource)
210 | 
211 |                     ftp.retrbinary('RETR ' + ressource, self._getfile)
212 | 
213 |                 time.sleep(self._multiplier * random.random())
214 | 
215 |             ftp.quit()
216 | 
217 |     def _getfile(self,
218 |                  ressource):
219 |         pass
220 | 
221 | class copy_gen():
222 |     '''
223 |     copy generator.
224 |     copy files or directories from a source to a destination. this generator
225 |     can be used to generate traffic on network filesystems like nfs or smb.
226 |     A random source file will be generated if the source parameter is set to
227 |     "None". The size of the generated source file can be controlled by an 
228 |     optional size parameter (default = 8192 byte)
229 |     '''
230 |     __generator__ = "copy"
231 | 
232 |     def __init__(self,
233 |                  params):
234 |         self._src = params[0]
235 |         self._dst = params[1]
236 |         self._size = 8192
237 | 
238 |         if len(params) == 3:
239 |             self._size = params[2] * 1024
240 | 
241 |     def __call__(self):
242 |         if self._src is not None:
243 |             logging.getLogger(self.__generator__).info("Copying from %s to %s",
244 |                                                        self._src,
245 |                                                        self._dst)
246 | 
247 |             if os.path.isdir(self._src):
248 |                 dst = self._dst + "/" + self._src
249 | 
250 |                 if os.path.exists(dst):
251 |                     logging.getLogger(self.__generator__).debug("Destination %s exists. Deleting it.",
252 |                                                                 dst)
253 |                     shutil.rmtree(dst)
254 | 
255 |                 try:
256 |                     shutil.copytree(self._src, dst)
257 |                 except:
258 |                     logging.getLogger(self.__generator__).debug("Error copying %s to %s",
259 |                                                                 self._src,
260 |                                                                 dst)
261 | 
262 |             else:
263 |                 try:
264 |                     shutil.copy2(self._src, self._dst)
265 |                 except:
266 |                     logging.getLogger(self.__generator__).debug("Error copying %s to %s",
267 |                                                                 self._src,
268 |                                                                 self._dst)
269 | 
270 |         else:
271 |             if (not os.path.exists(self._dst)) or (os.path.isfile(self._dst)):
272 |                 rnd = ''.join(random.choice(string.ascii_letters) for _ in range(int(self._size * random.random())))
273 | 
274 |                 logging.getLogger(self.__generator__).info("Writing %s byte to %s",
275 |                                                            len(rnd),
276 |                                                            self._dst)
277 | 
278 |                 try:
279 |                     f = open(self._dst, "w")
280 |                     f.write(rnd)
281 | 
282 |                 except:
283 |                     logging.getLogger(self.__generator__).debug("Error writing to %s",
284 |                                                                 self._dst)
285 | 
286 |                 else:
287 |                     f.close()
288 | 
289 |             else:
290 |                 logging.getLogger(self.__generator__).info("Destination %s is not a file",
291 |                                                            self._dst)
292 | 
293 | 
294 | class telnet_gen():
295 |     '''
296 |     telnet generator.
297 |     connect to a host using telnet and start sending commands to the host. The 
298 |     connection will be kept open until the connection time provided in the 
299 |     config is over. If the commands array is empty, the connection will idle 
300 |     until connection time is over.
301 |     '''
302 |     __generator__ = "telnet"
303 | 
304 |     def __init__(self,
305 |                  params):
306 |         self._host = params[0]
307 |         self._port = params[1]
308 |         self._user = params[2]
309 |         self._pass = params[3]
310 |         self._time = params[4]
311 |         self._cmds = params[5]
312 |         self._prompt = params[6]
313 |         self._multiplier = 60
314 | 
315 |         if len(params) == 8:
316 |             self._multiplier = params[7]
317 | 
318 |     def __call__(self):
319 |         logging.getLogger(self.__generator__).info("Connecting to %s",
320 |                                                    self._host)
321 | 
322 |         realmin = self._time * 2 * random.random()
323 |         endtime = datetime.datetime.now() + datetime.timedelta(minutes = realmin)
324 | 
325 |         try:
326 |             tn = telnetlib.Telnet(self._host, self._port)
327 | 
328 |             try:
329 |                 tn.read_until("login: ")
330 |                 tn.write(self._user + "\n")
331 |                 if self._pass is not None:
332 |                     tn.read_until("Password: ")
333 |                     tn.write(self._pass + "\n")
334 | 
335 |             except:
336 |                 logging.getLogger(self.__generator__).debug("Error logging in")
337 | 
338 |         except:
339 |             logging.getLogger(self.__generator__).debug("Error connecting to %s",
340 |                                                         self._host)
341 | 
342 |         else:
343 |             while datetime.datetime.now() < endtime:
344 |                 if len(self._cmds) is not 0:
345 |                     self._send_cmds(tn)
346 | 
347 |                 else:
348 |                     time.sleep(realmin)
349 |                     break
350 | 
351 |                 time.sleep(self._multiplier * random.random())
352 | 
353 |             tn.write("exit\n")
354 |             tn.read_all()
355 | 
356 |     def _send_cmds(self,
357 |                    tn):
358 |         for _ in range(int(6 * random.random())):
359 |             cmd = self._cmds[random.randint(0, (len(self._cmds) - 1))]
360 |             logging.getLogger(self.__generator__).debug("Sending command: %s",
361 |                                                         cmd)
362 |             tn.read_very_eager()
363 |             tn.write(cmd + '\n')
364 |             tn.read_eager()
365 |             time.sleep(5 * random.random())
366 | 
367 | class ssh_gen():
368 |     '''
369 |     ssh generator.
370 |     connect to a host using ssh and start sending commands to the host. The 
371 |     connection will be kept open until the connection time provided in the 
372 |     config is over. If the commands array is empty, the connection will idle 
373 |     until connection time is over.
374 |     '''
375 |     __generator__ = "ssh"
376 | 
377 |     def __init__(self,
378 |                  params):
379 |         self._host = params[0]
380 |         self._port = params[1]
381 |         self._user = params[2]
382 |         self._pass = params[3]
383 |         self._time = params[4]
384 |         self._cmds = params[5]
385 |         self._multiplier = 60
386 | 
387 |         if len(params) == 7:
388 |             self._multiplier = params[6]
389 | 
390 |     def __call__(self):
391 |         logging.getLogger("paramiko").setLevel(logging.INFO)
392 |         logging.getLogger(self.__generator__).info("Connecting to %s",
393 |                                                    self._host)
394 | 
395 |         realmin = self._time * 2 * random.random()
396 |         endtime = datetime.datetime.now() + datetime.timedelta(minutes = realmin)
397 | 
398 |         client = paramiko.SSHClient()
399 |         client.load_system_host_keys()
400 |         client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
401 | 
402 |         try:
403 |             # 20% chanche to login with a wrong password at first try
404 |             if random.random() > 0.8:
405 |                 try:
406 |                     client.connect(self._host,
407 |                                    self._port,
408 |                                    self._user,
409 |                                    "wrongpass")
410 |                 except:
411 |                     pass
412 | 
413 |                 time.sleep(2 * random.random())
414 | 
415 |             client.connect(self._host,
416 |                            self._port,
417 |                            self._user,
418 |                            self._pass)
419 | 
420 |         except:
421 |             logging.getLogger(self.__generator__).debug("Error connecting to %s",
422 |                                                         self._host)
423 | 
424 |         else:
425 |             while datetime.datetime.now() < endtime:
426 |                 if len(self._cmds) is not 0:
427 |                     self._send_cmds(client)
428 | 
429 |                 else:
430 |                     time.sleep(realmin)
431 |                     break
432 | 
433 |                 time.sleep(self._multiplier * random.random())
434 | 
435 |             client.close()
436 | 
437 |     def _send_cmds(self,
438 |                    client):
439 |         for _ in range(3):
440 |             cmd = self._cmds[random.randint(0, (len(self._cmds) - 1))]
441 |             logging.getLogger(self.__generator__).debug("Sending command: %s",
442 |                                                         cmd)
443 |             client.exec_command(cmd)
444 |             time.sleep(5 * random.random())
445 | 
446 | class sftp_gen():
447 |     '''
448 |     sftp generator.
449 |     this generator will connect to a host using ssh subsystem sftp. It then 
450 |     starts sending uploading and downloading files provided via the _get and
451 |     _put parameters.
452 |     '''
453 |     __generator__ = "sftp"
454 | 
455 |     def __init__(self,
456 |                  params):
457 |         self._host = params[0]
458 |         self._port = params[1]
459 |         self._user = params[2]
460 |         self._pass = params[3]
461 |         self._put = params[4]
462 |         self._get = params[5]
463 |         self._time = params[6]
464 |         self._multiplier = 60
465 | 
466 |         if len(params) == 8:
467 |             self._multiplier = params[7]
468 | 
469 |     def __call__(self):
470 |         logging.getLogger("paramiko").setLevel(logging.INFO)
471 |         logging.getLogger(self.__generator__).info("Connecting to %s", self._host)
472 | 
473 |         realmin = self._time * 2 * random.random()
474 |         endtime = datetime.datetime.now() + datetime.timedelta(minutes = realmin)
475 | 
476 |         try:
477 |             transport = paramiko.Transport((self._host,
478 |                                             self._port))
479 |             transport.connect(username = self._user,
480 |                               password = self._pass)
481 | 
482 |             sftp = paramiko.SFTPClient.from_transport(transport)
483 | 
484 |         except:
485 |             logging.getLogger(self.__generator__).debug("Error connecting to %s",
486 |                                                         self._host)
487 | 
488 |         else:
489 |             while datetime.datetime.now() < endtime:
490 |                 put = self._put
491 |                 get = self._get
492 | 
493 |                 if len(put) is 0 and len(get) is 0:
494 |                     time.sleep(realmin)
495 |                     break
496 | 
497 |                 while len(put) is not 0:
498 |                     (src, dst) = put.pop(len(put) - 1)
499 | 
500 |                     logging.getLogger(self.__generator__).debug("Uploading %s to %s",
501 |                                                                 src, dst)
502 |                     sftp.put(src, dst)
503 |                     time.sleep(2 * random.random())
504 | 
505 |                 while len(get) is not 0:
506 |                     (src, dst) = get.pop(len(get) - 1)
507 | 
508 |                     logging.getLogger(self.__generator__).debug("Downloading %s to %s",
509 |                                                                 src, dst)
510 |                     sftp.get(src, dst)
511 |                     time.sleep(2 * random.random())
512 | 
513 |                 time.sleep(self._multiplier * random.random())
514 | 
515 |             sftp.close()
516 |             time.sleep(0.2)
517 |             transport.close()


--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/runner.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Python 3 adaption/modification of the PyTgen runner
  3 | '''
  4 | 
  5 | import threading
  6 | import queue
  7 | import logging
  8 | 
  9 | class worker(threading.Thread):
 10 |     def __init__(self,
 11 |                  name,
 12 |                  queue,
 13 |                  create,
 14 |                  destroy):
 15 |         threading.Thread.__init__(self)
 16 | 
 17 |         self.__name = name
 18 |         self.__queue = queue
 19 | 
 20 |         self.__create = create
 21 |         self.__destroy = destroy
 22 | 
 23 |         self.__dismissed = threading.Event()
 24 | 
 25 |         self.daemon = True
 26 |         self.name = self.__name
 27 | 
 28 |         self.start()
 29 | 
 30 |     def run(self):
 31 |         if self.__create:
 32 |             self.__create()
 33 | 
 34 |         logging.getLogger(self.__name).debug('main loop started')
 35 | 
 36 |         while True:
 37 |             if self.__dismissed.is_set():
 38 |                 break
 39 | 
 40 |             try:
 41 |                 action = self.__queue.get(block=True,
 42 |                                           timeout=10)
 43 | 
 44 |             except:
 45 |                 continue
 46 | 
 47 |             else:
 48 |                 try:
 49 |                     action()
 50 | 
 51 |                 except:
 52 |                     raise
 53 | 
 54 |         logging.getLogger(self.__name).debug('main loop finished')
 55 | 
 56 |         if self.__destroy:
 57 |             self.__destroy()
 58 | 
 59 |     def dismiss(self):
 60 |         self.__dismissed.set()
 61 | 
 62 | class runner(object):
 63 |     def __init__(self,
 64 |                  maxthreads=10,
 65 |                  thread_create=None,
 66 |                  thread_destroy=None):
 67 |         self.__queue = queue.Queue()
 68 |         self.__maxthreads = maxthreads
 69 | 
 70 |         logging.getLogger('runner').info('creating runner with 3 initial threads')
 71 | 
 72 |         self.__workers = []
 73 |         for i in range(0, 3):
 74 |             name = 'worker_%d' % i
 75 | 
 76 |             logging.getLogger('runner').debug('creating worker thread: %s',
 77 |                                               name)
 78 | 
 79 |             self.__workers.append(worker(name=name,
 80 |                                          queue=self.__queue,
 81 |                                          create=thread_create,
 82 |                                          destroy=thread_destroy))
 83 | 
 84 |     def __call__(self,
 85 |                  action):
 86 |         self.__queue.put(action)
 87 | 
 88 |         if (self.__queue.qsize() > 2):
 89 |             if (len(self.__workers) < self.__maxthreads):
 90 |                 self._spawn()
 91 |             else:
 92 |                 logging.getLogger('Runner').warning('Not enough worker threads to handle queue')
 93 | 
 94 |     def _spawn(self):
 95 |         name = 'worker_%d' % (len(self.__workers) + 1)
 96 |         logging.getLogger('Runner').info('spawning new worker thread: %s', name)
 97 |         self.__workers.append(worker(name=name,
 98 |                                      queue=self.__queue,
 99 |                                      create=None,
100 |                                      destroy=None))
101 | 
102 |     def stop(self):
103 |         for worker in self.__workers:
104 |             worker.dismiss()
105 | 
106 |         for worker in self.__workers:
107 |             worker.join()


--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/scheduler.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Python 3 adaption/modification of the PyTgen scheduler
  3 | '''
  4 | 
  5 | import datetime
  6 | import threading
  7 | import heapq
  8 | import random
  9 | import logging
 10 | 
 11 | class scheduler(threading.Thread):
 12 |     class job(object):
 13 |         def __init__(self, name, action, interval, start, end):
 14 |             self.__name = name
 15 |             self.__action = action
 16 |             self.__interval = interval
 17 |             self.__start = start
 18 |             self.__end = end
 19 |             self.__exec_time = datetime.datetime.now() + datetime.timedelta(seconds = self.__interval[1] * random.random(),
 20 |                                                                             minutes = self.__interval[0] * random.random())
 21 | 
 22 |         def __call__(self):
 23 |             today = datetime.datetime.now()
 24 |             start = today.replace(hour = self.__start[0],
 25 |                                   minute = self.__start[1],
 26 |                                   second = 0, microsecond = 0)
 27 |             end = today.replace(hour = self.__end[0],
 28 |                                 minute = self.__end[1],
 29 |                                 second = 0, microsecond = 0)
 30 | 
 31 |             if start <= self.__exec_time < end:
 32 |                 # enqueue job for "random() * 2 * interval" 
 33 |                 # in average the job will run every interval but differing randomly
 34 |                 self.__exec_time += datetime.timedelta(seconds = self.__interval[1] * random.random() * 2,
 35 |                                                        minutes = self.__interval[0] * random.random() * 2)
 36 | 
 37 |                 if self.__exec_time < datetime.datetime.now():
 38 |                     logging.getLogger("scheduler").warning('scheduler is overloaded!')
 39 | 
 40 |                 return self.__action
 41 | 
 42 |             else:
 43 |                 # enqueue the job until next start time
 44 |                 if self.__exec_time < start and self.__exec_time.day == start.day:
 45 |                     self.__exec_time = start + datetime.timedelta(seconds = 1)
 46 |                 else:
 47 |                     self.__exec_time = start + datetime.timedelta(days = 1)
 48 | 
 49 |                 logging.getLogger("scheduler").info("enqueueing %s until %s",
 50 |                                                     self.__name, self.__exec_time)
 51 |                 return False
 52 | 
 53 | 
 54 |         def __lt__(self, other):
 55 |             try:
 56 |                 if type(other) == scheduler.job:
 57 |                     return self.__exec_time < other.__exec_time
 58 | 
 59 |                 elif type(other) == datetime.datetime:
 60 |                     return self.__exec_time < other
 61 |                 else:
 62 |                     raise Exception()
 63 |             except Exception as e:
 64 |                 raise 
 65 |             
 66 |         def __sub__(self, other):
 67 |             try:
 68 |                 if type(other) == scheduler.job:
 69 |                     return self.__exec_time - other.__exec_time
 70 | 
 71 |                 elif type(other) == datetime.datetime:
 72 |                     return self.__exec_time - other
 73 |                 else:
 74 |                     raise Exception()
 75 |             except Exception as e:
 76 |                 raise
 77 | 
 78 |     def __init__(self, jobs, runner):
 79 |         threading.Thread.__init__(self)
 80 |         self.setName('scheduler')
 81 | 
 82 |         self.__runner = runner
 83 |         self.__jobs = jobs
 84 |         heapq.heapify(self.__jobs)
 85 | 
 86 |         self.__running = False
 87 |         self.__signal = threading.Condition()
 88 | 
 89 |     def run(self):
 90 |         self.__running = True
 91 |         while self.__running:
 92 |             self.__signal.acquire()
 93 |             if not self.__jobs:
 94 |                 self.__signal.wait()
 95 | 
 96 |             else:
 97 |                 now = datetime.datetime.now()
 98 |                 while (self.__jobs[0] < now):
 99 |                     job = heapq.heappop(self.__jobs)
100 | 
101 |                     action = job()
102 |                     if action is not False:
103 |                         self.__runner(action)
104 | 
105 |                     heapq.heappush(self.__jobs, job)
106 | 
107 |                 logging.getLogger("scheduler").debug("Sleeping %s seconds", (self.__jobs[0] - now))
108 |                 self.__signal.wait((self.__jobs[0] - now).total_seconds())
109 | 
110 |             self.__signal.release()
111 | 
112 |     def stop(self):
113 |         self.__running = False
114 | 
115 |         self.__signal.acquire()
116 |         self.__signal.notify_all()
117 |         self.__signal.release()
118 | 
119 |     def set_jobs(self, jobs):
120 |         self.__signal.acquire()
121 | 
122 |         self.__jobs = jobs
123 |         heapq.heapify(self.__jobs)
124 | 
125 |         self.__signal.notify_all()
126 |         self.__signal.release()


--------------------------------------------------------------------------------
/trafficgen/PyTgen/nslookup.py:
--------------------------------------------------------------------------------
 1 | import socket
 2 | import config
 3 | 
 4 | Conf = config.Conf
 5 | # ip_list = []
 6 | # ais = socket.getaddrinfo('en.wikipedia.org', 0, 0, 0, 0)
 7 | # for result in ais:
 8 | #   ip_list.append(result[-1][0])
 9 | # ip_list = list(set(ip_list))
10 | # print(ip_list)
11 | 
12 | count = 0
13 | for http in Conf.https_urls:
14 |     url = http.split('/')[2]
15 |     ip_list = []
16 |     ais = socket.getaddrinfo(url, 0, 0, 0, 0)
17 |     for result in ais:
18 |       ip_list.append(result[-1][0])
19 |     ip_list = list(set(ip_list))
20 |     count +=1
21 |     print(http, ip_list, count)


--------------------------------------------------------------------------------
/trafficgen/PyTgen/run.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Python 3 adaption/modification of the PyTgen run.py
 3 | '''
 4 | 
 5 | import logging
 6 | import signal
 7 | import platform
 8 | 
 9 | import core.runner
10 | import core.scheduler
11 | import config
12 | 
13 | from core.generator import *
14 | 
15 | 
16 | def create_jobs():
17 |     logging.getLogger('main').info('Creating jobs')
18 | 
19 |     jobs = []
20 |     for next_job in Conf.jobdef:
21 |         logging.getLogger('main').info('creating %s', next_job)
22 | 
23 |         job = core.scheduler.job(name = next_job[0],
24 |                                  action = eval(next_job[0])(next_job[2]),
25 |                                  interval = next_job[1][2],
26 |                                  start = next_job[1][0],
27 |                                  end = next_job[1][1])
28 |         jobs.append(job)
29 | 
30 |     return jobs
31 | 
32 | if __name__ == '__main__':
33 |     # set hostbased parameters
34 |     hostname = platform.node()
35 |     
36 |     log_file = 'logs/' + hostname + '.log'
37 |     config_file = "config.py"
38 |     # file = open(log_file, 'a+')
39 |     # file.close()
40 |     # load the hostbased configuration file
41 |     # _Conf = __import__(config_file, globals(), locals(), ['Conf'], -1)
42 |     # Conf = _Conf.Conf
43 |     Conf = config.Conf
44 | 
45 |     # start logger
46 |     logging.basicConfig(level = Conf.loglevel,
47 |                         format = '%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
48 |                         datefmt = '%Y-%m-%d %H:%M:%S',
49 |                         filename = log_file)
50 | 
51 |     logging.getLogger('main').info('Configuration %s loaded', config_file)
52 | 
53 |     # start runner, create jobs, start scheduling
54 |     runner = core.runner(maxthreads = Conf.maxthreads)
55 | 
56 |     jobs = create_jobs()
57 | 
58 |     scheduler = core.scheduler(jobs = jobs,
59 |                                runner = runner)
60 | 
61 |     # Stop scheduler on exit
62 |     def signal_int(signal, frame):
63 |         logging.getLogger('main').info('Stopping scheduler')
64 |         scheduler.stop()
65 | 
66 |     signal.signal(signal.SIGINT, signal_int)
67 | 
68 |     # Run the scheduler
69 |     logging.getLogger('main').info('Starting scheduler')
70 |     scheduler.start()
71 |     scheduler.join(timeout=None)
72 | 
73 |     # Stop the runner
74 |     runner.stop()


--------------------------------------------------------------------------------
/trafficgen/Streaming/streaming_generator.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | sys.path.insert(0, "/home/mclrn/dlproject/")
  3 | import datetime
  4 | from threading import Thread
  5 | from selenium import webdriver
  6 | from slackclient import SlackClient
  7 | import traceback
  8 | import os
  9 | from selenium.webdriver.support.ui import WebDriverWait
 10 | 
 11 | 
 12 | # import trafficgen.Streaming.win_capture as cap
 13 | # import trafficgen.Streaming.streaming_types as stream
 14 | 
 15 | import unix_capture as cap
 16 | import streaming_types as stream
 17 | from constants import SLACK_TOKEN
 18 | 
 19 | def notifySlack(message):
 20 |     sc = SlackClient(SLACK_TOKEN)
 21 |     try:
 22 |         sc.api_call("chat.postMessage", channel="#server", text=message)
 23 |     except:
 24 |         sc.api_call("chat.postMessage", channel="#server", text="Could not send stacktrace")
 25 | 
 26 | 
 27 | def generate_streaming(duration, dir, total_iterations, options=None):
 28 |     iterations = 0
 29 |     stream_types = {
 30 |         # 'hbo': (stream.HboNordic, 1),
 31 |         # 'netflix': (stream.Netflix, 1),
 32 |         'twitch': (stream.Twitch, 5),
 33 |         'youtube': (stream.Youtube, 5),
 34 |         'drtv': (stream.DrTv, 5),
 35 |     }
 36 |     while iterations < total_iterations:
 37 |         print("Iteration:", iterations)
 38 |         if iterations % 25 == 0:
 39 |             notifySlack("Starting iteration: " + str(iterations))
 40 |         try:
 41 |             for stream_type in stream_types.keys():
 42 |                 browsers, capture_thread, file, streaming_threads, = [], [], [], []
 43 |                 type = stream_types[stream_type][0]
 44 |                 num_threads = stream_types[stream_type][1]
 45 | 
 46 |                 browsers, capture_thread, file, streaming_threads = generate_threaded_streaming(type, stream_type, dir,
 47 |                                                                                                 duration, options,
 48 |                                                                                                 num_threads=num_threads)
 49 |                 try:
 50 |                     capture_thread.start()
 51 |                     for thread in streaming_threads:
 52 |                         # Start streaming threads
 53 |                         thread.start()
 54 |                     print("streaming started", stream_type)
 55 |                     capture_thread.join()  # Stream until the capture thread joins
 56 |                     print("capture done - thread has joined")
 57 |                 except Exception as e:
 58 |                     notifySlack("Something went wrong %s" % traceback.format_exc())
 59 |                     # Wait for capture thread
 60 |                     capture_thread.join()
 61 |                     # Do a cleanup since somthing went wrong
 62 |                     cap.cleanup(file)
 63 |                 try:
 64 |                     for browser in browsers:
 65 |                         browser.close()
 66 |                         browser.quit()
 67 |                 except Exception as e:
 68 |                     notifySlack("Something went wrong %s" % traceback.format_exc())
 69 |                     # os.system("killall chrome")
 70 |                     # os.system("killall chromedriver")
 71 |         except Exception as ex:
 72 |             notifySlack("Something went wrong when setting up the threads \n %s" % traceback.format_exc())
 73 | 
 74 |         iterations += 1
 75 | 
 76 | 
 77 | def generate_threaded_streaming(obj: stream.Streaming, stream_name, dir, duration, chrome_options=None, num_threads=5):
 78 |     #### STREAMING ####
 79 |     # Create filename
 80 |     now = datetime.datetime.now()
 81 |     file = dir + "/%s-%.2d%.2d_%.2d%.2d%.2d.pcap" % (stream_name, now.day, now.month, now.hour, now.minute, now.second)
 82 |     # Instantiate thread
 83 |     capture_thread = Thread(target=cap.captureTraffic, args=(1, duration, dir, file))
 84 |     # Create five threads for streaming
 85 |     streaming_threads = []
 86 |     browsers = []
 87 |     for i in range(num_threads):
 88 |         browser = webdriver.Chrome(options=chrome_options)
 89 |         browser.implicitly_wait(10)
 90 |         browsers.append(browser)
 91 |         t = Thread(target=obj.stream_video, args=(obj, browser))
 92 |         streaming_threads.append(t)
 93 | 
 94 |     return browsers, capture_thread, file, streaming_threads
 95 | 
 96 | 
 97 | def get_clear_browsing_button(driver):
 98 |     """Find the "CLEAR BROWSING BUTTON" on the Chrome settings page. /deep/ to go past shadow roots"""
 99 |     return driver.find_element_by_css_selector('* /deep/ #clearBrowsingDataConfirm')
100 | 
101 | 
102 | def clear_cache(driver, timeout=60):
103 |     """Clear the cookies and cache for the ChromeDriver instance."""
104 |     # navigate to the settings page
105 |     driver.get('chrome://settings/clearBrowserData')
106 | 
107 |     # wait for the button to appear
108 |     wait = WebDriverWait(driver, timeout)
109 |     wait.until(get_clear_browsing_button)
110 | 
111 |     # click the button to clear the cache
112 |     get_clear_browsing_button(driver).click()
113 | 
114 |     # wait for the button to be gone before returning
115 |     wait.until_not(get_clear_browsing_button)
116 | 
117 | 
118 | if __name__ == "__main__":
119 |     #netflixuser = os.environ["netflixuser"]
120 |     #netflixpassword = os.environ["netflixpassword"]
121 |     #hbouser = os.environ["hbouser"]
122 |     #hbopassword = os.environ["hbopassword"]
123 |     # slack_token = os.environ['slack_token']
124 |     # Specify duration in seconds
125 |     duration = 60 * 1
126 |     total_iterations = 1000
127 |     save_dir = '/home/mclrn/Data'
128 |     chrome_profile_dir = "/home/mclrn/.config/google-chrome/"
129 |     options = webdriver.ChromeOptions()
130 |     #options.add_argument('user-data-dir=' + chrome_profile_dir)
131 |     options.add_argument("--enable-quic")
132 |     # options.add_argument('headless')
133 |     generate_streaming(duration, save_dir, total_iterations, options)
134 |     print("something")


--------------------------------------------------------------------------------
/trafficgen/Streaming/streaming_types.py:
--------------------------------------------------------------------------------
 1 | from random import randint
 2 | import time
 3 | class Streaming:
 4 | 
 5 |     def stream_video(self, browser):
 6 |         pass
 7 | 
 8 | 
 9 | class Twitch(Streaming):
10 | 
11 |     def stream_video(self, browser):
12 |         browser.get('https://www.twitch.tv/directory/game/League%20of%20Legends/videos/all')
13 |         # Choose random video
14 |         time.sleep(1)
15 |         videos = browser.find_elements_by_css_selector("a[href*='/videos/']")
16 |         video = videos[randint(0, len(videos))]
17 |         link = video.get_attribute('href')
18 |         browser.get(link)
19 | 
20 | 
21 | class Youtube(Streaming):
22 | 
23 |     def stream_video(self, browser):
24 |         browser.get('https://www.youtube.com')
25 |         # Choose random video
26 |         time.sleep(1)
27 |         videos = browser.find_elements_by_css_selector("ytd-grid-video-renderer  a[href*='/watch?v=']")
28 |         video = videos[randint(0, len(videos))]
29 |         # thumbnail
30 |         link = video.get_attribute('href')
31 |         browser.get(link)
32 | 
33 | 
34 | class Netflix(Streaming):
35 | 
36 |     def stream_video(self, browser):
37 |         # Change to correct profile
38 |         browser.get("https://www.netflix.com/SwitchProfile?tkn=I42P4G75VVDM7LV626VKTXTXGI")
39 |         # Choose random video
40 |         videos = browser.find_elements_by_css_selector("div.title-card-container a[href*='/watch/']")
41 |         video = videos[randint(0, len(videos))]
42 |         link = video.get_attribute('href')
43 |         browser.get(link)
44 | 
45 | 
46 | class DrTv(Streaming):
47 | 
48 |     def stream_video(self, browser):
49 |         browser.get('https://www.dr.dk/tv')
50 |         # Choose random video
51 |         videos = browser.find_elements_by_class_name('program-link')
52 |         video = videos[randint(0, len(videos))]
53 |         link = video.get_attribute('href')
54 |         browser.get(link)
55 |         play_button = browser.find_element_by_css_selector('button[title="Afspil"]')
56 |         play_button.click()
57 | 
58 | 
59 | class HboNordic(Streaming):
60 |     def stream_video(self, browser):
61 |         browser.get("https://dk.hbonordic.com/home")
62 |         videos = browser.find_elements_by_css_selector("a[data-automation='play-button']")
63 |         video = videos[randint(0, len(videos))]
64 |         video_url = video.get_attribute("href")
65 |         browser.get(video_url)
66 | 


--------------------------------------------------------------------------------
/trafficgen/Streaming/unix_capture.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | 
 4 | def captureTraffic(interfaceNumber, duration, dir, file):
 5 |     '''
 6 |     Interfacenumber: specifies the network interface that should be captured (use tshark -D to list the options)
 7 |     duration: specifies the number of seconds that the capture should go on
 8 |     dir: is the folder to which the pcap file of the capture should be saved
 9 |     name: is the name of that pcap file (this will always be appended by date and time)
10 |     '''
11 |     # makedir if it does not exist
12 |     if not os.path.isdir(dir):
13 |         os.mkdir(dir)
14 |     #open(file, "w") # overwrites if file already exists
15 |     os.system("echo %s |sudo -S tshark -i %d -a duration:%d -w %s port not 5901 and ip and not broadcast and not multicast" % ('Napatech10',interfaceNumber, duration, file))
16 |     os.system("echo %s |sudo -S chown mclrn:mclrn %s" % ('Napatech10', file))
17 | 
18 | 
19 | 
20 | '''
21 | To test
22 | captureTraffic(1,10, 'C:/users/arhjo/desktop', "test")
23 | '''
24 | 
25 | 
26 | def cleanup(file):
27 |     os.system("echo %s |sudo -S rm %s" % ('Napatech10', file))
28 | 


--------------------------------------------------------------------------------
/trafficgen/Streaming/win_capture.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import datetime
 3 | 
 4 | 
 5 | def captureTraffic(interfaceNumber, duration, dir, file):
 6 |     '''
 7 |     Interfacenumber: specifies the network interface that should be captured (use tshark -D to list the options)
 8 |     duration: specifies the number of seconds that the capture should go on
 9 |     dir: is the folder to which the pcap file of the capture should be saved
10 |     name: is the name of that pcap file (this will always be appended by date and time)
11 |     '''
12 | 
13 |     # makedir if it does not exist
14 |     if not os.path.isdir(dir):
15 |         os.mkdir(dir)
16 | 
17 |     open(file, "w") # overwrites if file already exists
18 |     os.system("tshark -i %d -a duration:%d -w %s port not 5901 and ((port 80) or (port 443)) and ip and not broadcast and not multicast" % (interfaceNumber, duration, file))
19 | 
20 | 
21 | def cleanup(file):
22 |     os.system("del %s" % file)
23 | 
24 | '''
25 | To test
26 | captureTraffic(1,10, 'C:/users/arhjo/desktop', "test")
27 | '''


--------------------------------------------------------------------------------
/train/train_header.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | from tf import tf_utils as tfu, confusionmatrix as conf, dataset, early_stopping as es
  3 | import numpy as np
  4 | import datetime
  5 | from sklearn import metrics
  6 | from sklearn.preprocessing import label_binarize
  7 | import utils
  8 | import os
  9 | 
 10 | now = datetime.datetime.now()
 11 | 
 12 | summaries_dir = '../tensorboard'
 13 | 
 14 | # hidden_units = 12
 15 | units = [5]
 16 | acc_list = []
 17 | num_headers_train = []
 18 | hidden_units_train = []
 19 | num_headers = [8]
 20 | for num_header in num_headers:
 21 |     for hidden_units in units:
 22 |         hidden_units_train.append(hidden_units)
 23 |         train_dirs = ["E:/Data/LinuxChrome/{}/".format(num_header),
 24 |                       "E:/Data/WindowsSalik/{}/".format(num_header),
 25 |                       "E:/Data/WindowsAndreas/{}/".format(num_header),
 26 |                       "E:/Data/WindowsFirefox/{}/".format(num_header)
 27 |                       ]
 28 | 
 29 |         test_dirs = ["E:/Data/WindowsChrome/{}/".format(num_header)]
 30 | 
 31 |         trainstr = "train:"
 32 |         for traindir in train_dirs:
 33 |             trainstr += traindir.split('Data/')[1].split("/")[0]
 34 |             trainstr += ":"
 35 |         teststr = "test:"
 36 |         for testdir in test_dirs:
 37 |             teststr += testdir.split('Data/')[1].split("/")[0]
 38 |             teststr += ":"
 39 |         timestamp = "%.2d%.2d_%.2d%.2d" % (now.day, now.month, now.hour, now.minute)
 40 |         save_dir = "../trained_models/{0}/{1}/{2}/".format(num_header, hidden_units, timestamp)
 41 |         os.makedirs(save_dir, exist_ok=True)
 42 |         seed = 0
 43 |         namestr = trainstr+teststr+str(num_header)+":"+str(hidden_units)
 44 |         # Beta for L2 regularization
 45 |         beta = 1.0
 46 |         val_size = [0.1]
 47 | 
 48 |         early_stop = es.EarlyStopping(patience=20, min_delta=0.05)
 49 |         subdir = "/{0}/{1}/{2}/".format(num_header, hidden_units, timestamp)
 50 |         input_size = num_header*54
 51 |         data = dataset.read_data_sets(test_dirs, train_dirs, merge_data=True, one_hot=True,
 52 |                                       validation_size=0.1,
 53 |                                       test_size=0.1,
 54 |                                       balance_classes=False,
 55 |                                       payload_length=input_size,
 56 |                                       seed=seed)
 57 |         tf.reset_default_graph()
 58 |         num_headers_train.append(num_header)
 59 |         num_classes = len(dataset._label_encoder.classes_)
 60 |         labels = dataset._label_encoder.classes_
 61 | 
 62 |         # cm = conf.ConfusionMatrix(num_classes, class_names=labels)
 63 |         gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.85)
 64 | 
 65 |         x_pl = tf.placeholder(tf.float32, [None, input_size], name='xPlaceholder')
 66 |         y_pl = tf.placeholder(tf.float64, [None, num_classes], name='yPlaceholder')
 67 |         y_pl = tf.cast(y_pl, tf.float32)
 68 | 
 69 |         x = tfu.ffn_layer('layer1', x_pl, hidden_units, activation=tf.nn.relu, seed=seed)
 70 |         # x = tf.layers.dense(x_pl, hidden_units, tf.nn.relu)
 71 |         # x = tfu.dropout(x, 0.5)
 72 |         # x = tfu.ffn_layer('layer2', x, hidden_units, activation=tf.nn.relu)
 73 |         # x = tfu.ffn_layer('layer2', x, 50, activation=tf.nn.sigmoid)
 74 |         # x = tfu.ffn_layer('layer3', x, 730, activation=tf.nn.relu)
 75 |         y = tfu.ffn_layer('output_layer', x, hidden_units=num_classes, activation=tf.nn.softmax, seed=seed)
 76 |         y_ = tf.argmax(y, axis=1)
 77 |         # with tf.name_scope('cross_entropy'):
 78 |         #   # The raw formulation of cross-entropy,
 79 |         #   #
 80 |         #   # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
 81 |         #   #                               reduction_indices=[1]))
 82 |         #   #
 83 |         #   # can be numerically unstable.
 84 |         #   #
 85 |         #   # So here we use tf.losses.sparse_softmax_cross_entropy on the
 86 |         #   # raw logit outputs of the nn_layer above.
 87 |         #   with tf.name_scope('total'):
 88 |         #     cross_entropy = tf.losses.softmax_cross_entropy(onehot_labels=y_pl, logits=y)
 89 | 
 90 | 
 91 |         with tf.variable_scope('loss'):
 92 |             # computing cross entropy per sample
 93 |             cross_entropy = -tf.reduce_sum(y_pl * tf.log(y + 1e-8), reduction_indices=[1])
 94 | 
 95 |             W1 = tf.get_default_graph().get_tensor_by_name("layer1/W:0")
 96 |             loss = tf.nn.l2_loss(W1)
 97 |             # averaging over samples
 98 |             cross_entropy = tf.reduce_mean(cross_entropy)
 99 |             loss = tf.reduce_mean(cross_entropy + loss * beta)
100 | 
101 |         tf.summary.scalar('cross_entropy', cross_entropy)
102 | 
103 |         with tf.variable_scope('training'):
104 |             # defining our optimizer
105 |             optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
106 | 
107 |             # applying the gradients
108 |             train_op = optimizer.minimize(cross_entropy)
109 | 
110 |         with tf.variable_scope('performance'):
111 |             # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
112 |             correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pl, axis=1))
113 | 
114 |             # averaging the one-hot encoded vector
115 |             accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
116 | 
117 |         tf.summary.scalar('accuracy', accuracy)
118 | 
119 |         # Merge all the summaries and write them out
120 |         merged = tf.summary.merge_all()
121 |         train_writer = tf.summary.FileWriter(summaries_dir + '/train/' + subdir)
122 |         val_writer = tf.summary.FileWriter(summaries_dir + '/validation/' + subdir)
123 | 
124 |         # Training Loop
125 |         batch_size = 100
126 |         max_epochs = 200
127 | 
128 |         valid_loss, valid_accuracy = [], []
129 |         train_loss, train_accuracy = [], []
130 |         test_loss, test_accuracy = [], []
131 |         epochs = []
132 |         with tf.Session() as sess:
133 |             early_stop.on_train_begin()
134 |             train_writer.add_graph(sess.graph)
135 |             sess.run(tf.global_variables_initializer())
136 |             total_parameters = 0
137 |             print("Calculating trainable parameters!")
138 |             for variable in tf.trainable_variables():
139 |                 # shape is an array of tf.Dimension
140 |                 shape = variable.get_shape()
141 |                 print("Shape: {}".format(shape))
142 |                 variable_parameters = 1
143 |                 for dim in shape:
144 |                     variable_parameters *= dim.value
145 |                 print("Shape {0} gives {1} trainable parameters".format(shape, variable_parameters))
146 |                 total_parameters += variable_parameters
147 |             print("Trainable parameters: {}".format(total_parameters))
148 |             print('Begin training loop')
149 |             saver = tf.train.Saver()
150 |             try:
151 |                 while data.train.epochs_completed < max_epochs:
152 |                     _train_loss, _train_accuracy = [], []
153 |                     ## Run train op
154 |                     x_batch, y_batch = data.train.next_batch(batch_size)
155 |                     fetches_train = [train_op, cross_entropy, accuracy, merged]
156 |                     feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
157 |                     _, _loss, _acc, _summary = sess.run(fetches_train, feed_dict_train)
158 | 
159 |                     _train_loss.append(_loss)
160 |                     _train_accuracy.append(_acc)
161 |                     ## Compute validation loss and accuracy
162 |                     if data.train.epochs_completed % 1 == 0 \
163 |                             and data.train._index_in_epoch <= batch_size:
164 | 
165 |                         train_writer.add_summary(_summary, data.train.epochs_completed)
166 | 
167 |                         train_loss.append(np.mean(_train_loss))
168 |                         train_accuracy.append(np.mean(_train_accuracy))
169 | 
170 |                         fetches_valid = [cross_entropy, accuracy, merged]
171 | 
172 |                         feed_dict_valid = {x_pl: data.validation.payloads, y_pl: data.validation.labels}
173 |                         _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_valid)
174 |                         epochs.append(data.train.epochs_completed)
175 |                         valid_loss.append(_loss)
176 |                         valid_accuracy.append(_acc)
177 |                         val_writer.add_summary(_summary, data.train.epochs_completed)
178 |                         current = valid_loss[-1]
179 |                         early_stop.on_epoch_end(data.train.epochs_completed, current)
180 |                         print("Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f},  Valid loss {:6.3f},  Valid acc {:6.3f}"
181 |                               .format(data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
182 |                                       valid_accuracy[-1]))
183 |                         if early_stop.stop_training:
184 |                             early_stop.on_train_end()
185 |                             break
186 | 
187 | 
188 |                 test_epoch = data.test.epochs_completed
189 |                 while data.test.epochs_completed == test_epoch:
190 |                     batch_size = 1000
191 |                     x_batch, y_batch = data.test.next_batch(batch_size)
192 |                     feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
193 |                     _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_test)
194 |                     y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
195 |                     y_preds = tf.argmax(y_preds, axis=1).eval()
196 |                     y_true = tf.argmax(y_batch, axis=1).eval()
197 |                     # cm.batch_add(y_true, y_preds)
198 |                     test_loss.append(_loss)
199 |                     test_accuracy.append(_acc)
200 |                     # test_writer.add_summary(_summary, data.train.epochs_completed)
201 |                 print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
202 |                     np.mean(test_loss), np.mean(test_accuracy)))
203 |                 namestr += ":acc{:.3f}".format(np.mean(test_accuracy))
204 |                 acc_list.append("{:.3f}".format(np.mean(test_accuracy)))
205 |                 saver.save(sess, save_dir+'header_{0}_{1}_units.ckpt'.format(num_header, hidden_units))
206 |                 feed_dict_test = {x_pl: data.test.payloads, y_pl: data.test.labels}
207 |                 # Create ROC curve for all classes
208 |                 y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
209 |                 y_true = data.test.labels
210 |                 utils.plot_ROC(y_true, y_preds, num_classes, labels, micro=False, macro=False)
211 |                 # Compute different metrics for confusionmatrix and more
212 |                 y_preds = sess.run(fetches=y_, feed_dict=feed_dict_test)
213 |                 y_true = tf.argmax(data.test.labels, axis=1).eval()
214 |                 y_true = [labels[i] for i in y_true]
215 |                 y_preds = [labels[i] for i in y_preds]
216 |                 conf = metrics.confusion_matrix(y_true, y_preds, labels=labels)
217 |                 report = metrics.classification_report(y_true, y_preds, labels=labels)
218 |                 nostream_dict = ['http', 'https']
219 |                 y_stream_true = []
220 |                 y_stream_preds = []
221 |                 for i, v in enumerate(y_true):
222 |                     pred = y_preds[i]
223 |                     if v in nostream_dict:
224 |                         y_stream_true.append('non-streaming')
225 |                     else:
226 |                         y_stream_true.append('streaming')
227 |                     if pred in nostream_dict:
228 |                         y_stream_preds.append('non-streaming')
229 |                     else:
230 |                         y_stream_preds.append('streaming')
231 |                 stream_acc = len([v for i, v in enumerate(y_stream_preds) if v == y_stream_true[i]]) / len(
232 |                     y_stream_true)
233 | 
234 |                 # Binarize the output
235 |                 y_stream_true1 = label_binarize(y_stream_true, classes=['non-streaming', 'streaming'])
236 |                 y_stream_preds1 = label_binarize(y_stream_preds, classes=['non-streaming', 'streaming'])
237 |                 n_classes = y_stream_true1.shape[1]
238 |                 # Stream/non-stream ROC curve
239 |                 utils.plot_ROC(y_stream_true1, y_stream_preds1, n_classes, ['non-streaming', 'streaming'], micro=False, macro=False)
240 | 
241 |                 conf1 = metrics.confusion_matrix(y_true, y_preds, labels=labels)
242 |                 conf2 = metrics.confusion_matrix(y_stream_true, y_stream_preds, labels=['non-streaming', 'streaming'])
243 |                 report = metrics.classification_report(y_true, y_preds, labels=labels)
244 |                 report2 = metrics.classification_report(y_stream_true, y_stream_preds, labels=['non-streaming', 'streaming'])
245 | 
246 |             except KeyboardInterrupt:
247 |                 pass
248 |         print(namestr)
249 |         utils.plot_confusion_matrix(conf1, labels, save=False, title=namestr)
250 |         utils.plot_confusion_matrix(conf2, ['non-streaming', 'streaming'], save=False,
251 |                                     title="StreamNoStream_acc{}".format(stream_acc))
252 |         print(report)
253 |         print(report2)
254 |         # utils.plot_metric_graph(x_list=epochs, y_list=valid_accuracy, save=False, x_label="epochs", y_label="Accuracy", title="Accuracy")
255 |         # utils.plot_metric_graph(x_list=epochs, y_list=valid_loss, save=False,  x_label="epochs", y_label="loss", title="Loss")
256 |         # utils.plot_confusion_matrix(conf, labels, save=False, title=namestr)
257 |         utils.show_plot()
258 | 
259 | acc_list = list(map(float, acc_list))
260 | print(acc_list, hidden_units_train)
261 | 
262 | 
263 | 


--------------------------------------------------------------------------------
/train/train_logistic.py:
--------------------------------------------------------------------------------
 1 | from sklearn.linear_model import LogisticRegression
 2 | import tensorflow as tf
 3 | from tf import tf_utils as tfu, dataset, early_stopping as es
 4 | import numpy as np
 5 | import datetime
 6 | from sklearn import metrics
 7 | import utils
 8 | import operator
 9 | from sklearn.metrics import classification_report
10 | import pandas as pd
11 | from collections import Counter
12 | 
13 | def input_fn(data):
14 |     return (data.payloads, data.labels)
15 | 
16 | 
17 | now = datetime.datetime.now()
18 | 
19 | num_headers = 16
20 | hidden_units = 15
21 | train_dirs = ['/home/mclrn/Data/WindowsSalik/{0}/'.format(num_headers),
22 |               '/home/mclrn/Data/WindowsChrome/{0}/'.format(num_headers),
23 |               '/home/mclrn/Data/WindowsLinux/{0}/'.format(num_headers),
24 |               '/home/mclrn/Data/WindowsFirefox/{0}/'.format(num_headers)]
25 | 
26 | test_dirs = ['/home/mclrn/Data/WindowsAndreas/{0}/'.format(num_headers)]
27 | 
28 | val = 0.1
29 | input_size = num_headers * 54
30 | seed = 0
31 | train_size = []
32 | data = dataset.read_data_sets(train_dirs, test_dirs, merge_data=True, one_hot=False,
33 |                               validation_size=val,
34 |                               test_size=0.1,
35 |                               balance_classes=False,
36 |                               payload_length=input_size,
37 |                               seed=seed)
38 | 
39 | train_size.append(len(data.train.payloads))
40 | num_classes = len(dataset._label_encoder.classes_)
41 | labels = dataset._label_encoder.classes_
42 | 
43 | classifier = LogisticRegression(verbose=2)
44 | 
45 | classifier.fit(data.train.payloads, data.train.labels)
46 | 
47 | score = classifier.score(data.test.payloads, data.test.labels)
48 | 
49 | predict_proba = classifier.predict_proba(data.test.payloads)
50 | 
51 | predict = classifier.predict(data.test.payloads)
52 | 
53 | correct = list(map(operator.sub,predict,data.test.labels)).count(0)
54 | 
55 | 
56 | confusion = metrics.confusion_matrix(data.test.labels, predict)
57 | utils.plot_confusion_matrix(confusion, labels, save=True, title="Logistic regression confusion matrixacc")
58 | accuracy = correct/len(predict)
59 | 
60 | report = classification_report(data.test.labels, predict)
61 | print(accuracy)
62 | skleranacc = metrics.accuracy_score(data.test.labels,predict)
63 | print("Accuracy calculated by sklearn.metrics: {}".format(skleranacc))
64 | most_occurences = Counter(data.test.labels).most_common(1)
65 | print("Naive classifier that just guesses the most frequents label will obtain an accuracy of: %s" % ((most_occurences[0][1])/len(data.test.labels)))
66 | print(report)


--------------------------------------------------------------------------------
/train/train_payload.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | from tf import tf_utils as tfu, confusionmatrix as conf, dataset, early_stopping as es
  3 | import numpy as np
  4 | from sklearn import metrics
  5 | import utils
  6 | summaries_dir = '../tensorboard'
  7 | 
  8 | train_dirs = ["E:/Data/h5/https/"]
  9 | test_dirs = ["E:/Data/h5/netflix/"]
 10 | pay_len = 1460
 11 | hidden_units = 2000
 12 | save_dir = "../trained_models/"
 13 | seed = 1
 14 | 
 15 | data = dataset.read_data_sets(train_dirs, test_dirs, merge_data=True, one_hot=True, validation_size=0.1, test_size=0.1, balance_classes=True, payload_length=pay_len, seed=seed)
 16 | tf.reset_default_graph()
 17 | num_classes = len(dataset._label_encoder.classes_)
 18 | labels = dataset._label_encoder.classes_
 19 | 
 20 | early_stop = es.EarlyStopping(patience=100, min_delta=0.05)
 21 | 
 22 | cm = conf.ConfusionMatrix(num_classes, class_names=dataset._label_encoder.classes_)
 23 | gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.45)
 24 | 
 25 | x_pl = tf.placeholder(tf.float32, [None, pay_len], name='xPlaceholder')
 26 | y_pl = tf.placeholder(tf.float64, [None, num_classes], name='yPlaceholder')
 27 | y_pl = tf.cast(y_pl, tf.float32)
 28 | 
 29 | x = tfu.ffn_layer('layer1', x_pl, hidden_units, activation=tf.nn.relu, seed=seed)
 30 | x = tfu.ffn_layer('layer2', x, hidden_units, activation=tf.nn.relu, seed=seed)
 31 | x = tfu.ffn_layer('layer3', x, hidden_units, activation=tf.nn.relu, seed=seed)
 32 | y = tfu.ffn_layer('output_layer', x, hidden_units=num_classes, activation=tf.nn.softmax, seed=seed)
 33 | y_ = tf.argmax(y, axis=1)
 34 | # with tf.name_scope('cross_entropy'):
 35 | #   # The raw formulation of cross-entropy,
 36 | #   #
 37 | #   # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
 38 | #   #                               reduction_indices=[1]))
 39 | #   #
 40 | #   # can be numerically unstable.
 41 | #   #
 42 | #   # So here we use tf.losses.sparse_softmax_cross_entropy on the
 43 | #   # raw logit outputs of the nn_layer above.
 44 | #   with tf.name_scope('total'):
 45 | #     cross_entropy = tf.losses.softmax_cross_entropy(onehot_labels=y_pl, logits=y)
 46 | 
 47 | 
 48 | with tf.variable_scope('loss'):
 49 |     # computing cross entropy per sample
 50 |     cross_entropy = -tf.reduce_sum(y_pl * tf.log(y + 1e-8), reduction_indices=[1])
 51 | 
 52 |     # averaging over samples
 53 |     cross_entropy = tf.reduce_mean(cross_entropy)
 54 | 
 55 | tf.summary.scalar('cross_entropy', cross_entropy)
 56 | 
 57 | with tf.variable_scope('training'):
 58 |     # defining our optimizer
 59 |     optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
 60 | 
 61 |     # applying the gradients
 62 |     train_op = optimizer.minimize(cross_entropy)
 63 | 
 64 | with tf.variable_scope('performance'):
 65 |     # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
 66 |     correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pl, axis=1))
 67 | 
 68 |     # averaging the one-hot encoded vector
 69 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
 70 | 
 71 | tf.summary.scalar('accuracy', accuracy)
 72 | 
 73 | # Merge all the summaries and write them out
 74 | merged = tf.summary.merge_all()
 75 | train_writer = tf.summary.FileWriter(summaries_dir + '/train')
 76 | val_writer = tf.summary.FileWriter(summaries_dir + '/validation')
 77 | test_writer = tf.summary.FileWriter(summaries_dir + '/test')
 78 | 
 79 | # Training Loop
 80 | batch_size = 100
 81 | max_epochs = 200
 82 | 
 83 | valid_loss, valid_accuracy = [], []
 84 | train_loss, train_accuracy = [], []
 85 | test_loss, test_accuracy = [], []
 86 | 
 87 | with tf.Session() as sess:
 88 |     early_stop.on_train_begin()
 89 |     train_writer.add_graph(sess.graph)
 90 |     sess.run(tf.global_variables_initializer())
 91 |     total_parameters = 0
 92 |     print("Calculating trainable parameters!")
 93 |     for variable in tf.trainable_variables():
 94 |         # shape is an array of tf.Dimension
 95 |         shape = variable.get_shape()
 96 |         # print("Shape: {}".format(shape))
 97 |         variable_parameters = 1
 98 |         for dim in shape:
 99 |             variable_parameters *= dim.value
100 |         print("Shape {0} gives {1} trainable parameters".format(shape, variable_parameters))
101 |         total_parameters += variable_parameters
102 |     print("Trainable parameters: {}".format(total_parameters))
103 |     print('Begin training loop')
104 |     saver = tf.train.Saver()
105 |     try:
106 |         while data.train.epochs_completed < max_epochs:
107 |             _train_loss, _train_accuracy = [], []
108 | 
109 |             ## Run train op
110 |             x_batch, y_batch = data.train.next_batch(batch_size)
111 |             fetches_train = [train_op, cross_entropy, accuracy, merged]
112 |             feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
113 |             _, _loss, _acc, _summary = sess.run(fetches_train, feed_dict_train)
114 | 
115 |             _train_loss.append(_loss)
116 |             _train_accuracy.append(_acc)
117 |             ## Compute validation loss and accuracy
118 |             if data.train.epochs_completed % 1 == 0 \
119 |                     and data.train._index_in_epoch <= batch_size:
120 | 
121 |                 train_writer.add_summary(_summary, data.train.epochs_completed)
122 | 
123 |                 train_loss.append(np.mean(_train_loss))
124 |                 train_accuracy.append(np.mean(_train_accuracy))
125 | 
126 |                 fetches_valid = [cross_entropy, accuracy, merged]
127 | 
128 |                 feed_dict_valid = {x_pl: data.validation.payloads, y_pl: data.validation.labels}
129 |                 _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_valid)
130 | 
131 |                 valid_loss.append(_loss)
132 |                 valid_accuracy.append(_acc)
133 |                 val_writer.add_summary(_summary, data.train.epochs_completed)
134 |                 current = valid_loss[-1]
135 |                 early_stop.on_epoch_end(data.train.epochs_completed, current)
136 |                 print("Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f},  Valid loss {:6.3f},  Valid acc {:6.3f}"
137 |                         .format(data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
138 |                         valid_accuracy[-1]))
139 |                 if early_stop.stop_training:
140 |                     early_stop.on_train_end()
141 |                     break
142 | 
143 | 
144 |         test_epoch = data.test.epochs_completed
145 |         while data.test.epochs_completed == test_epoch:
146 |             batch_size = 1000
147 |             x_batch, y_batch = data.test.next_batch(batch_size)
148 |             feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
149 |             _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_test)
150 |             y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
151 |             y_preds = tf.argmax(y_preds, axis=1).eval()
152 |             y_true = tf.argmax(y_batch, axis=1).eval()
153 |             # cm.batch_add(y_true, y_preds)
154 |             test_loss.append(_loss)
155 |             test_accuracy.append(_acc)
156 |             # test_writer.add_summary(_summary, data.train.epochs_completed)
157 |         print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
158 |             np.mean(test_loss), np.mean(test_accuracy)))
159 |         saver.save(sess, save_dir+'payload_{0}_{1}_units.ckpt'.format(pay_len, hidden_units))
160 |         feed_dict_test = {x_pl: data.test.payloads, y_pl: data.test.labels}
161 |         y_preds = sess.run(fetches=y_, feed_dict=feed_dict_test)
162 |         y_true = tf.argmax(data.test.labels, axis=1).eval()
163 |         y_true = [labels[i] for i in y_true]
164 |         y_preds = [labels[i] for i in y_preds]
165 |         conf = metrics.confusion_matrix(y_true, y_preds, labels=labels)
166 | 
167 |     except KeyboardInterrupt:
168 |         pass
169 | # print(namestr)
170 | utils.plot_confusion_matrix(conf, labels, save=True, title='payload_{0}_{1}_units_acc{2}'.format(pay_len, hidden_units, np.mean(test_accuracy)))
171 | # acc_list = list(map(float, acc_list))
172 | # print(acc_list, train_size)
173 | #
174 | # with tf.Session() as sess:
175 | #     sess.run(tf.global_variables_initializer())
176 | #     print('Begin training loop')
177 | #
178 | #     try:
179 | #         while data.train.epochs_completed < max_epochs:
180 | #             _train_loss, _train_accuracy = [], []
181 | #
182 | #             ## Run train op
183 | #             x_batch, y_batch = data.train.next_batch(batch_size)
184 | #             fetches_train = [train_op, cross_entropy, accuracy, merged]
185 | #             feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
186 | #             _, _loss, _acc, _summary = sess.run(fetches_train, feed_dict_train)
187 | #
188 | #             _train_loss.append(_loss)
189 | #             _train_accuracy.append(_acc)
190 | #             ## Compute validation loss and accuracy
191 | #             if data.train.epochs_completed % 1 == 0 \
192 | #                     and data.train._index_in_epoch <= batch_size:
193 | #
194 | #                 train_writer.add_summary(_summary, data.train.epochs_completed)
195 | #
196 | #                 train_loss.append(np.mean(_train_loss))
197 | #                 train_accuracy.append(np.mean(_train_accuracy))
198 | #
199 | #                 fetches_valid = [cross_entropy, accuracy, merged]
200 | #
201 | #                 feed_dict_valid = {x_pl: data.validation.payloads, y_pl: data.validation.labels}
202 | #                 _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_valid)
203 | #
204 | #                 valid_loss.append(_loss)
205 | #                 valid_accuracy.append(_acc)
206 | #                 val_writer.add_summary(_summary, data.train.epochs_completed)
207 | #                 print(
208 | #                     "Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f},  Valid loss {:6.3f},  Valid acc {:6.3f}".format(
209 | #                         data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
210 | #                         valid_accuracy[-1]))
211 | #         test_epoch = data.test.epochs_completed
212 | #         while data.test.epochs_completed == test_epoch:
213 | #             x_batch, y_batch = data.test.next_batch(batch_size)
214 | #             feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
215 | #             _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_test)
216 | #             y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
217 | #             y_preds = tf.argmax(y_preds, axis=1).eval()
218 | #             y_true = tf.argmax(y_batch, axis=1).eval()
219 | #             cm.batch_add(y_true, y_preds)
220 | #             test_loss.append(_loss)
221 | #             test_accuracy.append(_acc)
222 | #             # test_writer.add_summary(_summary, data.train.epochs_completed)
223 | #         print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
224 | #             np.mean(test_loss), np.mean(test_accuracy)))
225 | #
226 | #
227 | #     except KeyboardInterrupt:
228 | #         pass
229 | 
230 | 
231 | # print(cm)
232 | #
233 | # # df = utils.load_h5(dir + filename+'.h5', key=filename.split('-')[0])
234 | # # print("Load done!")
235 | # # print(df.shape)
236 | # # payloads = df['payload'].values
237 | # # payloads = utils.pad_elements_with_zero(payloads)
238 | # # df['payload'] = payloads
239 | # #
240 | # # # Converting hex string to list of int... Maybe takes to long?
241 | # # payloads = [[int(i, 16) for i in list(x)] for x in payloads]
242 | # # np_payloads = numpy.array(payloads)
243 | # # # dataset = DataSet(np_payloads, df['label'].values)
244 | # # x, y = dataset.next_batch(10)
245 | # # batch_size = 100
246 | # # features = {'payload': payloads}
247 | # #
248 | # #
249 | # # gb = df.groupby(['ip.dst', 'ip.src', 'port.dst', 'port.src'])
250 | # #
251 | # #
252 | # # l = dict(list(gb))
253 | # #
254 | # #
255 | # # s = [[k, len(v)] for k, v in sorted(l.items(), key=lambda x: len(x[1]), reverse=True)]
256 | #
257 | #
258 | # print("DONE")
259 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | import multiprocessing
  2 | from sklearn import metrics
  3 | from scapy.all import *
  4 | import matplotlib.pyplot as plt
  5 | import os
  6 | import numpy as np
  7 | import time
  8 | import pandas as pd
  9 | import glob
 10 | from scipy import interp
 11 | 
 12 | 
 13 | def filter_pcap_by_ip(dir, filename, ip_list, label):
 14 |     '''
 15 |     This method can be used  to extract certain packets (associted with specified ip adresses) from a pcap file.
 16 |     The method expects user knowledge of the communication protocol used by the specified ip adresses for the label to be correct.
 17 | 
 18 |     The label is intended to be either http or https.
 19 |     :param dir: The directory in which the pcap file is located
 20 |     :param filename: The name of the pcap file that should be loaded
 21 |     :param ip_list: A list of ip adresses of interest. Packets with either ip.src or ip.dst in ip_list will be extracted
 22 |     :param label: The label that should be applied to the extracted packets. Note that the ip_list should contain
 23 |                   adresses that we know communicate over http or https in order to match with the label
 24 |     :return: A dataframe containing information about the extracted packets from the pcap file.
 25 |     '''
 26 |     time_s = time.clock()
 27 |     count = 0
 28 |     print("Read PCAP, label is %s" % label)
 29 |     data = rdpcap(dir + filename + '.pcap')
 30 |     totalPackets = len(data)
 31 |     percentage = int(totalPackets / 100)
 32 |     # Workaround/speedup for pandas append to dataframe
 33 |     frametimes = []
 34 |     dsts = []
 35 |     srcs = []
 36 |     protocols = []
 37 |     dports = []
 38 |     sports = []
 39 |     bytes = []
 40 |     labels = []
 41 | 
 42 |     print("Total packages: %d" % totalPackets)
 43 |     time_r = time.clock()
 44 |     time_read = time_r - time_s
 45 |     print("Time to read PCAP: " + str(time_read))
 46 |     for packet in data:
 47 |         if IP in packet and \
 48 |                 (UDP in packet or TCP in packet) and \
 49 |                 (packet[IP].dst in ip_list or packet[IP].src in ip_list):
 50 |             ip_layer = packet[IP]
 51 |             transport_layer = ip_layer.payload
 52 |             frametimes.append(packet.time)
 53 |             dsts.append(ip_layer.dst)
 54 |             srcs.append(ip_layer.src)
 55 |             protocols.append(transport_layer.name)
 56 |             dports.append(transport_layer.dport)
 57 |             sports.append(transport_layer.sport)
 58 |             # Save the raw byte string
 59 |             raw_payload = raw(packet)
 60 |             bytes.append(raw_payload)
 61 |             labels.append(label)
 62 |             if (count % (percentage * 5) == 0):
 63 |                 print(str(count / percentage) + '%')
 64 |             count += 1
 65 |     time_t = time.clock()
 66 |     print("Time spend: %ds" % (time_t - time_r))
 67 |     d = {'time': frametimes,
 68 |          'ip.dst': dsts,
 69 |          'ip.src': srcs,
 70 |          'protocol': protocols,
 71 |          'port.dst': dports,
 72 |          'port.src': sports,
 73 |          'bytes': bytes,
 74 |          'label': labels}
 75 |     df = pd.DataFrame(data=d)
 76 |     time_e = time.clock()
 77 |     total_time = time_e - time_s
 78 |     print("Time to convert PCAP to dataframe: " + str(total_time))
 79 |     return df
 80 | 
 81 | 
 82 | def load_h5(dir, filename):
 83 |     timeS = time.clock()
 84 |     df = pd.read_hdf(dir + "/" + filename, key=filename.split('-')[0])
 85 |     timeE = time.clock()
 86 |     loadTime = timeE - timeS
 87 |     print("Time to load " + filename + ": " + str(loadTime))
 88 |     return df
 89 | 
 90 | 
 91 | def plotHex(hexvalues, filename):
 92 |     '''
 93 |         Plot an example as an image
 94 |         hexvalues: list of byte values
 95 |         average: allows for providing more than one list of hexvalues and create an average over all
 96 |     '''
 97 | 
 98 |     size = 39
 99 |     hex_placeholder = [0] * (size * size)  # create placeholder of correct size
100 | 
101 |     if (type(hexvalues[0]) is np.ndarray):
102 |         print("Multiple payloads")
103 |         for hex_list in hexvalues:
104 |             hex_placeholder[0:len(hex_list)] += hex_list  # overwrite zero values with values of
105 |         hex_placeholder = np.array(hex_placeholder) / len(hexvalues)  # average the elements of the placeholder
106 |     else:
107 |         print("Single payload")
108 |         hex_placeholder[0:len(hexvalues)] = hexvalues  # overwrite zero values with values of
109 | 
110 |     canvas = np.reshape(np.array(hex_placeholder), (size, size))
111 |     plt.figure(figsize=(4, 4))
112 |     plt.axis('off')
113 |     plt.imshow(canvas, cmap='gray')
114 |     plt.title(filename)
115 |     plt.show()
116 |     return canvas
117 | 
118 | 
119 | def pad_string_elements_with_zero(payloads):
120 |     # Assume max payload to be 1460 bytes but as each byte is now 2 hex digits we take double length
121 |     max_payload_len = 1460 * 2
122 |     # Pad with '0'
123 |     payloads = [s.ljust(max_payload_len, '0') for s in payloads]
124 |     return payloads
125 | 
126 | 
127 | def pad_arrays_with_zero(payloads, payload_length=810):
128 |     tmp_payloads = []
129 |     for x in payloads:
130 |         payload = np.zeros(payload_length, dtype=np.uint8)
131 |         # pl = np.fromstring(x, dtype=np.uint8)
132 |         payload[:x.shape[0]] = x
133 |         tmp_payloads.append(payload)
134 | 
135 |     # payloads = [np.fromstring(x) for x in payloads]
136 |     return np.array(tmp_payloads)
137 | 
138 | 
139 | def hash_elements(payloads):
140 |     return payloads
141 | 
142 | 
143 | def packetanonymizer(packet):
144 |     """"
145 |     Takes a packet as a bytestring in hex format and convert to unsigned 8bit integers [0-255]
146 |     Sets the header fields which contain MAC, IP and Port information to 0
147 |     """
148 |     # Should work with TCP and UDP
149 | 
150 |     p = np.fromstring(packet, dtype=np.uint8)
151 |     # set MACs to 0
152 |     p[0:12] = 0
153 |     # Remove IP checksum
154 |     p[24:26] = 0
155 |     # set IPs to 0
156 |     p[26:34] = 0
157 |     # set ports to 0
158 |     p[34:36] = 0
159 |     p[36:38] = 0
160 | 
161 |     # IP protocol field check if TCP
162 |     if p[23] == 6:
163 |         #Remove TCP checksum
164 |         p[50:52] = 0
165 |     else:
166 |         # Remove UDP checksum
167 |         p[40:42] = 0
168 |     return p
169 | 
170 | 
171 | def extractdatapoints(dataframe, filename, num_headers=15, session=True):
172 |     """"
173 |     Extracts the concatenated header datapoints from a dataframe while anonomizing the individual header
174 |     :returns a dataframe with datapoints (bytes) and labels
175 |     """
176 |     group_by = dataframe.sort_values(['time']).groupby(['ip.dst', 'ip.src', 'port.dst', 'port.src'])
177 |     gb_dict = dict(list(group_by))
178 |     data_points = []
179 |     labels = []
180 |     filenames = []
181 |     sessions = []
182 |     done = set()
183 |     num_too_short = 0
184 |     # Iterate over sessions
185 |     for k, v in gb_dict.items():
186 |         # v is a DataFrame
187 |         # k is a tuple (src, dst, sport, dport)
188 |         if k in done:
189 |             continue
190 |         done.add(k)
191 |         if session:
192 |             other_direction_key = (k[1], k[0], k[3], k[2])
193 |             other_direction = gb_dict[other_direction_key]
194 |             v = pd.concat([v, other_direction]).sort_values(['time'])
195 |             done.add(other_direction_key)
196 |         if len(v) < num_headers:
197 |             num_too_short += 1
198 |             continue
199 |         # extract num_headers of packets (one row each)
200 |         packets = v['bytes'].values[:num_headers]
201 |         headers = []
202 |         label = v['label'].iloc[0]
203 |         protocol = v['protocol'].iloc[0]
204 | 
205 |         # Skip session if UDP and not youtube
206 |         if protocol == 'UDP' and label != 'youtube':
207 |             continue
208 |         # For each packet
209 |         packetindex = 0
210 |         headeradded = 0
211 | 
212 |         while headeradded < num_headers:
213 |             if packetindex < len(packets):
214 |                 p = packets[packetindex]
215 |                 packetindex += 1
216 |             else:
217 |                 break
218 |             p_an = packetanonymizer(p)
219 | 
220 |             # assuming a session utilize the same protocol throughout
221 |             # Extract headers (TCP = 54 Bytes, UDP = 42 Bytes - Maybe + 4 Bytes for VLAN tagging) from x first packets of session/flow
222 |             header = np.zeros(54, dtype=np.uint8)
223 |             if protocol == 'TCP':
224 |                 # TCP
225 |                 header[:54] = p_an[:54]
226 |             else:
227 |                 # UDP
228 |                 header[:42] = p_an[:42]  # pad zeros
229 | 
230 |             # Skip if header packet is fragmented
231 |             if (0 < header[20] < 64) or header[21] != 0:
232 |                 continue
233 |             headers.append(header)
234 |             headeradded += 1
235 | 
236 |         # Concatenate headers as the feature vector
237 |         if len(headers) == num_headers:
238 |             feature_vector = np.concatenate(headers).ravel()
239 |             data_points.append(feature_vector)
240 |             labels.append(label)
241 |             filenames.append(filename)
242 |             sessions.append(k)
243 |     d = {'filename': filenames, 'session': sessions, 'bytes': data_points, 'label': labels}
244 |     return pd.DataFrame(data=d)
245 | 
246 | 
247 | def saveheaderstask(filelist, num_headers, session, dataframes):
248 |     datapointslist = []
249 |     for fullname in filelist:
250 |         load_dir, filename = os.path.split(fullname)
251 |         print("Loading: {0}".format(filename))
252 |         df = load_h5(load_dir, filename)
253 |         datapoints = extractdatapoints(df, filename, num_headers, session)
254 |         datapointslist.append(datapoints)
255 | 
256 |     # Extend the shared dataframe
257 |     dataframes.extend(datapointslist)
258 | 
259 | 
260 | def saveextractedheaders(load_dir, save_dir, savename, num_headers=15, session=True):
261 |     """"
262 |     Extracts datapoints from all .h5 files in train_dir and saves the them in a new .h5 file
263 |     :param load_dir: The directory to load from
264 |     :param save_dir: The directory to save the extracted headers
265 |     :param savename: The filename to save
266 |     :param num_headers: The amount of headers to use as datapoint
267 |     :param session: session or flow
268 |     """
269 |     manager = multiprocessing.Manager()
270 |     dataframes = manager.list()
271 |     filelist = glob.glob(load_dir + '*.h5')
272 |     filesplits = split_list(filelist, 4)
273 | 
274 |     threads = []
275 |     for split in filesplits:
276 |         # create a thread for each
277 |         t = multiprocessing.Process(target=saveheaderstask, args=(split, num_headers, session, dataframes))
278 |         threads.append(t)
279 |         t.start()
280 |     # create one large dataframe
281 | 
282 |     for t in threads:
283 |         t.join()
284 |         print("Process joined: ", t)
285 |     data = pd.concat(dataframes)
286 |     key = savename.split('-')[0]
287 |     if not os.path.exists(save_dir):
288 |         os.makedirs(save_dir)
289 |     data.to_hdf(save_dir + savename + '.h5', key=key, mode='w')
290 | 
291 | 
292 | # saveextractedheaders('./', 'extracted-0103_1136')
293 | # read_pcap('../Data/', 'drtv-2302_1031')
294 | def split_list(list, chunks):
295 |     '''
296 |     Takes a list an splits it to equal sized chunks.
297 |     :param list: list to split
298 |     :param chunks: number of chunks (int)
299 |     :return: a list containing chunks (lists) as elements
300 |     '''
301 |     avg = len(list) / float(chunks)
302 |     out = []
303 |     last = 0.0
304 | 
305 |     while last < len(list):
306 |         out.append(list[int(last):int(last + avg)])
307 |         last += avg
308 | 
309 |     return out
310 | 
311 | 
312 | def plot_confusion_matrix(cm, classes,
313 |                           normalize=False,
314 |                           title='Confusion matrix',
315 |                           cmap=plt.cm.get_cmap(name='Blues'), save=False):
316 |     """
317 |     This function prints and plots the confusion matrix.
318 |     Normalization can be applied by setting `normalize=True`.
319 |     """
320 |     import itertools
321 |     from matplotlib import rcParams
322 |     # Make room for xlabel which is otherwise cut off
323 |     rcParams.update({'figure.autolayout': True})
324 | 
325 |     if normalize:
326 |         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
327 |         print("Normalized confusion matrix")
328 |     else:
329 |         print('Confusion matrix, without normalization')
330 | 
331 |     print(cm)
332 |     plt.figure()
333 |     plt.imshow(cm, interpolation='nearest', cmap=cmap)
334 |     plt.title("Accuracy: {0}".format(title.split("acc")[1]))
335 |     plt.colorbar()
336 |     tick_marks = np.arange(len(classes))
337 |     plt.xticks(tick_marks, classes, rotation='vertical')
338 |     plt.yticks(tick_marks, classes)
339 | 
340 |     fmt = '.2f' if normalize else 'd'
341 |     thresh = cm.max() / 1.5
342 |     # thresh = 2260
343 |     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
344 |         plt.text(j, i, format(cm[i, j], fmt),
345 |                  horizontalalignment="center",
346 |                  color="white" if cm[i, j] > thresh else "black")
347 | 
348 |     plt.tight_layout()
349 |     plt.ylabel('True label')
350 |     plt.xlabel('Predicted label')
351 |     if save:
352 |         i = 0
353 |         filename = "{}".format(title)
354 |         while os.path.exists('{}{:d}.png'.format(filename, i)):
355 |             i += 1
356 |         plt.savefig('{}{:d}.png'.format(filename, i), dpi=300)
357 |     plt.draw()
358 |     # plt.gcf().clear()
359 | 
360 | 
361 | def plot_metric_graph(x_list, y_list, x_label="Datapoints", y_label="Accuracy",
362 |                           title='Metric list', save=False):
363 |     from matplotlib import rcParams
364 |     # Make room for xlabel which is otherwise cut off
365 |     rcParams.update({'figure.autolayout': True})
366 |     plt.figure()
367 |     plt.plot(x_list, y_list, label="90/10 split")
368 |     # plt.plot(x_list2, y_list2, label="Seperate testset")
369 |     # Calculate min and max of y scale
370 |     ymin = np.min(y_list)
371 |     ymin = np.floor(ymin * 10) / 10
372 |     ymax = np.max(y_list)
373 |     ymax = np.ceil(ymax * 10) / 10
374 |     plt.ylim(ymin, ymax)
375 |     plt.title("{0}".format(title))
376 |     plt.tight_layout()
377 |     plt.ylabel(y_label)
378 |     plt.xlabel(x_label)
379 |     # plt.legend()
380 |     if save:
381 |         i = 0
382 |         filename = "{}".format(title)
383 |         while os.path.exists('{}{:d}.png'.format(filename, i)):
384 |             i += 1
385 |         plt.savefig('{}{:d}.png'.format(filename, i), dpi=300)
386 |     plt.draw()
387 |     # plt.gcf().clear()
388 | 
389 | def plot_class_ROC(fpr, tpr, roc_auc, class_idx, labels):
390 |     from matplotlib import rcParams
391 |     # Make room for xlabel which is otherwise cut off
392 |     rcParams.update({'figure.autolayout': True})
393 |     plt.figure()
394 |     lw = 2
395 |     plt.plot(fpr[class_idx], tpr[class_idx], color='darkorange',
396 |              lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[class_idx])
397 |     plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
398 |     plt.xlim([0.0, 1.0])
399 |     plt.ylim([0.0, 1.05])
400 |     plt.xlabel('False Positive Rate')
401 |     plt.ylabel('True Positive Rate')
402 |     plt.title('Receiver operating characteristic of class {}'.format(labels[class_idx]))
403 |     plt.legend(loc="lower right")
404 |     plt.tight_layout()
405 | 
406 | def plot_multi_ROC(fpr, tpr, roc_auc, num_classes, labels, micro=True, macro=True):
407 |     from matplotlib import rcParams
408 |     # Make room for xlabel which is otherwise cut off
409 |     rcParams.update({'figure.autolayout': True})
410 |     # Plot all ROC curves
411 |     plt.figure()
412 |     lw = 2
413 |     if micro:
414 |         plt.plot(fpr["micro"], tpr["micro"],
415 |              label='micro-average ROC curve (area = {0:0.2f})'
416 |                    ''.format(roc_auc["micro"]),
417 |              color='deeppink', linestyle=':', linewidth=4)
418 |     if macro:
419 |         plt.plot(fpr["macro"], tpr["macro"],
420 |              label='macro-average ROC curve (area = {0:0.2f})'
421 |                    ''.format(roc_auc["macro"]),
422 |              color='navy', linestyle=':', linewidth=4)
423 |     color_map = {0: '#487fff', 1: '#2ee3ff', 2: '#4eff4e', 3: '#ffca43', 4: '#ff365e', 5: '#d342ff', 6: '#626663'}
424 |     # colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
425 |     for i in range(num_classes):
426 |         plt.plot(fpr[i], tpr[i], color=color_map[i], lw=lw,
427 |                  label='ROC curve of {0} (area = {1:0.2f})'
428 |                        ''.format(labels[i], roc_auc[i]))
429 | 
430 |     plt.plot([0, 1], [0, 1], 'k--', lw=lw)
431 |     plt.xlim([0.0, 1.0])
432 |     plt.ylim([0.0, 1.05])
433 |     plt.xlabel('False Positive Rate')
434 |     plt.ylabel('True Positive Rate')
435 |     plt.title('Receiver operating characteristic of all classes')
436 |     plt.legend(loc="lower right")
437 |     plt.tight_layout()
438 | 
439 | 
440 | def plot_ROC(y_true, y_preds, num_classes, labels, micro=True, macro=True):
441 |     # Compute ROC curve and ROC area for each class
442 |     fpr = dict()
443 |     tpr = dict()
444 |     roc_auc = dict()
445 |     for i in range(num_classes):
446 |         fpr[i], tpr[i], _ = metrics.roc_curve(y_true[:, i], y_preds[:, i])
447 |         roc_auc[i] = metrics.auc(fpr[i], tpr[i])
448 |     if micro:
449 |         # Compute micro-average ROC curve and ROC area
450 |         fpr["micro"], tpr["micro"], _ = metrics.roc_curve(y_true.ravel(), y_preds.ravel())
451 |         roc_auc["micro"] = metrics.auc(fpr["micro"], tpr["micro"])
452 |     if macro:
453 |         # Compute macro-average ROC curve and ROC area
454 |         # First aggregate all false positive rates
455 |         all_fpr = np.unique(np.concatenate([fpr[i] for i in range(num_classes)]))
456 |         # Then interpolate all ROC curves at this points
457 |         mean_tpr = np.zeros_like(all_fpr)
458 |         for i in range(num_classes):
459 |             mean_tpr += interp(all_fpr, fpr[i], tpr[i])
460 |         # Finally average it and compute AUC
461 |         mean_tpr /= num_classes
462 |         fpr["macro"] = all_fpr
463 |         tpr["macro"] = mean_tpr
464 |         roc_auc["macro"] = metrics.auc(fpr["macro"], tpr["macro"])
465 |     for i in range(num_classes):
466 |         plot_class_ROC(fpr, tpr, roc_auc, i, labels)
467 |     plot_multi_ROC(fpr, tpr, roc_auc, num_classes, labels, micro, macro)
468 | 
469 | 
470 | def show_plot():
471 |     plt.show()
472 | # y_list_test, x_list_test = [0.265, 0.572, 0.636, 0.704, 0.746, 0.763, 0.769, 0.784, 0.789, 0.818, 0.82, 0.817, 0.831, 0.845, 0.846, 0.848, 0.859, 0.887, 0.876], [54, 268, 536, 1072, 1608, 2144, 2680, 3216, 3752, 4288, 4824, 5360, 8039, 13398, 18758, 24117, 29476, 34835, 40194]
473 | # y_list_merge, x_list_merge = [0.269, 0.75, 0.79, 0.818, 0.833, 0.844, 0.856, 0.866, 0.87, 0.868, 0.883, 0.881, 0.902, 0.925, 0.923, 0.941, 0.946, 0.943], [59, 291, 582, 1163, 1744, 2325, 2906, 3487, 4068, 4649, 5230, 5811, 8717, 14528, 20339, 26150, 31961, 37772]
474 | # plot_metric_graph(x_list_merge, y_list_merge, x_list_test, y_list_test, title="#Training Datapoints vs. Accuracy", save=True)


--------------------------------------------------------------------------------
/visualization/classes_module.py:
--------------------------------------------------------------------------------
  1 | import copy
  2 | import numpy
  3 | from visualization import vis_utils
  4 | 
  5 | # -------------------------
  6 | # Feed-forward network
  7 | # -------------------------
  8 | class Network:
  9 | 
 10 |     def __init__(self, layers):
 11 |         self.layers = layers
 12 | 
 13 |     def forward(self, Z):
 14 |         for l in self.layers:
 15 |             Z = l.forward(Z)
 16 |         return Z
 17 | 
 18 |     def gradprop(self, DZ):
 19 |         for l in self.layers[::-1]:
 20 |             DZ = l.gradprop(DZ)
 21 |         return DZ
 22 | 
 23 |     def relprop(self, R):
 24 |         for l in self.layers[::-1]:
 25 |             R = l.relprop(R)
 26 |         return R
 27 | 
 28 | 
 29 | 
 30 | # -------------------------
 31 | # ReLU activation layer
 32 | # -------------------------
 33 | class ReLU:
 34 | 
 35 |     def forward(self, X):
 36 |         self.Z = X > 0
 37 |         return X*self.Z
 38 | 
 39 |     def gradprop(self, DY):
 40 |         return DY*self.Z
 41 | 
 42 |     def relprop(self, R):
 43 |         return R
 44 | 
 45 | # -------------------------
 46 | # Fully-connected layer
 47 | # -------------------------
 48 | class Linear:
 49 | 
 50 |     def __init__(self, W, b):
 51 |         self.W = W
 52 |         self.B = b
 53 | 
 54 |     def forward(self, X):
 55 |         self.X = X
 56 |         return numpy.dot(self.X, self.W)+self.B
 57 | 
 58 |     def gradprop(self, DY):
 59 |         self.DY = DY
 60 |         return numpy.dot(self.DY, self.W.T)
 61 | 
 62 |     def relprop(self, R):
 63 |         V = numpy.maximum(0, self.W)
 64 |         Z = numpy.dot(self.X, V) + 1e-9
 65 |         S = R / Z
 66 |         C = numpy.dot(S, V.T)
 67 |         R = self.X * C
 68 |         return R
 69 | 
 70 | 
 71 | class AlphaBetaLinear:
 72 | 
 73 |     def __init__(self, W, b, alpha):
 74 |         self.W = W
 75 |         self.B = b
 76 |         self.alpha = alpha
 77 |         self.beta = alpha -1
 78 | 
 79 |     def forward(self, X):
 80 |         self.X = X
 81 |         return numpy.dot(self.X, self.W)+self.B
 82 | 
 83 |     def gradprop(self, DY):
 84 |         self.DY = DY
 85 |         return numpy.dot(self.DY, self.W.T)
 86 | 
 87 |     def relprop(self, R):
 88 |         pself = copy.deepcopy(self)
 89 |         nself = copy.deepcopy(self)
 90 |         pself.B *= 0
 91 |         pself.W = numpy.maximum( 1e-9, pself.W)
 92 |         nself.B *= 0
 93 |         nself.W = numpy.minimum(-1e-9, pself.W)
 94 | 
 95 |         X = self.X + 1e-9
 96 |         ZA = pself.forward(X)
 97 |         SA = self.alpha*R/ZA
 98 |         ZB = nself.forward(X)
 99 |         SB = -self.beta *R/ZB
100 |         R = X * (pself.gradprop(SA) + nself.gradprop(SB))
101 |         return R
102 | 
103 | class FirstLinear(Linear):
104 | 
105 |     def __init__(self, W, b):
106 |         self.W = W
107 |         self.B = b
108 | 
109 |     def relprop(self, R):
110 |         W, V, U = self.W, numpy.maximum(0, self.W), numpy.minimum(0, self.W)
111 |         X, L, H = self.X, self.X * 0 + vis_utils.lowest, self.X * 0 + vis_utils.highest
112 | 
113 |         Z = numpy.dot(X, W) - numpy.dot(L, V) - numpy.dot(H, U) + 1e-9;
114 |         S = R / Z
115 |         R = X * numpy.dot(S, W.T) - L * numpy.dot(S, V.T) - H * numpy.dot(S, U.T)
116 |         return R
117 | 
118 | 
119 | # # -------------------------
120 | # # Sum-pooling layer
121 | # # -------------------------
122 | # class Pooling:
123 | #
124 | #     def forward(self, X):
125 | #         self.X = X
126 | #         self.Y = 0.5*(X[:, ::2, ::2, :]+X[:, ::2, 1::2, :]+X[:, 1::2, ::2, :]+X[:, 1::2, 1::2, :])
127 | #         return self.Y
128 | #
129 | #     def gradprop(self, DY):
130 | #         self.DY = DY
131 | #         DX = self.X*0
132 | #         for i, j in [(0, 0), (0, 1), (1, 0), (1, 1)]:
133 | #             DX[:, i::2, j::2, :] += DY*0.5
134 | #         return DX
135 | #
136 | # # -------------------------
137 | # # Convolution layer
138 | # # -------------------------
139 | # class Convolution:
140 | #
141 | #     def __init__(self,name):
142 | #         wshape = map(int,list(name.split("-")[-1].split("x")))
143 | #         self.W = numpy.loadtxt(name+'-W.txt').reshape(wshape)
144 | #         self.B = numpy.loadtxt(name+'-B.txt')
145 | #
146 | #     def forward(self,X):
147 | #
148 | #         self.X = X
149 | #         mb,wx,hx,nx = X.shape
150 | #         ww,hw,nx,ny = self.W.shape
151 | #         wy,hy       = wx-ww+1,hx-hw+1
152 | #
153 | #         Y = numpy.zeros([mb,wy,hy,ny],dtype='float32')
154 | #
155 | #         for i in range(ww):
156 | #             for j in range(hw):
157 | #                 Y += numpy.dot(X[:,i:i+wy,j:j+hy,:],self.W[i,j,:,:])
158 | #
159 | #         return Y+self.B
160 | #
161 | #     def gradprop(self,DY):
162 | #
163 | #         self.DY = DY
164 | #         mb,wy,hy,ny = DY.shape
165 | #         ww,hw,nx,ny = self.W.shape
166 | #
167 | #         DX = self.X*0
168 | #
169 | #         for i in range(ww):
170 | #             for j in range(hw):
171 | #                 DX[:,i:i+wy,j:j+hy,:] += numpy.dot(DY,self.W[i,j,:,:].T)
172 | #
173 | #         return DX
174 | 


--------------------------------------------------------------------------------
/visualization/heatmap.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from sklearn.preprocessing import LabelEncoder
 4 | from sklearn.cross_validation import train_test_split
 5 | import utils
 6 | import glob, os
 7 | import pca.dataanalyzer as da
 8 | 
 9 | 
10 | # visulaize the important characteristics of the dataset
11 | import matplotlib.pyplot as plt
12 | 
13 | data_len = 1460
14 | seed = 0
15 | # dirs = ["C:/Users/salik/Documents/Data/LinuxChrome/{}/".format(num_headers),
16 | #         "C:/Users/salik/Documents/Data/WindowsFirefox/{}/".format(num_headers),
17 | #         "C:/Users/salik/Documents/Data/WindowsChrome/{}/".format(num_headers),
18 | #         "C:/Users/salik/Documents/Data/WindowsSalik/{}/".format(num_headers),
19 | #         "C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
20 | dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
21 | 
22 | # step 1: get the data
23 | dataframes = []
24 | num_examples = 0
25 | for dir in dirs:
26 |     for fullname in glob.iglob(dir + '*.h5'):
27 |         filename = os.path.basename(fullname)
28 |         df = utils.load_h5(dir, filename)
29 |         dataframes.append(df)
30 |         num_examples = len(df.values)
31 |     # create one large dataframe
32 | data = pd.concat(dataframes)
33 | data.sample(frac=1, random_state=seed).reset_index(drop=True)
34 | num_rows = data.shape[0]
35 | columns = data.columns
36 | print(columns)
37 | 
38 | # step 2: get x and convert it to numpy array
39 | x = da.getbytes(data, data_len)
40 | 
41 | # step 3: get class labels y and then encode it into number
42 | # get class label data
43 | y = data['label'].values
44 | # encode the class label
45 | class_labels = np.unique(y)
46 | label_encoder = LabelEncoder()
47 | y = label_encoder.fit_transform(y)
48 | # step 4: split the data into training set and test set
49 | test_percentage = 0.5
50 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_percentage, random_state=seed)
51 | plot_savename = "histogram_payload"
52 | 
53 | from matplotlib import rcParams
54 | # Make room for xlabel which is otherwise cut off
55 | rcParams.update({'figure.autolayout': True})
56 | # Heatmap plot how plot the sample points among 5 classes
57 | for idx, cl in enumerate(np.unique(y_test)):
58 |     plt.figure()
59 |     print("Starting class: " + class_labels[cl] +" With len:" + str(len(x_test[y_test == cl])))
60 |     positioncounts = np.zeros(shape=(256, data_len))
61 |     for x in x_test[y_test == cl]:
62 |         for i, v in enumerate(x):
63 |             positioncounts[int(v), i] += 1
64 |     plt.imshow(positioncounts, cmap="YlGnBu", interpolation='nearest')
65 |     plt.title('Heatmap of : {}'.format(class_labels[cl]))
66 |     plt.colorbar()
67 |     plt.tight_layout()
68 | # plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
69 | plt.show()


--------------------------------------------------------------------------------
/visualization/histogram.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from sklearn.preprocessing import LabelEncoder
 4 | from sklearn.preprocessing import StandardScaler
 5 | from sklearn.cross_validation import train_test_split
 6 | import utils
 7 | import glob, os
 8 | import pca.dataanalyzer as da, pca.pca as pca
 9 | from sklearn.metrics import accuracy_score
10 | 
11 | # visulaize the important characteristics of the dataset
12 | import matplotlib.pyplot as plt
13 | seed = 0
14 | num_headers = 16
15 | data_len = 54*num_headers #1460
16 | dirs = ["C:/Users/salik/Documents/Data/LinuxChrome/{}/".format(num_headers),
17 |         "C:/Users/salik/Documents/Data/WindowsFirefox/{}/".format(num_headers),
18 |         "C:/Users/salik/Documents/Data/WindowsChrome/{}/".format(num_headers),
19 |         "C:/Users/salik/Documents/Data/WindowsSalik/{}/".format(num_headers),
20 |         "C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
21 | # dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
22 | 
23 | # step 1: get the data
24 | dataframes = []
25 | num_examples = 0
26 | for dir in dirs:
27 |     for fullname in glob.iglob(dir + '*.h5'):
28 |         filename = os.path.basename(fullname)
29 |         df = utils.load_h5(dir, filename)
30 |         dataframes.append(df)
31 |         num_examples = len(df.values)
32 |     # create one large dataframe
33 | data = pd.concat(dataframes)
34 | data.sample(frac=1, random_state=seed).reset_index(drop=True)
35 | num_rows = data.shape[0]
36 | columns = data.columns
37 | print(columns)
38 | 
39 | # step 2: get features (x) and convert it to numpy array
40 | x = da.getbytes(data, data_len)
41 | 
42 | # step 3: get class labels y and then encode it into number
43 | # get class label data
44 | y = data['label'].values
45 | 
46 | # encode the class label
47 | class_labels = np.unique(y)
48 | label_encoder = LabelEncoder()
49 | y = label_encoder.fit_transform(y)
50 | 
51 | # step 4: split the data into training set and test set
52 | test_percentage = 0.5
53 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_percentage, random_state=seed)
54 | 
55 | plot_savename = "histogram_payload"
56 | 
57 | from matplotlib import rcParams
58 | # Make room for xlabel which is otherwise cut off
59 | rcParams.update({'figure.autolayout': True})
60 | 
61 | 
62 | 
63 | # scatter plot the sample points among 5 classes
64 | # markers = ('s', 'd', 'o', '^', 'v', ".", ",", "<", ">", "8", "p", "P", "*", "h", "H", "+", "x", "X", "D", "|", "_")
65 | color_map = {0: '#487fff', 1: '#d342ff', 2: '#4eff4e', 3: '#2ee3ff', 4: '#ffca43', 5:'#ff365e', 6:'#626663'}
66 | plt.figure()
67 | for idx, cl in enumerate(np.unique(y_test)):
68 |     # Get count of unique values
69 |     values, counts = np.unique(x_test[y_test == cl], return_counts=True)
70 |     # Maybe remove zero as there is a lot of zeros in the header
71 |     # values = values[1:]
72 |     # counts = counts[1:]
73 |     n, bins, patches = plt.hist(values, weights=counts, bins=256, facecolor=color_map[idx], label=class_labels[cl],  alpha=0.8)
74 | 
75 | plt.legend(loc='upper right')
76 | plt.title('Histogram of : {}'.format(class_labels))
77 | plt.tight_layout()
78 | # plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
79 | plt.show()


--------------------------------------------------------------------------------
/visualization/pca_plots.py:
--------------------------------------------------------------------------------
 1 | import pca.dataanalyzer as da, pca.pca as pca
 2 | import glob
 3 | import os
 4 | import pandas as pd
 5 | import numpy as np
 6 | from sklearn.preprocessing import StandardScaler, LabelEncoder
 7 | import utils
 8 | 
 9 | 
10 | seed = 0
11 | num_headers = 16
12 | dirs =  ["E:/Data/LinuxChrome/{}/".format(num_headers),
13 |          "E:/Data/WindowsSalik/{}/".format(num_headers),
14 |          "E:/Data/WindowsAndreas/{}/".format(num_headers),
15 |          "E:/Data/WindowsFirefox/{}/".format(num_headers),
16 |          "E:/Data/WindowsChrome/{}/".format(num_headers),
17 |                       ]
18 | # dirs = ["C:/Users/salik/Documents/Data/h5/https/", "C:/Users/salik/Documents/Data/h5/netflix/"]
19 | # dirs = ["C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
20 | # step 1: get the data
21 | dataframes = []
22 | num_examples = 0
23 | for dir in dirs:
24 |     for fullname in glob.iglob(dir + '*.h5'):
25 |         filename = os.path.basename(fullname)
26 |         df = utils.load_h5(dir, filename)
27 |         dataframes.append(df)
28 |         num_examples = len(df.values)
29 |     # create one large dataframe
30 | data = pd.concat(dataframes)
31 | data = data.sample(frac=0.1, random_state=seed).reset_index(drop=True)
32 | num_rows = data.shape[0]
33 | columns = data.columns
34 | print(columns)
35 | 
36 | # step 3: get features (x) and scale the features
37 | # get x and convert it to numpy array
38 | x = da.getbytes(data, num_headers*54)
39 | standard_scaler = StandardScaler()
40 | # x_stds = []
41 | # ys = []
42 | # for data in dataframes:
43 | # x = da.getbytes(data, 1460)
44 | 
45 | # x = [[-0.5, -0.5],
46 | #      [-0.5, 0.5],
47 | #      [0.5, -0.5],
48 | #      [0.5, 0.5],
49 | #      [-0.4, -0.4],
50 | #      [-0.4, 0.4],
51 | #      [0.4, -0.4],
52 | #      [0.4, 0.4]
53 | #      ]
54 | x_std = standard_scaler.fit_transform(x)
55 | y = data['label'].values
56 | # y = ['drtv', 'netflix', 'youtube', 'twitch', 'twitch', 'youtube', 'netflix', 'drtv']
57 | # encode the class label
58 | class_labels = np.unique(y)
59 | label_encoder = LabelEncoder()
60 | y = label_encoder.fit_transform(y)
61 | 
62 | 
63 | p = pca.runpca(x_std, num_comp=25)
64 | z = pca.componentprojection(x_std, p)
65 | pca.plotprojection(z, 0, y, class_labels)
66 | pca.plotvarianceexp(p, 25)
67 | pca.showplots()
68 | # plot_savename = "PCA_header_all_cumulative"
69 | # plt.savefig('{0}.png'.format(plot_savename), dpi=300)


--------------------------------------------------------------------------------
/visualization/t-sne_compare.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from sklearn.preprocessing import LabelEncoder
  4 | from sklearn.preprocessing import StandardScaler
  5 | from sklearn.cross_validation import train_test_split
  6 | import utils
  7 | import glob, os
  8 | import pca.dataanalyzer as da, pca.pca as pca
  9 | from sklearn.metrics import accuracy_score
 10 | 
 11 | # visulaize the important characteristics of the dataset
 12 | import matplotlib.pyplot as plt
 13 | seed = 0
 14 | num_headers = 16
 15 | dirs = ["E:/Data/WindowsFirefox/{}/".format(num_headers),
 16 |         "E:/Data/WindowsChrome/{}/".format(num_headers)]
 17 | # dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
 18 | # dirs = ["C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
 19 | # step 1: get the data
 20 | dataframes = []
 21 | num_examples = 0
 22 | for dir in dirs:
 23 |     for fullname in glob.iglob(dir + '*.h5'):
 24 |         filename = os.path.basename(fullname)
 25 |         df = utils.load_h5(dir, filename)
 26 |         dataframes.append(df)
 27 |         num_examples = len(df.values)
 28 |     # create one large dataframe
 29 | data = pd.concat(dataframes)
 30 | num_rows = data.shape[0]
 31 | columns = data.columns
 32 | print(columns)
 33 | 
 34 | # step 2: get features (x) and  convert it to numpy array
 35 | standard_scaler = StandardScaler()
 36 | x_stds = []
 37 | ys = []
 38 | class_labels = []
 39 | for data in dataframes:
 40 |     x = da.getbytes(data, num_headers*54)
 41 | # x = da.getbytes(data, 1460)
 42 |     x_std = standard_scaler.fit_transform(x)
 43 |     x_stds.append(x_std)
 44 | # step 4: get class labels y and then encode it into number
 45 | # get class label data
 46 |     y = data['label'].values
 47 |     # encode the class label
 48 |     class_labels = np.unique(y)
 49 |     label_encoder = LabelEncoder()
 50 |     y = label_encoder.fit_transform(y)
 51 |     ys.append(y)
 52 | class_labels1 = ["Firefox " + x for x in class_labels]
 53 | class_labels2 = ["Chrome " + x for x in class_labels]
 54 | # step 5: split the data into training set and test set
 55 | test_percentage = 0.1
 56 | x_tests = []
 57 | y_tests = []
 58 | for i, x_std in enumerate(x_stds):
 59 |     x_train, x_test, y_train, y_test = train_test_split(x_std, ys[i], test_size=test_percentage, random_state=seed)
 60 |     x_tests.append(x_test)
 61 |     y_tests.append(y_test)
 62 | 
 63 | x_test = np.append(x_tests[0], x_tests[1], axis=0)
 64 | y_test = np.append(y_tests[0], y_tests[1], axis=0)
 65 | first_set_length = len(y_tests[0])
 66 | print(first_set_length)
 67 | 
 68 | # t-distributed Stochastic Neighbor Embedding (t-SNE) visualization
 69 | plot_savename = "t-sne_16headers_windows_linux_perplexity"
 70 | from sklearn.manifold import TSNE
 71 | perplexities = [30.0]
 72 | 
 73 | from matplotlib import rcParams
 74 | # Make room for xlabel which is otherwise cut off
 75 | rcParams.update({'figure.autolayout': True})
 76 | 
 77 | for perplexity in perplexities:
 78 |     print("Starting perplexity: {}".format(perplexity))
 79 |     tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=1000, random_state=seed, verbose=2)
 80 |     x_test_2d = tsne.fit_transform(x_test)
 81 |     x_test_2d_0 = x_test_2d[:first_set_length, :]
 82 |     x_test_2d_1 = x_test_2d[first_set_length:, :]
 83 | 
 84 |     # scatter plot the sample points among 7 classes
 85 |     # markers = ('s', 'd', 'o', '^', 'v', ".", ",", "<", ">", "8", "p", "P", "*", "h", "H", "+", "x", "X", "D", "|", "_")
 86 |     color_map = {0: '#487fff', 1: '#2ee3ff', 2: '#4eff4e', 3: '#ffca43', 4: '#ff365e', 5: '#d342ff', 6:'#626663'}
 87 |     plt.figure()
 88 |     for idx, cl in enumerate(np.unique(y_test)):
 89 |         # Plot first dataset as +
 90 |         plt.scatter(x=x_test_2d_0[y_tests[0] == cl, 0], y=x_test_2d_0[y_tests[0] == cl, 1],
 91 |                     marker="+", s=30,
 92 |                     c=color_map[idx],
 93 |                     label=class_labels1[cl])
 94 |         # Plot second dataset as o with no fill
 95 |         plt.scatter(x=x_test_2d_1[y_tests[1] == cl, 0], y=x_test_2d_1[y_tests[1] == cl, 1],
 96 |                     marker="o", facecolors="None", s=30, linewidths=1,
 97 |                     edgecolors=color_map[idx],
 98 |                     label=class_labels2[cl])
 99 |     plt.xlabel('X in t-SNE')
100 |     plt.ylabel('Y in t-SNE')
101 |     plt.legend(loc='lower right')
102 |     plt.title('t-SNE visualization with perplexity: {}'.format(perplexity))
103 |     plt.tight_layout()
104 | #     plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
105 | plt.show()


--------------------------------------------------------------------------------
/visualization/t_sne.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from sklearn.preprocessing import LabelEncoder
 4 | from sklearn.preprocessing import StandardScaler
 5 | from sklearn.cross_validation import train_test_split
 6 | import utils
 7 | import glob, os
 8 | import pca.dataanalyzer as da, pca.pca as pca
 9 | from sklearn.metrics import accuracy_score
10 | 
11 | # visulaize the important characteristics of the dataset
12 | import matplotlib.pyplot as plt
13 | import matplotlib.markers as ms
14 | seed = 0
15 | num_headers = 16
16 | dirs = ["C:/Users/salik/Documents/Data/LinuxChrome/{}/".format(num_headers),
17 |         "C:/Users/salik/Documents/Data/WindowsFirefox/{}/".format(num_headers),
18 |         "C:/Users/salik/Documents/Data/WindowsChrome/{}/".format(num_headers),
19 |         "C:/Users/salik/Documents/Data/WindowsSalik/{}/".format(num_headers),
20 |         "C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
21 | # dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
22 | # dirs = ["C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
23 | # step 1: get the data
24 | dataframes = []
25 | num_examples = 0
26 | for dir in dirs:
27 |     for fullname in glob.iglob(dir + '*.h5'):
28 |         filename = os.path.basename(fullname)
29 |         df = utils.load_h5(dir, filename)
30 |         dataframes.append(df)
31 |         num_examples = len(df.values)
32 |     # create one large dataframe
33 | data = pd.concat(dataframes)
34 | data.sample(frac=1, random_state=seed).reset_index(drop=True)
35 | num_rows = data.shape[0]
36 | columns = data.columns
37 | print(columns)
38 | 
39 | # step 3: get features (x) and scale the features
40 | # get x and convert it to numpy array
41 | # x = da.getbytes(data, 1460)
42 | standard_scaler = StandardScaler()
43 | x = da.getbytes(data, num_headers*54)
44 | x_std = standard_scaler.fit_transform(x)
45 | # step 4: get class labels y and then encode it into number
46 | # get class label data
47 | y = data['label'].values
48 | # encode the class label
49 | class_labels = np.unique(y)
50 | label_encoder = LabelEncoder()
51 | y = label_encoder.fit_transform(y)
52 | # step 5: split the data into training set and test set
53 | test_percentage = 0.1
54 | x_tests = []
55 | y_tests = []
56 | x_train, x_test, y_train, y_test = train_test_split(x_std, y, test_size=test_percentage, random_state=seed)
57 | # t-distributed Stochastic Neighbor Embedding (t-SNE) visualization
58 | plot_savename = "t-sne_16headers_windows_linux_perplexity"
59 | from sklearn.manifold import TSNE
60 | perplexities = [30.0]
61 | 
62 | from matplotlib import rcParams
63 | # Make room for xlabel which is otherwise cut off
64 | rcParams.update({'figure.autolayout': True})
65 | 
66 | for perplexity in perplexities:
67 |     print("Starting perplexity: {}".format(perplexity))
68 |     tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=1000, random_state=seed, verbose=2)
69 |     x_test_2d = tsne.fit_transform(x_test)
70 | #     # scatter plot the sample points among 7 classes
71 | #     # markers = ('s', 'd', 'o', '^', 'v', ".", ",", "<", ">", "8", "p", "P", "*", "h", "H", "+", "x", "X", "D", "|", "_")
72 |     color_map = {0: '#487fff', 1: '#2ee3ff', 2: '#4eff4e', 3: '#ffca43', 4: '#ff365e', 5: '#d342ff', 6:'#626663'}
73 |     plt.figure()
74 |     for idx, cl in enumerate(np.unique(y_test)):
75 |         plt.scatter(x=x_test_2d[y_test == cl, 0],
76 |                     y=x_test_2d[y_test == cl, 1],
77 |                     marker=".",
78 |                     c=color_map[idx],
79 |                     label=class_labels[cl])
80 |     plt.xlabel('X in t-SNE')
81 |     plt.ylabel('Y in t-SNE')
82 |     plt.legend(loc='lower right')
83 |     # plt.title('t-SNE visualization with perplexity: {}'.format(perplexity))
84 |     plt.tight_layout()
85 | #     plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
86 | plt.show()


--------------------------------------------------------------------------------
/visualization/vis_utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | # import PIL.Image
  3 | import matplotlib.pyplot as plt
  4 | import matplotlib.cm as cm
  5 | # lowest = -1.0
  6 | lowest = 0.0
  7 | highest = 1.0
  8 | 
  9 | # --------------------------------------
 10 | # Color maps ([-1,1] -> [0,1]^3)
 11 | # --------------------------------------
 12 | 
 13 | 
 14 | def heatmap(x):
 15 | 
 16 |     x = x[..., np.newaxis]
 17 | 
 18 |     # positive relevance
 19 |     hrp = 0.9 - np.clip(x-0.3, 0, 0.7)/0.7*0.5
 20 |     hgp = 0.9 - np.clip(x-0.0, 0, 0.3)/0.3*0.5 - np.clip(x-0.3, 0, 0.7)/0.7*0.4
 21 |     hbp = 0.9 - np.clip(x-0.0, 0, 0.3)/0.3*0.5 - np.clip(x-0.3, 0, 0.7)/0.7*0.4
 22 | 
 23 |     # negative relevance
 24 |     hrn = 0.9 - np.clip(-x-0.0, 0, 0.3)/0.3*0.5 - np.clip(-x-0.3, 0, 0.7)/0.7*0.4
 25 |     hgn = 0.9 - np.clip(-x-0.0, 0, 0.3)/0.3*0.5 - np.clip(-x-0.3, 0, 0.7)/0.7*0.4
 26 |     hbn = 0.9 - np.clip(-x-0.3, 0, 0.7)/0.7*0.5
 27 | 
 28 |     r = hrp*(x >= 0)+hrn*(x < 0)
 29 |     g = hgp*(x >= 0)+hgn*(x < 0)
 30 |     b = hbp*(x >= 0)+hbn*(x < 0)
 31 | 
 32 |     return np.concatenate([r, g, b], axis=-1)
 33 | 
 34 | 
 35 | def graymap(x):
 36 | 
 37 |     x = x[..., np.newaxis]
 38 |     return np.concatenate([x, x, x], axis=-1)*0.5+0.5
 39 | 
 40 | # --------------------------------------
 41 | # Visualizing data
 42 | # --------------------------------------
 43 | 
 44 | # def visualize(x,colormap,name):
 45 | #
 46 | #     N = len(x)
 47 | #     assert(N <= 16)
 48 | #
 49 | #     x = colormap(x/np.abs(x).max())
 50 | #
 51 | #     # Create a mosaic and upsample
 52 | #     x = x.reshape([1, N, 29, 29, 3])
 53 | #     x = np.pad(x, ((0, 0), (0, 0), (2, 2), (2, 2), (0, 0)), 'constant', constant_values=1)
 54 | #     x = x.transpose([0, 2, 1, 3, 4]).reshape([1*33, N*33, 3])
 55 | #     x = np.kron(x, np.ones([2, 2, 1]))
 56 | #
 57 | #     PIL.Image.fromarray((x*255).astype('byte'), 'RGB').save(name)
 58 | 
 59 | 
 60 | def plt_vector(x, colormap, num_headers):
 61 |     N = len(x)
 62 |     assert (N <= 16)
 63 |     len_x = 54
 64 |     len_y = num_headers
 65 |     # size = int(np.ceil(np.sqrt(len(x[0]))))
 66 |     length = len_y*len_x
 67 |     data = np.zeros((N, length), dtype=np.float64)
 68 |     data[:, :x.shape[1]] = x
 69 |     data = colormap(data / np.abs(data).max())
 70 |     # data = data.reshape([1, N, size, size, 3])
 71 |     data = data.reshape([1, N, len_y, len_x, 3])
 72 |     # data = np.pad(data, ((0, 0), (0, 0), (2, 2), (2, 2), (0, 0)), 'constant', constant_values=1)
 73 |     data = data.transpose([0, 2, 1, 3, 4]).reshape([1 * (len_y), N * (len_x), 3])
 74 |     return data
 75 |     # data = np.kron(data, np.ones([2, 2, 1])) # scales
 76 | 
 77 | 
 78 | def add_subplot(data, num_plots, plot_index, title, figure):
 79 |     fig = figure
 80 |     ax = fig.add_subplot(num_plots, 1, plot_index)
 81 |     cax = ax.imshow(data, interpolation='nearest', aspect='auto')
 82 |     # cbar = fig.colorbar(cax, ticks=[0, 1])
 83 |     # cbar.ax.set_yticklabels(['0', '> 1'])  # vertically oriented colorbar
 84 |     ax.set_title(title)
 85 | 
 86 | def plot_data(data, title):
 87 |     plt.figure(figsize=(6.4, 2.5))  # figuresize to make 16 headers plot look good
 88 |     # plt.axis('scaled')
 89 |     plt.imshow(data, interpolation='nearest', aspect='auto')
 90 |     plt.title(title)
 91 |     plt.tight_layout()
 92 | 
 93 | 
 94 | def plotNNFilter(units):
 95 |     filters = units.shape[3]
 96 |     plt.figure(1, figsize=(20, 20))
 97 |     n_columns = 6
 98 |     n_rows = np.ceil(filters / n_columns) + 1
 99 |     for i in range(filters):
100 |         plt.subplot(n_rows, n_columns, i+1)
101 |         plt.title('Filter ' + str(i))
102 |         plt.imshow(units[0, :, :, i], interpolation="nearest", cmap="gray")
103 | 
104 | 


--------------------------------------------------------------------------------
/visualization/visualize_activations.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import matplotlib.pyplot as plt
 3 | import numpy as np
 4 | from tf import dataset
 5 | from visualization import classes_module as md, vis_utils
 6 | 
 7 | num_headers = 16
 8 | hidden_units = 12
 9 | train_dirs = ['C:/Users/salik/Documents/Data/LinuxChrome/{0}/'.format(num_headers),
10 |                 'C:/Users/salik/Documents/Data/WindowsFirefox/{0}/'.format(num_headers),
11 |                 'C:/Users/salik/Documents/Data/WindowsAndreas/{0}/'.format(num_headers),
12 |                 'C:/Users/salik/Documents/Data/WindowsSalik/{0}/'.format(num_headers)]
13 | 
14 | test_dirs = ['C:/Users/salik/Documents/Data/WindowsChrome/{0}/'.format(num_headers)]
15 | seed = 0
16 | input_size = 54*num_headers
17 | data = dataset.read_data_sets(train_dirs, test_dirs, merge_data=True, one_hot=True,
18 |                                   validation_size=0.1,
19 |                                   test_size=0.1,
20 |                                   balance_classes=False,
21 |                                   payload_length=input_size,
22 |                                   seed=seed)
23 | load_dir = "../trained_models/"
24 | model_name = 'header_{0}_{1}_units.ckpt'.format(num_headers, hidden_units)
25 | sess = tf.Session()
26 | # First let's load meta graph and restore weights
27 | saver = tf.train.import_meta_graph(load_dir + model_name + ".meta")
28 | saver.restore(sess, tf.train.latest_checkpoint(load_dir))
29 | 
30 | # Now, let's access and create placeholders variables and
31 | # create feed-dict to feed new data
32 | graph = tf.get_default_graph()
33 | names = [tensor.name for tensor in graph.as_graph_def().node]
34 | x_pl = graph.get_tensor_by_name("xPlaceholder:0")
35 | y_pl = graph.get_tensor_by_name("yPlaceholder:0")
36 | layer1 = graph.get_tensor_by_name('layer1/activation:0')
37 | W1 = graph.get_tensor_by_name("layer1/W:0")
38 | b1 = graph.get_tensor_by_name("layer1/b:0")
39 | W_out = graph.get_tensor_by_name("output_layer/W:0")
40 | b_out = graph.get_tensor_by_name("output_layer/b:0")
41 | y = graph.get_tensor_by_name("output_layer/activation:0")
42 | # Get index of prediction
43 | y_ = tf.argmax(y, axis=1)
44 | 
45 | feed_dict = {x_pl: data.test.payloads, y_pl: data.test.labels}
46 | y_preds = sess.run(fetches=y_, feed_dict=feed_dict)
47 | y_true = tf.argmax(data.test.labels, axis=1).eval(session=sess)
48 | 
49 | # Number of samples to plot
50 | sample_size = 10
51 | num_samples_picked = 0
52 | sample_payloads = []
53 | sample_labels = []
54 | classes = np.array(dataset._label_encoder.classes_)
55 | # Which class do we want to visualise
56 | class_name = "drtv"
57 | class_number = np.argmax(classes == class_name)
58 | # Get all the places where that class label is the true label
59 | true_idx = [i for i, v in enumerate(y_true) if v == class_number]
60 | # What predictions did the network make at those indicies
61 | preds_at_idx = y_preds[true_idx]
62 | # Iterate over predictions and pick #sample_size where prediction was right
63 | for i, v in enumerate(preds_at_idx):
64 |     if v == class_number and num_samples_picked < sample_size:
65 |         sample_payloads.append(data.test.payloads[true_idx[i]])
66 |         sample_labels.append(data.test.labels[true_idx[i]])
67 |         num_samples_picked += 1
68 | 
69 | 
70 | X = np.array(sample_payloads)
71 | T = np.array(sample_labels)
72 | # Extract the weights and biases from the network
73 | l1_weights, l1_biases, out_weights, out_biases = sess.run([W1, b1, W_out, b_out])
74 | # Create visualisation network in sequential manner
75 | nn = md.Network([md.Linear(l1_weights, l1_biases), md.ReLU(),
76 |                  md.Linear(out_weights, out_biases), md.ReLU()
77 |                  ])
78 | # Make a sensitivity and relevance plot for each sample
79 | for i, v in enumerate(X):
80 |     t = T[i]
81 |     classname = classes[np.argmax(t)]
82 |     Y = np.array([nn.forward(v)])
83 |     S = np.array([nn.gradprop(t)**2])
84 |     D = nn.relprop(Y*t)
85 |     s_data = vis_utils.plt_vector(S, vis_utils.heatmap,  num_headers)
86 |     s_title = "Sensitivity for {0}".format(classname)
87 |     rel_data = vis_utils.plt_vector(D, vis_utils.heatmap, num_headers)
88 |     r_title = "Relevance for {0}".format(classname)
89 |     vis_utils.plot_data(s_data, s_title)
90 |     vis_utils.plot_data(rel_data, r_title)
91 | plt.show()
92 | 


--------------------------------------------------------------------------------