├── .gitignore
├── LICENSE
├── Master_Thesis.pdf
├── README.md
├── convert_pcap_to_h5.py
├── data_exploration.py
├── datasets
├── LinuxChrome
│ ├── 8
│ │ └── extracted_8-2104_1422.h5
│ └── 16
│ │ └── extracted_16-2104_1523.h5
├── WindowsAndreas
│ ├── 8
│ │ └── extracted_8-2304_0930.h5
│ └── 16
│ │ └── extracted_16-2304_0932.h5
├── WindowsChrome
│ ├── 8
│ │ └── extracted_8-2004_1553.h5
│ └── 16
│ │ └── extracted_16-2004_1615.h5
├── WindowsFirefox
│ ├── 8
│ │ └── extracted_8-2004_2104.h5
│ └── 16
│ │ └── extracted_16-2004_2210.h5
└── WindowsSalik
│ ├── 8
│ └── extracted_8-2004_1525.h5
│ └── 16
│ └── extracted_16-2004_1542.h5
├── extract_headers.py
├── extract_payload.py
├── filter_http_https.py
├── ip_header_test.py
├── pca
├── dataanalyzer.py
├── pca.py
└── summarystats.py
├── pcap
└── pcaptools.py
├── tf
├── confusionmatrix.py
├── dataset.py
├── early_stopping.py
└── tf_utils.py
├── trafficgen
├── PyTgen
│ ├── config.py
│ ├── core
│ │ ├── __init__.py
│ │ ├── generator.py
│ │ ├── runner.py
│ │ └── scheduler.py
│ ├── nslookup.py
│ └── run.py
└── Streaming
│ ├── streaming_generator.py
│ ├── streaming_types.py
│ ├── unix_capture.py
│ └── win_capture.py
├── train
├── train_header.py
├── train_logistic.py
└── train_payload.py
├── utils.py
└── visualization
├── classes_module.py
├── heatmap.py
├── histogram.py
├── pca_plots.py
├── t-sne_compare.py
├── t_sne.py
├── vis_utils.py
└── visualize_activations.py
/.gitignore:
--------------------------------------------------------------------------------
1 | trafficgen/__pycache__/
2 | trafficgen/core/__pycache__/
3 | __pycache__/
4 | *.pcap
5 | .idea/
6 | .vscode/
7 | constants.py
8 | trained_models/
9 | tensorboard/
10 | *.png
11 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 Salik Lennert Pedersen
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Master_Thesis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/Master_Thesis.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Classification of encrypted traffic using deep learning
2 | This repository contains the code used and developed during a master thesis at DTU Compute in 2018.
3 | Professor [Ole Winther](http://cogsys.imm.dtu.dk/staff/winther/) has been supervisor for this master thesis.
4 | Alex Omø Agerholm from [Napatech](https://www.napatech.com/) has been co-supervisor for this project.
5 |
6 | In this thesis we examined and evaluated different ways of classifying encrypted network traffic by use of neural networks. For this purpose we created a dataset with a streaming/non-streaming focus. The dataset comprises seven different classes, five streaming and two non-streaming.
7 | The thesis serves as a preliminary proof-of-concept for Napatech A/S.
8 |
9 | We propose a novel approach where the unencrypted parts of network traffic, namely the headers are utilized. This is done by concatenating the initial headers from a session thus forming a signature datapoint as shown in the following figure:
10 |
11 |
12 |
13 | The datasets created by use of the first 8 and 16 headers are available in the datasets folder in this repository.
14 | We explored the dataset by running t-SNE on the concatenated headers dataset. As can be seen in the t-SNE plot below, which shows all the individual datasets merged, it seems possible to perform classification of individual classes.
15 |
16 |
17 |
18 | In experiments using the header-based approach we achieve very promising results, showing that a simple neural network with a single hidden layer of less than 50 units, can predict the individual classes with an accuracy of 96.4\% and an AUC of 0.99 to 1.00 for the individual classes, as shown in the following figures.
19 |
20 |
21 |
22 | The thesis hereby provides a solution to network traffic classification using the unencrypted headers.
23 |
--------------------------------------------------------------------------------
/convert_pcap_to_h5.py:
--------------------------------------------------------------------------------
1 | import pcap.pcaptools as pcap
2 |
3 |
4 |
5 | if __name__ == '__main__':
6 | pcap.process_pcap_to_h5('/home/mclrn/Data/', '/home/mclrn/Data/h5/', session_threshold=5000)
--------------------------------------------------------------------------------
/data_exploration.py:
--------------------------------------------------------------------------------
1 | import pca.dataanalyzer
2 | import utils
3 | import glob
4 | import os
5 | import pandas as pd
6 | import numpy as np
7 | from matplotlib import pyplot as plt
8 | from pca.dataanalyzer import byteindextoheaderfield
9 |
10 | num_headers = 8
11 | train_dir = 'C:/Users/salik/Documents/Data/LinuxChrome/{}/'.format(num_headers)
12 | dataframes = []
13 | for fullname in glob.iglob(train_dir + '*.h5'):
14 | filename = os.path.basename(fullname)
15 | df = utils.load_h5(train_dir, filename)
16 | dataframes.append(df)
17 | # create one large dataframe
18 | data = pd.concat(dataframes)
19 | print(len(data))
20 | print("drtv:", len(data[data['label'] == 'drtv']))
21 | print("hbo:", len(data[data['label'] == 'hbo']))
22 | print("http:", len(data[data['label'] == 'http']))
23 | print("https:", len(data[data['label'] == 'https']))
24 | print("netflix:", len(data[data['label'] == 'netflix']))
25 | print("twitch:", len(data[data['label'] == 'twitch']))
26 | print("youtube:", len(data[data['label'] == 'youtube']))
27 |
28 |
29 | # dr_mean, dr_mean_sub, dr_std = getmeanstd(data, 'drtv')
30 | # nf_mean, nf_mean_sub, nf_std = getmeanstd(data, 'netflix')
31 | # mean_diff = dr_mean - nf_mean
32 | # sort_diff = (-abs(mean_diff)).argsort() #Sort on absolute values in decending order
33 | # for i in range(10):
34 | # packetnumber = math.ceil(sort_diff[i] / 54)
35 | # bytenumber = sort_diff[i] % 54
36 | # print('Index %i is bytenumber %i in packet: %i' % (sort_diff[i],bytenumber, packetnumber), byteindextoheaderfield(sort_diff[i]))
37 | #
38 | # bytes = getbytes(data)
39 | # labels = data['label']
40 | # pca = p.runpca(bytes, 50)
41 | # # p.plotvarianceexp(pca, 50)
42 | # Z = p.componentprojection(bytes, pca)
43 | # for pc in range(17):
44 | # p.plotprojection(Z, pc, labels)
45 | # p.showplots()
46 | # id_max = np.argmax(mean_diff)
47 | # print(byteindextoheaderfield(id_max))
48 | # id_min = np.argmin(mean_diff)
49 | # print(dr_mean-nf_mean)
50 |
51 | def createBoxplotsFromColumns(title,param, param1):
52 |
53 | plt.title(title)
54 | plt.boxplot([param,param1])
55 |
56 | plt.savefig('boxplots/boxplot:%s.png' % title,dpi=300)
57 | plt.gcf().clear()
58 |
59 | def compare_data(path1, path2, nrheaders):
60 | df1 = pd.DataFrame()
61 | df2 = pd.DataFrame()
62 |
63 | for fullname in glob.iglob(path1 + '*.h5'):
64 | filename = os.path.basename(fullname)
65 | df1 = utils.load_h5(path1, filename)
66 |
67 | for fullname in glob.iglob(path2 + '*.h5'):
68 | filename = os.path.basename(fullname)
69 | df2 = utils.load_h5(path2, filename)
70 |
71 | classes = []
72 | # find all classes
73 | for label in set(df1['label']):
74 | classes.append(label)
75 | print(set(df1['label']))
76 | print(set(df2['label']))
77 | # filter on classes
78 | for c in classes:
79 | ## Exclude youtube as it contains both UDP and TCP
80 | if(c == 'youtube'):
81 | continue
82 |
83 |
84 | # create selector
85 | df1_selector = df1['label'] == c
86 | df2_selector = df2['label'] == c
87 |
88 |
89 | df1_values = df1[df1_selector]['bytes'].values
90 | df2_values = df2[df2_selector]['bytes'].values
91 |
92 | df1_bytes = np.zeros((df1_values.shape[0], nrheaders * 54))
93 | df2_bytes = np.zeros((df2_values.shape[0], nrheaders * 54))
94 |
95 | for i, v in enumerate(df1_values):
96 | payload = np.zeros(nrheaders * 54, dtype=np.uint8)
97 | payload[:v.shape[0]] = v
98 | df1_bytes[i] = payload
99 |
100 | for i, v in enumerate(df2_values):
101 | payload = np.zeros(nrheaders * 54, dtype=np.uint8)
102 | payload[:v.shape[0]] = v
103 | df2_bytes[i] = payload
104 |
105 |
106 |
107 | # Extract byte 23 to determine the protocol.
108 | TCP = True if int(df2_bytes[0][23]) == 6 else False
109 |
110 |
111 | df1_mean = np.mean(df1_bytes, axis=0)
112 | df2_mean = np.mean(df2_bytes, axis=0)
113 |
114 | df1_min = np.min(df1_bytes, axis=0)
115 | df2_min = np.min(df2_bytes, axis=0)
116 |
117 | df1_max = np.max(df1_bytes, axis=0)
118 | df2_max = np.max(df2_bytes, axis=0)
119 |
120 |
121 |
122 | for index, mean in enumerate(df1_mean):
123 | if(index % 25 == 0):
124 | print(c,index)
125 | if df1_mean[index] > 0 or df2_mean[index] > 0:
126 | if(int(df1_min[index]) != int(df1_max[index]) or int(df2_min[index]) != int(df2_max[index])):
127 | if(int(df1_mean[index]) != int(df2_mean[index]) or int(df1_min[index]) != int(df2_min[index]) or int(df1_max[index]) != int(df2_max[index])):
128 | print(index, " : ", int(df1_mean[index]), ' : ' , int(df2_mean[index]), int(df1_min[index]), int(df2_min[index]), int(df1_max[index]), int(df2_max[index]))
129 | headername = byteindextoheaderfield(index, TCP)
130 | headername = headername.replace('/',' ')
131 | createBoxplotsFromColumns(c+headername +':'+str(index), df1_bytes[:,index], df2_bytes[:,index])
132 |
133 |
134 | # compare_data('/home/mclrn/Data/linux/no_checksum/8/', '/home/mclrn/Data/windows_firefox/no_checksum/8/',8)
135 |
136 |
--------------------------------------------------------------------------------
/datasets/LinuxChrome/16/extracted_16-2104_1523.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/LinuxChrome/16/extracted_16-2104_1523.h5
--------------------------------------------------------------------------------
/datasets/LinuxChrome/8/extracted_8-2104_1422.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/LinuxChrome/8/extracted_8-2104_1422.h5
--------------------------------------------------------------------------------
/datasets/WindowsAndreas/16/extracted_16-2304_0932.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsAndreas/16/extracted_16-2304_0932.h5
--------------------------------------------------------------------------------
/datasets/WindowsAndreas/8/extracted_8-2304_0930.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsAndreas/8/extracted_8-2304_0930.h5
--------------------------------------------------------------------------------
/datasets/WindowsChrome/16/extracted_16-2004_1615.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsChrome/16/extracted_16-2004_1615.h5
--------------------------------------------------------------------------------
/datasets/WindowsChrome/8/extracted_8-2004_1553.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsChrome/8/extracted_8-2004_1553.h5
--------------------------------------------------------------------------------
/datasets/WindowsFirefox/16/extracted_16-2004_2210.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsFirefox/16/extracted_16-2004_2210.h5
--------------------------------------------------------------------------------
/datasets/WindowsFirefox/8/extracted_8-2004_2104.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsFirefox/8/extracted_8-2004_2104.h5
--------------------------------------------------------------------------------
/datasets/WindowsSalik/16/extracted_16-2004_1542.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsSalik/16/extracted_16-2004_1542.h5
--------------------------------------------------------------------------------
/datasets/WindowsSalik/8/extracted_8-2004_1525.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SalikLP/classification-of-encrypted-traffic/3c86e098aab58941f9339bb64945c1112ab556ef/datasets/WindowsSalik/8/extracted_8-2004_1525.h5
--------------------------------------------------------------------------------
/extract_headers.py:
--------------------------------------------------------------------------------
1 | import datetime
2 |
3 | import utils
4 |
5 | if __name__ == '__main__':
6 | loaddir = "/home/mclrn/Data/h5/"
7 | headers = [1, 2, 4, 8, 16]
8 | for headersize in headers:
9 | savedir= "/home/mclrn/Data/h5/" + str(headersize) + "/"
10 | now = datetime.datetime.now()
11 | savename = "extracted_%d-%.2d%.2d_%.2d%.2d" % (headersize, now.day, now.month, now.hour, now.minute)
12 | utils.saveextractedheaders(loaddir, savedir, savename, num_headers=headersize)
--------------------------------------------------------------------------------
/extract_payload.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import utils
3 | import glob
4 | import os
5 | import numpy as np
6 | import pandas as pd
7 |
8 |
9 | if __name__ == '__main__':
10 | loaddir = "E:/Data/h5/"
11 | labels = ['https', 'netflix']
12 | max_packet_length = 1514
13 | for label in labels:
14 | print("Starting label: " + label)
15 | savedir = loaddir + label + "/"
16 | now = datetime.datetime.now()
17 | savename = "payload_%s-%.2d%.2d_%.2d%.2d" % (label, now.day, now.month, now.hour, now.minute)
18 | filelist = glob.glob(loaddir + label + '*.h5')
19 | # Try only one of each file
20 | fullname = filelist[0]
21 | # for fullname in filelist:
22 | load_dir, filename = os.path.split(fullname)
23 | print("Loading: {0}".format(filename))
24 | df = utils.load_h5(load_dir, filename)
25 | packets = df['bytes'].values
26 | payloads = []
27 | labels = []
28 | filenames = []
29 | for packet in packets:
30 | if len(packet) == max_packet_length:
31 | # Extract the payload from the packet should have length 1460
32 | payload = packet[54:]
33 | p = np.fromstring(payload, dtype=np.uint8)
34 | payloads.append(p)
35 | labels.append(label)
36 | filenames.append(filename)
37 | d = {'filename': filenames, 'bytes': payloads, 'label': labels}
38 | dataframe = pd.DataFrame(data=d)
39 | key = savename.split('-')[0]
40 | dataframe.to_hdf(savedir + savename + '.h5', key=key, mode='w')
41 | # utils.saveextractedheaders(loaddir, savedir, savename, num_headers=headersize)
42 | print("Done with label: " + label)
43 |
--------------------------------------------------------------------------------
/filter_http_https.py:
--------------------------------------------------------------------------------
1 | import utils
2 | import trafficgen.PyTgen.config as cf
3 | import socket
4 |
5 |
6 | def save_dataframe_h5(df, dir, filename):
7 | key = filename.split('-')[0]
8 | df.to_hdf(dir + filename + '.h5', key=key)
9 |
10 |
11 | Conf = cf.Conf
12 | all_ips = []
13 | for http in Conf.http_urls:
14 | url = http.split('/')[2]
15 | ip_list = []
16 | ais = socket.getaddrinfo(url, 0, 0, 0, 0)
17 | for result in ais:
18 | ip_list.append(result[-1][0])
19 | ip_list = list(set(ip_list))
20 | all_ips.append(ip_list)
21 |
22 | http_list = [ip for ips in all_ips for ip in ips]
23 |
24 | all_ips = []
25 | for http in Conf.https_urls:
26 | url = http.split('/')[2]
27 | ip_list = []
28 | ais = socket.getaddrinfo(url, 0, 0, 0, 0)
29 | for result in ais:
30 | ip_list.append(result[-1][0])
31 | ip_list = list(set(ip_list))
32 | all_ips.append(ip_list)
33 | https_list = [ip for ips in all_ips for ip in ips]
34 |
35 | dir = 'E:/Data/'
36 | filename = 'http_https-browse'
37 | http_dataframe = utils.filter_pcap_by_ip(dir, filename, http_list, 'http')
38 | save_dataframe_h5(http_dataframe, dir, 'http-browse-1104_2010')
39 |
40 | https_dataframe = utils.filter_pcap_by_ip(dir, filename, https_list, 'https')
41 | save_dataframe_h5(https_dataframe, dir, 'https-browse-1104_2020')
42 |
--------------------------------------------------------------------------------
/ip_header_test.py:
--------------------------------------------------------------------------------
1 | import glob
2 | import os
3 | import utils
4 | import numpy as np
5 | import multiprocessing
6 |
7 | num_headers = 1
8 |
9 | load_dir = 'E:/Data/h5/'
10 |
11 |
12 | def ipheadertask(filelist):
13 | j = 1
14 | for fullname in filelist:
15 | print("Loading filenr: {}".format(j))
16 | load_dir, filename = os.path.split(fullname)
17 | df = utils.load_h5(load_dir, filename)
18 | frames = df['bytes'].values
19 | for i, frame in enumerate(frames):
20 | p = np.fromstring(frame, dtype=np.uint8)
21 | if p[14] != 69:
22 | print("IP Header length not 20! in file {0}".format(filename))
23 | j += 1
24 |
25 | if __name__ == '__main__':
26 | filelist = glob.glob(load_dir + '*.h5')
27 | filesplits = utils.split_list(filelist, 4)
28 |
29 | threads = []
30 | for split in filesplits:
31 | # create a thread for each
32 | t = multiprocessing.Process(target=ipheadertask, args=(split,))
33 | threads.append(t)
34 | t.start()
35 | # create one large dataframe
36 |
37 | for t in threads:
38 | t.join()
39 | print("Process joined: ", t)
40 |
--------------------------------------------------------------------------------
/pca/dataanalyzer.py:
--------------------------------------------------------------------------------
1 | import utils
2 | import glob
3 | import os
4 | import pandas as pd
5 | import numpy as np
6 | import math
7 | import pca as p
8 |
9 |
10 | def getbytes(dataframe, payload_length=810):
11 | values = dataframe['bytes'].values
12 | bytes = np.zeros((values.shape[0], payload_length))
13 | for i, v in enumerate(values):
14 | payload = np.zeros(payload_length, dtype=np.uint8)
15 | payload[:v.shape[0]] = v
16 | bytes[i] = payload
17 | return bytes
18 |
19 |
20 | def getmeanstd(dataframe, label):
21 | labels = dataframe['label'] == label
22 | bytes = getbytes(dataframe[labels])
23 | # values = dataframe[labels]['bytes'].values
24 | # bytes = np.zeros((values.shape[0], values[0].shape[0]))
25 | # for i, v in enumerate(values):
26 | # bytes[i] = v
27 |
28 | # Ys = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
29 | mean = np.mean(bytes, axis=0)
30 | mean_sub = np.subtract(bytes, mean)
31 | std = mean_sub / np.std(bytes, axis=0)
32 | return mean, mean_sub, std
33 |
34 |
35 | def byteindextoheaderfield(number, TCP=True):
36 | if TCP:
37 | bytenumber = number % 54
38 | else:
39 | bytenumber = number % 42
40 |
41 | if bytenumber in range(6):
42 | return "Destination MAC"
43 | if bytenumber in range(6, 12):
44 | return "Source MAC"
45 | if bytenumber in (12, 13):
46 | return "Eth. Type"
47 | if bytenumber == 14:
48 | return "IP Version and header length"
49 | if bytenumber == 15:
50 | return "Explicit Congestion Notification"
51 | if bytenumber in (16, 17):
52 | return "Total Length (IP header)"
53 | if bytenumber in (18, 19):
54 | return "Identification (IP header)"
55 | if bytenumber in (20, 21):
56 | return "Fragment offset (IP header)"
57 | if bytenumber == 22:
58 | return "Time to live (IP header)"
59 | if bytenumber == 23:
60 | return "Protocol (IP header)"
61 | if bytenumber in (24, 25):
62 | return "Header checksum (IP header)"
63 | if bytenumber in range(26, 30):
64 | return "Source IP (IP header)"
65 | if bytenumber in range(30, 34):
66 | return "Destination IP (IP header)"
67 | if bytenumber in (34, 35):
68 | return "Source Port (TCP/UDP header)"
69 | if bytenumber in (36, 37):
70 | return "Destination Port (TCP/UDP header)"
71 | if bytenumber in range(38, 42):
72 | if TCP:
73 | return "Sequence number (TCP header)"
74 | elif bytenumber in (38, 39):
75 | return "Length of data (UDP Header)"
76 | else:
77 | return "UDP Checksum (UDP Header)"
78 | if bytenumber in range(42, 46):
79 | return "ACK number (TCP header)"
80 | if bytenumber == 46:
81 | return "TCP Header length or Nonce (TCP header)"
82 | if bytenumber == 47:
83 | return "TCP FLAGS (CWR, ECN-ECHO, ACK, PUSH, RST, SYN, FIN) (TCP header)"
84 | if bytenumber in (48, 49):
85 | return "Window size (TCP header)"
86 | if bytenumber in (50, 51):
87 | return "Checksum (TCP header)"
88 | if bytenumber in (52, 53):
89 | return "Urgent Pointer (TCP header)"
90 |
91 |
92 |
93 |
94 |
--------------------------------------------------------------------------------
/pca/pca.py:
--------------------------------------------------------------------------------
1 | from sklearn.decomposition import PCA
2 | import matplotlib.pyplot as plt
3 | import numpy as np
4 |
5 |
6 | def runpca(X, num_comp=None):
7 | pca = PCA(n_components=num_comp, svd_solver='full')
8 | pca.fit(X)
9 | # print(pca.n_components_)
10 | # print(pca.explained_variance_ratio_)
11 | # print(sum(pca.explained_variance_ratio_))
12 | return pca
13 |
14 |
15 | def plotvarianceexp(pca, num_comp):
16 | # plt.ion()
17 | rho = pca.explained_variance_ratio_
18 | rhosum = np.empty(len(rho))
19 | for i in range(1, len(rho) + 1):
20 | rhosum[i - 1] = np.sum(rho[0:i])
21 | ind = np.arange(num_comp)
22 | width = 0.35
23 | opacity = 0.8
24 | fig, ax = plt.subplots()
25 | ax.set_ylim(0, 1.1)
26 | # Variance explained by single component
27 | bars1 = ax.bar(ind, rho[0:num_comp], width, alpha=opacity, color="xkcd:blue", label='Single')
28 | # Variance explained by cummulative component
29 | bars2 = ax.bar(ind + width, rhosum[0:num_comp], width, alpha=opacity, color="g", label='Cummulative')
30 | # add some text for labels, title and axes ticks
31 | ax.set_ylabel('Variance Explained', fontsize=16)
32 | # ax.set_title('Variance Explained by principal components')
33 | ax.set_xticks(ind + width / 2)
34 |
35 | labels = ["" for x in range(num_comp)]
36 | for k in range(0, num_comp):
37 | labels[k] = ('$v{0}$'.format(k + 1))
38 | ax.set_xticklabels(labels, fontsize=16)
39 | ax.legend(fontsize=16)
40 | autolabel(bars2, ax)
41 | plt.yticks(fontsize=16)
42 | plt.draw()
43 |
44 |
45 | def componentprojection(data, pca):
46 | V = pca.components_.T
47 | Z = data @ V
48 | return Z
49 |
50 | def plotprojection(Z, pc, labels, class_labels):
51 | diff_labels = np.unique(labels)
52 | opacity = 0.8
53 | fig, ax = plt.subplots()
54 | color_map = {0: 'orangered', 1: 'royalblue', 2: 'lightgreen', 3: 'darkorchid', 4: 'teal', 5: 'darkslategrey',
55 | 6: 'darkgreen', 7: 'darkgrey'}
56 | for label in diff_labels:
57 | idx = labels == label
58 | ax.plot(Z[idx, pc], Z[idx, pc + 1], 'o', alpha=opacity, c=color_map[label], label='{label}'.format(label=class_labels[label]))
59 | # ax.plot(Z[idx_below, pc], Z[idx_below, pc + 1], 'o', alpha=opacity,
60 | # label='{name} below mean'.format(name=attributeNames[att]))
61 | ax.set_ylabel('$v{0}$'.format(pc + 2))
62 | ax.set_xlabel('$v{0}$'.format(pc + 1))
63 | ax.legend()
64 | ax.set_title('Data projected on v{0} and v{1}'.format(pc+1, pc+2))
65 | # fig.savefig('v{0}_v{1}_{att}.png'.format(pc + 1, pc + 2, att=attributeNames[att]), dpi=300)
66 | plt.draw()
67 |
68 |
69 | def showplots():
70 | plt.show()
71 |
72 |
73 | def autolabel(rects, ax):
74 | """
75 | Attach a text label above each bar displaying its height
76 | """
77 | for rect in rects:
78 | height = rect.get_height()
79 | ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
80 | '%4.2f' % height,
81 | ha='center', va='bottom', fontsize=16)
--------------------------------------------------------------------------------
/pca/summarystats.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import utils
4 | # import seaborn as sns
5 | import glob
6 | import os
7 | from scipy import stats, integrate
8 | import matplotlib.pyplot as plt
9 | # sns.set(color_codes=True)
10 |
11 |
12 | def remove_checksum(nrheaders):
13 | # nrheaders = 1
14 | read_dir = '/home/mclrn/Data/linux/'
15 | headers = '{0}/'.format(nrheaders)
16 | dataframes = []
17 | for fullname in glob.iglob(read_dir + headers + '*.h5'):
18 | filename = os.path.basename(fullname)
19 | df = utils.load_h5(read_dir + headers, filename)
20 | dataframes.append(df)
21 | # create one large dataframe
22 | df = pd.concat(dataframes)
23 | # df = pd.read_hdf('/home/mclrn/Data/salik_windows/{0}/'.format(nrheaders), key="extracted_{0}".format(nrheaders))
24 | df = df.sample(frac=1).reset_index(drop=True)
25 | values, counts = np.unique(df['label'], return_counts=True)
26 | print(values, counts)
27 |
28 | #selector = df['label'] == 'youtube'
29 |
30 | values = df['bytes'].values
31 | bytes = np.zeros((values.shape[0], nrheaders * 54))
32 | for i, v in enumerate(values):
33 | payload = np.zeros(nrheaders * 54, dtype=np.uint8)
34 | payload[:v.shape[0]] = v
35 | bytes[i] = payload
36 |
37 | #mean = np.mean(bytes, axis=0)
38 | #min = np.min(bytes, axis=0)
39 | #max = np.max(bytes, axis=0)
40 | #print(np.max(bytes[0:, 23])) # Protocol field if value = 6 then TCP if value = 17 the UDP
41 | bytes_no_checksum = []
42 | for j, b in enumerate(bytes):
43 | if b[23] == 6:
44 | # TCP
45 | # if bytenumber in (50, 51):
46 | # return "Checksum (TCP header)"
47 | for i in range(nrheaders):
48 | b[i * 54 + 50] = 0
49 | b[i * 54 + 51] = 0
50 | b[i * 54 + 24] = 0
51 | b[i * 54 + 25] = 0
52 | elif b[23] == 17:
53 | # UDP
54 | # if bytenumber in (40,41)
55 | # return "UDP Checksum (UDP Header)"
56 | for i in range(nrheaders):
57 | b[i * 42 + 40] = 0
58 | b[i * 42 + 41] = 0
59 | b[i * 42 + 24] = 0
60 | b[i * 42 + 25] = 0
61 | else:
62 | print("Byte was not 6 nor 17 but: %d" % bytes[23])
63 |
64 | bytes_no_checksum.append(b)
65 | new_data = {'bytes': bytes_no_checksum, 'label': df['label'].values}
66 | new_df = pd.DataFrame(new_data)
67 | # print(df)
68 | # print(new_df)
69 | save_dir = read_dir + "no_checksum/" + headers
70 | # if not os.path.exists(save_dir):
71 | os.makedirs(save_dir, exist_ok=True)
72 | new_df.to_hdf(save_dir +
73 | "extracted_{0}-no_checksum".format(nrheaders) + '.h5',
74 | key='extracted_{0}'.format(nrheaders), mode='w')
75 |
76 |
77 | nr = [1,2,4,8,16]
78 | for n in nr:
79 | remove_checksum(n)
80 |
81 |
82 | #sns.distplot(bytes[0:,370], kde=False, rug=True)
83 |
84 | '''
85 | for index, m in enumerate(mean):
86 | if m > 0 and min[index] != max[index]:
87 | print(index, min[index], max[index], mean[index])
88 | sns.distplot(bytes[0:,index], kde=False, rug=True)
89 | plt.show()
90 | '''
91 | #plt.show()
92 | #plt.savefig("dist.png")
93 |
94 | #for i in mean:
95 | # print("%.2f" % i)
96 |
97 |
98 |
--------------------------------------------------------------------------------
/pcap/pcaptools.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from scapy.all import *
3 | import glob
4 | import os
5 | import multiprocessing
6 |
7 | from utils import split_list
8 |
9 |
10 | def pcap_cleaner(dir):
11 | """
12 | This method can be used for cleaning pcap files.
13 | :param dir: Directory containing the pcap files that should be filtered
14 | :return: None
15 | """
16 | for fullname in glob.iglob(dir + '*.pcap'):
17 | dir, filename = os.path.split(fullname)
18 | command = 'tshark -r %s -2 -R "!(eth.dst[0]&1) && !(tcp.port==5901) && ip" -w %s/filtered/%s' % (fullname,dir, filename)
19 | os.system(command)
20 |
21 |
22 | def save_pcap_task(files, save_dir, session_threshold):
23 | """
24 | This method takes all files in a list (full path names) and uses the method save_pcap that converts a pcap to h5 format
25 | :param files:
26 | :param session_threshold:
27 | :return:
28 | """
29 | for fullname in files:
30 | print('Currently saving file: ', fullname)
31 |
32 | save_pcap(fullname, save_dir, session_threshold)
33 |
34 |
35 | def process_pcap_to_h5(read_dir, save_dir, session_threshold=5000):
36 | """
37 | Use this method to process all pcap files in a directory to a h5 format.
38 | Session threshold is used to filter out all sessions containing fewer packets
39 | :param save_dir:
40 | :param read_dir: Directory containing pcap files that should be converted into h5 format
41 | :param session_threshold: Threshold to filter out session with less packets
42 | :return: None
43 | """
44 | h5files = []
45 |
46 | for h5 in glob.iglob(save_dir + '*.h5'):
47 | h5files.append(os.path.basename(h5))
48 | # Load all files
49 | files = []
50 | for fullname in glob.iglob(read_dir + '*.pcap'):
51 | filename = os.path.basename(fullname)
52 | h5name = filename +'.h5'
53 | if h5name in h5files:
54 | os.rename(fullname, read_dir + '/processed_pcap/' + filename)
55 | else:
56 | files.append(fullname)
57 |
58 | splits = 4
59 | files_splits = split_list(files, splits)
60 | processes = []
61 | for file_split in files_splits:
62 | # create a thread for each
63 | t1 = multiprocessing.Process(target=save_pcap_task, args=(file_split, save_dir, session_threshold))
64 | print("Starting process", t1)
65 | processes.append(t1)
66 | t1.start()
67 |
68 | for process in processes:
69 | process.join()
70 | print("Process joined", process)
71 |
72 | def save_pcap(fullname, save_dir, session_threshold=0):
73 | """
74 | This method read a pcap file and saves it to an h5 dataframe.
75 | The file is overwritten if it already exists.
76 | :param dir: The folder containing the pcap file
77 | :param filename: The name of the pcap file
78 | :return: Nothing
79 | """
80 | dir_n, filename = os.path.split(fullname)
81 | df = read_pcap(fullname, filename, session_threshold)
82 | key = filename.split('-')[0]
83 | df.to_hdf(save_dir + filename + '.h5', key=key, mode='w')
84 |
85 |
86 | def read_pcap(fullname, filename, session_threshold=0):
87 | """
88 | This method will extract the packets of the major session within the pcap file. It will label the packets according
89 | to the filename.
90 | The method excludes packets between local/internal ip adresses (ip.src and ip.dst startswith 10.....)
91 | The method finds the major sessions by counting the packets for each session and calculate a threshold dependent
92 | on the session with most packets. All sessions with more packets than the threshold value is extracted and placed
93 | in the dataframe.
94 |
95 | :param dir: The directory in which the pcap file is located. Should end with a /
96 | :param filename: The name of the pcap file. It is expected to contain the label of the data before the first - char
97 | :return: A dataframe containing the extracted packets.
98 | """
99 | time_s = time.clock()
100 | label = filename.split('-')[0]
101 | print("Read PCAP, label is %s" % label)
102 | if not filename.endswith('.pcap'):
103 | filename += '.pcap'
104 | data = rdpcap(fullname)
105 | # Workaround/speedup for pandas append to dataframe
106 | frametimes =[]
107 | dsts = []
108 | srcs = []
109 | protocols = []
110 | dports = []
111 | sports = []
112 | bytes = []
113 | labels = []
114 |
115 | time_r = time.clock()
116 | time_read = time_r-time_s
117 | print("Time to read PCAP: "+ str(time_read))
118 | sessions = data.sessions(session_extractor=session_extractor)
119 | for id, session in sessions.items():
120 | if session_threshold is not None:
121 | if len(session) < session_threshold:
122 | continue
123 | for packet in session:
124 | # Check that the packet is transferred by either UDP or TCP and ensure that it is not a packet between to local/internal IP adresses (occurs when using vnc and such)
125 | if IP in packet and (UDP in packet or TCP in packet) and not (packet[IP].dst.startswith('10.') and packet[IP].src.startswith('10.')):
126 | ip_layer = packet[IP]
127 | transport_layer = ip_layer.payload
128 | frametimes.append(packet.time)
129 | dsts.append(ip_layer.dst)
130 | srcs.append(ip_layer.src)
131 | protocols.append(transport_layer.name)
132 | dports.append(transport_layer.dport)
133 | sports.append(transport_layer.sport)
134 | # Save the raw byte string
135 | raw_payload = raw(packet)
136 | bytes.append(raw_payload)
137 | labels.append(label)
138 | time_t = time.clock()
139 | print("Time spend: %ds" % (time_t-time_r))
140 | d = {'time': frametimes,
141 | 'ip.dst': dsts,
142 | 'ip.src': srcs,
143 | 'protocol': protocols,
144 | 'port.dst': dports,
145 | 'port.src': sports,
146 | 'bytes': bytes,
147 | 'label': labels}
148 | df = pd.DataFrame(data=d)
149 | time_e = time.clock()
150 | total_time = time_e - time_s
151 | print("Time to convert PCAP to dataframe: " + str(total_time))
152 | return df
153 |
154 |
155 | def session_extractor(p):
156 | """
157 | Custom session extractor to use for scapy to group bi directional sessions instead of a uni directional flows.
158 | :param p: packet as used by scapy
159 | :return: session string to use a key in dict
160 | """
161 | sess = "Other"
162 | if 'Ether' in p:
163 | if 'IP' in p:
164 | src = p[IP].src
165 | dst = p[IP].dst
166 | if NTP in p:
167 | if src.startswith('10.') or src.startswith('192.168.'):
168 | sess = p.sprintf("NTP %IP.src%:%r,UDP.sport% > %IP.dst%:%r,UDP.dport%")
169 | elif dst.startswith('10.') or dst.startswith('192.168.'):
170 | sess = p.sprintf("NTP %IP.dst%:%r,UDP.dport% > %IP.src%:%r,UDP.sport%")
171 | elif 'TCP' in p:
172 | if src.startswith('10.') or src.startswith('192.168.'):
173 | sess = p.sprintf("TCP %IP.src%:%r,TCP.sport% > %IP.dst%:%r,TCP.dport%")
174 | elif dst.startswith('10.') or dst.startswith('192.168.'):
175 | sess = p.sprintf("TCP %IP.dst%:%r,TCP.dport% > %IP.src%:%r,TCP.sport%")
176 | elif 'UDP' in p:
177 | if src.startswith('10.') or src.startswith('192.168.'):
178 | sess = p.sprintf("UDP %IP.src%:%r,UDP.sport% > %IP.dst%:%r,UDP.dport%")
179 | elif dst.startswith('10.') or dst.startswith('192.168.'):
180 | sess = p.sprintf("UDP %IP.dst%:%r,UDP.dport% > %IP.src%:%r,UDP.sport%")
181 | elif 'ICMP' in p:
182 | sess = p.sprintf("ICMP %IP.src% > %IP.dst% type=%r,ICMP.type% code=%r,ICMP.code% id=%ICMP.id%")
183 | else:
184 | sess = p.sprintf("IP %IP.src% > %IP.dst% proto=%IP.proto%")
185 | elif 'ARP' in p:
186 | sess = p.sprintf("ARP %ARP.psrc% > %ARP.pdst%")
187 | else:
188 | sess = p.sprintf("Ethernet type=%04xr,Ether.type%")
189 | return sess
--------------------------------------------------------------------------------
/tf/confusionmatrix.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib
3 |
4 |
5 | class ConfusionMatrix:
6 | """
7 | Simple confusion matrix class
8 | row is the true class, column is the predicted class
9 | """
10 | def __init__(self, num_classes, class_names=None):
11 | self.n_classes = num_classes
12 | if class_names is None:
13 | self.class_names = map(str, range(num_classes))
14 | else:
15 | self.class_names = class_names
16 |
17 | # find max class_name and pad
18 | max_len = max(map(len, self.class_names))
19 | self.max_len = max_len
20 | for idx, name in enumerate(self.class_names):
21 | if len(self.class_names) < max_len:
22 | self.class_names[idx] = name + " "*(max_len-len(name))
23 |
24 | self.mat = np.zeros((num_classes, num_classes), dtype='int')
25 |
26 | def __str__(self):
27 | # calucate row and column sums
28 | col_sum = np.sum(self.mat, axis=1)
29 | row_sum = np.sum(self.mat, axis=0)
30 |
31 | s = []
32 |
33 | mat_str = self.mat.__str__()
34 | mat_str = mat_str.replace('[','').replace(']','').split('\n')
35 |
36 | for idx, row in enumerate(mat_str):
37 | if idx == 0:
38 | pad = " "
39 | else:
40 | pad = ""
41 | class_name = self.class_names[idx]
42 | class_name = " {:10s}".format(class_name)
43 | class_name += " |"
44 | row_str = class_name + pad + row
45 | row_str += " |" + str(col_sum[idx])
46 | s.append(row_str)
47 |
48 | row_sum = [(self.max_len+7)*" "+" ".join(map(str, row_sum))]
49 | hline = [(1+self.max_len)*" "+"-"*len(row_sum[0])]
50 |
51 | s = hline + s + hline + row_sum
52 |
53 | # add linebreaks
54 | s_out = [line+'\n' for line in s]
55 | return "".join(s_out)
56 |
57 | def batch_add(self, targets, preds):
58 | assert targets.shape == preds.shape
59 | assert len(targets) == len(preds)
60 | assert max(targets) < self.n_classes
61 | assert max(preds) < self.n_classes
62 | targets = targets.flatten()
63 | preds = preds.flatten()
64 | for i in range(len(targets)):
65 | self.mat[targets[i], preds[i]] += 1
66 |
67 | def get_errors(self):
68 | tp = np.asarray(np.diag(self.mat).flatten(), dtype='float')
69 | fn = np.asarray(np.sum(self.mat, axis=1).flatten(), dtype='float') - tp
70 | fp = np.asarray(np.sum(self.mat, axis=0).flatten(), dtype='float') - tp
71 | tn = np.asarray(np.sum(self.mat)*np.ones(self.n_classes).flatten(),
72 | dtype='float') - tp - fn - fp
73 | return tp, fn, fp, tn
74 |
75 | def accuracy(self):
76 | """
77 | Calculates global accuracy
78 | :return: accuracy
79 | :example: >>> conf = ConfusionMatrix(3)
80 | >>> conf.batchAdd([0,0,1],[0,0,2])
81 | >>> print conf.accuracy()
82 | """
83 | tp, _, _, _ = self.get_errors()
84 | n_samples = np.sum(self.mat)
85 | return np.sum(tp) / n_samples
86 |
87 | def sensitivity(self):
88 | tp, tn, fp, fn = self.get_errors()
89 | res = tp / (tp + fn)
90 | res = res[~np.isnan(res)]
91 | return res
92 |
93 | def specificity(self):
94 | tp, tn, fp, fn = self.get_errors()
95 | res = tn / (tn + fp)
96 | res = res[~np.isnan(res)]
97 | return res
98 |
99 | def positive_predictive_value(self):
100 | tp, tn, fp, fn = self.get_errors()
101 | res = tp / (tp + fp)
102 | res = res[~np.isnan(res)]
103 | return res
104 |
105 | def negative_predictive_value(self):
106 | tp, tn, fp, fn = self.get_errors()
107 | res = tn / (tn + fn)
108 | res = res[~np.isnan(res)]
109 | return res
110 |
111 | def false_positive_rate(self):
112 | tp, tn, fp, fn = self.get_errors()
113 | res = fp / (fp + tn)
114 | res = res[~np.isnan(res)]
115 | return res
116 |
117 | def false_discovery_rate(self):
118 | tp, tn, fp, fn = self.get_errors()
119 | res = fp / (tp + fp)
120 | res = res[~np.isnan(res)]
121 | return res
122 |
123 | def F1(self):
124 | tp, tn, fp, fn = self.get_errors()
125 | res = (2*tp) / (2*tp + fp + fn)
126 | res = res[~np.isnan(res)]
127 | return res
128 |
129 | def matthews_correlation(self):
130 | tp, tn, fp, fn = self.get_errors()
131 | numerator = tp*tn - fp*fn
132 | denominator = np.sqrt((tp + fp)*(tp + fn)*(tn + fp)*(tn + fn))
133 | res = numerator / denominator
134 | res = res[~np.isnan(res)]
135 | return res
--------------------------------------------------------------------------------
/tf/dataset.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from tensorflow.python.framework import dtypes
3 | from tensorflow.python.framework import random_seed
4 | from tensorflow.contrib.learn.python.learn.datasets import base
5 | import glob
6 | import os
7 | import utils
8 | import pandas as pd
9 | from sklearn.preprocessing import LabelEncoder
10 |
11 | _label_encoder = LabelEncoder()
12 |
13 |
14 | class DataSet(object):
15 |
16 | def __init__(self,
17 | payloads,
18 | labels,
19 | dtype=dtypes.float32,
20 | seed=None):
21 | """Construct a DataSet.
22 | one_hot arg is used only if fake_data is true. `dtype` can be either
23 | `uint8` to leave the input as `[0, 255]`, or `float32` to rescale into
24 | `[0, 1]`. Seed arg provides for convenient deterministic testing.
25 | """
26 | seed1, seed2 = random_seed.get_seed(seed)
27 | # If op level seed is not set, use whatever graph level seed is returned
28 | np.random.seed(seed1 if seed is None else seed2)
29 | dtype = dtypes.as_dtype(dtype).base_dtype
30 | if dtype not in (dtypes.uint8, dtypes.float32):
31 | raise TypeError('Invalid payload dtype %r, expected uint8 or float32' %
32 | dtype)
33 |
34 | assert payloads.shape[0] == labels.shape[0], (
35 | 'payloads.shape: %s labels.shape: %s' % (payloads.shape, labels.shape))
36 | self._num_examples = payloads.shape[0]
37 |
38 | if dtype == dtypes.float32:
39 | # Convert from [0, 255] -> [0.0, 1.0].
40 | payloads = payloads.astype(np.float32)
41 | payloads = np.multiply(payloads, 1.0 / 255.0)
42 |
43 | self._payloads = payloads
44 | self._labels = labels
45 | self._epochs_completed = 0
46 | self._index_in_epoch = 0
47 |
48 | @property
49 | def payloads(self):
50 | return self._payloads
51 |
52 | @property
53 | def labels(self):
54 | return self._labels
55 |
56 | @property
57 | def num_examples(self):
58 | return self._num_examples
59 |
60 | @property
61 | def epochs_completed(self):
62 | return self._epochs_completed
63 |
64 | def next_batch(self, batch_size, shuffle=True):
65 | """Return the next `batch_size` examples from this data set."""
66 | start = self._index_in_epoch
67 | # Shuffle for the first epoch
68 |
69 | if self._epochs_completed == 0 and start == 0 and shuffle:
70 | perm0 = np.arange(self._num_examples)
71 | np.random.shuffle(perm0)
72 | self._payloads = self.payloads[perm0]
73 | self._labels = self.labels[perm0]
74 | # Go to the next epoch
75 | if start + batch_size > self._num_examples:
76 | # Finished epoch
77 | self._epochs_completed += 1
78 | # Get the rest examples in this epoch
79 | rest_num_examples = self._num_examples - start
80 | payloads_rest_part = self._payloads[start:self._num_examples]
81 | labels_rest_part = self._labels[start:self._num_examples]
82 | # Shuffle the data
83 | if shuffle:
84 | perm = np.arange(self._num_examples)
85 | np.random.shuffle(perm)
86 | self._payloads = self.payloads[perm]
87 | self._labels = self.labels[perm]
88 | # Start next epoch
89 | start = 0
90 | self._index_in_epoch = batch_size - rest_num_examples
91 | end = self._index_in_epoch
92 | images_new_part = self._payloads[start:end]
93 | labels_new_part = self._labels[start:end]
94 | return np.concatenate((payloads_rest_part, images_new_part), axis=0), np.concatenate(
95 | (labels_rest_part, labels_new_part), axis=0)
96 | else:
97 | self._index_in_epoch += batch_size
98 | end = self._index_in_epoch
99 | return self._payloads[start:end], self._labels[start:end]
100 |
101 |
102 | def dense_to_one_hot(labels_dense, num_classes):
103 | """Convert class labels from scalars to one-hot vectors."""
104 | num_labels = labels_dense.shape[0]
105 | index_offset = np.arange(num_labels) * num_classes
106 | labels_one_hot = np.zeros((num_labels, num_classes), dtype=np.int8)
107 | labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
108 | return labels_one_hot
109 |
110 |
111 | def extract_labels(dataframe, one_hot=False, num_classes=10):
112 | """Extract the labels into a 1D uint8 numpy array [index].
113 |
114 | Args:
115 | dataframe: A pandas dataframe object.
116 | one_hot: Does one hot encoding for the result.
117 | num_classes: Number of classes for the one hot encoding.
118 |
119 | Returns:
120 | labels: a 1D uint8 numpy array.
121 | """
122 | print('Extracting labels', )
123 | labels = dataframe['label'].values
124 | labels = _label_encoder.fit_transform(labels)
125 | if one_hot:
126 | return dense_to_one_hot(labels, num_classes)
127 | return labels
128 |
129 |
130 | def read_data_sets(train_dirs=[], test_dirs=None,
131 | merge_data=True,
132 | one_hot=False,
133 | dtype=dtypes.float32,
134 | validation_size=0.2,
135 | test_size=0.2,
136 | seed=None,
137 | balance_classes=False,
138 | payload_length=810):
139 | trainframes = []
140 | testframes = []
141 | for train_dir in train_dirs:
142 | for fullname in glob.iglob(train_dir + '*.h5'):
143 | filename = os.path.basename(fullname)
144 | df = utils.load_h5(train_dir, filename)
145 | trainframes.append(df)
146 | # create one large dataframe
147 | train_data = pd.concat(trainframes)
148 | if test_dirs != train_dirs:
149 | for test_dir in test_dirs:
150 | for fullname in glob.iglob(test_dir + '*.h5'):
151 | filename = os.path.basename(fullname)
152 | df = utils.load_h5(test_dir, filename)
153 | testframes.append(df)
154 | test_data = pd.concat(testframes)
155 | else:
156 | test_data = pd.DataFrame()
157 |
158 | if merge_data:
159 | train_data = pd.concat([test_data, train_data])
160 |
161 | num_classes = len(train_data['label'].unique())
162 |
163 | if balance_classes:
164 | values, counts = np.unique(train_data['label'], return_counts=True)
165 | smallest_class = np.argmin(counts)
166 | amount = counts[smallest_class]
167 | new_data = []
168 | for v in values:
169 | sample = train_data.loc[train_data['label'] == v].sample(n=amount)
170 | new_data.append(sample)
171 | train_data = new_data
172 | train_data = pd.concat(train_data)
173 |
174 |
175 | # shuffle the dataframe and reset the index
176 | train_data = train_data.sample(frac=1, random_state=seed).reset_index(drop=True)
177 | #
178 | # youtube_selector = train_data['label'] == 'youtube'
179 | # youtube_data = train_data[youtube_selector]
180 | # for index, row in youtube_data.iterrows():
181 | # bytes = row[0]
182 | # if bytes[23] == 17.0:
183 | # train_data.loc[index, 'label'] = 'youtube_udp'
184 | # else:
185 | # train_data.loc[index, 'label'] = 'youtube_tcp'
186 |
187 | if test_dirs != train_dirs:
188 | test_data = test_data.sample(frac=1, random_state=seed).reset_index(drop=True)
189 | test_labels = extract_labels(test_data, one_hot=one_hot, num_classes=num_classes)
190 | test_payloads = test_data['bytes'].values
191 | test_payloads = utils.pad_arrays_with_zero(test_payloads, payload_length=payload_length)
192 | train_labels = extract_labels(train_data, one_hot=one_hot, num_classes=num_classes)
193 | train_payloads = train_data['bytes'].values
194 | # pad with zero up to payload_length length
195 | train_payloads = utils.pad_arrays_with_zero(train_payloads, payload_length=payload_length)
196 |
197 | # TODO make seperate TEST SET ONCE ready
198 | total_length = len(train_payloads)
199 | validation_amount = int(total_length * validation_size)
200 | if merge_data:
201 | test_amount = int(total_length * test_size)
202 | test_payloads = train_payloads[:test_amount]
203 | test_labels = train_labels[:test_amount]
204 | val_payloads = train_payloads[test_amount:(validation_amount + test_amount)]
205 | val_labels = train_labels[test_amount:(validation_amount + test_amount)]
206 | train_payloads = train_payloads[(validation_amount + test_amount):]
207 | train_labels = train_labels[(validation_amount + test_amount):]
208 | else:
209 | val_payloads = train_payloads[:validation_amount]
210 | val_labels = train_labels[:validation_amount]
211 | train_payloads = train_payloads[validation_amount:]
212 | train_labels = train_labels[validation_amount:]
213 |
214 | options = dict(dtype=dtype, seed=seed)
215 | print("Training set size: {0}".format(len(train_payloads)))
216 | print("Validation set size: {0}".format(len(val_payloads)))
217 | print("Test set size: {0}".format(len(test_payloads)))
218 | train = DataSet(train_payloads, train_labels, **options)
219 | validation = DataSet(val_payloads, val_labels, **options)
220 | test = DataSet(test_payloads, test_labels, **options)
221 |
222 | return base.Datasets(train=train, validation=validation, test=test)
223 |
--------------------------------------------------------------------------------
/tf/early_stopping.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class EarlyStopping:
5 | """Stop training when a monitored quantity has stopped improving.
6 | Arguments:
7 | min_delta: minimum change in the monitored quantity
8 | to qualify as an improvement, i.e. an absolute
9 | change of less than min_delta, will count as no
10 | improvement.
11 | patience: number of epochs with no improvement
12 | after which training will be stopped.
13 | """
14 |
15 | def __init__(self,
16 | min_delta=0,
17 | patience=0):
18 | self.stop_training = False
19 | self.patience = patience
20 | self.min_delta = min_delta
21 | self.wait = 0
22 | self.stopped_epoch = 0
23 |
24 | self.monitor_op = np.less
25 | self.min_delta *= -1
26 | self.best = np.Inf
27 |
28 | def on_train_begin(self):
29 | # Allow instances to be re-used
30 | self.wait = 0
31 | self.stopped_epoch = 0
32 | self.best = np.Inf
33 | self.stop_training = False
34 |
35 | def on_epoch_end(self, epoch, current):
36 | if self.monitor_op(current - self.min_delta, self.best):
37 | self.best = current
38 | self.wait = 0
39 | else:
40 | if self.wait >= self.patience:
41 | self.stopped_epoch = epoch
42 | self.stop_training = True
43 | self.wait += 1
44 |
45 | def on_train_end(self):
46 | if self.stopped_epoch > 0:
47 | print('Epoch {}: early stopping'.format(self.stopped_epoch))
48 |
--------------------------------------------------------------------------------
/tf/tf_utils.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 |
3 |
4 | def train_input_fn(features, labels, batch_size):
5 | """An input function for training"""
6 | # Convert the inputs to a Dataset.
7 | dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
8 | # Shuffle, repeat, and batch the examples.
9 | dataset = dataset.shuffle(1000).repeat().batch(batch_size)
10 | # Build the Iterator, and return the read end of the pipeline.
11 | return dataset.make_one_shot_iterator().get_next()
12 |
13 |
14 | def ffn_layer(name, inputs, hidden_units, activation=tf.nn.relu, seed=None):
15 | """Reusable code for making a simple neural net layer.
16 |
17 | It does a matrix multiply, bias add, and then uses relu to nonlinearize.
18 | It also sets up name scoping so that the resultant graph is easy to read,
19 | and adds a number of summary ops.
20 | """
21 | input_dim = inputs.get_shape().as_list()[1]
22 | # use xavier glorot intitializer as regular uniform dist did not work
23 | weight_initializer = tf.contrib.layers.xavier_initializer(uniform=True, seed=seed, dtype=tf.float32)
24 | with tf.variable_scope(name):
25 | # This Variable will hold the state of the weights for the layer
26 | with tf.name_scope('weights'):
27 | weights = tf.get_variable('W', [input_dim, hidden_units], initializer=weight_initializer)
28 | variable_summaries(weights)
29 | with tf.name_scope('biases'):
30 | biases = tf.get_variable('b', [hidden_units], initializer=tf.zeros_initializer)
31 | variable_summaries(biases)
32 | with tf.name_scope('Wx_plus_b'):
33 | preactivate = tf.matmul(inputs, weights) + biases
34 | tf.summary.histogram('pre_activations', preactivate)
35 | activations = activation(preactivate, name='activation')
36 | tf.summary.histogram('activations', activations)
37 | return activations
38 | # layer = tf.layers.dense(inputs=inputs, units=hidden_units, activation=activation)
39 | # return layer
40 |
41 |
42 | def conv_layer_1d(name, inputs, num_filters=1, filter_size=(1, 1), strides=1, activation=None):
43 | """"A simple function to create a 1D conv layer"""
44 | with tf.variable_scope(name):
45 | # TensorFlow operation for convolution
46 | layer = tf.layers.conv1d(inputs=inputs,
47 | filters=num_filters,
48 | kernel_size=filter_size,
49 | strides=strides,
50 | padding='same',
51 | activation=activation)
52 | return layer
53 |
54 |
55 | def conv_layer_2d(name, inputs, num_filters=1, filter_size=(1, 1), strides=(1,1), activation=None):
56 | """"A simple function to create a 2D conv layer"""
57 | with tf.variable_scope(name):
58 | # TensorFlow operation for convolution
59 | layer = tf.layers.conv2d(inputs=inputs,
60 | filters=num_filters,
61 | kernel_size=filter_size,
62 | strides=strides,
63 | padding='same',
64 | activation=activation)
65 | return layer
66 |
67 |
68 | def max_pool_layer(inputs, name, pool_size=(1, 1), strides=(1, 1), padding='same'):
69 | """"A simple function to create a max pool layer"""
70 | with tf.variable_scope(name):
71 | # TensorFlow operation for max pooling
72 | layer = tf.layers.max_pooling2d(inputs=inputs,
73 | pool_size=pool_size,
74 | strides=strides,
75 | padding=padding)
76 | return layer
77 |
78 | def max_pool_layer1d(inputs, name, pool_size=(1, 1), strides=(1, 1), padding='same'):
79 | """"A simple function to create a max pool layer"""
80 | with tf.variable_scope(name):
81 | # TensorFlow operation for max pooling
82 | layer = tf.layers.max_pooling1d(inputs=inputs,
83 | pool_size=pool_size,
84 | strides=strides,
85 | padding=padding)
86 | return layer
87 | def dropout(inputs, keep_prob=1.0):
88 | with tf.name_scope('dropout'):
89 | # keep_prob = tf.placeholder(tf.float32)
90 | tf.summary.scalar('dropout_keep_probability', keep_prob)
91 | return tf.nn.dropout(inputs, keep_prob)
92 |
93 |
94 | def variable_summaries(var):
95 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
96 | with tf.name_scope('summaries'):
97 | mean = tf.reduce_mean(var)
98 | tf.summary.scalar('mean', mean)
99 | with tf.name_scope('stddev'):
100 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
101 | tf.summary.scalar('stddev', stddev)
102 | tf.summary.scalar('max', tf.reduce_max(var))
103 | tf.summary.scalar('min', tf.reduce_min(var))
104 | tf.summary.histogram('histogram', var)
105 |
106 | #
107 | # mnist_data = input_data.read_data_sets('MNIST_data',
108 | # one_hot=True, # Convert the labels into one hot encoding
109 | # dtype='float32', # rescale images to `[0, 1]`
110 | # reshape=False, # Don't flatten the images to vectors
111 | # )
112 |
113 | # # Simple MNIST test of the layers
114 | #
115 | # tf.reset_default_graph()
116 | # num_classes = 10
117 | # cm = conf.ConfusionMatrix(num_classes)
118 | # height, width, nchannels = 28, 28, 1
119 | # gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.45)
120 | # filters_1 = 16
121 | # kernel_size_1 = (5,5)
122 | # pool_size_1 = (2,2)
123 | # x_pl = tf.placeholder(tf.float32, [None, height, width, nchannels], name='xPlaceholder')
124 | # y_pl = tf.placeholder(tf.float64, [None, num_classes], name='yPlaceholder')
125 | # y_pl = tf.cast(y_pl, tf.float32)
126 | #
127 | # x = conv_layer_2d('layer1', x_pl, filters_1, kernel_size_1, activation=tf.nn.relu)
128 | # x = max_pool_layer(x, 'max_pool', pool_size = pool_size_1, strides=pool_size_1)
129 | # x = tf.contrib.layers.flatten(x)
130 | #
131 | # y = ffn_layer('output_layer', x, hidden_units=num_classes, activation=tf.nn.softmax)
132 | #
133 | # with tf.variable_scope('loss'):
134 | # # computing cross entropy per sample
135 | # cross_entropy = -tf.reduce_sum(y_pl * tf.log(y + 1e-8), reduction_indices=[1])
136 | #
137 | # # averaging over samples
138 | # cross_entropy = tf.reduce_mean(cross_entropy)
139 | #
140 | # with tf.variable_scope('training'):
141 | # # defining our optimizer
142 | # optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
143 | #
144 | # # applying the gradients
145 | # train_op = optimizer.minimize(cross_entropy)
146 | #
147 | # with tf.variable_scope('performance'):
148 | # # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
149 | # correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pl, axis=1))
150 | #
151 | # # averaging the one-hot encoded vector
152 | # accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
153 | #
154 | #
155 | # # Training Loop
156 | # batch_size = 100
157 | # max_epochs = 10
158 | #
159 | # valid_loss, valid_accuracy = [], []
160 | # train_loss, train_accuracy = [], []
161 | # test_loss, test_accuracy = [], []
162 | #
163 | # with tf.Session() as sess:
164 | # sess.run(tf.global_variables_initializer())
165 | # print('Begin training loop')
166 | #
167 | # try:
168 | # while mnist_data.train.epochs_completed < max_epochs:
169 | # _train_loss, _train_accuracy = [], []
170 | #
171 | # ## Run train op
172 | # x_batch, y_batch = mnist_data.train.next_batch(batch_size)
173 | # fetches_train = [train_op, cross_entropy, accuracy]
174 | # feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
175 | # _, _loss, _acc = sess.run(fetches_train, feed_dict_train)
176 | #
177 | # _train_loss.append(_loss)
178 | # _train_accuracy.append(_acc)
179 | #
180 | # ## Compute validation loss and accuracy
181 | # if mnist_data.train.epochs_completed % 1 == 0 \
182 | # and mnist_data.train._index_in_epoch <= batch_size:
183 | # train_loss.append(np.mean(_train_loss))
184 | # train_accuracy.append(np.mean(_train_accuracy))
185 | #
186 | # fetches_valid = [cross_entropy, accuracy]
187 | #
188 | # feed_dict_valid = {x_pl: mnist_data.validation.images, y_pl: mnist_data.validation.labels}
189 | # _loss, _acc = sess.run(fetches_valid, feed_dict_valid)
190 | #
191 | # valid_loss.append(_loss)
192 | # valid_accuracy.append(_acc)
193 | # print(
194 | # "Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f}, Valid loss {:6.3f}, Valid acc {:6.3f}".format(
195 | # mnist_data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
196 | # valid_accuracy[-1]))
197 | #
198 | # test_epoch = mnist_data.test.epochs_completed
199 | # while mnist_data.test.epochs_completed == test_epoch:
200 | # x_batch, y_batch = mnist_data.test.next_batch(batch_size)
201 | # feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
202 | # _loss, _acc = sess.run(fetches_valid, feed_dict_test)
203 | # y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
204 | # y_preds = tf.argmax(y_preds, axis=1).eval()
205 | # y_true = tf.argmax(y_batch, axis=1).eval()
206 | # cm.batch_add(y_true,y_preds)
207 | # test_loss.append(_loss)
208 | # test_accuracy.append(_acc)
209 | # print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
210 | # np.mean(test_loss), np.mean(test_accuracy)))
211 | #
212 | #
213 | # except KeyboardInterrupt:
214 | # pass
215 | #
216 | #
217 | # print(cm.accuracy())
218 | # epoch = np.arange(len(train_loss))
219 | # plt.figure()
220 | # plt.plot(epoch, train_accuracy,'r', epoch, valid_accuracy,'b')
221 | # plt.legend(['Train Acc','Val Acc'], loc=4)
222 | # plt.xlabel('Epochs'), plt.ylabel('Acc'), plt.ylim([0.75,1.03])
223 | # plt.show()
--------------------------------------------------------------------------------
/trafficgen/PyTgen/config.py:
--------------------------------------------------------------------------------
1 | '''
2 | Config file adaptation/modification of the PyTgen generator config.py
3 | '''
4 |
5 | import logging
6 |
7 |
8 | #
9 | # This is a default configuration for the classification of encrypted traffic project
10 | #
11 | class Conf(object):
12 | # maximum number of worker threads that can be used to execute the jobs.
13 | # the program will start using 3 threads and spawn new ones if needed.
14 | # this setting depends on the number of jobs that have to be executed
15 | # simultaneously (not the number of jobs given in the config file).
16 | maxthreads = 15
17 |
18 | # set to "logging.INFO" or "logging.DEBUG"
19 | loglevel = logging.DEBUG
20 |
21 | # ssh commands that will be randomly executed by the ssh traffic generator
22 | ssh_commands = ['ls', 'cd', 'cd /etc', 'ps ax', 'date', 'mount', 'free', 'vmstat',
23 | 'touch /tmp/tmpfile', 'rm /tmp/tmpfile', 'ls /tmp/tmpfile',
24 | 'tail /etc/hosts', 'tail /etc/passwd', 'tail /etc/fstab',
25 | 'cat /var/log/messages', 'cat /etc/group', 'cat /etc/mtab']
26 |
27 | # urls the http generator will randomly fetch from
28 | https_urls = ['https://www.dr.dk/', 'https://da.wikipedia.org/wiki/Forside',
29 | 'https://en.wikipedia.org/wiki/Main_Page', 'https://www.dk-hostmaster.dk/',
30 | 'https://www.cph.dk/', 'https://translate.google.com/', 'https://www.borger.dk/',
31 | 'https://www.sdu.dk/da/', 'https://www.sundhed.dk/', 'https://www.facebook.com/',
32 | 'https://www.ug.dk/', 'https://erhvervsstyrelsen.dk/', 'https://www.nets.eu/dk-da',
33 | 'https://www.jobindex.dk/', 'https://www.rejseplanen.dk/webapp/index.html', 'https://yousee.dk/',
34 | 'https://www.sparnord.dk/', 'https://gigahost.dk/', 'https://www.information.dk/',
35 | 'https://stps.dk/', 'https://www.skat.dk/', 'https://danskebank.dk/privat', 'https://www.sst.dk/']
36 |
37 | http_urls = ['http://naturstyrelsen.dk/', 'http://www.valutakurser.dk/', 'http://ordnet.dk/ddo/forside',
38 | 'http://www.speedtest.net/', 'http://bygningsreglementet.dk/', 'http://www.ft.dk/', 'http://tv2.dk/',
39 | 'http://www.kl.dk/', 'http://www.symbiosis.dk/', 'http://www.noegletal.dk/',
40 | 'http://novonordiskfonden.dk/da', 'http://frida.fooddata.dk/',
41 | 'http://www.arbejdsmiljoforskning.dk/da', 'http://www.su.dk/', 'http://www.trafikstyrelsen.dk/da.aspx',
42 | 'http://www.regioner.dk/', 'http://www.geus.dk/UK/Pages/default.aspx', 'http://bm.dk/',
43 | 'http://www.m.dk/#!/', 'http://www.regionsjaelland.dk/Sider/default.aspx',
44 | 'http://www.trafikstyrelsen.dk/da.aspx']
45 | # http_intern = ['http://web.intern.ndsec']
46 |
47 | # a number of files that will randomly be used for ftp upload
48 | ftp_put = ['~/files/file%s' % i for i in range(0, 9)]
49 |
50 | # a number of files that will randomly be used for ftp download
51 | ftp_get = ['~/files/file%s' % i for i in range(0, 9)]
52 |
53 | # array of source-destination tuples for sftp upload
54 | sftp_put = [('~/files/file%s' % i, '/tmp/file%s' % i) for i in range(0, 9)]
55 |
56 | # array of source-destination tuples for sftp download
57 | sftp_get = [('/media/share/files/file%s' % i, '~/files/tmp/file%s' % i) for i in range(0, 9)]
58 |
59 | # significant part of the shell prompt to be able to recognize
60 | # the end of a telnet data transmission
61 | telnet_prompt = "$ "
62 |
63 | # job configuration (see config.example.py)
64 | jobdef = [
65 | # http (intern)
66 | # ('http_gen', [(start_hour, start_min), (end_hour, end_min), (interval_min, interval_sec)], [urls, retry, sleep_multiply])
67 | # ('http_gen', [(9, 0), (16, 30), (60, 0)], [http_intern, 2, 30]),
68 | # ('http_gen', [(9, 55), (9, 30), (5, 0)], [http_intern, 5, 20]),
69 | # ('http_gen', [(12, 0), (12, 30), (2, 0)], [http_intern, 6, 10]),
70 | # ('http_gen', [(10, 50), (12, 0), (10, 0)], [http_intern, 2, 10]),
71 | # ('http_gen', [(15, 0), (17, 30), (30, 0)], [http_intern, 8, 20]),
72 | #
73 | # http (extern)
74 | # ('http_gen', [(12, 0), (12, 30), (5, 0)], [http_extern, 10, 20]),
75 | # ('http_gen', [(9, 0), (17, 0), (30, 0)], [http_extern, 5, 30]),
76 | ('http_gen', [(11, 0), (13, 0), (0, 10)], [http_urls, 1, 5]),
77 | ('http_gen', [(11, 0), (13, 0), (0, 10)], [https_urls, 1, 5]),
78 | # ('http_gen', [(9, 0), (17, 0), (90, 0)], [http_extern, 10, 30]),
79 | # ('http_gen', [(12, 0), (12, 10), (5, 0)], [http_extern, 15, 20]),
80 | #
81 | # smtp
82 | # ('smtp_gen', [(9, 0), (18, 0), (120, 0)], ['mail.extern.ndsec', 'mail2', 'mail', 'mail2@mail.extern.ndsec', 'mail52@mail.extern.ndsec']),
83 | # ('smtp_gen', [(12, 0), (13, 0), (30, 0)], ['mail.extern.ndsec', 'mail20', 'mail', 'mail20@mail.extern.ndsec', 'mail51@mail.extern.ndsec']),
84 | #
85 | # ftp
86 | # ('ftp_gen', [(9, 0), (11, 0), (15, 0)], ['ftp.intern.ndsec', 'ndsec', 'ndsec', ftp_put, ftp_get, 10, False, 5]),
87 | # ('ftp_gen', [(10, 0), (18, 0), (135, 0)], ['ftp.intern.ndsec', 'ndsec', 'ndsec', ftp_put, [], 2, False]),
88 | #
89 | # nfs / smb
90 | # ('copy_gen', [(9, 0), (12, 0), (90, 0)], [None, 'Z:/tmp/dummyfile.txt', 30]),
91 | # ('copy_gen', [(10, 0), (16, 0), (120, 0)], [None, 'Z:/tmp/dummyfile.txt', 80]),
92 | # ('copy_gen', [(12, 0), (17, 0), (160, 0)], [None, 'Z:/tmp/dummyfile.txt', 180]),
93 | # ('copy_gen', [(9, 0), (18, 0), (0, 10)], ['file1', 'file2']),
94 | #
95 | # telnet
96 | # ('telnet_gen', [(9, 0), (18, 0), (60, 0)], ['telnet.intern.ndsec', None, 'ndsec', 'ndsec', 5, ssh_commands, telnet_prompt, 10]),
97 | # ('telnet_gen', [(9, 0), (18, 0), (240, 0)], ['telnet.intern.ndsec', 23, 'ndsec', 'ndsec', 2, [], telnet_prompt]),
98 | # ('telnet_gen', [(16, 0), (18, 0), (120, 0)], ['telnet.intern.ndsec', 23, 'ndsec', 'wrongpass', 2, [], telnet_prompt]),
99 | #
100 | # ssh
101 | # ('ssh_gen', [(9, 0), (18, 0), (120, 0)], ['ssh.intern.ndsec', 22, 'ndsec', 'ndsec', 10, ssh_commands]),
102 | # ('ssh_gen', [(9, 0), (18, 0), (240, 0)], ['ssh.intern.ndsec', 22, 'ndsec', 'ndsec', 60, [], 30]),
103 | # ('ssh_gen', [(9, 0), (18, 0), (120, 0)], ['192.168.10.50', 22, 'dummy1', 'dummy1', 5, ssh_commands]),
104 | # ('ssh_gen', [(12, 0), (14, 0), (120, 0)], ['ssh.intern.ndsec', 22, 'dummy1', 'wrongpass', 5, ssh_commands]),
105 | #
106 | # sftp
107 | # ('sftp_gen', [(17, 0), (18, 0), (60, 0)], ['127.0.0.1', 22, 'user', 'pass', sftp_put, sftp_get, 5, 1]),
108 | ]
--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/__init__.py:
--------------------------------------------------------------------------------
1 | '''
2 | Python 3 adaption/modification of the PyTgen __init__
3 | '''
4 | from .runner import runner
5 | from .scheduler import scheduler
--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/generator.py:
--------------------------------------------------------------------------------
1 | '''
2 | Python 3 adaption/modification of the PyTgen generator
3 | '''
4 | import logging
5 | import random
6 | import datetime
7 | import time
8 | import base64
9 | import string
10 | import os
11 |
12 | import smtplib # Used for smtp/mail
13 | import ftplib # Used for ftp
14 | import shutil # Used for network copy
15 | import telnetlib # Used for telnet traffic
16 | import urllib3 # Used for http/https
17 | # import paramiko # Used for ssh + sftp
18 | urllib3.disable_warnings()
19 | logging.getLogger("urllib3").setLevel(logging.WARNING)
20 |
21 | class http_gen():
22 | '''
23 | http and https generator
24 | send HTTP GET requests to a webserver to retrieve a given URL n times.
25 | delay the requests by a random time controlled by a multiplier.
26 | to get best results with this generator, the size of the http answers
27 | should differ (send random data)
28 | '''
29 | __generator__ = 'http'
30 |
31 | def __init__(self,
32 | params):
33 | self._urls = params[0]
34 | self._num = params[1]
35 | self._multiplier = 5
36 |
37 | if len(params) == 3:
38 | self._multiplier = params[2]
39 |
40 | def __call__(self):
41 | for _ in range(self._num):
42 | url = self._urls[random.randint(0, (len(self._urls) - 1))]
43 | logging.getLogger(self.__generator__).info("Requesting: %s", url)
44 |
45 | try:
46 | for __ in range(int(random.random() * 10 + 1)):
47 | http = urllib3.PoolManager()
48 | response = http.request('GET', url)
49 | logging.getLogger(self.__generator__).debug("Recieved %s bytes from %s",
50 | str(len(response.data)),
51 | url)
52 |
53 | except:
54 | logging.getLogger(self.__generator__).debug("Failed to request %s",
55 | url)
56 |
57 | time.sleep(random.random() * self._multiplier)
58 |
59 | class smtp_gen():
60 | '''
61 | smtp generator
62 | connect to an smtp server and send an email containing a random length
63 | string to a destination email address.
64 | '''
65 | __generator__ = "smtp"
66 |
67 | def __init__(self,
68 | params):
69 | self._host = params[0]
70 | self._user = params[1]
71 | self._pass = params[2]
72 | self._from = params[3]
73 | self._to = params[4]
74 |
75 | def __call__(self):
76 | rnd = ''.join(random.choice(string.ascii_letters) for _ in range(int(3000 * random.random())))
77 |
78 | msg = "From: " + self._from + "\r\n" \
79 | + "To: " + self._to + "\r\n" \
80 | + "Subject: PyTgen " + str(datetime.datetime.now()) + "\r\n\r\n" \
81 | + rnd + "\r\n"
82 |
83 | logging.getLogger(self.__generator__).info("Sending email to %s (size: %s)",
84 | self._host,
85 | len(rnd))
86 |
87 | try:
88 | sender = smtplib.SMTP(self._host, 25)
89 |
90 | try:
91 | logging.getLogger(self.__generator__).debug("Using TLS")
92 | sender.starttls()
93 |
94 | except:
95 | pass
96 |
97 | try:
98 | sender.login(self._user, self._pass)
99 |
100 | except smtplib.SMTPAuthenticationError:
101 | logging.getLogger(self.__generator__).debug("Using PLAIN auth")
102 | sender.docmd("AUTH LOGIN", base64.b64encode(self._user))
103 | sender.docmd(base64.b64encode(self._pass), "")
104 |
105 | sender.sendmail(self._from, self._to, msg)
106 | logging.getLogger(self.__generator__).debug("Email sent successful")
107 |
108 | except:
109 | raise
110 |
111 | else:
112 | sender.quit()
113 |
114 | class ftp_gen():
115 | '''
116 | ftp and ftp_tls generator
117 | connect to a host using ftp and start uploading and downloading files.
118 | The files to be put or retrieved are specified in an array and are
119 | randomly choosen. An empty array will skip upload or download.
120 | '''
121 | __generator__ = 'ftp'
122 |
123 | def __init__(self,
124 | params):
125 | self._host = params[0]
126 | self._user = params[1]
127 | self._pass = params[2]
128 | self._put = params[3]
129 | self._get = params[4]
130 | self._num = params[5]
131 | self._tls = params[6]
132 | self._multiplier = 5
133 |
134 | if len(params) == 7:
135 | self._multiplier = params[6]
136 |
137 | def __call__(self):
138 | ftp = None
139 | if self._tls == True:
140 | logging.getLogger(self.__generator__).info("Connecting to ftps://%s",
141 | self._host)
142 | try:
143 | # 20% chanche to login with a wrong password at first try
144 | if random.random() > 0.8:
145 | try:
146 | logging.getLogger(self.__generator__).debug("Logging in with wrong credentials")
147 | ftp = ftplib.FTP_TLS(self._host,
148 | self._user,
149 | "wrongpass")
150 | except:
151 | pass
152 |
153 | time.sleep(2 * random.random())
154 |
155 | ftp = ftplib.FTP_TLS(self._host,
156 | self._user,
157 | self._pass)
158 | ftp.prot_p()
159 |
160 | except:
161 | logging.getLogger(self.__generator__).debug("Error connecting to ftps://%s",
162 | self._host)
163 |
164 | else:
165 | logging.getLogger(self.__generator__).info("Connecting to ftp://%s",
166 | self._host)
167 | try:
168 | # 20% chanche to login with a wrong password at first try
169 | if random.random() > 0.8:
170 | try:
171 | logging.getLogger(self.__generator__).debug("Logging in with wrong credentials")
172 | ftp = ftplib.FTP(self._host,
173 | self._user,
174 | "wrongpass")
175 | except:
176 | pass
177 |
178 | time.sleep(2 * random.random())
179 |
180 | ftp = ftplib.FTP(self._host,
181 | self._user,
182 | self._pass)
183 |
184 | except:
185 | logging.getLogger(self.__generator__).debug("Error connecting to ftp://%s",
186 | self._host)
187 |
188 | if ftp is not None:
189 | ftp.retrlines('LIST')
190 |
191 | for _ in range(self._num):
192 | if len(self._put) is not 0:
193 | ressource = self._put[random.randint(0, (len(self._put) - 1))]
194 | (path, filename) = os.path.split(ressource)
195 |
196 | logging.getLogger(self.__generator__).debug("Uploading %s",
197 | ressource)
198 |
199 | f = open(ressource, 'r')
200 | ftp.storbinary("STOR " + filename, f)
201 | f.close()
202 |
203 | time.sleep(self._multiplier * random.random())
204 |
205 | if len(self._get) is not 0:
206 | ressource = self._get[random.randint(0, (len(self._get) - 1))]
207 |
208 | logging.getLogger(self.__generator__).debug("Downloading %s",
209 | ressource)
210 |
211 | ftp.retrbinary('RETR ' + ressource, self._getfile)
212 |
213 | time.sleep(self._multiplier * random.random())
214 |
215 | ftp.quit()
216 |
217 | def _getfile(self,
218 | ressource):
219 | pass
220 |
221 | class copy_gen():
222 | '''
223 | copy generator.
224 | copy files or directories from a source to a destination. this generator
225 | can be used to generate traffic on network filesystems like nfs or smb.
226 | A random source file will be generated if the source parameter is set to
227 | "None". The size of the generated source file can be controlled by an
228 | optional size parameter (default = 8192 byte)
229 | '''
230 | __generator__ = "copy"
231 |
232 | def __init__(self,
233 | params):
234 | self._src = params[0]
235 | self._dst = params[1]
236 | self._size = 8192
237 |
238 | if len(params) == 3:
239 | self._size = params[2] * 1024
240 |
241 | def __call__(self):
242 | if self._src is not None:
243 | logging.getLogger(self.__generator__).info("Copying from %s to %s",
244 | self._src,
245 | self._dst)
246 |
247 | if os.path.isdir(self._src):
248 | dst = self._dst + "/" + self._src
249 |
250 | if os.path.exists(dst):
251 | logging.getLogger(self.__generator__).debug("Destination %s exists. Deleting it.",
252 | dst)
253 | shutil.rmtree(dst)
254 |
255 | try:
256 | shutil.copytree(self._src, dst)
257 | except:
258 | logging.getLogger(self.__generator__).debug("Error copying %s to %s",
259 | self._src,
260 | dst)
261 |
262 | else:
263 | try:
264 | shutil.copy2(self._src, self._dst)
265 | except:
266 | logging.getLogger(self.__generator__).debug("Error copying %s to %s",
267 | self._src,
268 | self._dst)
269 |
270 | else:
271 | if (not os.path.exists(self._dst)) or (os.path.isfile(self._dst)):
272 | rnd = ''.join(random.choice(string.ascii_letters) for _ in range(int(self._size * random.random())))
273 |
274 | logging.getLogger(self.__generator__).info("Writing %s byte to %s",
275 | len(rnd),
276 | self._dst)
277 |
278 | try:
279 | f = open(self._dst, "w")
280 | f.write(rnd)
281 |
282 | except:
283 | logging.getLogger(self.__generator__).debug("Error writing to %s",
284 | self._dst)
285 |
286 | else:
287 | f.close()
288 |
289 | else:
290 | logging.getLogger(self.__generator__).info("Destination %s is not a file",
291 | self._dst)
292 |
293 |
294 | class telnet_gen():
295 | '''
296 | telnet generator.
297 | connect to a host using telnet and start sending commands to the host. The
298 | connection will be kept open until the connection time provided in the
299 | config is over. If the commands array is empty, the connection will idle
300 | until connection time is over.
301 | '''
302 | __generator__ = "telnet"
303 |
304 | def __init__(self,
305 | params):
306 | self._host = params[0]
307 | self._port = params[1]
308 | self._user = params[2]
309 | self._pass = params[3]
310 | self._time = params[4]
311 | self._cmds = params[5]
312 | self._prompt = params[6]
313 | self._multiplier = 60
314 |
315 | if len(params) == 8:
316 | self._multiplier = params[7]
317 |
318 | def __call__(self):
319 | logging.getLogger(self.__generator__).info("Connecting to %s",
320 | self._host)
321 |
322 | realmin = self._time * 2 * random.random()
323 | endtime = datetime.datetime.now() + datetime.timedelta(minutes = realmin)
324 |
325 | try:
326 | tn = telnetlib.Telnet(self._host, self._port)
327 |
328 | try:
329 | tn.read_until("login: ")
330 | tn.write(self._user + "\n")
331 | if self._pass is not None:
332 | tn.read_until("Password: ")
333 | tn.write(self._pass + "\n")
334 |
335 | except:
336 | logging.getLogger(self.__generator__).debug("Error logging in")
337 |
338 | except:
339 | logging.getLogger(self.__generator__).debug("Error connecting to %s",
340 | self._host)
341 |
342 | else:
343 | while datetime.datetime.now() < endtime:
344 | if len(self._cmds) is not 0:
345 | self._send_cmds(tn)
346 |
347 | else:
348 | time.sleep(realmin)
349 | break
350 |
351 | time.sleep(self._multiplier * random.random())
352 |
353 | tn.write("exit\n")
354 | tn.read_all()
355 |
356 | def _send_cmds(self,
357 | tn):
358 | for _ in range(int(6 * random.random())):
359 | cmd = self._cmds[random.randint(0, (len(self._cmds) - 1))]
360 | logging.getLogger(self.__generator__).debug("Sending command: %s",
361 | cmd)
362 | tn.read_very_eager()
363 | tn.write(cmd + '\n')
364 | tn.read_eager()
365 | time.sleep(5 * random.random())
366 |
367 | class ssh_gen():
368 | '''
369 | ssh generator.
370 | connect to a host using ssh and start sending commands to the host. The
371 | connection will be kept open until the connection time provided in the
372 | config is over. If the commands array is empty, the connection will idle
373 | until connection time is over.
374 | '''
375 | __generator__ = "ssh"
376 |
377 | def __init__(self,
378 | params):
379 | self._host = params[0]
380 | self._port = params[1]
381 | self._user = params[2]
382 | self._pass = params[3]
383 | self._time = params[4]
384 | self._cmds = params[5]
385 | self._multiplier = 60
386 |
387 | if len(params) == 7:
388 | self._multiplier = params[6]
389 |
390 | def __call__(self):
391 | logging.getLogger("paramiko").setLevel(logging.INFO)
392 | logging.getLogger(self.__generator__).info("Connecting to %s",
393 | self._host)
394 |
395 | realmin = self._time * 2 * random.random()
396 | endtime = datetime.datetime.now() + datetime.timedelta(minutes = realmin)
397 |
398 | client = paramiko.SSHClient()
399 | client.load_system_host_keys()
400 | client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
401 |
402 | try:
403 | # 20% chanche to login with a wrong password at first try
404 | if random.random() > 0.8:
405 | try:
406 | client.connect(self._host,
407 | self._port,
408 | self._user,
409 | "wrongpass")
410 | except:
411 | pass
412 |
413 | time.sleep(2 * random.random())
414 |
415 | client.connect(self._host,
416 | self._port,
417 | self._user,
418 | self._pass)
419 |
420 | except:
421 | logging.getLogger(self.__generator__).debug("Error connecting to %s",
422 | self._host)
423 |
424 | else:
425 | while datetime.datetime.now() < endtime:
426 | if len(self._cmds) is not 0:
427 | self._send_cmds(client)
428 |
429 | else:
430 | time.sleep(realmin)
431 | break
432 |
433 | time.sleep(self._multiplier * random.random())
434 |
435 | client.close()
436 |
437 | def _send_cmds(self,
438 | client):
439 | for _ in range(3):
440 | cmd = self._cmds[random.randint(0, (len(self._cmds) - 1))]
441 | logging.getLogger(self.__generator__).debug("Sending command: %s",
442 | cmd)
443 | client.exec_command(cmd)
444 | time.sleep(5 * random.random())
445 |
446 | class sftp_gen():
447 | '''
448 | sftp generator.
449 | this generator will connect to a host using ssh subsystem sftp. It then
450 | starts sending uploading and downloading files provided via the _get and
451 | _put parameters.
452 | '''
453 | __generator__ = "sftp"
454 |
455 | def __init__(self,
456 | params):
457 | self._host = params[0]
458 | self._port = params[1]
459 | self._user = params[2]
460 | self._pass = params[3]
461 | self._put = params[4]
462 | self._get = params[5]
463 | self._time = params[6]
464 | self._multiplier = 60
465 |
466 | if len(params) == 8:
467 | self._multiplier = params[7]
468 |
469 | def __call__(self):
470 | logging.getLogger("paramiko").setLevel(logging.INFO)
471 | logging.getLogger(self.__generator__).info("Connecting to %s", self._host)
472 |
473 | realmin = self._time * 2 * random.random()
474 | endtime = datetime.datetime.now() + datetime.timedelta(minutes = realmin)
475 |
476 | try:
477 | transport = paramiko.Transport((self._host,
478 | self._port))
479 | transport.connect(username = self._user,
480 | password = self._pass)
481 |
482 | sftp = paramiko.SFTPClient.from_transport(transport)
483 |
484 | except:
485 | logging.getLogger(self.__generator__).debug("Error connecting to %s",
486 | self._host)
487 |
488 | else:
489 | while datetime.datetime.now() < endtime:
490 | put = self._put
491 | get = self._get
492 |
493 | if len(put) is 0 and len(get) is 0:
494 | time.sleep(realmin)
495 | break
496 |
497 | while len(put) is not 0:
498 | (src, dst) = put.pop(len(put) - 1)
499 |
500 | logging.getLogger(self.__generator__).debug("Uploading %s to %s",
501 | src, dst)
502 | sftp.put(src, dst)
503 | time.sleep(2 * random.random())
504 |
505 | while len(get) is not 0:
506 | (src, dst) = get.pop(len(get) - 1)
507 |
508 | logging.getLogger(self.__generator__).debug("Downloading %s to %s",
509 | src, dst)
510 | sftp.get(src, dst)
511 | time.sleep(2 * random.random())
512 |
513 | time.sleep(self._multiplier * random.random())
514 |
515 | sftp.close()
516 | time.sleep(0.2)
517 | transport.close()
--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/runner.py:
--------------------------------------------------------------------------------
1 | '''
2 | Python 3 adaption/modification of the PyTgen runner
3 | '''
4 |
5 | import threading
6 | import queue
7 | import logging
8 |
9 | class worker(threading.Thread):
10 | def __init__(self,
11 | name,
12 | queue,
13 | create,
14 | destroy):
15 | threading.Thread.__init__(self)
16 |
17 | self.__name = name
18 | self.__queue = queue
19 |
20 | self.__create = create
21 | self.__destroy = destroy
22 |
23 | self.__dismissed = threading.Event()
24 |
25 | self.daemon = True
26 | self.name = self.__name
27 |
28 | self.start()
29 |
30 | def run(self):
31 | if self.__create:
32 | self.__create()
33 |
34 | logging.getLogger(self.__name).debug('main loop started')
35 |
36 | while True:
37 | if self.__dismissed.is_set():
38 | break
39 |
40 | try:
41 | action = self.__queue.get(block=True,
42 | timeout=10)
43 |
44 | except:
45 | continue
46 |
47 | else:
48 | try:
49 | action()
50 |
51 | except:
52 | raise
53 |
54 | logging.getLogger(self.__name).debug('main loop finished')
55 |
56 | if self.__destroy:
57 | self.__destroy()
58 |
59 | def dismiss(self):
60 | self.__dismissed.set()
61 |
62 | class runner(object):
63 | def __init__(self,
64 | maxthreads=10,
65 | thread_create=None,
66 | thread_destroy=None):
67 | self.__queue = queue.Queue()
68 | self.__maxthreads = maxthreads
69 |
70 | logging.getLogger('runner').info('creating runner with 3 initial threads')
71 |
72 | self.__workers = []
73 | for i in range(0, 3):
74 | name = 'worker_%d' % i
75 |
76 | logging.getLogger('runner').debug('creating worker thread: %s',
77 | name)
78 |
79 | self.__workers.append(worker(name=name,
80 | queue=self.__queue,
81 | create=thread_create,
82 | destroy=thread_destroy))
83 |
84 | def __call__(self,
85 | action):
86 | self.__queue.put(action)
87 |
88 | if (self.__queue.qsize() > 2):
89 | if (len(self.__workers) < self.__maxthreads):
90 | self._spawn()
91 | else:
92 | logging.getLogger('Runner').warning('Not enough worker threads to handle queue')
93 |
94 | def _spawn(self):
95 | name = 'worker_%d' % (len(self.__workers) + 1)
96 | logging.getLogger('Runner').info('spawning new worker thread: %s', name)
97 | self.__workers.append(worker(name=name,
98 | queue=self.__queue,
99 | create=None,
100 | destroy=None))
101 |
102 | def stop(self):
103 | for worker in self.__workers:
104 | worker.dismiss()
105 |
106 | for worker in self.__workers:
107 | worker.join()
--------------------------------------------------------------------------------
/trafficgen/PyTgen/core/scheduler.py:
--------------------------------------------------------------------------------
1 | '''
2 | Python 3 adaption/modification of the PyTgen scheduler
3 | '''
4 |
5 | import datetime
6 | import threading
7 | import heapq
8 | import random
9 | import logging
10 |
11 | class scheduler(threading.Thread):
12 | class job(object):
13 | def __init__(self, name, action, interval, start, end):
14 | self.__name = name
15 | self.__action = action
16 | self.__interval = interval
17 | self.__start = start
18 | self.__end = end
19 | self.__exec_time = datetime.datetime.now() + datetime.timedelta(seconds = self.__interval[1] * random.random(),
20 | minutes = self.__interval[0] * random.random())
21 |
22 | def __call__(self):
23 | today = datetime.datetime.now()
24 | start = today.replace(hour = self.__start[0],
25 | minute = self.__start[1],
26 | second = 0, microsecond = 0)
27 | end = today.replace(hour = self.__end[0],
28 | minute = self.__end[1],
29 | second = 0, microsecond = 0)
30 |
31 | if start <= self.__exec_time < end:
32 | # enqueue job for "random() * 2 * interval"
33 | # in average the job will run every interval but differing randomly
34 | self.__exec_time += datetime.timedelta(seconds = self.__interval[1] * random.random() * 2,
35 | minutes = self.__interval[0] * random.random() * 2)
36 |
37 | if self.__exec_time < datetime.datetime.now():
38 | logging.getLogger("scheduler").warning('scheduler is overloaded!')
39 |
40 | return self.__action
41 |
42 | else:
43 | # enqueue the job until next start time
44 | if self.__exec_time < start and self.__exec_time.day == start.day:
45 | self.__exec_time = start + datetime.timedelta(seconds = 1)
46 | else:
47 | self.__exec_time = start + datetime.timedelta(days = 1)
48 |
49 | logging.getLogger("scheduler").info("enqueueing %s until %s",
50 | self.__name, self.__exec_time)
51 | return False
52 |
53 |
54 | def __lt__(self, other):
55 | try:
56 | if type(other) == scheduler.job:
57 | return self.__exec_time < other.__exec_time
58 |
59 | elif type(other) == datetime.datetime:
60 | return self.__exec_time < other
61 | else:
62 | raise Exception()
63 | except Exception as e:
64 | raise
65 |
66 | def __sub__(self, other):
67 | try:
68 | if type(other) == scheduler.job:
69 | return self.__exec_time - other.__exec_time
70 |
71 | elif type(other) == datetime.datetime:
72 | return self.__exec_time - other
73 | else:
74 | raise Exception()
75 | except Exception as e:
76 | raise
77 |
78 | def __init__(self, jobs, runner):
79 | threading.Thread.__init__(self)
80 | self.setName('scheduler')
81 |
82 | self.__runner = runner
83 | self.__jobs = jobs
84 | heapq.heapify(self.__jobs)
85 |
86 | self.__running = False
87 | self.__signal = threading.Condition()
88 |
89 | def run(self):
90 | self.__running = True
91 | while self.__running:
92 | self.__signal.acquire()
93 | if not self.__jobs:
94 | self.__signal.wait()
95 |
96 | else:
97 | now = datetime.datetime.now()
98 | while (self.__jobs[0] < now):
99 | job = heapq.heappop(self.__jobs)
100 |
101 | action = job()
102 | if action is not False:
103 | self.__runner(action)
104 |
105 | heapq.heappush(self.__jobs, job)
106 |
107 | logging.getLogger("scheduler").debug("Sleeping %s seconds", (self.__jobs[0] - now))
108 | self.__signal.wait((self.__jobs[0] - now).total_seconds())
109 |
110 | self.__signal.release()
111 |
112 | def stop(self):
113 | self.__running = False
114 |
115 | self.__signal.acquire()
116 | self.__signal.notify_all()
117 | self.__signal.release()
118 |
119 | def set_jobs(self, jobs):
120 | self.__signal.acquire()
121 |
122 | self.__jobs = jobs
123 | heapq.heapify(self.__jobs)
124 |
125 | self.__signal.notify_all()
126 | self.__signal.release()
--------------------------------------------------------------------------------
/trafficgen/PyTgen/nslookup.py:
--------------------------------------------------------------------------------
1 | import socket
2 | import config
3 |
4 | Conf = config.Conf
5 | # ip_list = []
6 | # ais = socket.getaddrinfo('en.wikipedia.org', 0, 0, 0, 0)
7 | # for result in ais:
8 | # ip_list.append(result[-1][0])
9 | # ip_list = list(set(ip_list))
10 | # print(ip_list)
11 |
12 | count = 0
13 | for http in Conf.https_urls:
14 | url = http.split('/')[2]
15 | ip_list = []
16 | ais = socket.getaddrinfo(url, 0, 0, 0, 0)
17 | for result in ais:
18 | ip_list.append(result[-1][0])
19 | ip_list = list(set(ip_list))
20 | count +=1
21 | print(http, ip_list, count)
--------------------------------------------------------------------------------
/trafficgen/PyTgen/run.py:
--------------------------------------------------------------------------------
1 | '''
2 | Python 3 adaption/modification of the PyTgen run.py
3 | '''
4 |
5 | import logging
6 | import signal
7 | import platform
8 |
9 | import core.runner
10 | import core.scheduler
11 | import config
12 |
13 | from core.generator import *
14 |
15 |
16 | def create_jobs():
17 | logging.getLogger('main').info('Creating jobs')
18 |
19 | jobs = []
20 | for next_job in Conf.jobdef:
21 | logging.getLogger('main').info('creating %s', next_job)
22 |
23 | job = core.scheduler.job(name = next_job[0],
24 | action = eval(next_job[0])(next_job[2]),
25 | interval = next_job[1][2],
26 | start = next_job[1][0],
27 | end = next_job[1][1])
28 | jobs.append(job)
29 |
30 | return jobs
31 |
32 | if __name__ == '__main__':
33 | # set hostbased parameters
34 | hostname = platform.node()
35 |
36 | log_file = 'logs/' + hostname + '.log'
37 | config_file = "config.py"
38 | # file = open(log_file, 'a+')
39 | # file.close()
40 | # load the hostbased configuration file
41 | # _Conf = __import__(config_file, globals(), locals(), ['Conf'], -1)
42 | # Conf = _Conf.Conf
43 | Conf = config.Conf
44 |
45 | # start logger
46 | logging.basicConfig(level = Conf.loglevel,
47 | format = '%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
48 | datefmt = '%Y-%m-%d %H:%M:%S',
49 | filename = log_file)
50 |
51 | logging.getLogger('main').info('Configuration %s loaded', config_file)
52 |
53 | # start runner, create jobs, start scheduling
54 | runner = core.runner(maxthreads = Conf.maxthreads)
55 |
56 | jobs = create_jobs()
57 |
58 | scheduler = core.scheduler(jobs = jobs,
59 | runner = runner)
60 |
61 | # Stop scheduler on exit
62 | def signal_int(signal, frame):
63 | logging.getLogger('main').info('Stopping scheduler')
64 | scheduler.stop()
65 |
66 | signal.signal(signal.SIGINT, signal_int)
67 |
68 | # Run the scheduler
69 | logging.getLogger('main').info('Starting scheduler')
70 | scheduler.start()
71 | scheduler.join(timeout=None)
72 |
73 | # Stop the runner
74 | runner.stop()
--------------------------------------------------------------------------------
/trafficgen/Streaming/streaming_generator.py:
--------------------------------------------------------------------------------
1 | import sys
2 | sys.path.insert(0, "/home/mclrn/dlproject/")
3 | import datetime
4 | from threading import Thread
5 | from selenium import webdriver
6 | from slackclient import SlackClient
7 | import traceback
8 | import os
9 | from selenium.webdriver.support.ui import WebDriverWait
10 |
11 |
12 | # import trafficgen.Streaming.win_capture as cap
13 | # import trafficgen.Streaming.streaming_types as stream
14 |
15 | import unix_capture as cap
16 | import streaming_types as stream
17 | from constants import SLACK_TOKEN
18 |
19 | def notifySlack(message):
20 | sc = SlackClient(SLACK_TOKEN)
21 | try:
22 | sc.api_call("chat.postMessage", channel="#server", text=message)
23 | except:
24 | sc.api_call("chat.postMessage", channel="#server", text="Could not send stacktrace")
25 |
26 |
27 | def generate_streaming(duration, dir, total_iterations, options=None):
28 | iterations = 0
29 | stream_types = {
30 | # 'hbo': (stream.HboNordic, 1),
31 | # 'netflix': (stream.Netflix, 1),
32 | 'twitch': (stream.Twitch, 5),
33 | 'youtube': (stream.Youtube, 5),
34 | 'drtv': (stream.DrTv, 5),
35 | }
36 | while iterations < total_iterations:
37 | print("Iteration:", iterations)
38 | if iterations % 25 == 0:
39 | notifySlack("Starting iteration: " + str(iterations))
40 | try:
41 | for stream_type in stream_types.keys():
42 | browsers, capture_thread, file, streaming_threads, = [], [], [], []
43 | type = stream_types[stream_type][0]
44 | num_threads = stream_types[stream_type][1]
45 |
46 | browsers, capture_thread, file, streaming_threads = generate_threaded_streaming(type, stream_type, dir,
47 | duration, options,
48 | num_threads=num_threads)
49 | try:
50 | capture_thread.start()
51 | for thread in streaming_threads:
52 | # Start streaming threads
53 | thread.start()
54 | print("streaming started", stream_type)
55 | capture_thread.join() # Stream until the capture thread joins
56 | print("capture done - thread has joined")
57 | except Exception as e:
58 | notifySlack("Something went wrong %s" % traceback.format_exc())
59 | # Wait for capture thread
60 | capture_thread.join()
61 | # Do a cleanup since somthing went wrong
62 | cap.cleanup(file)
63 | try:
64 | for browser in browsers:
65 | browser.close()
66 | browser.quit()
67 | except Exception as e:
68 | notifySlack("Something went wrong %s" % traceback.format_exc())
69 | # os.system("killall chrome")
70 | # os.system("killall chromedriver")
71 | except Exception as ex:
72 | notifySlack("Something went wrong when setting up the threads \n %s" % traceback.format_exc())
73 |
74 | iterations += 1
75 |
76 |
77 | def generate_threaded_streaming(obj: stream.Streaming, stream_name, dir, duration, chrome_options=None, num_threads=5):
78 | #### STREAMING ####
79 | # Create filename
80 | now = datetime.datetime.now()
81 | file = dir + "/%s-%.2d%.2d_%.2d%.2d%.2d.pcap" % (stream_name, now.day, now.month, now.hour, now.minute, now.second)
82 | # Instantiate thread
83 | capture_thread = Thread(target=cap.captureTraffic, args=(1, duration, dir, file))
84 | # Create five threads for streaming
85 | streaming_threads = []
86 | browsers = []
87 | for i in range(num_threads):
88 | browser = webdriver.Chrome(options=chrome_options)
89 | browser.implicitly_wait(10)
90 | browsers.append(browser)
91 | t = Thread(target=obj.stream_video, args=(obj, browser))
92 | streaming_threads.append(t)
93 |
94 | return browsers, capture_thread, file, streaming_threads
95 |
96 |
97 | def get_clear_browsing_button(driver):
98 | """Find the "CLEAR BROWSING BUTTON" on the Chrome settings page. /deep/ to go past shadow roots"""
99 | return driver.find_element_by_css_selector('* /deep/ #clearBrowsingDataConfirm')
100 |
101 |
102 | def clear_cache(driver, timeout=60):
103 | """Clear the cookies and cache for the ChromeDriver instance."""
104 | # navigate to the settings page
105 | driver.get('chrome://settings/clearBrowserData')
106 |
107 | # wait for the button to appear
108 | wait = WebDriverWait(driver, timeout)
109 | wait.until(get_clear_browsing_button)
110 |
111 | # click the button to clear the cache
112 | get_clear_browsing_button(driver).click()
113 |
114 | # wait for the button to be gone before returning
115 | wait.until_not(get_clear_browsing_button)
116 |
117 |
118 | if __name__ == "__main__":
119 | #netflixuser = os.environ["netflixuser"]
120 | #netflixpassword = os.environ["netflixpassword"]
121 | #hbouser = os.environ["hbouser"]
122 | #hbopassword = os.environ["hbopassword"]
123 | # slack_token = os.environ['slack_token']
124 | # Specify duration in seconds
125 | duration = 60 * 1
126 | total_iterations = 1000
127 | save_dir = '/home/mclrn/Data'
128 | chrome_profile_dir = "/home/mclrn/.config/google-chrome/"
129 | options = webdriver.ChromeOptions()
130 | #options.add_argument('user-data-dir=' + chrome_profile_dir)
131 | options.add_argument("--enable-quic")
132 | # options.add_argument('headless')
133 | generate_streaming(duration, save_dir, total_iterations, options)
134 | print("something")
--------------------------------------------------------------------------------
/trafficgen/Streaming/streaming_types.py:
--------------------------------------------------------------------------------
1 | from random import randint
2 | import time
3 | class Streaming:
4 |
5 | def stream_video(self, browser):
6 | pass
7 |
8 |
9 | class Twitch(Streaming):
10 |
11 | def stream_video(self, browser):
12 | browser.get('https://www.twitch.tv/directory/game/League%20of%20Legends/videos/all')
13 | # Choose random video
14 | time.sleep(1)
15 | videos = browser.find_elements_by_css_selector("a[href*='/videos/']")
16 | video = videos[randint(0, len(videos))]
17 | link = video.get_attribute('href')
18 | browser.get(link)
19 |
20 |
21 | class Youtube(Streaming):
22 |
23 | def stream_video(self, browser):
24 | browser.get('https://www.youtube.com')
25 | # Choose random video
26 | time.sleep(1)
27 | videos = browser.find_elements_by_css_selector("ytd-grid-video-renderer a[href*='/watch?v=']")
28 | video = videos[randint(0, len(videos))]
29 | # thumbnail
30 | link = video.get_attribute('href')
31 | browser.get(link)
32 |
33 |
34 | class Netflix(Streaming):
35 |
36 | def stream_video(self, browser):
37 | # Change to correct profile
38 | browser.get("https://www.netflix.com/SwitchProfile?tkn=I42P4G75VVDM7LV626VKTXTXGI")
39 | # Choose random video
40 | videos = browser.find_elements_by_css_selector("div.title-card-container a[href*='/watch/']")
41 | video = videos[randint(0, len(videos))]
42 | link = video.get_attribute('href')
43 | browser.get(link)
44 |
45 |
46 | class DrTv(Streaming):
47 |
48 | def stream_video(self, browser):
49 | browser.get('https://www.dr.dk/tv')
50 | # Choose random video
51 | videos = browser.find_elements_by_class_name('program-link')
52 | video = videos[randint(0, len(videos))]
53 | link = video.get_attribute('href')
54 | browser.get(link)
55 | play_button = browser.find_element_by_css_selector('button[title="Afspil"]')
56 | play_button.click()
57 |
58 |
59 | class HboNordic(Streaming):
60 | def stream_video(self, browser):
61 | browser.get("https://dk.hbonordic.com/home")
62 | videos = browser.find_elements_by_css_selector("a[data-automation='play-button']")
63 | video = videos[randint(0, len(videos))]
64 | video_url = video.get_attribute("href")
65 | browser.get(video_url)
66 |
--------------------------------------------------------------------------------
/trafficgen/Streaming/unix_capture.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 |
4 | def captureTraffic(interfaceNumber, duration, dir, file):
5 | '''
6 | Interfacenumber: specifies the network interface that should be captured (use tshark -D to list the options)
7 | duration: specifies the number of seconds that the capture should go on
8 | dir: is the folder to which the pcap file of the capture should be saved
9 | name: is the name of that pcap file (this will always be appended by date and time)
10 | '''
11 | # makedir if it does not exist
12 | if not os.path.isdir(dir):
13 | os.mkdir(dir)
14 | #open(file, "w") # overwrites if file already exists
15 | os.system("echo %s |sudo -S tshark -i %d -a duration:%d -w %s port not 5901 and ip and not broadcast and not multicast" % ('Napatech10',interfaceNumber, duration, file))
16 | os.system("echo %s |sudo -S chown mclrn:mclrn %s" % ('Napatech10', file))
17 |
18 |
19 |
20 | '''
21 | To test
22 | captureTraffic(1,10, 'C:/users/arhjo/desktop', "test")
23 | '''
24 |
25 |
26 | def cleanup(file):
27 | os.system("echo %s |sudo -S rm %s" % ('Napatech10', file))
28 |
--------------------------------------------------------------------------------
/trafficgen/Streaming/win_capture.py:
--------------------------------------------------------------------------------
1 | import os
2 | import datetime
3 |
4 |
5 | def captureTraffic(interfaceNumber, duration, dir, file):
6 | '''
7 | Interfacenumber: specifies the network interface that should be captured (use tshark -D to list the options)
8 | duration: specifies the number of seconds that the capture should go on
9 | dir: is the folder to which the pcap file of the capture should be saved
10 | name: is the name of that pcap file (this will always be appended by date and time)
11 | '''
12 |
13 | # makedir if it does not exist
14 | if not os.path.isdir(dir):
15 | os.mkdir(dir)
16 |
17 | open(file, "w") # overwrites if file already exists
18 | os.system("tshark -i %d -a duration:%d -w %s port not 5901 and ((port 80) or (port 443)) and ip and not broadcast and not multicast" % (interfaceNumber, duration, file))
19 |
20 |
21 | def cleanup(file):
22 | os.system("del %s" % file)
23 |
24 | '''
25 | To test
26 | captureTraffic(1,10, 'C:/users/arhjo/desktop', "test")
27 | '''
--------------------------------------------------------------------------------
/train/train_header.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | from tf import tf_utils as tfu, confusionmatrix as conf, dataset, early_stopping as es
3 | import numpy as np
4 | import datetime
5 | from sklearn import metrics
6 | from sklearn.preprocessing import label_binarize
7 | import utils
8 | import os
9 |
10 | now = datetime.datetime.now()
11 |
12 | summaries_dir = '../tensorboard'
13 |
14 | # hidden_units = 12
15 | units = [5]
16 | acc_list = []
17 | num_headers_train = []
18 | hidden_units_train = []
19 | num_headers = [8]
20 | for num_header in num_headers:
21 | for hidden_units in units:
22 | hidden_units_train.append(hidden_units)
23 | train_dirs = ["E:/Data/LinuxChrome/{}/".format(num_header),
24 | "E:/Data/WindowsSalik/{}/".format(num_header),
25 | "E:/Data/WindowsAndreas/{}/".format(num_header),
26 | "E:/Data/WindowsFirefox/{}/".format(num_header)
27 | ]
28 |
29 | test_dirs = ["E:/Data/WindowsChrome/{}/".format(num_header)]
30 |
31 | trainstr = "train:"
32 | for traindir in train_dirs:
33 | trainstr += traindir.split('Data/')[1].split("/")[0]
34 | trainstr += ":"
35 | teststr = "test:"
36 | for testdir in test_dirs:
37 | teststr += testdir.split('Data/')[1].split("/")[0]
38 | teststr += ":"
39 | timestamp = "%.2d%.2d_%.2d%.2d" % (now.day, now.month, now.hour, now.minute)
40 | save_dir = "../trained_models/{0}/{1}/{2}/".format(num_header, hidden_units, timestamp)
41 | os.makedirs(save_dir, exist_ok=True)
42 | seed = 0
43 | namestr = trainstr+teststr+str(num_header)+":"+str(hidden_units)
44 | # Beta for L2 regularization
45 | beta = 1.0
46 | val_size = [0.1]
47 |
48 | early_stop = es.EarlyStopping(patience=20, min_delta=0.05)
49 | subdir = "/{0}/{1}/{2}/".format(num_header, hidden_units, timestamp)
50 | input_size = num_header*54
51 | data = dataset.read_data_sets(test_dirs, train_dirs, merge_data=True, one_hot=True,
52 | validation_size=0.1,
53 | test_size=0.1,
54 | balance_classes=False,
55 | payload_length=input_size,
56 | seed=seed)
57 | tf.reset_default_graph()
58 | num_headers_train.append(num_header)
59 | num_classes = len(dataset._label_encoder.classes_)
60 | labels = dataset._label_encoder.classes_
61 |
62 | # cm = conf.ConfusionMatrix(num_classes, class_names=labels)
63 | gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.85)
64 |
65 | x_pl = tf.placeholder(tf.float32, [None, input_size], name='xPlaceholder')
66 | y_pl = tf.placeholder(tf.float64, [None, num_classes], name='yPlaceholder')
67 | y_pl = tf.cast(y_pl, tf.float32)
68 |
69 | x = tfu.ffn_layer('layer1', x_pl, hidden_units, activation=tf.nn.relu, seed=seed)
70 | # x = tf.layers.dense(x_pl, hidden_units, tf.nn.relu)
71 | # x = tfu.dropout(x, 0.5)
72 | # x = tfu.ffn_layer('layer2', x, hidden_units, activation=tf.nn.relu)
73 | # x = tfu.ffn_layer('layer2', x, 50, activation=tf.nn.sigmoid)
74 | # x = tfu.ffn_layer('layer3', x, 730, activation=tf.nn.relu)
75 | y = tfu.ffn_layer('output_layer', x, hidden_units=num_classes, activation=tf.nn.softmax, seed=seed)
76 | y_ = tf.argmax(y, axis=1)
77 | # with tf.name_scope('cross_entropy'):
78 | # # The raw formulation of cross-entropy,
79 | # #
80 | # # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
81 | # # reduction_indices=[1]))
82 | # #
83 | # # can be numerically unstable.
84 | # #
85 | # # So here we use tf.losses.sparse_softmax_cross_entropy on the
86 | # # raw logit outputs of the nn_layer above.
87 | # with tf.name_scope('total'):
88 | # cross_entropy = tf.losses.softmax_cross_entropy(onehot_labels=y_pl, logits=y)
89 |
90 |
91 | with tf.variable_scope('loss'):
92 | # computing cross entropy per sample
93 | cross_entropy = -tf.reduce_sum(y_pl * tf.log(y + 1e-8), reduction_indices=[1])
94 |
95 | W1 = tf.get_default_graph().get_tensor_by_name("layer1/W:0")
96 | loss = tf.nn.l2_loss(W1)
97 | # averaging over samples
98 | cross_entropy = tf.reduce_mean(cross_entropy)
99 | loss = tf.reduce_mean(cross_entropy + loss * beta)
100 |
101 | tf.summary.scalar('cross_entropy', cross_entropy)
102 |
103 | with tf.variable_scope('training'):
104 | # defining our optimizer
105 | optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
106 |
107 | # applying the gradients
108 | train_op = optimizer.minimize(cross_entropy)
109 |
110 | with tf.variable_scope('performance'):
111 | # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
112 | correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pl, axis=1))
113 |
114 | # averaging the one-hot encoded vector
115 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
116 |
117 | tf.summary.scalar('accuracy', accuracy)
118 |
119 | # Merge all the summaries and write them out
120 | merged = tf.summary.merge_all()
121 | train_writer = tf.summary.FileWriter(summaries_dir + '/train/' + subdir)
122 | val_writer = tf.summary.FileWriter(summaries_dir + '/validation/' + subdir)
123 |
124 | # Training Loop
125 | batch_size = 100
126 | max_epochs = 200
127 |
128 | valid_loss, valid_accuracy = [], []
129 | train_loss, train_accuracy = [], []
130 | test_loss, test_accuracy = [], []
131 | epochs = []
132 | with tf.Session() as sess:
133 | early_stop.on_train_begin()
134 | train_writer.add_graph(sess.graph)
135 | sess.run(tf.global_variables_initializer())
136 | total_parameters = 0
137 | print("Calculating trainable parameters!")
138 | for variable in tf.trainable_variables():
139 | # shape is an array of tf.Dimension
140 | shape = variable.get_shape()
141 | print("Shape: {}".format(shape))
142 | variable_parameters = 1
143 | for dim in shape:
144 | variable_parameters *= dim.value
145 | print("Shape {0} gives {1} trainable parameters".format(shape, variable_parameters))
146 | total_parameters += variable_parameters
147 | print("Trainable parameters: {}".format(total_parameters))
148 | print('Begin training loop')
149 | saver = tf.train.Saver()
150 | try:
151 | while data.train.epochs_completed < max_epochs:
152 | _train_loss, _train_accuracy = [], []
153 | ## Run train op
154 | x_batch, y_batch = data.train.next_batch(batch_size)
155 | fetches_train = [train_op, cross_entropy, accuracy, merged]
156 | feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
157 | _, _loss, _acc, _summary = sess.run(fetches_train, feed_dict_train)
158 |
159 | _train_loss.append(_loss)
160 | _train_accuracy.append(_acc)
161 | ## Compute validation loss and accuracy
162 | if data.train.epochs_completed % 1 == 0 \
163 | and data.train._index_in_epoch <= batch_size:
164 |
165 | train_writer.add_summary(_summary, data.train.epochs_completed)
166 |
167 | train_loss.append(np.mean(_train_loss))
168 | train_accuracy.append(np.mean(_train_accuracy))
169 |
170 | fetches_valid = [cross_entropy, accuracy, merged]
171 |
172 | feed_dict_valid = {x_pl: data.validation.payloads, y_pl: data.validation.labels}
173 | _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_valid)
174 | epochs.append(data.train.epochs_completed)
175 | valid_loss.append(_loss)
176 | valid_accuracy.append(_acc)
177 | val_writer.add_summary(_summary, data.train.epochs_completed)
178 | current = valid_loss[-1]
179 | early_stop.on_epoch_end(data.train.epochs_completed, current)
180 | print("Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f}, Valid loss {:6.3f}, Valid acc {:6.3f}"
181 | .format(data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
182 | valid_accuracy[-1]))
183 | if early_stop.stop_training:
184 | early_stop.on_train_end()
185 | break
186 |
187 |
188 | test_epoch = data.test.epochs_completed
189 | while data.test.epochs_completed == test_epoch:
190 | batch_size = 1000
191 | x_batch, y_batch = data.test.next_batch(batch_size)
192 | feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
193 | _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_test)
194 | y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
195 | y_preds = tf.argmax(y_preds, axis=1).eval()
196 | y_true = tf.argmax(y_batch, axis=1).eval()
197 | # cm.batch_add(y_true, y_preds)
198 | test_loss.append(_loss)
199 | test_accuracy.append(_acc)
200 | # test_writer.add_summary(_summary, data.train.epochs_completed)
201 | print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
202 | np.mean(test_loss), np.mean(test_accuracy)))
203 | namestr += ":acc{:.3f}".format(np.mean(test_accuracy))
204 | acc_list.append("{:.3f}".format(np.mean(test_accuracy)))
205 | saver.save(sess, save_dir+'header_{0}_{1}_units.ckpt'.format(num_header, hidden_units))
206 | feed_dict_test = {x_pl: data.test.payloads, y_pl: data.test.labels}
207 | # Create ROC curve for all classes
208 | y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
209 | y_true = data.test.labels
210 | utils.plot_ROC(y_true, y_preds, num_classes, labels, micro=False, macro=False)
211 | # Compute different metrics for confusionmatrix and more
212 | y_preds = sess.run(fetches=y_, feed_dict=feed_dict_test)
213 | y_true = tf.argmax(data.test.labels, axis=1).eval()
214 | y_true = [labels[i] for i in y_true]
215 | y_preds = [labels[i] for i in y_preds]
216 | conf = metrics.confusion_matrix(y_true, y_preds, labels=labels)
217 | report = metrics.classification_report(y_true, y_preds, labels=labels)
218 | nostream_dict = ['http', 'https']
219 | y_stream_true = []
220 | y_stream_preds = []
221 | for i, v in enumerate(y_true):
222 | pred = y_preds[i]
223 | if v in nostream_dict:
224 | y_stream_true.append('non-streaming')
225 | else:
226 | y_stream_true.append('streaming')
227 | if pred in nostream_dict:
228 | y_stream_preds.append('non-streaming')
229 | else:
230 | y_stream_preds.append('streaming')
231 | stream_acc = len([v for i, v in enumerate(y_stream_preds) if v == y_stream_true[i]]) / len(
232 | y_stream_true)
233 |
234 | # Binarize the output
235 | y_stream_true1 = label_binarize(y_stream_true, classes=['non-streaming', 'streaming'])
236 | y_stream_preds1 = label_binarize(y_stream_preds, classes=['non-streaming', 'streaming'])
237 | n_classes = y_stream_true1.shape[1]
238 | # Stream/non-stream ROC curve
239 | utils.plot_ROC(y_stream_true1, y_stream_preds1, n_classes, ['non-streaming', 'streaming'], micro=False, macro=False)
240 |
241 | conf1 = metrics.confusion_matrix(y_true, y_preds, labels=labels)
242 | conf2 = metrics.confusion_matrix(y_stream_true, y_stream_preds, labels=['non-streaming', 'streaming'])
243 | report = metrics.classification_report(y_true, y_preds, labels=labels)
244 | report2 = metrics.classification_report(y_stream_true, y_stream_preds, labels=['non-streaming', 'streaming'])
245 |
246 | except KeyboardInterrupt:
247 | pass
248 | print(namestr)
249 | utils.plot_confusion_matrix(conf1, labels, save=False, title=namestr)
250 | utils.plot_confusion_matrix(conf2, ['non-streaming', 'streaming'], save=False,
251 | title="StreamNoStream_acc{}".format(stream_acc))
252 | print(report)
253 | print(report2)
254 | # utils.plot_metric_graph(x_list=epochs, y_list=valid_accuracy, save=False, x_label="epochs", y_label="Accuracy", title="Accuracy")
255 | # utils.plot_metric_graph(x_list=epochs, y_list=valid_loss, save=False, x_label="epochs", y_label="loss", title="Loss")
256 | # utils.plot_confusion_matrix(conf, labels, save=False, title=namestr)
257 | utils.show_plot()
258 |
259 | acc_list = list(map(float, acc_list))
260 | print(acc_list, hidden_units_train)
261 |
262 |
263 |
--------------------------------------------------------------------------------
/train/train_logistic.py:
--------------------------------------------------------------------------------
1 | from sklearn.linear_model import LogisticRegression
2 | import tensorflow as tf
3 | from tf import tf_utils as tfu, dataset, early_stopping as es
4 | import numpy as np
5 | import datetime
6 | from sklearn import metrics
7 | import utils
8 | import operator
9 | from sklearn.metrics import classification_report
10 | import pandas as pd
11 | from collections import Counter
12 |
13 | def input_fn(data):
14 | return (data.payloads, data.labels)
15 |
16 |
17 | now = datetime.datetime.now()
18 |
19 | num_headers = 16
20 | hidden_units = 15
21 | train_dirs = ['/home/mclrn/Data/WindowsSalik/{0}/'.format(num_headers),
22 | '/home/mclrn/Data/WindowsChrome/{0}/'.format(num_headers),
23 | '/home/mclrn/Data/WindowsLinux/{0}/'.format(num_headers),
24 | '/home/mclrn/Data/WindowsFirefox/{0}/'.format(num_headers)]
25 |
26 | test_dirs = ['/home/mclrn/Data/WindowsAndreas/{0}/'.format(num_headers)]
27 |
28 | val = 0.1
29 | input_size = num_headers * 54
30 | seed = 0
31 | train_size = []
32 | data = dataset.read_data_sets(train_dirs, test_dirs, merge_data=True, one_hot=False,
33 | validation_size=val,
34 | test_size=0.1,
35 | balance_classes=False,
36 | payload_length=input_size,
37 | seed=seed)
38 |
39 | train_size.append(len(data.train.payloads))
40 | num_classes = len(dataset._label_encoder.classes_)
41 | labels = dataset._label_encoder.classes_
42 |
43 | classifier = LogisticRegression(verbose=2)
44 |
45 | classifier.fit(data.train.payloads, data.train.labels)
46 |
47 | score = classifier.score(data.test.payloads, data.test.labels)
48 |
49 | predict_proba = classifier.predict_proba(data.test.payloads)
50 |
51 | predict = classifier.predict(data.test.payloads)
52 |
53 | correct = list(map(operator.sub,predict,data.test.labels)).count(0)
54 |
55 |
56 | confusion = metrics.confusion_matrix(data.test.labels, predict)
57 | utils.plot_confusion_matrix(confusion, labels, save=True, title="Logistic regression confusion matrixacc")
58 | accuracy = correct/len(predict)
59 |
60 | report = classification_report(data.test.labels, predict)
61 | print(accuracy)
62 | skleranacc = metrics.accuracy_score(data.test.labels,predict)
63 | print("Accuracy calculated by sklearn.metrics: {}".format(skleranacc))
64 | most_occurences = Counter(data.test.labels).most_common(1)
65 | print("Naive classifier that just guesses the most frequents label will obtain an accuracy of: %s" % ((most_occurences[0][1])/len(data.test.labels)))
66 | print(report)
--------------------------------------------------------------------------------
/train/train_payload.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | from tf import tf_utils as tfu, confusionmatrix as conf, dataset, early_stopping as es
3 | import numpy as np
4 | from sklearn import metrics
5 | import utils
6 | summaries_dir = '../tensorboard'
7 |
8 | train_dirs = ["E:/Data/h5/https/"]
9 | test_dirs = ["E:/Data/h5/netflix/"]
10 | pay_len = 1460
11 | hidden_units = 2000
12 | save_dir = "../trained_models/"
13 | seed = 1
14 |
15 | data = dataset.read_data_sets(train_dirs, test_dirs, merge_data=True, one_hot=True, validation_size=0.1, test_size=0.1, balance_classes=True, payload_length=pay_len, seed=seed)
16 | tf.reset_default_graph()
17 | num_classes = len(dataset._label_encoder.classes_)
18 | labels = dataset._label_encoder.classes_
19 |
20 | early_stop = es.EarlyStopping(patience=100, min_delta=0.05)
21 |
22 | cm = conf.ConfusionMatrix(num_classes, class_names=dataset._label_encoder.classes_)
23 | gpu_opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.45)
24 |
25 | x_pl = tf.placeholder(tf.float32, [None, pay_len], name='xPlaceholder')
26 | y_pl = tf.placeholder(tf.float64, [None, num_classes], name='yPlaceholder')
27 | y_pl = tf.cast(y_pl, tf.float32)
28 |
29 | x = tfu.ffn_layer('layer1', x_pl, hidden_units, activation=tf.nn.relu, seed=seed)
30 | x = tfu.ffn_layer('layer2', x, hidden_units, activation=tf.nn.relu, seed=seed)
31 | x = tfu.ffn_layer('layer3', x, hidden_units, activation=tf.nn.relu, seed=seed)
32 | y = tfu.ffn_layer('output_layer', x, hidden_units=num_classes, activation=tf.nn.softmax, seed=seed)
33 | y_ = tf.argmax(y, axis=1)
34 | # with tf.name_scope('cross_entropy'):
35 | # # The raw formulation of cross-entropy,
36 | # #
37 | # # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
38 | # # reduction_indices=[1]))
39 | # #
40 | # # can be numerically unstable.
41 | # #
42 | # # So here we use tf.losses.sparse_softmax_cross_entropy on the
43 | # # raw logit outputs of the nn_layer above.
44 | # with tf.name_scope('total'):
45 | # cross_entropy = tf.losses.softmax_cross_entropy(onehot_labels=y_pl, logits=y)
46 |
47 |
48 | with tf.variable_scope('loss'):
49 | # computing cross entropy per sample
50 | cross_entropy = -tf.reduce_sum(y_pl * tf.log(y + 1e-8), reduction_indices=[1])
51 |
52 | # averaging over samples
53 | cross_entropy = tf.reduce_mean(cross_entropy)
54 |
55 | tf.summary.scalar('cross_entropy', cross_entropy)
56 |
57 | with tf.variable_scope('training'):
58 | # defining our optimizer
59 | optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
60 |
61 | # applying the gradients
62 | train_op = optimizer.minimize(cross_entropy)
63 |
64 | with tf.variable_scope('performance'):
65 | # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
66 | correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pl, axis=1))
67 |
68 | # averaging the one-hot encoded vector
69 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
70 |
71 | tf.summary.scalar('accuracy', accuracy)
72 |
73 | # Merge all the summaries and write them out
74 | merged = tf.summary.merge_all()
75 | train_writer = tf.summary.FileWriter(summaries_dir + '/train')
76 | val_writer = tf.summary.FileWriter(summaries_dir + '/validation')
77 | test_writer = tf.summary.FileWriter(summaries_dir + '/test')
78 |
79 | # Training Loop
80 | batch_size = 100
81 | max_epochs = 200
82 |
83 | valid_loss, valid_accuracy = [], []
84 | train_loss, train_accuracy = [], []
85 | test_loss, test_accuracy = [], []
86 |
87 | with tf.Session() as sess:
88 | early_stop.on_train_begin()
89 | train_writer.add_graph(sess.graph)
90 | sess.run(tf.global_variables_initializer())
91 | total_parameters = 0
92 | print("Calculating trainable parameters!")
93 | for variable in tf.trainable_variables():
94 | # shape is an array of tf.Dimension
95 | shape = variable.get_shape()
96 | # print("Shape: {}".format(shape))
97 | variable_parameters = 1
98 | for dim in shape:
99 | variable_parameters *= dim.value
100 | print("Shape {0} gives {1} trainable parameters".format(shape, variable_parameters))
101 | total_parameters += variable_parameters
102 | print("Trainable parameters: {}".format(total_parameters))
103 | print('Begin training loop')
104 | saver = tf.train.Saver()
105 | try:
106 | while data.train.epochs_completed < max_epochs:
107 | _train_loss, _train_accuracy = [], []
108 |
109 | ## Run train op
110 | x_batch, y_batch = data.train.next_batch(batch_size)
111 | fetches_train = [train_op, cross_entropy, accuracy, merged]
112 | feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
113 | _, _loss, _acc, _summary = sess.run(fetches_train, feed_dict_train)
114 |
115 | _train_loss.append(_loss)
116 | _train_accuracy.append(_acc)
117 | ## Compute validation loss and accuracy
118 | if data.train.epochs_completed % 1 == 0 \
119 | and data.train._index_in_epoch <= batch_size:
120 |
121 | train_writer.add_summary(_summary, data.train.epochs_completed)
122 |
123 | train_loss.append(np.mean(_train_loss))
124 | train_accuracy.append(np.mean(_train_accuracy))
125 |
126 | fetches_valid = [cross_entropy, accuracy, merged]
127 |
128 | feed_dict_valid = {x_pl: data.validation.payloads, y_pl: data.validation.labels}
129 | _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_valid)
130 |
131 | valid_loss.append(_loss)
132 | valid_accuracy.append(_acc)
133 | val_writer.add_summary(_summary, data.train.epochs_completed)
134 | current = valid_loss[-1]
135 | early_stop.on_epoch_end(data.train.epochs_completed, current)
136 | print("Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f}, Valid loss {:6.3f}, Valid acc {:6.3f}"
137 | .format(data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
138 | valid_accuracy[-1]))
139 | if early_stop.stop_training:
140 | early_stop.on_train_end()
141 | break
142 |
143 |
144 | test_epoch = data.test.epochs_completed
145 | while data.test.epochs_completed == test_epoch:
146 | batch_size = 1000
147 | x_batch, y_batch = data.test.next_batch(batch_size)
148 | feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
149 | _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_test)
150 | y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
151 | y_preds = tf.argmax(y_preds, axis=1).eval()
152 | y_true = tf.argmax(y_batch, axis=1).eval()
153 | # cm.batch_add(y_true, y_preds)
154 | test_loss.append(_loss)
155 | test_accuracy.append(_acc)
156 | # test_writer.add_summary(_summary, data.train.epochs_completed)
157 | print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
158 | np.mean(test_loss), np.mean(test_accuracy)))
159 | saver.save(sess, save_dir+'payload_{0}_{1}_units.ckpt'.format(pay_len, hidden_units))
160 | feed_dict_test = {x_pl: data.test.payloads, y_pl: data.test.labels}
161 | y_preds = sess.run(fetches=y_, feed_dict=feed_dict_test)
162 | y_true = tf.argmax(data.test.labels, axis=1).eval()
163 | y_true = [labels[i] for i in y_true]
164 | y_preds = [labels[i] for i in y_preds]
165 | conf = metrics.confusion_matrix(y_true, y_preds, labels=labels)
166 |
167 | except KeyboardInterrupt:
168 | pass
169 | # print(namestr)
170 | utils.plot_confusion_matrix(conf, labels, save=True, title='payload_{0}_{1}_units_acc{2}'.format(pay_len, hidden_units, np.mean(test_accuracy)))
171 | # acc_list = list(map(float, acc_list))
172 | # print(acc_list, train_size)
173 | #
174 | # with tf.Session() as sess:
175 | # sess.run(tf.global_variables_initializer())
176 | # print('Begin training loop')
177 | #
178 | # try:
179 | # while data.train.epochs_completed < max_epochs:
180 | # _train_loss, _train_accuracy = [], []
181 | #
182 | # ## Run train op
183 | # x_batch, y_batch = data.train.next_batch(batch_size)
184 | # fetches_train = [train_op, cross_entropy, accuracy, merged]
185 | # feed_dict_train = {x_pl: x_batch, y_pl: y_batch}
186 | # _, _loss, _acc, _summary = sess.run(fetches_train, feed_dict_train)
187 | #
188 | # _train_loss.append(_loss)
189 | # _train_accuracy.append(_acc)
190 | # ## Compute validation loss and accuracy
191 | # if data.train.epochs_completed % 1 == 0 \
192 | # and data.train._index_in_epoch <= batch_size:
193 | #
194 | # train_writer.add_summary(_summary, data.train.epochs_completed)
195 | #
196 | # train_loss.append(np.mean(_train_loss))
197 | # train_accuracy.append(np.mean(_train_accuracy))
198 | #
199 | # fetches_valid = [cross_entropy, accuracy, merged]
200 | #
201 | # feed_dict_valid = {x_pl: data.validation.payloads, y_pl: data.validation.labels}
202 | # _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_valid)
203 | #
204 | # valid_loss.append(_loss)
205 | # valid_accuracy.append(_acc)
206 | # val_writer.add_summary(_summary, data.train.epochs_completed)
207 | # print(
208 | # "Epoch {} : Train Loss {:6.3f}, Train acc {:6.3f}, Valid loss {:6.3f}, Valid acc {:6.3f}".format(
209 | # data.train.epochs_completed, train_loss[-1], train_accuracy[-1], valid_loss[-1],
210 | # valid_accuracy[-1]))
211 | # test_epoch = data.test.epochs_completed
212 | # while data.test.epochs_completed == test_epoch:
213 | # x_batch, y_batch = data.test.next_batch(batch_size)
214 | # feed_dict_test = {x_pl: x_batch, y_pl: y_batch}
215 | # _loss, _acc, _summary = sess.run(fetches_valid, feed_dict_test)
216 | # y_preds = sess.run(fetches=y, feed_dict=feed_dict_test)
217 | # y_preds = tf.argmax(y_preds, axis=1).eval()
218 | # y_true = tf.argmax(y_batch, axis=1).eval()
219 | # cm.batch_add(y_true, y_preds)
220 | # test_loss.append(_loss)
221 | # test_accuracy.append(_acc)
222 | # # test_writer.add_summary(_summary, data.train.epochs_completed)
223 | # print('Test Loss {:6.3f}, Test acc {:6.3f}'.format(
224 | # np.mean(test_loss), np.mean(test_accuracy)))
225 | #
226 | #
227 | # except KeyboardInterrupt:
228 | # pass
229 |
230 |
231 | # print(cm)
232 | #
233 | # # df = utils.load_h5(dir + filename+'.h5', key=filename.split('-')[0])
234 | # # print("Load done!")
235 | # # print(df.shape)
236 | # # payloads = df['payload'].values
237 | # # payloads = utils.pad_elements_with_zero(payloads)
238 | # # df['payload'] = payloads
239 | # #
240 | # # # Converting hex string to list of int... Maybe takes to long?
241 | # # payloads = [[int(i, 16) for i in list(x)] for x in payloads]
242 | # # np_payloads = numpy.array(payloads)
243 | # # # dataset = DataSet(np_payloads, df['label'].values)
244 | # # x, y = dataset.next_batch(10)
245 | # # batch_size = 100
246 | # # features = {'payload': payloads}
247 | # #
248 | # #
249 | # # gb = df.groupby(['ip.dst', 'ip.src', 'port.dst', 'port.src'])
250 | # #
251 | # #
252 | # # l = dict(list(gb))
253 | # #
254 | # #
255 | # # s = [[k, len(v)] for k, v in sorted(l.items(), key=lambda x: len(x[1]), reverse=True)]
256 | #
257 | #
258 | # print("DONE")
259 |
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | import multiprocessing
2 | from sklearn import metrics
3 | from scapy.all import *
4 | import matplotlib.pyplot as plt
5 | import os
6 | import numpy as np
7 | import time
8 | import pandas as pd
9 | import glob
10 | from scipy import interp
11 |
12 |
13 | def filter_pcap_by_ip(dir, filename, ip_list, label):
14 | '''
15 | This method can be used to extract certain packets (associted with specified ip adresses) from a pcap file.
16 | The method expects user knowledge of the communication protocol used by the specified ip adresses for the label to be correct.
17 |
18 | The label is intended to be either http or https.
19 | :param dir: The directory in which the pcap file is located
20 | :param filename: The name of the pcap file that should be loaded
21 | :param ip_list: A list of ip adresses of interest. Packets with either ip.src or ip.dst in ip_list will be extracted
22 | :param label: The label that should be applied to the extracted packets. Note that the ip_list should contain
23 | adresses that we know communicate over http or https in order to match with the label
24 | :return: A dataframe containing information about the extracted packets from the pcap file.
25 | '''
26 | time_s = time.clock()
27 | count = 0
28 | print("Read PCAP, label is %s" % label)
29 | data = rdpcap(dir + filename + '.pcap')
30 | totalPackets = len(data)
31 | percentage = int(totalPackets / 100)
32 | # Workaround/speedup for pandas append to dataframe
33 | frametimes = []
34 | dsts = []
35 | srcs = []
36 | protocols = []
37 | dports = []
38 | sports = []
39 | bytes = []
40 | labels = []
41 |
42 | print("Total packages: %d" % totalPackets)
43 | time_r = time.clock()
44 | time_read = time_r - time_s
45 | print("Time to read PCAP: " + str(time_read))
46 | for packet in data:
47 | if IP in packet and \
48 | (UDP in packet or TCP in packet) and \
49 | (packet[IP].dst in ip_list or packet[IP].src in ip_list):
50 | ip_layer = packet[IP]
51 | transport_layer = ip_layer.payload
52 | frametimes.append(packet.time)
53 | dsts.append(ip_layer.dst)
54 | srcs.append(ip_layer.src)
55 | protocols.append(transport_layer.name)
56 | dports.append(transport_layer.dport)
57 | sports.append(transport_layer.sport)
58 | # Save the raw byte string
59 | raw_payload = raw(packet)
60 | bytes.append(raw_payload)
61 | labels.append(label)
62 | if (count % (percentage * 5) == 0):
63 | print(str(count / percentage) + '%')
64 | count += 1
65 | time_t = time.clock()
66 | print("Time spend: %ds" % (time_t - time_r))
67 | d = {'time': frametimes,
68 | 'ip.dst': dsts,
69 | 'ip.src': srcs,
70 | 'protocol': protocols,
71 | 'port.dst': dports,
72 | 'port.src': sports,
73 | 'bytes': bytes,
74 | 'label': labels}
75 | df = pd.DataFrame(data=d)
76 | time_e = time.clock()
77 | total_time = time_e - time_s
78 | print("Time to convert PCAP to dataframe: " + str(total_time))
79 | return df
80 |
81 |
82 | def load_h5(dir, filename):
83 | timeS = time.clock()
84 | df = pd.read_hdf(dir + "/" + filename, key=filename.split('-')[0])
85 | timeE = time.clock()
86 | loadTime = timeE - timeS
87 | print("Time to load " + filename + ": " + str(loadTime))
88 | return df
89 |
90 |
91 | def plotHex(hexvalues, filename):
92 | '''
93 | Plot an example as an image
94 | hexvalues: list of byte values
95 | average: allows for providing more than one list of hexvalues and create an average over all
96 | '''
97 |
98 | size = 39
99 | hex_placeholder = [0] * (size * size) # create placeholder of correct size
100 |
101 | if (type(hexvalues[0]) is np.ndarray):
102 | print("Multiple payloads")
103 | for hex_list in hexvalues:
104 | hex_placeholder[0:len(hex_list)] += hex_list # overwrite zero values with values of
105 | hex_placeholder = np.array(hex_placeholder) / len(hexvalues) # average the elements of the placeholder
106 | else:
107 | print("Single payload")
108 | hex_placeholder[0:len(hexvalues)] = hexvalues # overwrite zero values with values of
109 |
110 | canvas = np.reshape(np.array(hex_placeholder), (size, size))
111 | plt.figure(figsize=(4, 4))
112 | plt.axis('off')
113 | plt.imshow(canvas, cmap='gray')
114 | plt.title(filename)
115 | plt.show()
116 | return canvas
117 |
118 |
119 | def pad_string_elements_with_zero(payloads):
120 | # Assume max payload to be 1460 bytes but as each byte is now 2 hex digits we take double length
121 | max_payload_len = 1460 * 2
122 | # Pad with '0'
123 | payloads = [s.ljust(max_payload_len, '0') for s in payloads]
124 | return payloads
125 |
126 |
127 | def pad_arrays_with_zero(payloads, payload_length=810):
128 | tmp_payloads = []
129 | for x in payloads:
130 | payload = np.zeros(payload_length, dtype=np.uint8)
131 | # pl = np.fromstring(x, dtype=np.uint8)
132 | payload[:x.shape[0]] = x
133 | tmp_payloads.append(payload)
134 |
135 | # payloads = [np.fromstring(x) for x in payloads]
136 | return np.array(tmp_payloads)
137 |
138 |
139 | def hash_elements(payloads):
140 | return payloads
141 |
142 |
143 | def packetanonymizer(packet):
144 | """"
145 | Takes a packet as a bytestring in hex format and convert to unsigned 8bit integers [0-255]
146 | Sets the header fields which contain MAC, IP and Port information to 0
147 | """
148 | # Should work with TCP and UDP
149 |
150 | p = np.fromstring(packet, dtype=np.uint8)
151 | # set MACs to 0
152 | p[0:12] = 0
153 | # Remove IP checksum
154 | p[24:26] = 0
155 | # set IPs to 0
156 | p[26:34] = 0
157 | # set ports to 0
158 | p[34:36] = 0
159 | p[36:38] = 0
160 |
161 | # IP protocol field check if TCP
162 | if p[23] == 6:
163 | #Remove TCP checksum
164 | p[50:52] = 0
165 | else:
166 | # Remove UDP checksum
167 | p[40:42] = 0
168 | return p
169 |
170 |
171 | def extractdatapoints(dataframe, filename, num_headers=15, session=True):
172 | """"
173 | Extracts the concatenated header datapoints from a dataframe while anonomizing the individual header
174 | :returns a dataframe with datapoints (bytes) and labels
175 | """
176 | group_by = dataframe.sort_values(['time']).groupby(['ip.dst', 'ip.src', 'port.dst', 'port.src'])
177 | gb_dict = dict(list(group_by))
178 | data_points = []
179 | labels = []
180 | filenames = []
181 | sessions = []
182 | done = set()
183 | num_too_short = 0
184 | # Iterate over sessions
185 | for k, v in gb_dict.items():
186 | # v is a DataFrame
187 | # k is a tuple (src, dst, sport, dport)
188 | if k in done:
189 | continue
190 | done.add(k)
191 | if session:
192 | other_direction_key = (k[1], k[0], k[3], k[2])
193 | other_direction = gb_dict[other_direction_key]
194 | v = pd.concat([v, other_direction]).sort_values(['time'])
195 | done.add(other_direction_key)
196 | if len(v) < num_headers:
197 | num_too_short += 1
198 | continue
199 | # extract num_headers of packets (one row each)
200 | packets = v['bytes'].values[:num_headers]
201 | headers = []
202 | label = v['label'].iloc[0]
203 | protocol = v['protocol'].iloc[0]
204 |
205 | # Skip session if UDP and not youtube
206 | if protocol == 'UDP' and label != 'youtube':
207 | continue
208 | # For each packet
209 | packetindex = 0
210 | headeradded = 0
211 |
212 | while headeradded < num_headers:
213 | if packetindex < len(packets):
214 | p = packets[packetindex]
215 | packetindex += 1
216 | else:
217 | break
218 | p_an = packetanonymizer(p)
219 |
220 | # assuming a session utilize the same protocol throughout
221 | # Extract headers (TCP = 54 Bytes, UDP = 42 Bytes - Maybe + 4 Bytes for VLAN tagging) from x first packets of session/flow
222 | header = np.zeros(54, dtype=np.uint8)
223 | if protocol == 'TCP':
224 | # TCP
225 | header[:54] = p_an[:54]
226 | else:
227 | # UDP
228 | header[:42] = p_an[:42] # pad zeros
229 |
230 | # Skip if header packet is fragmented
231 | if (0 < header[20] < 64) or header[21] != 0:
232 | continue
233 | headers.append(header)
234 | headeradded += 1
235 |
236 | # Concatenate headers as the feature vector
237 | if len(headers) == num_headers:
238 | feature_vector = np.concatenate(headers).ravel()
239 | data_points.append(feature_vector)
240 | labels.append(label)
241 | filenames.append(filename)
242 | sessions.append(k)
243 | d = {'filename': filenames, 'session': sessions, 'bytes': data_points, 'label': labels}
244 | return pd.DataFrame(data=d)
245 |
246 |
247 | def saveheaderstask(filelist, num_headers, session, dataframes):
248 | datapointslist = []
249 | for fullname in filelist:
250 | load_dir, filename = os.path.split(fullname)
251 | print("Loading: {0}".format(filename))
252 | df = load_h5(load_dir, filename)
253 | datapoints = extractdatapoints(df, filename, num_headers, session)
254 | datapointslist.append(datapoints)
255 |
256 | # Extend the shared dataframe
257 | dataframes.extend(datapointslist)
258 |
259 |
260 | def saveextractedheaders(load_dir, save_dir, savename, num_headers=15, session=True):
261 | """"
262 | Extracts datapoints from all .h5 files in train_dir and saves the them in a new .h5 file
263 | :param load_dir: The directory to load from
264 | :param save_dir: The directory to save the extracted headers
265 | :param savename: The filename to save
266 | :param num_headers: The amount of headers to use as datapoint
267 | :param session: session or flow
268 | """
269 | manager = multiprocessing.Manager()
270 | dataframes = manager.list()
271 | filelist = glob.glob(load_dir + '*.h5')
272 | filesplits = split_list(filelist, 4)
273 |
274 | threads = []
275 | for split in filesplits:
276 | # create a thread for each
277 | t = multiprocessing.Process(target=saveheaderstask, args=(split, num_headers, session, dataframes))
278 | threads.append(t)
279 | t.start()
280 | # create one large dataframe
281 |
282 | for t in threads:
283 | t.join()
284 | print("Process joined: ", t)
285 | data = pd.concat(dataframes)
286 | key = savename.split('-')[0]
287 | if not os.path.exists(save_dir):
288 | os.makedirs(save_dir)
289 | data.to_hdf(save_dir + savename + '.h5', key=key, mode='w')
290 |
291 |
292 | # saveextractedheaders('./', 'extracted-0103_1136')
293 | # read_pcap('../Data/', 'drtv-2302_1031')
294 | def split_list(list, chunks):
295 | '''
296 | Takes a list an splits it to equal sized chunks.
297 | :param list: list to split
298 | :param chunks: number of chunks (int)
299 | :return: a list containing chunks (lists) as elements
300 | '''
301 | avg = len(list) / float(chunks)
302 | out = []
303 | last = 0.0
304 |
305 | while last < len(list):
306 | out.append(list[int(last):int(last + avg)])
307 | last += avg
308 |
309 | return out
310 |
311 |
312 | def plot_confusion_matrix(cm, classes,
313 | normalize=False,
314 | title='Confusion matrix',
315 | cmap=plt.cm.get_cmap(name='Blues'), save=False):
316 | """
317 | This function prints and plots the confusion matrix.
318 | Normalization can be applied by setting `normalize=True`.
319 | """
320 | import itertools
321 | from matplotlib import rcParams
322 | # Make room for xlabel which is otherwise cut off
323 | rcParams.update({'figure.autolayout': True})
324 |
325 | if normalize:
326 | cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
327 | print("Normalized confusion matrix")
328 | else:
329 | print('Confusion matrix, without normalization')
330 |
331 | print(cm)
332 | plt.figure()
333 | plt.imshow(cm, interpolation='nearest', cmap=cmap)
334 | plt.title("Accuracy: {0}".format(title.split("acc")[1]))
335 | plt.colorbar()
336 | tick_marks = np.arange(len(classes))
337 | plt.xticks(tick_marks, classes, rotation='vertical')
338 | plt.yticks(tick_marks, classes)
339 |
340 | fmt = '.2f' if normalize else 'd'
341 | thresh = cm.max() / 1.5
342 | # thresh = 2260
343 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
344 | plt.text(j, i, format(cm[i, j], fmt),
345 | horizontalalignment="center",
346 | color="white" if cm[i, j] > thresh else "black")
347 |
348 | plt.tight_layout()
349 | plt.ylabel('True label')
350 | plt.xlabel('Predicted label')
351 | if save:
352 | i = 0
353 | filename = "{}".format(title)
354 | while os.path.exists('{}{:d}.png'.format(filename, i)):
355 | i += 1
356 | plt.savefig('{}{:d}.png'.format(filename, i), dpi=300)
357 | plt.draw()
358 | # plt.gcf().clear()
359 |
360 |
361 | def plot_metric_graph(x_list, y_list, x_label="Datapoints", y_label="Accuracy",
362 | title='Metric list', save=False):
363 | from matplotlib import rcParams
364 | # Make room for xlabel which is otherwise cut off
365 | rcParams.update({'figure.autolayout': True})
366 | plt.figure()
367 | plt.plot(x_list, y_list, label="90/10 split")
368 | # plt.plot(x_list2, y_list2, label="Seperate testset")
369 | # Calculate min and max of y scale
370 | ymin = np.min(y_list)
371 | ymin = np.floor(ymin * 10) / 10
372 | ymax = np.max(y_list)
373 | ymax = np.ceil(ymax * 10) / 10
374 | plt.ylim(ymin, ymax)
375 | plt.title("{0}".format(title))
376 | plt.tight_layout()
377 | plt.ylabel(y_label)
378 | plt.xlabel(x_label)
379 | # plt.legend()
380 | if save:
381 | i = 0
382 | filename = "{}".format(title)
383 | while os.path.exists('{}{:d}.png'.format(filename, i)):
384 | i += 1
385 | plt.savefig('{}{:d}.png'.format(filename, i), dpi=300)
386 | plt.draw()
387 | # plt.gcf().clear()
388 |
389 | def plot_class_ROC(fpr, tpr, roc_auc, class_idx, labels):
390 | from matplotlib import rcParams
391 | # Make room for xlabel which is otherwise cut off
392 | rcParams.update({'figure.autolayout': True})
393 | plt.figure()
394 | lw = 2
395 | plt.plot(fpr[class_idx], tpr[class_idx], color='darkorange',
396 | lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[class_idx])
397 | plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
398 | plt.xlim([0.0, 1.0])
399 | plt.ylim([0.0, 1.05])
400 | plt.xlabel('False Positive Rate')
401 | plt.ylabel('True Positive Rate')
402 | plt.title('Receiver operating characteristic of class {}'.format(labels[class_idx]))
403 | plt.legend(loc="lower right")
404 | plt.tight_layout()
405 |
406 | def plot_multi_ROC(fpr, tpr, roc_auc, num_classes, labels, micro=True, macro=True):
407 | from matplotlib import rcParams
408 | # Make room for xlabel which is otherwise cut off
409 | rcParams.update({'figure.autolayout': True})
410 | # Plot all ROC curves
411 | plt.figure()
412 | lw = 2
413 | if micro:
414 | plt.plot(fpr["micro"], tpr["micro"],
415 | label='micro-average ROC curve (area = {0:0.2f})'
416 | ''.format(roc_auc["micro"]),
417 | color='deeppink', linestyle=':', linewidth=4)
418 | if macro:
419 | plt.plot(fpr["macro"], tpr["macro"],
420 | label='macro-average ROC curve (area = {0:0.2f})'
421 | ''.format(roc_auc["macro"]),
422 | color='navy', linestyle=':', linewidth=4)
423 | color_map = {0: '#487fff', 1: '#2ee3ff', 2: '#4eff4e', 3: '#ffca43', 4: '#ff365e', 5: '#d342ff', 6: '#626663'}
424 | # colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
425 | for i in range(num_classes):
426 | plt.plot(fpr[i], tpr[i], color=color_map[i], lw=lw,
427 | label='ROC curve of {0} (area = {1:0.2f})'
428 | ''.format(labels[i], roc_auc[i]))
429 |
430 | plt.plot([0, 1], [0, 1], 'k--', lw=lw)
431 | plt.xlim([0.0, 1.0])
432 | plt.ylim([0.0, 1.05])
433 | plt.xlabel('False Positive Rate')
434 | plt.ylabel('True Positive Rate')
435 | plt.title('Receiver operating characteristic of all classes')
436 | plt.legend(loc="lower right")
437 | plt.tight_layout()
438 |
439 |
440 | def plot_ROC(y_true, y_preds, num_classes, labels, micro=True, macro=True):
441 | # Compute ROC curve and ROC area for each class
442 | fpr = dict()
443 | tpr = dict()
444 | roc_auc = dict()
445 | for i in range(num_classes):
446 | fpr[i], tpr[i], _ = metrics.roc_curve(y_true[:, i], y_preds[:, i])
447 | roc_auc[i] = metrics.auc(fpr[i], tpr[i])
448 | if micro:
449 | # Compute micro-average ROC curve and ROC area
450 | fpr["micro"], tpr["micro"], _ = metrics.roc_curve(y_true.ravel(), y_preds.ravel())
451 | roc_auc["micro"] = metrics.auc(fpr["micro"], tpr["micro"])
452 | if macro:
453 | # Compute macro-average ROC curve and ROC area
454 | # First aggregate all false positive rates
455 | all_fpr = np.unique(np.concatenate([fpr[i] for i in range(num_classes)]))
456 | # Then interpolate all ROC curves at this points
457 | mean_tpr = np.zeros_like(all_fpr)
458 | for i in range(num_classes):
459 | mean_tpr += interp(all_fpr, fpr[i], tpr[i])
460 | # Finally average it and compute AUC
461 | mean_tpr /= num_classes
462 | fpr["macro"] = all_fpr
463 | tpr["macro"] = mean_tpr
464 | roc_auc["macro"] = metrics.auc(fpr["macro"], tpr["macro"])
465 | for i in range(num_classes):
466 | plot_class_ROC(fpr, tpr, roc_auc, i, labels)
467 | plot_multi_ROC(fpr, tpr, roc_auc, num_classes, labels, micro, macro)
468 |
469 |
470 | def show_plot():
471 | plt.show()
472 | # y_list_test, x_list_test = [0.265, 0.572, 0.636, 0.704, 0.746, 0.763, 0.769, 0.784, 0.789, 0.818, 0.82, 0.817, 0.831, 0.845, 0.846, 0.848, 0.859, 0.887, 0.876], [54, 268, 536, 1072, 1608, 2144, 2680, 3216, 3752, 4288, 4824, 5360, 8039, 13398, 18758, 24117, 29476, 34835, 40194]
473 | # y_list_merge, x_list_merge = [0.269, 0.75, 0.79, 0.818, 0.833, 0.844, 0.856, 0.866, 0.87, 0.868, 0.883, 0.881, 0.902, 0.925, 0.923, 0.941, 0.946, 0.943], [59, 291, 582, 1163, 1744, 2325, 2906, 3487, 4068, 4649, 5230, 5811, 8717, 14528, 20339, 26150, 31961, 37772]
474 | # plot_metric_graph(x_list_merge, y_list_merge, x_list_test, y_list_test, title="#Training Datapoints vs. Accuracy", save=True)
--------------------------------------------------------------------------------
/visualization/classes_module.py:
--------------------------------------------------------------------------------
1 | import copy
2 | import numpy
3 | from visualization import vis_utils
4 |
5 | # -------------------------
6 | # Feed-forward network
7 | # -------------------------
8 | class Network:
9 |
10 | def __init__(self, layers):
11 | self.layers = layers
12 |
13 | def forward(self, Z):
14 | for l in self.layers:
15 | Z = l.forward(Z)
16 | return Z
17 |
18 | def gradprop(self, DZ):
19 | for l in self.layers[::-1]:
20 | DZ = l.gradprop(DZ)
21 | return DZ
22 |
23 | def relprop(self, R):
24 | for l in self.layers[::-1]:
25 | R = l.relprop(R)
26 | return R
27 |
28 |
29 |
30 | # -------------------------
31 | # ReLU activation layer
32 | # -------------------------
33 | class ReLU:
34 |
35 | def forward(self, X):
36 | self.Z = X > 0
37 | return X*self.Z
38 |
39 | def gradprop(self, DY):
40 | return DY*self.Z
41 |
42 | def relprop(self, R):
43 | return R
44 |
45 | # -------------------------
46 | # Fully-connected layer
47 | # -------------------------
48 | class Linear:
49 |
50 | def __init__(self, W, b):
51 | self.W = W
52 | self.B = b
53 |
54 | def forward(self, X):
55 | self.X = X
56 | return numpy.dot(self.X, self.W)+self.B
57 |
58 | def gradprop(self, DY):
59 | self.DY = DY
60 | return numpy.dot(self.DY, self.W.T)
61 |
62 | def relprop(self, R):
63 | V = numpy.maximum(0, self.W)
64 | Z = numpy.dot(self.X, V) + 1e-9
65 | S = R / Z
66 | C = numpy.dot(S, V.T)
67 | R = self.X * C
68 | return R
69 |
70 |
71 | class AlphaBetaLinear:
72 |
73 | def __init__(self, W, b, alpha):
74 | self.W = W
75 | self.B = b
76 | self.alpha = alpha
77 | self.beta = alpha -1
78 |
79 | def forward(self, X):
80 | self.X = X
81 | return numpy.dot(self.X, self.W)+self.B
82 |
83 | def gradprop(self, DY):
84 | self.DY = DY
85 | return numpy.dot(self.DY, self.W.T)
86 |
87 | def relprop(self, R):
88 | pself = copy.deepcopy(self)
89 | nself = copy.deepcopy(self)
90 | pself.B *= 0
91 | pself.W = numpy.maximum( 1e-9, pself.W)
92 | nself.B *= 0
93 | nself.W = numpy.minimum(-1e-9, pself.W)
94 |
95 | X = self.X + 1e-9
96 | ZA = pself.forward(X)
97 | SA = self.alpha*R/ZA
98 | ZB = nself.forward(X)
99 | SB = -self.beta *R/ZB
100 | R = X * (pself.gradprop(SA) + nself.gradprop(SB))
101 | return R
102 |
103 | class FirstLinear(Linear):
104 |
105 | def __init__(self, W, b):
106 | self.W = W
107 | self.B = b
108 |
109 | def relprop(self, R):
110 | W, V, U = self.W, numpy.maximum(0, self.W), numpy.minimum(0, self.W)
111 | X, L, H = self.X, self.X * 0 + vis_utils.lowest, self.X * 0 + vis_utils.highest
112 |
113 | Z = numpy.dot(X, W) - numpy.dot(L, V) - numpy.dot(H, U) + 1e-9;
114 | S = R / Z
115 | R = X * numpy.dot(S, W.T) - L * numpy.dot(S, V.T) - H * numpy.dot(S, U.T)
116 | return R
117 |
118 |
119 | # # -------------------------
120 | # # Sum-pooling layer
121 | # # -------------------------
122 | # class Pooling:
123 | #
124 | # def forward(self, X):
125 | # self.X = X
126 | # self.Y = 0.5*(X[:, ::2, ::2, :]+X[:, ::2, 1::2, :]+X[:, 1::2, ::2, :]+X[:, 1::2, 1::2, :])
127 | # return self.Y
128 | #
129 | # def gradprop(self, DY):
130 | # self.DY = DY
131 | # DX = self.X*0
132 | # for i, j in [(0, 0), (0, 1), (1, 0), (1, 1)]:
133 | # DX[:, i::2, j::2, :] += DY*0.5
134 | # return DX
135 | #
136 | # # -------------------------
137 | # # Convolution layer
138 | # # -------------------------
139 | # class Convolution:
140 | #
141 | # def __init__(self,name):
142 | # wshape = map(int,list(name.split("-")[-1].split("x")))
143 | # self.W = numpy.loadtxt(name+'-W.txt').reshape(wshape)
144 | # self.B = numpy.loadtxt(name+'-B.txt')
145 | #
146 | # def forward(self,X):
147 | #
148 | # self.X = X
149 | # mb,wx,hx,nx = X.shape
150 | # ww,hw,nx,ny = self.W.shape
151 | # wy,hy = wx-ww+1,hx-hw+1
152 | #
153 | # Y = numpy.zeros([mb,wy,hy,ny],dtype='float32')
154 | #
155 | # for i in range(ww):
156 | # for j in range(hw):
157 | # Y += numpy.dot(X[:,i:i+wy,j:j+hy,:],self.W[i,j,:,:])
158 | #
159 | # return Y+self.B
160 | #
161 | # def gradprop(self,DY):
162 | #
163 | # self.DY = DY
164 | # mb,wy,hy,ny = DY.shape
165 | # ww,hw,nx,ny = self.W.shape
166 | #
167 | # DX = self.X*0
168 | #
169 | # for i in range(ww):
170 | # for j in range(hw):
171 | # DX[:,i:i+wy,j:j+hy,:] += numpy.dot(DY,self.W[i,j,:,:].T)
172 | #
173 | # return DX
174 |
--------------------------------------------------------------------------------
/visualization/heatmap.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from sklearn.preprocessing import LabelEncoder
4 | from sklearn.cross_validation import train_test_split
5 | import utils
6 | import glob, os
7 | import pca.dataanalyzer as da
8 |
9 |
10 | # visulaize the important characteristics of the dataset
11 | import matplotlib.pyplot as plt
12 |
13 | data_len = 1460
14 | seed = 0
15 | # dirs = ["C:/Users/salik/Documents/Data/LinuxChrome/{}/".format(num_headers),
16 | # "C:/Users/salik/Documents/Data/WindowsFirefox/{}/".format(num_headers),
17 | # "C:/Users/salik/Documents/Data/WindowsChrome/{}/".format(num_headers),
18 | # "C:/Users/salik/Documents/Data/WindowsSalik/{}/".format(num_headers),
19 | # "C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
20 | dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
21 |
22 | # step 1: get the data
23 | dataframes = []
24 | num_examples = 0
25 | for dir in dirs:
26 | for fullname in glob.iglob(dir + '*.h5'):
27 | filename = os.path.basename(fullname)
28 | df = utils.load_h5(dir, filename)
29 | dataframes.append(df)
30 | num_examples = len(df.values)
31 | # create one large dataframe
32 | data = pd.concat(dataframes)
33 | data.sample(frac=1, random_state=seed).reset_index(drop=True)
34 | num_rows = data.shape[0]
35 | columns = data.columns
36 | print(columns)
37 |
38 | # step 2: get x and convert it to numpy array
39 | x = da.getbytes(data, data_len)
40 |
41 | # step 3: get class labels y and then encode it into number
42 | # get class label data
43 | y = data['label'].values
44 | # encode the class label
45 | class_labels = np.unique(y)
46 | label_encoder = LabelEncoder()
47 | y = label_encoder.fit_transform(y)
48 | # step 4: split the data into training set and test set
49 | test_percentage = 0.5
50 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_percentage, random_state=seed)
51 | plot_savename = "histogram_payload"
52 |
53 | from matplotlib import rcParams
54 | # Make room for xlabel which is otherwise cut off
55 | rcParams.update({'figure.autolayout': True})
56 | # Heatmap plot how plot the sample points among 5 classes
57 | for idx, cl in enumerate(np.unique(y_test)):
58 | plt.figure()
59 | print("Starting class: " + class_labels[cl] +" With len:" + str(len(x_test[y_test == cl])))
60 | positioncounts = np.zeros(shape=(256, data_len))
61 | for x in x_test[y_test == cl]:
62 | for i, v in enumerate(x):
63 | positioncounts[int(v), i] += 1
64 | plt.imshow(positioncounts, cmap="YlGnBu", interpolation='nearest')
65 | plt.title('Heatmap of : {}'.format(class_labels[cl]))
66 | plt.colorbar()
67 | plt.tight_layout()
68 | # plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
69 | plt.show()
--------------------------------------------------------------------------------
/visualization/histogram.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from sklearn.preprocessing import LabelEncoder
4 | from sklearn.preprocessing import StandardScaler
5 | from sklearn.cross_validation import train_test_split
6 | import utils
7 | import glob, os
8 | import pca.dataanalyzer as da, pca.pca as pca
9 | from sklearn.metrics import accuracy_score
10 |
11 | # visulaize the important characteristics of the dataset
12 | import matplotlib.pyplot as plt
13 | seed = 0
14 | num_headers = 16
15 | data_len = 54*num_headers #1460
16 | dirs = ["C:/Users/salik/Documents/Data/LinuxChrome/{}/".format(num_headers),
17 | "C:/Users/salik/Documents/Data/WindowsFirefox/{}/".format(num_headers),
18 | "C:/Users/salik/Documents/Data/WindowsChrome/{}/".format(num_headers),
19 | "C:/Users/salik/Documents/Data/WindowsSalik/{}/".format(num_headers),
20 | "C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
21 | # dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
22 |
23 | # step 1: get the data
24 | dataframes = []
25 | num_examples = 0
26 | for dir in dirs:
27 | for fullname in glob.iglob(dir + '*.h5'):
28 | filename = os.path.basename(fullname)
29 | df = utils.load_h5(dir, filename)
30 | dataframes.append(df)
31 | num_examples = len(df.values)
32 | # create one large dataframe
33 | data = pd.concat(dataframes)
34 | data.sample(frac=1, random_state=seed).reset_index(drop=True)
35 | num_rows = data.shape[0]
36 | columns = data.columns
37 | print(columns)
38 |
39 | # step 2: get features (x) and convert it to numpy array
40 | x = da.getbytes(data, data_len)
41 |
42 | # step 3: get class labels y and then encode it into number
43 | # get class label data
44 | y = data['label'].values
45 |
46 | # encode the class label
47 | class_labels = np.unique(y)
48 | label_encoder = LabelEncoder()
49 | y = label_encoder.fit_transform(y)
50 |
51 | # step 4: split the data into training set and test set
52 | test_percentage = 0.5
53 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_percentage, random_state=seed)
54 |
55 | plot_savename = "histogram_payload"
56 |
57 | from matplotlib import rcParams
58 | # Make room for xlabel which is otherwise cut off
59 | rcParams.update({'figure.autolayout': True})
60 |
61 |
62 |
63 | # scatter plot the sample points among 5 classes
64 | # markers = ('s', 'd', 'o', '^', 'v', ".", ",", "<", ">", "8", "p", "P", "*", "h", "H", "+", "x", "X", "D", "|", "_")
65 | color_map = {0: '#487fff', 1: '#d342ff', 2: '#4eff4e', 3: '#2ee3ff', 4: '#ffca43', 5:'#ff365e', 6:'#626663'}
66 | plt.figure()
67 | for idx, cl in enumerate(np.unique(y_test)):
68 | # Get count of unique values
69 | values, counts = np.unique(x_test[y_test == cl], return_counts=True)
70 | # Maybe remove zero as there is a lot of zeros in the header
71 | # values = values[1:]
72 | # counts = counts[1:]
73 | n, bins, patches = plt.hist(values, weights=counts, bins=256, facecolor=color_map[idx], label=class_labels[cl], alpha=0.8)
74 |
75 | plt.legend(loc='upper right')
76 | plt.title('Histogram of : {}'.format(class_labels))
77 | plt.tight_layout()
78 | # plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
79 | plt.show()
--------------------------------------------------------------------------------
/visualization/pca_plots.py:
--------------------------------------------------------------------------------
1 | import pca.dataanalyzer as da, pca.pca as pca
2 | import glob
3 | import os
4 | import pandas as pd
5 | import numpy as np
6 | from sklearn.preprocessing import StandardScaler, LabelEncoder
7 | import utils
8 |
9 |
10 | seed = 0
11 | num_headers = 16
12 | dirs = ["E:/Data/LinuxChrome/{}/".format(num_headers),
13 | "E:/Data/WindowsSalik/{}/".format(num_headers),
14 | "E:/Data/WindowsAndreas/{}/".format(num_headers),
15 | "E:/Data/WindowsFirefox/{}/".format(num_headers),
16 | "E:/Data/WindowsChrome/{}/".format(num_headers),
17 | ]
18 | # dirs = ["C:/Users/salik/Documents/Data/h5/https/", "C:/Users/salik/Documents/Data/h5/netflix/"]
19 | # dirs = ["C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
20 | # step 1: get the data
21 | dataframes = []
22 | num_examples = 0
23 | for dir in dirs:
24 | for fullname in glob.iglob(dir + '*.h5'):
25 | filename = os.path.basename(fullname)
26 | df = utils.load_h5(dir, filename)
27 | dataframes.append(df)
28 | num_examples = len(df.values)
29 | # create one large dataframe
30 | data = pd.concat(dataframes)
31 | data = data.sample(frac=0.1, random_state=seed).reset_index(drop=True)
32 | num_rows = data.shape[0]
33 | columns = data.columns
34 | print(columns)
35 |
36 | # step 3: get features (x) and scale the features
37 | # get x and convert it to numpy array
38 | x = da.getbytes(data, num_headers*54)
39 | standard_scaler = StandardScaler()
40 | # x_stds = []
41 | # ys = []
42 | # for data in dataframes:
43 | # x = da.getbytes(data, 1460)
44 |
45 | # x = [[-0.5, -0.5],
46 | # [-0.5, 0.5],
47 | # [0.5, -0.5],
48 | # [0.5, 0.5],
49 | # [-0.4, -0.4],
50 | # [-0.4, 0.4],
51 | # [0.4, -0.4],
52 | # [0.4, 0.4]
53 | # ]
54 | x_std = standard_scaler.fit_transform(x)
55 | y = data['label'].values
56 | # y = ['drtv', 'netflix', 'youtube', 'twitch', 'twitch', 'youtube', 'netflix', 'drtv']
57 | # encode the class label
58 | class_labels = np.unique(y)
59 | label_encoder = LabelEncoder()
60 | y = label_encoder.fit_transform(y)
61 |
62 |
63 | p = pca.runpca(x_std, num_comp=25)
64 | z = pca.componentprojection(x_std, p)
65 | pca.plotprojection(z, 0, y, class_labels)
66 | pca.plotvarianceexp(p, 25)
67 | pca.showplots()
68 | # plot_savename = "PCA_header_all_cumulative"
69 | # plt.savefig('{0}.png'.format(plot_savename), dpi=300)
--------------------------------------------------------------------------------
/visualization/t-sne_compare.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from sklearn.preprocessing import LabelEncoder
4 | from sklearn.preprocessing import StandardScaler
5 | from sklearn.cross_validation import train_test_split
6 | import utils
7 | import glob, os
8 | import pca.dataanalyzer as da, pca.pca as pca
9 | from sklearn.metrics import accuracy_score
10 |
11 | # visulaize the important characteristics of the dataset
12 | import matplotlib.pyplot as plt
13 | seed = 0
14 | num_headers = 16
15 | dirs = ["E:/Data/WindowsFirefox/{}/".format(num_headers),
16 | "E:/Data/WindowsChrome/{}/".format(num_headers)]
17 | # dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
18 | # dirs = ["C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
19 | # step 1: get the data
20 | dataframes = []
21 | num_examples = 0
22 | for dir in dirs:
23 | for fullname in glob.iglob(dir + '*.h5'):
24 | filename = os.path.basename(fullname)
25 | df = utils.load_h5(dir, filename)
26 | dataframes.append(df)
27 | num_examples = len(df.values)
28 | # create one large dataframe
29 | data = pd.concat(dataframes)
30 | num_rows = data.shape[0]
31 | columns = data.columns
32 | print(columns)
33 |
34 | # step 2: get features (x) and convert it to numpy array
35 | standard_scaler = StandardScaler()
36 | x_stds = []
37 | ys = []
38 | class_labels = []
39 | for data in dataframes:
40 | x = da.getbytes(data, num_headers*54)
41 | # x = da.getbytes(data, 1460)
42 | x_std = standard_scaler.fit_transform(x)
43 | x_stds.append(x_std)
44 | # step 4: get class labels y and then encode it into number
45 | # get class label data
46 | y = data['label'].values
47 | # encode the class label
48 | class_labels = np.unique(y)
49 | label_encoder = LabelEncoder()
50 | y = label_encoder.fit_transform(y)
51 | ys.append(y)
52 | class_labels1 = ["Firefox " + x for x in class_labels]
53 | class_labels2 = ["Chrome " + x for x in class_labels]
54 | # step 5: split the data into training set and test set
55 | test_percentage = 0.1
56 | x_tests = []
57 | y_tests = []
58 | for i, x_std in enumerate(x_stds):
59 | x_train, x_test, y_train, y_test = train_test_split(x_std, ys[i], test_size=test_percentage, random_state=seed)
60 | x_tests.append(x_test)
61 | y_tests.append(y_test)
62 |
63 | x_test = np.append(x_tests[0], x_tests[1], axis=0)
64 | y_test = np.append(y_tests[0], y_tests[1], axis=0)
65 | first_set_length = len(y_tests[0])
66 | print(first_set_length)
67 |
68 | # t-distributed Stochastic Neighbor Embedding (t-SNE) visualization
69 | plot_savename = "t-sne_16headers_windows_linux_perplexity"
70 | from sklearn.manifold import TSNE
71 | perplexities = [30.0]
72 |
73 | from matplotlib import rcParams
74 | # Make room for xlabel which is otherwise cut off
75 | rcParams.update({'figure.autolayout': True})
76 |
77 | for perplexity in perplexities:
78 | print("Starting perplexity: {}".format(perplexity))
79 | tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=1000, random_state=seed, verbose=2)
80 | x_test_2d = tsne.fit_transform(x_test)
81 | x_test_2d_0 = x_test_2d[:first_set_length, :]
82 | x_test_2d_1 = x_test_2d[first_set_length:, :]
83 |
84 | # scatter plot the sample points among 7 classes
85 | # markers = ('s', 'd', 'o', '^', 'v', ".", ",", "<", ">", "8", "p", "P", "*", "h", "H", "+", "x", "X", "D", "|", "_")
86 | color_map = {0: '#487fff', 1: '#2ee3ff', 2: '#4eff4e', 3: '#ffca43', 4: '#ff365e', 5: '#d342ff', 6:'#626663'}
87 | plt.figure()
88 | for idx, cl in enumerate(np.unique(y_test)):
89 | # Plot first dataset as +
90 | plt.scatter(x=x_test_2d_0[y_tests[0] == cl, 0], y=x_test_2d_0[y_tests[0] == cl, 1],
91 | marker="+", s=30,
92 | c=color_map[idx],
93 | label=class_labels1[cl])
94 | # Plot second dataset as o with no fill
95 | plt.scatter(x=x_test_2d_1[y_tests[1] == cl, 0], y=x_test_2d_1[y_tests[1] == cl, 1],
96 | marker="o", facecolors="None", s=30, linewidths=1,
97 | edgecolors=color_map[idx],
98 | label=class_labels2[cl])
99 | plt.xlabel('X in t-SNE')
100 | plt.ylabel('Y in t-SNE')
101 | plt.legend(loc='lower right')
102 | plt.title('t-SNE visualization with perplexity: {}'.format(perplexity))
103 | plt.tight_layout()
104 | # plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
105 | plt.show()
--------------------------------------------------------------------------------
/visualization/t_sne.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from sklearn.preprocessing import LabelEncoder
4 | from sklearn.preprocessing import StandardScaler
5 | from sklearn.cross_validation import train_test_split
6 | import utils
7 | import glob, os
8 | import pca.dataanalyzer as da, pca.pca as pca
9 | from sklearn.metrics import accuracy_score
10 |
11 | # visulaize the important characteristics of the dataset
12 | import matplotlib.pyplot as plt
13 | import matplotlib.markers as ms
14 | seed = 0
15 | num_headers = 16
16 | dirs = ["C:/Users/salik/Documents/Data/LinuxChrome/{}/".format(num_headers),
17 | "C:/Users/salik/Documents/Data/WindowsFirefox/{}/".format(num_headers),
18 | "C:/Users/salik/Documents/Data/WindowsChrome/{}/".format(num_headers),
19 | "C:/Users/salik/Documents/Data/WindowsSalik/{}/".format(num_headers),
20 | "C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
21 | # dirs = ["E:/Data/h5/https/", "E:/Data/h5/netflix/"]
22 | # dirs = ["C:/Users/salik/Documents/Data/WindowsAndreas/{}/".format(num_headers)]
23 | # step 1: get the data
24 | dataframes = []
25 | num_examples = 0
26 | for dir in dirs:
27 | for fullname in glob.iglob(dir + '*.h5'):
28 | filename = os.path.basename(fullname)
29 | df = utils.load_h5(dir, filename)
30 | dataframes.append(df)
31 | num_examples = len(df.values)
32 | # create one large dataframe
33 | data = pd.concat(dataframes)
34 | data.sample(frac=1, random_state=seed).reset_index(drop=True)
35 | num_rows = data.shape[0]
36 | columns = data.columns
37 | print(columns)
38 |
39 | # step 3: get features (x) and scale the features
40 | # get x and convert it to numpy array
41 | # x = da.getbytes(data, 1460)
42 | standard_scaler = StandardScaler()
43 | x = da.getbytes(data, num_headers*54)
44 | x_std = standard_scaler.fit_transform(x)
45 | # step 4: get class labels y and then encode it into number
46 | # get class label data
47 | y = data['label'].values
48 | # encode the class label
49 | class_labels = np.unique(y)
50 | label_encoder = LabelEncoder()
51 | y = label_encoder.fit_transform(y)
52 | # step 5: split the data into training set and test set
53 | test_percentage = 0.1
54 | x_tests = []
55 | y_tests = []
56 | x_train, x_test, y_train, y_test = train_test_split(x_std, y, test_size=test_percentage, random_state=seed)
57 | # t-distributed Stochastic Neighbor Embedding (t-SNE) visualization
58 | plot_savename = "t-sne_16headers_windows_linux_perplexity"
59 | from sklearn.manifold import TSNE
60 | perplexities = [30.0]
61 |
62 | from matplotlib import rcParams
63 | # Make room for xlabel which is otherwise cut off
64 | rcParams.update({'figure.autolayout': True})
65 |
66 | for perplexity in perplexities:
67 | print("Starting perplexity: {}".format(perplexity))
68 | tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=1000, random_state=seed, verbose=2)
69 | x_test_2d = tsne.fit_transform(x_test)
70 | # # scatter plot the sample points among 7 classes
71 | # # markers = ('s', 'd', 'o', '^', 'v', ".", ",", "<", ">", "8", "p", "P", "*", "h", "H", "+", "x", "X", "D", "|", "_")
72 | color_map = {0: '#487fff', 1: '#2ee3ff', 2: '#4eff4e', 3: '#ffca43', 4: '#ff365e', 5: '#d342ff', 6:'#626663'}
73 | plt.figure()
74 | for idx, cl in enumerate(np.unique(y_test)):
75 | plt.scatter(x=x_test_2d[y_test == cl, 0],
76 | y=x_test_2d[y_test == cl, 1],
77 | marker=".",
78 | c=color_map[idx],
79 | label=class_labels[cl])
80 | plt.xlabel('X in t-SNE')
81 | plt.ylabel('Y in t-SNE')
82 | plt.legend(loc='lower right')
83 | # plt.title('t-SNE visualization with perplexity: {}'.format(perplexity))
84 | plt.tight_layout()
85 | # plt.savefig('{0}{1}.png'.format(plot_savename, int(perplexity)), dpi=300)
86 | plt.show()
--------------------------------------------------------------------------------
/visualization/vis_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | # import PIL.Image
3 | import matplotlib.pyplot as plt
4 | import matplotlib.cm as cm
5 | # lowest = -1.0
6 | lowest = 0.0
7 | highest = 1.0
8 |
9 | # --------------------------------------
10 | # Color maps ([-1,1] -> [0,1]^3)
11 | # --------------------------------------
12 |
13 |
14 | def heatmap(x):
15 |
16 | x = x[..., np.newaxis]
17 |
18 | # positive relevance
19 | hrp = 0.9 - np.clip(x-0.3, 0, 0.7)/0.7*0.5
20 | hgp = 0.9 - np.clip(x-0.0, 0, 0.3)/0.3*0.5 - np.clip(x-0.3, 0, 0.7)/0.7*0.4
21 | hbp = 0.9 - np.clip(x-0.0, 0, 0.3)/0.3*0.5 - np.clip(x-0.3, 0, 0.7)/0.7*0.4
22 |
23 | # negative relevance
24 | hrn = 0.9 - np.clip(-x-0.0, 0, 0.3)/0.3*0.5 - np.clip(-x-0.3, 0, 0.7)/0.7*0.4
25 | hgn = 0.9 - np.clip(-x-0.0, 0, 0.3)/0.3*0.5 - np.clip(-x-0.3, 0, 0.7)/0.7*0.4
26 | hbn = 0.9 - np.clip(-x-0.3, 0, 0.7)/0.7*0.5
27 |
28 | r = hrp*(x >= 0)+hrn*(x < 0)
29 | g = hgp*(x >= 0)+hgn*(x < 0)
30 | b = hbp*(x >= 0)+hbn*(x < 0)
31 |
32 | return np.concatenate([r, g, b], axis=-1)
33 |
34 |
35 | def graymap(x):
36 |
37 | x = x[..., np.newaxis]
38 | return np.concatenate([x, x, x], axis=-1)*0.5+0.5
39 |
40 | # --------------------------------------
41 | # Visualizing data
42 | # --------------------------------------
43 |
44 | # def visualize(x,colormap,name):
45 | #
46 | # N = len(x)
47 | # assert(N <= 16)
48 | #
49 | # x = colormap(x/np.abs(x).max())
50 | #
51 | # # Create a mosaic and upsample
52 | # x = x.reshape([1, N, 29, 29, 3])
53 | # x = np.pad(x, ((0, 0), (0, 0), (2, 2), (2, 2), (0, 0)), 'constant', constant_values=1)
54 | # x = x.transpose([0, 2, 1, 3, 4]).reshape([1*33, N*33, 3])
55 | # x = np.kron(x, np.ones([2, 2, 1]))
56 | #
57 | # PIL.Image.fromarray((x*255).astype('byte'), 'RGB').save(name)
58 |
59 |
60 | def plt_vector(x, colormap, num_headers):
61 | N = len(x)
62 | assert (N <= 16)
63 | len_x = 54
64 | len_y = num_headers
65 | # size = int(np.ceil(np.sqrt(len(x[0]))))
66 | length = len_y*len_x
67 | data = np.zeros((N, length), dtype=np.float64)
68 | data[:, :x.shape[1]] = x
69 | data = colormap(data / np.abs(data).max())
70 | # data = data.reshape([1, N, size, size, 3])
71 | data = data.reshape([1, N, len_y, len_x, 3])
72 | # data = np.pad(data, ((0, 0), (0, 0), (2, 2), (2, 2), (0, 0)), 'constant', constant_values=1)
73 | data = data.transpose([0, 2, 1, 3, 4]).reshape([1 * (len_y), N * (len_x), 3])
74 | return data
75 | # data = np.kron(data, np.ones([2, 2, 1])) # scales
76 |
77 |
78 | def add_subplot(data, num_plots, plot_index, title, figure):
79 | fig = figure
80 | ax = fig.add_subplot(num_plots, 1, plot_index)
81 | cax = ax.imshow(data, interpolation='nearest', aspect='auto')
82 | # cbar = fig.colorbar(cax, ticks=[0, 1])
83 | # cbar.ax.set_yticklabels(['0', '> 1']) # vertically oriented colorbar
84 | ax.set_title(title)
85 |
86 | def plot_data(data, title):
87 | plt.figure(figsize=(6.4, 2.5)) # figuresize to make 16 headers plot look good
88 | # plt.axis('scaled')
89 | plt.imshow(data, interpolation='nearest', aspect='auto')
90 | plt.title(title)
91 | plt.tight_layout()
92 |
93 |
94 | def plotNNFilter(units):
95 | filters = units.shape[3]
96 | plt.figure(1, figsize=(20, 20))
97 | n_columns = 6
98 | n_rows = np.ceil(filters / n_columns) + 1
99 | for i in range(filters):
100 | plt.subplot(n_rows, n_columns, i+1)
101 | plt.title('Filter ' + str(i))
102 | plt.imshow(units[0, :, :, i], interpolation="nearest", cmap="gray")
103 |
104 |
--------------------------------------------------------------------------------
/visualization/visualize_activations.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import matplotlib.pyplot as plt
3 | import numpy as np
4 | from tf import dataset
5 | from visualization import classes_module as md, vis_utils
6 |
7 | num_headers = 16
8 | hidden_units = 12
9 | train_dirs = ['C:/Users/salik/Documents/Data/LinuxChrome/{0}/'.format(num_headers),
10 | 'C:/Users/salik/Documents/Data/WindowsFirefox/{0}/'.format(num_headers),
11 | 'C:/Users/salik/Documents/Data/WindowsAndreas/{0}/'.format(num_headers),
12 | 'C:/Users/salik/Documents/Data/WindowsSalik/{0}/'.format(num_headers)]
13 |
14 | test_dirs = ['C:/Users/salik/Documents/Data/WindowsChrome/{0}/'.format(num_headers)]
15 | seed = 0
16 | input_size = 54*num_headers
17 | data = dataset.read_data_sets(train_dirs, test_dirs, merge_data=True, one_hot=True,
18 | validation_size=0.1,
19 | test_size=0.1,
20 | balance_classes=False,
21 | payload_length=input_size,
22 | seed=seed)
23 | load_dir = "../trained_models/"
24 | model_name = 'header_{0}_{1}_units.ckpt'.format(num_headers, hidden_units)
25 | sess = tf.Session()
26 | # First let's load meta graph and restore weights
27 | saver = tf.train.import_meta_graph(load_dir + model_name + ".meta")
28 | saver.restore(sess, tf.train.latest_checkpoint(load_dir))
29 |
30 | # Now, let's access and create placeholders variables and
31 | # create feed-dict to feed new data
32 | graph = tf.get_default_graph()
33 | names = [tensor.name for tensor in graph.as_graph_def().node]
34 | x_pl = graph.get_tensor_by_name("xPlaceholder:0")
35 | y_pl = graph.get_tensor_by_name("yPlaceholder:0")
36 | layer1 = graph.get_tensor_by_name('layer1/activation:0')
37 | W1 = graph.get_tensor_by_name("layer1/W:0")
38 | b1 = graph.get_tensor_by_name("layer1/b:0")
39 | W_out = graph.get_tensor_by_name("output_layer/W:0")
40 | b_out = graph.get_tensor_by_name("output_layer/b:0")
41 | y = graph.get_tensor_by_name("output_layer/activation:0")
42 | # Get index of prediction
43 | y_ = tf.argmax(y, axis=1)
44 |
45 | feed_dict = {x_pl: data.test.payloads, y_pl: data.test.labels}
46 | y_preds = sess.run(fetches=y_, feed_dict=feed_dict)
47 | y_true = tf.argmax(data.test.labels, axis=1).eval(session=sess)
48 |
49 | # Number of samples to plot
50 | sample_size = 10
51 | num_samples_picked = 0
52 | sample_payloads = []
53 | sample_labels = []
54 | classes = np.array(dataset._label_encoder.classes_)
55 | # Which class do we want to visualise
56 | class_name = "drtv"
57 | class_number = np.argmax(classes == class_name)
58 | # Get all the places where that class label is the true label
59 | true_idx = [i for i, v in enumerate(y_true) if v == class_number]
60 | # What predictions did the network make at those indicies
61 | preds_at_idx = y_preds[true_idx]
62 | # Iterate over predictions and pick #sample_size where prediction was right
63 | for i, v in enumerate(preds_at_idx):
64 | if v == class_number and num_samples_picked < sample_size:
65 | sample_payloads.append(data.test.payloads[true_idx[i]])
66 | sample_labels.append(data.test.labels[true_idx[i]])
67 | num_samples_picked += 1
68 |
69 |
70 | X = np.array(sample_payloads)
71 | T = np.array(sample_labels)
72 | # Extract the weights and biases from the network
73 | l1_weights, l1_biases, out_weights, out_biases = sess.run([W1, b1, W_out, b_out])
74 | # Create visualisation network in sequential manner
75 | nn = md.Network([md.Linear(l1_weights, l1_biases), md.ReLU(),
76 | md.Linear(out_weights, out_biases), md.ReLU()
77 | ])
78 | # Make a sensitivity and relevance plot for each sample
79 | for i, v in enumerate(X):
80 | t = T[i]
81 | classname = classes[np.argmax(t)]
82 | Y = np.array([nn.forward(v)])
83 | S = np.array([nn.gradprop(t)**2])
84 | D = nn.relprop(Y*t)
85 | s_data = vis_utils.plt_vector(S, vis_utils.heatmap, num_headers)
86 | s_title = "Sensitivity for {0}".format(classname)
87 | rel_data = vis_utils.plt_vector(D, vis_utils.heatmap, num_headers)
88 | r_title = "Relevance for {0}".format(classname)
89 | vis_utils.plot_data(s_data, s_title)
90 | vis_utils.plot_data(rel_data, r_title)
91 | plt.show()
92 |
--------------------------------------------------------------------------------