├── LICENSE
├── README.md
├── TNSM2020
├── clf_helpers.py
├── params.pkl
└── run_classification.py
├── example_anomaly_detection.py
├── example_classification.py
├── feature_extraction.py
├── requirements.txt
└── temporal_data_representation.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 lcd-dal
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Feature extraction for CERT insider threat test dataset
2 | This is a script for extracting features (csv format) from the [CERT insider threat test dataset](https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099) [[1]](#1), [[2]](#2), versions 4.1 to 6.2. For more details, please see this paper: [Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning](https://ieeexplore.ieee.org/document/8962316).
3 |
4 | [1]
5 | Lindauer, Brian (2020): Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1
6 |
7 | [2]
8 | J. Glasser and B. Lindauer, "Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data," 2013 IEEE Security and Privacy Workshops, San Francisco, CA, 2013, pp. 98-104, doi: 10.1109/SPW.2013.37.
9 |
10 | ## Run feature_extraction script
11 | - Require python3.8 & packages in `requirements.txt`. The script is written and tested in Linux only.
12 | - By default the script extracts week, day, session, and sub-session data (as in the paper).
13 | - To run the script, place it in a folder of a CERT dataset (e.g. r4.2, decompressed from r4.2.tar.bz2 downloaded [here](https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1)), then run `python3 feature_extraction.py`
14 | - To change number of cores used in parallelization (default 8), use `python3 feature_extraction.py numberOfCores`, e.g `python3 feature_extraction.py 16`.
15 |
16 | ## Extracted Data
17 | Extracted data is stored in ExtractedData subfolder.
18 |
19 | Note that in the extracted data, `insider` is the label indicating the insider threat scenario (0 is normal). Some extracted features (subs_ind, starttime, endtime, sessionid, user, day, week) are for information and may or may not be used in training machine learning approaches.
20 |
21 | Pre-extracted data from CERT insider threat test dataset r5.2 (gzipped) can be found in [here](https://web.cs.dal.ca/~lcd/data/CERTr5.2/).
22 |
23 | ## Data representations
24 | From the extracted data, `temporal_data_representation.py` can be used to generate different data representations, as presented in this paper: [Anomaly Detection for Insider Threats Using Unsupervised Ensembles](https://ieeexplore.ieee.org/document/9399116).
25 |
26 | `python3 temporal_data_representation.py --help`
27 |
28 | ## Sample classification and anomaly detection results
29 | Sample code is provided in:
30 |
31 | - `sample_classification.py` for classification (as in [Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning](https://ieeexplore.ieee.org/document/8962316)).
32 | - `sample_anomaly_detection.py` for anomaly detection (as in [Anomaly Detection for Insider Threats Using Unsupervised Ensembles](https://ieeexplore.ieee.org/document/9399116)).
33 |
34 | ## Citation
35 | If you use the source code, or the extracted datasets, please cite the following paper:
36 |
37 | `D. C. Le, N. Zincir-Heywood and M. I. Heywood, "Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning," in IEEE Transactions on Network and Service Management, vol. 17, no. 1, pp. 30-44, March 2020, doi: 10.1109/TNSM.2020.2967721.`
38 |
39 | Data representations and anomaly detection:
40 |
41 | `D. C. Le, N. Zincir-Heywood, "Anomaly Detection for Insider Threats Using Unsupervised Ensembles," in IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1152–1164. June 2021, doi:http://doi.org/10.1109/TNSM.2021.3071928.`
42 |
--------------------------------------------------------------------------------
/TNSM2020/clf_helpers.py:
--------------------------------------------------------------------------------
1 | from sklearn.model_selection import train_test_split
2 | from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, auc
3 | from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
4 | import numpy as np
5 | import time
6 | import gc
7 | import pandas as pd
8 | import random
9 | from joblib import Parallel, delayed
10 |
11 | num_cores = 16
12 |
13 |
14 | def split_data(data, test_size=0.25, random_state=0, y_column='insider',
15 | shuffle=True,
16 | x_rm_cols=('user', 'day', 'week', 'starttime', 'endtime', 'sessionid',
17 | 'timeind', 'Unnamed: 0', 'insider'),
18 | dname='r4.2', normalization='StandardScaler',
19 | rm_empty_cols=True, by_user=False, by_user_time=False,
20 | by_user_time_trainper=0.5, limit_ntrain_user=0):
21 | """
22 | split data to train and test, can get data by user, seq or random, with normalization builtin
23 | """
24 | np.random.seed(random_state)
25 | random.seed(random_state)
26 |
27 | x_cols = [i for i in data.columns if i not in x_rm_cols]
28 | if rm_empty_cols:
29 | x_cols = [i for i in x_cols if len(set(data[i])) > 1]
30 |
31 | infocols = list(set(data.columns) - set(x_cols))
32 |
33 | # output a dict
34 | out = {}
35 |
36 | # normalization
37 | if normalization == 'StandardScaler':
38 | sc = StandardScaler()
39 | elif normalization == 'MinMaxScaler':
40 | sc = MinMaxScaler()
41 | elif normalization == 'MaxAbsScaler':
42 | sc = MaxAbsScaler()
43 | else:
44 | sc = None
45 | out['sc'] = sc
46 |
47 | # split data randomly by instance
48 | if not by_user and not by_user_time:
49 | x = data[x_cols].values
50 | y_org = data[y_column].values
51 |
52 | y = y_org.copy()
53 | if 'r6' in dname:
54 | y[y != 0] = 1
55 |
56 | x_train, x_test, y_train, y_test = train_test_split(x, y_org, test_size=test_size, shuffle=shuffle)
57 |
58 | if 'sc' in out and out['sc'] is not None:
59 | x_train = sc.fit_transform(x_train)
60 | out['sc'] = sc
61 | if test_size > 0: x_test = sc.transform(x_test)
62 |
63 | # split data by user
64 | elif by_user:
65 | test_users, train_users = [], []
66 | for i in [j for j in list(set(data['insider'])) if j != 0]:
67 | uli = list(set(data[data['insider'] == i]['user']))
68 | random.shuffle(uli)
69 | ind_i = int(np.ceil(test_size * len(uli)))
70 | test_users += uli[:ind_i]
71 | train_users += uli[ind_i:]
72 |
73 | normal_users = list(set(data['user']) - set(data[data['insider'] != 0]['user']))
74 | random.shuffle(normal_users)
75 | if limit_ntrain_user > 0:
76 | normal_ind = limit_ntrain_user - len(train_users)
77 | else:
78 | normal_ind = int(np.ceil((1 - test_size) * len(normal_users)))
79 |
80 | train_users += normal_users[: normal_ind]
81 | test_users += normal_users[normal_ind:]
82 | x_train = data[data['user'].isin(train_users)][x_cols].values
83 | x_test = data[data['user'].isin(test_users)][x_cols].values
84 | y_train = data[data['user'].isin(train_users)][y_column].values
85 | y_test = data[data['user'].isin(test_users)][y_column].values
86 |
87 | out['train_info'] = data[data['user'].isin(train_users)][infocols]
88 | out['test_info'] = data[data['user'].isin(test_users)][infocols]
89 |
90 | out['train_users'] = train_users
91 | if test_size > 0 or limit_ntrain_user > 0:
92 | out['test_users'] = test_users
93 |
94 | if 'sc' in out and out['sc'] is not None:
95 | x_train = sc.fit_transform(x_train)
96 | out['sc'] = sc
97 | if test_size > 0 or (limit_ntrain_user > 0 and limit_ntrain_user < len(set(data['user']))):
98 | x_test = sc.transform(x_test)
99 |
100 | # split by user and time
101 | elif by_user_time:
102 | train_week_max = by_user_time_trainper * max(data['week'])
103 | train_insiders = set(data[(data['week'] <= train_week_max) & (data['insider'] != 0)]['user'])
104 | users_set_later_weeks = set(data[data['week'] > train_week_max]['user'])
105 |
106 | first_part = data[data['week'] <= train_week_max]
107 | second_part = data[data['week'] > train_week_max]
108 |
109 | first_part_split = split_data(first_part, random_state=random_state, test_size=0,
110 | dname=dname, normalization=normalization,
111 | by_user=True, by_user_time=False,
112 | limit_ntrain_user=limit_ntrain_user,
113 | )
114 |
115 | x_train = first_part_split['x_train']
116 | y_train = first_part_split['y_train']
117 | x_cols = first_part_split['x_cols']
118 |
119 | out['train_info'] = first_part_split['train_info']
120 | out['other_trainweeks_users_info'] = first_part_split['test_info']
121 |
122 | if 'sc' in first_part_split and first_part_split['sc'] is not None:
123 | out['sc'] = first_part_split['sc']
124 |
125 | out['x_other_trainweeks_users'] = first_part_split['x_test']
126 | out['y_other_trainweeks_users'] = first_part_split['y_test']
127 | out['y_bin_other_trainweeks_users'] = first_part_split['y_test_bin']
128 | out['other_trainweeks_users'] = first_part_split['test_users'] # users in first half but not in train
129 |
130 | real_train_users = set(first_part_split['train_users'])
131 | real_train_insiders = train_insiders.intersection(real_train_users)
132 | test_users = list(users_set_later_weeks - real_train_insiders)
133 | x_test = second_part[second_part['user'].isin(test_users)][x_cols].values
134 | y_test = second_part[second_part['user'].isin(test_users)][y_column].values
135 | out['test_info'] = second_part[second_part['user'].isin(test_users)][infocols]
136 | if ('sc' in out) and (out['sc'] is not None) and (by_user_time_trainper < 1):
137 | x_test = out['sc'].transform(x_test)
138 |
139 | out['train_users'] = first_part_split['train_users']
140 | out['test_users'] = test_users
141 |
142 | # get binary data
143 | y_train_bin = y_train.copy()
144 | y_train_bin[y_train_bin != 0] = 1
145 |
146 | out['x_train'] = x_train
147 | out['y_train'] = y_train
148 | out['y_train_bin'] = y_train_bin
149 | out['x_cols'] = x_cols
150 | out['info_cols'] = infocols
151 |
152 | out['test_size'] = test_size
153 |
154 | if test_size > 0 or (by_user_time and by_user_time_trainper < 1) or limit_ntrain_user > 0:
155 | y_test_bin = y_test.copy()
156 | y_test_bin[y_test_bin != 0] = 1
157 | out['x_test'] = x_test
158 | out['y_test'] = y_test
159 | out['y_test_bin'] = y_test_bin
160 |
161 | return out
162 |
163 |
164 | def get_result_one_user(u, pred_all, datainfo):
165 | res_u = {}
166 | u_labels = datainfo[datainfo['user'] == u]['insider'].values
167 | utype = list(set(u_labels))
168 | if np.any(u_labels != 0) and len(set(u_labels)) > 1:
169 | utype.remove(0.0)
170 | res_u['type'] = utype[0]
171 | u_idx = np.where(datainfo['user'] == u)[0]
172 | res_u['data_idxs'] = u_idx
173 | pred = pred_all[u_idx]
174 | if len(np.where(u_labels == 0)[0]) > 0:
175 | res_u['norm_per'] = len(np.where(pred[u_labels == 0] == 0)[0]) / len(np.where(u_labels == 0)[0])
176 | if utype[0] != 0:
177 | res_u['mal_per'] = len(np.where(pred[u_labels != 0] != 0)[0]) / len(np.where(u_labels != 0)[0])
178 | res_u['norm_bin'] = int(np.any(pred[u_labels == 0] != 0))
179 | if utype[0] != 0:
180 | res_u['mal_bin'] = int(np.any(pred[u_labels != 0] != 0))
181 | return res_u
182 |
183 |
184 | def get_result_by_users(users, user_list=None, pred_all=None, datainfo=None):
185 | out = {}
186 | cms = {}
187 | out[users] = {}
188 | # out_users = [get_result_one_user(u, pred_all, datainfo, old_res, label_all) for u in user_list]
189 | out_users = Parallel(n_jobs=num_cores)(delayed(get_result_one_user)(u, pred_all, datainfo)
190 | for u in user_list)
191 | users_true_label = []
192 | users_pred_label = []
193 | for i, u in enumerate(user_list):
194 | out[users][u] = out_users[i]
195 | users_true_label.append(out[users][u]['type'])
196 | if out[users][u]['type'] == 0:
197 | users_pred_label.append(out[users][u]['norm_bin'])
198 | else:
199 | users_pred_label.append(out[users][u]['mal_bin'])
200 |
201 | out[users]['true_label'] = users_true_label
202 | out[users]['pred_label'] = users_pred_label
203 | cms[users] = confusion_matrix(users_true_label, users_pred_label)
204 | return out, cms
205 |
206 |
207 | def do_classification(clf, x_train, y_train, x_test, y_test, y_org=None, by_user=False,
208 | split_output=None):
209 | '''
210 | train classification and get results
211 | '''
212 | st = time.time()
213 | clf.fit(x_train, y_train)
214 | train_time = time.time() - st
215 |
216 | cms_train = {}
217 | cms_test = {}
218 |
219 | st = time.time()
220 | y_train_hat = clf.predict(x_train)
221 | y_train_proba = clf.predict_proba(x_train)
222 | pred_time = time.time() - st
223 | cms_train['bin'] = confusion_matrix(y_train, y_train_hat)
224 |
225 | st = time.time()
226 | y_test_hat = clf.predict(x_test)
227 | y_test_proba = clf.predict_proba(x_test)
228 | test_pred_time = time.time() - st
229 | cms_test['bin'] = confusion_matrix(y_test, y_test_hat)
230 |
231 | test_org_labels = y_test
232 | train_org_labels = y_train
233 | if y_org is not None:
234 | cms_train['org'] = confusion_matrix(y_org['train'], y_train_hat)
235 | train_org_labels = y_org['train']
236 | cms_test['org'] = confusion_matrix(y_org['test'], y_test_hat)
237 | test_org_labels = y_org['test']
238 |
239 | userres_train = {}
240 | userres_test = {}
241 | if by_user:
242 | uout, ucm = get_result_by_users('train_users', split_output['train_users'], pred_all=y_train_hat,
243 | datainfo=split_output['train_info'])
244 |
245 | userres_train.update(uout)
246 | cms_train.update(ucm)
247 | uout, ucm = get_result_by_users('test_users', split_output['test_users'], pred_all=y_test_hat,
248 | datainfo=split_output['test_info'])
249 | userres_test.update(uout)
250 | cms_test.update(ucm)
251 |
252 |
253 | return clf, {'by_user': userres_train, 'cms': cms_train, 'train_time': train_time, 'pred_time': pred_time,
254 | 'org_labels': train_org_labels,
255 | 'pred_bin': y_train_hat, 'pred_proba': y_train_proba}, \
256 | {'by_user': userres_test, 'cms': cms_test, 'pred_time': test_pred_time, 'org_labels': test_org_labels,
257 | 'pred_bin': y_test_hat,
258 | 'pred_proba': y_test_proba}
259 |
260 |
261 | def user_auc_roc(ures, users):
262 | nnu = len(users) - sum(ures[:, 0])
263 | nmu = sum(ures[:, 0])
264 | ufpr = np.sum(ures[np.where(ures[:, 0] == 0)[0], 1:], axis=0) / nnu
265 | utpr = np.sum(ures[np.where(ures[:, 0] == 1)[0], 1:], axis=0) / nmu
266 | uauc = auc(ufpr, utpr)
267 | return utpr, ufpr, uauc
268 |
269 |
270 | def user_auc_roc2(u_idxs, y_bin, y_predprob, thresholds):
271 | ures = np.zeros((1, len(thresholds)))
272 | u_y = y_bin[u_idxs]
273 | u_yprob = y_predprob[u_idxs]
274 | for ii in range(len(thresholds[1:])):
275 | ures[0, ii + 1] = int(np.any(u_yprob >= thresholds[ii + 1]))
276 | if np.any(u_y != 0):
277 | ures[0, 0] = 1
278 | return ures
279 |
280 |
281 | def roc_auc_calc(rw, algs=('RF', 'XGB'), nrun=20, dtype=None, data=None, res_names=['test_in']):
282 |
283 | allres = []
284 | fpri = np.linspace(0, 1, 1000) # interpolation
285 |
286 | for i in range(nrun):
287 | for alg in algs:
288 | gc.collect()
289 | if 'all' in res_names:
290 | res_names = [td for td in rw[i][alg].keys() if td not in ['clf'] and 'thres' not in td]
291 |
292 | for resname in res_names:
293 | resname2 = 'train_users' if resname == 'train' else 'test_users'
294 | list_users = sorted([u for u in rw[i][alg][resname]['by_user'][resname2].keys() if type(u) != str])
295 | y_predprob = rw[i][alg][resname]['pred_proba'][:, 1]
296 | y_org = rw[i][alg][resname]['org_labels']
297 | y_bin = np.array(y_org > 0).astype(int)
298 | fpr, tpr, thresholds = roc_curve(y_bin, y_predprob)
299 | tpri = np.interp(fpri, fpr, tpr)
300 |
301 | if len(set(y_bin)) > 1:
302 | aucsc = roc_auc_score(y_bin, y_predprob)
303 | ures = Parallel(n_jobs=num_cores)(
304 | delayed(user_auc_roc2)(rw[i][alg][resname]['by_user'][resname2][u]['data_idxs'], y_bin,
305 | y_predprob, thresholds) for u in list_users)
306 |
307 | utpr, ufpr, uauc = user_auc_roc(np.vstack(ures), list_users)
308 | utpri = np.interp(fpri, ufpr, utpr)
309 | else:
310 | aucsc, ufpr, utpr, uauc, tpri, utpri = None, None, None, None, None, None
311 |
312 | allres.append([i, alg, resname, fpr, tpr, thresholds, aucsc, ufpr, utpr, uauc, fpri, tpri, utpri])
313 |
314 | res = pd.DataFrame(
315 | columns=['run', 'alg', 'test_on', 'fpr', 'tpr', 'threshold', 'auc', 'ufpr', 'utpr', 'uauc', 'fpri', 'tpri',
316 | 'utpri'], data=allres)
317 | if dtype is not None: res['dtype'] = dtype
318 | res['data'] = data
319 | return res
320 |
321 |
322 | def get_cert_roc(r, a, dtype, test_on='test_in', user=True):
323 | fprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['fpri'].values
324 | if user:
325 | tprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['utpri'].values
326 | aucs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['uauc'].values
327 | else:
328 | tprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['tpri'].values
329 | aucs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['auc'].values
330 |
331 | mean_fpr = np.concatenate(([0], np.mean(fprs, axis=0)))
332 | mean_tpr = np.concatenate(([0], np.mean(tprs, axis=0)))
333 |
334 | std_auc = np.std(aucs)
335 | mean_auc = auc(mean_fpr, mean_tpr)
336 |
337 | std_tpr = np.concatenate(([0], np.std(tprs, axis=0)))
338 | tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
339 | tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
340 |
341 | return mean_fpr, mean_tpr, tprs_upper, tprs_lower, mean_auc, std_auc
--------------------------------------------------------------------------------
/TNSM2020/params.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcd-dal/feature-extraction-for-CERT-insider-threat-test-datasets/3441da78ea8ff1bd13ecb27c3a8157d04d0a84fe/TNSM2020/params.pkl
--------------------------------------------------------------------------------
/TNSM2020/run_classification.py:
--------------------------------------------------------------------------------
1 | from copy import deepcopy
2 | import pickle
3 | import gc
4 | import pandas as pd
5 | import time
6 | import clf_helpers
7 | import matplotlib.pyplot as plt
8 |
9 | from sklearn.ensemble import RandomForestClassifier
10 | from sklearn.neural_network import MLPClassifier
11 | from sklearn.linear_model import LogisticRegression
12 | from xgboost import XGBClassifier
13 | import warnings, sklearn
14 |
15 | warnings.filterwarnings("ignore", category=sklearn.exceptions.DataConversionWarning)
16 | n_cores = 16
17 |
18 |
19 | def run_exp_onealg(run, slw, x, clf_name, classifiers, res_by_user, st):
20 | print(clf_name)
21 | clf_copy = deepcopy(classifiers[clf_name])
22 | if hasattr(clf_copy, 'random_state'):
23 | clf_copy.set_params(**{'random_state': run})
24 |
25 | clf_res = clf_helpers.do_classification(clf_copy, x['train'], slw['y_train_bin'],
26 | x['test'], slw['y_test_bin'],
27 | y_org={'train': slw['y_train'], 'test': slw['y_test']},
28 | by_user=res_by_user, split_output=slw)
29 | res = {'train': clf_res[1], 'test_in': clf_res[2]}
30 | print('Training time: ', res['train']['train_time'])
31 | print('Train confusion matrices: ', res['train']['cms'])
32 | print('Test confusion matrices: ', res['test_in']['cms'])
33 | gc.collect()
34 | print(run, 'Done training & res,', (time.time() - st) // 1)
35 | return res
36 |
37 |
38 | def run_exp_onerun(run, classifiers, data_in, test_size, res_by_user, shuffle,
39 | normalization, by_user_time=False, by_user_time_trainper=0.5, limit_ntrain_user=0):
40 | st = time.time()
41 | slw = clf_helpers.split_data(data_in['data'], test_size=test_size, shuffle=shuffle, random_state=run,
42 | normalization=normalization, dname=data_in['name'],
43 | by_user_time=by_user_time, by_user_time_trainper=by_user_time_trainper,
44 | limit_ntrain_user=limit_ntrain_user)
45 | print('\n', run, 'Done splitting,', (time.time() - st) // 1)
46 |
47 | res_clf = {'scaler': slw['sc']}
48 | x_cols = list(slw['x_cols'])
49 | x = {'train': slw['x_train'], 'test': slw['x_test']}
50 |
51 | for clf_name in classifiers:
52 | res = run_exp_onealg(run, slw, x, clf_name, classifiers, res_by_user, st)
53 | res_clf[clf_name] = res
54 |
55 | return res_clf, x_cols
56 |
57 |
58 | def run_experiment(n_run, classifiers, data_in, test_size=0.5, res_by_user=True, shuffle=True,
59 | normalization="StandardScaler", by_user_time=True, by_user_time_trainper=0.5,
60 | limit_ntrain_user=0):
61 |
62 | all_res = {'exp_setting': {'y_col': 'insider', 'train_dname': data_in['name'],
63 | 'shuffle': shuffle, 'norm': normalization, 'res_by_user': res_by_user}}
64 | all_res['exp_setting']['classifiers'] = list(classifiers.keys())
65 | all_res['exp_setting']['n_run'] = n_run
66 | all_res['exp_setting']['in_test_size'] = test_size
67 | all_res['exp_setting']['by_user_time_trainper'] = by_user_time_trainper
68 | all_res['exp_setting']['limit_ntrain_user'] = limit_ntrain_user
69 |
70 | for run in range(n_run):
71 | res_clf, x_cols = run_exp_onerun(run, classifiers, data_in, test_size,
72 | res_by_user=res_by_user, by_user_time=by_user_time,
73 | by_user_time_trainper=by_user_time_trainper,
74 | limit_ntrain_user=limit_ntrain_user,
75 | shuffle=shuffle,
76 | normalization=normalization,
77 | )
78 | gc.collect()
79 | all_res[run] = res_clf
80 | return all_res
81 |
82 |
83 | def load_data(dname, dtype, datafolder='data'):
84 | name = datafolder + '/' + dtype + dname + '.csv.gz'
85 | return pd.read_csv(name)
86 |
87 |
88 | def run_exp(nrun, dname, dtype, mode, ttype=None, limit_ntrain_user=None, train_week_per=None, test_per=0.5, algs=None,
89 | load_params=True, scaler='StandardScaler', savefolder='res'):
90 |
91 | print('\n----------------\n%s %s' % (dname, dtype), '\n----------------\n')
92 |
93 | clfs = {'LR': LogisticRegression(solver='lbfgs', n_jobs=n_cores),
94 | 'MLP': MLPClassifier(solver='adam'),
95 | 'RF': RandomForestClassifier(n_jobs=n_cores),
96 | 'XGB': XGBClassifier(n_jobs=n_cores),
97 | }
98 |
99 | if algs is not None:
100 | clfs = {k:clfs[k] for k in algs}
101 |
102 | if load_params:
103 | with open('params.pkl', 'rb') as f:
104 | loaded_params = pickle.load(f)
105 | for c in clfs:
106 | if c != 'LR':
107 | clfs[c].set_params(**loaded_params[c][dtype])
108 |
109 | data_in = {'name': dname, 'data': load_data(dname, dtype)}
110 |
111 | if mode == 'by_user_time':
112 | res = run_experiment(nrun, clfs, data_in, by_user_time=True,
113 | by_user_time_trainper=train_week_per,
114 | limit_ntrain_user=limit_ntrain_user,
115 | res_by_user=True, normalization=scaler)
116 | elif mode == 'randomsplit':
117 | res = run_experiment(nrun, clfs, data_in, by_user_time=False,
118 | test_size=test_per,
119 | res_by_user=False, normalization=scaler)
120 |
121 | savefile = '%s/%s-%s-%s-%s-%s' % (savefolder, dname, dtype, ttype, mode, '_'.join(algs)) + '.pickle'
122 | with open(savefile, 'wb') as handle:
123 | pickle.dump(res, handle, protocol=4)
124 | return res
125 |
126 |
127 | if __name__ == "__main__":
128 | algs = ['RF']
129 | nrun = 2
130 |
131 | dname = 'r5.2'
132 | dtypes = ['week']
133 | mode = 'by_user_time'
134 |
135 | for dtype in dtypes:
136 | res = run_exp(nrun, dname, dtype, algs=algs, mode=mode, limit_ntrain_user=400, train_week_per=0.5,
137 | load_params=True)
138 | if mode == 'randomsplit': continue
139 |
140 | res = clf_helpers.roc_auc_calc(res, algs=algs, nrun=nrun, dtype=dtype, data=dname)
141 |
142 | colors = ['r', 'g', 'blue', 'orange']
143 | for user in [True, False]:
144 | plt.figure()
145 | restype = 'user' if user else 'org'
146 | for i, alg in enumerate(algs):
147 | tmp = clf_helpers.get_cert_roc(res, alg, dtype, 'test_in', user=user)
148 | plt.plot(tmp[0], tmp[1], label=f'{alg}, AUC = {tmp[4]:.3f}', color=colors[i])
149 | plt.fill_between(tmp[0], tmp[3], tmp[2], color=colors[i], alpha=.1, label=None)
150 | plt.legend()
151 | plt.savefig(f'ROC_{dtype}_{restype}.jpg')
152 |
--------------------------------------------------------------------------------
/example_anomaly_detection.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 |
4 | import pandas as pd
5 | import numpy as np
6 | # from sklearn.ensemble import IsolationForest
7 | from sklearn.neural_network import MLPRegressor
8 | from sklearn.metrics.pairwise import paired_distances
9 | from sklearn.preprocessing import StandardScaler
10 | from sklearn.metrics import roc_auc_score
11 |
12 |
13 | print('This script runs a sample anomaly detection (using simple autoencoder) '
14 | 'on CERT dataset. By default it takes CERT r5.2 data extracted with percentile '
15 | 'representation, generated using temporal_data_representation script. '
16 | 'It then trains on data of 200 random users in first half of the dataset, '
17 | 'and output AUC score and detection rate at different budgets (instance-based)')
18 |
19 | print('For more details, see this paper: Anomaly Detection for Insider Threats Using'
20 | ' Unsupervised Ensembles. Le, D. C.; and Zincir-Heywood, A. N. IEEE Transactions'
21 | ' on Network and Service Management, 18(2): 1152–1164. June 2021.')
22 |
23 | data = pd.read_pickle('week-r5.2-percentile30.pkl')
24 | removed_cols = ['user','day','week','starttime','endtime','sessionid','insider']
25 | x_cols = [i for i in data.columns if i not in removed_cols]
26 |
27 | run = 1
28 | np.random.seed(run)
29 |
30 | data1stHalf = data[data.week <= max(data.week)/2]
31 | dataTest = data[data.week > max(data.week)/2]
32 |
33 | nUsers = np.random.permutation(list(set(data1stHalf.user)))
34 | trainUsers = nUsers[:200]
35 |
36 |
37 | xTrain = data1stHalf[data1stHalf.user.isin(trainUsers)][x_cols].values
38 | yTrain = data1stHalf[data1stHalf.user.isin(trainUsers)]['insider'].values
39 | yTrainBin = yTrain > 0
40 |
41 | xTest = data[x_cols].values
42 | yTest = data['insider'].values
43 | yTestBin = yTest > 0
44 |
45 | scaler = StandardScaler()
46 | xTrain = scaler.fit_transform(xTrain)
47 | xTest = scaler.transform(xTest)
48 |
49 | ae = MLPRegressor(hidden_layer_sizes=(int(data.shape[1]/4), int(data.shape[1]/8),
50 | int(data.shape[1]/4)), max_iter=25, random_state=10)
51 |
52 | ae.fit(xTrain, xTrain)
53 |
54 | reconstructionError = paired_distances(xTest, ae.predict(xTest))
55 |
56 | print('AUC score: ', roc_auc_score(yTestBin, reconstructionError))
57 |
58 | print('Detection rate at different budgets:')
59 | for ib in [0.001, 0.01, 0.05, 0.1, 0.2]:
60 | threshold = np.percentile(reconstructionError, 100-100*ib)
61 | flagged = np.where(reconstructionError>threshold)[0]
62 | dr = sum(yTestBin[flagged]>0)/sum(yTestBin>0)
63 | print(f'{100*ib}%, DR = {100*dr:.2f}%')
--------------------------------------------------------------------------------
/example_classification.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 |
4 | import pandas as pd
5 | import numpy as np
6 | from sklearn.ensemble import RandomForestClassifier
7 | from sklearn.metrics import recall_score, classification_report, f1_score, accuracy_score
8 |
9 | print('This script trains a sample classifier (using simple RandomForestClassifier) '
10 | 'on CERT dataset. By default it takes CERT r5.2 extracted day data '
11 | 'downloaded from https://web.cs.dal.ca/~lcd/data/CERTr5.2/'
12 | ', train on data of 400 users in first half of the dataset, '
13 | 'then output classification report (instance-based)')
14 |
15 | print('For more details, see this paper: Analyzing Data Granularity Levels for'
16 | ' Insider Threat Detection using Machine Learning. Le, D. C.; Zincir-Heywood, '
17 | 'A. N.; and Heywood, M. I. IEEE Transactions on Network and Service Management,'
18 | ' 17(1): 30–44. March 2020.')
19 |
20 | data = pd.read_csv('day-r5.2.csv.gz')
21 | removed_cols = ['user','day','week','starttime','endtime','sessionid','insider']
22 | x_cols = [i for i in data.columns if i not in removed_cols]
23 |
24 | run = 1
25 | np.random.seed(run)
26 |
27 | data1stHalf = data[data.week <= max(data.week)/2]
28 | dataTest = data[data.week > max(data.week)/2]
29 |
30 | selectedTrainUsers = set(data1stHalf[data1stHalf.insider > 0]['user'])
31 | nUsers = np.random.permutation(list(set(data1stHalf.user) - selectedTrainUsers))
32 | trainUsers = np.concatenate((list(selectedTrainUsers), nUsers[:400-len(selectedTrainUsers)]))
33 |
34 | unKnownTestUsers = list(set(dataTest.user) - selectedTrainUsers)
35 |
36 | xTrain = data1stHalf[data1stHalf.user.isin(trainUsers)][x_cols].values
37 | yTrain = data1stHalf[data1stHalf.user.isin(trainUsers)]['insider'].values
38 | yTrainBin = yTrain > 0
39 |
40 | xTest = dataTest[dataTest.user.isin(unKnownTestUsers)][x_cols].values
41 | yTest = dataTest[dataTest.user.isin(unKnownTestUsers)]['insider'].values
42 | yTestBin = yTest > 0
43 |
44 | rf = RandomForestClassifier(n_jobs=-1)
45 |
46 | rf.fit(xTrain, yTrainBin)
47 |
48 | print(classification_report(yTestBin, rf.predict(xTest)))
49 |
50 |
--------------------------------------------------------------------------------
/feature_extraction.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | @author: lcd
5 | """
6 | import os, sys
7 | import pandas as pd
8 | import numpy as np
9 | from datetime import datetime, timedelta
10 | import re
11 | import time
12 | import subprocess
13 | from joblib import Parallel, delayed
14 |
15 | def time_convert(inp, mode, real_sd = '2010-01-02', sd_monday= "2009-12-28"):
16 | if mode == 'e2t':
17 | return datetime.fromtimestamp(inp).strftime('%m/%d/%Y %H:%M:%S')
18 | elif mode == 't2e':
19 | return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S').strftime('%s')
20 | elif mode == 't2dt':
21 | return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S')
22 | elif mode == 't2date':
23 | return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S').strftime("%Y-%m-%d")
24 | elif mode == 'dt2t':
25 | return inp.strftime('%m/%d/%Y %H:%M:%S')
26 | elif mode == 'dt2W':
27 | return int(inp.strftime('%W'))
28 | elif mode == 'dt2d':
29 | return inp.strftime('%m/%d/%Y %H:%M:%S')
30 | elif mode == 'dt2date':
31 | return inp.strftime("%Y-%m-%d")
32 | elif mode =='dt2dn': #datetime to day number
33 | startdate = datetime.strptime(sd_monday,'%Y-%m-%d')
34 | return (inp - startdate).days
35 | elif mode =='dn2epoch': #datenum to epoch
36 | dt = datetime.strptime(sd_monday,'%Y-%m-%d') + timedelta(days=inp)
37 | return int(dt.timestamp())
38 | elif mode =='dt2wn': #datetime to week number
39 | startdate = datetime.strptime(real_sd,'%Y-%m-%d')
40 | return (inp - startdate).days//7
41 | elif mode =='t2wn': #datetime to week number
42 | startdate = datetime.strptime(real_sd,'%Y-%m-%d')
43 | return (datetime.strptime(inp, '%m/%d/%Y %H:%M:%S') - startdate).days//7
44 | elif mode == 'dt2wd':
45 | return int(inp.strftime("%w"))
46 | elif mode == 'm2dt':
47 | return datetime.strptime(inp, "%Y-%m")
48 | elif mode == 'datetoweekday':
49 | return int(datetime.strptime(inp,"%Y-%m-%d").strftime('%w'))
50 | elif mode == 'datetoweeknum':
51 | w0 = datetime.strptime(sd_monday,"%Y-%m-%d")
52 | return int((datetime.strptime(inp,"%Y-%m-%d") - w0).days / 7)
53 | elif mode == 'weeknumtodate':
54 | startday = datetime.strptime(sd_monday,"%Y-%m-%d")
55 | return startday+timedelta(weeks = inp)
56 |
57 | def add_action_thisweek(act, columns, lines, act_handles, week_index, stop, firstdate, dname = 'r5.2'):
58 | thisweek_act = []
59 | while True:
60 | if not lines[act]:
61 | stop[act] = 1
62 | break
63 | if dname in ['r6.1','r6.2'] and act in ['email', 'file','http'] and '"' in lines[act]:
64 | tmp = lines[act]
65 | firstpart = tmp[:tmp.find('"')-1]
66 | content = tmp[tmp.find('"')+1:-1]
67 | tmp = firstpart.split(',') + [content]
68 | else:
69 | tmp = lines[act].split(',')
70 | if time_convert(tmp[1], 't2wn', real_sd= firstdate) == week_index:
71 | thisweek_act.append(tmp)
72 | else:
73 | break
74 | lines[act] = act_handles[act].readline()
75 | df = pd.DataFrame(thisweek_act, columns=columns)
76 | df['type']= act
77 | df.index = df['id']
78 | df.drop('id',1, inplace = True)
79 | return df
80 |
81 | def combine_by_timerange_pandas(dname = 'r4.2'):
82 | allacts = ['device','email','file', 'http','logon']
83 | firstline = str(subprocess.check_output(['head', '-2', 'http.csv'])).split('\\n')[1]
84 | firstdate = time_convert(firstline.split(',')[1],'t2dt')
85 | firstdate = firstdate - timedelta(int(firstdate.strftime("%w")))
86 | firstdate = time_convert(firstdate, 'dt2date')
87 | week_index = 0
88 | act_handles = {}
89 | lines = {}
90 | stop = {}
91 | for act in allacts:
92 | act_handles[act] = open(act+'.csv','r')
93 | next(act_handles[act],None) #skip header row
94 | lines[act] = act_handles[act].readline()
95 | stop[act] = 0 # store stop value indicating if all of the file has been read
96 | while sum(stop.values()) < 5:
97 | thisweekdf = pd.DataFrame()
98 | for act in allacts:
99 | if 'email' == act:
100 | if dname in ['r4.1','r4.2']:
101 | columns = ['id', 'date', 'user', 'pc', 'to', 'cc', 'bcc', 'from', 'size', '#att', 'content']
102 | if dname in ['r6.1','r6.2','r5.2','r5.1']:
103 | columns = ['id', 'date', 'user', 'pc', 'to', 'cc', 'bcc', 'from', 'activity', 'size', 'att', 'content']
104 | elif 'logon' == act:
105 | columns = ['id', 'date', 'user', 'pc', 'activity']
106 | elif 'device' == act:
107 | if dname in ['r4.1','r4.2']:
108 | columns = ['id', 'date', 'user', 'pc', 'activity']
109 | if dname in ['r5.1','r5.2','r6.2','r6.1']:
110 | columns = ['id', 'date', 'user', 'pc', 'content', 'activity']
111 | elif 'http' == act:
112 | if dname in ['r6.1','r6.2']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'activity', 'content']
113 | if dname in ['r5.1','r5.2','r4.2','r4.1']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'content']
114 | elif 'file' == act:
115 | if dname in ['r4.1','r4.2']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'content']
116 | if dname in ['r5.2','r5.1','r6.2','r6.1']: columns = ['id', 'date', 'user', 'pc', 'url/fname','activity','to','from','content']
117 |
118 | df = add_action_thisweek(act, columns, lines, act_handles, week_index, stop, firstdate, dname=dname)
119 | thisweekdf = thisweekdf.append(df, sort=False)
120 |
121 | thisweekdf['date'] = thisweekdf['date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S"))
122 | thisweekdf.to_pickle("DataByWeek/"+str(week_index)+".pickle")
123 | week_index += 1
124 |
125 | ##############################################################################
126 |
127 | def process_user_pc(upd, roles): #figure out which PC belongs to which user
128 | upd['sharedpc'] = None
129 | upd['npc'] = upd['pcs'].apply(lambda x: len(x))
130 | upd.at[upd['npc']==1,'pc'] = upd[upd['npc']==1]['pcs'].apply(lambda x: x[0])
131 | multiuser_pcs = np.concatenate(upd[upd['npc']>1]['pcs'].values).tolist()
132 | set_multiuser_pc = list(set(multiuser_pcs))
133 | count = {}
134 | for pc in set_multiuser_pc:
135 | count[pc] = multiuser_pcs.count(pc)
136 | for u in upd[upd['npc']>1].index:
137 | sharedpc = upd.loc[u]['pcs']
138 | count_u_pc = [count[pc] for pc in upd.loc[u]['pcs']]
139 | the_pc = count_u_pc.index(min(count_u_pc))
140 | upd.at[u,'pc'] = sharedpc[the_pc]
141 | if roles.loc[u] != 'ITAdmin':
142 | sharedpc.remove(sharedpc[the_pc])
143 | upd.at[u,'sharedpc']= sharedpc
144 | return upd
145 |
146 | def getuserlist(dname = 'r4.2', psycho = True):
147 | allfiles = ['LDAP/'+f1 for f1 in os.listdir('LDAP') if os.path.isfile('LDAP/'+f1)]
148 | alluser = {}
149 | alreadyFired = []
150 |
151 | for file in allfiles:
152 | af = (pd.read_csv(file,delimiter=',')).values
153 | employeesThisMonth = []
154 | for i in range(len(af)):
155 | employeesThisMonth.append(af[i][1])
156 | if af[i][1] not in alluser:
157 | alluser[af[i][1]] = af[i][0:1].tolist() + af[i][2:].tolist() + [file.split('.')[0] , np.nan]
158 |
159 | firedEmployees = list(set(alluser.keys()) - set(alreadyFired) - set(employeesThisMonth))
160 | alreadyFired = alreadyFired + firedEmployees
161 | for e in firedEmployees:
162 | alluser[e][-1] = file.split('.')[0]
163 |
164 | if psycho and os.path.isfile("psychometric.csv"):
165 |
166 | p_score = pd.read_csv("psychometric.csv",delimiter = ',').values
167 | for id in range(len(p_score)):
168 | alluser[p_score[id,1]] = alluser[p_score[id,1]]+ list(p_score[id,2:])
169 | df = pd.DataFrame.from_dict(alluser, orient='index')
170 | if dname in ['r4.1','r4.2']:
171 | df.columns = ['uname', 'email', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'sup','wstart', 'wend', 'O', 'C', 'E', 'A', 'N']
172 | elif dname in ['r5.2','r5.1','r6.2','r6.1']:
173 | df.columns = ['uname', 'email', 'role', 'project', 'b_unit', 'f_unit', 'dept', 'team', 'sup','wstart', 'wend', 'O', 'C', 'E', 'A', 'N']
174 | else:
175 | df = pd.DataFrame.from_dict(alluser, orient='index')
176 | if dname in ['r4.1','r4.2']:
177 | df.columns = ['uname', 'email', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'sup', 'wstart', 'wend']
178 | elif dname in ['r5.2','r5.1','r6.2','r6.1']:
179 | df.columns = ['uname', 'email', 'role', 'project', 'b_unit', 'f_unit', 'dept', 'team', 'sup', 'wstart', 'wend']
180 |
181 | df['pc'] = None
182 | for i in df.index:
183 | if type(df.loc[i]['sup']) == str:
184 | sup = df[df['uname'] == df.loc[i]['sup']].index[0]
185 | else:
186 | sup = None
187 | df.at[i,'sup'] = sup
188 |
189 | #read first 2 weeks to determine each user's PC
190 | w1 = pd.read_pickle("DataByWeek/1.pickle")
191 | w2 = pd.read_pickle("DataByWeek/2.pickle")
192 | user_pc_dict = pd.DataFrame(index=df.index)
193 | user_pc_dict['pcs'] = None
194 |
195 | for u in df.index:
196 | pc = list(set(w1[w1['user']==u]['pc']) & set(w2[w2['user']==u]['pc']))
197 | user_pc_dict.at[u,'pcs'] = pc
198 | upd = process_user_pc(user_pc_dict, df['role'])
199 | df['pc'] = upd['pc']
200 | df['sharedpc'] = upd['sharedpc']
201 | return df
202 |
203 |
204 | def get_mal_userdata(data = 'r4.2', usersdf = None):
205 |
206 | if not os.path.isdir('answers'):
207 | os.system('wget https://kilthub.cmu.edu/ndownloader/files/24857828 -O answers.tar.bz2')
208 | os.system('tar -xjvf answers.tar.bz2')
209 |
210 | listmaluser = pd.read_csv("answers/insiders.csv")
211 | listmaluser['dataset'] = listmaluser['dataset'].apply(lambda x: str(x))
212 | listmaluser = listmaluser[listmaluser['dataset']==data.replace("r","")]
213 | #for r6.2, new time in scenario 4 answer is incomplete.
214 | if data == 'r6.2': listmaluser.at[listmaluser['scenario']==4,'start'] = '02'+listmaluser[listmaluser['scenario']==4]['start']
215 | listmaluser[['start','end']] = listmaluser[['start','end']].applymap(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S"))
216 |
217 | if type(usersdf) != pd.core.frame.DataFrame:
218 | usersdf = getuserlist(data)
219 | usersdf['malscene']=0
220 | usersdf['mstart'] = None
221 | usersdf['mend'] = None
222 | usersdf['malacts'] = None
223 |
224 | for i in listmaluser.index:
225 | usersdf.loc[listmaluser['user'][i], 'mstart'] = listmaluser['start'][i]
226 | usersdf.loc[listmaluser['user'][i], 'mend'] = listmaluser['end'][i]
227 | usersdf.loc[listmaluser['user'][i], 'malscene'] = listmaluser['scenario'][i]
228 |
229 | if data in ['r4.2', 'r5.2']:
230 | malacts = open(f"answers/r{listmaluser['dataset'][i]}-{listmaluser['scenario'][i]}/"+
231 | listmaluser['details'][i],'r').read().strip().split("\n")
232 | else: #only 1 malicious user, no folder
233 | malacts = open("answers/"+ listmaluser['details'][i],'r').read().strip().split("\n")
234 |
235 | malacts = [x.split(',') for x in malacts]
236 |
237 | mal_users = np.array([x[3].strip('"') for x in malacts])
238 | mal_act_ids = np.array([x[1].strip('"') for x in malacts])
239 |
240 | usersdf.at[listmaluser['user'][i], 'malacts'] = mal_act_ids[mal_users==listmaluser['user'][i]]
241 |
242 | return usersdf
243 |
244 | ##############################################################################
245 |
246 | def is_after_whour(dt): #Workhours assumed 7:30-17:30
247 | wday_start = datetime.strptime("7:30", "%H:%M").time()
248 | wday_end = datetime.strptime("17:30", "%H:%M").time()
249 | dt = dt.time()
250 | if dt < wday_start or dt > wday_end:
251 | return True
252 | return False
253 |
254 | def is_weekend(dt):
255 | if dt.strftime("%w") in ['0', '6']:
256 | return True
257 | return False
258 |
259 | def email_process(act, data = 'r4.2', separate_send_receive = True):
260 | receivers = act['to'].split(';')
261 | if type(act['cc']) == str:
262 | receivers = receivers + act['cc'].split(";")
263 | if type(act['bcc']) == str:
264 | bccreceivers = act['bcc'].split(";")
265 | else:
266 | bccreceivers = []
267 | exemail = False
268 | n_exdes = 0
269 | for i in receivers + bccreceivers:
270 | if 'dtaa.com' not in i:
271 | exemail = True
272 | n_exdes += 1
273 |
274 | n_des = len(receivers) + len(bccreceivers)
275 | Xemail = 1 if exemail else 0
276 | n_bccdes = len(bccreceivers)
277 | exbccmail = 0
278 | email_text_len = len(act['content'])
279 | email_text_nwords = act['content'].count(' ') + 1
280 | for i in bccreceivers:
281 | if 'dtaa.com' not in i:
282 | exbccmail = 1
283 | break
284 |
285 | if data in ['r5.1','r5.2','r6.1','r6.2']:
286 | send_mail = 1 if act['activity'] == 'Send' else 0
287 | receive_mail = 1 if act['activity'] in ['Receive','View'] else 0
288 |
289 | atts = act['att'].split(';')
290 | n_atts = len(atts)
291 | size_atts = 0
292 | att_types = [0,0,0,0,0,0]
293 | att_sizes = [0,0,0,0,0,0]
294 | for att in atts:
295 | if '.' in att:
296 | tmp = file_process(att, filetype='att')
297 | att_types = [sum(x) for x in zip(att_types,tmp[0])]
298 | att_sizes = [sum(x) for x in zip(att_sizes,tmp[1])]
299 | size_atts +=sum(tmp[1])
300 | return [send_mail, receive_mail, n_des, n_atts, Xemail, n_exdes,
301 | n_bccdes, exbccmail, int(act['size']), email_text_len,
302 | email_text_nwords] + att_types + att_sizes
303 | elif data in ['r4.1','r4.2']:
304 | return [n_des, int(act['#att']), Xemail, n_exdes, n_bccdes, exbccmail,
305 | int(act['size']), email_text_len, email_text_nwords]
306 |
307 | def http_process(act, data = 'r4.2'):
308 | # basic features:
309 | url_len = len(act['url/fname'])
310 | url_depth = act['url/fname'].count('/')-2
311 | content_len = len(act['content'])
312 | content_nwords = act['content'].count(' ')+1
313 |
314 | domainname = re.findall("//(.*?)/", act['url/fname'])[0]
315 | domainname.replace("www.","")
316 | dn = domainname.split(".")
317 | if len(dn) > 2 and not any([x in domainname for x in ["google.com", '.co.uk', '.co.nz', 'live.com']]):
318 | domainname = ".".join(dn[-2:])
319 |
320 | # other 1, socnet 2, cloud 3, job 4, leak 5, hack 6
321 | if domainname in ['dropbox.com', 'drive.google.com', 'mega.co.nz', 'account.live.com']:
322 | r = 3
323 | elif domainname in ['wikileaks.org','freedom.press','theintercept.com']:
324 | r = 5
325 | elif domainname in ['facebook.com','twitter.com','plus.google.com','instagr.am','instagram.com',
326 | 'flickr.com','linkedin.com','reddit.com','about.com','youtube.com','pinterest.com',
327 | 'tumblr.com','quora.com','vine.co','match.com','t.co']:
328 | r = 2
329 | elif domainname in ['indeed.com','monster.com', 'careerbuilder.com','simplyhired.com']:
330 | r = 4
331 |
332 | elif ('job' in domainname and ('hunt' in domainname or 'search' in domainname)) \
333 | or ('aol.com' in domainname and ("recruit" in act['url/fname'] or "job" in act['url/fname'])):
334 | r = 4
335 | elif (domainname in ['webwatchernow.com','actionalert.com', 'relytec.com','refog.com','wellresearchedreviews.com',
336 | 'softactivity.com', 'spectorsoft.com','best-spy-soft.com']):
337 | r = 6
338 | elif ('keylog' in domainname):
339 | r = 6
340 | else:
341 | r = 1
342 | if data in ['r6.1','r6.2']:
343 | http_act_dict = {'www visit': 1, 'www download': 2, 'www upload': 3}
344 | http_act = http_act_dict.get(act['activity'].lower(), 0)
345 | return [r, url_len, url_depth, content_len, content_nwords, http_act]
346 | else:
347 | return [r, url_len, url_depth, content_len, content_nwords]
348 |
349 | def file_process(act, complete_ul = None, data = 'r4.2', filetype = 'act'):
350 | if filetype == 'act':
351 | ftype = act['url/fname'].split(".")[1]
352 | disk = 1 if act['url/fname'][0] == 'C' else 0
353 | if act['url/fname'][0] == 'R': disk = 2
354 | file_depth = act['url/fname'].count('\\')
355 | elif filetype == 'att': #attachments
356 | tmp = act.split('.')[1]
357 | ftype = tmp[:tmp.find('(')]
358 | attsize = int(tmp[tmp.find("(")+1:tmp.find(")")])
359 | r = [[0,0,0,0,0,0], [0,0,0,0,0,0]]
360 | if ftype in ['zip','rar','7z']:
361 | ind = 1
362 | elif ftype in ['jpg', 'png', 'bmp']:
363 | ind = 2
364 | elif ftype in ['doc','docx', 'pdf']:
365 | ind = 3
366 | elif ftype in ['txt','cfg', 'rtf']:
367 | ind = 4
368 | elif ftype in ['exe', 'sh']:
369 | ind = 5
370 | else:
371 | ind = 0
372 | r[0][ind] = 1
373 | r[1][ind] = attsize
374 | return r
375 |
376 | fsize = len(act['content'])
377 | f_nwords = act['content'].count(' ')+1
378 | if ftype in ['zip','rar','7z']:
379 | r = 2
380 | elif ftype in ['jpg', 'png', 'bmp']:
381 | r = 3
382 | elif ftype in ['doc','docx', 'pdf']:
383 | r = 4
384 | elif ftype in ['txt','cfg','rtf']:
385 | r = 5
386 | elif ftype in ['exe', 'sh']:
387 | r = 6
388 | else:
389 | r = 1
390 | if data in ['r5.2','r5.1', 'r6.2','r6.1']:
391 | to_usb = 1 if act['to'] == 'True' else 0
392 | from_usb = 1 if act['from'] == 'True' else 0
393 | file_depth = act['url/fname'].count('\\')
394 | file_act_dict = {'file open': 1, 'file copy': 2, 'file write': 3, 'file delete': 4}
395 | if act['activity'].lower() not in file_act_dict: print(act['activity'].lower())
396 | file_act = file_act_dict.get(act['activity'].lower(), 0)
397 | return [r, fsize, f_nwords, disk, file_depth, file_act, to_usb, from_usb]
398 | elif data in ['r4.1','r4.2']:
399 | return [r, fsize, f_nwords, disk, file_depth]
400 |
401 | def from_pc(act, ul):
402 | #code: 0,1,2,3: own pc, sharedpc, other's pc, supervisor's pc
403 | user_pc = ul.loc[act['user']]['pc']
404 | act_pc = act['pc']
405 | if act_pc == user_pc:
406 | return (0, act_pc) #using normal PC
407 | elif ul.loc[act['user']]['sharedpc'] is not None and act_pc in ul.loc[act['user']]['sharedpc']:
408 | return (1, act_pc)
409 | elif ul.loc[act['user']]['sup'] is not None and act_pc == ul.loc[ul.loc[act['user']]['sup']]['pc']:
410 | return (3, act_pc)
411 | else:
412 | return (2, act_pc)
413 |
414 | def process_week_num(week, users, userlist = 'all', data = 'r4.2'):
415 |
416 | user_dict = {idx: i for (i, idx) in enumerate(users.index)}
417 | acts_week = pd.read_pickle("DataByWeek/"+str(week)+".pickle")
418 | start_week, end_week = min(acts_week.date), max(acts_week.date)
419 | acts_week.sort_values('date', ascending = True, inplace = True)
420 | n_cols = 45 if data in ['r5.2','r5.1'] else 46
421 | if data in ['r4.2','r4.1']: n_cols = 27
422 | u_week = np.zeros((len(acts_week), n_cols))
423 | pc_time = []
424 | if userlist == 'all':
425 | userlist = set(acts_week.user)
426 |
427 | #FOR EACH USER
428 | current_ind = 0
429 | for u in userlist:
430 | df_acts_u = acts_week[acts_week.user == u]
431 | mal_u = 0 #, stop_soon = 0, 0
432 | if users.loc[u].malscene > 0:
433 | if start_week <= users.loc[u].mend and users.loc[u].mstart <= end_week:
434 | mal_u = users.loc[u].malscene
435 |
436 | list_uacts = df_acts_u.type.tolist() #all user's activities
437 | list_activity = df_acts_u.activity.tolist()
438 | list_uacts = [list_activity[i].strip().lower() if (type(list_activity[i])==str and list_activity[i].strip() in ['Logon', 'Logoff', 'Connect', 'Disconnect']) \
439 | else list_uacts[i] for i in range(len(list_uacts))]
440 | uacts_mapping = {'logon':1, 'logoff':2, 'connect':3, 'disconnect':4, 'http':5,'email':6,'file':7}
441 | list_uacts_num = [uacts_mapping[x] for x in list_uacts]
442 |
443 | oneu_week = np.zeros((len(df_acts_u), n_cols))
444 | oneu_pc_time = []
445 | for i in range(len(df_acts_u)):
446 | pc, _ = from_pc(df_acts_u.iloc[i], users)
447 | if is_weekend(df_acts_u.iloc[i]['date']):
448 | if is_after_whour(df_acts_u.iloc[i]['date']):
449 | act_time = 4
450 | else:
451 | act_time = 3
452 | elif is_after_whour(df_acts_u.iloc[i]['date']):
453 | act_time = 2
454 | else:
455 | act_time = 1
456 |
457 | if data in ['r4.2','r4.1']:
458 | device_f = [0]
459 | file_f = [0, 0, 0, 0, 0]
460 | http_f = [0,0,0,0,0]
461 | email_f = [0]*9
462 | elif data in ['r5.2','r5.1','r6.2','r6.1']:
463 | device_f = [0,0]
464 | file_f = [0]*8
465 | http_f = [0,0,0,0,0]
466 | if data in ['r6.2','r6.1']:
467 | http_f = [0,0,0,0,0,0]
468 | email_f = [0]*23
469 |
470 | if list_uacts[i] == 'file':
471 | file_f = file_process(df_acts_u.iloc[i], data = data)
472 | elif list_uacts[i] == 'email':
473 | email_f = email_process(df_acts_u.iloc[i], data = data)
474 | elif list_uacts[i] == 'http':
475 | http_f = http_process(df_acts_u.iloc[i], data=data)
476 | elif list_uacts[i] == 'connect':
477 | tmp = df_acts_u.iloc[i:]
478 | disconnect_acts = tmp[(tmp['activity'] == 'Disconnect\n') & \
479 | (tmp['user'] == df_acts_u.iloc[i]['user']) & \
480 | (tmp['pc'] == df_acts_u.iloc[i]['pc'])]
481 |
482 | connect_acts = tmp[(tmp['activity'] == 'Connect\n') & \
483 | (tmp['user'] == df_acts_u.iloc[i]['user']) & \
484 | (tmp['pc'] == df_acts_u.iloc[i]['pc'])]
485 |
486 | if len(disconnect_acts) > 0:
487 | distime = disconnect_acts.iloc[0]['date']
488 | if len(connect_acts) > 0 and connect_acts.iloc[0]['date'] < distime:
489 | connect_dur = -1
490 | else:
491 | tmp_td = distime - df_acts_u.iloc[i]['date']
492 | connect_dur = tmp_td.days*24*3600 + tmp_td.seconds
493 | else:
494 | connect_dur = -1 # disconnect action not found!
495 |
496 | if data in ['r5.2','r5.1','r6.2','r6.1']:
497 | file_tree_len = len(df_acts_u.iloc[i]['content'].split(';'))
498 | device_f = [connect_dur, file_tree_len]
499 | else:
500 | device_f = [connect_dur]
501 |
502 | is_mal_act = 0
503 | if mal_u > 0 and df_acts_u.index[i] in users.loc[u]['malacts']: is_mal_act = 1
504 |
505 | oneu_week[i,:] = [ user_dict[u], time_convert(df_acts_u.iloc[i]['date'], 'dt2dn'), list_uacts_num[i], pc, act_time] \
506 | + device_f + file_f + http_f + email_f + [is_mal_act, mal_u]
507 |
508 | oneu_pc_time.append([df_acts_u.index[i], df_acts_u.iloc[i]['pc'],df_acts_u.iloc[i]['date']])
509 | u_week[current_ind:current_ind+len(oneu_week),:] = oneu_week
510 | pc_time += oneu_pc_time
511 | current_ind += len(oneu_week)
512 |
513 | u_week = u_week[0:current_ind, :]
514 | col_names = ['user','day','act','pc','time']
515 | if data in ['r4.1','r4.2']:
516 | device_feature_names = ['usb_dur']
517 | file_feature_names = ['file_type', 'file_len', 'file_nwords', 'disk', 'file_depth']
518 | http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords']
519 | email_feature_names = ['n_des', 'n_atts', 'Xemail', 'n_exdes', 'n_bccdes', 'exbccmail', 'email_size', 'email_text_slen', 'email_text_nwords']
520 | elif data in ['r5.2','r5.1', 'r6.2','r6.1']:
521 | device_feature_names = ['usb_dur', 'file_tree_len']
522 | file_feature_names = ['file_type', 'file_len', 'file_nwords', 'disk', 'file_depth', 'file_act', 'to_usb', 'from_usb']
523 | http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords']
524 | if data in ['r6.2','r6.1']:
525 | http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords', 'http_act']
526 | email_feature_names = ['send_mail', 'receive_mail','n_des', 'n_atts', 'Xemail', 'n_exdes', 'n_bccdes', 'exbccmail', 'email_size', 'email_text_slen', 'email_text_nwords']
527 | email_feature_names += ['e_att_other', 'e_att_comp', 'e_att_pho', 'e_att_doc', 'e_att_txt', 'e_att_exe']
528 | email_feature_names += ['e_att_sother', 'e_att_scomp', 'e_att_spho', 'e_att_sdoc', 'e_att_stxt', 'e_att_sexe']
529 |
530 | col_names = col_names + device_feature_names + file_feature_names+ http_feature_names + email_feature_names + ['mal_act','insider']#['stop_soon', 'mal_act','insider']
531 | df_u_week = pd.DataFrame(columns=['actid','pcid','time_stamp'] + col_names, index = np.arange(0,len(pc_time)))
532 | df_u_week[['actid','pcid','time_stamp']] = np.array(pc_time)
533 |
534 | df_u_week[col_names] = u_week
535 | df_u_week[col_names] = df_u_week[col_names].astype(int)
536 | df_u_week.to_pickle("NumDataByWeek/"+str(week)+"_num.pickle")
537 |
538 | ##############################################################################
539 |
540 | # return sessions for each user in a week:
541 | # sessions[sid] = [sessionid, pc, start_with, end_with, start time, end time,number_of_concurent_login, [action_indices]]
542 | # start_with: in the beginning of a week, action start with log in or not (1, 2)
543 | # end_with: log off, next log on same computer (1, 2)
544 | def get_sessions(uw, first_sid = 0):
545 | sessions = {}
546 | open_sessions = {}
547 | sid = 0
548 | current_pc = uw.iloc[0]['pcid']
549 | start_time = uw.iloc[0]['time_stamp']
550 | if uw.iloc[0]['act'] == 1:
551 | open_sessions[current_pc] = [current_pc, 1, 0, start_time, start_time, 1, [uw.index[0]]]
552 | else:
553 | open_sessions[current_pc] = [current_pc, 2, 0, start_time, start_time, 1, [uw.index[0]]]
554 |
555 | for i in uw.index[1:]:
556 | current_pc = uw.loc[i]['pcid']
557 | if current_pc in open_sessions: # must be already a session with that pcid
558 | if uw.loc[i]['act'] == 2:
559 | open_sessions[current_pc][2] = 1
560 | open_sessions[current_pc][4] = uw.loc[i]['time_stamp']
561 | open_sessions[current_pc][6].append(i)
562 | sessions[sid] = [first_sid+sid] + open_sessions.pop(current_pc)
563 | sid +=1
564 | elif uw.loc[i]['act'] == 1:
565 | open_sessions[current_pc][2] = 2
566 | sessions[sid] = [first_sid+sid] + open_sessions.pop(current_pc)
567 | sid +=1
568 | #create a new open session
569 | open_sessions[current_pc] = [current_pc, 1, 0, uw.loc[i]['time_stamp'], uw.loc[i]['time_stamp'], 1, [i]]
570 | if len(open_sessions) > 1: #increase the concurent count for all sessions
571 | for k in open_sessions:
572 | open_sessions[k][5] +=1
573 | else:
574 | open_sessions[current_pc][4] = uw.loc[i]['time_stamp']
575 | open_sessions[current_pc][6].append(i)
576 | else:
577 | start_status = 1 if uw.loc[i]['act'] == 1 else 2
578 | open_sessions[current_pc] = [current_pc, start_status, 0, uw.loc[i]['time_stamp'], uw.loc[i]['time_stamp'], 1, [i]]
579 | if len(open_sessions) > 1: #increase the concurent count for all sessions
580 | for k in open_sessions:
581 | open_sessions[k][5] +=1
582 | return sessions
583 |
584 | def get_u_features_dicts(ul, data = 'r5.2'):
585 | ufdict = {}
586 | list_uf=[] if data in ['r4.1','r4.2'] else ['project']
587 | list_uf += ['role','b_unit','f_unit', 'dept','team']
588 | for f in list_uf:
589 | ul[f] = ul[f].astype(str)
590 | tmp = list(set(ul[f]))
591 | tmp.sort()
592 | ufdict[f] = {idx:i for i, idx in enumerate(tmp)}
593 | return (ul,ufdict, list_uf)
594 |
595 | def proc_u_features(uf, ufdict, list_f = None, data = 'r4.2'): #to remove mode
596 | if type(list_f) != list:
597 | list_f=[] if data in ['r4.1','r4.2'] else ['project']
598 | list_f = ['role','b_unit','f_unit', 'dept','team'] + list_f
599 |
600 | out = []
601 | for f in list_f:
602 | out.append(ufdict[f][uf[f]])
603 | return out
604 |
605 | def f_stats_calc(ud, fn, stats_f, countonly_f = {}, get_stats = False):
606 | f_count = len(ud)
607 | r = []
608 | f_names = []
609 |
610 | for f in stats_f:
611 | inp = ud[f].values
612 | if get_stats:
613 | if f_count > 0:
614 | r += [np.min(inp), np.max(inp), np.median(inp), np.mean(inp), np.std(inp)]
615 | else: r += [0, 0, 0, 0, 0]
616 | f_names += [fn+'_min_'+f, fn+'_max_'+f, fn+'_med_'+f, fn+'_mean_'+f, fn+'_std_'+f]
617 | else:
618 | if f_count > 0: r += [np.mean(inp)]
619 | else: r += [0]
620 | f_names += [fn+'_mean_'+f]
621 |
622 | for f in countonly_f:
623 | for v in countonly_f[f]:
624 | r += [sum(ud[f].values == v)]
625 | f_names += [fn+'_n-'+f+str(v)]
626 | return (f_count, r, f_names)
627 |
628 | def f_calc_subfeatures(ud, fname, filter_col, filter_vals, filter_names, sub_features, countonly_subfeatures):
629 | [n, stats, fnames] = f_stats_calc(ud, fname,sub_features, countonly_subfeatures)
630 | allf = [n] + stats
631 | allf_names = ['n_'+fname] + fnames
632 | for i in range(len(filter_vals)):
633 | [n_sf, sf_stats, sf_fnames] = f_stats_calc(ud[ud[filter_col]==filter_vals[i]], filter_names[i], sub_features, countonly_subfeatures)
634 | allf += [n_sf] + sf_stats
635 | allf_names += [fname+'_n_'+filter_names[i]] + [fname + '_' + x for x in sf_fnames]
636 | return (allf, allf_names)
637 |
638 | def f_calc(ud, mode = 'week', data = 'r4.2'):
639 | n_weekendact = (ud['time']==3).sum()
640 | if n_weekendact > 0:
641 | is_weekend = 1
642 | else:
643 | is_weekend = 0
644 |
645 | all_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {}
646 | [all_f, all_f_names] = f_calc_subfeatures(ud, 'allact', None, [], [], [], all_countonlyf)
647 | if mode == 'day':
648 | [workhourf, workhourf_names] = f_calc_subfeatures(ud[(ud['time'] == 1) | (ud['time'] == 3)], 'workhourallact', None, [], [], [], all_countonlyf)
649 | [afterhourf, afterhourf_names] = f_calc_subfeatures(ud[(ud['time'] == 2) | (ud['time'] == 4) ], 'afterhourallact', None, [], [], [], all_countonlyf)
650 | elif mode == 'week':
651 | [workhourf, workhourf_names] = f_calc_subfeatures(ud[ud['time'] == 1], 'workhourallact', None, [], [], [], all_countonlyf)
652 | [afterhourf, afterhourf_names] = f_calc_subfeatures(ud[ud['time'] == 2 ], 'afterhourallact', None, [], [], [], all_countonlyf)
653 | [weekendf, weekendf_names] = f_calc_subfeatures(ud[ud['time'] >= 3 ], 'weekendallact', None, [], [], [], all_countonlyf)
654 |
655 | logon_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {}
656 | logon_statf = []
657 |
658 | [all_logonf, all_logonf_names] = f_calc_subfeatures(ud[ud['act']==1], 'logon', None, [], [], logon_statf, logon_countonlyf)
659 | if mode == 'day':
660 | [workhourlogonf, workhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & ((ud['time'] == 1) | (ud['time'] == 3) )], 'workhourlogon', None, [], [], logon_statf, logon_countonlyf)
661 | [afterhourlogonf, afterhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & ((ud['time'] == 2) | (ud['time'] == 4) )], 'afterhourlogon', None, [], [], logon_statf, logon_countonlyf)
662 | elif mode == 'week':
663 | [workhourlogonf, workhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] == 1)], 'workhourlogon', None, [], [], logon_statf, logon_countonlyf)
664 | [afterhourlogonf, afterhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] == 2) ], 'afterhourlogon', None, [], [], logon_statf, logon_countonlyf)
665 | [weekendlogonf, weekendlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] >= 3) ], 'weekendlogon', None, [], [], logon_statf, logon_countonlyf)
666 |
667 | device_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {}
668 | device_statf = ['usb_dur','file_tree_len'] if data not in ['r4.1','r4.2'] else ['usb_dur']
669 |
670 | [all_devicef, all_devicef_names] = f_calc_subfeatures(ud[ud['act']==3], 'usb', None, [], [], device_statf, device_countonlyf)
671 | if mode == 'day':
672 | [workhourdevicef, workhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & ((ud['time'] == 1) | (ud['time'] == 3) )], 'workhourusb', None, [], [], device_statf, device_countonlyf)
673 | [afterhourdevicef, afterhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & ((ud['time'] == 2) | (ud['time'] == 4) )], 'afterhourusb', None, [], [], device_statf, device_countonlyf)
674 | elif mode == 'week':
675 | [workhourdevicef, workhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] == 1)], 'workhourusb', None, [], [], device_statf, device_countonlyf)
676 | [afterhourdevicef, afterhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] == 2) ], 'afterhourusb', None, [], [], device_statf, device_countonlyf)
677 | [weekenddevicef, weekenddevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] >= 3) ], 'weekendusb', None, [], [], device_statf, device_countonlyf)
678 |
679 | if mode != 'session': file_countonlyf = {'to_usb':[1],'from_usb':[1], 'file_act':[1,2,3,4], 'disk':[0,1], 'pc':[0,1,2,3]}
680 | else: file_countonlyf = {'to_usb':[1],'from_usb':[1], 'file_act':[1,2,3,4], 'disk':[0,1,2]}
681 | if data in ['r4.1','r4.2']:
682 | [file_countonlyf.pop(k) for k in ['to_usb','from_usb', 'file_act']]
683 |
684 | (all_filef, all_filef_names) = f_calc_subfeatures(ud[ud['act']==7], 'file', 'file_type', [1,2,3,4,5,6], \
685 | ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
686 |
687 | if mode == 'day':
688 | (workhourfilef, workhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
689 | (afterhourfilef, afterhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
690 | elif mode == 'week':
691 | (workhourfilef, workhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] ==1)], 'workhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
692 | (afterhourfilef, afterhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] ==2)], 'afterhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
693 | (weekendfilef, weekendfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] >= 3)], 'weekendfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
694 |
695 | email_stats_f = ['n_des', 'n_atts', 'n_exdes', 'n_bccdes', 'email_size', 'email_text_slen', 'email_text_nwords']
696 | if data not in ['r4.1','r4.2']:
697 | email_stats_f += ['e_att_other', 'e_att_comp', 'e_att_pho', 'e_att_doc', 'e_att_txt', 'e_att_exe']
698 | email_stats_f += ['e_att_sother', 'e_att_scomp', 'e_att_spho', 'e_att_sdoc', 'e_att_stxt', 'e_att_sexe']
699 | mail_filter = 'send_mail'
700 | mail_filter_vals = [0,1]
701 | mail_filter_names = ['recvmail','send_mail']
702 | else:
703 | mail_filter, mail_filter_vals, mail_filter_names = None, [], []
704 |
705 | if mode != 'session': mail_countonlyf = {'Xemail':[1],'exbccmail':[1], 'pc':[0,1,2,3]}
706 | else: mail_countonlyf = {'Xemail':[1],'exbccmail':[1]}
707 |
708 | (all_emailf, all_emailf_names) = f_calc_subfeatures(ud[ud['act']==6], 'email', mail_filter, mail_filter_vals, mail_filter_names , email_stats_f, mail_countonlyf)
709 | if mode == 'week':
710 | (workhouremailf, workhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] == 1)], 'workhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
711 | (afterhouremailf, afterhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] == 2)], 'afterhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
712 | (weekendemailf, weekendemailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] >= 3)], 'weekendemail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
713 | elif mode == 'day':
714 | (workhouremailf, workhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
715 | (afterhouremailf, afterhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
716 |
717 | if data in ['r5.2','r5.1'] or data in ['r4.1','r4.2']:
718 | http_count_subf = {'pc':[0,1,2,3]}
719 | elif data in ['r6.2','r6.1']:
720 | http_count_subf = {'pc':[0,1,2,3], 'http_act':[1,2,3]}
721 |
722 | if mode == 'session': http_count_subf.pop('pc',None)
723 |
724 | (all_httpf, all_httpf_names) = f_calc_subfeatures(ud[ud['act']==5], 'http', 'http_type', [1,2,3,4,5,6], \
725 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
726 |
727 | if mode == 'week':
728 | (workhourhttpf, workhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] ==1)], 'workhourhttp', 'http_type', [1,2,3,4,5,6], \
729 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
730 | (afterhourhttpf, afterhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] ==2)], 'afterhourhttp', 'http_type', [1,2,3,4,5,6], \
731 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
732 | (weekendhttpf, weekendhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] >=3)], 'weekendhttp', 'http_type', [1,2,3,4,5,6], \
733 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
734 | elif mode == 'day':
735 | (workhourhttpf, workhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhourhttp', 'http_type', [1,2,3,4,5,6], \
736 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
737 | (afterhourhttpf, afterhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhourhttp', 'http_type', [1,2,3,4,5,6], \
738 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
739 |
740 | numActs = all_f[0]
741 | mal_u = 0
742 | if (ud['mal_act']).sum() > 0:
743 | tmp = list(set(ud['insider']))
744 | if len(tmp) > 1:
745 | tmp.remove(0.0)
746 | mal_u = tmp[0]
747 |
748 | if mode == 'week':
749 | features_tmp = all_f + workhourf + afterhourf + weekendf +\
750 | all_logonf + workhourlogonf + afterhourlogonf + weekendlogonf +\
751 | all_devicef + workhourdevicef + afterhourdevicef + weekenddevicef +\
752 | all_filef + workhourfilef + afterhourfilef + weekendfilef + \
753 | all_emailf + workhouremailf + afterhouremailf + weekendemailf + all_httpf + workhourhttpf + afterhourhttpf + weekendhttpf
754 | fnames_tmp = all_f_names + workhourf_names + afterhourf_names + weekendf_names +\
755 | all_logonf_names + workhourlogonf_names + afterhourlogonf_names + weekendlogonf_names +\
756 | all_devicef_names + workhourdevicef_names + afterhourdevicef_names + weekenddevicef_names +\
757 | all_filef_names + workhourfilef_names + afterhourfilef_names + weekendfilef_names + \
758 | all_emailf_names + workhouremailf_names + afterhouremailf_names + weekendemailf_names + all_httpf_names + workhourhttpf_names + afterhourhttpf_names + weekendhttpf_names
759 | elif mode == 'day':
760 | features_tmp = all_f + workhourf + afterhourf +\
761 | all_logonf + workhourlogonf + afterhourlogonf +\
762 | all_devicef + workhourdevicef + afterhourdevicef + \
763 | all_filef + workhourfilef + afterhourfilef + \
764 | all_emailf + workhouremailf + afterhouremailf + all_httpf + workhourhttpf + afterhourhttpf
765 | fnames_tmp = all_f_names + workhourf_names + afterhourf_names +\
766 | all_logonf_names + workhourlogonf_names + afterhourlogonf_names +\
767 | all_devicef_names + workhourdevicef_names + afterhourdevicef_names +\
768 | all_filef_names + workhourfilef_names + afterhourfilef_names + \
769 | all_emailf_names + workhouremailf_names + afterhouremailf_names + all_httpf_names + workhourhttpf_names + afterhourhttpf_names
770 | elif mode == 'session':
771 | features_tmp = all_f + all_logonf + all_devicef + all_filef + all_emailf + all_httpf
772 | fnames_tmp = all_f_names + all_logonf_names + all_devicef_names + all_filef_names + all_emailf_names + all_httpf_names
773 |
774 | return [numActs, is_weekend, features_tmp, fnames_tmp, mal_u]
775 |
776 | def session_instance_calc(ud, sinfo, week, mode, data, uw, v, list_uf):
777 | d = ud.iloc[0]['day']
778 | perworkhour = sum(ud['time']==1)/len(ud)
779 | perafterhour = sum(ud['time']==2)/len(ud)
780 | perweekend = sum(ud['time']==3)/len(ud)
781 | perweekendafterhour = sum(ud['time']==4)/len(ud)
782 | st_timestamp = min(ud['time_stamp'])
783 | end_timestamp = max(ud['time_stamp'])
784 | s_dur = (end_timestamp - st_timestamp).total_seconds() / 60 # in minute
785 | s_start = st_timestamp.hour + st_timestamp.minute/60
786 | s_end = end_timestamp.hour + end_timestamp.minute/60
787 | starttime = st_timestamp.timestamp()
788 | endtime = end_timestamp.timestamp()
789 | n_days = len(set(ud['day']))
790 |
791 | tmp = f_calc(ud, mode, data)
792 | session_instance = [starttime, endtime, v, sinfo[0], d, week, ud.iloc[0]['pc'], perworkhour, perafterhour, perweekend,
793 | perweekendafterhour, n_days, s_dur, sinfo[6], sinfo[2], sinfo[3], s_start, s_end] + \
794 | (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [tmp[4]]
795 | return (session_instance, tmp[3])
796 |
797 | def to_csv(week, mode, data, ul, uf_dict, list_uf, subsession_mode = {}):
798 | user_dict = {i : idx for (i, idx) in enumerate(ul.index)}
799 | if mode == 'session':
800 | first_sid = week*100000 # to get an unique index for each session, also, first 1 or 2 number in index would be week number
801 | cols2a = ['starttime', 'endtime','user', 'sessionid', 'day', 'week', 'pc', 'isworkhour', 'isafterhour','isweekend',
802 | 'isweekendafterhour', 'n_days', 'duration', 'n_concurrent_sessions', 'start_with', 'end_with', 'ses_start',
803 | 'ses_end'] + list_uf + ['ITAdmin','O','C','E','A','N']
804 | elif mode == 'day':
805 | cols2a = ['starttime', 'endtime','user', 'day', 'week', 'isweekday','isweekend'] + list_uf +\
806 | ['ITAdmin','O','C','E','A','N']
807 | else: cols2a = ['starttime', 'endtime','user','week'] + list_uf + ['ITAdmin','O','C','E','A','N']
808 | cols2b = ['insider']
809 |
810 | w = pd.read_pickle("NumDataByWeek/"+str(week)+"_num.pickle")
811 |
812 | usnlist = list(set(w['user'].astype('int').values))
813 | if True:
814 | cols = ['week']+ list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N', 'insider']
815 | uw = pd.DataFrame(columns = cols, index = user_dict.keys())
816 | uwdict = {}
817 | for v in user_dict:
818 | if v in usnlist:
819 | is_ITAdmin = 1 if ul.loc[user_dict[v], 'role'] == 'ITAdmin' else 0
820 | row = [week] + proc_u_features(ul.loc[user_dict[v]], uf_dict, list_uf, data = data) + [is_ITAdmin] + \
821 | (ul.loc[user_dict[v],['O','C','E','A','N']]).tolist() + [0]
822 | row[-1] = int(list(set(w[w['user']==v]['insider']))[0])
823 | uwdict[v] = row
824 | uw = pd.DataFrame.from_dict(uwdict, orient = 'index',columns = cols)
825 |
826 | towrite = pd.DataFrame()
827 | towrite_list = []
828 |
829 | if mode == 'session' and len(subsession_mode) > 0:
830 | towrite_list_subsession = {}
831 | for k1 in subsession_mode:
832 | towrite_list_subsession[k1] = {}
833 | for k2 in subsession_mode[k1]:
834 | towrite_list_subsession[k1][k2] = []
835 |
836 | days = list(set(w['day']))
837 | for v in user_dict:
838 | if v in usnlist:
839 | uactw = w[w['user']==v]
840 |
841 | if mode == 'week':
842 | a = uactw.iloc[0]['time_stamp']
843 | a = a - timedelta(int(a.strftime("%w"))) # get the nearest Sunday
844 | starttime = datetime(a.year, a.month, a.day).timestamp()
845 | endtime = (datetime(a.year, a.month, a.day) + timedelta(days=7)).timestamp()
846 |
847 | if len(uactw) > 0:
848 | tmp = f_calc(uactw, mode, data)
849 | i_fnames = tmp[3]
850 | towrite_list.append([starttime, endtime, v, week] + (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [ tmp[4]])
851 |
852 | if mode == 'session':
853 | sessions = get_sessions(uactw, first_sid)
854 | first_sid += len(sessions)
855 | for s in sessions:
856 | sinfo = sessions[s]
857 |
858 | ud = uactw.loc[sessions[s][7]]
859 | if len(ud) > 0:
860 | session_instance, i_fnames = session_instance_calc(ud, sinfo, week, mode, data, uw, v, list_uf)
861 | towrite_list.append(session_instance)
862 |
863 | ## do subsessions:
864 | if 'time' in subsession_mode: # divide a session into subsessions by consecutive time chunks
865 | for subsession_dur in subsession_mode['time']:
866 | n_subsession = int(np.ceil(session_instance[12] / subsession_dur))
867 | if n_subsession == 1:
868 | towrite_list_subsession['time'][subsession_dur].append([0] + session_instance)
869 | else:
870 | sinfo1 = sinfo.copy()
871 | for subsession_ind in range(n_subsession):
872 | sinfo1[3] = 0 if subsession_ind < n_subsession-1 else sinfo[3]
873 |
874 | subsession_ud = ud[(ud['time_stamp'] >= sessions[s][4] + timedelta(minutes = subsession_ind*subsession_dur)) & \
875 | (ud['time_stamp'] < sessions[s][4] + timedelta(minutes = (subsession_ind+1)*subsession_dur))]
876 | if len(subsession_ud) > 0:
877 | ss_instance, _ = session_instance_calc(subsession_ud, sinfo1, week, mode, data, uw, v, list_uf)
878 | towrite_list_subsession['time'][subsession_dur].append([subsession_ind] + ss_instance)
879 |
880 | if 'nact' in subsession_mode:
881 | for ss_nact in subsession_mode['nact']:
882 | n_subsession = int(np.ceil(len(ud) / ss_nact))
883 | if n_subsession == 1:
884 | towrite_list_subsession['nact'][ss_nact].append([0] + session_instance)
885 | else:
886 | sinfo1 = sinfo.copy()
887 | for ss_ind in range(n_subsession):
888 | sinfo1[3] = 0 if ss_ind < n_subsession-1 else sinfo[3]
889 |
890 | ss_ud = ud.iloc[ss_ind*ss_nact : min(len(ud), (ss_ind+1)*ss_nact)]
891 | if len(ss_ud) > 0:
892 | ss_instance,_ = session_instance_calc(ss_ud, sinfo1, week, mode, data, uw, v, list_uf)
893 | towrite_list_subsession['nact'][ss_nact].append([ss_ind] + ss_instance)
894 |
895 | if mode == 'day':
896 | days = sorted(list(set(uactw['day'])))
897 | for d in days:
898 | ud = uactw[uactw['day'] == d]
899 | isweekday = 1 if sum(ud['time']>=3) == 0 else 0
900 | isweekend = 1-isweekday
901 | a = ud.iloc[0]['time_stamp']
902 | starttime = datetime(a.year, a.month, a.day).timestamp()
903 | endtime = (datetime(a.year, a.month, a.day) + timedelta(days=1)).timestamp()
904 |
905 | if len(ud) > 0:
906 | tmp = f_calc(ud, mode, data)
907 | i_fnames = tmp[3]
908 | towrite_list.append([starttime, endtime, v, d, week, isweekday, isweekend] + (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [ tmp[4]])
909 |
910 | towrite = pd.DataFrame(columns = cols2a + i_fnames + cols2b, data = towrite_list)
911 | towrite.to_pickle("tmp/"+str(week) + mode+".pickle")
912 |
913 | if mode == 'session' and len(subsession_mode) > 0:
914 | for k1 in subsession_mode:
915 | for k2 in subsession_mode[k1]:
916 | df_tmp = pd.DataFrame(columns = ['subs_ind']+cols2a + i_fnames + cols2b, data = towrite_list_subsession[k1][k2])
917 | df_tmp.to_pickle("tmp/"+str(week) + mode + k1 + str(k2) + ".pickle")
918 |
919 | if __name__ == "__main__":
920 | dname = os.getcwd().split('/')[-1]
921 | if dname not in ['r4.1','r4.2','r6.2','r6.1','r5.1','r5.2']:
922 | raise Exception('Please put this script in and run it from a CERT data folder (e.g. r4.2)')
923 | #make temporary folders
924 | [os.mkdir(x) for x in ["tmp", "ExtractedData", "DataByWeek", "NumDataByWeek"]]
925 |
926 | subsession_mode = {'nact':[25, 50], 'time':[120, 240]}#this can be an empty dict
927 |
928 | numCores = 8
929 | arguments = len(sys.argv) - 1
930 | if arguments > 0:
931 | numCores = int(sys.argv[1])
932 |
933 | numWeek = 73 if dname in ['r4.1','r4.2'] else 75 # only 73 weeks in r4.1 and r4.2 dataset
934 | st = time.time()
935 |
936 | #### Step 1: Combine data from sources by week, stored in DataByWeek
937 | combine_by_timerange_pandas(dname)
938 | print(f"Step 1 - Separate data by week - done. Time (mins): {(time.time()-st)/60:.2f}")
939 | st = time.time()
940 |
941 | #### Step 2: Get user list
942 | users = get_mal_userdata(dname)
943 | print(f"Step 2 - Get user list - done. Time (mins): {(time.time()-st)/60:.2f}")
944 | st = time.time()
945 |
946 | #### Step 3: Convert each action to numerical data, stored in NumDataByWeek
947 | Parallel(n_jobs=numCores)(delayed(process_week_num)(i, users, data=dname) for i in range(numWeek))
948 | print(f"Step 3 - Convert each action to numerical data - done. Time (mins): {(time.time()-st)/60:.2f}")
949 | st = time.time()
950 |
951 | #### Step 4: Extract to csv
952 | for mode in ['week','day','session']:
953 |
954 | weekRange = list(range(0, numWeek)) if mode in ['day', 'session'] else list(range(1, numWeek))
955 | (ul, uf_dict, list_uf) = get_u_features_dicts(users, data= dname)
956 |
957 | Parallel(n_jobs=numCores)(delayed(to_csv)(i, mode, dname, ul, uf_dict, list_uf, subsession_mode)
958 | for i in weekRange)
959 |
960 | all_csv = open('ExtractedData/'+mode+dname+'.csv','a')
961 |
962 | towrite = pd.read_pickle("tmp/"+str(weekRange[0]) + mode+".pickle")
963 | towrite.to_csv(all_csv,header=True, index = False)
964 | for w in weekRange[1:]:
965 | towrite = pd.read_pickle("tmp/"+str(w) + mode+".pickle")
966 | towrite.to_csv(all_csv,header=False, index = False)
967 |
968 | if mode == 'session' and len(subsession_mode) > 0:
969 | for k1 in subsession_mode:
970 | for k2 in subsession_mode[k1]:
971 | all_csv = open('ExtractedData/'+mode+ k1 + str(k2) + dname+'.csv','a')
972 | towrite = pd.read_pickle('tmp/'+str(weekRange[0]) + mode + k1 + str(k2)+".pickle")
973 | towrite.to_csv(all_csv,header=True, index = False)
974 | for w in weekRange[1:]:
975 | towrite = pd.read_pickle('tmp/'+str(w) + mode+ k1 + str(k2)+".pickle")
976 | towrite.to_csv(all_csv,header=False, index = False)
977 |
978 | print(f'Extracted {mode} data. Time (mins): {(time.time()-st)/60:.2f}')
979 | st = time.time()
980 |
981 | [os.system(f"rm -r {x}") for x in ["tmp", "DataByWeek", "NumDataByWeek"]]
982 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | #python3.6
2 | numpy==1.19.5
3 | pandas==1.1.5
4 | joblib==1.1.1
5 | scikit-learn==0.24.2
--------------------------------------------------------------------------------
/temporal_data_representation.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | import numpy as np
4 | import pandas as pd
5 | import multiprocessing
6 | from scipy.stats import percentileofscore
7 | import argparse
8 |
9 | def concat_combination(data, window_size = 3, dname = 'cert'):
10 |
11 | if dname == 'cert':
12 | info_cols = ['sessionid','day','week',"starttime", "endtime",
13 | 'user', 'project', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'ITAdmin',
14 | 'O', 'C', 'E', 'A', 'N', 'insider']
15 |
16 | combining_features = [ f for f in data.columns if f not in info_cols]
17 | info_features = [f for f in data.columns if f in info_cols]
18 |
19 | data_info = data[info_features].values
20 |
21 | data_combining_features = data[combining_features].values
22 | useridx = data['user'].values
23 |
24 | userset = set(data['user'])
25 |
26 | cols = []
27 | for shiftrange in range(window_size-1,0,-1):
28 | cols += [str(-shiftrange) + '_' + f for f in combining_features]
29 | cols += combining_features + info_features
30 |
31 | combined_data = []
32 | for u in userset:
33 | data_cf_u = data_combining_features[useridx == u, ]
34 |
35 | data_cf_u_shifted = []
36 | for shiftrange in range(window_size-1,0,-1):
37 | data_cf_u_shifted.append(np.roll(data_cf_u, shiftrange, axis = 0))
38 |
39 | data_cf_u_shifted.append(data_cf_u)
40 | data_cf_u_shifted.append(data_info[useridx==u, ])
41 |
42 | combined_data.append(np.hstack(data_cf_u_shifted)[window_size:,])
43 |
44 | combined_data = pd.DataFrame(np.vstack(combined_data), columns=cols)
45 |
46 | return combined_data
47 |
48 |
49 | def subtract_combination_uworker(u, alluserdict, dtype, calc_type, window_size, udayidx, udata, uinfo, uorg):
50 | if u%200==0:
51 | print(u)
52 |
53 | data_out = []
54 |
55 | if dtype in ['day', 'week']:
56 |
57 | for i in range(len(udayidx)):
58 | t = udayidx[i]
59 | if dtype in ['day','week']: min_idx = min(udayidx)+window_size
60 |
61 | if t>=min_idx:
62 | if calc_type == 'meandiff':
63 | prevdata = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),]
64 | if len(prevdata) < 1: continue
65 | window_mean = np.mean(prevdata, axis = 0)
66 | data_out.append(np.concatenate((udata[i] - window_mean, uorg[i,:], uinfo[i,:])))
67 |
68 | if calc_type == 'meddiff':
69 | prevdata = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),]
70 | if len(prevdata) < 1: continue
71 | window_med = np.median(prevdata, axis = 0)
72 | data_out.append(np.concatenate((udata[i] - window_med, uorg[i,:], uinfo[i,:])))
73 | elif calc_type == 'percentile':
74 | window = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),]
75 | if window.shape[0] < 1: continue
76 | percentile_i = [percentileofscore(window[:,j], udata[i,j], 'mean') - 50 for j in range(window.shape[1])]
77 | data_out.append(np.concatenate((percentile_i , uorg[i,:], uinfo[i,:])))
78 |
79 | if len(data_out) > 0: alluserdict[u] = np.vstack(data_out)
80 |
81 | def subtract_percentile_combination(data, dtype, calc_type = 'percentile', window_size = 7, dname = 'cert', parallel = True):
82 | '''
83 | Combine data to generate different temporal representations
84 | window_size: window size by days (for CERT data)
85 | '''
86 | if dname == 'cert':
87 | info_cols = ['sessionid','day','week',"starttime", "endtime",
88 | 'user', 'project', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'ITAdmin',
89 | 'O', 'C', 'E', 'A', 'N', 'insider','subs_ind']
90 | keep_org_cols = ["pc", "isworkhour", "isafterhour", "isweekday", "isweekend", "isweekendafterhour", "n_days",
91 | "duration", "n_concurrent_sessions", "start_with", "end_with", "ses_start", "ses_end"]
92 |
93 | combining_features = [ f for f in data.columns if f not in info_cols]
94 | info_features = [f for f in data.columns if f in info_cols]
95 | keep_org_features = [f for f in data.columns if f in keep_org_cols]
96 |
97 | data_info = data[info_features].values
98 | data_org = data[keep_org_features].values
99 | data_combining_features = data[combining_features].values
100 | useridx = data['user'].values
101 | if dtype in ['day']: dayidx = data['day'].values
102 | if dname == 'cert': weekidx = data['week'].values
103 |
104 | userset = set(data['user'])
105 |
106 | if dtype == 'week':
107 | window_size = np.floor(window_size/7)
108 | idx = weekidx
109 | elif dtype in ['day']: idx = dayidx
110 |
111 | if parallel:
112 | manager = multiprocessing.Manager()
113 | return_dict = manager.dict()
114 | jobs = []
115 | for u in userset:
116 | udayidx = idx[useridx==u]
117 | udata = data_combining_features[useridx==u, ]
118 | uinfo = data_info[useridx==u, ]
119 | uorg = data_org[useridx==u, ]
120 | p = multiprocessing.Process(target=subtract_combination_uworker, args=(u, return_dict, dtype, calc_type,
121 | window_size, udayidx,
122 | udata, uinfo, uorg))
123 | jobs.append(p)
124 | p.start()
125 |
126 | for proc in jobs:
127 | proc.join()
128 | else:
129 | return_dict = {}
130 | for u in userset:
131 | udayidx = idx[useridx==u]
132 | udata = data_combining_features[useridx==u, ]
133 | uinfo = data_info[useridx==u, ]
134 | uorg = data_org[useridx==u, ]
135 | subtract_combination_uworker(u, return_dict, dtype, calc_type,
136 | window_size, udayidx,
137 | udata, uinfo, uorg)
138 |
139 | combined_data = pd.DataFrame(np.vstack([return_dict[ri] for ri in return_dict.keys()]), columns=combining_features+['org_'+f for f in keep_org_features] + info_features)
140 |
141 | return combined_data
142 |
143 |
144 | if __name__ == "__main__":
145 | parser=argparse.ArgumentParser()
146 | parser.add_argument('--representation', help='Data representation to extract (concat, percentile, meandiff, mediandiff). Default: percentile',
147 | type= str, default = 'percentile')
148 | parser.add_argument('--file_input', help='CERT input file name. Default: week-r5.2.csv.gz', type= str, default= 'week-r5.2.csv.gz')
149 | parser.add_argument('--window_size', help='Window size for percentile or mean/median difference representation. Default: 30',
150 | type = int, default=30)
151 | parser.add_argument('--num_concat', help='Number of data points for concatenation. Default: 3',
152 | type = int, default=3)
153 | args=parser.parse_args()
154 |
155 | print('If "too many opened files", or "ForkAwareLocal" error, run ulimit command, e.g. "$ulimit -n 10000" to increase the limit first')
156 | if args.representation == 'all':
157 | reps = ['concat', 'percentile','meandiff','meddiff']
158 | elif args.representation in ['concat', 'percentile','meandiff','meddiff']:
159 | reps = [args.representation]
160 |
161 | fileName = (args.file_input).replace('.csv','').replace('.gz','')
162 | if 'day' in fileName:
163 | data_type = 'day'
164 | elif 'week' in fileName:
165 | data_type = 'week'
166 | s = pd.read_csv(f'{args.file_input}')
167 |
168 | for rep in reps:
169 | if rep in ['percentile','meandiff','meddiff']:
170 | s1 = subtract_percentile_combination(s, data_type, calc_type = rep, window_size = args.window_size, dname='cert')
171 | s1.to_pickle(f'{fileName}-{rep}{args.window_size}.pkl')
172 | else:
173 | s1 = concat_combination(s, window_size = args.num_concat, dname = 'cert')
174 | s1.to_pickle(f'{fileName}-{rep}{args.num_concat}.pkl')
175 |
176 |
--------------------------------------------------------------------------------