├── LICENSE
├── README.md
├── TNSM2020
    ├── clf_helpers.py
    ├── params.pkl
    └── run_classification.py
├── example_anomaly_detection.py
├── example_classification.py
├── feature_extraction.py
├── requirements.txt
└── temporal_data_representation.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 lcd-dal
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Feature extraction for CERT insider threat test dataset
 2 | This is a script for extracting features (csv format) from the [CERT insider threat test dataset](https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099) [[1]](#1), [[2]](#2), versions 4.1 to 6.2. For more details, please see this paper: [Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning](https://ieeexplore.ieee.org/document/8962316).
 3 | 
 4 | <a id="1">[1]</a> 
 5 | Lindauer, Brian (2020): Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1 
 6 | 
 7 | <a id="2">[2]</a> 
 8 | J. Glasser and B. Lindauer, "Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data," 2013 IEEE Security and Privacy Workshops, San Francisco, CA, 2013, pp. 98-104, doi: 10.1109/SPW.2013.37.
 9 | 
10 | ## Run feature_extraction script
11 | - Require python3.8 & packages in `requirements.txt`. The script is written and tested in Linux only.
12 | - By default the script extracts week, day, session, and sub-session data (as in the paper).
13 | - To run the script, place it in a folder of a CERT dataset (e.g. r4.2, decompressed from r4.2.tar.bz2 downloaded [here](https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1)), then run `python3 feature_extraction.py`
14 | - To change number of cores used in parallelization (default 8), use `python3 feature_extraction.py numberOfCores`, e.g `python3 feature_extraction.py 16`.
15 | 
16 | ## Extracted Data
17 | Extracted data is stored in ExtractedData subfolder.
18 | 
19 | Note that in the extracted data, `insider` is the label indicating the insider threat scenario (0 is normal). Some extracted features (subs_ind, starttime, endtime, sessionid, user, day, week) are for information and may or may not be used in training machine learning approaches.
20 | 
21 | Pre-extracted data from CERT insider threat test dataset r5.2 (gzipped) can be found in [here](https://web.cs.dal.ca/~lcd/data/CERTr5.2/).
22 | 
23 | ## Data representations
24 | From the extracted data, `temporal_data_representation.py` can be used to generate different data representations, as presented in this paper: [Anomaly Detection for Insider Threats Using Unsupervised Ensembles](https://ieeexplore.ieee.org/document/9399116). 
25 | 
26 | `python3 temporal_data_representation.py --help`
27 | 
28 | ## Sample classification and anomaly detection results
29 | Sample code is provided in:
30 | 
31 | - `sample_classification.py` for classification (as in [Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning](https://ieeexplore.ieee.org/document/8962316)).
32 | - `sample_anomaly_detection.py` for anomaly detection (as in [Anomaly Detection for Insider Threats Using Unsupervised Ensembles](https://ieeexplore.ieee.org/document/9399116)).
33 | 
34 | ## Citation
35 | If you use the source code, or the extracted datasets, please cite the following paper:
36 | 
37 | `D. C. Le, N. Zincir-Heywood and M. I. Heywood, "Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning," in IEEE Transactions on Network and Service Management, vol. 17, no. 1, pp. 30-44, March 2020, doi: 10.1109/TNSM.2020.2967721.`
38 | 
39 | Data representations and anomaly detection:
40 | 
41 | `D. C. Le, N. Zincir-Heywood, "Anomaly Detection for Insider Threats Using Unsupervised Ensembles," in IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1152–1164. June 2021, doi:http://doi.org/10.1109/TNSM.2021.3071928.`
42 | 


--------------------------------------------------------------------------------
/TNSM2020/clf_helpers.py:
--------------------------------------------------------------------------------
  1 | from sklearn.model_selection import train_test_split
  2 | from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, auc
  3 | from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
  4 | import numpy as np
  5 | import time
  6 | import gc
  7 | import pandas as pd
  8 | import random
  9 | from joblib import Parallel, delayed
 10 | 
 11 | num_cores = 16
 12 | 
 13 | 
 14 | def split_data(data, test_size=0.25, random_state=0, y_column='insider',
 15 |                shuffle=True,
 16 |                x_rm_cols=('user', 'day', 'week', 'starttime', 'endtime', 'sessionid',
 17 |                           'timeind', 'Unnamed: 0', 'insider'),
 18 |                dname='r4.2', normalization='StandardScaler',
 19 |                rm_empty_cols=True, by_user=False, by_user_time=False,
 20 |                by_user_time_trainper=0.5, limit_ntrain_user=0):
 21 |     """
 22 |     split data to train and test, can get data by user, seq or random, with normalization builtin
 23 |     """
 24 |     np.random.seed(random_state)
 25 |     random.seed(random_state)
 26 | 
 27 |     x_cols = [i for i in data.columns if i not in x_rm_cols]
 28 |     if rm_empty_cols:
 29 |         x_cols = [i for i in x_cols if len(set(data[i])) > 1]
 30 | 
 31 |     infocols = list(set(data.columns) - set(x_cols))
 32 | 
 33 |     # output a dict
 34 |     out = {}
 35 | 
 36 |     # normalization
 37 |     if normalization == 'StandardScaler':
 38 |         sc = StandardScaler()
 39 |     elif normalization == 'MinMaxScaler':
 40 |         sc = MinMaxScaler()
 41 |     elif normalization == 'MaxAbsScaler':
 42 |         sc = MaxAbsScaler()
 43 |     else:
 44 |         sc = None
 45 |     out['sc'] = sc
 46 | 
 47 |     # split data randomly by instance
 48 |     if not by_user and not by_user_time:
 49 |         x = data[x_cols].values
 50 |         y_org = data[y_column].values
 51 | 
 52 |         y = y_org.copy()
 53 |         if 'r6' in dname:
 54 |             y[y != 0] = 1
 55 | 
 56 |         x_train, x_test, y_train, y_test = train_test_split(x, y_org, test_size=test_size, shuffle=shuffle)
 57 | 
 58 |         if 'sc' in out and out['sc'] is not None:
 59 |             x_train = sc.fit_transform(x_train)
 60 |             out['sc'] = sc
 61 |             if test_size > 0: x_test = sc.transform(x_test)
 62 | 
 63 |     # split data by user
 64 |     elif by_user:
 65 |         test_users, train_users = [], []
 66 |         for i in [j for j in list(set(data['insider'])) if j != 0]:
 67 |             uli = list(set(data[data['insider'] == i]['user']))
 68 |             random.shuffle(uli)
 69 |             ind_i = int(np.ceil(test_size * len(uli)))
 70 |             test_users += uli[:ind_i]
 71 |             train_users += uli[ind_i:]
 72 | 
 73 |         normal_users = list(set(data['user']) - set(data[data['insider'] != 0]['user']))
 74 |         random.shuffle(normal_users)
 75 |         if limit_ntrain_user > 0:
 76 |             normal_ind = limit_ntrain_user - len(train_users)
 77 |         else:
 78 |             normal_ind = int(np.ceil((1 - test_size) * len(normal_users)))
 79 | 
 80 |         train_users += normal_users[: normal_ind]
 81 |         test_users += normal_users[normal_ind:]
 82 |         x_train = data[data['user'].isin(train_users)][x_cols].values
 83 |         x_test = data[data['user'].isin(test_users)][x_cols].values
 84 |         y_train = data[data['user'].isin(train_users)][y_column].values
 85 |         y_test = data[data['user'].isin(test_users)][y_column].values
 86 | 
 87 |         out['train_info'] = data[data['user'].isin(train_users)][infocols]
 88 |         out['test_info'] = data[data['user'].isin(test_users)][infocols]
 89 | 
 90 |         out['train_users'] = train_users
 91 |         if test_size > 0 or limit_ntrain_user > 0:
 92 |             out['test_users'] = test_users
 93 | 
 94 |         if 'sc' in out and out['sc'] is not None:
 95 |             x_train = sc.fit_transform(x_train)
 96 |             out['sc'] = sc
 97 |             if test_size > 0 or (limit_ntrain_user > 0 and limit_ntrain_user < len(set(data['user']))):
 98 |                 x_test = sc.transform(x_test)
 99 | 
100 |     # split by user and time
101 |     elif by_user_time:
102 |         train_week_max = by_user_time_trainper * max(data['week'])
103 |         train_insiders = set(data[(data['week'] <= train_week_max) & (data['insider'] != 0)]['user'])
104 |         users_set_later_weeks = set(data[data['week'] > train_week_max]['user'])
105 | 
106 |         first_part = data[data['week'] <= train_week_max]
107 |         second_part = data[data['week'] > train_week_max]
108 | 
109 |         first_part_split = split_data(first_part, random_state=random_state, test_size=0,
110 |                                       dname=dname, normalization=normalization,
111 |                                       by_user=True, by_user_time=False,
112 |                                       limit_ntrain_user=limit_ntrain_user,
113 |                                       )
114 | 
115 |         x_train = first_part_split['x_train']
116 |         y_train = first_part_split['y_train']
117 |         x_cols = first_part_split['x_cols']
118 | 
119 |         out['train_info'] = first_part_split['train_info']
120 |         out['other_trainweeks_users_info'] = first_part_split['test_info']
121 | 
122 |         if 'sc' in first_part_split and first_part_split['sc'] is not None:
123 |             out['sc'] = first_part_split['sc']
124 | 
125 |         out['x_other_trainweeks_users'] = first_part_split['x_test']
126 |         out['y_other_trainweeks_users'] = first_part_split['y_test']
127 |         out['y_bin_other_trainweeks_users'] = first_part_split['y_test_bin']
128 |         out['other_trainweeks_users'] = first_part_split['test_users']  # users in first half but not in train
129 | 
130 |         real_train_users = set(first_part_split['train_users'])
131 |         real_train_insiders = train_insiders.intersection(real_train_users)
132 |         test_users = list(users_set_later_weeks - real_train_insiders)
133 |         x_test = second_part[second_part['user'].isin(test_users)][x_cols].values
134 |         y_test = second_part[second_part['user'].isin(test_users)][y_column].values
135 |         out['test_info'] = second_part[second_part['user'].isin(test_users)][infocols]
136 |         if ('sc' in out) and (out['sc'] is not None) and (by_user_time_trainper < 1):
137 |             x_test = out['sc'].transform(x_test)
138 | 
139 |         out['train_users'] = first_part_split['train_users']
140 |         out['test_users'] = test_users
141 | 
142 |     # get binary data
143 |     y_train_bin = y_train.copy()
144 |     y_train_bin[y_train_bin != 0] = 1
145 | 
146 |     out['x_train'] = x_train
147 |     out['y_train'] = y_train
148 |     out['y_train_bin'] = y_train_bin
149 |     out['x_cols'] = x_cols
150 |     out['info_cols'] = infocols
151 | 
152 |     out['test_size'] = test_size
153 | 
154 |     if test_size > 0 or (by_user_time and by_user_time_trainper < 1) or limit_ntrain_user > 0:
155 |         y_test_bin = y_test.copy()
156 |         y_test_bin[y_test_bin != 0] = 1
157 |         out['x_test'] = x_test
158 |         out['y_test'] = y_test
159 |         out['y_test_bin'] = y_test_bin
160 | 
161 |     return out
162 | 
163 | 
164 | def get_result_one_user(u, pred_all, datainfo):
165 |     res_u = {}
166 |     u_labels = datainfo[datainfo['user'] == u]['insider'].values
167 |     utype = list(set(u_labels))
168 |     if np.any(u_labels != 0) and len(set(u_labels)) > 1:
169 |         utype.remove(0.0)
170 |     res_u['type'] = utype[0]
171 |     u_idx = np.where(datainfo['user'] == u)[0]
172 |     res_u['data_idxs'] = u_idx
173 |     pred = pred_all[u_idx]
174 |     if len(np.where(u_labels == 0)[0]) > 0:
175 |         res_u['norm_per'] = len(np.where(pred[u_labels == 0] == 0)[0]) / len(np.where(u_labels == 0)[0])
176 |     if utype[0] != 0:
177 |         res_u['mal_per'] = len(np.where(pred[u_labels != 0] != 0)[0]) / len(np.where(u_labels != 0)[0])
178 |     res_u['norm_bin'] = int(np.any(pred[u_labels == 0] != 0))
179 |     if utype[0] != 0:
180 |         res_u['mal_bin'] = int(np.any(pred[u_labels != 0] != 0))
181 |     return res_u
182 | 
183 | 
184 | def get_result_by_users(users, user_list=None, pred_all=None, datainfo=None):
185 |     out = {}
186 |     cms = {}
187 |     out[users] = {}
188 |     # out_users = [get_result_one_user(u, pred_all, datainfo, old_res, label_all) for u in user_list]
189 |     out_users = Parallel(n_jobs=num_cores)(delayed(get_result_one_user)(u, pred_all, datainfo)
190 |                                            for u in user_list)
191 |     users_true_label = []
192 |     users_pred_label = []
193 |     for i, u in enumerate(user_list):
194 |         out[users][u] = out_users[i]
195 |         users_true_label.append(out[users][u]['type'])
196 |         if out[users][u]['type'] == 0:
197 |             users_pred_label.append(out[users][u]['norm_bin'])
198 |         else:
199 |             users_pred_label.append(out[users][u]['mal_bin'])
200 | 
201 |     out[users]['true_label'] = users_true_label
202 |     out[users]['pred_label'] = users_pred_label
203 |     cms[users] = confusion_matrix(users_true_label, users_pred_label)
204 |     return out, cms
205 | 
206 | 
207 | def do_classification(clf, x_train, y_train, x_test, y_test, y_org=None, by_user=False,
208 |                       split_output=None):
209 |     '''
210 |     train classification and get results
211 |     '''
212 |     st = time.time()
213 |     clf.fit(x_train, y_train)
214 |     train_time = time.time() - st
215 | 
216 |     cms_train = {}
217 |     cms_test = {}
218 | 
219 |     st = time.time()
220 |     y_train_hat = clf.predict(x_train)
221 |     y_train_proba = clf.predict_proba(x_train)
222 |     pred_time = time.time() - st
223 |     cms_train['bin'] = confusion_matrix(y_train, y_train_hat)
224 | 
225 |     st = time.time()
226 |     y_test_hat = clf.predict(x_test)
227 |     y_test_proba = clf.predict_proba(x_test)
228 |     test_pred_time = time.time() - st
229 |     cms_test['bin'] = confusion_matrix(y_test, y_test_hat)
230 | 
231 |     test_org_labels = y_test
232 |     train_org_labels = y_train
233 |     if y_org is not None:
234 |         cms_train['org'] = confusion_matrix(y_org['train'], y_train_hat)
235 |         train_org_labels = y_org['train']
236 |         cms_test['org'] = confusion_matrix(y_org['test'], y_test_hat)
237 |         test_org_labels = y_org['test']
238 | 
239 |     userres_train = {}
240 |     userres_test = {}
241 |     if by_user:
242 |         uout, ucm = get_result_by_users('train_users', split_output['train_users'], pred_all=y_train_hat,
243 |                                         datainfo=split_output['train_info'])
244 | 
245 |         userres_train.update(uout)
246 |         cms_train.update(ucm)
247 |         uout, ucm = get_result_by_users('test_users', split_output['test_users'], pred_all=y_test_hat,
248 |                                         datainfo=split_output['test_info'])
249 |         userres_test.update(uout)
250 |         cms_test.update(ucm)
251 | 
252 | 
253 |     return clf, {'by_user': userres_train, 'cms': cms_train, 'train_time': train_time, 'pred_time': pred_time,
254 |                  'org_labels': train_org_labels,
255 |                  'pred_bin': y_train_hat, 'pred_proba': y_train_proba}, \
256 |            {'by_user': userres_test, 'cms': cms_test, 'pred_time': test_pred_time, 'org_labels': test_org_labels,
257 |             'pred_bin': y_test_hat,
258 |             'pred_proba': y_test_proba}
259 | 
260 | 
261 | def user_auc_roc(ures, users):
262 |     nnu = len(users) - sum(ures[:, 0])
263 |     nmu = sum(ures[:, 0])
264 |     ufpr = np.sum(ures[np.where(ures[:, 0] == 0)[0], 1:], axis=0) / nnu
265 |     utpr = np.sum(ures[np.where(ures[:, 0] == 1)[0], 1:], axis=0) / nmu
266 |     uauc = auc(ufpr, utpr)
267 |     return utpr, ufpr, uauc
268 | 
269 | 
270 | def user_auc_roc2(u_idxs, y_bin, y_predprob, thresholds):
271 |     ures = np.zeros((1, len(thresholds)))
272 |     u_y = y_bin[u_idxs]
273 |     u_yprob = y_predprob[u_idxs]
274 |     for ii in range(len(thresholds[1:])):
275 |         ures[0, ii + 1] = int(np.any(u_yprob >= thresholds[ii + 1]))
276 |     if np.any(u_y != 0):
277 |         ures[0, 0] = 1
278 |     return ures
279 | 
280 | 
281 | def roc_auc_calc(rw, algs=('RF', 'XGB'), nrun=20, dtype=None, data=None, res_names=['test_in']):
282 | 
283 |     allres = []
284 |     fpri = np.linspace(0, 1, 1000)  # interpolation
285 | 
286 |     for i in range(nrun):
287 |         for alg in algs:
288 |             gc.collect()
289 |             if 'all' in res_names:
290 |                 res_names = [td for td in rw[i][alg].keys() if td not in ['clf'] and 'thres' not in td]
291 | 
292 |             for resname in res_names:
293 |                 resname2 = 'train_users' if resname == 'train' else 'test_users'
294 |                 list_users = sorted([u for u in rw[i][alg][resname]['by_user'][resname2].keys() if type(u) != str])
295 |                 y_predprob = rw[i][alg][resname]['pred_proba'][:, 1]
296 |                 y_org = rw[i][alg][resname]['org_labels']
297 |                 y_bin = np.array(y_org > 0).astype(int)
298 |                 fpr, tpr, thresholds = roc_curve(y_bin, y_predprob)
299 |                 tpri = np.interp(fpri, fpr, tpr)
300 | 
301 |                 if len(set(y_bin)) > 1:
302 |                     aucsc = roc_auc_score(y_bin, y_predprob)
303 |                     ures = Parallel(n_jobs=num_cores)(
304 |                         delayed(user_auc_roc2)(rw[i][alg][resname]['by_user'][resname2][u]['data_idxs'], y_bin,
305 |                                                y_predprob, thresholds) for u in list_users)
306 | 
307 |                     utpr, ufpr, uauc = user_auc_roc(np.vstack(ures), list_users)
308 |                     utpri = np.interp(fpri, ufpr, utpr)
309 |                 else:
310 |                     aucsc, ufpr, utpr, uauc, tpri, utpri = None, None, None, None, None, None
311 | 
312 |                 allres.append([i, alg, resname, fpr, tpr, thresholds, aucsc, ufpr, utpr, uauc, fpri, tpri, utpri])
313 | 
314 |     res = pd.DataFrame(
315 |         columns=['run', 'alg', 'test_on', 'fpr', 'tpr', 'threshold', 'auc', 'ufpr', 'utpr', 'uauc', 'fpri', 'tpri',
316 |                  'utpri'], data=allres)
317 |     if dtype is not None: res['dtype'] = dtype
318 |     res['data'] = data
319 |     return res
320 | 
321 | 
322 | def get_cert_roc(r, a, dtype, test_on='test_in', user=True):
323 |     fprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['fpri'].values
324 |     if user:
325 |         tprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['utpri'].values
326 |         aucs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['uauc'].values
327 |     else:
328 |         tprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['tpri'].values
329 |         aucs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['auc'].values
330 | 
331 |     mean_fpr = np.concatenate(([0], np.mean(fprs, axis=0)))
332 |     mean_tpr = np.concatenate(([0], np.mean(tprs, axis=0)))
333 | 
334 |     std_auc = np.std(aucs)
335 |     mean_auc = auc(mean_fpr, mean_tpr)
336 | 
337 |     std_tpr = np.concatenate(([0], np.std(tprs, axis=0)))
338 |     tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
339 |     tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
340 | 
341 |     return mean_fpr, mean_tpr, tprs_upper, tprs_lower, mean_auc, std_auc


--------------------------------------------------------------------------------
/TNSM2020/params.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lcd-dal/feature-extraction-for-CERT-insider-threat-test-datasets/3441da78ea8ff1bd13ecb27c3a8157d04d0a84fe/TNSM2020/params.pkl


--------------------------------------------------------------------------------
/TNSM2020/run_classification.py:
--------------------------------------------------------------------------------
  1 | from copy import deepcopy
  2 | import pickle
  3 | import gc
  4 | import pandas as pd
  5 | import time
  6 | import clf_helpers
  7 | import matplotlib.pyplot as plt
  8 | 
  9 | from sklearn.ensemble import RandomForestClassifier
 10 | from sklearn.neural_network import MLPClassifier
 11 | from sklearn.linear_model import LogisticRegression
 12 | from xgboost import XGBClassifier
 13 | import warnings, sklearn
 14 | 
 15 | warnings.filterwarnings("ignore", category=sklearn.exceptions.DataConversionWarning)
 16 | n_cores = 16
 17 | 
 18 | 
 19 | def run_exp_onealg(run, slw, x, clf_name, classifiers, res_by_user, st):
 20 |     print(clf_name)
 21 |     clf_copy = deepcopy(classifiers[clf_name])
 22 |     if hasattr(clf_copy, 'random_state'):
 23 |         clf_copy.set_params(**{'random_state': run})
 24 | 
 25 |     clf_res = clf_helpers.do_classification(clf_copy, x['train'], slw['y_train_bin'],
 26 |                                         x['test'], slw['y_test_bin'],
 27 |                                         y_org={'train': slw['y_train'], 'test': slw['y_test']},
 28 |                                         by_user=res_by_user, split_output=slw)
 29 |     res = {'train': clf_res[1], 'test_in': clf_res[2]}
 30 |     print('Training time: ', res['train']['train_time'])
 31 |     print('Train confusion matrices: ', res['train']['cms'])
 32 |     print('Test confusion matrices: ', res['test_in']['cms'])
 33 |     gc.collect()
 34 |     print(run, 'Done training & res,', (time.time() - st) // 1)
 35 |     return res
 36 | 
 37 | 
 38 | def run_exp_onerun(run, classifiers, data_in, test_size, res_by_user, shuffle,
 39 |                    normalization, by_user_time=False, by_user_time_trainper=0.5, limit_ntrain_user=0):
 40 |     st = time.time()
 41 |     slw = clf_helpers.split_data(data_in['data'], test_size=test_size, shuffle=shuffle, random_state=run,
 42 |                                  normalization=normalization, dname=data_in['name'],
 43 |                                  by_user_time=by_user_time, by_user_time_trainper=by_user_time_trainper,
 44 |                                  limit_ntrain_user=limit_ntrain_user)
 45 |     print('\n', run, 'Done splitting,', (time.time() - st) // 1)
 46 | 
 47 |     res_clf = {'scaler': slw['sc']}
 48 |     x_cols = list(slw['x_cols'])
 49 |     x = {'train': slw['x_train'], 'test': slw['x_test']}
 50 | 
 51 |     for clf_name in classifiers:
 52 |         res = run_exp_onealg(run, slw, x, clf_name, classifiers, res_by_user, st)
 53 |         res_clf[clf_name] = res
 54 | 
 55 |     return res_clf, x_cols
 56 | 
 57 | 
 58 | def run_experiment(n_run, classifiers, data_in, test_size=0.5, res_by_user=True, shuffle=True,
 59 |                    normalization="StandardScaler", by_user_time=True, by_user_time_trainper=0.5,
 60 |                    limit_ntrain_user=0):
 61 | 
 62 |     all_res = {'exp_setting': {'y_col': 'insider', 'train_dname': data_in['name'],
 63 |                                'shuffle': shuffle, 'norm': normalization, 'res_by_user': res_by_user}}
 64 |     all_res['exp_setting']['classifiers'] = list(classifiers.keys())
 65 |     all_res['exp_setting']['n_run'] = n_run
 66 |     all_res['exp_setting']['in_test_size'] = test_size
 67 |     all_res['exp_setting']['by_user_time_trainper'] = by_user_time_trainper
 68 |     all_res['exp_setting']['limit_ntrain_user'] = limit_ntrain_user
 69 | 
 70 |     for run in range(n_run):
 71 |         res_clf, x_cols = run_exp_onerun(run, classifiers, data_in, test_size,
 72 |                                          res_by_user=res_by_user, by_user_time=by_user_time,
 73 |                                          by_user_time_trainper=by_user_time_trainper,
 74 |                                          limit_ntrain_user=limit_ntrain_user,
 75 |                                          shuffle=shuffle,
 76 |                                          normalization=normalization,
 77 |                                          )
 78 |         gc.collect()
 79 |         all_res[run] = res_clf
 80 |     return all_res
 81 | 
 82 | 
 83 | def load_data(dname, dtype, datafolder='data'):
 84 |     name = datafolder + '/' + dtype + dname + '.csv.gz'
 85 |     return pd.read_csv(name)
 86 | 
 87 | 
 88 | def run_exp(nrun, dname, dtype, mode, ttype=None, limit_ntrain_user=None, train_week_per=None, test_per=0.5, algs=None,
 89 |             load_params=True, scaler='StandardScaler', savefolder='res'):
 90 | 
 91 |     print('\n----------------\n%s %s' % (dname, dtype), '\n----------------\n')
 92 | 
 93 |     clfs = {'LR': LogisticRegression(solver='lbfgs', n_jobs=n_cores),
 94 |             'MLP': MLPClassifier(solver='adam'),
 95 |             'RF': RandomForestClassifier(n_jobs=n_cores),
 96 |             'XGB': XGBClassifier(n_jobs=n_cores),
 97 |             }
 98 | 
 99 |     if algs is not None:
100 |         clfs = {k:clfs[k] for k in algs}
101 | 
102 |     if load_params:
103 |         with open('params.pkl', 'rb') as f:
104 |             loaded_params = pickle.load(f)
105 |             for c in clfs:
106 |                 if c != 'LR':
107 |                     clfs[c].set_params(**loaded_params[c][dtype])
108 | 
109 |     data_in = {'name': dname, 'data': load_data(dname, dtype)}
110 | 
111 |     if mode == 'by_user_time':
112 |         res = run_experiment(nrun, clfs, data_in, by_user_time=True,
113 |                              by_user_time_trainper=train_week_per,
114 |                              limit_ntrain_user=limit_ntrain_user,
115 |                              res_by_user=True, normalization=scaler)
116 |     elif mode == 'randomsplit':
117 |         res = run_experiment(nrun, clfs, data_in, by_user_time=False,
118 |                              test_size=test_per,
119 |                              res_by_user=False, normalization=scaler)
120 | 
121 |     savefile = '%s/%s-%s-%s-%s-%s' % (savefolder, dname, dtype, ttype, mode, '_'.join(algs)) + '.pickle'
122 |     with open(savefile, 'wb') as handle:
123 |         pickle.dump(res, handle, protocol=4)
124 |     return res
125 | 
126 | 
127 | if __name__ == "__main__":
128 |     algs = ['RF']
129 |     nrun = 2
130 | 
131 |     dname = 'r5.2'
132 |     dtypes = ['week']
133 |     mode = 'by_user_time'
134 | 
135 |     for dtype in dtypes:
136 |         res = run_exp(nrun, dname, dtype, algs=algs, mode=mode, limit_ntrain_user=400, train_week_per=0.5,
137 |                       load_params=True)
138 |         if mode == 'randomsplit': continue
139 | 
140 |         res = clf_helpers.roc_auc_calc(res, algs=algs, nrun=nrun, dtype=dtype, data=dname)
141 | 
142 |         colors = ['r', 'g', 'blue', 'orange']
143 |         for user in [True, False]:
144 |             plt.figure()
145 |             restype = 'user' if user else 'org'
146 |             for i, alg in enumerate(algs):
147 |                 tmp = clf_helpers.get_cert_roc(res, alg, dtype, 'test_in', user=user)
148 |                 plt.plot(tmp[0], tmp[1], label=f'{alg}, AUC = {tmp[4]:.3f}', color=colors[i])
149 |                 plt.fill_between(tmp[0], tmp[3], tmp[2], color=colors[i], alpha=.1, label=None)
150 |             plt.legend()
151 |             plt.savefig(f'ROC_{dtype}_{restype}.jpg')
152 | 


--------------------------------------------------------------------------------
/example_anomaly_detection.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | 
 4 | import pandas as pd
 5 | import numpy as np
 6 | # from sklearn.ensemble import IsolationForest
 7 | from sklearn.neural_network import MLPRegressor
 8 | from sklearn.metrics.pairwise import paired_distances
 9 | from sklearn.preprocessing import StandardScaler
10 | from sklearn.metrics import roc_auc_score
11 | 
12 | 
13 | print('This script runs a sample anomaly detection (using simple autoencoder) '
14 |       'on CERT dataset. By default it takes CERT r5.2 data extracted with percentile '
15 |       'representation, generated using temporal_data_representation script. '
16 |       'It then trains on data of 200 random users in first half of the dataset, '
17 |       'and output AUC score and detection rate at different budgets (instance-based)')
18 | 
19 | print('For more details, see this paper: Anomaly Detection for Insider Threats Using'
20 |       ' Unsupervised Ensembles. Le, D. C.; and Zincir-Heywood, A. N. IEEE Transactions'
21 |       ' on Network and Service Management, 18(2): 1152–1164. June 2021.')
22 | 
23 | data = pd.read_pickle('week-r5.2-percentile30.pkl')
24 | removed_cols = ['user','day','week','starttime','endtime','sessionid','insider']
25 | x_cols = [i for i in data.columns if i not in removed_cols]
26 | 
27 | run = 1
28 | np.random.seed(run)
29 | 
30 | data1stHalf = data[data.week <= max(data.week)/2]
31 | dataTest = data[data.week > max(data.week)/2]
32 | 
33 | nUsers = np.random.permutation(list(set(data1stHalf.user)))
34 | trainUsers = nUsers[:200]
35 | 
36 | 
37 | xTrain = data1stHalf[data1stHalf.user.isin(trainUsers)][x_cols].values
38 | yTrain = data1stHalf[data1stHalf.user.isin(trainUsers)]['insider'].values
39 | yTrainBin = yTrain > 0
40 | 
41 | xTest = data[x_cols].values
42 | yTest = data['insider'].values
43 | yTestBin = yTest > 0
44 | 
45 | scaler = StandardScaler()
46 | xTrain = scaler.fit_transform(xTrain)
47 | xTest = scaler.transform(xTest)
48 | 
49 | ae = MLPRegressor(hidden_layer_sizes=(int(data.shape[1]/4), int(data.shape[1]/8), 
50 |                                       int(data.shape[1]/4)), max_iter=25, random_state=10)
51 | 
52 | ae.fit(xTrain, xTrain)
53 | 
54 | reconstructionError = paired_distances(xTest, ae.predict(xTest))
55 | 
56 | print('AUC score: ', roc_auc_score(yTestBin, reconstructionError))
57 | 
58 | print('Detection rate at different budgets:')
59 | for ib in [0.001, 0.01, 0.05, 0.1, 0.2]:
60 |     threshold = np.percentile(reconstructionError, 100-100*ib)
61 |     flagged = np.where(reconstructionError>threshold)[0]
62 |     dr = sum(yTestBin[flagged]>0)/sum(yTestBin>0)
63 |     print(f'{100*ib}%, DR = {100*dr:.2f}%')


--------------------------------------------------------------------------------
/example_classification.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | 
 4 | import pandas as pd
 5 | import numpy as np
 6 | from sklearn.ensemble import RandomForestClassifier
 7 | from sklearn.metrics import recall_score, classification_report, f1_score, accuracy_score
 8 | 
 9 | print('This script trains a sample classifier (using simple RandomForestClassifier) '
10 |       'on CERT dataset. By default it takes CERT r5.2 extracted day data '
11 |       'downloaded from https://web.cs.dal.ca/~lcd/data/CERTr5.2/'
12 |       ', train on data of 400 users in first half of the dataset, '
13 |       'then output classification report (instance-based)')
14 | 
15 | print('For more details, see this paper: Analyzing Data Granularity Levels for'
16 |       ' Insider Threat Detection using Machine Learning. Le, D. C.; Zincir-Heywood, '
17 |       'A. N.; and Heywood, M. I. IEEE Transactions on Network and Service Management,'
18 |       ' 17(1): 30–44. March 2020.')
19 | 
20 | data = pd.read_csv('day-r5.2.csv.gz')
21 | removed_cols = ['user','day','week','starttime','endtime','sessionid','insider']
22 | x_cols = [i for i in data.columns if i not in removed_cols]
23 | 
24 | run = 1
25 | np.random.seed(run)
26 | 
27 | data1stHalf = data[data.week <= max(data.week)/2]
28 | dataTest = data[data.week > max(data.week)/2]
29 | 
30 | selectedTrainUsers =  set(data1stHalf[data1stHalf.insider > 0]['user'])
31 | nUsers = np.random.permutation(list(set(data1stHalf.user) - selectedTrainUsers))
32 | trainUsers = np.concatenate((list(selectedTrainUsers), nUsers[:400-len(selectedTrainUsers)]))
33 | 
34 | unKnownTestUsers = list(set(dataTest.user) - selectedTrainUsers)
35 | 
36 | xTrain = data1stHalf[data1stHalf.user.isin(trainUsers)][x_cols].values
37 | yTrain = data1stHalf[data1stHalf.user.isin(trainUsers)]['insider'].values
38 | yTrainBin = yTrain > 0
39 | 
40 | xTest = dataTest[dataTest.user.isin(unKnownTestUsers)][x_cols].values
41 | yTest = dataTest[dataTest.user.isin(unKnownTestUsers)]['insider'].values
42 | yTestBin = yTest > 0
43 | 
44 | rf = RandomForestClassifier(n_jobs=-1)
45 | 
46 | rf.fit(xTrain, yTrainBin)
47 | 
48 | print(classification_report(yTestBin, rf.predict(xTest)))
49 | 
50 | 


--------------------------------------------------------------------------------
/feature_extraction.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | @author: lcd
  5 | """
  6 | import os, sys
  7 | import pandas as pd
  8 | import numpy as np
  9 | from datetime import datetime, timedelta
 10 | import re
 11 | import time
 12 | import subprocess
 13 | from joblib import Parallel, delayed
 14 | 
 15 | def time_convert(inp, mode, real_sd = '2010-01-02', sd_monday= "2009-12-28"):
 16 |     if mode == 'e2t':
 17 |         return datetime.fromtimestamp(inp).strftime('%m/%d/%Y %H:%M:%S')
 18 |     elif mode == 't2e':
 19 |         return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S').strftime('%s')
 20 |     elif mode == 't2dt':
 21 |         return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S')
 22 |     elif mode == 't2date':
 23 |         return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S').strftime("%Y-%m-%d")
 24 |     elif mode == 'dt2t':
 25 |         return inp.strftime('%m/%d/%Y %H:%M:%S')
 26 |     elif mode == 'dt2W':
 27 |         return int(inp.strftime('%W'))
 28 |     elif mode == 'dt2d':
 29 |         return inp.strftime('%m/%d/%Y %H:%M:%S')
 30 |     elif mode == 'dt2date':
 31 |         return inp.strftime("%Y-%m-%d")
 32 |     elif mode =='dt2dn': #datetime to day number
 33 |         startdate = datetime.strptime(sd_monday,'%Y-%m-%d')
 34 |         return (inp - startdate).days
 35 |     elif mode =='dn2epoch': #datenum to epoch
 36 |         dt = datetime.strptime(sd_monday,'%Y-%m-%d') + timedelta(days=inp)
 37 |         return int(dt.timestamp())
 38 |     elif mode =='dt2wn': #datetime to week number
 39 |         startdate = datetime.strptime(real_sd,'%Y-%m-%d')
 40 |         return (inp - startdate).days//7
 41 |     elif mode =='t2wn': #datetime to week number
 42 |         startdate = datetime.strptime(real_sd,'%Y-%m-%d')
 43 |         return (datetime.strptime(inp, '%m/%d/%Y %H:%M:%S') - startdate).days//7
 44 |     elif mode == 'dt2wd':
 45 |         return int(inp.strftime("%w"))
 46 |     elif mode == 'm2dt':
 47 |         return datetime.strptime(inp, "%Y-%m")
 48 |     elif mode == 'datetoweekday':
 49 |         return int(datetime.strptime(inp,"%Y-%m-%d").strftime('%w'))
 50 |     elif mode == 'datetoweeknum':
 51 |         w0 = datetime.strptime(sd_monday,"%Y-%m-%d")
 52 |         return int((datetime.strptime(inp,"%Y-%m-%d") - w0).days / 7)
 53 |     elif mode == 'weeknumtodate':
 54 |         startday = datetime.strptime(sd_monday,"%Y-%m-%d")
 55 |         return startday+timedelta(weeks = inp)
 56 |     
 57 | def add_action_thisweek(act, columns, lines, act_handles, week_index, stop, firstdate, dname = 'r5.2'):
 58 |     thisweek_act = []
 59 |     while True:
 60 |         if not lines[act]: 
 61 |             stop[act] = 1
 62 |             break
 63 |         if dname in ['r6.1','r6.2'] and act in ['email', 'file','http'] and '"' in lines[act]:
 64 |             tmp = lines[act]
 65 |             firstpart = tmp[:tmp.find('"')-1]
 66 |             content = tmp[tmp.find('"')+1:-1]
 67 |             tmp = firstpart.split(',') + [content]
 68 |         else:
 69 |             tmp = lines[act].split(',')
 70 |         if time_convert(tmp[1], 't2wn', real_sd= firstdate) == week_index:
 71 |             thisweek_act.append(tmp)
 72 |         else:
 73 |             break
 74 |         lines[act] = act_handles[act].readline()
 75 |     df = pd.DataFrame(thisweek_act, columns=columns)
 76 |     df['type']= act
 77 |     df.index = df['id']
 78 |     df.drop('id',1, inplace = True)
 79 |     return df
 80 | 
 81 | def combine_by_timerange_pandas(dname = 'r4.2'):
 82 |     allacts =  ['device','email','file', 'http','logon']
 83 |     firstline = str(subprocess.check_output(['head', '-2', 'http.csv'])).split('\\n')[1]
 84 |     firstdate = time_convert(firstline.split(',')[1],'t2dt')
 85 |     firstdate = firstdate - timedelta(int(firstdate.strftime("%w")))
 86 |     firstdate = time_convert(firstdate, 'dt2date')
 87 |     week_index = 0
 88 |     act_handles = {}
 89 |     lines = {}
 90 |     stop = {}
 91 |     for act in allacts:
 92 |         act_handles[act] = open(act+'.csv','r')
 93 |         next(act_handles[act],None) #skip header row
 94 |         lines[act] = act_handles[act].readline()
 95 |         stop[act] = 0 # store stop value indicating if all of the file has been read
 96 |     while sum(stop.values()) < 5:
 97 |         thisweekdf = pd.DataFrame()
 98 |         for act in allacts:
 99 |             if 'email' == act:
100 |                 if dname in ['r4.1','r4.2']:
101 |                     columns = ['id', 'date', 'user', 'pc', 'to', 'cc', 'bcc', 'from', 'size', '#att', 'content']
102 |                 if dname in ['r6.1','r6.2','r5.2','r5.1']:
103 |                     columns = ['id', 'date', 'user', 'pc', 'to', 'cc', 'bcc', 'from', 'activity', 'size', 'att', 'content']     
104 |             elif 'logon' == act:
105 |                 columns = ['id', 'date', 'user', 'pc', 'activity']
106 |             elif 'device' == act:
107 |                 if dname in ['r4.1','r4.2']:
108 |                     columns = ['id', 'date', 'user', 'pc', 'activity']
109 |                 if dname in ['r5.1','r5.2','r6.2','r6.1']:
110 |                     columns = ['id', 'date', 'user', 'pc', 'content', 'activity']
111 |             elif 'http' == act:
112 |                 if dname in ['r6.1','r6.2']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'activity', 'content']
113 |                 if dname in ['r5.1','r5.2','r4.2','r4.1']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'content']
114 |             elif 'file' == act:
115 |                 if dname in ['r4.1','r4.2']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'content']
116 |                 if dname in ['r5.2','r5.1','r6.2','r6.1']: columns = ['id', 'date', 'user', 'pc', 'url/fname','activity','to','from','content']
117 |             
118 |             df = add_action_thisweek(act, columns, lines, act_handles, week_index, stop, firstdate, dname=dname)
119 |             thisweekdf = thisweekdf.append(df, sort=False)
120 |         
121 |         thisweekdf['date'] = thisweekdf['date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S"))
122 |         thisweekdf.to_pickle("DataByWeek/"+str(week_index)+".pickle")
123 |         week_index += 1
124 | 
125 | ##############################################################################
126 | 
127 | def process_user_pc(upd, roles): #figure out  which PC belongs to which user
128 |     upd['sharedpc'] = None
129 |     upd['npc'] = upd['pcs'].apply(lambda x: len(x))
130 |     upd.at[upd['npc']==1,'pc'] = upd[upd['npc']==1]['pcs'].apply(lambda x: x[0])
131 |     multiuser_pcs = np.concatenate(upd[upd['npc']>1]['pcs'].values).tolist()
132 |     set_multiuser_pc = list(set(multiuser_pcs))
133 |     count = {}
134 |     for pc in set_multiuser_pc:
135 |         count[pc] = multiuser_pcs.count(pc)
136 |     for u in upd[upd['npc']>1].index:
137 |         sharedpc = upd.loc[u]['pcs']
138 |         count_u_pc = [count[pc] for pc in upd.loc[u]['pcs']]
139 |         the_pc = count_u_pc.index(min(count_u_pc))
140 |         upd.at[u,'pc'] = sharedpc[the_pc]
141 |         if roles.loc[u] != 'ITAdmin':
142 |             sharedpc.remove(sharedpc[the_pc])
143 |             upd.at[u,'sharedpc']= sharedpc
144 |     return upd
145 | 
146 | def getuserlist(dname = 'r4.2', psycho = True):
147 |     allfiles =  ['LDAP/'+f1 for f1 in os.listdir('LDAP') if os.path.isfile('LDAP/'+f1)]
148 |     alluser = {}
149 |     alreadyFired = []
150 |     
151 |     for file in allfiles:
152 |         af = (pd.read_csv(file,delimiter=',')).values
153 |         employeesThisMonth = []    
154 |         for i in range(len(af)):
155 |             employeesThisMonth.append(af[i][1])
156 |             if af[i][1] not in alluser:
157 |                 alluser[af[i][1]] = af[i][0:1].tolist() + af[i][2:].tolist() + [file.split('.')[0] , np.nan]
158 | 
159 |         firedEmployees = list(set(alluser.keys()) - set(alreadyFired) - set(employeesThisMonth))
160 |         alreadyFired = alreadyFired + firedEmployees
161 |         for e in firedEmployees:
162 |             alluser[e][-1] = file.split('.')[0]
163 |     
164 |     if psycho and os.path.isfile("psychometric.csv"):
165 | 
166 |         p_score = pd.read_csv("psychometric.csv",delimiter = ',').values
167 |         for id in range(len(p_score)):
168 |             alluser[p_score[id,1]] = alluser[p_score[id,1]]+ list(p_score[id,2:])
169 |         df  = pd.DataFrame.from_dict(alluser, orient='index')
170 |         if dname in ['r4.1','r4.2']:
171 |             df.columns = ['uname', 'email', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'sup','wstart', 'wend', 'O', 'C', 'E', 'A', 'N']
172 |         elif dname in ['r5.2','r5.1','r6.2','r6.1']:
173 |             df.columns = ['uname', 'email', 'role', 'project', 'b_unit', 'f_unit', 'dept', 'team', 'sup','wstart', 'wend', 'O', 'C', 'E', 'A', 'N']
174 |     else:
175 |         df  = pd.DataFrame.from_dict(alluser, orient='index')
176 |         if dname in ['r4.1','r4.2']:
177 |             df.columns = ['uname', 'email', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'sup', 'wstart', 'wend']
178 |         elif dname in ['r5.2','r5.1','r6.2','r6.1']:
179 |             df.columns = ['uname', 'email', 'role', 'project', 'b_unit', 'f_unit', 'dept', 'team', 'sup', 'wstart', 'wend']
180 | 
181 |     df['pc'] = None
182 |     for i in df.index:
183 |         if type(df.loc[i]['sup']) == str:
184 |             sup = df[df['uname'] == df.loc[i]['sup']].index[0]
185 |         else:
186 |             sup = None
187 |         df.at[i,'sup'] = sup
188 |         
189 |     #read first 2 weeks to determine each user's PC
190 |     w1 = pd.read_pickle("DataByWeek/1.pickle")
191 |     w2 = pd.read_pickle("DataByWeek/2.pickle")
192 |     user_pc_dict = pd.DataFrame(index=df.index)
193 |     user_pc_dict['pcs'] = None  
194 |   
195 |     for u in df.index:
196 |         pc = list(set(w1[w1['user']==u]['pc']) & set(w2[w2['user']==u]['pc']))
197 |         user_pc_dict.at[u,'pcs'] = pc
198 |     upd = process_user_pc(user_pc_dict, df['role'])
199 |     df['pc'] = upd['pc']
200 |     df['sharedpc'] = upd['sharedpc']
201 |     return df
202 | 
203 |         
204 | def get_mal_userdata(data = 'r4.2', usersdf = None):
205 |     
206 |     if not os.path.isdir('answers'):
207 |         os.system('wget https://kilthub.cmu.edu/ndownloader/files/24857828 -O answers.tar.bz2')
208 |         os.system('tar -xjvf answers.tar.bz2')
209 |     
210 |     listmaluser = pd.read_csv("answers/insiders.csv")
211 |     listmaluser['dataset'] = listmaluser['dataset'].apply(lambda x: str(x))
212 |     listmaluser = listmaluser[listmaluser['dataset']==data.replace("r","")]
213 |     #for r6.2, new time in scenario 4 answer is incomplete.
214 |     if data == 'r6.2': listmaluser.at[listmaluser['scenario']==4,'start'] = '02'+listmaluser[listmaluser['scenario']==4]['start']
215 |     listmaluser[['start','end']] = listmaluser[['start','end']].applymap(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S"))
216 |     
217 |     if type(usersdf) != pd.core.frame.DataFrame:
218 |         usersdf = getuserlist(data)
219 |     usersdf['malscene']=0
220 |     usersdf['mstart'] = None
221 |     usersdf['mend'] = None
222 |     usersdf['malacts'] = None
223 |     
224 |     for i in listmaluser.index:
225 |         usersdf.loc[listmaluser['user'][i], 'mstart'] = listmaluser['start'][i]
226 |         usersdf.loc[listmaluser['user'][i], 'mend'] = listmaluser['end'][i]
227 |         usersdf.loc[listmaluser['user'][i], 'malscene'] = listmaluser['scenario'][i]
228 |         
229 |         if data in ['r4.2', 'r5.2']:
230 |             malacts = open(f"answers/r{listmaluser['dataset'][i]}-{listmaluser['scenario'][i]}/"+
231 |                        listmaluser['details'][i],'r').read().strip().split("\n")
232 |         else: #only 1 malicious user, no folder
233 |             malacts = open("answers/"+ listmaluser['details'][i],'r').read().strip().split("\n")
234 |         
235 |         malacts = [x.split(',') for x in malacts]
236 | 
237 |         mal_users = np.array([x[3].strip('"') for x in malacts])
238 |         mal_act_ids =  np.array([x[1].strip('"') for x in malacts])
239 |         
240 |         usersdf.at[listmaluser['user'][i], 'malacts'] = mal_act_ids[mal_users==listmaluser['user'][i]]
241 |                     
242 |     return usersdf
243 | 
244 | ##############################################################################
245 | 
246 | def is_after_whour(dt): #Workhours assumed 7:30-17:30
247 |     wday_start = datetime.strptime("7:30", "%H:%M").time()
248 |     wday_end = datetime.strptime("17:30", "%H:%M").time()
249 |     dt = dt.time()
250 |     if dt < wday_start or dt > wday_end:
251 |         return True
252 |     return False
253 |       
254 | def is_weekend(dt):
255 |     if dt.strftime("%w") in ['0', '6']:
256 |         return True
257 |     return False   
258 |     
259 | def email_process(act, data = 'r4.2', separate_send_receive = True):
260 |     receivers = act['to'].split(';')
261 |     if type(act['cc']) == str:
262 |         receivers = receivers + act['cc'].split(";")
263 |     if type(act['bcc']) == str:
264 |         bccreceivers = act['bcc'].split(";")   
265 |     else:
266 |         bccreceivers = []
267 |     exemail = False
268 |     n_exdes = 0
269 |     for i in receivers + bccreceivers:
270 |         if 'dtaa.com' not in i:
271 |             exemail = True
272 |             n_exdes += 1
273 | 
274 |     n_des = len(receivers) + len(bccreceivers)
275 |     Xemail = 1 if exemail else 0
276 |     n_bccdes = len(bccreceivers)
277 |     exbccmail = 0
278 |     email_text_len = len(act['content'])
279 |     email_text_nwords = act['content'].count(' ') + 1
280 |     for i in bccreceivers:
281 |         if 'dtaa.com' not in i:
282 |             exbccmail = 1
283 |             break
284 | 
285 |     if data in ['r5.1','r5.2','r6.1','r6.2']:
286 |         send_mail = 1 if act['activity'] == 'Send' else 0
287 |         receive_mail = 1 if act['activity'] in ['Receive','View'] else 0
288 |         
289 |         atts = act['att'].split(';')
290 |         n_atts = len(atts)
291 |         size_atts = 0
292 |         att_types = [0,0,0,0,0,0]
293 |         att_sizes = [0,0,0,0,0,0]
294 |         for att in atts:
295 |             if '.' in att:
296 |                 tmp = file_process(att, filetype='att')
297 |                 att_types = [sum(x) for x in zip(att_types,tmp[0])]
298 |                 att_sizes = [sum(x) for x in zip(att_sizes,tmp[1])]
299 |                 size_atts +=sum(tmp[1])
300 |         return [send_mail, receive_mail, n_des, n_atts, Xemail, n_exdes, 
301 |                 n_bccdes, exbccmail, int(act['size']), email_text_len, 
302 |                 email_text_nwords] + att_types + att_sizes
303 |     elif data in ['r4.1','r4.2']:
304 |         return [n_des, int(act['#att']), Xemail, n_exdes, n_bccdes, exbccmail, 
305 |                 int(act['size']), email_text_len, email_text_nwords]
306 |         
307 | def http_process(act, data = 'r4.2'): 
308 |     # basic features:
309 |     url_len = len(act['url/fname'])
310 |     url_depth = act['url/fname'].count('/')-2
311 |     content_len = len(act['content'])
312 |     content_nwords = act['content'].count(' ')+1
313 |     
314 |     domainname = re.findall("//(.*?)/", act['url/fname'])[0]
315 |     domainname.replace("www.","")
316 |     dn = domainname.split(".")
317 |     if len(dn) > 2 and not any([x in domainname for x in ["google.com", '.co.uk', '.co.nz', 'live.com']]):
318 |         domainname = ".".join(dn[-2:])
319 | 
320 |     # other 1, socnet 2, cloud 3, job 4, leak 5, hack 6
321 |     if domainname in ['dropbox.com', 'drive.google.com', 'mega.co.nz', 'account.live.com']:
322 |         r = 3
323 |     elif domainname in ['wikileaks.org','freedom.press','theintercept.com']:
324 |         r = 5
325 |     elif domainname in ['facebook.com','twitter.com','plus.google.com','instagr.am','instagram.com',
326 |                         'flickr.com','linkedin.com','reddit.com','about.com','youtube.com','pinterest.com',
327 |                         'tumblr.com','quora.com','vine.co','match.com','t.co']:
328 |         r = 2
329 |     elif domainname in ['indeed.com','monster.com', 'careerbuilder.com','simplyhired.com']:
330 |         r = 4
331 |     
332 |     elif ('job' in domainname and ('hunt' in domainname or 'search' in domainname)) \
333 |     or ('aol.com' in domainname and ("recruit" in act['url/fname'] or "job" in act['url/fname'])):
334 |         r = 4
335 |     elif (domainname in ['webwatchernow.com','actionalert.com', 'relytec.com','refog.com','wellresearchedreviews.com',
336 |                          'softactivity.com', 'spectorsoft.com','best-spy-soft.com']):
337 |         r = 6
338 |     elif ('keylog' in domainname):
339 |         r = 6
340 |     else:
341 |         r = 1
342 |     if data in ['r6.1','r6.2']:
343 |         http_act_dict = {'www visit': 1, 'www download': 2, 'www upload': 3}
344 |         http_act = http_act_dict.get(act['activity'].lower(), 0)
345 |         return [r, url_len, url_depth, content_len, content_nwords, http_act]
346 |     else:
347 |         return [r, url_len, url_depth, content_len, content_nwords]
348 |         
349 | def file_process(act, complete_ul = None, data = 'r4.2', filetype = 'act'):
350 |     if filetype == 'act':
351 |         ftype = act['url/fname'].split(".")[1]
352 |         disk = 1 if act['url/fname'][0] == 'C' else 0
353 |         if act['url/fname'][0] == 'R': disk = 2
354 |         file_depth = act['url/fname'].count('\\')
355 |     elif filetype == 'att': #attachments
356 |         tmp = act.split('.')[1]
357 |         ftype = tmp[:tmp.find('(')]
358 |         attsize = int(tmp[tmp.find("(")+1:tmp.find(")")])
359 |         r = [[0,0,0,0,0,0], [0,0,0,0,0,0]]
360 |         if ftype in ['zip','rar','7z']:
361 |             ind = 1
362 |         elif ftype in ['jpg', 'png', 'bmp']:
363 |             ind = 2
364 |         elif ftype in ['doc','docx', 'pdf']:
365 |             ind = 3
366 |         elif ftype in ['txt','cfg', 'rtf']:
367 |             ind = 4
368 |         elif ftype in ['exe', 'sh']:
369 |             ind = 5
370 |         else:
371 |             ind = 0
372 |         r[0][ind] = 1
373 |         r[1][ind] = attsize
374 |         return r
375 | 
376 |     fsize = len(act['content'])
377 |     f_nwords = act['content'].count(' ')+1
378 |     if ftype in ['zip','rar','7z']:
379 |         r = 2
380 |     elif ftype in ['jpg', 'png', 'bmp']:
381 |         r = 3
382 |     elif ftype in ['doc','docx', 'pdf']:
383 |         r = 4
384 |     elif ftype in ['txt','cfg','rtf']:
385 |         r = 5
386 |     elif ftype in ['exe', 'sh']:
387 |         r = 6
388 |     else:
389 |         r = 1
390 |     if data in ['r5.2','r5.1', 'r6.2','r6.1']:
391 |         to_usb = 1 if act['to'] == 'True' else 0
392 |         from_usb = 1 if act['from'] == 'True' else 0
393 |         file_depth = act['url/fname'].count('\\')
394 |         file_act_dict = {'file open': 1, 'file copy': 2, 'file write': 3, 'file delete': 4}
395 |         if act['activity'].lower() not in file_act_dict: print(act['activity'].lower())
396 |         file_act = file_act_dict.get(act['activity'].lower(), 0)
397 |         return [r, fsize, f_nwords, disk, file_depth, file_act, to_usb, from_usb]
398 |     elif data in ['r4.1','r4.2']:
399 |         return [r, fsize, f_nwords, disk, file_depth]
400 | 
401 | def from_pc(act, ul):
402 |     #code: 0,1,2,3:  own pc, sharedpc, other's pc, supervisor's pc
403 |     user_pc = ul.loc[act['user']]['pc']
404 |     act_pc = act['pc']
405 |     if act_pc == user_pc:
406 |         return (0, act_pc) #using normal PC
407 |     elif ul.loc[act['user']]['sharedpc'] is not None and act_pc in ul.loc[act['user']]['sharedpc']:
408 |         return (1, act_pc)
409 |     elif ul.loc[act['user']]['sup'] is not None and act_pc == ul.loc[ul.loc[act['user']]['sup']]['pc']:
410 |         return (3, act_pc)
411 |     else:
412 |         return (2, act_pc)
413 |     
414 | def process_week_num(week, users, userlist = 'all', data = 'r4.2'):
415 | 
416 |     user_dict = {idx: i for (i, idx) in enumerate(users.index)}        
417 |     acts_week = pd.read_pickle("DataByWeek/"+str(week)+".pickle")
418 |     start_week, end_week = min(acts_week.date), max(acts_week.date)
419 |     acts_week.sort_values('date', ascending = True, inplace = True)
420 |     n_cols = 45 if data in ['r5.2','r5.1'] else 46
421 |     if data in ['r4.2','r4.1']: n_cols = 27
422 |     u_week = np.zeros((len(acts_week), n_cols))
423 |     pc_time = []
424 |     if userlist == 'all':
425 |         userlist = set(acts_week.user)
426 |     
427 |     #FOR EACH USER
428 |     current_ind = 0
429 |     for u in userlist:
430 |         df_acts_u = acts_week[acts_week.user == u]
431 |         mal_u = 0 #, stop_soon = 0, 0        
432 |         if users.loc[u].malscene > 0:
433 |             if start_week <= users.loc[u].mend and users.loc[u].mstart <= end_week:
434 |                 mal_u = users.loc[u].malscene
435 |         
436 |         list_uacts = df_acts_u.type.tolist() #all user's activities       
437 |         list_activity = df_acts_u.activity.tolist()
438 |         list_uacts = [list_activity[i].strip().lower() if (type(list_activity[i])==str and list_activity[i].strip() in ['Logon', 'Logoff', 'Connect', 'Disconnect']) \
439 |                         else list_uacts[i] for i in range(len(list_uacts))]  
440 |         uacts_mapping = {'logon':1, 'logoff':2, 'connect':3, 'disconnect':4, 'http':5,'email':6,'file':7}
441 |         list_uacts_num = [uacts_mapping[x] for x in list_uacts]
442 | 
443 |         oneu_week = np.zeros((len(df_acts_u), n_cols))
444 |         oneu_pc_time = []
445 |         for i in range(len(df_acts_u)):
446 |             pc, _ = from_pc(df_acts_u.iloc[i], users)
447 |             if is_weekend(df_acts_u.iloc[i]['date']):
448 |                 if is_after_whour(df_acts_u.iloc[i]['date']):
449 |                     act_time = 4
450 |                 else:
451 |                     act_time = 3
452 |             elif is_after_whour(df_acts_u.iloc[i]['date']):
453 |                 act_time = 2
454 |             else:
455 |                 act_time = 1
456 |             
457 |             if data in ['r4.2','r4.1']:
458 |                 device_f = [0]
459 |                 file_f = [0, 0, 0, 0, 0]
460 |                 http_f = [0,0,0,0,0]
461 |                 email_f = [0]*9
462 |             elif data in ['r5.2','r5.1','r6.2','r6.1']:
463 |                 device_f = [0,0]
464 |                 file_f = [0]*8
465 |                 http_f = [0,0,0,0,0]
466 |                 if data in ['r6.2','r6.1']:
467 |                     http_f = [0,0,0,0,0,0]
468 |                 email_f = [0]*23
469 |             
470 |             if list_uacts[i] == 'file':
471 |                 file_f = file_process(df_acts_u.iloc[i], data = data)
472 |             elif list_uacts[i] == 'email':
473 |                 email_f = email_process(df_acts_u.iloc[i], data = data)
474 |             elif list_uacts[i] == 'http':
475 |                 http_f = http_process(df_acts_u.iloc[i], data=data)
476 |             elif list_uacts[i] == 'connect':
477 |                 tmp = df_acts_u.iloc[i:]
478 |                 disconnect_acts = tmp[(tmp['activity'] == 'Disconnect\n') & \
479 |                  (tmp['user'] == df_acts_u.iloc[i]['user']) & \
480 |                  (tmp['pc'] == df_acts_u.iloc[i]['pc'])]
481 |                 
482 |                 connect_acts = tmp[(tmp['activity'] == 'Connect\n') & \
483 |                  (tmp['user'] == df_acts_u.iloc[i]['user']) & \
484 |                  (tmp['pc'] == df_acts_u.iloc[i]['pc'])]
485 |                 
486 |                 if len(disconnect_acts) > 0:
487 |                     distime = disconnect_acts.iloc[0]['date']
488 |                     if len(connect_acts) > 0 and connect_acts.iloc[0]['date'] < distime:
489 |                         connect_dur = -1
490 |                     else:
491 |                         tmp_td = distime - df_acts_u.iloc[i]['date']
492 |                         connect_dur = tmp_td.days*24*3600 + tmp_td.seconds
493 |                 else:
494 |                     connect_dur = -1 # disconnect action not found!
495 |                     
496 |                 if data in ['r5.2','r5.1','r6.2','r6.1']:
497 |                     file_tree_len =  len(df_acts_u.iloc[i]['content'].split(';'))
498 |                     device_f = [connect_dur, file_tree_len]
499 |                 else:
500 |                     device_f = [connect_dur]
501 |                 
502 |             is_mal_act = 0
503 |             if mal_u > 0 and df_acts_u.index[i] in users.loc[u]['malacts']: is_mal_act = 1
504 | 
505 |             oneu_week[i,:] = [ user_dict[u], time_convert(df_acts_u.iloc[i]['date'], 'dt2dn'), list_uacts_num[i], pc, act_time] \
506 |             + device_f + file_f + http_f + email_f + [is_mal_act, mal_u]
507 | 
508 |             oneu_pc_time.append([df_acts_u.index[i], df_acts_u.iloc[i]['pc'],df_acts_u.iloc[i]['date']])
509 |         u_week[current_ind:current_ind+len(oneu_week),:] = oneu_week
510 |         pc_time += oneu_pc_time
511 |         current_ind += len(oneu_week)
512 |     
513 |     u_week = u_week[0:current_ind, :]
514 |     col_names = ['user','day','act','pc','time']
515 |     if data in ['r4.1','r4.2']:
516 |         device_feature_names = ['usb_dur']
517 |         file_feature_names = ['file_type', 'file_len', 'file_nwords', 'disk', 'file_depth']
518 |         http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords']
519 |         email_feature_names = ['n_des', 'n_atts', 'Xemail', 'n_exdes', 'n_bccdes', 'exbccmail', 'email_size', 'email_text_slen', 'email_text_nwords']
520 |     elif data in ['r5.2','r5.1', 'r6.2','r6.1']:
521 |         device_feature_names = ['usb_dur', 'file_tree_len']
522 |         file_feature_names = ['file_type', 'file_len', 'file_nwords', 'disk', 'file_depth', 'file_act', 'to_usb', 'from_usb']
523 |         http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords']
524 |         if data in ['r6.2','r6.1']:
525 |             http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords', 'http_act']
526 |         email_feature_names = ['send_mail', 'receive_mail','n_des', 'n_atts', 'Xemail', 'n_exdes', 'n_bccdes', 'exbccmail', 'email_size', 'email_text_slen', 'email_text_nwords']
527 |         email_feature_names += ['e_att_other', 'e_att_comp', 'e_att_pho', 'e_att_doc', 'e_att_txt', 'e_att_exe']
528 |         email_feature_names += ['e_att_sother', 'e_att_scomp', 'e_att_spho', 'e_att_sdoc', 'e_att_stxt', 'e_att_sexe']     
529 |         
530 |     col_names = col_names + device_feature_names + file_feature_names+ http_feature_names + email_feature_names + ['mal_act','insider']#['stop_soon', 'mal_act','insider']
531 |     df_u_week = pd.DataFrame(columns=['actid','pcid','time_stamp'] + col_names, index = np.arange(0,len(pc_time)))
532 |     df_u_week[['actid','pcid','time_stamp']] = np.array(pc_time)
533 |     
534 |     df_u_week[col_names] = u_week
535 |     df_u_week[col_names] = df_u_week[col_names].astype(int)
536 |     df_u_week.to_pickle("NumDataByWeek/"+str(week)+"_num.pickle")
537 | 
538 | ##############################################################################
539 | 
540 | # return sessions for each user in a week:
541 | # sessions[sid] = [sessionid, pc, start_with, end_with, start time, end time,number_of_concurent_login, [action_indices]]
542 | # start_with: in the beginning of a week, action start with log in or not (1, 2)
543 | # end_with: log off, next log on same computer (1, 2)
544 | def get_sessions(uw, first_sid = 0):
545 |     sessions = {}
546 |     open_sessions = {}
547 |     sid = 0
548 |     current_pc = uw.iloc[0]['pcid']
549 |     start_time = uw.iloc[0]['time_stamp']
550 |     if uw.iloc[0]['act'] == 1:
551 |         open_sessions[current_pc] = [current_pc, 1, 0, start_time, start_time, 1, [uw.index[0]]]
552 |     else:
553 |         open_sessions[current_pc] = [current_pc, 2, 0, start_time, start_time, 1, [uw.index[0]]]
554 | 
555 |     for i in uw.index[1:]:
556 |         current_pc = uw.loc[i]['pcid']
557 |         if current_pc in open_sessions: # must be already a session with that pcid
558 |             if uw.loc[i]['act'] == 2:
559 |                 open_sessions[current_pc][2] = 1
560 |                 open_sessions[current_pc][4] = uw.loc[i]['time_stamp']
561 |                 open_sessions[current_pc][6].append(i)
562 |                 sessions[sid] = [first_sid+sid] + open_sessions.pop(current_pc)
563 |                 sid +=1
564 |             elif uw.loc[i]['act'] == 1:
565 |                 open_sessions[current_pc][2] = 2
566 |                 sessions[sid] = [first_sid+sid] + open_sessions.pop(current_pc)
567 |                 sid +=1
568 |                 #create a new open session
569 |                 open_sessions[current_pc] = [current_pc, 1, 0, uw.loc[i]['time_stamp'], uw.loc[i]['time_stamp'], 1, [i]]
570 |                 if len(open_sessions) > 1: #increase the concurent count for all sessions
571 |                     for k in open_sessions:
572 |                         open_sessions[k][5] +=1
573 |             else:
574 |                 open_sessions[current_pc][4] = uw.loc[i]['time_stamp']
575 |                 open_sessions[current_pc][6].append(i)
576 |         else:
577 |             start_status = 1 if uw.loc[i]['act'] == 1 else 2
578 |             open_sessions[current_pc] = [current_pc, start_status, 0, uw.loc[i]['time_stamp'], uw.loc[i]['time_stamp'], 1, [i]]
579 |             if len(open_sessions) > 1: #increase the concurent count for all sessions
580 |                 for k in open_sessions:
581 |                     open_sessions[k][5] +=1
582 |     return sessions
583 |                 
584 | def get_u_features_dicts(ul, data = 'r5.2'):
585 |     ufdict = {}
586 |     list_uf=[] if data in ['r4.1','r4.2'] else ['project']
587 |     list_uf += ['role','b_unit','f_unit', 'dept','team']
588 |     for f in list_uf:
589 |         ul[f] = ul[f].astype(str)
590 |         tmp = list(set(ul[f]))
591 |         tmp.sort()
592 |         ufdict[f] = {idx:i for i, idx in enumerate(tmp)}
593 |     return (ul,ufdict, list_uf)
594 | 
595 | def proc_u_features(uf, ufdict, list_f = None, data = 'r4.2'): #to remove mode
596 |     if type(list_f) != list:
597 |         list_f=[] if data in ['r4.1','r4.2'] else ['project']
598 |         list_f = ['role','b_unit','f_unit', 'dept','team'] + list_f
599 | 
600 |     out = []
601 |     for f in list_f:
602 |         out.append(ufdict[f][uf[f]])
603 |     return out
604 | 
605 | def f_stats_calc(ud, fn, stats_f, countonly_f = {}, get_stats = False):
606 |     f_count = len(ud)
607 |     r = []
608 |     f_names = []
609 |     
610 |     for f in stats_f:
611 |         inp = ud[f].values
612 |         if get_stats:
613 |             if f_count > 0:
614 |                 r += [np.min(inp), np.max(inp), np.median(inp), np.mean(inp), np.std(inp)]
615 |             else: r += [0, 0, 0, 0, 0]
616 |             f_names += [fn+'_min_'+f, fn+'_max_'+f, fn+'_med_'+f, fn+'_mean_'+f, fn+'_std_'+f]
617 |         else:
618 |             if f_count > 0: r += [np.mean(inp)]
619 |             else: r += [0]
620 |             f_names += [fn+'_mean_'+f]
621 |         
622 |     for f in countonly_f:
623 |         for v in countonly_f[f]:
624 |             r += [sum(ud[f].values == v)]
625 |             f_names += [fn+'_n-'+f+str(v)]
626 |     return (f_count, r, f_names)
627 | 
628 | def f_calc_subfeatures(ud, fname, filter_col, filter_vals, filter_names, sub_features, countonly_subfeatures):
629 |     [n, stats, fnames] = f_stats_calc(ud, fname,sub_features, countonly_subfeatures)
630 |     allf = [n] + stats
631 |     allf_names = ['n_'+fname] + fnames
632 |     for i in range(len(filter_vals)):
633 |         [n_sf, sf_stats, sf_fnames] = f_stats_calc(ud[ud[filter_col]==filter_vals[i]], filter_names[i], sub_features, countonly_subfeatures)
634 |         allf += [n_sf] + sf_stats
635 |         allf_names += [fname+'_n_'+filter_names[i]] + [fname + '_' + x for x in sf_fnames]
636 |     return (allf, allf_names)
637 | 
638 | def f_calc(ud, mode = 'week', data = 'r4.2'):
639 |     n_weekendact = (ud['time']==3).sum()
640 |     if n_weekendact > 0: 
641 |         is_weekend = 1
642 |     else: 
643 |         is_weekend = 0
644 |     
645 |     all_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {}
646 |     [all_f, all_f_names] = f_calc_subfeatures(ud, 'allact', None, [], [], [], all_countonlyf)
647 |     if mode == 'day':
648 |         [workhourf, workhourf_names] = f_calc_subfeatures(ud[(ud['time'] == 1) | (ud['time'] == 3)], 'workhourallact', None, [], [], [], all_countonlyf)
649 |         [afterhourf, afterhourf_names] = f_calc_subfeatures(ud[(ud['time'] == 2) | (ud['time'] == 4) ], 'afterhourallact', None, [], [], [], all_countonlyf)
650 |     elif mode == 'week':
651 |         [workhourf, workhourf_names] = f_calc_subfeatures(ud[ud['time'] == 1], 'workhourallact', None, [], [], [], all_countonlyf)
652 |         [afterhourf, afterhourf_names] = f_calc_subfeatures(ud[ud['time'] == 2 ], 'afterhourallact', None, [], [], [], all_countonlyf)
653 |         [weekendf, weekendf_names] = f_calc_subfeatures(ud[ud['time'] >= 3 ], 'weekendallact', None, [], [], [], all_countonlyf)
654 | 
655 |     logon_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {}
656 |     logon_statf = []
657 |         
658 |     [all_logonf, all_logonf_names] = f_calc_subfeatures(ud[ud['act']==1], 'logon', None, [], [], logon_statf, logon_countonlyf)
659 |     if mode == 'day':
660 |         [workhourlogonf, workhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & ((ud['time'] == 1) | (ud['time'] == 3) )], 'workhourlogon', None, [], [], logon_statf, logon_countonlyf)
661 |         [afterhourlogonf, afterhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & ((ud['time'] == 2) | (ud['time'] == 4) )], 'afterhourlogon', None, [], [], logon_statf, logon_countonlyf)
662 |     elif mode == 'week':
663 |         [workhourlogonf, workhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] == 1)], 'workhourlogon', None, [], [], logon_statf, logon_countonlyf)
664 |         [afterhourlogonf, afterhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] == 2) ], 'afterhourlogon', None, [], [], logon_statf, logon_countonlyf)
665 |         [weekendlogonf, weekendlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] >= 3) ], 'weekendlogon', None, [], [], logon_statf, logon_countonlyf)
666 |     
667 |     device_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {}
668 |     device_statf = ['usb_dur','file_tree_len'] if data not in ['r4.1','r4.2'] else ['usb_dur']
669 |         
670 |     [all_devicef, all_devicef_names] = f_calc_subfeatures(ud[ud['act']==3], 'usb', None, [], [], device_statf, device_countonlyf)
671 |     if mode == 'day':
672 |         [workhourdevicef, workhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & ((ud['time'] == 1) | (ud['time'] == 3) )], 'workhourusb', None, [], [], device_statf, device_countonlyf)
673 |         [afterhourdevicef, afterhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & ((ud['time'] == 2) | (ud['time'] == 4) )], 'afterhourusb', None, [], [], device_statf, device_countonlyf)
674 |     elif mode == 'week':
675 |         [workhourdevicef, workhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] == 1)], 'workhourusb', None, [], [], device_statf, device_countonlyf)
676 |         [afterhourdevicef, afterhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] == 2) ], 'afterhourusb', None, [], [], device_statf, device_countonlyf)
677 |         [weekenddevicef, weekenddevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] >= 3) ], 'weekendusb', None, [], [], device_statf, device_countonlyf)
678 |           
679 |     if mode != 'session': file_countonlyf = {'to_usb':[1],'from_usb':[1], 'file_act':[1,2,3,4], 'disk':[0,1], 'pc':[0,1,2,3]}
680 |     else: file_countonlyf = {'to_usb':[1],'from_usb':[1], 'file_act':[1,2,3,4], 'disk':[0,1,2]}
681 |     if data in ['r4.1','r4.2']: 
682 |         [file_countonlyf.pop(k) for k in ['to_usb','from_usb', 'file_act']]
683 |     
684 |     (all_filef, all_filef_names) = f_calc_subfeatures(ud[ud['act']==7], 'file', 'file_type', [1,2,3,4,5,6], \
685 |             ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
686 |     
687 |     if mode == 'day':
688 |         (workhourfilef, workhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
689 |         (afterhourfilef, afterhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
690 |     elif mode == 'week':
691 |         (workhourfilef, workhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] ==1)], 'workhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
692 |         (afterhourfilef, afterhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] ==2)], 'afterhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
693 |         (weekendfilef, weekendfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] >= 3)], 'weekendfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf)
694 |         
695 |     email_stats_f = ['n_des', 'n_atts', 'n_exdes', 'n_bccdes', 'email_size', 'email_text_slen', 'email_text_nwords']
696 |     if data not in ['r4.1','r4.2']:
697 |         email_stats_f += ['e_att_other', 'e_att_comp', 'e_att_pho', 'e_att_doc', 'e_att_txt', 'e_att_exe']
698 |         email_stats_f += ['e_att_sother', 'e_att_scomp', 'e_att_spho', 'e_att_sdoc', 'e_att_stxt', 'e_att_sexe'] 
699 |         mail_filter = 'send_mail'
700 |         mail_filter_vals = [0,1]
701 |         mail_filter_names = ['recvmail','send_mail']
702 |     else:
703 |         mail_filter, mail_filter_vals, mail_filter_names = None, [], []    
704 |     
705 |     if mode != 'session': mail_countonlyf = {'Xemail':[1],'exbccmail':[1], 'pc':[0,1,2,3]}
706 |     else: mail_countonlyf = {'Xemail':[1],'exbccmail':[1]}
707 |     
708 |     (all_emailf, all_emailf_names) = f_calc_subfeatures(ud[ud['act']==6], 'email', mail_filter, mail_filter_vals, mail_filter_names , email_stats_f, mail_countonlyf)
709 |     if mode == 'week':
710 |         (workhouremailf, workhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] == 1)], 'workhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
711 |         (afterhouremailf, afterhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] == 2)], 'afterhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
712 |         (weekendemailf, weekendemailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] >= 3)], 'weekendemail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
713 |     elif mode == 'day':
714 |         (workhouremailf, workhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)
715 |         (afterhouremailf, afterhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf)    
716 |     
717 |     if data in ['r5.2','r5.1'] or data in ['r4.1','r4.2']:
718 |         http_count_subf =  {'pc':[0,1,2,3]}
719 |     elif data in ['r6.2','r6.1']:
720 |         http_count_subf = {'pc':[0,1,2,3], 'http_act':[1,2,3]}
721 |     
722 |     if mode == 'session': http_count_subf.pop('pc',None)
723 | 
724 |     (all_httpf, all_httpf_names) = f_calc_subfeatures(ud[ud['act']==5], 'http', 'http_type', [1,2,3,4,5,6], \
725 |             ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
726 |     
727 |     if mode == 'week':
728 |         (workhourhttpf, workhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] ==1)], 'workhourhttp', 'http_type', [1,2,3,4,5,6], \
729 |                 ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
730 |         (afterhourhttpf, afterhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] ==2)], 'afterhourhttp', 'http_type', [1,2,3,4,5,6], \
731 |                 ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
732 |         (weekendhttpf, weekendhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] >=3)], 'weekendhttp', 'http_type', [1,2,3,4,5,6], \
733 |                 ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
734 |     elif mode == 'day':
735 |         (workhourhttpf, workhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhourhttp', 'http_type', [1,2,3,4,5,6], \
736 |                 ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
737 |         (afterhourhttpf, afterhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhourhttp', 'http_type', [1,2,3,4,5,6], \
738 |                 ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf)
739 |         
740 |     numActs = all_f[0]
741 |     mal_u = 0
742 |     if (ud['mal_act']).sum() > 0:
743 |         tmp = list(set(ud['insider']))
744 |         if len(tmp) > 1:
745 |             tmp.remove(0.0)
746 |         mal_u = tmp[0]
747 |         
748 |     if mode == 'week':        
749 |         features_tmp =  all_f + workhourf + afterhourf + weekendf +\
750 |                         all_logonf + workhourlogonf + afterhourlogonf + weekendlogonf +\
751 |                         all_devicef + workhourdevicef + afterhourdevicef + weekenddevicef +\
752 |                         all_filef + workhourfilef + afterhourfilef + weekendfilef + \
753 |                         all_emailf + workhouremailf + afterhouremailf + weekendemailf + all_httpf + workhourhttpf + afterhourhttpf + weekendhttpf
754 |         fnames_tmp = all_f_names + workhourf_names + afterhourf_names + weekendf_names +\
755 |                       all_logonf_names + workhourlogonf_names + afterhourlogonf_names + weekendlogonf_names +\
756 |                       all_devicef_names + workhourdevicef_names + afterhourdevicef_names + weekenddevicef_names +\
757 |                       all_filef_names + workhourfilef_names + afterhourfilef_names + weekendfilef_names + \
758 |                       all_emailf_names + workhouremailf_names + afterhouremailf_names + weekendemailf_names + all_httpf_names + workhourhttpf_names + afterhourhttpf_names + weekendhttpf_names
759 |     elif mode == 'day':
760 |         features_tmp = all_f + workhourf + afterhourf +\
761 |                         all_logonf + workhourlogonf + afterhourlogonf +\
762 |                         all_devicef + workhourdevicef + afterhourdevicef + \
763 |                         all_filef + workhourfilef + afterhourfilef + \
764 |                         all_emailf + workhouremailf + afterhouremailf + all_httpf + workhourhttpf + afterhourhttpf
765 |         fnames_tmp = all_f_names + workhourf_names + afterhourf_names +\
766 |                       all_logonf_names + workhourlogonf_names + afterhourlogonf_names +\
767 |                       all_devicef_names + workhourdevicef_names + afterhourdevicef_names +\
768 |                       all_filef_names + workhourfilef_names + afterhourfilef_names + \
769 |                       all_emailf_names + workhouremailf_names + afterhouremailf_names + all_httpf_names + workhourhttpf_names + afterhourhttpf_names
770 |     elif mode == 'session':
771 |         features_tmp = all_f + all_logonf + all_devicef + all_filef + all_emailf + all_httpf
772 |         fnames_tmp = all_f_names + all_logonf_names + all_devicef_names + all_filef_names + all_emailf_names + all_httpf_names
773 |     
774 |     return [numActs, is_weekend, features_tmp, fnames_tmp, mal_u]
775 | 
776 | def session_instance_calc(ud, sinfo, week, mode, data, uw, v, list_uf):
777 |     d = ud.iloc[0]['day']
778 |     perworkhour = sum(ud['time']==1)/len(ud)
779 |     perafterhour = sum(ud['time']==2)/len(ud)
780 |     perweekend = sum(ud['time']==3)/len(ud)
781 |     perweekendafterhour = sum(ud['time']==4)/len(ud)
782 |     st_timestamp = min(ud['time_stamp'])
783 |     end_timestamp = max(ud['time_stamp'])
784 |     s_dur = (end_timestamp - st_timestamp).total_seconds() / 60 # in minute
785 |     s_start = st_timestamp.hour + st_timestamp.minute/60
786 |     s_end = end_timestamp.hour + end_timestamp.minute/60
787 |     starttime = st_timestamp.timestamp()
788 |     endtime = end_timestamp.timestamp()
789 |     n_days = len(set(ud['day']))        
790 |     
791 |     tmp = f_calc(ud, mode, data)
792 |     session_instance = [starttime, endtime, v, sinfo[0], d, week, ud.iloc[0]['pc'], perworkhour, perafterhour, perweekend,
793 |                         perweekendafterhour, n_days, s_dur, sinfo[6], sinfo[2], sinfo[3], s_start, s_end] + \
794 |         (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [tmp[4]]
795 |     return (session_instance, tmp[3])
796 | 
797 | def to_csv(week, mode, data, ul, uf_dict, list_uf, subsession_mode = {}):
798 |     user_dict = {i : idx for (i, idx) in enumerate(ul.index)} 
799 |     if mode == 'session': 
800 |         first_sid = week*100000 # to get an unique index for each session, also, first 1 or 2 number in index would be week number
801 |         cols2a = ['starttime', 'endtime','user', 'sessionid', 'day', 'week', 'pc', 'isworkhour', 'isafterhour','isweekend', 
802 |                   'isweekendafterhour', 'n_days', 'duration', 'n_concurrent_sessions', 'start_with', 'end_with', 'ses_start', 
803 |                   'ses_end'] + list_uf + ['ITAdmin','O','C','E','A','N']
804 |     elif mode == 'day': 
805 |         cols2a = ['starttime', 'endtime','user', 'day', 'week', 'isweekday','isweekend'] + list_uf +\
806 |             ['ITAdmin','O','C','E','A','N']
807 |     else: cols2a = ['starttime', 'endtime','user','week'] + list_uf + ['ITAdmin','O','C','E','A','N']
808 |     cols2b = ['insider']        
809 | 
810 |     w = pd.read_pickle("NumDataByWeek/"+str(week)+"_num.pickle")
811 | 
812 |     usnlist = list(set(w['user'].astype('int').values))
813 |     if True:
814 |         cols = ['week']+ list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N', 'insider'] 
815 |         uw = pd.DataFrame(columns = cols, index = user_dict.keys())
816 |         uwdict = {}
817 |         for v in user_dict:
818 |             if v in usnlist:
819 |                 is_ITAdmin = 1 if ul.loc[user_dict[v], 'role'] == 'ITAdmin' else 0
820 |                 row = [week] + proc_u_features(ul.loc[user_dict[v]], uf_dict, list_uf, data = data) + [is_ITAdmin] + \
821 |                     (ul.loc[user_dict[v],['O','C','E','A','N']]).tolist() + [0]
822 |                 row[-1] = int(list(set(w[w['user']==v]['insider']))[0])
823 |                 uwdict[v] = row
824 |         uw = pd.DataFrame.from_dict(uwdict, orient = 'index',columns = cols)    
825 |     
826 |     towrite = pd.DataFrame()
827 |     towrite_list = []
828 |     
829 |     if mode == 'session' and len(subsession_mode) > 0:
830 |         towrite_list_subsession = {} 
831 |         for k1 in subsession_mode:
832 |             towrite_list_subsession[k1] = {}
833 |             for k2 in subsession_mode[k1]:
834 |                 towrite_list_subsession[k1][k2] = []
835 |     
836 |     days = list(set(w['day']))
837 |     for v in user_dict:
838 |         if v in usnlist:
839 |             uactw = w[w['user']==v]
840 |             
841 |             if mode == 'week':
842 |                 a = uactw.iloc[0]['time_stamp']
843 |                 a = a - timedelta(int(a.strftime("%w"))) # get the nearest Sunday
844 |                 starttime = datetime(a.year, a.month, a.day).timestamp()
845 |                 endtime = (datetime(a.year, a.month, a.day) + timedelta(days=7)).timestamp()
846 |                 
847 |                 if len(uactw) > 0:
848 |                     tmp = f_calc(uactw, mode, data)
849 |                     i_fnames = tmp[3]
850 |                     towrite_list.append([starttime, endtime, v, week] + (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [ tmp[4]])
851 | 
852 |             if mode == 'session':
853 |                 sessions = get_sessions(uactw, first_sid)
854 |                 first_sid += len(sessions)
855 |                 for s in sessions:
856 |                     sinfo = sessions[s]
857 |                     
858 |                     ud = uactw.loc[sessions[s][7]]
859 |                     if len(ud) > 0:                     
860 |                         session_instance, i_fnames = session_instance_calc(ud, sinfo, week, mode, data, uw, v, list_uf)
861 |                         towrite_list.append(session_instance)
862 |                         
863 |                         ## do subsessions:
864 |                         if 'time' in subsession_mode: # divide a session into subsessions by consecutive time chunks
865 |                             for subsession_dur in subsession_mode['time']:
866 |                                 n_subsession = int(np.ceil(session_instance[12] / subsession_dur))
867 |                                 if n_subsession == 1:
868 |                                     towrite_list_subsession['time'][subsession_dur].append([0] + session_instance)
869 |                                 else:
870 |                                     sinfo1 = sinfo.copy()
871 |                                     for subsession_ind in range(n_subsession):
872 |                                         sinfo1[3] = 0 if subsession_ind < n_subsession-1 else sinfo[3] 
873 |                                         
874 |                                         subsession_ud = ud[(ud['time_stamp'] >= sessions[s][4] + timedelta(minutes = subsession_ind*subsession_dur)) & \
875 |                                                             (ud['time_stamp'] < sessions[s][4] + timedelta(minutes = (subsession_ind+1)*subsession_dur))]
876 |                                         if len(subsession_ud) > 0:
877 |                                             ss_instance, _ = session_instance_calc(subsession_ud, sinfo1, week, mode, data, uw, v, list_uf)
878 |                                             towrite_list_subsession['time'][subsession_dur].append([subsession_ind] + ss_instance)
879 |                             
880 |                         if 'nact' in subsession_mode:
881 |                             for ss_nact in subsession_mode['nact']:
882 |                                 n_subsession = int(np.ceil(len(ud) / ss_nact))
883 |                                 if n_subsession == 1:
884 |                                     towrite_list_subsession['nact'][ss_nact].append([0] + session_instance)
885 |                                 else:
886 |                                     sinfo1 = sinfo.copy()
887 |                                     for ss_ind in range(n_subsession):
888 |                                         sinfo1[3] = 0 if ss_ind < n_subsession-1 else sinfo[3] 
889 |                                         
890 |                                         ss_ud = ud.iloc[ss_ind*ss_nact : min(len(ud), (ss_ind+1)*ss_nact)] 
891 |                                         if len(ss_ud) > 0:
892 |                                             ss_instance,_ = session_instance_calc(ss_ud, sinfo1, week, mode, data, uw, v, list_uf)
893 |                                             towrite_list_subsession['nact'][ss_nact].append([ss_ind] + ss_instance)
894 |                         
895 |             if mode == 'day':
896 |                 days = sorted(list(set(uactw['day']))) 
897 |                 for d in days:
898 |                     ud = uactw[uactw['day'] == d]
899 |                     isweekday = 1 if sum(ud['time']>=3) == 0 else 0
900 |                     isweekend = 1-isweekday
901 |                     a = ud.iloc[0]['time_stamp']
902 |                     starttime = datetime(a.year, a.month, a.day).timestamp()
903 |                     endtime = (datetime(a.year, a.month, a.day) + timedelta(days=1)).timestamp()
904 |                     
905 |                     if len(ud) > 0:
906 |                         tmp = f_calc(ud, mode, data)
907 |                         i_fnames = tmp[3]
908 |                         towrite_list.append([starttime, endtime, v, d, week, isweekday, isweekend] + (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [ tmp[4]])
909 | 
910 |     towrite = pd.DataFrame(columns = cols2a + i_fnames + cols2b, data = towrite_list)
911 |     towrite.to_pickle("tmp/"+str(week) + mode+".pickle")
912 |     
913 |     if mode == 'session' and len(subsession_mode) > 0:
914 |         for k1 in subsession_mode:
915 |             for k2 in subsession_mode[k1]:
916 |                 df_tmp = pd.DataFrame(columns = ['subs_ind']+cols2a + i_fnames + cols2b, data = towrite_list_subsession[k1][k2])
917 |                 df_tmp.to_pickle("tmp/"+str(week) + mode + k1 + str(k2) + ".pickle")
918 |     
919 | if __name__ == "__main__":
920 |     dname = os.getcwd().split('/')[-1]
921 |     if dname not in ['r4.1','r4.2','r6.2','r6.1','r5.1','r5.2']:
922 |         raise Exception('Please put this script in and run it from a CERT data folder (e.g. r4.2)')
923 |     #make temporary folders
924 |     [os.mkdir(x) for x in ["tmp", "ExtractedData", "DataByWeek", "NumDataByWeek"]]
925 |     
926 |     subsession_mode = {'nact':[25, 50], 'time':[120, 240]}#this can be an empty dict
927 |     
928 |     numCores = 8
929 |     arguments = len(sys.argv) - 1
930 |     if arguments > 0:
931 |         numCores = int(sys.argv[1])
932 |         
933 |     numWeek = 73 if dname in ['r4.1','r4.2'] else 75 # only 73 weeks in r4.1 and r4.2 dataset
934 |     st = time.time()
935 |     
936 |     #### Step 1: Combine data from sources by week, stored in DataByWeek
937 |     combine_by_timerange_pandas(dname)
938 |     print(f"Step 1 - Separate data by week - done. Time (mins): {(time.time()-st)/60:.2f}")
939 |     st = time.time()
940 |     
941 |     #### Step 2: Get user list
942 |     users = get_mal_userdata(dname)
943 |     print(f"Step 2 - Get user list - done. Time (mins): {(time.time()-st)/60:.2f}")
944 |     st = time.time()
945 |     
946 |     #### Step 3: Convert each action to numerical data, stored in NumDataByWeek
947 |     Parallel(n_jobs=numCores)(delayed(process_week_num)(i, users, data=dname) for i in range(numWeek))
948 |     print(f"Step 3 - Convert each action to numerical data - done. Time (mins): {(time.time()-st)/60:.2f}")
949 |     st = time.time()
950 |     
951 |     #### Step 4: Extract to csv
952 |     for mode in ['week','day','session']:
953 |     
954 |         weekRange = list(range(0, numWeek)) if mode in ['day', 'session'] else list(range(1, numWeek))
955 |         (ul, uf_dict, list_uf) = get_u_features_dicts(users, data= dname)
956 |         
957 |         Parallel(n_jobs=numCores)(delayed(to_csv)(i, mode, dname, ul, uf_dict, list_uf, subsession_mode) 
958 |                                    for i in weekRange)
959 | 
960 |         all_csv = open('ExtractedData/'+mode+dname+'.csv','a')
961 |         
962 |         towrite = pd.read_pickle("tmp/"+str(weekRange[0]) + mode+".pickle")
963 |         towrite.to_csv(all_csv,header=True, index = False)
964 |         for w in weekRange[1:]:
965 |             towrite = pd.read_pickle("tmp/"+str(w) + mode+".pickle")        
966 |             towrite.to_csv(all_csv,header=False, index = False)
967 |         
968 |         if mode == 'session' and len(subsession_mode) > 0:
969 |             for k1 in subsession_mode:
970 |                 for k2 in subsession_mode[k1]:
971 |                     all_csv = open('ExtractedData/'+mode+ k1 + str(k2) + dname+'.csv','a')
972 |                     towrite = pd.read_pickle('tmp/'+str(weekRange[0]) + mode + k1 + str(k2)+".pickle")
973 |                     towrite.to_csv(all_csv,header=True, index = False)
974 |                     for w in weekRange[1:]:
975 |                         towrite = pd.read_pickle('tmp/'+str(w) + mode+ k1 + str(k2)+".pickle")        
976 |                         towrite.to_csv(all_csv,header=False, index = False)
977 |                     
978 |         print(f'Extracted {mode} data. Time (mins): {(time.time()-st)/60:.2f}')
979 |         st = time.time()
980 | 
981 |     [os.system(f"rm -r {x}") for x in ["tmp", "DataByWeek", "NumDataByWeek"]]
982 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | #python3.6
2 | numpy==1.19.5
3 | pandas==1.1.5
4 | joblib==1.1.1
5 | scikit-learn==0.24.2


--------------------------------------------------------------------------------
/temporal_data_representation.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | import multiprocessing
  6 | from scipy.stats import percentileofscore
  7 | import argparse
  8 | 
  9 | def concat_combination(data, window_size = 3, dname = 'cert'):
 10 |     
 11 |     if dname == 'cert':
 12 |         info_cols = ['sessionid','day','week',"starttime", "endtime",
 13 |                      'user', 'project', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'ITAdmin', 
 14 |                      'O', 'C', 'E', 'A', 'N', 'insider']
 15 |         
 16 |     combining_features = [ f for f in data.columns if f not in info_cols]
 17 |     info_features = [f for f in data.columns if f in info_cols]
 18 |     
 19 |     data_info = data[info_features].values
 20 |     
 21 |     data_combining_features = data[combining_features].values
 22 |     useridx = data['user'].values
 23 |     
 24 |     userset = set(data['user'])
 25 | 
 26 |     cols = []
 27 |     for shiftrange in range(window_size-1,0,-1):
 28 |         cols += [str(-shiftrange) + '_' + f for f in combining_features]
 29 |     cols += combining_features + info_features
 30 |     
 31 |     combined_data = []
 32 |     for u in userset:
 33 |         data_cf_u = data_combining_features[useridx == u, ]
 34 |         
 35 |         data_cf_u_shifted = []
 36 |         for shiftrange in range(window_size-1,0,-1):
 37 |             data_cf_u_shifted.append(np.roll(data_cf_u, shiftrange, axis = 0))
 38 |         
 39 |         data_cf_u_shifted.append(data_cf_u)
 40 |         data_cf_u_shifted.append(data_info[useridx==u, ])
 41 |         
 42 |         combined_data.append(np.hstack(data_cf_u_shifted)[window_size:,])
 43 |     
 44 |     combined_data = pd.DataFrame(np.vstack(combined_data), columns=cols)
 45 |     
 46 |     return combined_data
 47 | 
 48 | 
 49 | def subtract_combination_uworker(u, alluserdict, dtype, calc_type, window_size, udayidx, udata, uinfo, uorg):
 50 |     if u%200==0: 
 51 |         print(u)
 52 | 
 53 |     data_out = []
 54 |      
 55 |     if dtype in ['day', 'week']:
 56 |         
 57 |         for i in range(len(udayidx)):
 58 |             t = udayidx[i]
 59 |             if dtype in ['day','week']: min_idx = min(udayidx)+window_size
 60 |             
 61 |             if t>=min_idx:
 62 |                 if calc_type == 'meandiff':
 63 |                     prevdata = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),]
 64 |                     if len(prevdata) < 1: continue 
 65 |                     window_mean = np.mean(prevdata, axis = 0)
 66 |                     data_out.append(np.concatenate((udata[i] - window_mean, uorg[i,:], uinfo[i,:])))
 67 |                    
 68 |                 if calc_type == 'meddiff':
 69 |                     prevdata = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),]
 70 |                     if len(prevdata) < 1: continue 
 71 |                     window_med = np.median(prevdata, axis = 0)
 72 |                     data_out.append(np.concatenate((udata[i] - window_med, uorg[i,:], uinfo[i,:])))
 73 |                 elif calc_type == 'percentile':
 74 |                     window = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),]
 75 |                     if window.shape[0] < 1: continue
 76 |                     percentile_i = [percentileofscore(window[:,j], udata[i,j], 'mean') - 50 for j in range(window.shape[1])]
 77 |                     data_out.append(np.concatenate((percentile_i , uorg[i,:], uinfo[i,:])))
 78 |                     
 79 |     if len(data_out) > 0: alluserdict[u] = np.vstack(data_out)
 80 | 
 81 | def subtract_percentile_combination(data, dtype, calc_type = 'percentile', window_size = 7, dname = 'cert', parallel = True):
 82 |     '''
 83 |     Combine data to generate different temporal representations
 84 |     window_size: window size by days (for CERT data)
 85 |     '''
 86 |     if dname == 'cert':
 87 |         info_cols = ['sessionid','day','week',"starttime", "endtime", 
 88 |                      'user', 'project', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'ITAdmin', 
 89 |                      'O', 'C', 'E', 'A', 'N', 'insider','subs_ind']
 90 |         keep_org_cols = ["pc", "isworkhour", "isafterhour", "isweekday", "isweekend", "isweekendafterhour", "n_days", 
 91 |                          "duration", "n_concurrent_sessions", "start_with", "end_with", "ses_start", "ses_end"]
 92 |         
 93 |     combining_features = [ f for f in data.columns if f not in info_cols]
 94 |     info_features = [f for f in data.columns if f in info_cols] 
 95 |     keep_org_features = [f for f in data.columns if f in keep_org_cols]
 96 |     
 97 |     data_info = data[info_features].values
 98 |     data_org = data[keep_org_features].values
 99 |     data_combining_features = data[combining_features].values
100 |     useridx = data['user'].values
101 |     if dtype in ['day']: dayidx = data['day'].values
102 |     if dname == 'cert': weekidx = data['week'].values
103 |     
104 |     userset = set(data['user'])
105 |     
106 |     if dtype == 'week': 
107 |         window_size = np.floor(window_size/7)
108 |         idx = weekidx
109 |     elif dtype in ['day']: idx = dayidx
110 | 
111 |     if parallel:
112 |         manager = multiprocessing.Manager()
113 |         return_dict = manager.dict()
114 |         jobs = []
115 |         for u in userset:
116 |             udayidx = idx[useridx==u]
117 |             udata = data_combining_features[useridx==u, ]
118 |             uinfo = data_info[useridx==u, ]
119 |             uorg = data_org[useridx==u, ]
120 |             p = multiprocessing.Process(target=subtract_combination_uworker, args=(u, return_dict, dtype, calc_type,
121 |                                                                                     window_size, udayidx,
122 |                                                                                     udata, uinfo, uorg))
123 |             jobs.append(p)
124 |             p.start()
125 |     
126 |         for proc in jobs:
127 |             proc.join()
128 |     else:
129 |         return_dict = {}
130 |         for u in userset:
131 |             udayidx = idx[useridx==u]
132 |             udata = data_combining_features[useridx==u, ]
133 |             uinfo = data_info[useridx==u, ]
134 |             uorg = data_org[useridx==u, ]
135 |             subtract_combination_uworker(u, return_dict, dtype, calc_type,
136 |                                         window_size, udayidx,
137 |                                         udata, uinfo, uorg)
138 | 
139 |     combined_data = pd.DataFrame(np.vstack([return_dict[ri] for ri in return_dict.keys()]), columns=combining_features+['org_'+f for f in keep_org_features] + info_features)
140 |     
141 |     return combined_data
142 | 
143 | 
144 | if __name__ == "__main__":    
145 |     parser=argparse.ArgumentParser()
146 |     parser.add_argument('--representation', help='Data representation to extract (concat, percentile, meandiff, mediandiff). Default: percentile', 
147 |                         type= str, default = 'percentile')
148 |     parser.add_argument('--file_input', help='CERT input file name. Default: week-r5.2.csv.gz', type= str, default= 'week-r5.2.csv.gz')  
149 |     parser.add_argument('--window_size', help='Window size for percentile or mean/median difference representation. Default: 30', 
150 |                         type = int, default=30)
151 |     parser.add_argument('--num_concat', help='Number of data points for concatenation. Default: 3', 
152 |                         type = int, default=3)
153 |     args=parser.parse_args()    
154 |     
155 |     print('If "too many opened files", or "ForkAwareLocal" error, run ulimit command, e.g. "$ulimit -n 10000" to increase the limit first')
156 |     if args.representation == 'all':
157 |         reps = ['concat', 'percentile','meandiff','meddiff']
158 |     elif args.representation in ['concat', 'percentile','meandiff','meddiff']:
159 |         reps = [args.representation]
160 |         
161 |     fileName = (args.file_input).replace('.csv','').replace('.gz','')
162 |     if 'day' in fileName:
163 |         data_type = 'day'
164 |     elif 'week' in fileName:
165 |         data_type = 'week'
166 |     s = pd.read_csv(f'{args.file_input}')
167 |     
168 |     for rep in reps:
169 |         if rep in ['percentile','meandiff','meddiff']:
170 |             s1 = subtract_percentile_combination(s, data_type, calc_type = rep, window_size = args.window_size, dname='cert')
171 |             s1.to_pickle(f'{fileName}-{rep}{args.window_size}.pkl')
172 |         else:
173 |             s1 = concat_combination(s, window_size = args.num_concat, dname = 'cert')
174 |             s1.to_pickle(f'{fileName}-{rep}{args.num_concat}.pkl')
175 |     
176 | 


--------------------------------------------------------------------------------