├── LICENSE ├── README.md ├── TNSM2020 ├── clf_helpers.py ├── params.pkl └── run_classification.py ├── example_anomaly_detection.py ├── example_classification.py ├── feature_extraction.py ├── requirements.txt └── temporal_data_representation.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 lcd-dal 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Feature extraction for CERT insider threat test dataset 2 | This is a script for extracting features (csv format) from the [CERT insider threat test dataset](https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099) [[1]](#1), [[2]](#2), versions 4.1 to 6.2. For more details, please see this paper: [Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning](https://ieeexplore.ieee.org/document/8962316). 3 | 4 | [1] 5 | Lindauer, Brian (2020): Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1 6 | 7 | [2] 8 | J. Glasser and B. Lindauer, "Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data," 2013 IEEE Security and Privacy Workshops, San Francisco, CA, 2013, pp. 98-104, doi: 10.1109/SPW.2013.37. 9 | 10 | ## Run feature_extraction script 11 | - Require python3.8 & packages in `requirements.txt`. The script is written and tested in Linux only. 12 | - By default the script extracts week, day, session, and sub-session data (as in the paper). 13 | - To run the script, place it in a folder of a CERT dataset (e.g. r4.2, decompressed from r4.2.tar.bz2 downloaded [here](https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1)), then run `python3 feature_extraction.py` 14 | - To change number of cores used in parallelization (default 8), use `python3 feature_extraction.py numberOfCores`, e.g `python3 feature_extraction.py 16`. 15 | 16 | ## Extracted Data 17 | Extracted data is stored in ExtractedData subfolder. 18 | 19 | Note that in the extracted data, `insider` is the label indicating the insider threat scenario (0 is normal). Some extracted features (subs_ind, starttime, endtime, sessionid, user, day, week) are for information and may or may not be used in training machine learning approaches. 20 | 21 | Pre-extracted data from CERT insider threat test dataset r5.2 (gzipped) can be found in [here](https://web.cs.dal.ca/~lcd/data/CERTr5.2/). 22 | 23 | ## Data representations 24 | From the extracted data, `temporal_data_representation.py` can be used to generate different data representations, as presented in this paper: [Anomaly Detection for Insider Threats Using Unsupervised Ensembles](https://ieeexplore.ieee.org/document/9399116). 25 | 26 | `python3 temporal_data_representation.py --help` 27 | 28 | ## Sample classification and anomaly detection results 29 | Sample code is provided in: 30 | 31 | - `sample_classification.py` for classification (as in [Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning](https://ieeexplore.ieee.org/document/8962316)). 32 | - `sample_anomaly_detection.py` for anomaly detection (as in [Anomaly Detection for Insider Threats Using Unsupervised Ensembles](https://ieeexplore.ieee.org/document/9399116)). 33 | 34 | ## Citation 35 | If you use the source code, or the extracted datasets, please cite the following paper: 36 | 37 | `D. C. Le, N. Zincir-Heywood and M. I. Heywood, "Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning," in IEEE Transactions on Network and Service Management, vol. 17, no. 1, pp. 30-44, March 2020, doi: 10.1109/TNSM.2020.2967721.` 38 | 39 | Data representations and anomaly detection: 40 | 41 | `D. C. Le, N. Zincir-Heywood, "Anomaly Detection for Insider Threats Using Unsupervised Ensembles," in IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1152–1164. June 2021, doi:http://doi.org/10.1109/TNSM.2021.3071928.` 42 | -------------------------------------------------------------------------------- /TNSM2020/clf_helpers.py: -------------------------------------------------------------------------------- 1 | from sklearn.model_selection import train_test_split 2 | from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, auc 3 | from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler 4 | import numpy as np 5 | import time 6 | import gc 7 | import pandas as pd 8 | import random 9 | from joblib import Parallel, delayed 10 | 11 | num_cores = 16 12 | 13 | 14 | def split_data(data, test_size=0.25, random_state=0, y_column='insider', 15 | shuffle=True, 16 | x_rm_cols=('user', 'day', 'week', 'starttime', 'endtime', 'sessionid', 17 | 'timeind', 'Unnamed: 0', 'insider'), 18 | dname='r4.2', normalization='StandardScaler', 19 | rm_empty_cols=True, by_user=False, by_user_time=False, 20 | by_user_time_trainper=0.5, limit_ntrain_user=0): 21 | """ 22 | split data to train and test, can get data by user, seq or random, with normalization builtin 23 | """ 24 | np.random.seed(random_state) 25 | random.seed(random_state) 26 | 27 | x_cols = [i for i in data.columns if i not in x_rm_cols] 28 | if rm_empty_cols: 29 | x_cols = [i for i in x_cols if len(set(data[i])) > 1] 30 | 31 | infocols = list(set(data.columns) - set(x_cols)) 32 | 33 | # output a dict 34 | out = {} 35 | 36 | # normalization 37 | if normalization == 'StandardScaler': 38 | sc = StandardScaler() 39 | elif normalization == 'MinMaxScaler': 40 | sc = MinMaxScaler() 41 | elif normalization == 'MaxAbsScaler': 42 | sc = MaxAbsScaler() 43 | else: 44 | sc = None 45 | out['sc'] = sc 46 | 47 | # split data randomly by instance 48 | if not by_user and not by_user_time: 49 | x = data[x_cols].values 50 | y_org = data[y_column].values 51 | 52 | y = y_org.copy() 53 | if 'r6' in dname: 54 | y[y != 0] = 1 55 | 56 | x_train, x_test, y_train, y_test = train_test_split(x, y_org, test_size=test_size, shuffle=shuffle) 57 | 58 | if 'sc' in out and out['sc'] is not None: 59 | x_train = sc.fit_transform(x_train) 60 | out['sc'] = sc 61 | if test_size > 0: x_test = sc.transform(x_test) 62 | 63 | # split data by user 64 | elif by_user: 65 | test_users, train_users = [], [] 66 | for i in [j for j in list(set(data['insider'])) if j != 0]: 67 | uli = list(set(data[data['insider'] == i]['user'])) 68 | random.shuffle(uli) 69 | ind_i = int(np.ceil(test_size * len(uli))) 70 | test_users += uli[:ind_i] 71 | train_users += uli[ind_i:] 72 | 73 | normal_users = list(set(data['user']) - set(data[data['insider'] != 0]['user'])) 74 | random.shuffle(normal_users) 75 | if limit_ntrain_user > 0: 76 | normal_ind = limit_ntrain_user - len(train_users) 77 | else: 78 | normal_ind = int(np.ceil((1 - test_size) * len(normal_users))) 79 | 80 | train_users += normal_users[: normal_ind] 81 | test_users += normal_users[normal_ind:] 82 | x_train = data[data['user'].isin(train_users)][x_cols].values 83 | x_test = data[data['user'].isin(test_users)][x_cols].values 84 | y_train = data[data['user'].isin(train_users)][y_column].values 85 | y_test = data[data['user'].isin(test_users)][y_column].values 86 | 87 | out['train_info'] = data[data['user'].isin(train_users)][infocols] 88 | out['test_info'] = data[data['user'].isin(test_users)][infocols] 89 | 90 | out['train_users'] = train_users 91 | if test_size > 0 or limit_ntrain_user > 0: 92 | out['test_users'] = test_users 93 | 94 | if 'sc' in out and out['sc'] is not None: 95 | x_train = sc.fit_transform(x_train) 96 | out['sc'] = sc 97 | if test_size > 0 or (limit_ntrain_user > 0 and limit_ntrain_user < len(set(data['user']))): 98 | x_test = sc.transform(x_test) 99 | 100 | # split by user and time 101 | elif by_user_time: 102 | train_week_max = by_user_time_trainper * max(data['week']) 103 | train_insiders = set(data[(data['week'] <= train_week_max) & (data['insider'] != 0)]['user']) 104 | users_set_later_weeks = set(data[data['week'] > train_week_max]['user']) 105 | 106 | first_part = data[data['week'] <= train_week_max] 107 | second_part = data[data['week'] > train_week_max] 108 | 109 | first_part_split = split_data(first_part, random_state=random_state, test_size=0, 110 | dname=dname, normalization=normalization, 111 | by_user=True, by_user_time=False, 112 | limit_ntrain_user=limit_ntrain_user, 113 | ) 114 | 115 | x_train = first_part_split['x_train'] 116 | y_train = first_part_split['y_train'] 117 | x_cols = first_part_split['x_cols'] 118 | 119 | out['train_info'] = first_part_split['train_info'] 120 | out['other_trainweeks_users_info'] = first_part_split['test_info'] 121 | 122 | if 'sc' in first_part_split and first_part_split['sc'] is not None: 123 | out['sc'] = first_part_split['sc'] 124 | 125 | out['x_other_trainweeks_users'] = first_part_split['x_test'] 126 | out['y_other_trainweeks_users'] = first_part_split['y_test'] 127 | out['y_bin_other_trainweeks_users'] = first_part_split['y_test_bin'] 128 | out['other_trainweeks_users'] = first_part_split['test_users'] # users in first half but not in train 129 | 130 | real_train_users = set(first_part_split['train_users']) 131 | real_train_insiders = train_insiders.intersection(real_train_users) 132 | test_users = list(users_set_later_weeks - real_train_insiders) 133 | x_test = second_part[second_part['user'].isin(test_users)][x_cols].values 134 | y_test = second_part[second_part['user'].isin(test_users)][y_column].values 135 | out['test_info'] = second_part[second_part['user'].isin(test_users)][infocols] 136 | if ('sc' in out) and (out['sc'] is not None) and (by_user_time_trainper < 1): 137 | x_test = out['sc'].transform(x_test) 138 | 139 | out['train_users'] = first_part_split['train_users'] 140 | out['test_users'] = test_users 141 | 142 | # get binary data 143 | y_train_bin = y_train.copy() 144 | y_train_bin[y_train_bin != 0] = 1 145 | 146 | out['x_train'] = x_train 147 | out['y_train'] = y_train 148 | out['y_train_bin'] = y_train_bin 149 | out['x_cols'] = x_cols 150 | out['info_cols'] = infocols 151 | 152 | out['test_size'] = test_size 153 | 154 | if test_size > 0 or (by_user_time and by_user_time_trainper < 1) or limit_ntrain_user > 0: 155 | y_test_bin = y_test.copy() 156 | y_test_bin[y_test_bin != 0] = 1 157 | out['x_test'] = x_test 158 | out['y_test'] = y_test 159 | out['y_test_bin'] = y_test_bin 160 | 161 | return out 162 | 163 | 164 | def get_result_one_user(u, pred_all, datainfo): 165 | res_u = {} 166 | u_labels = datainfo[datainfo['user'] == u]['insider'].values 167 | utype = list(set(u_labels)) 168 | if np.any(u_labels != 0) and len(set(u_labels)) > 1: 169 | utype.remove(0.0) 170 | res_u['type'] = utype[0] 171 | u_idx = np.where(datainfo['user'] == u)[0] 172 | res_u['data_idxs'] = u_idx 173 | pred = pred_all[u_idx] 174 | if len(np.where(u_labels == 0)[0]) > 0: 175 | res_u['norm_per'] = len(np.where(pred[u_labels == 0] == 0)[0]) / len(np.where(u_labels == 0)[0]) 176 | if utype[0] != 0: 177 | res_u['mal_per'] = len(np.where(pred[u_labels != 0] != 0)[0]) / len(np.where(u_labels != 0)[0]) 178 | res_u['norm_bin'] = int(np.any(pred[u_labels == 0] != 0)) 179 | if utype[0] != 0: 180 | res_u['mal_bin'] = int(np.any(pred[u_labels != 0] != 0)) 181 | return res_u 182 | 183 | 184 | def get_result_by_users(users, user_list=None, pred_all=None, datainfo=None): 185 | out = {} 186 | cms = {} 187 | out[users] = {} 188 | # out_users = [get_result_one_user(u, pred_all, datainfo, old_res, label_all) for u in user_list] 189 | out_users = Parallel(n_jobs=num_cores)(delayed(get_result_one_user)(u, pred_all, datainfo) 190 | for u in user_list) 191 | users_true_label = [] 192 | users_pred_label = [] 193 | for i, u in enumerate(user_list): 194 | out[users][u] = out_users[i] 195 | users_true_label.append(out[users][u]['type']) 196 | if out[users][u]['type'] == 0: 197 | users_pred_label.append(out[users][u]['norm_bin']) 198 | else: 199 | users_pred_label.append(out[users][u]['mal_bin']) 200 | 201 | out[users]['true_label'] = users_true_label 202 | out[users]['pred_label'] = users_pred_label 203 | cms[users] = confusion_matrix(users_true_label, users_pred_label) 204 | return out, cms 205 | 206 | 207 | def do_classification(clf, x_train, y_train, x_test, y_test, y_org=None, by_user=False, 208 | split_output=None): 209 | ''' 210 | train classification and get results 211 | ''' 212 | st = time.time() 213 | clf.fit(x_train, y_train) 214 | train_time = time.time() - st 215 | 216 | cms_train = {} 217 | cms_test = {} 218 | 219 | st = time.time() 220 | y_train_hat = clf.predict(x_train) 221 | y_train_proba = clf.predict_proba(x_train) 222 | pred_time = time.time() - st 223 | cms_train['bin'] = confusion_matrix(y_train, y_train_hat) 224 | 225 | st = time.time() 226 | y_test_hat = clf.predict(x_test) 227 | y_test_proba = clf.predict_proba(x_test) 228 | test_pred_time = time.time() - st 229 | cms_test['bin'] = confusion_matrix(y_test, y_test_hat) 230 | 231 | test_org_labels = y_test 232 | train_org_labels = y_train 233 | if y_org is not None: 234 | cms_train['org'] = confusion_matrix(y_org['train'], y_train_hat) 235 | train_org_labels = y_org['train'] 236 | cms_test['org'] = confusion_matrix(y_org['test'], y_test_hat) 237 | test_org_labels = y_org['test'] 238 | 239 | userres_train = {} 240 | userres_test = {} 241 | if by_user: 242 | uout, ucm = get_result_by_users('train_users', split_output['train_users'], pred_all=y_train_hat, 243 | datainfo=split_output['train_info']) 244 | 245 | userres_train.update(uout) 246 | cms_train.update(ucm) 247 | uout, ucm = get_result_by_users('test_users', split_output['test_users'], pred_all=y_test_hat, 248 | datainfo=split_output['test_info']) 249 | userres_test.update(uout) 250 | cms_test.update(ucm) 251 | 252 | 253 | return clf, {'by_user': userres_train, 'cms': cms_train, 'train_time': train_time, 'pred_time': pred_time, 254 | 'org_labels': train_org_labels, 255 | 'pred_bin': y_train_hat, 'pred_proba': y_train_proba}, \ 256 | {'by_user': userres_test, 'cms': cms_test, 'pred_time': test_pred_time, 'org_labels': test_org_labels, 257 | 'pred_bin': y_test_hat, 258 | 'pred_proba': y_test_proba} 259 | 260 | 261 | def user_auc_roc(ures, users): 262 | nnu = len(users) - sum(ures[:, 0]) 263 | nmu = sum(ures[:, 0]) 264 | ufpr = np.sum(ures[np.where(ures[:, 0] == 0)[0], 1:], axis=0) / nnu 265 | utpr = np.sum(ures[np.where(ures[:, 0] == 1)[0], 1:], axis=0) / nmu 266 | uauc = auc(ufpr, utpr) 267 | return utpr, ufpr, uauc 268 | 269 | 270 | def user_auc_roc2(u_idxs, y_bin, y_predprob, thresholds): 271 | ures = np.zeros((1, len(thresholds))) 272 | u_y = y_bin[u_idxs] 273 | u_yprob = y_predprob[u_idxs] 274 | for ii in range(len(thresholds[1:])): 275 | ures[0, ii + 1] = int(np.any(u_yprob >= thresholds[ii + 1])) 276 | if np.any(u_y != 0): 277 | ures[0, 0] = 1 278 | return ures 279 | 280 | 281 | def roc_auc_calc(rw, algs=('RF', 'XGB'), nrun=20, dtype=None, data=None, res_names=['test_in']): 282 | 283 | allres = [] 284 | fpri = np.linspace(0, 1, 1000) # interpolation 285 | 286 | for i in range(nrun): 287 | for alg in algs: 288 | gc.collect() 289 | if 'all' in res_names: 290 | res_names = [td for td in rw[i][alg].keys() if td not in ['clf'] and 'thres' not in td] 291 | 292 | for resname in res_names: 293 | resname2 = 'train_users' if resname == 'train' else 'test_users' 294 | list_users = sorted([u for u in rw[i][alg][resname]['by_user'][resname2].keys() if type(u) != str]) 295 | y_predprob = rw[i][alg][resname]['pred_proba'][:, 1] 296 | y_org = rw[i][alg][resname]['org_labels'] 297 | y_bin = np.array(y_org > 0).astype(int) 298 | fpr, tpr, thresholds = roc_curve(y_bin, y_predprob) 299 | tpri = np.interp(fpri, fpr, tpr) 300 | 301 | if len(set(y_bin)) > 1: 302 | aucsc = roc_auc_score(y_bin, y_predprob) 303 | ures = Parallel(n_jobs=num_cores)( 304 | delayed(user_auc_roc2)(rw[i][alg][resname]['by_user'][resname2][u]['data_idxs'], y_bin, 305 | y_predprob, thresholds) for u in list_users) 306 | 307 | utpr, ufpr, uauc = user_auc_roc(np.vstack(ures), list_users) 308 | utpri = np.interp(fpri, ufpr, utpr) 309 | else: 310 | aucsc, ufpr, utpr, uauc, tpri, utpri = None, None, None, None, None, None 311 | 312 | allres.append([i, alg, resname, fpr, tpr, thresholds, aucsc, ufpr, utpr, uauc, fpri, tpri, utpri]) 313 | 314 | res = pd.DataFrame( 315 | columns=['run', 'alg', 'test_on', 'fpr', 'tpr', 'threshold', 'auc', 'ufpr', 'utpr', 'uauc', 'fpri', 'tpri', 316 | 'utpri'], data=allres) 317 | if dtype is not None: res['dtype'] = dtype 318 | res['data'] = data 319 | return res 320 | 321 | 322 | def get_cert_roc(r, a, dtype, test_on='test_in', user=True): 323 | fprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['fpri'].values 324 | if user: 325 | tprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['utpri'].values 326 | aucs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['uauc'].values 327 | else: 328 | tprs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['tpri'].values 329 | aucs = r[(r['test_on'] == test_on) & (r['alg'] == a) & (r['dtype'] == dtype)]['auc'].values 330 | 331 | mean_fpr = np.concatenate(([0], np.mean(fprs, axis=0))) 332 | mean_tpr = np.concatenate(([0], np.mean(tprs, axis=0))) 333 | 334 | std_auc = np.std(aucs) 335 | mean_auc = auc(mean_fpr, mean_tpr) 336 | 337 | std_tpr = np.concatenate(([0], np.std(tprs, axis=0))) 338 | tprs_upper = np.minimum(mean_tpr + std_tpr, 1) 339 | tprs_lower = np.maximum(mean_tpr - std_tpr, 0) 340 | 341 | return mean_fpr, mean_tpr, tprs_upper, tprs_lower, mean_auc, std_auc -------------------------------------------------------------------------------- /TNSM2020/params.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lcd-dal/feature-extraction-for-CERT-insider-threat-test-datasets/3441da78ea8ff1bd13ecb27c3a8157d04d0a84fe/TNSM2020/params.pkl -------------------------------------------------------------------------------- /TNSM2020/run_classification.py: -------------------------------------------------------------------------------- 1 | from copy import deepcopy 2 | import pickle 3 | import gc 4 | import pandas as pd 5 | import time 6 | import clf_helpers 7 | import matplotlib.pyplot as plt 8 | 9 | from sklearn.ensemble import RandomForestClassifier 10 | from sklearn.neural_network import MLPClassifier 11 | from sklearn.linear_model import LogisticRegression 12 | from xgboost import XGBClassifier 13 | import warnings, sklearn 14 | 15 | warnings.filterwarnings("ignore", category=sklearn.exceptions.DataConversionWarning) 16 | n_cores = 16 17 | 18 | 19 | def run_exp_onealg(run, slw, x, clf_name, classifiers, res_by_user, st): 20 | print(clf_name) 21 | clf_copy = deepcopy(classifiers[clf_name]) 22 | if hasattr(clf_copy, 'random_state'): 23 | clf_copy.set_params(**{'random_state': run}) 24 | 25 | clf_res = clf_helpers.do_classification(clf_copy, x['train'], slw['y_train_bin'], 26 | x['test'], slw['y_test_bin'], 27 | y_org={'train': slw['y_train'], 'test': slw['y_test']}, 28 | by_user=res_by_user, split_output=slw) 29 | res = {'train': clf_res[1], 'test_in': clf_res[2]} 30 | print('Training time: ', res['train']['train_time']) 31 | print('Train confusion matrices: ', res['train']['cms']) 32 | print('Test confusion matrices: ', res['test_in']['cms']) 33 | gc.collect() 34 | print(run, 'Done training & res,', (time.time() - st) // 1) 35 | return res 36 | 37 | 38 | def run_exp_onerun(run, classifiers, data_in, test_size, res_by_user, shuffle, 39 | normalization, by_user_time=False, by_user_time_trainper=0.5, limit_ntrain_user=0): 40 | st = time.time() 41 | slw = clf_helpers.split_data(data_in['data'], test_size=test_size, shuffle=shuffle, random_state=run, 42 | normalization=normalization, dname=data_in['name'], 43 | by_user_time=by_user_time, by_user_time_trainper=by_user_time_trainper, 44 | limit_ntrain_user=limit_ntrain_user) 45 | print('\n', run, 'Done splitting,', (time.time() - st) // 1) 46 | 47 | res_clf = {'scaler': slw['sc']} 48 | x_cols = list(slw['x_cols']) 49 | x = {'train': slw['x_train'], 'test': slw['x_test']} 50 | 51 | for clf_name in classifiers: 52 | res = run_exp_onealg(run, slw, x, clf_name, classifiers, res_by_user, st) 53 | res_clf[clf_name] = res 54 | 55 | return res_clf, x_cols 56 | 57 | 58 | def run_experiment(n_run, classifiers, data_in, test_size=0.5, res_by_user=True, shuffle=True, 59 | normalization="StandardScaler", by_user_time=True, by_user_time_trainper=0.5, 60 | limit_ntrain_user=0): 61 | 62 | all_res = {'exp_setting': {'y_col': 'insider', 'train_dname': data_in['name'], 63 | 'shuffle': shuffle, 'norm': normalization, 'res_by_user': res_by_user}} 64 | all_res['exp_setting']['classifiers'] = list(classifiers.keys()) 65 | all_res['exp_setting']['n_run'] = n_run 66 | all_res['exp_setting']['in_test_size'] = test_size 67 | all_res['exp_setting']['by_user_time_trainper'] = by_user_time_trainper 68 | all_res['exp_setting']['limit_ntrain_user'] = limit_ntrain_user 69 | 70 | for run in range(n_run): 71 | res_clf, x_cols = run_exp_onerun(run, classifiers, data_in, test_size, 72 | res_by_user=res_by_user, by_user_time=by_user_time, 73 | by_user_time_trainper=by_user_time_trainper, 74 | limit_ntrain_user=limit_ntrain_user, 75 | shuffle=shuffle, 76 | normalization=normalization, 77 | ) 78 | gc.collect() 79 | all_res[run] = res_clf 80 | return all_res 81 | 82 | 83 | def load_data(dname, dtype, datafolder='data'): 84 | name = datafolder + '/' + dtype + dname + '.csv.gz' 85 | return pd.read_csv(name) 86 | 87 | 88 | def run_exp(nrun, dname, dtype, mode, ttype=None, limit_ntrain_user=None, train_week_per=None, test_per=0.5, algs=None, 89 | load_params=True, scaler='StandardScaler', savefolder='res'): 90 | 91 | print('\n----------------\n%s %s' % (dname, dtype), '\n----------------\n') 92 | 93 | clfs = {'LR': LogisticRegression(solver='lbfgs', n_jobs=n_cores), 94 | 'MLP': MLPClassifier(solver='adam'), 95 | 'RF': RandomForestClassifier(n_jobs=n_cores), 96 | 'XGB': XGBClassifier(n_jobs=n_cores), 97 | } 98 | 99 | if algs is not None: 100 | clfs = {k:clfs[k] for k in algs} 101 | 102 | if load_params: 103 | with open('params.pkl', 'rb') as f: 104 | loaded_params = pickle.load(f) 105 | for c in clfs: 106 | if c != 'LR': 107 | clfs[c].set_params(**loaded_params[c][dtype]) 108 | 109 | data_in = {'name': dname, 'data': load_data(dname, dtype)} 110 | 111 | if mode == 'by_user_time': 112 | res = run_experiment(nrun, clfs, data_in, by_user_time=True, 113 | by_user_time_trainper=train_week_per, 114 | limit_ntrain_user=limit_ntrain_user, 115 | res_by_user=True, normalization=scaler) 116 | elif mode == 'randomsplit': 117 | res = run_experiment(nrun, clfs, data_in, by_user_time=False, 118 | test_size=test_per, 119 | res_by_user=False, normalization=scaler) 120 | 121 | savefile = '%s/%s-%s-%s-%s-%s' % (savefolder, dname, dtype, ttype, mode, '_'.join(algs)) + '.pickle' 122 | with open(savefile, 'wb') as handle: 123 | pickle.dump(res, handle, protocol=4) 124 | return res 125 | 126 | 127 | if __name__ == "__main__": 128 | algs = ['RF'] 129 | nrun = 2 130 | 131 | dname = 'r5.2' 132 | dtypes = ['week'] 133 | mode = 'by_user_time' 134 | 135 | for dtype in dtypes: 136 | res = run_exp(nrun, dname, dtype, algs=algs, mode=mode, limit_ntrain_user=400, train_week_per=0.5, 137 | load_params=True) 138 | if mode == 'randomsplit': continue 139 | 140 | res = clf_helpers.roc_auc_calc(res, algs=algs, nrun=nrun, dtype=dtype, data=dname) 141 | 142 | colors = ['r', 'g', 'blue', 'orange'] 143 | for user in [True, False]: 144 | plt.figure() 145 | restype = 'user' if user else 'org' 146 | for i, alg in enumerate(algs): 147 | tmp = clf_helpers.get_cert_roc(res, alg, dtype, 'test_in', user=user) 148 | plt.plot(tmp[0], tmp[1], label=f'{alg}, AUC = {tmp[4]:.3f}', color=colors[i]) 149 | plt.fill_between(tmp[0], tmp[3], tmp[2], color=colors[i], alpha=.1, label=None) 150 | plt.legend() 151 | plt.savefig(f'ROC_{dtype}_{restype}.jpg') 152 | -------------------------------------------------------------------------------- /example_anomaly_detection.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import pandas as pd 5 | import numpy as np 6 | # from sklearn.ensemble import IsolationForest 7 | from sklearn.neural_network import MLPRegressor 8 | from sklearn.metrics.pairwise import paired_distances 9 | from sklearn.preprocessing import StandardScaler 10 | from sklearn.metrics import roc_auc_score 11 | 12 | 13 | print('This script runs a sample anomaly detection (using simple autoencoder) ' 14 | 'on CERT dataset. By default it takes CERT r5.2 data extracted with percentile ' 15 | 'representation, generated using temporal_data_representation script. ' 16 | 'It then trains on data of 200 random users in first half of the dataset, ' 17 | 'and output AUC score and detection rate at different budgets (instance-based)') 18 | 19 | print('For more details, see this paper: Anomaly Detection for Insider Threats Using' 20 | ' Unsupervised Ensembles. Le, D. C.; and Zincir-Heywood, A. N. IEEE Transactions' 21 | ' on Network and Service Management, 18(2): 1152–1164. June 2021.') 22 | 23 | data = pd.read_pickle('week-r5.2-percentile30.pkl') 24 | removed_cols = ['user','day','week','starttime','endtime','sessionid','insider'] 25 | x_cols = [i for i in data.columns if i not in removed_cols] 26 | 27 | run = 1 28 | np.random.seed(run) 29 | 30 | data1stHalf = data[data.week <= max(data.week)/2] 31 | dataTest = data[data.week > max(data.week)/2] 32 | 33 | nUsers = np.random.permutation(list(set(data1stHalf.user))) 34 | trainUsers = nUsers[:200] 35 | 36 | 37 | xTrain = data1stHalf[data1stHalf.user.isin(trainUsers)][x_cols].values 38 | yTrain = data1stHalf[data1stHalf.user.isin(trainUsers)]['insider'].values 39 | yTrainBin = yTrain > 0 40 | 41 | xTest = data[x_cols].values 42 | yTest = data['insider'].values 43 | yTestBin = yTest > 0 44 | 45 | scaler = StandardScaler() 46 | xTrain = scaler.fit_transform(xTrain) 47 | xTest = scaler.transform(xTest) 48 | 49 | ae = MLPRegressor(hidden_layer_sizes=(int(data.shape[1]/4), int(data.shape[1]/8), 50 | int(data.shape[1]/4)), max_iter=25, random_state=10) 51 | 52 | ae.fit(xTrain, xTrain) 53 | 54 | reconstructionError = paired_distances(xTest, ae.predict(xTest)) 55 | 56 | print('AUC score: ', roc_auc_score(yTestBin, reconstructionError)) 57 | 58 | print('Detection rate at different budgets:') 59 | for ib in [0.001, 0.01, 0.05, 0.1, 0.2]: 60 | threshold = np.percentile(reconstructionError, 100-100*ib) 61 | flagged = np.where(reconstructionError>threshold)[0] 62 | dr = sum(yTestBin[flagged]>0)/sum(yTestBin>0) 63 | print(f'{100*ib}%, DR = {100*dr:.2f}%') -------------------------------------------------------------------------------- /example_classification.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import pandas as pd 5 | import numpy as np 6 | from sklearn.ensemble import RandomForestClassifier 7 | from sklearn.metrics import recall_score, classification_report, f1_score, accuracy_score 8 | 9 | print('This script trains a sample classifier (using simple RandomForestClassifier) ' 10 | 'on CERT dataset. By default it takes CERT r5.2 extracted day data ' 11 | 'downloaded from https://web.cs.dal.ca/~lcd/data/CERTr5.2/' 12 | ', train on data of 400 users in first half of the dataset, ' 13 | 'then output classification report (instance-based)') 14 | 15 | print('For more details, see this paper: Analyzing Data Granularity Levels for' 16 | ' Insider Threat Detection using Machine Learning. Le, D. C.; Zincir-Heywood, ' 17 | 'A. N.; and Heywood, M. I. IEEE Transactions on Network and Service Management,' 18 | ' 17(1): 30–44. March 2020.') 19 | 20 | data = pd.read_csv('day-r5.2.csv.gz') 21 | removed_cols = ['user','day','week','starttime','endtime','sessionid','insider'] 22 | x_cols = [i for i in data.columns if i not in removed_cols] 23 | 24 | run = 1 25 | np.random.seed(run) 26 | 27 | data1stHalf = data[data.week <= max(data.week)/2] 28 | dataTest = data[data.week > max(data.week)/2] 29 | 30 | selectedTrainUsers = set(data1stHalf[data1stHalf.insider > 0]['user']) 31 | nUsers = np.random.permutation(list(set(data1stHalf.user) - selectedTrainUsers)) 32 | trainUsers = np.concatenate((list(selectedTrainUsers), nUsers[:400-len(selectedTrainUsers)])) 33 | 34 | unKnownTestUsers = list(set(dataTest.user) - selectedTrainUsers) 35 | 36 | xTrain = data1stHalf[data1stHalf.user.isin(trainUsers)][x_cols].values 37 | yTrain = data1stHalf[data1stHalf.user.isin(trainUsers)]['insider'].values 38 | yTrainBin = yTrain > 0 39 | 40 | xTest = dataTest[dataTest.user.isin(unKnownTestUsers)][x_cols].values 41 | yTest = dataTest[dataTest.user.isin(unKnownTestUsers)]['insider'].values 42 | yTestBin = yTest > 0 43 | 44 | rf = RandomForestClassifier(n_jobs=-1) 45 | 46 | rf.fit(xTrain, yTrainBin) 47 | 48 | print(classification_report(yTestBin, rf.predict(xTest))) 49 | 50 | -------------------------------------------------------------------------------- /feature_extraction.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | @author: lcd 5 | """ 6 | import os, sys 7 | import pandas as pd 8 | import numpy as np 9 | from datetime import datetime, timedelta 10 | import re 11 | import time 12 | import subprocess 13 | from joblib import Parallel, delayed 14 | 15 | def time_convert(inp, mode, real_sd = '2010-01-02', sd_monday= "2009-12-28"): 16 | if mode == 'e2t': 17 | return datetime.fromtimestamp(inp).strftime('%m/%d/%Y %H:%M:%S') 18 | elif mode == 't2e': 19 | return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S').strftime('%s') 20 | elif mode == 't2dt': 21 | return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S') 22 | elif mode == 't2date': 23 | return datetime.strptime(inp, '%m/%d/%Y %H:%M:%S').strftime("%Y-%m-%d") 24 | elif mode == 'dt2t': 25 | return inp.strftime('%m/%d/%Y %H:%M:%S') 26 | elif mode == 'dt2W': 27 | return int(inp.strftime('%W')) 28 | elif mode == 'dt2d': 29 | return inp.strftime('%m/%d/%Y %H:%M:%S') 30 | elif mode == 'dt2date': 31 | return inp.strftime("%Y-%m-%d") 32 | elif mode =='dt2dn': #datetime to day number 33 | startdate = datetime.strptime(sd_monday,'%Y-%m-%d') 34 | return (inp - startdate).days 35 | elif mode =='dn2epoch': #datenum to epoch 36 | dt = datetime.strptime(sd_monday,'%Y-%m-%d') + timedelta(days=inp) 37 | return int(dt.timestamp()) 38 | elif mode =='dt2wn': #datetime to week number 39 | startdate = datetime.strptime(real_sd,'%Y-%m-%d') 40 | return (inp - startdate).days//7 41 | elif mode =='t2wn': #datetime to week number 42 | startdate = datetime.strptime(real_sd,'%Y-%m-%d') 43 | return (datetime.strptime(inp, '%m/%d/%Y %H:%M:%S') - startdate).days//7 44 | elif mode == 'dt2wd': 45 | return int(inp.strftime("%w")) 46 | elif mode == 'm2dt': 47 | return datetime.strptime(inp, "%Y-%m") 48 | elif mode == 'datetoweekday': 49 | return int(datetime.strptime(inp,"%Y-%m-%d").strftime('%w')) 50 | elif mode == 'datetoweeknum': 51 | w0 = datetime.strptime(sd_monday,"%Y-%m-%d") 52 | return int((datetime.strptime(inp,"%Y-%m-%d") - w0).days / 7) 53 | elif mode == 'weeknumtodate': 54 | startday = datetime.strptime(sd_monday,"%Y-%m-%d") 55 | return startday+timedelta(weeks = inp) 56 | 57 | def add_action_thisweek(act, columns, lines, act_handles, week_index, stop, firstdate, dname = 'r5.2'): 58 | thisweek_act = [] 59 | while True: 60 | if not lines[act]: 61 | stop[act] = 1 62 | break 63 | if dname in ['r6.1','r6.2'] and act in ['email', 'file','http'] and '"' in lines[act]: 64 | tmp = lines[act] 65 | firstpart = tmp[:tmp.find('"')-1] 66 | content = tmp[tmp.find('"')+1:-1] 67 | tmp = firstpart.split(',') + [content] 68 | else: 69 | tmp = lines[act].split(',') 70 | if time_convert(tmp[1], 't2wn', real_sd= firstdate) == week_index: 71 | thisweek_act.append(tmp) 72 | else: 73 | break 74 | lines[act] = act_handles[act].readline() 75 | df = pd.DataFrame(thisweek_act, columns=columns) 76 | df['type']= act 77 | df.index = df['id'] 78 | df.drop('id',1, inplace = True) 79 | return df 80 | 81 | def combine_by_timerange_pandas(dname = 'r4.2'): 82 | allacts = ['device','email','file', 'http','logon'] 83 | firstline = str(subprocess.check_output(['head', '-2', 'http.csv'])).split('\\n')[1] 84 | firstdate = time_convert(firstline.split(',')[1],'t2dt') 85 | firstdate = firstdate - timedelta(int(firstdate.strftime("%w"))) 86 | firstdate = time_convert(firstdate, 'dt2date') 87 | week_index = 0 88 | act_handles = {} 89 | lines = {} 90 | stop = {} 91 | for act in allacts: 92 | act_handles[act] = open(act+'.csv','r') 93 | next(act_handles[act],None) #skip header row 94 | lines[act] = act_handles[act].readline() 95 | stop[act] = 0 # store stop value indicating if all of the file has been read 96 | while sum(stop.values()) < 5: 97 | thisweekdf = pd.DataFrame() 98 | for act in allacts: 99 | if 'email' == act: 100 | if dname in ['r4.1','r4.2']: 101 | columns = ['id', 'date', 'user', 'pc', 'to', 'cc', 'bcc', 'from', 'size', '#att', 'content'] 102 | if dname in ['r6.1','r6.2','r5.2','r5.1']: 103 | columns = ['id', 'date', 'user', 'pc', 'to', 'cc', 'bcc', 'from', 'activity', 'size', 'att', 'content'] 104 | elif 'logon' == act: 105 | columns = ['id', 'date', 'user', 'pc', 'activity'] 106 | elif 'device' == act: 107 | if dname in ['r4.1','r4.2']: 108 | columns = ['id', 'date', 'user', 'pc', 'activity'] 109 | if dname in ['r5.1','r5.2','r6.2','r6.1']: 110 | columns = ['id', 'date', 'user', 'pc', 'content', 'activity'] 111 | elif 'http' == act: 112 | if dname in ['r6.1','r6.2']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'activity', 'content'] 113 | if dname in ['r5.1','r5.2','r4.2','r4.1']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'content'] 114 | elif 'file' == act: 115 | if dname in ['r4.1','r4.2']: columns = ['id', 'date', 'user', 'pc', 'url/fname', 'content'] 116 | if dname in ['r5.2','r5.1','r6.2','r6.1']: columns = ['id', 'date', 'user', 'pc', 'url/fname','activity','to','from','content'] 117 | 118 | df = add_action_thisweek(act, columns, lines, act_handles, week_index, stop, firstdate, dname=dname) 119 | thisweekdf = thisweekdf.append(df, sort=False) 120 | 121 | thisweekdf['date'] = thisweekdf['date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S")) 122 | thisweekdf.to_pickle("DataByWeek/"+str(week_index)+".pickle") 123 | week_index += 1 124 | 125 | ############################################################################## 126 | 127 | def process_user_pc(upd, roles): #figure out which PC belongs to which user 128 | upd['sharedpc'] = None 129 | upd['npc'] = upd['pcs'].apply(lambda x: len(x)) 130 | upd.at[upd['npc']==1,'pc'] = upd[upd['npc']==1]['pcs'].apply(lambda x: x[0]) 131 | multiuser_pcs = np.concatenate(upd[upd['npc']>1]['pcs'].values).tolist() 132 | set_multiuser_pc = list(set(multiuser_pcs)) 133 | count = {} 134 | for pc in set_multiuser_pc: 135 | count[pc] = multiuser_pcs.count(pc) 136 | for u in upd[upd['npc']>1].index: 137 | sharedpc = upd.loc[u]['pcs'] 138 | count_u_pc = [count[pc] for pc in upd.loc[u]['pcs']] 139 | the_pc = count_u_pc.index(min(count_u_pc)) 140 | upd.at[u,'pc'] = sharedpc[the_pc] 141 | if roles.loc[u] != 'ITAdmin': 142 | sharedpc.remove(sharedpc[the_pc]) 143 | upd.at[u,'sharedpc']= sharedpc 144 | return upd 145 | 146 | def getuserlist(dname = 'r4.2', psycho = True): 147 | allfiles = ['LDAP/'+f1 for f1 in os.listdir('LDAP') if os.path.isfile('LDAP/'+f1)] 148 | alluser = {} 149 | alreadyFired = [] 150 | 151 | for file in allfiles: 152 | af = (pd.read_csv(file,delimiter=',')).values 153 | employeesThisMonth = [] 154 | for i in range(len(af)): 155 | employeesThisMonth.append(af[i][1]) 156 | if af[i][1] not in alluser: 157 | alluser[af[i][1]] = af[i][0:1].tolist() + af[i][2:].tolist() + [file.split('.')[0] , np.nan] 158 | 159 | firedEmployees = list(set(alluser.keys()) - set(alreadyFired) - set(employeesThisMonth)) 160 | alreadyFired = alreadyFired + firedEmployees 161 | for e in firedEmployees: 162 | alluser[e][-1] = file.split('.')[0] 163 | 164 | if psycho and os.path.isfile("psychometric.csv"): 165 | 166 | p_score = pd.read_csv("psychometric.csv",delimiter = ',').values 167 | for id in range(len(p_score)): 168 | alluser[p_score[id,1]] = alluser[p_score[id,1]]+ list(p_score[id,2:]) 169 | df = pd.DataFrame.from_dict(alluser, orient='index') 170 | if dname in ['r4.1','r4.2']: 171 | df.columns = ['uname', 'email', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'sup','wstart', 'wend', 'O', 'C', 'E', 'A', 'N'] 172 | elif dname in ['r5.2','r5.1','r6.2','r6.1']: 173 | df.columns = ['uname', 'email', 'role', 'project', 'b_unit', 'f_unit', 'dept', 'team', 'sup','wstart', 'wend', 'O', 'C', 'E', 'A', 'N'] 174 | else: 175 | df = pd.DataFrame.from_dict(alluser, orient='index') 176 | if dname in ['r4.1','r4.2']: 177 | df.columns = ['uname', 'email', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'sup', 'wstart', 'wend'] 178 | elif dname in ['r5.2','r5.1','r6.2','r6.1']: 179 | df.columns = ['uname', 'email', 'role', 'project', 'b_unit', 'f_unit', 'dept', 'team', 'sup', 'wstart', 'wend'] 180 | 181 | df['pc'] = None 182 | for i in df.index: 183 | if type(df.loc[i]['sup']) == str: 184 | sup = df[df['uname'] == df.loc[i]['sup']].index[0] 185 | else: 186 | sup = None 187 | df.at[i,'sup'] = sup 188 | 189 | #read first 2 weeks to determine each user's PC 190 | w1 = pd.read_pickle("DataByWeek/1.pickle") 191 | w2 = pd.read_pickle("DataByWeek/2.pickle") 192 | user_pc_dict = pd.DataFrame(index=df.index) 193 | user_pc_dict['pcs'] = None 194 | 195 | for u in df.index: 196 | pc = list(set(w1[w1['user']==u]['pc']) & set(w2[w2['user']==u]['pc'])) 197 | user_pc_dict.at[u,'pcs'] = pc 198 | upd = process_user_pc(user_pc_dict, df['role']) 199 | df['pc'] = upd['pc'] 200 | df['sharedpc'] = upd['sharedpc'] 201 | return df 202 | 203 | 204 | def get_mal_userdata(data = 'r4.2', usersdf = None): 205 | 206 | if not os.path.isdir('answers'): 207 | os.system('wget https://kilthub.cmu.edu/ndownloader/files/24857828 -O answers.tar.bz2') 208 | os.system('tar -xjvf answers.tar.bz2') 209 | 210 | listmaluser = pd.read_csv("answers/insiders.csv") 211 | listmaluser['dataset'] = listmaluser['dataset'].apply(lambda x: str(x)) 212 | listmaluser = listmaluser[listmaluser['dataset']==data.replace("r","")] 213 | #for r6.2, new time in scenario 4 answer is incomplete. 214 | if data == 'r6.2': listmaluser.at[listmaluser['scenario']==4,'start'] = '02'+listmaluser[listmaluser['scenario']==4]['start'] 215 | listmaluser[['start','end']] = listmaluser[['start','end']].applymap(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S")) 216 | 217 | if type(usersdf) != pd.core.frame.DataFrame: 218 | usersdf = getuserlist(data) 219 | usersdf['malscene']=0 220 | usersdf['mstart'] = None 221 | usersdf['mend'] = None 222 | usersdf['malacts'] = None 223 | 224 | for i in listmaluser.index: 225 | usersdf.loc[listmaluser['user'][i], 'mstart'] = listmaluser['start'][i] 226 | usersdf.loc[listmaluser['user'][i], 'mend'] = listmaluser['end'][i] 227 | usersdf.loc[listmaluser['user'][i], 'malscene'] = listmaluser['scenario'][i] 228 | 229 | if data in ['r4.2', 'r5.2']: 230 | malacts = open(f"answers/r{listmaluser['dataset'][i]}-{listmaluser['scenario'][i]}/"+ 231 | listmaluser['details'][i],'r').read().strip().split("\n") 232 | else: #only 1 malicious user, no folder 233 | malacts = open("answers/"+ listmaluser['details'][i],'r').read().strip().split("\n") 234 | 235 | malacts = [x.split(',') for x in malacts] 236 | 237 | mal_users = np.array([x[3].strip('"') for x in malacts]) 238 | mal_act_ids = np.array([x[1].strip('"') for x in malacts]) 239 | 240 | usersdf.at[listmaluser['user'][i], 'malacts'] = mal_act_ids[mal_users==listmaluser['user'][i]] 241 | 242 | return usersdf 243 | 244 | ############################################################################## 245 | 246 | def is_after_whour(dt): #Workhours assumed 7:30-17:30 247 | wday_start = datetime.strptime("7:30", "%H:%M").time() 248 | wday_end = datetime.strptime("17:30", "%H:%M").time() 249 | dt = dt.time() 250 | if dt < wday_start or dt > wday_end: 251 | return True 252 | return False 253 | 254 | def is_weekend(dt): 255 | if dt.strftime("%w") in ['0', '6']: 256 | return True 257 | return False 258 | 259 | def email_process(act, data = 'r4.2', separate_send_receive = True): 260 | receivers = act['to'].split(';') 261 | if type(act['cc']) == str: 262 | receivers = receivers + act['cc'].split(";") 263 | if type(act['bcc']) == str: 264 | bccreceivers = act['bcc'].split(";") 265 | else: 266 | bccreceivers = [] 267 | exemail = False 268 | n_exdes = 0 269 | for i in receivers + bccreceivers: 270 | if 'dtaa.com' not in i: 271 | exemail = True 272 | n_exdes += 1 273 | 274 | n_des = len(receivers) + len(bccreceivers) 275 | Xemail = 1 if exemail else 0 276 | n_bccdes = len(bccreceivers) 277 | exbccmail = 0 278 | email_text_len = len(act['content']) 279 | email_text_nwords = act['content'].count(' ') + 1 280 | for i in bccreceivers: 281 | if 'dtaa.com' not in i: 282 | exbccmail = 1 283 | break 284 | 285 | if data in ['r5.1','r5.2','r6.1','r6.2']: 286 | send_mail = 1 if act['activity'] == 'Send' else 0 287 | receive_mail = 1 if act['activity'] in ['Receive','View'] else 0 288 | 289 | atts = act['att'].split(';') 290 | n_atts = len(atts) 291 | size_atts = 0 292 | att_types = [0,0,0,0,0,0] 293 | att_sizes = [0,0,0,0,0,0] 294 | for att in atts: 295 | if '.' in att: 296 | tmp = file_process(att, filetype='att') 297 | att_types = [sum(x) for x in zip(att_types,tmp[0])] 298 | att_sizes = [sum(x) for x in zip(att_sizes,tmp[1])] 299 | size_atts +=sum(tmp[1]) 300 | return [send_mail, receive_mail, n_des, n_atts, Xemail, n_exdes, 301 | n_bccdes, exbccmail, int(act['size']), email_text_len, 302 | email_text_nwords] + att_types + att_sizes 303 | elif data in ['r4.1','r4.2']: 304 | return [n_des, int(act['#att']), Xemail, n_exdes, n_bccdes, exbccmail, 305 | int(act['size']), email_text_len, email_text_nwords] 306 | 307 | def http_process(act, data = 'r4.2'): 308 | # basic features: 309 | url_len = len(act['url/fname']) 310 | url_depth = act['url/fname'].count('/')-2 311 | content_len = len(act['content']) 312 | content_nwords = act['content'].count(' ')+1 313 | 314 | domainname = re.findall("//(.*?)/", act['url/fname'])[0] 315 | domainname.replace("www.","") 316 | dn = domainname.split(".") 317 | if len(dn) > 2 and not any([x in domainname for x in ["google.com", '.co.uk', '.co.nz', 'live.com']]): 318 | domainname = ".".join(dn[-2:]) 319 | 320 | # other 1, socnet 2, cloud 3, job 4, leak 5, hack 6 321 | if domainname in ['dropbox.com', 'drive.google.com', 'mega.co.nz', 'account.live.com']: 322 | r = 3 323 | elif domainname in ['wikileaks.org','freedom.press','theintercept.com']: 324 | r = 5 325 | elif domainname in ['facebook.com','twitter.com','plus.google.com','instagr.am','instagram.com', 326 | 'flickr.com','linkedin.com','reddit.com','about.com','youtube.com','pinterest.com', 327 | 'tumblr.com','quora.com','vine.co','match.com','t.co']: 328 | r = 2 329 | elif domainname in ['indeed.com','monster.com', 'careerbuilder.com','simplyhired.com']: 330 | r = 4 331 | 332 | elif ('job' in domainname and ('hunt' in domainname or 'search' in domainname)) \ 333 | or ('aol.com' in domainname and ("recruit" in act['url/fname'] or "job" in act['url/fname'])): 334 | r = 4 335 | elif (domainname in ['webwatchernow.com','actionalert.com', 'relytec.com','refog.com','wellresearchedreviews.com', 336 | 'softactivity.com', 'spectorsoft.com','best-spy-soft.com']): 337 | r = 6 338 | elif ('keylog' in domainname): 339 | r = 6 340 | else: 341 | r = 1 342 | if data in ['r6.1','r6.2']: 343 | http_act_dict = {'www visit': 1, 'www download': 2, 'www upload': 3} 344 | http_act = http_act_dict.get(act['activity'].lower(), 0) 345 | return [r, url_len, url_depth, content_len, content_nwords, http_act] 346 | else: 347 | return [r, url_len, url_depth, content_len, content_nwords] 348 | 349 | def file_process(act, complete_ul = None, data = 'r4.2', filetype = 'act'): 350 | if filetype == 'act': 351 | ftype = act['url/fname'].split(".")[1] 352 | disk = 1 if act['url/fname'][0] == 'C' else 0 353 | if act['url/fname'][0] == 'R': disk = 2 354 | file_depth = act['url/fname'].count('\\') 355 | elif filetype == 'att': #attachments 356 | tmp = act.split('.')[1] 357 | ftype = tmp[:tmp.find('(')] 358 | attsize = int(tmp[tmp.find("(")+1:tmp.find(")")]) 359 | r = [[0,0,0,0,0,0], [0,0,0,0,0,0]] 360 | if ftype in ['zip','rar','7z']: 361 | ind = 1 362 | elif ftype in ['jpg', 'png', 'bmp']: 363 | ind = 2 364 | elif ftype in ['doc','docx', 'pdf']: 365 | ind = 3 366 | elif ftype in ['txt','cfg', 'rtf']: 367 | ind = 4 368 | elif ftype in ['exe', 'sh']: 369 | ind = 5 370 | else: 371 | ind = 0 372 | r[0][ind] = 1 373 | r[1][ind] = attsize 374 | return r 375 | 376 | fsize = len(act['content']) 377 | f_nwords = act['content'].count(' ')+1 378 | if ftype in ['zip','rar','7z']: 379 | r = 2 380 | elif ftype in ['jpg', 'png', 'bmp']: 381 | r = 3 382 | elif ftype in ['doc','docx', 'pdf']: 383 | r = 4 384 | elif ftype in ['txt','cfg','rtf']: 385 | r = 5 386 | elif ftype in ['exe', 'sh']: 387 | r = 6 388 | else: 389 | r = 1 390 | if data in ['r5.2','r5.1', 'r6.2','r6.1']: 391 | to_usb = 1 if act['to'] == 'True' else 0 392 | from_usb = 1 if act['from'] == 'True' else 0 393 | file_depth = act['url/fname'].count('\\') 394 | file_act_dict = {'file open': 1, 'file copy': 2, 'file write': 3, 'file delete': 4} 395 | if act['activity'].lower() not in file_act_dict: print(act['activity'].lower()) 396 | file_act = file_act_dict.get(act['activity'].lower(), 0) 397 | return [r, fsize, f_nwords, disk, file_depth, file_act, to_usb, from_usb] 398 | elif data in ['r4.1','r4.2']: 399 | return [r, fsize, f_nwords, disk, file_depth] 400 | 401 | def from_pc(act, ul): 402 | #code: 0,1,2,3: own pc, sharedpc, other's pc, supervisor's pc 403 | user_pc = ul.loc[act['user']]['pc'] 404 | act_pc = act['pc'] 405 | if act_pc == user_pc: 406 | return (0, act_pc) #using normal PC 407 | elif ul.loc[act['user']]['sharedpc'] is not None and act_pc in ul.loc[act['user']]['sharedpc']: 408 | return (1, act_pc) 409 | elif ul.loc[act['user']]['sup'] is not None and act_pc == ul.loc[ul.loc[act['user']]['sup']]['pc']: 410 | return (3, act_pc) 411 | else: 412 | return (2, act_pc) 413 | 414 | def process_week_num(week, users, userlist = 'all', data = 'r4.2'): 415 | 416 | user_dict = {idx: i for (i, idx) in enumerate(users.index)} 417 | acts_week = pd.read_pickle("DataByWeek/"+str(week)+".pickle") 418 | start_week, end_week = min(acts_week.date), max(acts_week.date) 419 | acts_week.sort_values('date', ascending = True, inplace = True) 420 | n_cols = 45 if data in ['r5.2','r5.1'] else 46 421 | if data in ['r4.2','r4.1']: n_cols = 27 422 | u_week = np.zeros((len(acts_week), n_cols)) 423 | pc_time = [] 424 | if userlist == 'all': 425 | userlist = set(acts_week.user) 426 | 427 | #FOR EACH USER 428 | current_ind = 0 429 | for u in userlist: 430 | df_acts_u = acts_week[acts_week.user == u] 431 | mal_u = 0 #, stop_soon = 0, 0 432 | if users.loc[u].malscene > 0: 433 | if start_week <= users.loc[u].mend and users.loc[u].mstart <= end_week: 434 | mal_u = users.loc[u].malscene 435 | 436 | list_uacts = df_acts_u.type.tolist() #all user's activities 437 | list_activity = df_acts_u.activity.tolist() 438 | list_uacts = [list_activity[i].strip().lower() if (type(list_activity[i])==str and list_activity[i].strip() in ['Logon', 'Logoff', 'Connect', 'Disconnect']) \ 439 | else list_uacts[i] for i in range(len(list_uacts))] 440 | uacts_mapping = {'logon':1, 'logoff':2, 'connect':3, 'disconnect':4, 'http':5,'email':6,'file':7} 441 | list_uacts_num = [uacts_mapping[x] for x in list_uacts] 442 | 443 | oneu_week = np.zeros((len(df_acts_u), n_cols)) 444 | oneu_pc_time = [] 445 | for i in range(len(df_acts_u)): 446 | pc, _ = from_pc(df_acts_u.iloc[i], users) 447 | if is_weekend(df_acts_u.iloc[i]['date']): 448 | if is_after_whour(df_acts_u.iloc[i]['date']): 449 | act_time = 4 450 | else: 451 | act_time = 3 452 | elif is_after_whour(df_acts_u.iloc[i]['date']): 453 | act_time = 2 454 | else: 455 | act_time = 1 456 | 457 | if data in ['r4.2','r4.1']: 458 | device_f = [0] 459 | file_f = [0, 0, 0, 0, 0] 460 | http_f = [0,0,0,0,0] 461 | email_f = [0]*9 462 | elif data in ['r5.2','r5.1','r6.2','r6.1']: 463 | device_f = [0,0] 464 | file_f = [0]*8 465 | http_f = [0,0,0,0,0] 466 | if data in ['r6.2','r6.1']: 467 | http_f = [0,0,0,0,0,0] 468 | email_f = [0]*23 469 | 470 | if list_uacts[i] == 'file': 471 | file_f = file_process(df_acts_u.iloc[i], data = data) 472 | elif list_uacts[i] == 'email': 473 | email_f = email_process(df_acts_u.iloc[i], data = data) 474 | elif list_uacts[i] == 'http': 475 | http_f = http_process(df_acts_u.iloc[i], data=data) 476 | elif list_uacts[i] == 'connect': 477 | tmp = df_acts_u.iloc[i:] 478 | disconnect_acts = tmp[(tmp['activity'] == 'Disconnect\n') & \ 479 | (tmp['user'] == df_acts_u.iloc[i]['user']) & \ 480 | (tmp['pc'] == df_acts_u.iloc[i]['pc'])] 481 | 482 | connect_acts = tmp[(tmp['activity'] == 'Connect\n') & \ 483 | (tmp['user'] == df_acts_u.iloc[i]['user']) & \ 484 | (tmp['pc'] == df_acts_u.iloc[i]['pc'])] 485 | 486 | if len(disconnect_acts) > 0: 487 | distime = disconnect_acts.iloc[0]['date'] 488 | if len(connect_acts) > 0 and connect_acts.iloc[0]['date'] < distime: 489 | connect_dur = -1 490 | else: 491 | tmp_td = distime - df_acts_u.iloc[i]['date'] 492 | connect_dur = tmp_td.days*24*3600 + tmp_td.seconds 493 | else: 494 | connect_dur = -1 # disconnect action not found! 495 | 496 | if data in ['r5.2','r5.1','r6.2','r6.1']: 497 | file_tree_len = len(df_acts_u.iloc[i]['content'].split(';')) 498 | device_f = [connect_dur, file_tree_len] 499 | else: 500 | device_f = [connect_dur] 501 | 502 | is_mal_act = 0 503 | if mal_u > 0 and df_acts_u.index[i] in users.loc[u]['malacts']: is_mal_act = 1 504 | 505 | oneu_week[i,:] = [ user_dict[u], time_convert(df_acts_u.iloc[i]['date'], 'dt2dn'), list_uacts_num[i], pc, act_time] \ 506 | + device_f + file_f + http_f + email_f + [is_mal_act, mal_u] 507 | 508 | oneu_pc_time.append([df_acts_u.index[i], df_acts_u.iloc[i]['pc'],df_acts_u.iloc[i]['date']]) 509 | u_week[current_ind:current_ind+len(oneu_week),:] = oneu_week 510 | pc_time += oneu_pc_time 511 | current_ind += len(oneu_week) 512 | 513 | u_week = u_week[0:current_ind, :] 514 | col_names = ['user','day','act','pc','time'] 515 | if data in ['r4.1','r4.2']: 516 | device_feature_names = ['usb_dur'] 517 | file_feature_names = ['file_type', 'file_len', 'file_nwords', 'disk', 'file_depth'] 518 | http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords'] 519 | email_feature_names = ['n_des', 'n_atts', 'Xemail', 'n_exdes', 'n_bccdes', 'exbccmail', 'email_size', 'email_text_slen', 'email_text_nwords'] 520 | elif data in ['r5.2','r5.1', 'r6.2','r6.1']: 521 | device_feature_names = ['usb_dur', 'file_tree_len'] 522 | file_feature_names = ['file_type', 'file_len', 'file_nwords', 'disk', 'file_depth', 'file_act', 'to_usb', 'from_usb'] 523 | http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords'] 524 | if data in ['r6.2','r6.1']: 525 | http_feature_names = ['http_type', 'url_len','url_depth', 'http_c_len', 'http_c_nwords', 'http_act'] 526 | email_feature_names = ['send_mail', 'receive_mail','n_des', 'n_atts', 'Xemail', 'n_exdes', 'n_bccdes', 'exbccmail', 'email_size', 'email_text_slen', 'email_text_nwords'] 527 | email_feature_names += ['e_att_other', 'e_att_comp', 'e_att_pho', 'e_att_doc', 'e_att_txt', 'e_att_exe'] 528 | email_feature_names += ['e_att_sother', 'e_att_scomp', 'e_att_spho', 'e_att_sdoc', 'e_att_stxt', 'e_att_sexe'] 529 | 530 | col_names = col_names + device_feature_names + file_feature_names+ http_feature_names + email_feature_names + ['mal_act','insider']#['stop_soon', 'mal_act','insider'] 531 | df_u_week = pd.DataFrame(columns=['actid','pcid','time_stamp'] + col_names, index = np.arange(0,len(pc_time))) 532 | df_u_week[['actid','pcid','time_stamp']] = np.array(pc_time) 533 | 534 | df_u_week[col_names] = u_week 535 | df_u_week[col_names] = df_u_week[col_names].astype(int) 536 | df_u_week.to_pickle("NumDataByWeek/"+str(week)+"_num.pickle") 537 | 538 | ############################################################################## 539 | 540 | # return sessions for each user in a week: 541 | # sessions[sid] = [sessionid, pc, start_with, end_with, start time, end time,number_of_concurent_login, [action_indices]] 542 | # start_with: in the beginning of a week, action start with log in or not (1, 2) 543 | # end_with: log off, next log on same computer (1, 2) 544 | def get_sessions(uw, first_sid = 0): 545 | sessions = {} 546 | open_sessions = {} 547 | sid = 0 548 | current_pc = uw.iloc[0]['pcid'] 549 | start_time = uw.iloc[0]['time_stamp'] 550 | if uw.iloc[0]['act'] == 1: 551 | open_sessions[current_pc] = [current_pc, 1, 0, start_time, start_time, 1, [uw.index[0]]] 552 | else: 553 | open_sessions[current_pc] = [current_pc, 2, 0, start_time, start_time, 1, [uw.index[0]]] 554 | 555 | for i in uw.index[1:]: 556 | current_pc = uw.loc[i]['pcid'] 557 | if current_pc in open_sessions: # must be already a session with that pcid 558 | if uw.loc[i]['act'] == 2: 559 | open_sessions[current_pc][2] = 1 560 | open_sessions[current_pc][4] = uw.loc[i]['time_stamp'] 561 | open_sessions[current_pc][6].append(i) 562 | sessions[sid] = [first_sid+sid] + open_sessions.pop(current_pc) 563 | sid +=1 564 | elif uw.loc[i]['act'] == 1: 565 | open_sessions[current_pc][2] = 2 566 | sessions[sid] = [first_sid+sid] + open_sessions.pop(current_pc) 567 | sid +=1 568 | #create a new open session 569 | open_sessions[current_pc] = [current_pc, 1, 0, uw.loc[i]['time_stamp'], uw.loc[i]['time_stamp'], 1, [i]] 570 | if len(open_sessions) > 1: #increase the concurent count for all sessions 571 | for k in open_sessions: 572 | open_sessions[k][5] +=1 573 | else: 574 | open_sessions[current_pc][4] = uw.loc[i]['time_stamp'] 575 | open_sessions[current_pc][6].append(i) 576 | else: 577 | start_status = 1 if uw.loc[i]['act'] == 1 else 2 578 | open_sessions[current_pc] = [current_pc, start_status, 0, uw.loc[i]['time_stamp'], uw.loc[i]['time_stamp'], 1, [i]] 579 | if len(open_sessions) > 1: #increase the concurent count for all sessions 580 | for k in open_sessions: 581 | open_sessions[k][5] +=1 582 | return sessions 583 | 584 | def get_u_features_dicts(ul, data = 'r5.2'): 585 | ufdict = {} 586 | list_uf=[] if data in ['r4.1','r4.2'] else ['project'] 587 | list_uf += ['role','b_unit','f_unit', 'dept','team'] 588 | for f in list_uf: 589 | ul[f] = ul[f].astype(str) 590 | tmp = list(set(ul[f])) 591 | tmp.sort() 592 | ufdict[f] = {idx:i for i, idx in enumerate(tmp)} 593 | return (ul,ufdict, list_uf) 594 | 595 | def proc_u_features(uf, ufdict, list_f = None, data = 'r4.2'): #to remove mode 596 | if type(list_f) != list: 597 | list_f=[] if data in ['r4.1','r4.2'] else ['project'] 598 | list_f = ['role','b_unit','f_unit', 'dept','team'] + list_f 599 | 600 | out = [] 601 | for f in list_f: 602 | out.append(ufdict[f][uf[f]]) 603 | return out 604 | 605 | def f_stats_calc(ud, fn, stats_f, countonly_f = {}, get_stats = False): 606 | f_count = len(ud) 607 | r = [] 608 | f_names = [] 609 | 610 | for f in stats_f: 611 | inp = ud[f].values 612 | if get_stats: 613 | if f_count > 0: 614 | r += [np.min(inp), np.max(inp), np.median(inp), np.mean(inp), np.std(inp)] 615 | else: r += [0, 0, 0, 0, 0] 616 | f_names += [fn+'_min_'+f, fn+'_max_'+f, fn+'_med_'+f, fn+'_mean_'+f, fn+'_std_'+f] 617 | else: 618 | if f_count > 0: r += [np.mean(inp)] 619 | else: r += [0] 620 | f_names += [fn+'_mean_'+f] 621 | 622 | for f in countonly_f: 623 | for v in countonly_f[f]: 624 | r += [sum(ud[f].values == v)] 625 | f_names += [fn+'_n-'+f+str(v)] 626 | return (f_count, r, f_names) 627 | 628 | def f_calc_subfeatures(ud, fname, filter_col, filter_vals, filter_names, sub_features, countonly_subfeatures): 629 | [n, stats, fnames] = f_stats_calc(ud, fname,sub_features, countonly_subfeatures) 630 | allf = [n] + stats 631 | allf_names = ['n_'+fname] + fnames 632 | for i in range(len(filter_vals)): 633 | [n_sf, sf_stats, sf_fnames] = f_stats_calc(ud[ud[filter_col]==filter_vals[i]], filter_names[i], sub_features, countonly_subfeatures) 634 | allf += [n_sf] + sf_stats 635 | allf_names += [fname+'_n_'+filter_names[i]] + [fname + '_' + x for x in sf_fnames] 636 | return (allf, allf_names) 637 | 638 | def f_calc(ud, mode = 'week', data = 'r4.2'): 639 | n_weekendact = (ud['time']==3).sum() 640 | if n_weekendact > 0: 641 | is_weekend = 1 642 | else: 643 | is_weekend = 0 644 | 645 | all_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {} 646 | [all_f, all_f_names] = f_calc_subfeatures(ud, 'allact', None, [], [], [], all_countonlyf) 647 | if mode == 'day': 648 | [workhourf, workhourf_names] = f_calc_subfeatures(ud[(ud['time'] == 1) | (ud['time'] == 3)], 'workhourallact', None, [], [], [], all_countonlyf) 649 | [afterhourf, afterhourf_names] = f_calc_subfeatures(ud[(ud['time'] == 2) | (ud['time'] == 4) ], 'afterhourallact', None, [], [], [], all_countonlyf) 650 | elif mode == 'week': 651 | [workhourf, workhourf_names] = f_calc_subfeatures(ud[ud['time'] == 1], 'workhourallact', None, [], [], [], all_countonlyf) 652 | [afterhourf, afterhourf_names] = f_calc_subfeatures(ud[ud['time'] == 2 ], 'afterhourallact', None, [], [], [], all_countonlyf) 653 | [weekendf, weekendf_names] = f_calc_subfeatures(ud[ud['time'] >= 3 ], 'weekendallact', None, [], [], [], all_countonlyf) 654 | 655 | logon_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {} 656 | logon_statf = [] 657 | 658 | [all_logonf, all_logonf_names] = f_calc_subfeatures(ud[ud['act']==1], 'logon', None, [], [], logon_statf, logon_countonlyf) 659 | if mode == 'day': 660 | [workhourlogonf, workhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & ((ud['time'] == 1) | (ud['time'] == 3) )], 'workhourlogon', None, [], [], logon_statf, logon_countonlyf) 661 | [afterhourlogonf, afterhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & ((ud['time'] == 2) | (ud['time'] == 4) )], 'afterhourlogon', None, [], [], logon_statf, logon_countonlyf) 662 | elif mode == 'week': 663 | [workhourlogonf, workhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] == 1)], 'workhourlogon', None, [], [], logon_statf, logon_countonlyf) 664 | [afterhourlogonf, afterhourlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] == 2) ], 'afterhourlogon', None, [], [], logon_statf, logon_countonlyf) 665 | [weekendlogonf, weekendlogonf_names] = f_calc_subfeatures(ud[(ud['act']==1) & (ud['time'] >= 3) ], 'weekendlogon', None, [], [], logon_statf, logon_countonlyf) 666 | 667 | device_countonlyf = {'pc':[0,1,2,3]} if mode != 'session' else {} 668 | device_statf = ['usb_dur','file_tree_len'] if data not in ['r4.1','r4.2'] else ['usb_dur'] 669 | 670 | [all_devicef, all_devicef_names] = f_calc_subfeatures(ud[ud['act']==3], 'usb', None, [], [], device_statf, device_countonlyf) 671 | if mode == 'day': 672 | [workhourdevicef, workhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & ((ud['time'] == 1) | (ud['time'] == 3) )], 'workhourusb', None, [], [], device_statf, device_countonlyf) 673 | [afterhourdevicef, afterhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & ((ud['time'] == 2) | (ud['time'] == 4) )], 'afterhourusb', None, [], [], device_statf, device_countonlyf) 674 | elif mode == 'week': 675 | [workhourdevicef, workhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] == 1)], 'workhourusb', None, [], [], device_statf, device_countonlyf) 676 | [afterhourdevicef, afterhourdevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] == 2) ], 'afterhourusb', None, [], [], device_statf, device_countonlyf) 677 | [weekenddevicef, weekenddevicef_names] = f_calc_subfeatures(ud[(ud['act']==3) & (ud['time'] >= 3) ], 'weekendusb', None, [], [], device_statf, device_countonlyf) 678 | 679 | if mode != 'session': file_countonlyf = {'to_usb':[1],'from_usb':[1], 'file_act':[1,2,3,4], 'disk':[0,1], 'pc':[0,1,2,3]} 680 | else: file_countonlyf = {'to_usb':[1],'from_usb':[1], 'file_act':[1,2,3,4], 'disk':[0,1,2]} 681 | if data in ['r4.1','r4.2']: 682 | [file_countonlyf.pop(k) for k in ['to_usb','from_usb', 'file_act']] 683 | 684 | (all_filef, all_filef_names) = f_calc_subfeatures(ud[ud['act']==7], 'file', 'file_type', [1,2,3,4,5,6], \ 685 | ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf) 686 | 687 | if mode == 'day': 688 | (workhourfilef, workhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf) 689 | (afterhourfilef, afterhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf) 690 | elif mode == 'week': 691 | (workhourfilef, workhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] ==1)], 'workhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf) 692 | (afterhourfilef, afterhourfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] ==2)], 'afterhourfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf) 693 | (weekendfilef, weekendfilef_names) = f_calc_subfeatures(ud[(ud['act']==7) & (ud['time'] >= 3)], 'weekendfile', 'file_type', [1,2,3,4,5,6], ['otherf','compf','phof','docf','txtf','exef'], ['file_len', 'file_depth', 'file_nwords'], file_countonlyf) 694 | 695 | email_stats_f = ['n_des', 'n_atts', 'n_exdes', 'n_bccdes', 'email_size', 'email_text_slen', 'email_text_nwords'] 696 | if data not in ['r4.1','r4.2']: 697 | email_stats_f += ['e_att_other', 'e_att_comp', 'e_att_pho', 'e_att_doc', 'e_att_txt', 'e_att_exe'] 698 | email_stats_f += ['e_att_sother', 'e_att_scomp', 'e_att_spho', 'e_att_sdoc', 'e_att_stxt', 'e_att_sexe'] 699 | mail_filter = 'send_mail' 700 | mail_filter_vals = [0,1] 701 | mail_filter_names = ['recvmail','send_mail'] 702 | else: 703 | mail_filter, mail_filter_vals, mail_filter_names = None, [], [] 704 | 705 | if mode != 'session': mail_countonlyf = {'Xemail':[1],'exbccmail':[1], 'pc':[0,1,2,3]} 706 | else: mail_countonlyf = {'Xemail':[1],'exbccmail':[1]} 707 | 708 | (all_emailf, all_emailf_names) = f_calc_subfeatures(ud[ud['act']==6], 'email', mail_filter, mail_filter_vals, mail_filter_names , email_stats_f, mail_countonlyf) 709 | if mode == 'week': 710 | (workhouremailf, workhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] == 1)], 'workhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf) 711 | (afterhouremailf, afterhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] == 2)], 'afterhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf) 712 | (weekendemailf, weekendemailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & (ud['time'] >= 3)], 'weekendemail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf) 713 | elif mode == 'day': 714 | (workhouremailf, workhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf) 715 | (afterhouremailf, afterhouremailf_names) = f_calc_subfeatures(ud[(ud['act']==6) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhouremail', mail_filter, mail_filter_vals, mail_filter_names, email_stats_f, mail_countonlyf) 716 | 717 | if data in ['r5.2','r5.1'] or data in ['r4.1','r4.2']: 718 | http_count_subf = {'pc':[0,1,2,3]} 719 | elif data in ['r6.2','r6.1']: 720 | http_count_subf = {'pc':[0,1,2,3], 'http_act':[1,2,3]} 721 | 722 | if mode == 'session': http_count_subf.pop('pc',None) 723 | 724 | (all_httpf, all_httpf_names) = f_calc_subfeatures(ud[ud['act']==5], 'http', 'http_type', [1,2,3,4,5,6], \ 725 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf) 726 | 727 | if mode == 'week': 728 | (workhourhttpf, workhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] ==1)], 'workhourhttp', 'http_type', [1,2,3,4,5,6], \ 729 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf) 730 | (afterhourhttpf, afterhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] ==2)], 'afterhourhttp', 'http_type', [1,2,3,4,5,6], \ 731 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf) 732 | (weekendhttpf, weekendhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & (ud['time'] >=3)], 'weekendhttp', 'http_type', [1,2,3,4,5,6], \ 733 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf) 734 | elif mode == 'day': 735 | (workhourhttpf, workhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & ((ud['time'] ==1) | (ud['time'] ==3))], 'workhourhttp', 'http_type', [1,2,3,4,5,6], \ 736 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf) 737 | (afterhourhttpf, afterhourhttpf_names) = f_calc_subfeatures(ud[(ud['act']==5) & ((ud['time'] ==2) | (ud['time'] ==4))], 'afterhourhttp', 'http_type', [1,2,3,4,5,6], \ 738 | ['otherf','socnetf','cloudf','jobf','leakf','hackf'], ['url_len', 'url_depth', 'http_c_len', 'http_c_nwords'], http_count_subf) 739 | 740 | numActs = all_f[0] 741 | mal_u = 0 742 | if (ud['mal_act']).sum() > 0: 743 | tmp = list(set(ud['insider'])) 744 | if len(tmp) > 1: 745 | tmp.remove(0.0) 746 | mal_u = tmp[0] 747 | 748 | if mode == 'week': 749 | features_tmp = all_f + workhourf + afterhourf + weekendf +\ 750 | all_logonf + workhourlogonf + afterhourlogonf + weekendlogonf +\ 751 | all_devicef + workhourdevicef + afterhourdevicef + weekenddevicef +\ 752 | all_filef + workhourfilef + afterhourfilef + weekendfilef + \ 753 | all_emailf + workhouremailf + afterhouremailf + weekendemailf + all_httpf + workhourhttpf + afterhourhttpf + weekendhttpf 754 | fnames_tmp = all_f_names + workhourf_names + afterhourf_names + weekendf_names +\ 755 | all_logonf_names + workhourlogonf_names + afterhourlogonf_names + weekendlogonf_names +\ 756 | all_devicef_names + workhourdevicef_names + afterhourdevicef_names + weekenddevicef_names +\ 757 | all_filef_names + workhourfilef_names + afterhourfilef_names + weekendfilef_names + \ 758 | all_emailf_names + workhouremailf_names + afterhouremailf_names + weekendemailf_names + all_httpf_names + workhourhttpf_names + afterhourhttpf_names + weekendhttpf_names 759 | elif mode == 'day': 760 | features_tmp = all_f + workhourf + afterhourf +\ 761 | all_logonf + workhourlogonf + afterhourlogonf +\ 762 | all_devicef + workhourdevicef + afterhourdevicef + \ 763 | all_filef + workhourfilef + afterhourfilef + \ 764 | all_emailf + workhouremailf + afterhouremailf + all_httpf + workhourhttpf + afterhourhttpf 765 | fnames_tmp = all_f_names + workhourf_names + afterhourf_names +\ 766 | all_logonf_names + workhourlogonf_names + afterhourlogonf_names +\ 767 | all_devicef_names + workhourdevicef_names + afterhourdevicef_names +\ 768 | all_filef_names + workhourfilef_names + afterhourfilef_names + \ 769 | all_emailf_names + workhouremailf_names + afterhouremailf_names + all_httpf_names + workhourhttpf_names + afterhourhttpf_names 770 | elif mode == 'session': 771 | features_tmp = all_f + all_logonf + all_devicef + all_filef + all_emailf + all_httpf 772 | fnames_tmp = all_f_names + all_logonf_names + all_devicef_names + all_filef_names + all_emailf_names + all_httpf_names 773 | 774 | return [numActs, is_weekend, features_tmp, fnames_tmp, mal_u] 775 | 776 | def session_instance_calc(ud, sinfo, week, mode, data, uw, v, list_uf): 777 | d = ud.iloc[0]['day'] 778 | perworkhour = sum(ud['time']==1)/len(ud) 779 | perafterhour = sum(ud['time']==2)/len(ud) 780 | perweekend = sum(ud['time']==3)/len(ud) 781 | perweekendafterhour = sum(ud['time']==4)/len(ud) 782 | st_timestamp = min(ud['time_stamp']) 783 | end_timestamp = max(ud['time_stamp']) 784 | s_dur = (end_timestamp - st_timestamp).total_seconds() / 60 # in minute 785 | s_start = st_timestamp.hour + st_timestamp.minute/60 786 | s_end = end_timestamp.hour + end_timestamp.minute/60 787 | starttime = st_timestamp.timestamp() 788 | endtime = end_timestamp.timestamp() 789 | n_days = len(set(ud['day'])) 790 | 791 | tmp = f_calc(ud, mode, data) 792 | session_instance = [starttime, endtime, v, sinfo[0], d, week, ud.iloc[0]['pc'], perworkhour, perafterhour, perweekend, 793 | perweekendafterhour, n_days, s_dur, sinfo[6], sinfo[2], sinfo[3], s_start, s_end] + \ 794 | (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [tmp[4]] 795 | return (session_instance, tmp[3]) 796 | 797 | def to_csv(week, mode, data, ul, uf_dict, list_uf, subsession_mode = {}): 798 | user_dict = {i : idx for (i, idx) in enumerate(ul.index)} 799 | if mode == 'session': 800 | first_sid = week*100000 # to get an unique index for each session, also, first 1 or 2 number in index would be week number 801 | cols2a = ['starttime', 'endtime','user', 'sessionid', 'day', 'week', 'pc', 'isworkhour', 'isafterhour','isweekend', 802 | 'isweekendafterhour', 'n_days', 'duration', 'n_concurrent_sessions', 'start_with', 'end_with', 'ses_start', 803 | 'ses_end'] + list_uf + ['ITAdmin','O','C','E','A','N'] 804 | elif mode == 'day': 805 | cols2a = ['starttime', 'endtime','user', 'day', 'week', 'isweekday','isweekend'] + list_uf +\ 806 | ['ITAdmin','O','C','E','A','N'] 807 | else: cols2a = ['starttime', 'endtime','user','week'] + list_uf + ['ITAdmin','O','C','E','A','N'] 808 | cols2b = ['insider'] 809 | 810 | w = pd.read_pickle("NumDataByWeek/"+str(week)+"_num.pickle") 811 | 812 | usnlist = list(set(w['user'].astype('int').values)) 813 | if True: 814 | cols = ['week']+ list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N', 'insider'] 815 | uw = pd.DataFrame(columns = cols, index = user_dict.keys()) 816 | uwdict = {} 817 | for v in user_dict: 818 | if v in usnlist: 819 | is_ITAdmin = 1 if ul.loc[user_dict[v], 'role'] == 'ITAdmin' else 0 820 | row = [week] + proc_u_features(ul.loc[user_dict[v]], uf_dict, list_uf, data = data) + [is_ITAdmin] + \ 821 | (ul.loc[user_dict[v],['O','C','E','A','N']]).tolist() + [0] 822 | row[-1] = int(list(set(w[w['user']==v]['insider']))[0]) 823 | uwdict[v] = row 824 | uw = pd.DataFrame.from_dict(uwdict, orient = 'index',columns = cols) 825 | 826 | towrite = pd.DataFrame() 827 | towrite_list = [] 828 | 829 | if mode == 'session' and len(subsession_mode) > 0: 830 | towrite_list_subsession = {} 831 | for k1 in subsession_mode: 832 | towrite_list_subsession[k1] = {} 833 | for k2 in subsession_mode[k1]: 834 | towrite_list_subsession[k1][k2] = [] 835 | 836 | days = list(set(w['day'])) 837 | for v in user_dict: 838 | if v in usnlist: 839 | uactw = w[w['user']==v] 840 | 841 | if mode == 'week': 842 | a = uactw.iloc[0]['time_stamp'] 843 | a = a - timedelta(int(a.strftime("%w"))) # get the nearest Sunday 844 | starttime = datetime(a.year, a.month, a.day).timestamp() 845 | endtime = (datetime(a.year, a.month, a.day) + timedelta(days=7)).timestamp() 846 | 847 | if len(uactw) > 0: 848 | tmp = f_calc(uactw, mode, data) 849 | i_fnames = tmp[3] 850 | towrite_list.append([starttime, endtime, v, week] + (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [ tmp[4]]) 851 | 852 | if mode == 'session': 853 | sessions = get_sessions(uactw, first_sid) 854 | first_sid += len(sessions) 855 | for s in sessions: 856 | sinfo = sessions[s] 857 | 858 | ud = uactw.loc[sessions[s][7]] 859 | if len(ud) > 0: 860 | session_instance, i_fnames = session_instance_calc(ud, sinfo, week, mode, data, uw, v, list_uf) 861 | towrite_list.append(session_instance) 862 | 863 | ## do subsessions: 864 | if 'time' in subsession_mode: # divide a session into subsessions by consecutive time chunks 865 | for subsession_dur in subsession_mode['time']: 866 | n_subsession = int(np.ceil(session_instance[12] / subsession_dur)) 867 | if n_subsession == 1: 868 | towrite_list_subsession['time'][subsession_dur].append([0] + session_instance) 869 | else: 870 | sinfo1 = sinfo.copy() 871 | for subsession_ind in range(n_subsession): 872 | sinfo1[3] = 0 if subsession_ind < n_subsession-1 else sinfo[3] 873 | 874 | subsession_ud = ud[(ud['time_stamp'] >= sessions[s][4] + timedelta(minutes = subsession_ind*subsession_dur)) & \ 875 | (ud['time_stamp'] < sessions[s][4] + timedelta(minutes = (subsession_ind+1)*subsession_dur))] 876 | if len(subsession_ud) > 0: 877 | ss_instance, _ = session_instance_calc(subsession_ud, sinfo1, week, mode, data, uw, v, list_uf) 878 | towrite_list_subsession['time'][subsession_dur].append([subsession_ind] + ss_instance) 879 | 880 | if 'nact' in subsession_mode: 881 | for ss_nact in subsession_mode['nact']: 882 | n_subsession = int(np.ceil(len(ud) / ss_nact)) 883 | if n_subsession == 1: 884 | towrite_list_subsession['nact'][ss_nact].append([0] + session_instance) 885 | else: 886 | sinfo1 = sinfo.copy() 887 | for ss_ind in range(n_subsession): 888 | sinfo1[3] = 0 if ss_ind < n_subsession-1 else sinfo[3] 889 | 890 | ss_ud = ud.iloc[ss_ind*ss_nact : min(len(ud), (ss_ind+1)*ss_nact)] 891 | if len(ss_ud) > 0: 892 | ss_instance,_ = session_instance_calc(ss_ud, sinfo1, week, mode, data, uw, v, list_uf) 893 | towrite_list_subsession['nact'][ss_nact].append([ss_ind] + ss_instance) 894 | 895 | if mode == 'day': 896 | days = sorted(list(set(uactw['day']))) 897 | for d in days: 898 | ud = uactw[uactw['day'] == d] 899 | isweekday = 1 if sum(ud['time']>=3) == 0 else 0 900 | isweekend = 1-isweekday 901 | a = ud.iloc[0]['time_stamp'] 902 | starttime = datetime(a.year, a.month, a.day).timestamp() 903 | endtime = (datetime(a.year, a.month, a.day) + timedelta(days=1)).timestamp() 904 | 905 | if len(ud) > 0: 906 | tmp = f_calc(ud, mode, data) 907 | i_fnames = tmp[3] 908 | towrite_list.append([starttime, endtime, v, d, week, isweekday, isweekend] + (uw.loc[v, list_uf + ['ITAdmin', 'O', 'C', 'E', 'A', 'N'] ]).tolist() + tmp[2] + [ tmp[4]]) 909 | 910 | towrite = pd.DataFrame(columns = cols2a + i_fnames + cols2b, data = towrite_list) 911 | towrite.to_pickle("tmp/"+str(week) + mode+".pickle") 912 | 913 | if mode == 'session' and len(subsession_mode) > 0: 914 | for k1 in subsession_mode: 915 | for k2 in subsession_mode[k1]: 916 | df_tmp = pd.DataFrame(columns = ['subs_ind']+cols2a + i_fnames + cols2b, data = towrite_list_subsession[k1][k2]) 917 | df_tmp.to_pickle("tmp/"+str(week) + mode + k1 + str(k2) + ".pickle") 918 | 919 | if __name__ == "__main__": 920 | dname = os.getcwd().split('/')[-1] 921 | if dname not in ['r4.1','r4.2','r6.2','r6.1','r5.1','r5.2']: 922 | raise Exception('Please put this script in and run it from a CERT data folder (e.g. r4.2)') 923 | #make temporary folders 924 | [os.mkdir(x) for x in ["tmp", "ExtractedData", "DataByWeek", "NumDataByWeek"]] 925 | 926 | subsession_mode = {'nact':[25, 50], 'time':[120, 240]}#this can be an empty dict 927 | 928 | numCores = 8 929 | arguments = len(sys.argv) - 1 930 | if arguments > 0: 931 | numCores = int(sys.argv[1]) 932 | 933 | numWeek = 73 if dname in ['r4.1','r4.2'] else 75 # only 73 weeks in r4.1 and r4.2 dataset 934 | st = time.time() 935 | 936 | #### Step 1: Combine data from sources by week, stored in DataByWeek 937 | combine_by_timerange_pandas(dname) 938 | print(f"Step 1 - Separate data by week - done. Time (mins): {(time.time()-st)/60:.2f}") 939 | st = time.time() 940 | 941 | #### Step 2: Get user list 942 | users = get_mal_userdata(dname) 943 | print(f"Step 2 - Get user list - done. Time (mins): {(time.time()-st)/60:.2f}") 944 | st = time.time() 945 | 946 | #### Step 3: Convert each action to numerical data, stored in NumDataByWeek 947 | Parallel(n_jobs=numCores)(delayed(process_week_num)(i, users, data=dname) for i in range(numWeek)) 948 | print(f"Step 3 - Convert each action to numerical data - done. Time (mins): {(time.time()-st)/60:.2f}") 949 | st = time.time() 950 | 951 | #### Step 4: Extract to csv 952 | for mode in ['week','day','session']: 953 | 954 | weekRange = list(range(0, numWeek)) if mode in ['day', 'session'] else list(range(1, numWeek)) 955 | (ul, uf_dict, list_uf) = get_u_features_dicts(users, data= dname) 956 | 957 | Parallel(n_jobs=numCores)(delayed(to_csv)(i, mode, dname, ul, uf_dict, list_uf, subsession_mode) 958 | for i in weekRange) 959 | 960 | all_csv = open('ExtractedData/'+mode+dname+'.csv','a') 961 | 962 | towrite = pd.read_pickle("tmp/"+str(weekRange[0]) + mode+".pickle") 963 | towrite.to_csv(all_csv,header=True, index = False) 964 | for w in weekRange[1:]: 965 | towrite = pd.read_pickle("tmp/"+str(w) + mode+".pickle") 966 | towrite.to_csv(all_csv,header=False, index = False) 967 | 968 | if mode == 'session' and len(subsession_mode) > 0: 969 | for k1 in subsession_mode: 970 | for k2 in subsession_mode[k1]: 971 | all_csv = open('ExtractedData/'+mode+ k1 + str(k2) + dname+'.csv','a') 972 | towrite = pd.read_pickle('tmp/'+str(weekRange[0]) + mode + k1 + str(k2)+".pickle") 973 | towrite.to_csv(all_csv,header=True, index = False) 974 | for w in weekRange[1:]: 975 | towrite = pd.read_pickle('tmp/'+str(w) + mode+ k1 + str(k2)+".pickle") 976 | towrite.to_csv(all_csv,header=False, index = False) 977 | 978 | print(f'Extracted {mode} data. Time (mins): {(time.time()-st)/60:.2f}') 979 | st = time.time() 980 | 981 | [os.system(f"rm -r {x}") for x in ["tmp", "DataByWeek", "NumDataByWeek"]] 982 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | #python3.6 2 | numpy==1.19.5 3 | pandas==1.1.5 4 | joblib==1.1.1 5 | scikit-learn==0.24.2 -------------------------------------------------------------------------------- /temporal_data_representation.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import numpy as np 4 | import pandas as pd 5 | import multiprocessing 6 | from scipy.stats import percentileofscore 7 | import argparse 8 | 9 | def concat_combination(data, window_size = 3, dname = 'cert'): 10 | 11 | if dname == 'cert': 12 | info_cols = ['sessionid','day','week',"starttime", "endtime", 13 | 'user', 'project', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'ITAdmin', 14 | 'O', 'C', 'E', 'A', 'N', 'insider'] 15 | 16 | combining_features = [ f for f in data.columns if f not in info_cols] 17 | info_features = [f for f in data.columns if f in info_cols] 18 | 19 | data_info = data[info_features].values 20 | 21 | data_combining_features = data[combining_features].values 22 | useridx = data['user'].values 23 | 24 | userset = set(data['user']) 25 | 26 | cols = [] 27 | for shiftrange in range(window_size-1,0,-1): 28 | cols += [str(-shiftrange) + '_' + f for f in combining_features] 29 | cols += combining_features + info_features 30 | 31 | combined_data = [] 32 | for u in userset: 33 | data_cf_u = data_combining_features[useridx == u, ] 34 | 35 | data_cf_u_shifted = [] 36 | for shiftrange in range(window_size-1,0,-1): 37 | data_cf_u_shifted.append(np.roll(data_cf_u, shiftrange, axis = 0)) 38 | 39 | data_cf_u_shifted.append(data_cf_u) 40 | data_cf_u_shifted.append(data_info[useridx==u, ]) 41 | 42 | combined_data.append(np.hstack(data_cf_u_shifted)[window_size:,]) 43 | 44 | combined_data = pd.DataFrame(np.vstack(combined_data), columns=cols) 45 | 46 | return combined_data 47 | 48 | 49 | def subtract_combination_uworker(u, alluserdict, dtype, calc_type, window_size, udayidx, udata, uinfo, uorg): 50 | if u%200==0: 51 | print(u) 52 | 53 | data_out = [] 54 | 55 | if dtype in ['day', 'week']: 56 | 57 | for i in range(len(udayidx)): 58 | t = udayidx[i] 59 | if dtype in ['day','week']: min_idx = min(udayidx)+window_size 60 | 61 | if t>=min_idx: 62 | if calc_type == 'meandiff': 63 | prevdata = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),] 64 | if len(prevdata) < 1: continue 65 | window_mean = np.mean(prevdata, axis = 0) 66 | data_out.append(np.concatenate((udata[i] - window_mean, uorg[i,:], uinfo[i,:]))) 67 | 68 | if calc_type == 'meddiff': 69 | prevdata = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),] 70 | if len(prevdata) < 1: continue 71 | window_med = np.median(prevdata, axis = 0) 72 | data_out.append(np.concatenate((udata[i] - window_med, uorg[i,:], uinfo[i,:]))) 73 | elif calc_type == 'percentile': 74 | window = udata[(udayidx > t - 1 - window_size) & (udayidx <= t-1),] 75 | if window.shape[0] < 1: continue 76 | percentile_i = [percentileofscore(window[:,j], udata[i,j], 'mean') - 50 for j in range(window.shape[1])] 77 | data_out.append(np.concatenate((percentile_i , uorg[i,:], uinfo[i,:]))) 78 | 79 | if len(data_out) > 0: alluserdict[u] = np.vstack(data_out) 80 | 81 | def subtract_percentile_combination(data, dtype, calc_type = 'percentile', window_size = 7, dname = 'cert', parallel = True): 82 | ''' 83 | Combine data to generate different temporal representations 84 | window_size: window size by days (for CERT data) 85 | ''' 86 | if dname == 'cert': 87 | info_cols = ['sessionid','day','week',"starttime", "endtime", 88 | 'user', 'project', 'role', 'b_unit', 'f_unit', 'dept', 'team', 'ITAdmin', 89 | 'O', 'C', 'E', 'A', 'N', 'insider','subs_ind'] 90 | keep_org_cols = ["pc", "isworkhour", "isafterhour", "isweekday", "isweekend", "isweekendafterhour", "n_days", 91 | "duration", "n_concurrent_sessions", "start_with", "end_with", "ses_start", "ses_end"] 92 | 93 | combining_features = [ f for f in data.columns if f not in info_cols] 94 | info_features = [f for f in data.columns if f in info_cols] 95 | keep_org_features = [f for f in data.columns if f in keep_org_cols] 96 | 97 | data_info = data[info_features].values 98 | data_org = data[keep_org_features].values 99 | data_combining_features = data[combining_features].values 100 | useridx = data['user'].values 101 | if dtype in ['day']: dayidx = data['day'].values 102 | if dname == 'cert': weekidx = data['week'].values 103 | 104 | userset = set(data['user']) 105 | 106 | if dtype == 'week': 107 | window_size = np.floor(window_size/7) 108 | idx = weekidx 109 | elif dtype in ['day']: idx = dayidx 110 | 111 | if parallel: 112 | manager = multiprocessing.Manager() 113 | return_dict = manager.dict() 114 | jobs = [] 115 | for u in userset: 116 | udayidx = idx[useridx==u] 117 | udata = data_combining_features[useridx==u, ] 118 | uinfo = data_info[useridx==u, ] 119 | uorg = data_org[useridx==u, ] 120 | p = multiprocessing.Process(target=subtract_combination_uworker, args=(u, return_dict, dtype, calc_type, 121 | window_size, udayidx, 122 | udata, uinfo, uorg)) 123 | jobs.append(p) 124 | p.start() 125 | 126 | for proc in jobs: 127 | proc.join() 128 | else: 129 | return_dict = {} 130 | for u in userset: 131 | udayidx = idx[useridx==u] 132 | udata = data_combining_features[useridx==u, ] 133 | uinfo = data_info[useridx==u, ] 134 | uorg = data_org[useridx==u, ] 135 | subtract_combination_uworker(u, return_dict, dtype, calc_type, 136 | window_size, udayidx, 137 | udata, uinfo, uorg) 138 | 139 | combined_data = pd.DataFrame(np.vstack([return_dict[ri] for ri in return_dict.keys()]), columns=combining_features+['org_'+f for f in keep_org_features] + info_features) 140 | 141 | return combined_data 142 | 143 | 144 | if __name__ == "__main__": 145 | parser=argparse.ArgumentParser() 146 | parser.add_argument('--representation', help='Data representation to extract (concat, percentile, meandiff, mediandiff). Default: percentile', 147 | type= str, default = 'percentile') 148 | parser.add_argument('--file_input', help='CERT input file name. Default: week-r5.2.csv.gz', type= str, default= 'week-r5.2.csv.gz') 149 | parser.add_argument('--window_size', help='Window size for percentile or mean/median difference representation. Default: 30', 150 | type = int, default=30) 151 | parser.add_argument('--num_concat', help='Number of data points for concatenation. Default: 3', 152 | type = int, default=3) 153 | args=parser.parse_args() 154 | 155 | print('If "too many opened files", or "ForkAwareLocal" error, run ulimit command, e.g. "$ulimit -n 10000" to increase the limit first') 156 | if args.representation == 'all': 157 | reps = ['concat', 'percentile','meandiff','meddiff'] 158 | elif args.representation in ['concat', 'percentile','meandiff','meddiff']: 159 | reps = [args.representation] 160 | 161 | fileName = (args.file_input).replace('.csv','').replace('.gz','') 162 | if 'day' in fileName: 163 | data_type = 'day' 164 | elif 'week' in fileName: 165 | data_type = 'week' 166 | s = pd.read_csv(f'{args.file_input}') 167 | 168 | for rep in reps: 169 | if rep in ['percentile','meandiff','meddiff']: 170 | s1 = subtract_percentile_combination(s, data_type, calc_type = rep, window_size = args.window_size, dname='cert') 171 | s1.to_pickle(f'{fileName}-{rep}{args.window_size}.pkl') 172 | else: 173 | s1 = concat_combination(s, window_size = args.num_concat, dname = 'cert') 174 | s1.to_pickle(f'{fileName}-{rep}{args.num_concat}.pkl') 175 | 176 | --------------------------------------------------------------------------------