├── README.md ├── dataset ├── decomposed_data │ ├── seasonal_decomposed.csv │ └── trend_decomposed.csv ├── ecg_mitbih_test.csv ├── imputed_data │ └── ecg_mitbih_test_imputed.csv ├── machine_temperature_system_failure.csv ├── power_voltage.csv └── synchronized_data │ ├── Original plot.png │ ├── Synchronized plot.png │ └── synchronized_dtw.csv ├── dtw.py ├── imputation.py ├── seasonal_trend_decomposition.py ├── synchronization.py ├── 고려대학교_건물별_전력데이터.zip ├── 녹지캠4일전력소비량.png ├── 녹지캠4주전력소비량.png └── 변수설명.docx /README.md: -------------------------------------------------------------------------------- 1 | # KU Data Preprocessing Package 2 | 3 | ## 1. Repository Structure 4 | ```sh 5 | . 6 | ├── dataset 7 | │   ├── ecg_mitbih_test.csv 8 | │   ├── imputed_data 9 | │   │ └── ecg_mitbih_test_imputed.csv 10 | │   ├── decomposed_data 11 | │   │ ├── ecg_mitbih_test_imputed.csv 12 | │   │ └── trend_decomposed.csv 13 | │   └── synchronized_data 14 | │   └── synchronized_dtw.csv 15 | ├── imputation.py 16 | ├── seasonal_trend_decomposition.py 17 | ├── synchronization.py 18 | └── README.md 19 | ``` 20 | 21 | ## 2. Preprocessing module 22 | ### 2.1 Missing Value (NA) Imputation 23 | #### 2.1.1 Supported Options & Sample Usage 24 | Impute the missing values in a dataset and save the result. 25 | - Simple Imputation with `mean, median, most_frequent, constant` value [[description]](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) 26 | ```Python 27 | # Sample Usage 28 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 29 | --option='simple' \ 30 | --strategy='mean' \ 31 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 32 | ``` 33 | 34 | - KNN Imputation [[description]](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) 35 | ```Python 36 | # Sample Usage 37 | python seasonal_trend_decomposition.py --data_path='./dataset/ecg_mitbih_test.csv' \ 38 | --option='knn' \ 39 | --n_neighbors=5 \ 40 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 41 | ``` 42 | - MICE Imputation [[description]](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn-impute-iterativeimputer) 43 | ```Python 44 | # Sample Usage 45 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 46 | --option='mice' \ 47 | --strategy='mean' \ 48 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 49 | ``` 50 | #### 2.1.2 Testing imputation module by adding random NAs to temporary dataset. 51 | Just add `--test_module` argument to the command-line for testing the module. 52 | If ``--test_module` argument is given, `imputation.py` automatically adds random NAs to the dataset and then continues to impute the missing values. 53 | ```Python 54 | * Sample Usage 55 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 56 | --option='simple' \ 57 | --strategy='mean' \ 58 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 59 | --test_module 60 | ``` 61 | 62 | ### 2.2 Seasonal Trend Decomposition and Prediction (STL) 63 | #### 2.2.1 Seasonal Trend Detection using Seasonal-Trend LOESS (STL) 64 | - STL Decomposition [[description]](https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.DecomposeResult.html#statsmodels.tsa.seasonal.DecomposeResult) 65 | 66 | #### 2.2.1 Diagnosis of Patterns in Time-Series data 67 | - Auto Arima [[description]](https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html) 68 | ```Python 69 | 70 | # Sample Usage 71 | python imputation.py --data_path='./dataset/machine_temperature_system_failure.csv' \ 72 | --seasonal_output_path='./dataset/decomposed_data/seasonal_decomposed.csv' 73 | --trend_output_path='./dataset/decomposed_data/trend_decomposed.csv' 74 | ``` 75 | 76 | ### 2.3 Synchronization using DTW and soft-DTW 77 | - DTW [[description]](https://tslearn.readthedocs.io/en/stable/gen_modules/metrics/tslearn.metrics.dtw.html) 78 | - soft-DTW [[description]](https://tslearn.readthedocs.io/en/stable/gen_modules/metrics/tslearn.metrics.soft_dtw.html) 79 | ```Python 80 | 81 | # Sample Usage 82 | python synchronization.py --data_path='./dataset/power_voltage.csv' \ 83 | --dtw_output_path='./dataset/synchronized_data/synchronized_dtw.csv'\ 84 | --plot_output_path='./dataset/synchronized_data'\ 85 | --option='dtw'\ 86 | --distance=2 87 | 88 | 89 | -------------------------------------------------------------------------------- /dataset/power_voltage.csv: -------------------------------------------------------------------------------- 1 | DateTime,Power,Voltage 2 | 01-01-20 12:00,31,45 3 | 01-01-20 13:00,98,78 4 | 01-01-20 14:00,57,25 5 | 01-01-20 15:00,1,82 6 | 01-01-20 16:00,30,47 7 | 01-01-20 17:00,41,0 8 | 01-01-20 18:00,81,25 9 | 01-01-20 19:00,83,35 10 | 01-01-20 20:00,83,67 11 | 01-01-20 21:00,8,70 12 | 01-01-20 22:00,18,70 13 | 01-01-20 23:00,57,7 14 | 02-01-20 0:00,26,15 15 | 02-01-20 1:00,32,47 16 | 02-01-20 2:00,72,22 17 | 02-01-20 3:00,70,27 18 | 02-01-20 4:00,56,60 19 | 02-01-20 5:00,72,57 20 | 02-01-20 6:00,63,47 21 | 02-01-20 7:00,76,60 22 | 02-01-20 8:00,46,52 23 | 02-01-20 9:00,86,62 24 | 02-01-20 10:00,61,37 25 | 02-01-20 11:00,19,72 26 | 02-01-20 12:00,14,50 27 | 02-01-20 13:00,53,15 28 | 02-01-20 14:00,72,12 29 | 02-01-20 15:00,80,45 30 | 02-01-20 16:00,85,60 31 | 02-01-20 17:00,91,67 32 | 02-01-20 18:00,1,70 33 | 02-01-20 19:00,45,75 34 | 02-01-20 20:00,68,0 35 | 02-01-20 21:00,38,37 36 | 02-01-20 22:00,70,57 37 | 02-01-20 23:00,9,32 38 | 03-01-20 0:00,69,57 39 | 03-01-20 1:00,80,7 40 | 03-01-20 2:00,11,57 41 | 03-01-20 3:00,89,67 42 | 03-01-20 4:00,15,10 43 | 03-01-20 5:00,3,75 44 | 03-01-20 6:00,63,12 45 | 03-01-20 7:00,1,2 46 | 03-01-20 8:00,26,52 47 | 03-01-20 9:00,45,0 48 | 03-01-20 10:00,14,22 49 | 03-01-20 11:00,4,37 50 | 03-01-20 12:00,73,12 51 | 03-01-20 13:00,32,2 52 | 03-01-20 14:00,4,60 53 | 03-01-20 15:00,15,27 54 | 03-01-20 16:00,17,2 55 | 03-01-20 17:00,89,12 56 | 03-01-20 18:00,14,15 57 | 03-01-20 19:00,44,75 58 | 03-01-20 20:00,73,12 59 | 03-01-20 21:00,66,37 60 | 03-01-20 22:00,84,60 61 | 03-01-20 23:00,59,55 62 | 04-01-20 0:00,2,70 63 | 04-01-20 1:00,27,50 64 | 04-01-20 2:00,7,2 65 | 04-01-20 3:00,6,22 66 | 04-01-20 4:00,30,5 67 | 04-01-20 5:00,82,5 68 | 04-01-20 6:00,24,25 69 | 04-01-20 7:00,69,67 70 | 04-01-20 8:00,87,20 71 | 04-01-20 9:00,30,57 72 | 04-01-20 10:00,27,72 73 | 04-01-20 11:00,70,25 74 | 04-01-20 12:00,46,22 75 | 04-01-20 13:00,82,57 76 | 04-01-20 14:00,95,37 77 | 04-01-20 15:00,45,67 78 | 04-01-20 16:00,49,80 79 | 04-01-20 17:00,37,37 80 | 04-01-20 18:00,30,40 81 | 04-01-20 19:00,98,30 82 | 04-01-20 20:00,74,25 83 | 04-01-20 21:00,4,82 84 | 04-01-20 22:00,6,62 85 | 04-01-20 23:00,98,2 86 | 05-01-20 0:00,6,5 87 | 05-01-20 1:00,26,82 88 | 05-01-20 2:00,25,5 89 | 05-01-20 3:00,97,22 90 | 05-01-20 4:00,40,20 91 | 05-01-20 5:00,33,80 92 | 05-01-20 6:00,21,32 93 | 05-01-20 7:00,29,27 94 | 05-01-20 8:00,1,17 95 | 05-01-20 9:00,12,25 96 | 05-01-20 10:00,18,0 97 | 05-01-20 11:00,7,10 98 | 05-01-20 12:00,58,15 99 | 05-01-20 13:00,13,5 100 | 05-01-20 14:00,42,47 101 | 05-01-20 15:00,1,10 102 | 05-01-20 15:59,44,35 103 | 05-01-20 16:59,7,0 104 | 05-01-20 17:59,54,37 105 | 05-01-20 18:59,70,5 106 | 05-01-20 19:59,67,45 107 | 05-01-20 20:59,29,57 108 | 05-01-20 21:59,59,55 109 | 05-01-20 22:59,73,25 110 | 05-01-20 23:59,5,50 111 | 06-01-20 0:59,30,60 112 | 06-01-20 1:59,30,5 113 | 06-01-20 2:59,14,25 114 | 06-01-20 3:59,85,25 115 | 06-01-20 4:59,38,12 116 | 06-01-20 5:59,74,70 117 | 06-01-20 6:59,43,32 118 | 06-01-20 7:59,16,62 119 | 06-01-20 8:59,61,35 120 | 06-01-20 9:59,34,12 121 | 06-01-20 10:59,5,50 122 | 06-01-20 11:59,74,27 123 | 06-01-20 12:59,89,5 124 | 06-01-20 13:59,8,62 125 | 06-01-20 14:59,80,75 126 | 06-01-20 15:59,14,7 127 | 06-01-20 16:59,11,67 128 | 06-01-20 17:59,74,12 129 | 06-01-20 18:59,62,10 130 | 06-01-20 19:59,99,62 131 | 06-01-20 20:59,86,52 132 | 06-01-20 21:59,93,82 133 | 06-01-20 22:59,81,72 134 | 06-01-20 23:59,62,77 135 | 07-01-20 0:59,40,67 136 | 07-01-20 1:59,10,52 137 | 07-01-20 2:59,12,32 138 | 07-01-20 3:59,43,7 139 | 07-01-20 4:59,43,10 140 | 07-01-20 5:59,28,35 141 | 07-01-20 6:59,89,35 142 | 07-01-20 7:59,29,22 143 | 07-01-20 8:59,48,75 144 | 07-01-20 9:59,1,25 145 | 07-01-20 10:59,72,40 146 | 07-01-20 11:59,30,0 147 | 07-01-20 12:59,20,60 148 | 07-01-20 13:59,81,25 149 | 07-01-20 14:59,18,17 150 | 07-01-20 15:59,29,67 151 | 07-01-20 16:59,39,15 152 | 07-01-20 17:59,59,25 153 | 07-01-20 18:59,11,32 154 | 07-01-20 19:59,83,50 155 | 07-01-20 20:59,89,10 156 | 07-01-20 21:59,98,70 157 | 07-01-20 22:59,17,75 158 | 07-01-20 23:59,75,82 159 | 08-01-20 0:59,65,15 160 | 08-01-20 1:59,44,62 161 | 08-01-20 2:59,8,55 162 | 08-01-20 3:59,69,37 163 | 08-01-20 4:59,53,7 164 | 08-01-20 5:59,28,57 165 | 08-01-20 6:59,68,45 166 | 08-01-20 7:59,33,22 167 | 08-01-20 8:59,58,57 168 | 08-01-20 9:59,52,27 169 | 08-01-20 10:59,86,47 170 | 08-01-20 11:59,66,42 171 | 08-01-20 12:59,63,72 172 | 08-01-20 13:59,85,55 173 | 08-01-20 14:59,45,52 174 | 08-01-20 15:59,36,70 175 | 08-01-20 16:59,8,37 176 | 08-01-20 17:59,63,30 177 | 08-01-20 18:59,47,7 178 | 08-01-20 19:59,73,52 179 | 08-01-20 20:59,62,40 180 | 08-01-20 21:59,61,60 181 | 08-01-20 22:59,68,52 182 | 08-01-20 23:59,19,50 183 | 09-01-20 0:59,24,57 184 | 09-01-20 1:59,43,15 185 | 09-01-20 2:59,21,20 186 | 09-01-20 3:59,51,35 187 | 09-01-20 4:59,97,17 188 | 09-01-20 5:59,97,42 189 | 09-01-20 6:59,70,80 190 | 09-01-20 7:59,58,80 191 | 09-01-20 8:59,38,57 192 | 09-01-20 9:59,85,47 193 | 09-01-20 10:59,16,32 194 | 09-01-20 11:59,53,70 195 | 09-01-20 12:59,99,12 196 | 09-01-20 13:59,42,45 197 | 09-01-20 14:59,38,82 198 | 09-01-20 15:59,35,35 199 | 09-01-20 16:59,80,32 200 | 09-01-20 17:59,54,30 201 | 09-01-20 18:59,19,67 202 | 09-01-20 19:59,60,45 203 | 09-01-20 20:59,77,15 204 | 09-01-20 21:59,95,50 205 | 09-01-20 22:59,32,65 206 | 09-01-20 23:59,30,80 207 | 10-01-20 0:59,16,27 208 | 10-01-20 1:59,21,25 209 | 10-01-20 2:59,73,12 210 | 10-01-20 3:59,54,17 211 | 10-01-20 4:59,69,60 212 | 10-01-20 5:59,86,45 213 | 10-01-20 6:59,17,57 214 | 10-01-20 7:59,18,72 215 | 10-01-20 8:59,32,15 216 | 10-01-20 9:59,30,15 217 | 10-01-20 10:59,39,27 218 | 10-01-20 11:59,51,25 219 | 10-01-20 12:59,91,32 220 | 10-01-20 13:59,99,42 221 | 10-01-20 14:59,15,75 222 | 10-01-20 15:59,86,82 223 | 10-01-20 16:59,85,12 224 | 10-01-20 17:59,83,72 225 | 10-01-20 18:59,4,70 226 | 10-01-20 19:59,60,70 227 | 10-01-20 20:59,34,2 228 | 10-01-20 21:59,95,50 229 | 10-01-20 22:59,49,27 230 | 10-01-20 23:59,86,80 231 | 11-01-20 0:59,64,40 232 | 11-01-20 1:59,24,72 233 | 11-01-20 2:59,9,52 234 | 11-01-20 3:59,59,20 235 | 11-01-20 4:59,64,7 236 | 11-01-20 5:59,59,50 237 | 11-01-20 6:59,6,52 238 | 11-01-20 7:59,67,50 239 | 11-01-20 8:59,31,5 240 | 11-01-20 9:59,36,55 241 | 11-01-20 10:59,83,25 242 | 11-01-20 11:59,94,30 243 | 11-01-20 12:59,64,70 244 | 11-01-20 13:59,4,77 245 | 11-01-20 14:59,44,52 246 | 11-01-20 15:59,73,2 247 | 11-01-20 16:59,94,37 248 | 11-01-20 17:59,12,60 249 | 11-01-20 18:59,18,77 250 | 11-01-20 19:59,1,10 251 | 11-01-20 20:59,50,15 252 | 11-01-20 21:59,49,0 253 | 11-01-20 22:59,58,42 254 | 11-01-20 23:59,50,40 255 | 12-01-20 0:59,51,47 256 | 12-01-20 1:59,25,42 257 | 12-01-20 2:59,29,42 258 | 12-01-20 3:59,89,20 259 | 12-01-20 4:59,72,25 260 | 12-01-20 5:59,68,75 261 | 12-01-20 6:59,89,60 262 | 12-01-20 7:59,32,57 263 | 12-01-20 8:59,66,75 264 | 12-01-20 9:59,51,27 265 | 12-01-20 10:59,46,55 266 | 12-01-20 11:59,80,42 267 | 12-01-20 12:59,29,37 268 | 12-01-20 13:59,74,67 269 | 12-01-20 14:59,24,25 270 | 12-01-20 15:59,12,62 271 | 12-01-20 16:59,15,20 272 | 12-01-20 17:59,93,10 273 | 12-01-20 18:59,85,12 274 | 12-01-20 19:59,63,77 275 | 12-01-20 20:59,0,70 276 | 12-01-20 21:59,48,52 277 | 12-01-20 22:59,93,0 278 | 12-01-20 23:59,55,40 279 | 13-01-20 0:59,6,77 280 | 13-01-20 1:59,82,45 281 | 13-01-20 2:59,27,5 282 | 13-01-20 3:59,24,67 283 | 13-01-20 4:59,65,22 284 | 13-01-20 5:59,6,20 285 | 13-01-20 6:59,53,55 286 | 13-01-20 7:59,34,5 287 | 13-01-20 8:59,66,45 288 | 13-01-20 9:59,28,27 289 | 13-01-20 10:59,9,55 290 | 13-01-20 11:59,69,22 291 | 13-01-20 12:59,36,7 292 | 13-01-20 13:59,83,57 293 | 13-01-20 14:59,70,30 294 | 13-01-20 15:59,10,70 295 | 13-01-20 16:59,42,57 296 | 13-01-20 17:59,3,7 297 | 13-01-20 18:59,73,35 298 | 13-01-20 19:59,1,2 299 | 13-01-20 20:59,10,60 300 | 13-01-20 21:59,96,0 301 | 13-01-20 22:59,7,7 302 | 13-01-20 23:59,12,80 303 | 14-01-20 0:59,42,5 304 | 14-01-20 1:59,6,10 305 | 14-01-20 2:59,10,35 306 | 14-01-20 3:59,92,5 307 | 14-01-20 4:59,90,7 308 | 14-01-20 5:59,85,77 309 | 14-01-20 6:59,84,75 310 | 14-01-20 7:59,81,70 311 | 14-01-20 8:59,63,70 312 | 14-01-20 9:59,7,67 313 | 14-01-20 10:59,7,52 314 | 14-01-20 11:59,44,5 315 | 14-01-20 12:59,62,5 316 | 14-01-20 13:59,45,37 317 | 14-01-20 14:59,31,52 318 | 14-01-20 15:59,84,37 319 | 14-01-20 16:59,48,25 320 | 14-01-20 17:59,72,70 321 | 14-01-20 18:59,46,40 322 | 14-01-20 19:59,23,60 323 | 14-01-20 20:59,78,37 324 | 14-01-20 21:59,65,20 325 | 14-01-20 22:59,52,65 326 | 14-01-20 23:59,2,55 327 | 15-01-20 0:59,69,42 328 | 15-01-20 1:59,18,2 329 | 15-01-20 2:59,45,57 330 | 15-01-20 3:59,98,15 331 | 15-01-20 4:59,95,37 332 | 15-01-20 5:59,34,82 333 | 15-01-20 6:59,6,80 334 | 15-01-20 7:59,74,27 335 | 15-01-20 8:59,22,5 336 | 15-01-20 9:59,86,62 337 | 15-01-20 10:59,41,17 338 | 15-01-20 11:59,66,72 339 | 15-01-20 12:59,10,35 340 | 15-01-20 13:59,59,55 341 | 15-01-20 14:59,25,7 342 | 15-01-20 15:59,39,50 343 | 15-01-20 16:59,81,20 344 | 15-01-20 17:59,49,32 345 | 15-01-20 18:59,29,67 346 | 15-01-20 19:59,17,40 347 | 15-01-20 20:59,79,25 348 | 15-01-20 21:59,85,15 349 | 15-01-20 22:59,70,65 350 | 15-01-20 23:59,38,70 351 | 16-01-20 0:59,24,57 352 | 16-01-20 1:59,61,32 353 | 16-01-20 2:59,69,20 354 | 16-01-20 3:59,68,50 355 | 16-01-20 4:59,59,57 356 | 16-01-20 5:59,68,57 357 | 16-01-20 6:59,48,50 358 | 16-01-20 7:59,82,57 359 | 16-01-20 8:59,44,40 360 | 16-01-20 9:59,4,67 361 | 16-01-20 10:59,7,37 362 | 16-01-20 11:59,38,2 363 | 16-01-20 12:59,25,5 364 | 16-01-20 13:59,19,32 365 | 16-01-20 14:59,37,20 366 | 16-01-20 15:59,89,15 367 | 16-01-20 16:59,23,30 368 | ,,75 369 | ,,20 370 | -------------------------------------------------------------------------------- /dataset/synchronized_data/Original plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ClustProject/KUDataPreprocessing/e8804c08107bff23b5092f95149bc0c5c0f88ec3/dataset/synchronized_data/Original plot.png -------------------------------------------------------------------------------- /dataset/synchronized_data/Synchronized plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ClustProject/KUDataPreprocessing/e8804c08107bff23b5092f95149bc0c5c0f88ec3/dataset/synchronized_data/Synchronized plot.png -------------------------------------------------------------------------------- /dataset/synchronized_data/synchronized_dtw.csv: -------------------------------------------------------------------------------- 1 | Time,X,Y 2 | 01-01-20 12:00,31.0,45 3 | 01-01-20 13:00,98.0,82 4 | 01-01-20 14:00,57.0,47 5 | 01-01-20 15:00,1.0,0 6 | 01-01-20 16:00,30.0,25 7 | 01-01-20 17:00,41.0,35 8 | 01-01-20 18:00,81.0,67 9 | 01-01-20 19:00,83.0,70 10 | 01-01-20 20:00,83.0,70 11 | 01-01-20 21:00,8.0,7 12 | 01-01-20 22:00,18.0,15 13 | 01-01-20 23:00,57.0,47 14 | 02-01-20 0:00,26.0,22 15 | 02-01-20 10:00,61.0,50 16 | 02-01-20 11:00,19.0,15 17 | 02-01-20 12:00,14.0,12 18 | 02-01-20 13:00,53.0,45 19 | 02-01-20 14:00,72.0,67 20 | 02-01-20 15:00,80.0,75 21 | 02-01-20 16:00,85.0,75 22 | 02-01-20 17:00,91.0,75 23 | 02-01-20 18:00,1.0,0 24 | 02-01-20 19:00,45.0,37 25 | 02-01-20 1:00,32.0,27 26 | 02-01-20 20:00,68.0,57 27 | 02-01-20 21:00,38.0,32 28 | 02-01-20 22:00,70.0,57 29 | 02-01-20 23:00,9.0,7 30 | 02-01-20 2:00,72.0,60 31 | 02-01-20 3:00,70.0,60 32 | 02-01-20 4:00,56.0,57 33 | 02-01-20 5:00,72.0,62 34 | 02-01-20 6:00,63.0,62 35 | 02-01-20 7:00,76.0,62 36 | 02-01-20 8:00,46.0,37 37 | 02-01-20 9:00,86.0,72 38 | 03-01-20 0:00,69.0,57 39 | 03-01-20 10:00,14.0,12 40 | 03-01-20 11:00,4.0,2 41 | 03-01-20 12:00,73.0,60 42 | 03-01-20 13:00,32.0,27 43 | 03-01-20 14:00,4.0,2 44 | 03-01-20 15:00,15.0,12 45 | 03-01-20 16:00,17.0,15 46 | 03-01-20 17:00,89.0,75 47 | 03-01-20 18:00,14.0,12 48 | 03-01-20 19:00,44.0,37 49 | 03-01-20 1:00,80.0,67 50 | 03-01-20 20:00,73.0,60 51 | 03-01-20 21:00,66.0,55 52 | 03-01-20 22:00,84.0,70 53 | 03-01-20 23:00,59.0,50 54 | 03-01-20 2:00,11.0,10 55 | 03-01-20 3:00,89.0,75 56 | 03-01-20 4:00,15.0,12 57 | 03-01-20 5:00,3.0,2 58 | 03-01-20 6:00,63.0,52 59 | 03-01-20 7:00,1.0,0 60 | 03-01-20 8:00,26.0,22 61 | 03-01-20 9:00,45.0,37 62 | 04-01-20 0:00,2.0,2 63 | 04-01-20 10:00,27.0,22 64 | 04-01-20 11:00,70.0,57 65 | 04-01-20 12:00,46.0,37 66 | 04-01-20 13:00,82.0,67 67 | 04-01-20 14:00,95.0,80 68 | 04-01-20 15:00,45.0,37 69 | 04-01-20 16:00,49.0,40 70 | 04-01-20 17:00,37.0,40 71 | 04-01-20 18:00,30.0,30 72 | 04-01-20 19:00,98.0,82 73 | 04-01-20 1:00,27.0,22 74 | 04-01-20 20:00,74.0,62 75 | 04-01-20 21:00,4.0,2 76 | 04-01-20 22:00,6.0,5 77 | 04-01-20 23:00,98.0,82 78 | 04-01-20 2:00,7.0,5 79 | 04-01-20 3:00,6.0,5 80 | 04-01-20 4:00,30.0,25 81 | 04-01-20 5:00,82.0,67 82 | 04-01-20 6:00,24.0,20 83 | 04-01-20 7:00,69.0,57 84 | 04-01-20 8:00,87.0,72 85 | 04-01-20 9:00,30.0,25 86 | 05-01-20 0:00,6.0,5 87 | 05-01-20 10:00,18.0,15 88 | 05-01-20 11:00,7.0,5 89 | 05-01-20 12:00,58.0,47 90 | 05-01-20 13:00,13.0,10 91 | 05-01-20 14:00,42.0,35 92 | 05-01-20 15:00,1.0,0 93 | 05-01-20 15:59,44.0,37 94 | 05-01-20 16:59,7.0,5 95 | 05-01-20 17:59,54.0,45 96 | 05-01-20 18:59,70.0,57 97 | 05-01-20 19:59,67.0,55 98 | 05-01-20 1:00,26.0,22 99 | 05-01-20 20:59,29.0,25 100 | 05-01-20 21:59,59.0,50 101 | 05-01-20 22:59,73.0,60 102 | 05-01-20 23:59,5.0,5 103 | 05-01-20 2:00,25.0,20 104 | 05-01-20 3:00,97.0,80 105 | 05-01-20 4:00,40.0,32 106 | 05-01-20 5:00,33.0,27 107 | 05-01-20 6:00,21.0,17 108 | 05-01-20 7:00,29.0,25 109 | 05-01-20 8:00,1.0,0 110 | 05-01-20 9:00,12.0,10 111 | 06-01-20 0:59,30.0,25 112 | 06-01-20 10:59,5.0,5 113 | 06-01-20 11:59,74.0,62 114 | 06-01-20 12:59,89.0,75 115 | 06-01-20 13:59,8.0,7 116 | 06-01-20 14:59,80.0,67 117 | 06-01-20 15:59,14.0,12 118 | 06-01-20 16:59,11.0,10 119 | 06-01-20 17:59,74.0,62 120 | 06-01-20 18:59,62.0,62 121 | 06-01-20 19:59,99.0,82 122 | 06-01-20 1:59,30.0,25 123 | 06-01-20 20:59,86.0,82 124 | 06-01-20 21:59,93.0,82 125 | 06-01-20 22:59,81.0,72 126 | 06-01-20 23:59,62.0,67 127 | 06-01-20 2:59,14.0,12 128 | 06-01-20 3:59,85.0,70 129 | 06-01-20 4:59,38.0,32 130 | 06-01-20 5:59,74.0,62 131 | 06-01-20 6:59,43.0,35 132 | 06-01-20 7:59,16.0,12 133 | 06-01-20 8:59,61.0,50 134 | 06-01-20 9:59,34.0,27 135 | 07-01-20 0:59,40.0,32 136 | 07-01-20 10:59,72.0,60 137 | 07-01-20 11:59,30.0,25 138 | 07-01-20 12:59,20.0,17 139 | 07-01-20 13:59,81.0,67 140 | 07-01-20 14:59,18.0,15 141 | 07-01-20 15:59,29.0,25 142 | 07-01-20 16:59,39.0,32 143 | 07-01-20 17:59,59.0,50 144 | 07-01-20 18:59,11.0,10 145 | 07-01-20 19:59,83.0,70 146 | 07-01-20 1:59,10.0,7 147 | 07-01-20 20:59,89.0,75 148 | 07-01-20 21:59,98.0,82 149 | 07-01-20 22:59,17.0,15 150 | 07-01-20 23:59,75.0,62 151 | 07-01-20 2:59,12.0,10 152 | 07-01-20 3:59,43.0,35 153 | 07-01-20 4:59,43.0,35 154 | 07-01-20 5:59,28.0,22 155 | 07-01-20 6:59,89.0,75 156 | 07-01-20 7:59,29.0,25 157 | 07-01-20 8:59,48.0,40 158 | 07-01-20 9:59,1.0,0 159 | 08-01-20 0:59,65.0,55 160 | 08-01-20 10:59,86.0,72 161 | 08-01-20 11:59,66.0,55 162 | 08-01-20 12:59,63.0,52 163 | 08-01-20 13:59,85.0,70 164 | 08-01-20 14:59,45.0,37 165 | 08-01-20 15:59,36.0,30 166 | 08-01-20 16:59,8.0,7 167 | 08-01-20 17:59,63.0,52 168 | 08-01-20 18:59,47.0,40 169 | 08-01-20 19:59,73.0,60 170 | 08-01-20 1:59,44.0,37 171 | 08-01-20 20:59,62.0,52 172 | 08-01-20 21:59,61.0,50 173 | 08-01-20 22:59,68.0,57 174 | 08-01-20 23:59,19.0,15 175 | 08-01-20 2:59,8.0,7 176 | 08-01-20 3:59,69.0,57 177 | 08-01-20 4:59,53.0,45 178 | 08-01-20 5:59,28.0,22 179 | 08-01-20 6:59,68.0,57 180 | 08-01-20 7:59,33.0,27 181 | 08-01-20 8:59,58.0,47 182 | 08-01-20 9:59,52.0,42 183 | 09-01-20 0:59,24.0,20 184 | 09-01-20 10:59,16.0,12 185 | 09-01-20 11:59,53.0,45 186 | 09-01-20 12:59,99.0,82 187 | 09-01-20 13:59,42.0,35 188 | 09-01-20 14:59,38.0,35 189 | 09-01-20 15:59,35.0,35 190 | 09-01-20 16:59,80.0,67 191 | 09-01-20 17:59,54.0,45 192 | 09-01-20 18:59,19.0,15 193 | 09-01-20 19:59,60.0,50 194 | 09-01-20 1:59,43.0,35 195 | 09-01-20 20:59,77.0,80 196 | 09-01-20 21:59,95.0,80 197 | 09-01-20 22:59,32.0,27 198 | 09-01-20 23:59,30.0,25 199 | 09-01-20 2:59,21.0,17 200 | 09-01-20 3:59,51.0,42 201 | 09-01-20 4:59,97.0,80 202 | 09-01-20 5:59,97.0,80 203 | 09-01-20 6:59,70.0,80 204 | 09-01-20 7:59,58.0,57 205 | 09-01-20 8:59,38.0,47 206 | 09-01-20 9:59,85.0,70 207 | 10-01-20 0:59,16.0,12 208 | 10-01-20 10:59,39.0,42 209 | 10-01-20 11:59,51.0,42 210 | 10-01-20 12:59,91.0,75 211 | 10-01-20 13:59,99.0,82 212 | 10-01-20 14:59,15.0,12 213 | 10-01-20 15:59,86.0,72 214 | 10-01-20 16:59,85.0,70 215 | 10-01-20 17:59,83.0,70 216 | 10-01-20 18:59,4.0,2 217 | 10-01-20 19:59,60.0,50 218 | 10-01-20 1:59,21.0,17 219 | 10-01-20 20:59,34.0,27 220 | 10-01-20 21:59,95.0,80 221 | 10-01-20 22:59,49.0,40 222 | 10-01-20 23:59,86.0,72 223 | 10-01-20 2:59,73.0,60 224 | 10-01-20 3:59,54.0,45 225 | 10-01-20 4:59,69.0,72 226 | 10-01-20 5:59,86.0,72 227 | 10-01-20 6:59,17.0,15 228 | 10-01-20 7:59,18.0,15 229 | 10-01-20 8:59,32.0,27 230 | 10-01-20 9:59,30.0,25 231 | 11-01-20 0:59,64.0,52 232 | 11-01-20 10:59,83.0,70 233 | 11-01-20 11:59,94.0,77 234 | 11-01-20 12:59,64.0,52 235 | 11-01-20 13:59,4.0,2 236 | 11-01-20 14:59,44.0,37 237 | 11-01-20 15:59,73.0,60 238 | 11-01-20 16:59,94.0,77 239 | 11-01-20 17:59,12.0,10 240 | 11-01-20 18:59,18.0,15 241 | 11-01-20 19:59,1.0,0 242 | 11-01-20 1:59,24.0,20 243 | 11-01-20 20:59,50.0,42 244 | 11-01-20 21:59,49.0,40 245 | 11-01-20 22:59,58.0,47 246 | 11-01-20 23:59,50.0,42 247 | 11-01-20 2:59,9.0,7 248 | 11-01-20 3:59,59.0,50 249 | 11-01-20 4:59,64.0,52 250 | 11-01-20 5:59,59.0,50 251 | 11-01-20 6:59,6.0,5 252 | 11-01-20 7:59,67.0,55 253 | 11-01-20 8:59,31.0,25 254 | 11-01-20 9:59,36.0,30 255 | 12-01-20 0:59,51.0,42 256 | 12-01-20 10:59,46.0,42 257 | 12-01-20 11:59,80.0,67 258 | 12-01-20 12:59,29.0,25 259 | 12-01-20 13:59,74.0,62 260 | 12-01-20 14:59,24.0,20 261 | 12-01-20 15:59,12.0,10 262 | 12-01-20 16:59,15.0,12 263 | 12-01-20 17:59,93.0,77 264 | 12-01-20 18:59,85.0,77 265 | 12-01-20 19:59,63.0,70 266 | 12-01-20 1:59,25.0,20 267 | 12-01-20 20:59,0.0,0 268 | 12-01-20 21:59,48.0,40 269 | 12-01-20 22:59,93.0,77 270 | 12-01-20 23:59,55.0,45 271 | 12-01-20 2:59,29.0,25 272 | 12-01-20 3:59,89.0,75 273 | 12-01-20 4:59,72.0,75 274 | 12-01-20 5:59,68.0,60 275 | 12-01-20 6:59,89.0,75 276 | 12-01-20 7:59,32.0,27 277 | 12-01-20 8:59,66.0,55 278 | 12-01-20 9:59,51.0,55 279 | 13-01-20 0:59,6.0,5 280 | 13-01-20 10:59,9.0,7 281 | 13-01-20 11:59,69.0,57 282 | 13-01-20 12:59,36.0,30 283 | 13-01-20 13:59,83.0,70 284 | 13-01-20 14:59,70.0,70 285 | 13-01-20 15:59,10.0,7 286 | 13-01-20 16:59,42.0,35 287 | 13-01-20 17:59,3.0,2 288 | 13-01-20 18:59,73.0,60 289 | 13-01-20 19:59,1.0,0 290 | 13-01-20 1:59,82.0,67 291 | 13-01-20 20:59,10.0,7 292 | 13-01-20 21:59,96.0,80 293 | 13-01-20 22:59,7.0,5 294 | 13-01-20 23:59,12.0,10 295 | 13-01-20 2:59,27.0,22 296 | 13-01-20 3:59,24.0,20 297 | 13-01-20 4:59,65.0,55 298 | 13-01-20 5:59,6.0,5 299 | 13-01-20 6:59,53.0,45 300 | 13-01-20 7:59,34.0,27 301 | 13-01-20 8:59,66.0,55 302 | 13-01-20 9:59,28.0,22 303 | 14-01-20 0:59,42.0,35 304 | 14-01-20 10:59,7.0,5 305 | 14-01-20 11:59,44.0,37 306 | 14-01-20 12:59,62.0,52 307 | 14-01-20 13:59,45.0,37 308 | 14-01-20 14:59,31.0,25 309 | 14-01-20 15:59,84.0,70 310 | 14-01-20 16:59,48.0,40 311 | 14-01-20 17:59,72.0,60 312 | 14-01-20 18:59,46.0,37 313 | 14-01-20 19:59,23.0,20 314 | 14-01-20 1:59,6.0,5 315 | 14-01-20 20:59,78.0,65 316 | 14-01-20 21:59,65.0,65 317 | 14-01-20 22:59,52.0,55 318 | 14-01-20 23:59,2.0,2 319 | 14-01-20 2:59,10.0,7 320 | 14-01-20 3:59,92.0,77 321 | 14-01-20 4:59,90.0,77 322 | 14-01-20 5:59,85.0,77 323 | 14-01-20 6:59,84.0,77 324 | 14-01-20 7:59,81.0,75 325 | 14-01-20 8:59,63.0,70 326 | 14-01-20 9:59,7.0,5 327 | 15-01-20 0:59,69.0,57 328 | 15-01-20 10:59,41.0,35 329 | 15-01-20 11:59,66.0,55 330 | 15-01-20 12:59,10.0,7 331 | 15-01-20 13:59,59.0,50 332 | 15-01-20 14:59,25.0,20 333 | 15-01-20 15:59,39.0,32 334 | 15-01-20 16:59,81.0,67 335 | 15-01-20 17:59,49.0,40 336 | 15-01-20 18:59,29.0,25 337 | 15-01-20 19:59,17.0,15 338 | 15-01-20 1:59,18.0,15 339 | 15-01-20 20:59,79.0,65 340 | 15-01-20 21:59,85.0,70 341 | 15-01-20 22:59,70.0,70 342 | 15-01-20 23:59,38.0,32 343 | 15-01-20 2:59,45.0,37 344 | 15-01-20 3:59,98.0,82 345 | 15-01-20 4:59,95.0,80 346 | 15-01-20 5:59,34.0,27 347 | 15-01-20 6:59,6.0,5 348 | 15-01-20 7:59,74.0,62 349 | 15-01-20 8:59,22.0,17 350 | 15-01-20 9:59,86.0,72 351 | 16-01-20 0:59,24.0,20 352 | 16-01-20 10:59,7.0,5 353 | 16-01-20 11:59,38.0,32 354 | 16-01-20 12:59,25.0,20 355 | 16-01-20 13:59,19.0,15 356 | 16-01-20 14:59,37.0,30 357 | 16-01-20 15:59,89.0,75 358 | 16-01-20 16:59,23.0,20 359 | 16-01-20 1:59,61.0,50 360 | 16-01-20 2:59,69.0,57 361 | 16-01-20 3:59,68.0,57 362 | 16-01-20 4:59,59.0,50 363 | 16-01-20 5:59,68.0,57 364 | 16-01-20 6:59,48.0,40 365 | 16-01-20 7:59,82.0,67 366 | 16-01-20 8:59,44.0,37 367 | 16-01-20 9:59,4.0,2 368 | -------------------------------------------------------------------------------- /dtw.py: -------------------------------------------------------------------------------- 1 | import numbers 2 | import numpy as np 3 | from collections import defaultdict 4 | 5 | def __difference(a, b): 6 | return abs(a - b) 7 | 8 | def __norm(p): 9 | return lambda a, b: np.linalg.norm(np.atleast_1d(a) - np.atleast_1d(b), p) 10 | 11 | def __prep_inputs(x, y, dist): 12 | x = np.asanyarray(x, dtype='float') 13 | y = np.asanyarray(y, dtype='float') 14 | 15 | if x.ndim == y.ndim > 1 and x.shape[1] != y.shape[1]: 16 | raise ValueError('second dimension of x and y must be the same') 17 | if isinstance(dist, numbers.Number) and dist <= 0: 18 | raise ValueError('dist cannot be a negative integer') 19 | 20 | if dist is None: 21 | if x.ndim == 1: 22 | dist = __difference 23 | else: 24 | dist = __norm(p=1) 25 | elif isinstance(dist, numbers.Number): 26 | dist = __norm(p=dist) 27 | 28 | return x, y, dist 29 | 30 | def dtw(x, y, dist=None): 31 | ''' return the distance between 2 time series without approximation 32 | Parameters 33 | ---------- 34 | x : array_like 35 | input array 1 36 | y : array_like 37 | input array 2 38 | dist : function or int 39 | The method for calculating the distance between x[i] and y[j]. If 40 | dist is an int of value p > 0, then the p-norm will be used. If 41 | dist is a function then dist(x[i], y[j]) will be used. If dist is 42 | None then abs(x[i] - y[j]) will be used. 43 | Returns 44 | ------- 45 | distance : float 46 | the approximate distance between the 2 time series 47 | path : list 48 | list of indexes for the inputs x and y 49 | 50 | ''' 51 | x, y, dist = __prep_inputs(x, y, dist) 52 | return __dtw(x, y, None, dist) 53 | 54 | 55 | def __dtw(x, y, window, dist): 56 | len_x, len_y = len(x), len(y) 57 | if window is None: 58 | window = [(i, j) for i in range(len_x) for j in range(len_y)] 59 | window = ((i + 1, j + 1) for i, j in window) 60 | D = defaultdict(lambda: (float('inf'),)) 61 | D[0, 0] = (0, 0, 0) 62 | for i, j in window: 63 | dt = dist(x[i-1], y[j-1]) 64 | D[i, j] = min((D[i-1, j][0]+dt, i-1, j), (D[i, j-1][0]+dt, i, j-1), 65 | (D[i-1, j-1][0]+dt, i-1, j-1), key=lambda a: a[0]) 66 | path = [] 67 | i, j = len_x, len_y 68 | while not (i == j == 0): 69 | path.append((i-1, j-1)) 70 | i, j = D[i, j][1], D[i, j][2] 71 | path.reverse() 72 | return (D[len_x, len_y][0], path) 73 | 74 | 75 | def __reduce_by_half(x): 76 | return [(x[i] + x[1+i]) / 2 for i in range(0, len(x) - len(x) % 2, 2)] 77 | 78 | 79 | def __expand_window(path, len_x, len_y, radius): 80 | path_ = set(path) 81 | for i, j in path: 82 | for a, b in ((i + a, j + b) 83 | for a in range(-radius, radius+1) 84 | for b in range(-radius, radius+1)): 85 | path_.add((a, b)) 86 | 87 | window_ = set() 88 | for i, j in path_: 89 | for a, b in ((i * 2, j * 2), (i * 2, j * 2 + 1), 90 | (i * 2 + 1, j * 2), (i * 2 + 1, j * 2 + 1)): 91 | window_.add((a, b)) 92 | 93 | window = [] 94 | start_j = 0 95 | for i in range(0, len_x): 96 | new_start_j = None 97 | for j in range(start_j, len_y): 98 | if (i, j) in window_: 99 | window.append((i, j)) 100 | if new_start_j is None: 101 | new_start_j = j 102 | elif new_start_j is not None: 103 | break 104 | start_j = new_start_j 105 | 106 | return window -------------------------------------------------------------------------------- /imputation.py: -------------------------------------------------------------------------------- 1 | """Missing Value Imputation module for Numeric dataset. 2 | 3 | ##### 4 | # Sample Usage 5 | ##### 6 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 7 | --option='simple' \ 8 | --strategy='mean' \ 9 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 10 | 11 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 12 | --option='simple' \ 13 | --strategy='constant' \ 14 | --fill_value=0 \ 15 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 16 | 17 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 18 | --option='knn' \ 19 | --n_neighbors=5 \ 20 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 21 | 22 | ##### 23 | # Testing Module by Adding Random Noise 24 | ##### 25 | python imputation.py --data_path='./dataset/ecg_mitbih_test.csv' \ 26 | --option='simple' \ 27 | --strategy='mean' \ 28 | --output_path='./dataset/imputed_data/ecg_mitbih_test_imputed.csv' 29 | --test_module 30 | """ 31 | 32 | import numpy as np 33 | import pandas as pd 34 | import argparse 35 | 36 | from sklearn.impute import SimpleImputer 37 | from sklearn.impute import KNNImputer 38 | from sklearn.experimental import enable_iterative_imputer 39 | from sklearn.impute import IterativeImputer 40 | 41 | parser = argparse.ArgumentParser(description='Temporal One-class Anomaly Detection') 42 | parser.add_argument('--data_path', type=str, default='./dataset/ecg_mitbih_test.csv') 43 | parser.add_argument('--output_path', type=str, default='./dataset/ecg_mitbih_test_imputed.csv') 44 | parser.add_argument('--header', action="store_true", default=None) 45 | parser.add_argument('--option', type=str, default='simple', help='imputation method') 46 | parser.add_argument('--strategy', type=str, default=None, 47 | help='strategy for simple imputation. [mean, median, most_frequent, constant]') 48 | parser.add_argument('--fill_value', type=float, default=None, 49 | help='If the strategy is “constant”, then replace missing values with fill_value.') 50 | parser.add_argument('--n_neighbors', type=int, default=5, 51 | help='Number of neighboring samples to use for KNN imputation.') 52 | parser.add_argument('--test_module', action="store_true", default=False, 53 | help='Add random noise for test module') 54 | parser.add_argument('--noise_ratio', type=float, default=0.05, help='randomly insert NA noises with this value') 55 | 56 | 57 | class DataImputer(): 58 | def __init__(self, data_path, header): 59 | self.dataset = pd.read_csv(data_path, header = None) 60 | 61 | def add_random_noise(self, noise_ratio): 62 | dataset_na = self.dataset.copy() 63 | 64 | # select only numeric columns to apply the missingness to 65 | cols_list = dataset_na.select_dtypes('number').columns.tolist() 66 | 67 | # randomly insert NA values 68 | for col in self.dataset[cols_list]: 69 | dataset_na.loc[self.dataset.sample(frac=noise_ratio).index, col] = np.nan 70 | 71 | return dataset_na 72 | 73 | def SimpleImputer(self, strategy, fill_value): 74 | """ sklearn simple imputer. """ 75 | # https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html 76 | imputer = SimpleImputer(missing_values=np.nan, strategy=strategy, fill_value=fill_value) 77 | imputed_dataset = imputer.fit_transform(self.dataset) 78 | imputed_dataset = pd.DataFrame(imputed_dataset, columns=self.dataset.columns) 79 | 80 | return imputed_dataset 81 | 82 | def KNNImputer(self, n_neighbors): 83 | """ sklearn KNN imputer. """ 84 | # https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html 85 | imputer = KNNImputer(n_neighbors=n_neighbors) 86 | imputed_dataset = imputer.fit_transform(self.dataset) 87 | imputed_dataset = pd.DataFrame(imputed_dataset, columns=self.dataset.columns) 88 | 89 | return imputed_dataset 90 | 91 | def MICEImputer(self, initial_strategy): 92 | """ sklearn MICE imputer. """ 93 | # https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn-impute-iterativeimputer 94 | imputer = IterativeImputer(random_state=0, initial_strategy=initial_strategy, sample_posterior=True) 95 | imputed_dataset = imputer.fit_transform(self.dataset) 96 | imputed_dataset = pd.DataFrame(imputed_dataset, columns=self.dataset.columns) 97 | 98 | return imputed_dataset 99 | 100 | def main(args): 101 | dataImputer = DataImputer(args.data_path, args.header) 102 | 103 | if args.test_module: 104 | # Add Noise 105 | print('Add random noise with NA values for testing the module.') 106 | dataImputer.dataset = dataImputer.add_random_noise(args.noise_ratio) 107 | dataImputer.dataset.to_csv('./dataset/noisy_dataset.csv', mode='w', index=False, header=False) 108 | 109 | if args.option == 'simple': 110 | assert(args.strategy is not None) 111 | print(f'Processing simple imputation using the {args.strategy}') 112 | imputed_dataset = dataImputer.SimpleImputer(strategy=args.strategy, fill_value=args.fill_value) 113 | elif args.option == 'knn': 114 | assert(args.n_neighbors > 0) 115 | print(f'Processing KNN imputation') 116 | imputed_dataset = dataImputer.KNNImputer(n_neighbors=args.n_neighbors) 117 | elif args.option == 'mice': 118 | print(f'Processing MICE imputation') 119 | imputed_dataset = dataImputer.MICEImputer(initial_strategy=args.strategy) 120 | 121 | imputed_dataset.to_csv(args.output_path, mode='w', index=False, header=False) 122 | print(f'Done {args.option} imputation') 123 | 124 | if __name__ == '__main__': 125 | args = parser.parse_args() 126 | main(args) -------------------------------------------------------------------------------- /seasonal_trend_decomposition.py: -------------------------------------------------------------------------------- 1 | 2 | """Seasonal Trend Decomposition and Prediction module for Numeric dataset. 3 | 4 | """ 5 | 6 | import statsmodels.api as sm 7 | import matplotlib.pyplot as plt 8 | 9 | import numpy as np 10 | import pandas as pd 11 | import argparse 12 | from pmdarima.arima import auto_arima 13 | 14 | parser = argparse.ArgumentParser(description='Temporal One-class Anomaly Detection') 15 | parser.add_argument('--data_path', type=str, default='./dataset/machine_temperature_system_failure.csv') 16 | parser.add_argument('--seasonal_output_path', type=str, default='./decomposed_data/seasonal_decomposed.csv') 17 | parser.add_argument('--trend_output_path', type=str, default='./decomposed_data/trend_decomposed.csv') 18 | 19 | class DataDecomposer(): 20 | def __init__(self, data_path): 21 | self.dataset = pd.read_csv(data_path) 22 | self.dataset['timestamp'] = pd.to_datetime(self.dataset['timestamp']) 23 | self.dataset = self.dataset.set_index('timestamp') 24 | 25 | def STL_decomposition(self, model='additive', period = 288): 26 | # https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.DecomposeResult.html#statsmodels.tsa.seasonal.DecomposeResult 27 | decompostion = sm.tsa.seasonal_decompose(self.dataset['value'], model=model, period=period) 28 | fig = decompostion.plot() 29 | fig.set_size_inches(12,10) 30 | plt.show() 31 | 32 | seasonal = decompostion.seasonal 33 | trend = decompostion.trend 34 | trend = trend.dropna() 35 | 36 | return seasonal, trend 37 | 38 | def AutoArima(self, time_type): 39 | if time_type == 'seasonal': 40 | train_data = seasonal.copy().iloc[:-288] 41 | test_data = seasonal.copy().iloc[-288:] 42 | elif time_type == 'trend': 43 | train_data = trend.copy().iloc[:-288] 44 | test_data = trend.copy().iloc[-288:] 45 | 46 | auto_arima_model = auto_arima(train_data, seasonal=True, 47 | trace=True, d=1, D=1, 48 | error_action='ignore', 49 | suppress_warnings=True, 50 | stepwise=False, 51 | n_jobs=8) 52 | 53 | 54 | auto_arima_model.summary() 55 | prediction = auto_arima_model.predict(288, return_conf_int=True) 56 | 57 | predicted_value = prediction[0] 58 | predicted_ub = prediction[1][:,0] 59 | predicted_lb = prediction[1][:,1] 60 | predict_index = list(test_data.index) 61 | 62 | fig, ax = plt.subplots(figsize=(12, 6)) 63 | train_data.plot(ax=ax); 64 | ymin, ymax = ax.get_ylim() 65 | ax.vlines(predict_index[0], ymin, ymax, linestyle='--', color='r', label='Start of Forecast'); 66 | ax.plot(predict_index, predicted_value, label = 'Prediction') 67 | ax.fill_between(predict_index, predicted_lb, predicted_ub, color = 'k', alpha = 0.1, label='0.95 Prediction Interval') 68 | ax.legend(loc='upper left') 69 | plt.suptitle(f'{time_type} SARIMA {auto_arima_model.order},{auto_arima_model.seasonal_order}') 70 | plt.show() 71 | 72 | return auto_arima_model, prediction 73 | 74 | def main(args): 75 | dataDecomposer = DataDecomposer(args.data_path) 76 | seasonal, trend = dataDecomposer.STL_decomposition() 77 | 78 | model_seasonal, pred_seasonal = dataDecomposer.AutoArima(time_type = 'seasonal') 79 | model_trend, pred_trend = dataDecomposer.AutoArima(time_type = 'trend') 80 | 81 | seasonal.to_csv(args.seasonal_output_path, index=False, header=False) 82 | trend.to_csv(args.trend_output_path, index=False, header=False) 83 | 84 | if __name__ == '__main__': 85 | args = parser.parse_args() 86 | main(args) 87 | -------------------------------------------------------------------------------- /synchronization.py: -------------------------------------------------------------------------------- 1 | """Seasonal Trend Decomposition and Prediction module for Numeric dataset. 2 | 3 | """ 4 | 5 | import pandas as pd 6 | from dtw import dtw 7 | import matplotlib.pyplot as plt 8 | import argparse 9 | from tslearn import metrics 10 | 11 | 12 | parser = argparse.ArgumentParser(description='Temporal One-class Anomaly Detection') 13 | parser.add_argument('--data_path', type=str, default='./dataset/power_voltage.csv') 14 | parser.add_argument('--dtw_output_path', type=str, default='./dataset/synchronized_data/synchronized_dtw.csv') 15 | parser.add_argument('--plot_output_path', type=str, default='./dataset/synchronized_data') 16 | parser.add_argument('--soft_dtw_output_path', type=str, default='./dataset/synchronized_data/synchronized_softdtw.csv') 17 | parser.add_argument('--option', type=str, default='dtw', help='synchronization method') 18 | parser.add_argument('--gamma', type=float, default=0.1, help='gamma for soft dtw') 19 | parser.add_argument('--distance', type=float, default= 2, help='p-norm for distacne') 20 | parser.add_argument('--plot', type=bool, default= True, help='plot for visualization') 21 | 22 | class Sync(): 23 | def __init__(self, data_path): 24 | self.dataset = pd.read_csv(data_path) 25 | self.Time = self.dataset.iloc[:,0] 26 | self.X = self.dataset.iloc[:,1].fillna(0).values 27 | self.Y = self.dataset.iloc[:,2].fillna(0).values 28 | 29 | def dtw_sync(self, distance): 30 | 31 | distance, path = dtw(self.X, self.Y, dist=distance) 32 | 33 | result = [] 34 | 35 | for i in range(0, len(path)): 36 | result.append([self.Time[path[i][0]], self.X[path[i][0]], self.Y[path[i][1]]]) 37 | 38 | result_df = pd.DataFrame(data=result, columns=['Time', 'X', 'Y']).dropna() 39 | result_df = result_df.drop_duplicates(subset=['Time']) 40 | result_df = result_df.sort_values(by='Time') 41 | result_df = result_df.reset_index(drop=True) 42 | 43 | return result_df 44 | 45 | def soft_dtw_sync(self, gamma): 46 | 47 | path, sim = metrics.soft_dtw_alignment(self.X, self.Y, gamma=gamma) 48 | 49 | result = [] 50 | 51 | for i in range(0, len(path)): 52 | result.append([self.Time[path[i][0]], self.X[path[i][0]], self.Y[path[i][1]]]) 53 | 54 | result_df = pd.DataFrame(data=result, columns=['Time', 'X', 'Y']).dropna() 55 | result_df = result_df.drop_duplicates(subset=['Time']) 56 | result_df = result_df.sort_values(by='Time') 57 | result_df = result_df.reset_index(drop=True) 58 | 59 | return result_df 60 | 61 | 62 | def plot_data(self): 63 | 64 | return self.X, self.Y 65 | 66 | 67 | def main(args): 68 | DTW_sync = Sync(args.data_path) 69 | 70 | if args.option == 'dtw': 71 | print(f'Processing synchronization using the {args.option}') 72 | result_df = DTW_sync.dtw_sync(distance=args.distance) 73 | result_df.to_csv(args.dtw_output_path, index=False) 74 | 75 | elif args.option == 'soft_dtw': 76 | print(f'Processing synchronization using the {args.option}') 77 | result_df = DTW_sync.dtw_sync(distance=args.gamma) 78 | result_df.to_csv(args.soft_dtw_output_path, index=False) 79 | 80 | if args.plot == True: 81 | 82 | X, Y = DTW_sync.plot_data() 83 | plt.plot(X[:100]) 84 | plt.plot(Y[:100]) 85 | plt.title('Original data (0~100 timestep)') 86 | plt.savefig(args.plot_output_path + '/Original plot.png') 87 | plt.clf() 88 | 89 | plt.plot(result_df['X'][0:100]) 90 | plt.plot(result_df['Y'][0:100]) 91 | plt.title(f'Synchronized by {args.option} (0~100 timestep)') 92 | plt.savefig(args.plot_output_path + '/Synchronized plot.png') 93 | 94 | 95 | print(f'Done {args.option} synchronization') 96 | 97 | if __name__ == '__main__': 98 | args = parser.parse_args() 99 | main(args) -------------------------------------------------------------------------------- /고려대학교_건물별_전력데이터.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ClustProject/KUDataPreprocessing/e8804c08107bff23b5092f95149bc0c5c0f88ec3/고려대학교_건물별_전력데이터.zip -------------------------------------------------------------------------------- /녹지캠4일전력소비량.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ClustProject/KUDataPreprocessing/e8804c08107bff23b5092f95149bc0c5c0f88ec3/녹지캠4일전력소비량.png -------------------------------------------------------------------------------- /녹지캠4주전력소비량.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ClustProject/KUDataPreprocessing/e8804c08107bff23b5092f95149bc0c5c0f88ec3/녹지캠4주전력소비량.png -------------------------------------------------------------------------------- /변수설명.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ClustProject/KUDataPreprocessing/e8804c08107bff23b5092f95149bc0c5c0f88ec3/변수설명.docx --------------------------------------------------------------------------------