├── README.md ├── modified_keras_files ├── README └── training.py ├── mycode ├── config.py ├── load_data.py ├── main.py └── model.py └── 数据集说明.txt /README.md: -------------------------------------------------------------------------------- 1 | # zhihu_kanshan_cup_2017 2 | 2017知乎看山杯比赛,我的部分代码。仅提供了单模型最高分的RCNN+ATTENTION模型。 3 | 4 | 详情请移步至我的博文:[大规模文本分类实践-知乎看山杯总结](http://coderskychen.cn/2017/08/20/zhihucup/) 5 | 6 | # 数据下载与说明 7 | 数据存在百度云: http://pan.baidu.com/s/1bpnNRQJ 8 | 数据说明:[移步至此](https://github.com/coderSkyChen/zhihu_kanshan_cup_2017/blob/master/%E6%95%B0%E6%8D%AE%E9%9B%86%E8%AF%B4%E6%98%8E.txt) 9 | 10 | # 运行环境: 11 | - python版本:py3 12 | - keras版本:Keras (2.0.6) 其中prozhuchen修改了keras的training.py 文件,使得其能够处理稀疏的label,具体修改:training.py 第375~455行。 13 | 具体原理是在每一个batch时把稀疏的label转化为dense的,省空间。您也可以使用keras的fit_on_batch函数,但是如果您使用fit函数的话,必须使用我们修改后的源代码文件。 14 | - 本代码提供了修改后的training.py文件,在modified_keras_files中 请将其替换您keras目录下的:keras/engine/training.py文件 15 | 16 | # 执行方法 17 | ## 训练: 18 | python3 main.py train 19 | 20 | ## 预测: 21 | python3 main.py pred 22 | 23 | 24 | -------------------------------------------------------------------------------- /modified_keras_files/README: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /modified_keras_files/training.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from __future__ import print_function 3 | from __future__ import absolute_import 4 | 5 | import warnings 6 | import copy 7 | import numpy as np 8 | import six 9 | 10 | from keras.utils import Sequence 11 | from keras.utils import GeneratorEnqueuer 12 | from keras.utils import OrderedEnqueuer 13 | 14 | try: 15 | import queue 16 | except ImportError: 17 | import Queue as queue 18 | 19 | from .topology import Container 20 | from .. import backend as K 21 | from .. import optimizers 22 | from .. import losses 23 | from .. import metrics as metrics_module 24 | from ..utils.generic_utils import Progbar 25 | from .. import callbacks as cbks 26 | from ..legacy import interfaces 27 | 28 | 29 | def _standardize_input_data(data, names, shapes=None, 30 | check_batch_axis=True, 31 | exception_prefix=''): 32 | """Normalizes inputs and targets provided by users. 33 | 34 | Users may pass data as a list of arrays, dictionary of arrays, 35 | or as a single array. We normalize this to an ordered list of 36 | arrays (same order as `names`), while checking that the provided 37 | arrays have shapes that match the network's expectations. 38 | 39 | # Arguments 40 | data: User-provided input data (polymorphic). 41 | names: List of expected array names. 42 | shapes: Optional list of expected array shapes. 43 | check_batch_axis: Boolean; whether to check that 44 | the batch axis of the arrays matches the expected 45 | value found in `shapes`. 46 | exception_prefix: String prefix used for exception formatting. 47 | 48 | # Returns 49 | List of standardized input arrays (one array per model input). 50 | 51 | # Raises 52 | ValueError: in case of improperly formatted user-provided data. 53 | """ 54 | if not names: 55 | return [] 56 | if data is None: 57 | return [None for _ in range(len(names))] 58 | if isinstance(data, dict): 59 | arrays = [] 60 | for name in names: 61 | if name not in data: 62 | raise ValueError('No data provided for "' + 63 | name + '". Need data for each key in: ' + 64 | str(names)) 65 | arrays.append(data[name]) 66 | elif isinstance(data, list): 67 | if len(data) != len(names): 68 | if data and hasattr(data[0], 'shape'): 69 | raise ValueError('Error when checking model ' + 70 | exception_prefix + 71 | ': the list of Numpy arrays ' 72 | 'that you are passing to your model ' 73 | 'is not the size the model expected. ' 74 | 'Expected to see ' + str(len(names)) + 75 | ' arrays but instead got ' 76 | 'the following list of ' + str(len(data)) + 77 | ' arrays: ' + str(data)[:200] + 78 | '...') 79 | else: 80 | if len(names) == 1: 81 | data = [np.asarray(data)] 82 | else: 83 | raise ValueError( 84 | 'Error when checking model ' + 85 | exception_prefix + 86 | ': you are passing a list as ' 87 | 'input to your model, ' 88 | 'but the model expects ' 89 | 'a list of ' + str(len(names)) + 90 | ' Numpy arrays instead. ' 91 | 'The list you passed was: ' + 92 | str(data)[:200]) 93 | arrays = data 94 | else: 95 | if not hasattr(data, 'shape'): 96 | raise TypeError('Error when checking model ' + 97 | exception_prefix + 98 | ': data should be a Numpy array, ' 99 | 'or list/dict of Numpy arrays. ' 100 | 'Found: ' + str(data)[:200] + '...') 101 | if len(names) > 1: 102 | # Case: model expects multiple inputs but only received 103 | # a single Numpy array. 104 | raise ValueError('The model expects ' + str(len(names)) + ' ' + 105 | exception_prefix + 106 | ' arrays, but only received one array. ' 107 | 'Found: array with shape ' + str(data.shape)) 108 | arrays = [data] 109 | 110 | # Make arrays at least 2D. 111 | for i in range(len(names)): 112 | array = arrays[i] 113 | if len(array.shape) == 1: 114 | array = np.expand_dims(array, 1) 115 | arrays[i] = array 116 | 117 | # Check shapes compatibility. 118 | if shapes: 119 | for i in range(len(names)): 120 | if shapes[i] is None: 121 | continue 122 | array = arrays[i] 123 | if len(array.shape) != len(shapes[i]): 124 | raise ValueError('Error when checking ' + exception_prefix + 125 | ': expected ' + names[i] + 126 | ' to have ' + str(len(shapes[i])) + 127 | ' dimensions, but got array with shape ' + 128 | str(array.shape)) 129 | for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])): 130 | if not j and not check_batch_axis: 131 | # skip the first axis 132 | continue 133 | if ref_dim: 134 | if ref_dim != dim: 135 | raise ValueError( 136 | 'Error when checking ' + exception_prefix + 137 | ': expected ' + names[i] + 138 | ' to have shape ' + str(shapes[i]) + 139 | ' but got array with shape ' + 140 | str(array.shape)) 141 | return arrays 142 | 143 | 144 | def _standardize_sample_or_class_weights(x_weight, output_names, weight_type): 145 | """Maps `sample_weight` or `class_weight` to model outputs. 146 | 147 | # Arguments 148 | x_weight: User-provided `sample_weight` or `class_weight` argument. 149 | output_names: List of output names (strings) in the model. 150 | weight_type: A string used purely for exception printing. 151 | 152 | # Returns 153 | A list of `sample_weight` or `class_weight` where there are exactly 154 | one element per model output. 155 | 156 | # Raises 157 | ValueError: In case of invalid user-provided argument. 158 | """ 159 | if x_weight is None or len(x_weight) == 0: 160 | return [None for _ in output_names] 161 | if len(output_names) == 1: 162 | if isinstance(x_weight, list) and len(x_weight) == 1: 163 | return x_weight 164 | if isinstance(x_weight, dict) and output_names[0] in x_weight: 165 | return [x_weight[output_names[0]]] 166 | else: 167 | return [x_weight] 168 | if isinstance(x_weight, list): 169 | if len(x_weight) != len(output_names): 170 | raise ValueError('Provided `' + weight_type + '` was a list of ' + 171 | str(len(x_weight)) + 172 | ' elements, but the model has ' + 173 | str(len(output_names)) + ' outputs. ' 174 | 'You should provide one `' + weight_type + '`' 175 | 'array per model output.') 176 | return x_weight 177 | if isinstance(x_weight, dict): 178 | x_weights = [] 179 | for name in output_names: 180 | x_weights.append(x_weight.get(name)) 181 | return x_weights 182 | else: 183 | raise TypeError('The model has multiple outputs, so `' + 184 | weight_type + '` ' 185 | 'should be either a list of a dict. ' 186 | 'Provided `' + weight_type + 187 | '` type not understood: ' + 188 | str(x_weight)) 189 | 190 | 191 | def _standardize_class_weights(class_weight, output_names): 192 | return _standardize_sample_or_class_weights(class_weight, 193 | output_names, 194 | 'class_weight') 195 | 196 | 197 | def _standardize_sample_weights(sample_weight, output_names): 198 | return _standardize_sample_or_class_weights(sample_weight, 199 | output_names, 200 | 'sample_weight') 201 | 202 | 203 | def _check_array_lengths(inputs, targets, weights=None): 204 | """Does user input validation for numpy arrays. 205 | 206 | # Arguments 207 | inputs: list of Numpy arrays of inputs. 208 | targets: list of Numpy arrays of targets. 209 | weights: list of Numpy arrays of sample weights. 210 | 211 | # Raises 212 | ValueError: in case of incorrectly formatted data. 213 | """ 214 | def set_of_lengths(x): 215 | # return a set with the variation between 216 | # different shapes, with None => 0 217 | if x is None: 218 | return {0} 219 | else: 220 | return set([0 if y is None else y.shape[0] for y in x]) 221 | 222 | set_x = set_of_lengths(inputs) 223 | set_y = set_of_lengths(targets) 224 | set_w = set_of_lengths(weights) 225 | if len(set_x) > 1: 226 | raise ValueError('All input arrays (x) should have ' 227 | 'the same number of samples. Got array shapes: ' + 228 | str([x.shape for x in inputs])) 229 | if len(set_y) > 1: 230 | raise ValueError('All target arrays (y) should have ' 231 | 'the same number of samples. Got array shapes: ' + 232 | str([y.shape for y in targets])) 233 | if set_x and set_y and list(set_x)[0] != list(set_y)[0]: 234 | raise ValueError('Input arrays should have ' 235 | 'the same number of samples as target arrays. ' 236 | 'Found ' + str(list(set_x)[0]) + ' input samples ' 237 | 'and ' + str(list(set_y)[0]) + ' target samples.') 238 | if len(set_w) > 1: 239 | raise ValueError('All sample_weight arrays should have ' 240 | 'the same number of samples. Got array shapes: ' + 241 | str([w.shape for w in weights])) 242 | if set_y and set_w and list(set_y)[0] != list(set_w)[0]: 243 | raise ValueError('Sample_weight arrays should have ' 244 | 'the same number of samples as target arrays. Got ' + 245 | str(list(set_y)[0]) + ' input samples and ' + 246 | str(list(set_w)[0]) + ' target samples.') 247 | 248 | 249 | def _check_loss_and_target_compatibility(targets, loss_fns, output_shapes): 250 | """Does validation on the compatibility of targets and loss functions. 251 | 252 | This helps prevent users from using loss functions incorrectly. 253 | 254 | # Arguments 255 | targets: list of Numpy arrays of targets. 256 | loss_fns: list of loss functions. 257 | output_shapes: list of shapes of model outputs. 258 | 259 | # Raises 260 | ValueError: if a loss function or target array 261 | is incompatible with an output. 262 | """ 263 | key_losses = {'mean_squared_error', 264 | 'binary_crossentropy', 265 | 'categorical_crossentropy'} 266 | for y, loss, shape in zip(targets, loss_fns, output_shapes): 267 | if loss is None: 268 | continue 269 | if loss.__name__ == 'categorical_crossentropy': 270 | if y.shape[-1] == 1: 271 | raise ValueError( 272 | 'You are passing a target array of shape ' + str(y.shape) + 273 | ' while using as loss `categorical_crossentropy`. ' 274 | '`categorical_crossentropy` expects ' 275 | 'targets to be binary matrices (1s and 0s) ' 276 | 'of shape (samples, classes). ' 277 | 'If your targets are integer classes, ' 278 | 'you can convert them to the expected format via:\n' 279 | '```\n' 280 | 'from keras.utils.np_utils import to_categorical\n' 281 | 'y_binary = to_categorical(y_int)\n' 282 | '```\n' 283 | '\n' 284 | 'Alternatively, you can use the loss function ' 285 | '`sparse_categorical_crossentropy` instead, ' 286 | 'which does expect integer targets.') 287 | if loss.__name__ in key_losses: 288 | for target_dim, out_dim in zip(y.shape[1:], shape[1:]): 289 | if out_dim is not None and target_dim != out_dim: 290 | raise ValueError( 291 | 'A target array with shape ' + str(y.shape) + 292 | ' was passed for an output of shape ' + str(shape) + 293 | ' while using as loss `' + loss.__name__ + '`. ' 294 | 'This loss expects ' 295 | 'targets to have the same shape ' 296 | 'as the output.') 297 | 298 | 299 | def _collect_metrics(metrics, output_names): 300 | """Maps metric functions to model outputs. 301 | 302 | # Arguments 303 | metrics: a list or dict of metric functions. 304 | output_names: a list of the names (strings) of model outputs. 305 | 306 | # Returns 307 | A list (one entry per model output) of lists of metric functions. 308 | For instance, if the model has 2 outputs, and for the first output 309 | we want to compute "binary_accuracy" and "binary_crossentropy", 310 | and just "binary_accuracy" for the second output, 311 | the list would look like: 312 | `[[binary_accuracy, binary_crossentropy], [binary_accuracy]]` 313 | 314 | # Raises 315 | TypeError: if an incorrect type is passed for the `metrics` argument. 316 | """ 317 | if not metrics: 318 | return [[] for _ in output_names] 319 | if isinstance(metrics, list): 320 | # we then apply all metrics to all outputs. 321 | return [copy.copy(metrics) for _ in output_names] 322 | elif isinstance(metrics, dict): 323 | nested_metrics = [] 324 | for name in output_names: 325 | output_metrics = metrics.get(name, []) 326 | if not isinstance(output_metrics, list): 327 | output_metrics = [output_metrics] 328 | nested_metrics.append(output_metrics) 329 | return nested_metrics 330 | else: 331 | raise TypeError('Type of `metrics` argument not understood. ' 332 | 'Expected a list or dictionary, found: ' + 333 | str(metrics)) 334 | 335 | 336 | def _batch_shuffle(index_array, batch_size): 337 | """Shuffles an array in a batch-wise fashion. 338 | 339 | Useful for shuffling HDF5 arrays 340 | (where one cannot access arbitrary indices). 341 | 342 | # Arguments 343 | index_array: array of indices to be shuffled. 344 | batch_size: integer. 345 | 346 | # Returns 347 | The `index_array` array, shuffled in a batch-wise fashion. 348 | """ 349 | batch_count = int(len(index_array) / batch_size) 350 | # to reshape we need to be cleanly divisible by batch size 351 | # we stash extra items and reappend them after shuffling 352 | last_batch = index_array[batch_count * batch_size:] 353 | index_array = index_array[:batch_count * batch_size] 354 | index_array = index_array.reshape((batch_count, batch_size)) 355 | np.random.shuffle(index_array) 356 | index_array = index_array.flatten() 357 | return np.append(index_array, last_batch) 358 | 359 | 360 | def _make_batches(size, batch_size): 361 | """Returns a list of batch indices (tuples of indices). 362 | 363 | # Arguments 364 | size: Integer, total size of the data to slice into batches. 365 | batch_size: Integer, batch size. 366 | 367 | # Returns 368 | A list of tuples of array indices. 369 | """ 370 | num_batches = int(np.ceil(size / float(batch_size))) 371 | return [(i * batch_size, min(size, (i + 1) * batch_size)) 372 | for i in range(0, num_batches)] 373 | 374 | 375 | def _slice_arrays(arrays, start=None, stop=None): 376 | """Slice an array or list of arrays. 377 | 378 | This takes an array-like, or a list of 379 | array-likes, and outputs: 380 | - arrays[start:stop] if `arrays` is an array-like 381 | - [x[start:stop] for x in arrays] if `arrays` is a list 382 | 383 | Can also work on list/array of indices: `_slice_arrays(x, indices)` 384 | 385 | # Arguments 386 | arrays: Single array or list of arrays. 387 | start: can be an integer index (start index) 388 | or a list/array of indices 389 | stop: integer (stop index); should be None if 390 | `start` was a list. 391 | 392 | # Returns 393 | A slice of the array(s). 394 | """ 395 | # if arrays is None: 396 | # return [None] 397 | # elif isinstance(arrays, list): 398 | # if hasattr(start, '__len__'): 399 | # # hdf5 datasets only support list objects as indices 400 | # if hasattr(start, 'shape'): 401 | # start = start.tolist() 402 | # return [None if x is None else x[start] for x in arrays] 403 | # else: 404 | # return [None if x is None else x[start:stop] for x in arrays] 405 | # else: 406 | # if hasattr(start, '__len__'): 407 | # if hasattr(start, 'shape'): 408 | # start = start.tolist() 409 | # return arrays[start] 410 | # elif hasattr(start, '__getitem__'): 411 | # return arrays[start:stop] 412 | # else: 413 | # return [None] 414 | 415 | from scipy import sparse as sps 416 | 417 | if isinstance(arrays, list): 418 | if hasattr(start, '__len__'): 419 | # hdf5 datasets only support list objects as indices 420 | if hasattr(start, 'shape'): 421 | start = start.tolist() 422 | # return [x[start] for x in arrays] 423 | res = [] 424 | for x in arrays: 425 | if sps.issparse(x): 426 | res.append(x[start].toarray()) 427 | else: 428 | res.append(x[start]) 429 | return res 430 | else: 431 | # return [x[start:stop] for x in arrays] 432 | res = [] 433 | for x in arrays: 434 | if sps.issparse(x): 435 | res.append(x[start:stop].toarray()) 436 | else: 437 | res.append(x[start:stop]) 438 | return res 439 | else: 440 | if hasattr(start, '__len__'): 441 | if hasattr(start, 'shape'): 442 | start = start.tolist() 443 | # return arrays[start] 444 | if sps.issparse(arrays): 445 | return arrays[start].toarray() 446 | else: 447 | return arrays[start] 448 | else: 449 | # return arrays[start:stop] 450 | if sps.issparse(arrays): 451 | return arrays[start:stop].toarray() 452 | else: 453 | return arrays[start:stop] 454 | 455 | 456 | def _weighted_masked_objective(fn): 457 | """Adds support for masking and sample-weighting to an objective function. 458 | 459 | It transforms an objective function `fn(y_true, y_pred)` 460 | into a sample-weighted, cost-masked objective function 461 | `fn(y_true, y_pred, weights, mask)`. 462 | 463 | # Arguments 464 | fn: The objective function to wrap, 465 | with signature `fn(y_true, y_pred)`. 466 | 467 | # Returns 468 | A function with signature `fn(y_true, y_pred, weights, mask)`. 469 | """ 470 | if fn is None: 471 | return None 472 | 473 | def weighted(y_true, y_pred, weights, mask=None): 474 | """Wrapper function. 475 | 476 | # Arguments 477 | y_true: `y_true` argument of `fn`. 478 | y_pred: `y_pred` argument of `fn`. 479 | weights: Weights tensor. 480 | mask: Mask tensor. 481 | 482 | # Returns 483 | Scalar tensor. 484 | """ 485 | # score_array has ndim >= 2 486 | score_array = fn(y_true, y_pred) 487 | if mask is not None: 488 | # Cast the mask to floatX to avoid float64 upcasting in theano 489 | mask = K.cast(mask, K.floatx()) 490 | # mask should have the same shape as score_array 491 | score_array *= mask 492 | # the loss per batch should be proportional 493 | # to the number of unmasked samples. 494 | score_array /= K.mean(mask) 495 | 496 | # apply sample weighting 497 | if weights is not None: 498 | # reduce score_array to same ndim as weight array 499 | ndim = K.ndim(score_array) 500 | weight_ndim = K.ndim(weights) 501 | score_array = K.mean(score_array, axis=list(range(weight_ndim, ndim))) 502 | score_array *= weights 503 | score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx())) 504 | return K.mean(score_array) 505 | return weighted 506 | 507 | 508 | def _masked_objective(fn): 509 | """Adds support for masking to an objective function. 510 | 511 | It transforms an objective function `fn(y_true, y_pred)` 512 | into a cost-masked objective function 513 | `fn(y_true, y_pred, mask)`. 514 | 515 | # Arguments 516 | fn: The objective function to wrap, 517 | with signature `fn(y_true, y_pred)`. 518 | 519 | # Returns 520 | A function with signature `fn(y_true, y_pred, mask)`. 521 | """ 522 | def masked(y_true, y_pred, mask=None): 523 | """Wrapper function. 524 | 525 | # Arguments 526 | y_true: `y_true` argument of `fn`. 527 | y_pred: `y_pred` argument of `fn`. 528 | mask: Mask tensor. 529 | 530 | # Returns 531 | Scalar tensor. 532 | """ 533 | # score_array has ndim >= 2 534 | score_array = fn(y_true, y_pred) 535 | if mask is not None: 536 | # Cast the mask to floatX to avoid float64 upcasting in theano 537 | mask = K.cast(mask, K.floatx()) 538 | # mask should have the same shape as score_array 539 | score_array *= mask 540 | # the loss per batch should be proportional 541 | # to the number of unmasked samples. 542 | score_array /= K.mean(mask) 543 | 544 | return K.mean(score_array) 545 | return masked 546 | 547 | 548 | def _standardize_weights(y, sample_weight=None, class_weight=None, 549 | sample_weight_mode=None): 550 | """Performs sample weight validation and standardization. 551 | 552 | Everything gets normalized to a single sample-wise (or timestep-wise) 553 | weight array. 554 | 555 | # Arguments 556 | y: Numpy array of model targets to be weighted. 557 | sample_weight: User-provided `sample_weight` argument. 558 | class_weight: User-provided `class_weight` argument. 559 | sample_weight_mode: One of `None` or `"temporal"`. 560 | `"temporal"` indicated that we expect 2D weight data 561 | that will be applied to the last 2 dimensions of 562 | the targets (i.e. we are weighting timesteps, not samples). 563 | 564 | # Returns 565 | A numpy array of target weights, one entry per sample to weight. 566 | 567 | # Raises 568 | ValueError: In case of invalid user-provided arguments. 569 | """ 570 | if sample_weight_mode is not None: 571 | if sample_weight_mode != 'temporal': 572 | raise ValueError('"sample_weight_mode ' 573 | 'should be None or "temporal". ' 574 | 'Found: ' + str(sample_weight_mode)) 575 | if len(y.shape) < 3: 576 | raise ValueError('Found a sample_weight array for ' 577 | 'an input with shape ' + 578 | str(y.shape) + '. ' 579 | 'Timestep-wise sample weighting (use of ' 580 | 'sample_weight_mode="temporal") is restricted to ' 581 | 'outputs that are at least 3D, i.e. that have ' 582 | 'a time dimension.') 583 | if sample_weight is not None and len(sample_weight.shape) != 2: 584 | raise ValueError('Found a sample_weight array with shape ' + 585 | str(sample_weight.shape) + '. ' 586 | 'In order to use timestep-wise sample weighting, ' 587 | 'you should pass a 2D sample_weight array.') 588 | else: 589 | if sample_weight is not None and len(sample_weight.shape) != 1: 590 | raise ValueError('Found a sample_weight array with shape ' + 591 | str(sample_weight.shape) + '. ' 592 | 'In order to use timestep-wise sample weights, ' 593 | 'you should specify ' 594 | 'sample_weight_mode="temporal" ' 595 | 'in compile(). If you just mean to use ' 596 | 'sample-wise weights, make sure your ' 597 | 'sample_weight array is 1D.') 598 | 599 | if sample_weight is not None: 600 | if len(sample_weight.shape) > len(y.shape): 601 | raise ValueError('Found a sample_weight with shape' + 602 | str(sample_weight.shape) + '.' 603 | 'Expected sample_weight with rank ' 604 | 'less than or equal to ' + str(len(y.shape))) 605 | 606 | if y.shape[:sample_weight.ndim] != sample_weight.shape: 607 | raise ValueError('Found a sample_weight array with shape ' + 608 | str(sample_weight.shape) + ' for an input with shape ' + 609 | str(y.shape) + '. ' 610 | 'sample_weight cannot be broadcast.') 611 | return sample_weight 612 | elif isinstance(class_weight, dict): 613 | if len(y.shape) > 2: 614 | raise ValueError('`class_weight` not supported for ' 615 | '3+ dimensional targets.') 616 | if y.shape[1] > 1: 617 | y_classes = y.argmax(axis=1) 618 | elif y.shape[1] == 1: 619 | y_classes = np.reshape(y, y.shape[0]) 620 | else: 621 | y_classes = y 622 | 623 | weights = np.asarray([class_weight[cls] for cls in y_classes 624 | if cls in class_weight]) 625 | 626 | if len(weights) != len(y_classes): 627 | # subtract the sets to pick all missing classes 628 | existing_classes = set(y_classes) 629 | existing_class_weight = set(class_weight.keys()) 630 | raise ValueError('`class_weight` must contain all classes in the data.' 631 | ' The classes %s exist in the data but not in ' 632 | '`class_weight`.' 633 | % (existing_classes - existing_class_weight)) 634 | return weights 635 | else: 636 | if sample_weight_mode is None: 637 | return np.ones((y.shape[0],), dtype=K.floatx()) 638 | else: 639 | return np.ones((y.shape[0], y.shape[1]), dtype=K.floatx()) 640 | 641 | 642 | class Model(Container): 643 | """The `Model` class adds training & evaluation routines to a `Container`. 644 | """ 645 | 646 | def compile(self, optimizer, loss, metrics=None, loss_weights=None, 647 | sample_weight_mode=None, **kwargs): 648 | """Configures the model for training. 649 | 650 | # Arguments 651 | optimizer: str (name of optimizer) or optimizer object. 652 | See [optimizers](/optimizers). 653 | loss: str (name of objective function) or objective function. 654 | See [losses](/losses). 655 | If the model has multiple outputs, you can use a different loss 656 | on each output by passing a dictionary or a list of losses. 657 | The loss value that will be minimized by the model 658 | will then be the sum of all individual losses. 659 | metrics: list of metrics to be evaluated by the model 660 | during training and testing. 661 | Typically you will use `metrics=['accuracy']`. 662 | To specify different metrics for different outputs of a 663 | multi-output model, you could also pass a dictionary, 664 | such as `metrics={'output_a': 'accuracy'}`. 665 | loss_weights: Optional list or dictionary specifying scalar 666 | coefficients (Python floats) to weight the loss contributions 667 | of different model outputs. 668 | The loss value that will be minimized by the model 669 | will then be the *weighted sum* of all individual losses, 670 | weighted by the `loss_weights` coefficients. 671 | If a list, it is expected to have a 1:1 mapping 672 | to the model's outputs. If a tensor, it is expected to map 673 | output names (strings) to scalar coefficients. 674 | sample_weight_mode: if you need to do timestep-wise 675 | sample weighting (2D weights), set this to `"temporal"`. 676 | `None` defaults to sample-wise weights (1D). 677 | If the model has multiple outputs, you can use a different 678 | `sample_weight_mode` on each output by passing a 679 | dictionary or a list of modes. 680 | **kwargs: when using the Theano/CNTK backends, these arguments 681 | are passed into K.function. When using the TensorFlow backend, 682 | these arguments are passed into `tf.Session.run`. 683 | 684 | # Raises 685 | ValueError: In case of invalid arguments for 686 | `optimizer`, `loss`, `metrics` or `sample_weight_mode`. 687 | """ 688 | loss = loss or {} 689 | self.optimizer = optimizers.get(optimizer) 690 | self.sample_weight_mode = sample_weight_mode 691 | self.loss = loss 692 | self.loss_weights = loss_weights 693 | 694 | # Prepare loss functions. 695 | if isinstance(loss, dict): 696 | for name in loss: 697 | if name not in self.output_names: 698 | raise ValueError('Unknown entry in loss ' 699 | 'dictionary: "' + name + '". ' 700 | 'Only expected the following keys: ' + 701 | str(self.output_names)) 702 | loss_functions = [] 703 | for name in self.output_names: 704 | if name not in loss: 705 | warnings.warn('Output "' + name + 706 | '" missing from loss dictionary. ' 707 | 'We assume this was done on purpose, ' 708 | 'and we will not be expecting ' 709 | 'any data to be passed to "' + name + 710 | '" during training.', stacklevel=2) 711 | loss_functions.append(losses.get(loss.get(name))) 712 | elif isinstance(loss, list): 713 | if len(loss) != len(self.outputs): 714 | raise ValueError('When passing a list as loss, ' 715 | 'it should have one entry per model outputs. ' 716 | 'The model has ' + str(len(self.outputs)) + 717 | ' outputs, but you passed loss=' + 718 | str(loss)) 719 | loss_functions = [losses.get(l) for l in loss] 720 | else: 721 | loss_function = losses.get(loss) 722 | loss_functions = [loss_function for _ in range(len(self.outputs))] 723 | self.loss_functions = loss_functions 724 | weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions] 725 | skip_indices = [] 726 | self._feed_outputs = [] 727 | self._feed_output_names = [] 728 | self._feed_output_shapes = [] 729 | self._feed_loss_fns = [] 730 | for i in range(len(weighted_losses)): 731 | if weighted_losses[i] is None: 732 | skip_indices.append(i) 733 | else: 734 | self._feed_outputs.append(self.outputs[i]) 735 | self._feed_output_names.append(self.output_names[i]) 736 | self._feed_output_shapes.append(self.internal_output_shapes[i]) 737 | self._feed_loss_fns.append(self.loss_functions[i]) 738 | 739 | # Prepare output masks. 740 | masks = self.compute_mask(self.inputs, mask=None) 741 | if masks is None: 742 | masks = [None for _ in self.outputs] 743 | if not isinstance(masks, list): 744 | masks = [masks] 745 | 746 | # Prepare loss weights. 747 | if loss_weights is None: 748 | loss_weights_list = [1. for _ in range(len(self.outputs))] 749 | elif isinstance(loss_weights, dict): 750 | for name in loss_weights: 751 | if name not in self.output_names: 752 | raise ValueError('Unknown entry in loss_weights ' 753 | 'dictionary: "' + name + '". ' 754 | 'Only expected the following keys: ' + 755 | str(self.output_names)) 756 | loss_weights_list = [] 757 | for name in self.output_names: 758 | loss_weights_list.append(loss_weights.get(name, 1.)) 759 | elif isinstance(loss_weights, list): 760 | if len(loss_weights) != len(self.outputs): 761 | raise ValueError('When passing a list as loss_weights, ' 762 | 'it should have one entry per model outputs. ' 763 | 'The model has ' + str(len(self.outputs)) + 764 | ' outputs, but you passed loss_weights=' + 765 | str(loss_weights)) 766 | loss_weights_list = loss_weights 767 | else: 768 | raise TypeError('Could not interpret loss_weights argument: ' + 769 | str(loss_weights) + 770 | ' - expected a list of dicts.') 771 | 772 | # Prepare sample weights. 773 | sample_weights = [] 774 | sample_weight_modes = [] 775 | if isinstance(sample_weight_mode, dict): 776 | for name in sample_weight_mode: 777 | if name not in self.output_names: 778 | raise ValueError('Unknown entry in ' 779 | 'sample_weight_mode dictionary: "' + 780 | name + '". ' 781 | 'Only expected the following keys: ' + 782 | str(self.output_names)) 783 | for i, name in enumerate(self.output_names): 784 | if i in skip_indices: 785 | weight = None 786 | sample_weight_modes.append(None) 787 | else: 788 | if name not in sample_weight_mode: 789 | raise ValueError('Output "' + name + 790 | '" missing from sample_weight_modes ' 791 | 'dictionary') 792 | if sample_weight_mode.get(name) == 'temporal': 793 | weight = K.placeholder(ndim=2, 794 | name=name + '_sample_weights') 795 | sample_weight_modes.append('temporal') 796 | else: 797 | weight = K.placeholder(ndim=1, 798 | name=name + '_sample_weights') 799 | sample_weight_modes.append(None) 800 | sample_weights.append(weight) 801 | elif isinstance(sample_weight_mode, list): 802 | if len(sample_weight_mode) != len(self.outputs): 803 | raise ValueError('When passing a list as sample_weight_mode, ' 804 | 'it should have one entry per model outputs. ' 805 | 'The model has ' + str(len(self.outputs)) + 806 | ' outputs, but you passed ' 807 | 'sample_weight_mode=' + 808 | str(sample_weight_mode)) 809 | for i in range(len(self.output_names)): 810 | if i in skip_indices: 811 | weight = None 812 | sample_weight_modes.append(None) 813 | else: 814 | mode = sample_weight_mode[i] 815 | name = self.output_names[i] 816 | if mode == 'temporal': 817 | weight = K.placeholder(ndim=2, 818 | name=name + '_sample_weights') 819 | sample_weight_modes.append('temporal') 820 | else: 821 | weight = K.placeholder(ndim=1, 822 | name=name + '_sample_weights') 823 | sample_weight_modes.append(None) 824 | sample_weights.append(weight) 825 | else: 826 | for i, name in enumerate(self.output_names): 827 | if i in skip_indices: 828 | sample_weight_modes.append(None) 829 | sample_weights.append(None) 830 | else: 831 | if sample_weight_mode == 'temporal': 832 | sample_weights.append( 833 | K.placeholder(ndim=2, 834 | name=name + '_sample_weights')) 835 | sample_weight_modes.append('temporal') 836 | else: 837 | sample_weights.append( 838 | K.placeholder(ndim=1, 839 | name=name + '_sample_weights')) 840 | sample_weight_modes.append(None) 841 | self.sample_weight_modes = sample_weight_modes 842 | self._feed_sample_weight_modes = [] 843 | for i in range(len(self.outputs)): 844 | if i not in skip_indices: 845 | self._feed_sample_weight_modes.append(self.sample_weight_modes[i]) 846 | 847 | # Prepare targets of model. 848 | self.targets = [] 849 | self._feed_targets = [] 850 | for i in range(len(self.outputs)): 851 | if i in skip_indices: 852 | self.targets.append(None) 853 | else: 854 | shape = self.internal_output_shapes[i] 855 | name = self.output_names[i] 856 | target = K.placeholder(ndim=len(shape), 857 | name=name + '_target', 858 | sparse=K.is_sparse(self.outputs[i]), 859 | dtype=K.dtype(self.outputs[i])) 860 | self.targets.append(target) 861 | self._feed_targets.append(target) 862 | 863 | # Prepare metrics. 864 | self.metrics = metrics 865 | self.metrics_names = ['loss'] 866 | self.metrics_tensors = [] 867 | 868 | # Compute total loss. 869 | total_loss = None 870 | for i in range(len(self.outputs)): 871 | if i in skip_indices: 872 | continue 873 | y_true = self.targets[i] 874 | y_pred = self.outputs[i] 875 | weighted_loss = weighted_losses[i] 876 | sample_weight = sample_weights[i] 877 | mask = masks[i] 878 | loss_weight = loss_weights_list[i] 879 | output_loss = weighted_loss(y_true, y_pred, 880 | sample_weight, mask) 881 | if len(self.outputs) > 1: 882 | self.metrics_tensors.append(output_loss) 883 | self.metrics_names.append(self.output_names[i] + '_loss') 884 | if total_loss is None: 885 | total_loss = loss_weight * output_loss 886 | else: 887 | total_loss += loss_weight * output_loss 888 | if total_loss is None: 889 | if not self.losses: 890 | raise RuntimeError('The model cannot be compiled ' 891 | 'because it has no loss to optimize.') 892 | else: 893 | total_loss = 0. 894 | 895 | # Add regularization penalties 896 | # and other layer-specific losses. 897 | for loss_tensor in self.losses: 898 | total_loss += loss_tensor 899 | 900 | # List of same size as output_names. 901 | # contains tuples (metrics for output, names of metrics). 902 | nested_metrics = _collect_metrics(metrics, self.output_names) 903 | 904 | def append_metric(layer_num, metric_name, metric_tensor): 905 | """Helper function used in loop below.""" 906 | if len(self.output_names) > 1: 907 | metric_name = self.output_layers[layer_num].name + '_' + metric_name 908 | self.metrics_names.append(metric_name) 909 | self.metrics_tensors.append(metric_tensor) 910 | 911 | for i in range(len(self.outputs)): 912 | if i in skip_indices: 913 | continue 914 | y_true = self.targets[i] 915 | y_pred = self.outputs[i] 916 | output_metrics = nested_metrics[i] 917 | for metric in output_metrics: 918 | if metric == 'accuracy' or metric == 'acc': 919 | # custom handling of accuracy 920 | # (because of class mode duality) 921 | output_shape = self.internal_output_shapes[i] 922 | acc_fn = None 923 | if (output_shape[-1] == 1 or 924 | self.loss_functions[i] == losses.binary_crossentropy): 925 | # case: binary accuracy 926 | acc_fn = metrics_module.binary_accuracy 927 | elif self.loss_functions[i] == losses.sparse_categorical_crossentropy: 928 | # case: categorical accuracy with sparse targets 929 | acc_fn = metrics_module.sparse_categorical_accuracy 930 | else: 931 | acc_fn = metrics_module.categorical_accuracy 932 | 933 | masked_fn = _masked_objective(acc_fn) 934 | append_metric(i, 'acc', masked_fn(y_true, y_pred, mask=masks[i])) 935 | else: 936 | metric_fn = metrics_module.get(metric) 937 | masked_metric_fn = _masked_objective(metric_fn) 938 | metric_result = masked_metric_fn(y_true, y_pred, mask=masks[i]) 939 | metric_result = { 940 | metric_fn.__name__: metric_result 941 | } 942 | for name, tensor in six.iteritems(metric_result): 943 | append_metric(i, name, tensor) 944 | 945 | # Prepare gradient updates and state updates. 946 | self.total_loss = total_loss 947 | self.sample_weights = sample_weights 948 | self._feed_sample_weights = [] 949 | for i in range(len(self.sample_weights)): 950 | if i not in skip_indices: 951 | self._feed_sample_weights.append(sample_weights[i]) 952 | 953 | # Functions for train, test and predict will 954 | # be compiled lazily when required. 955 | # This saves time when the user is not using all functions. 956 | self._function_kwargs = kwargs 957 | 958 | self.train_function = None 959 | self.test_function = None 960 | self.predict_function = None 961 | 962 | # Collected trainable weights, sorted in topological order. 963 | trainable_weights = self.trainable_weights 964 | self._collected_trainable_weights = trainable_weights 965 | 966 | def _make_train_function(self): 967 | if not hasattr(self, 'train_function'): 968 | raise RuntimeError('You must compile your model before using it.') 969 | if self.train_function is None: 970 | inputs = self._feed_inputs + self._feed_targets + self._feed_sample_weights 971 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 972 | inputs += [K.learning_phase()] 973 | 974 | training_updates = self.optimizer.get_updates( 975 | self._collected_trainable_weights, 976 | self.constraints, 977 | self.total_loss) 978 | updates = self.updates + training_updates 979 | # Gets loss and metrics. Updates weights at each call. 980 | self.train_function = K.function(inputs, 981 | [self.total_loss] + self.metrics_tensors, 982 | updates=updates, 983 | name='train_function', 984 | **self._function_kwargs) 985 | 986 | def _make_test_function(self): 987 | if not hasattr(self, 'test_function'): 988 | raise RuntimeError('You must compile your model before using it.') 989 | if self.test_function is None: 990 | inputs = self._feed_inputs + self._feed_targets + self._feed_sample_weights 991 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 992 | inputs += [K.learning_phase()] 993 | # Return loss and metrics, no gradient updates. 994 | # Does update the network states. 995 | self.test_function = K.function(inputs, 996 | [self.total_loss] + self.metrics_tensors, 997 | updates=self.state_updates, 998 | name='test_function', 999 | **self._function_kwargs) 1000 | 1001 | def _make_predict_function(self): 1002 | if not hasattr(self, 'predict_function'): 1003 | self.predict_function = None 1004 | if self.predict_function is None: 1005 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1006 | inputs = self._feed_inputs + [K.learning_phase()] 1007 | else: 1008 | inputs = self._feed_inputs 1009 | # Gets network outputs. Does not update weights. 1010 | # Does update the network states. 1011 | kwargs = getattr(self, '_function_kwargs', {}) 1012 | self.predict_function = K.function(inputs, 1013 | self.outputs, 1014 | updates=self.state_updates, 1015 | name='predict_function', 1016 | **kwargs) 1017 | 1018 | def _fit_loop(self, f, ins, out_labels=None, batch_size=32, 1019 | epochs=100, verbose=1, callbacks=None, 1020 | val_f=None, val_ins=None, shuffle=True, 1021 | callback_metrics=None, initial_epoch=0): 1022 | """Abstract fit function for `f(ins)`. 1023 | 1024 | Assume that f returns a list, labeled by out_labels. 1025 | 1026 | # Arguments 1027 | f: Keras function returning a list of tensors 1028 | ins: list of tensors to be fed to `f` 1029 | out_labels: list of strings, display names of 1030 | the outputs of `f` 1031 | batch_size: integer batch size 1032 | epochs: number of times to iterate over the data 1033 | verbose: verbosity mode, 0, 1 or 2 1034 | callbacks: list of callbacks to be called during training 1035 | val_f: Keras function to call for validation 1036 | val_ins: list of tensors to be fed to `val_f` 1037 | shuffle: whether to shuffle the data at the beginning of each epoch 1038 | callback_metrics: list of strings, the display names of the metrics 1039 | passed to the callbacks. They should be the 1040 | concatenation of list the display names of the outputs of 1041 | `f` and the list of display names of the outputs of `f_val`. 1042 | initial_epoch: epoch at which to start training 1043 | (useful for resuming a previous training run) 1044 | 1045 | # Returns 1046 | `History` object. 1047 | """ 1048 | do_validation = False 1049 | if val_f and val_ins: 1050 | do_validation = True 1051 | if verbose: 1052 | print('Train on %d samples, validate on %d samples' % 1053 | (ins[0].shape[0], val_ins[0].shape[0])) 1054 | 1055 | if ins and hasattr(ins[0], 'shape'): 1056 | num_train_samples = ins[0].shape[0] 1057 | else: 1058 | # May happen if we are running `fit` without Numpy input data, 1059 | # i.e. if all inputs to the models are data tensors 1060 | # instead of placeholders. 1061 | # In that case we will run `fit` over a single batch. 1062 | num_train_samples = batch_size 1063 | verbose = 2 1064 | index_array = np.arange(num_train_samples) 1065 | 1066 | self.history = cbks.History() 1067 | callbacks = [cbks.BaseLogger()] + (callbacks or []) + [self.history] 1068 | if verbose: 1069 | callbacks += [cbks.ProgbarLogger()] 1070 | callbacks = cbks.CallbackList(callbacks) 1071 | out_labels = out_labels or [] 1072 | 1073 | # it's possible to callback a different model than self 1074 | # (used by Sequential models) 1075 | if hasattr(self, 'callback_model') and self.callback_model: 1076 | callback_model = self.callback_model 1077 | else: 1078 | callback_model = self 1079 | 1080 | callbacks.set_model(callback_model) 1081 | callbacks.set_params({ 1082 | 'batch_size': batch_size, 1083 | 'epochs': epochs, 1084 | 'samples': num_train_samples, 1085 | 'verbose': verbose, 1086 | 'do_validation': do_validation, 1087 | 'metrics': callback_metrics or [], 1088 | }) 1089 | callbacks.on_train_begin() 1090 | callback_model.stop_training = False 1091 | for cbk in callbacks: 1092 | cbk.validation_data = val_ins 1093 | 1094 | for epoch in range(initial_epoch, epochs): 1095 | callbacks.on_epoch_begin(epoch) 1096 | if shuffle == 'batch': 1097 | index_array = _batch_shuffle(index_array, batch_size) 1098 | elif shuffle: 1099 | np.random.shuffle(index_array) 1100 | 1101 | batches = _make_batches(num_train_samples, batch_size) 1102 | epoch_logs = {} 1103 | for batch_index, (batch_start, batch_end) in enumerate(batches): 1104 | batch_ids = index_array[batch_start:batch_end] 1105 | try: 1106 | if isinstance(ins[-1], float): 1107 | # Do not slice the training phase flag. 1108 | ins_batch = _slice_arrays(ins[:-1], batch_ids) + [ins[-1]] 1109 | else: 1110 | ins_batch = _slice_arrays(ins, batch_ids) 1111 | except TypeError: 1112 | raise TypeError('TypeError while preparing batch. ' 1113 | 'If using HDF5 input data, ' 1114 | 'pass shuffle="batch".') 1115 | batch_logs = {} 1116 | batch_logs['batch'] = batch_index 1117 | batch_logs['size'] = len(batch_ids) 1118 | callbacks.on_batch_begin(batch_index, batch_logs) 1119 | outs = f(ins_batch) 1120 | if not isinstance(outs, list): 1121 | outs = [outs] 1122 | for l, o in zip(out_labels, outs): 1123 | batch_logs[l] = o 1124 | 1125 | callbacks.on_batch_end(batch_index, batch_logs) 1126 | if callback_model.stop_training: 1127 | break 1128 | 1129 | if batch_index == len(batches) - 1: # Last batch. 1130 | if do_validation: 1131 | val_outs = self._test_loop(val_f, val_ins, 1132 | batch_size=batch_size, 1133 | verbose=0) 1134 | if not isinstance(val_outs, list): 1135 | val_outs = [val_outs] 1136 | # Same labels assumed. 1137 | for l, o in zip(out_labels, val_outs): 1138 | epoch_logs['val_' + l] = o 1139 | callbacks.on_epoch_end(epoch, epoch_logs) 1140 | if callback_model.stop_training: 1141 | break 1142 | callbacks.on_train_end() 1143 | return self.history 1144 | 1145 | def _predict_loop(self, f, ins, batch_size=32, verbose=0): 1146 | """Abstract method to loop over some data in batches. 1147 | 1148 | # Arguments 1149 | f: Keras function returning a list of tensors. 1150 | ins: list of tensors to be fed to `f`. 1151 | batch_size: integer batch size. 1152 | verbose: verbosity mode. 1153 | 1154 | # Returns 1155 | Array of predictions (if the model has a single output) 1156 | or list of arrays of predictions 1157 | (if the model has multiple outputs). 1158 | """ 1159 | if ins and hasattr(ins[0], 'shape'): 1160 | samples = ins[0].shape[0] 1161 | else: 1162 | # May happen if we are running `predict` without Numpy input data, 1163 | # i.e. if all inputs to the models are data tensors 1164 | # instead of placeholders. 1165 | # In that case we will run `predict` over a single batch. 1166 | samples = batch_size 1167 | verbose = 2 1168 | outs = [] 1169 | if verbose == 1: 1170 | progbar = Progbar(target=samples) 1171 | batches = _make_batches(samples, batch_size) 1172 | index_array = np.arange(samples) 1173 | for batch_index, (batch_start, batch_end) in enumerate(batches): 1174 | batch_ids = index_array[batch_start:batch_end] 1175 | if ins and isinstance(ins[-1], float): 1176 | # Do not slice the training phase flag. 1177 | ins_batch = _slice_arrays(ins[:-1], batch_ids) + [ins[-1]] 1178 | else: 1179 | ins_batch = _slice_arrays(ins, batch_ids) 1180 | 1181 | batch_outs = f(ins_batch) 1182 | if not isinstance(batch_outs, list): 1183 | batch_outs = [batch_outs] 1184 | if batch_index == 0: 1185 | for batch_out in batch_outs: 1186 | shape = (samples,) + batch_out.shape[1:] 1187 | outs.append(np.zeros(shape, dtype=batch_out.dtype)) 1188 | 1189 | for i, batch_out in enumerate(batch_outs): 1190 | outs[i][batch_start:batch_end] = batch_out 1191 | if verbose == 1: 1192 | progbar.update(batch_end) 1193 | if len(outs) == 1: 1194 | return outs[0] 1195 | return outs 1196 | 1197 | def _test_loop(self, f, ins, batch_size=32, verbose=0): 1198 | """Abstract method to loop over some data in batches. 1199 | 1200 | # Arguments 1201 | f: Keras function returning a list of tensors. 1202 | ins: list of tensors to be fed to `f`. 1203 | batch_size: integer batch size. 1204 | verbose: verbosity mode. 1205 | 1206 | # Returns 1207 | Scalar loss (if the model has a single output and no metrics) 1208 | or list of scalars (if the model has multiple outputs 1209 | and/or metrics). The attribute `model.metrics_names` will give you 1210 | the display labels for the scalar outputs. 1211 | """ 1212 | if ins and hasattr(ins[0], 'shape'): 1213 | samples = ins[0].shape[0] 1214 | else: 1215 | # May happen if we are running `evaluate` without Numpy input data, 1216 | # i.e. if all inputs to the models are data tensors 1217 | # instead of placeholders. 1218 | # In that case we will run `evaluate` over a single batch. 1219 | samples = batch_size 1220 | verbose = 2 1221 | 1222 | outs = [] 1223 | if verbose == 1: 1224 | progbar = Progbar(target=samples) 1225 | batches = _make_batches(samples, batch_size) 1226 | index_array = np.arange(samples) 1227 | for batch_index, (batch_start, batch_end) in enumerate(batches): 1228 | batch_ids = index_array[batch_start:batch_end] 1229 | if isinstance(ins[-1], float): 1230 | # Do not slice the training phase flag. 1231 | ins_batch = _slice_arrays(ins[:-1], batch_ids) + [ins[-1]] 1232 | else: 1233 | ins_batch = _slice_arrays(ins, batch_ids) 1234 | 1235 | batch_outs = f(ins_batch) 1236 | if isinstance(batch_outs, list): 1237 | if batch_index == 0: 1238 | for batch_out in enumerate(batch_outs): 1239 | outs.append(0.) 1240 | for i, batch_out in enumerate(batch_outs): 1241 | outs[i] += batch_out * len(batch_ids) 1242 | else: 1243 | if batch_index == 0: 1244 | outs.append(0.) 1245 | outs[0] += batch_outs * len(batch_ids) 1246 | 1247 | if verbose == 1: 1248 | progbar.update(batch_end) 1249 | for i in range(len(outs)): 1250 | outs[i] /= samples 1251 | if len(outs) == 1: 1252 | return outs[0] 1253 | return outs 1254 | 1255 | def _standardize_user_data(self, x, y, 1256 | sample_weight=None, class_weight=None, 1257 | check_batch_axis=True, batch_size=None): 1258 | if not hasattr(self, 'optimizer'): 1259 | raise RuntimeError('You must compile a model before ' 1260 | 'training/testing. ' 1261 | 'Use `model.compile(optimizer, loss)`.') 1262 | 1263 | output_shapes = [] 1264 | for output_shape, loss_fn in zip(self._feed_output_shapes, self._feed_loss_fns): 1265 | if loss_fn.__name__ == 'sparse_categorical_crossentropy': 1266 | output_shapes.append(output_shape[:-1] + (1,)) 1267 | elif getattr(losses, loss_fn.__name__, None) is None: 1268 | output_shapes.append(None) 1269 | else: 1270 | output_shapes.append(output_shape) 1271 | x = _standardize_input_data(x, self._feed_input_names, 1272 | self._feed_input_shapes, 1273 | check_batch_axis=False, 1274 | exception_prefix='input') 1275 | y = _standardize_input_data(y, self._feed_output_names, 1276 | output_shapes, 1277 | check_batch_axis=False, 1278 | exception_prefix='target') 1279 | sample_weights = _standardize_sample_weights(sample_weight, 1280 | self._feed_output_names) 1281 | class_weights = _standardize_class_weights(class_weight, 1282 | self._feed_output_names) 1283 | sample_weights = [_standardize_weights(ref, sw, cw, mode) 1284 | for (ref, sw, cw, mode) 1285 | in zip(y, sample_weights, class_weights, self._feed_sample_weight_modes)] 1286 | _check_array_lengths(x, y, sample_weights) 1287 | _check_loss_and_target_compatibility(y, 1288 | self._feed_loss_fns, 1289 | self._feed_output_shapes) 1290 | if self.stateful and batch_size: 1291 | if x[0].shape[0] % batch_size != 0: 1292 | raise ValueError('In a stateful network, ' 1293 | 'you should only pass inputs with ' 1294 | 'a number of samples that can be ' 1295 | 'divided by the batch size. Found: ' + 1296 | str(x[0].shape[0]) + ' samples') 1297 | return x, y, sample_weights 1298 | 1299 | def _get_deduped_metrics_names(self): 1300 | out_labels = self.metrics_names 1301 | 1302 | # Rename duplicated metrics name 1303 | # (can happen with an output layer shared among multiple dataflows). 1304 | deduped_out_labels = [] 1305 | for i, label in enumerate(out_labels): 1306 | new_label = label 1307 | if out_labels.count(label) > 1: 1308 | dup_idx = out_labels[:i].count(label) 1309 | new_label += '_' + str(dup_idx + 1) 1310 | deduped_out_labels.append(new_label) 1311 | return deduped_out_labels 1312 | 1313 | def fit(self, x=None, 1314 | y=None, 1315 | batch_size=32, 1316 | epochs=1, 1317 | verbose=1, 1318 | callbacks=None, 1319 | validation_split=0., 1320 | validation_data=None, 1321 | shuffle=True, 1322 | class_weight=None, 1323 | sample_weight=None, 1324 | initial_epoch=0, 1325 | **kwargs): 1326 | """Trains the model for a fixed number of epochs (iterations on a dataset). 1327 | 1328 | # Arguments 1329 | x: Numpy array of training data, 1330 | or list of Numpy arrays if the model has multiple inputs. 1331 | If all inputs in the model are named, 1332 | you can also pass a dictionary 1333 | mapping input names to Numpy arrays. 1334 | y: Numpy array of target data, 1335 | or list of Numpy arrays if the model has multiple outputs. 1336 | If all outputs in the model are named, 1337 | you can also pass a dictionary 1338 | mapping output names to Numpy arrays. 1339 | batch_size: integer. Number of samples per gradient update. 1340 | epochs: integer, the number of times to iterate 1341 | over the training data arrays. 1342 | verbose: 0, 1, or 2. Verbosity mode. 1343 | 0 = silent, 1 = verbose, 2 = one log line per epoch. 1344 | callbacks: list of callbacks to be called during training. 1345 | See [callbacks](/callbacks). 1346 | validation_split: float between 0 and 1: 1347 | fraction of the training data to be used as validation data. 1348 | The model will set apart this fraction of the training data, 1349 | will not train on it, and will evaluate 1350 | the loss and any model metrics 1351 | on this data at the end of each epoch. 1352 | validation_data: data on which to evaluate 1353 | the loss and any model metrics 1354 | at the end of each epoch. The model will not 1355 | be trained on this data. 1356 | This could be a tuple (x_val, y_val) 1357 | or a tuple (x_val, y_val, val_sample_weights). 1358 | shuffle: boolean, whether to shuffle the training data 1359 | before each epoch. 1360 | class_weight: optional dictionary mapping 1361 | class indices (integers) to 1362 | a weight (float) to apply to the model's loss for the samples 1363 | from this class during training. 1364 | This can be useful to tell the model to "pay more attention" to 1365 | samples from an under-represented class. 1366 | sample_weight: optional array of the same length as x, containing 1367 | weights to apply to the model's loss for each sample. 1368 | In the case of temporal data, you can pass a 2D array 1369 | with shape (samples, sequence_length), 1370 | to apply a different weight to every timestep of every sample. 1371 | In this case you should make sure to specify 1372 | sample_weight_mode="temporal" in compile(). 1373 | initial_epoch: epoch at which to start training 1374 | (useful for resuming a previous training run) 1375 | 1376 | # Returns 1377 | A `History` instance. Its `history` attribute contains 1378 | all information collected during training. 1379 | 1380 | # Raises 1381 | ValueError: In case of mismatch between the provided input data 1382 | and what the model expects. 1383 | """ 1384 | # Legacy support 1385 | if 'nb_epoch' in kwargs: 1386 | warnings.warn('The `nb_epoch` argument in `fit` ' 1387 | 'has been renamed `epochs`.', stacklevel=2) 1388 | epochs = kwargs.pop('nb_epoch') 1389 | if kwargs: 1390 | raise TypeError('Unrecognized keyword arguments: ' + str(kwargs)) 1391 | 1392 | # Validate user data. 1393 | x, y, sample_weights = self._standardize_user_data( 1394 | x, y, 1395 | sample_weight=sample_weight, 1396 | class_weight=class_weight, 1397 | check_batch_axis=False, 1398 | batch_size=batch_size) 1399 | # Prepare validation data. 1400 | if validation_data: 1401 | do_validation = True 1402 | if len(validation_data) == 2: 1403 | val_x, val_y = validation_data 1404 | val_sample_weight = None 1405 | elif len(validation_data) == 3: 1406 | val_x, val_y, val_sample_weight = validation_data 1407 | else: 1408 | raise ValueError('When passing validation_data, ' 1409 | 'it must contain 2 (x_val, y_val) ' 1410 | 'or 3 (x_val, y_val, val_sample_weights) ' 1411 | 'items, however it contains %d items' % 1412 | len(validation_data)) 1413 | 1414 | val_x, val_y, val_sample_weights = self._standardize_user_data( 1415 | val_x, val_y, 1416 | sample_weight=val_sample_weight, 1417 | check_batch_axis=False, 1418 | batch_size=batch_size) 1419 | self._make_test_function() 1420 | val_f = self.test_function 1421 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1422 | val_ins = val_x + val_y + val_sample_weights + [0.] 1423 | else: 1424 | val_ins = val_x + val_y + val_sample_weights 1425 | 1426 | elif validation_split and 0. < validation_split < 1.: 1427 | do_validation = True 1428 | if hasattr(x[0], 'shape'): 1429 | split_at = int(x[0].shape[0] * (1. - validation_split)) 1430 | else: 1431 | split_at = int(len(x[0]) * (1. - validation_split)) 1432 | x, val_x = (_slice_arrays(x, 0, split_at), _slice_arrays(x, split_at)) 1433 | y, val_y = (_slice_arrays(y, 0, split_at), _slice_arrays(y, split_at)) 1434 | sample_weights, val_sample_weights = ( 1435 | _slice_arrays(sample_weights, 0, split_at), 1436 | _slice_arrays(sample_weights, split_at)) 1437 | self._make_test_function() 1438 | val_f = self.test_function 1439 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1440 | val_ins = val_x + val_y + val_sample_weights + [0.] 1441 | else: 1442 | val_ins = val_x + val_y + val_sample_weights 1443 | else: 1444 | do_validation = False 1445 | val_f = None 1446 | val_ins = None 1447 | 1448 | # Prepare input arrays and training function. 1449 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1450 | ins = x + y + sample_weights + [1.] 1451 | else: 1452 | ins = x + y + sample_weights 1453 | self._make_train_function() 1454 | f = self.train_function 1455 | 1456 | # Prepare display labels. 1457 | out_labels = self._get_deduped_metrics_names() 1458 | 1459 | if do_validation: 1460 | callback_metrics = copy.copy(out_labels) + ['val_' + n for n in out_labels] 1461 | else: 1462 | callback_metrics = copy.copy(out_labels) 1463 | 1464 | # Delegate logic to `_fit_loop`. 1465 | return self._fit_loop(f, ins, out_labels=out_labels, 1466 | batch_size=batch_size, epochs=epochs, 1467 | verbose=verbose, callbacks=callbacks, 1468 | val_f=val_f, val_ins=val_ins, shuffle=shuffle, 1469 | callback_metrics=callback_metrics, 1470 | initial_epoch=initial_epoch) 1471 | 1472 | def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None): 1473 | """Returns the loss value & metrics values for the model in test mode. 1474 | 1475 | Computation is done in batches. 1476 | 1477 | # Arguments 1478 | x: Numpy array of test data, 1479 | or list of Numpy arrays if the model has multiple inputs. 1480 | If all inputs in the model are named, 1481 | you can also pass a dictionary 1482 | mapping input names to Numpy arrays. 1483 | y: Numpy array of target data, 1484 | or list of Numpy arrays if the model has multiple outputs. 1485 | If all outputs in the model are named, 1486 | you can also pass a dictionary 1487 | mapping output names to Numpy arrays. 1488 | batch_size: integer. Number of samples per gradient update. 1489 | verbose: verbosity mode, 0 or 1. 1490 | sample_weight: Array of weights to weight the contribution 1491 | of different samples to the loss and metrics. 1492 | 1493 | # Returns 1494 | Scalar test loss (if the model has a single output and no metrics) 1495 | or list of scalars (if the model has multiple outputs 1496 | and/or metrics). The attribute `model.metrics_names` will give you 1497 | the display labels for the scalar outputs. 1498 | """ 1499 | # Validate user data. 1500 | x, y, sample_weights = self._standardize_user_data( 1501 | x, y, 1502 | sample_weight=sample_weight, 1503 | check_batch_axis=False, 1504 | batch_size=batch_size) 1505 | # Prepare inputs, delegate logic to `_test_loop`. 1506 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1507 | ins = x + y + sample_weights + [0.] 1508 | else: 1509 | ins = x + y + sample_weights 1510 | self._make_test_function() 1511 | f = self.test_function 1512 | return self._test_loop(f, ins, 1513 | batch_size=batch_size, 1514 | verbose=verbose) 1515 | 1516 | def predict(self, x, batch_size=32, verbose=0): 1517 | """Generates output predictions for the input samples. 1518 | 1519 | Computation is done in batches. 1520 | 1521 | # Arguments 1522 | x: the input data, as a Numpy array 1523 | (or list of Numpy arrays if the model has multiple outputs). 1524 | batch_size: integer. 1525 | verbose: verbosity mode, 0 or 1. 1526 | 1527 | # Returns 1528 | Numpy array(s) of predictions. 1529 | 1530 | # Raises 1531 | ValueError: In case of mismatch between the provided 1532 | input data and the model's expectations, 1533 | or in case a stateful model receives a number of samples 1534 | that is not a multiple of the batch size. 1535 | """ 1536 | # Validate user data. 1537 | x = _standardize_input_data(x, self._feed_input_names, 1538 | self._feed_input_shapes, 1539 | check_batch_axis=False) 1540 | if self.stateful: 1541 | if x[0].shape[0] > batch_size and x[0].shape[0] % batch_size != 0: 1542 | raise ValueError('In a stateful network, ' 1543 | 'you should only pass inputs with ' 1544 | 'a number of samples that can be ' 1545 | 'divided by the batch size. Found: ' + 1546 | str(x[0].shape[0]) + ' samples. ' 1547 | 'Batch size: ' + str(batch_size) + '.') 1548 | 1549 | # Prepare inputs, delegate logic to `_predict_loop`. 1550 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1551 | ins = x + [0.] 1552 | else: 1553 | ins = x 1554 | self._make_predict_function() 1555 | f = self.predict_function 1556 | return self._predict_loop(f, ins, 1557 | batch_size=batch_size, verbose=verbose) 1558 | 1559 | def train_on_batch(self, x, y, 1560 | sample_weight=None, class_weight=None): 1561 | """Runs a single gradient update on a single batch of data. 1562 | 1563 | # Arguments 1564 | x: Numpy array of training data, 1565 | or list of Numpy arrays if the model has multiple inputs. 1566 | If all inputs in the model are named, 1567 | you can also pass a dictionary 1568 | mapping input names to Numpy arrays. 1569 | y: Numpy array of target data, 1570 | or list of Numpy arrays if the model has multiple outputs. 1571 | If all outputs in the model are named, 1572 | you can also pass a dictionary 1573 | mapping output names to Numpy arrays. 1574 | sample_weight: optional array of the same length as x, containing 1575 | weights to apply to the model's loss for each sample. 1576 | In the case of temporal data, you can pass a 2D array 1577 | with shape (samples, sequence_length), 1578 | to apply a different weight to every timestep of every sample. 1579 | In this case you should make sure to specify 1580 | sample_weight_mode="temporal" in compile(). 1581 | class_weight: optional dictionary mapping 1582 | class indices (integers) to 1583 | a weight (float) to apply to the model's loss for the samples 1584 | from this class during training. 1585 | This can be useful to tell the model to "pay more attention" to 1586 | samples from an under-represented class. 1587 | 1588 | # Returns 1589 | Scalar training loss 1590 | (if the model has a single output and no metrics) 1591 | or list of scalars (if the model has multiple outputs 1592 | and/or metrics). The attribute `model.metrics_names` will give you 1593 | the display labels for the scalar outputs. 1594 | """ 1595 | x, y, sample_weights = self._standardize_user_data( 1596 | x, y, 1597 | sample_weight=sample_weight, 1598 | class_weight=class_weight, 1599 | check_batch_axis=True) 1600 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1601 | ins = x + y + sample_weights + [1.] 1602 | else: 1603 | ins = x + y + sample_weights 1604 | self._make_train_function() 1605 | outputs = self.train_function(ins) 1606 | if len(outputs) == 1: 1607 | return outputs[0] 1608 | return outputs 1609 | 1610 | def test_on_batch(self, x, y, sample_weight=None): 1611 | """Test the model on a single batch of samples. 1612 | 1613 | # Arguments 1614 | x: Numpy array of test data, 1615 | or list of Numpy arrays if the model has multiple inputs. 1616 | If all inputs in the model are named, 1617 | you can also pass a dictionary 1618 | mapping input names to Numpy arrays. 1619 | y: Numpy array of target data, 1620 | or list of Numpy arrays if the model has multiple outputs. 1621 | If all outputs in the model are named, 1622 | you can also pass a dictionary 1623 | mapping output names to Numpy arrays. 1624 | sample_weight: optional array of the same length as x, containing 1625 | weights to apply to the model's loss for each sample. 1626 | In the case of temporal data, you can pass a 2D array 1627 | with shape (samples, sequence_length), 1628 | to apply a different weight to every timestep of every sample. 1629 | In this case you should make sure to specify 1630 | sample_weight_mode="temporal" in compile(). 1631 | 1632 | # Returns 1633 | Scalar test loss (if the model has a single output and no metrics) 1634 | or list of scalars (if the model has multiple outputs 1635 | and/or metrics). The attribute `model.metrics_names` will give you 1636 | the display labels for the scalar outputs. 1637 | """ 1638 | x, y, sample_weights = self._standardize_user_data( 1639 | x, y, 1640 | sample_weight=sample_weight, 1641 | check_batch_axis=True) 1642 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1643 | ins = x + y + sample_weights + [0.] 1644 | else: 1645 | ins = x + y + sample_weights 1646 | self._make_test_function() 1647 | outputs = self.test_function(ins) 1648 | if len(outputs) == 1: 1649 | return outputs[0] 1650 | return outputs 1651 | 1652 | def predict_on_batch(self, x): 1653 | """Returns predictions for a single batch of samples. 1654 | 1655 | # Arguments 1656 | x: Input samples, as a Numpy array. 1657 | 1658 | # Returns 1659 | Numpy array(s) of predictions. 1660 | """ 1661 | x = _standardize_input_data(x, self._feed_input_names, 1662 | self._feed_input_shapes) 1663 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1664 | ins = x + [0.] 1665 | else: 1666 | ins = x 1667 | self._make_predict_function() 1668 | outputs = self.predict_function(ins) 1669 | if len(outputs) == 1: 1670 | return outputs[0] 1671 | return outputs 1672 | 1673 | @interfaces.legacy_generator_methods_support 1674 | def fit_generator(self, generator, 1675 | steps_per_epoch, 1676 | epochs=1, 1677 | verbose=1, 1678 | callbacks=None, 1679 | validation_data=None, 1680 | validation_steps=None, 1681 | class_weight=None, 1682 | max_queue_size=10, 1683 | workers=1, 1684 | use_multiprocessing=False, 1685 | initial_epoch=0): 1686 | """Fits the model on data yielded batch-by-batch by a Python generator. 1687 | 1688 | The generator is run in parallel to the model, for efficiency. 1689 | For instance, this allows you to do real-time data augmentation 1690 | on images on CPU in parallel to training your model on GPU. 1691 | 1692 | The use of `keras.utils.Sequence` guarantees the ordering 1693 | and guarantees the single use of every input per epoch when 1694 | using `use_multiprocessing=True`. 1695 | 1696 | # Arguments 1697 | generator: a generator or an instance of Sequence (keras.utils.Sequence) 1698 | object in order to avoid duplicate data 1699 | when using multiprocessing. 1700 | The output of the generator must be either 1701 | - a tuple (inputs, targets) 1702 | - a tuple (inputs, targets, sample_weights). 1703 | All arrays should contain the same number of samples. 1704 | The generator is expected to loop over its data 1705 | indefinitely. An epoch finishes when `steps_per_epoch` 1706 | batches have been seen by the model. 1707 | steps_per_epoch: Total number of steps (batches of samples) 1708 | to yield from `generator` before declaring one epoch 1709 | finished and starting the next epoch. It should typically 1710 | be equal to the number of unique samples if your dataset 1711 | divided by the batch size. 1712 | epochs: integer, total number of iterations on the data. 1713 | verbose: verbosity mode, 0, 1, or 2. 1714 | callbacks: list of callbacks to be called during training. 1715 | validation_data: this can be either 1716 | - a generator for the validation data 1717 | - a tuple (inputs, targets) 1718 | - a tuple (inputs, targets, sample_weights). 1719 | validation_steps: Only relevant if `validation_data` 1720 | is a generator. Total number of steps (batches of samples) 1721 | to yield from `generator` before stopping. 1722 | class_weight: dictionary mapping class indices to a weight 1723 | for the class. 1724 | max_queue_size: maximum size for the generator queue 1725 | workers: maximum number of processes to spin up 1726 | when using process based threading 1727 | use_multiprocessing: if True, use process based threading. 1728 | Note that because 1729 | this implementation relies on multiprocessing, 1730 | you should not pass 1731 | non picklable arguments to the generator 1732 | as they can't be passed 1733 | easily to children processes. 1734 | initial_epoch: epoch at which to start training 1735 | (useful for resuming a previous training run) 1736 | 1737 | # Returns 1738 | A `History` object. 1739 | 1740 | # Example 1741 | 1742 | ```python 1743 | def generate_arrays_from_file(path): 1744 | while 1: 1745 | f = open(path) 1746 | for line in f: 1747 | # create numpy arrays of input data 1748 | # and labels, from each line in the file 1749 | x1, x2, y = process_line(line) 1750 | yield ({'input_1': x1, 'input_2': x2}, {'output': y}) 1751 | f.close() 1752 | 1753 | model.fit_generator(generate_arrays_from_file('/my_file.txt'), 1754 | steps_per_epoch=10000, epochs=10) 1755 | ``` 1756 | 1757 | # Raises 1758 | ValueError: In case the generator yields 1759 | data in an invalid format. 1760 | """ 1761 | wait_time = 0.01 # in seconds 1762 | epoch = initial_epoch 1763 | 1764 | do_validation = bool(validation_data) 1765 | self._make_train_function() 1766 | if do_validation: 1767 | self._make_test_function() 1768 | 1769 | # python 2 has 'next', 3 has '__next__' 1770 | # avoid any explicit version checks 1771 | val_gen = (hasattr(validation_data, 'next') or 1772 | hasattr(validation_data, '__next__') or 1773 | isinstance(validation_data, Sequence)) 1774 | if val_gen and not validation_steps: 1775 | raise ValueError('When using a generator for validation data, ' 1776 | 'you must specify a value for ' 1777 | '`validation_steps`.') 1778 | 1779 | # Prepare display labels. 1780 | out_labels = self._get_deduped_metrics_names() 1781 | callback_metrics = out_labels + ['val_' + n for n in out_labels] 1782 | 1783 | # prepare callbacks 1784 | self.history = cbks.History() 1785 | callbacks = [cbks.BaseLogger()] + (callbacks or []) + [self.history] 1786 | if verbose: 1787 | callbacks += [cbks.ProgbarLogger(count_mode='steps')] 1788 | callbacks = cbks.CallbackList(callbacks) 1789 | 1790 | # it's possible to callback a different model than self: 1791 | if hasattr(self, 'callback_model') and self.callback_model: 1792 | callback_model = self.callback_model 1793 | else: 1794 | callback_model = self 1795 | callbacks.set_model(callback_model) 1796 | callbacks.set_params({ 1797 | 'epochs': epochs, 1798 | 'steps': steps_per_epoch, 1799 | 'verbose': verbose, 1800 | 'do_validation': do_validation, 1801 | 'metrics': callback_metrics, 1802 | }) 1803 | callbacks.on_train_begin() 1804 | 1805 | if do_validation and not val_gen: 1806 | if len(validation_data) == 2: 1807 | val_x, val_y = validation_data 1808 | val_sample_weight = None 1809 | elif len(validation_data) == 3: 1810 | val_x, val_y, val_sample_weight = validation_data 1811 | else: 1812 | raise ValueError('`validation_data` should be a tuple ' 1813 | '`(val_x, val_y, val_sample_weight)` ' 1814 | 'or `(val_x, val_y)`. Found: ' + 1815 | str(validation_data)) 1816 | val_x, val_y, val_sample_weights = self._standardize_user_data( 1817 | val_x, val_y, val_sample_weight) 1818 | val_data = val_x + val_y + val_sample_weights 1819 | if self.uses_learning_phase and not isinstance(K.learning_phase(), int): 1820 | val_data += [0.] 1821 | for cbk in callbacks: 1822 | cbk.validation_data = val_data 1823 | is_sequence = isinstance(generator, Sequence) 1824 | if not is_sequence and use_multiprocessing and workers > 1: 1825 | warnings.warn( 1826 | UserWarning('Using a generator with `use_multiprocessing=True`' 1827 | ' and multiple workers may duplicate your data.' 1828 | ' Please consider using the`keras.utils.Sequence' 1829 | ' class.')) 1830 | enqueuer = None 1831 | 1832 | try: 1833 | if is_sequence: 1834 | enqueuer = OrderedEnqueuer(generator, 1835 | use_multiprocessing=use_multiprocessing) 1836 | else: 1837 | enqueuer = GeneratorEnqueuer(generator, 1838 | use_multiprocessing=use_multiprocessing, 1839 | wait_time=wait_time) 1840 | enqueuer.start(workers=workers, max_queue_size=max_queue_size) 1841 | output_generator = enqueuer.get() 1842 | 1843 | callback_model.stop_training = False 1844 | while epoch < epochs: 1845 | callbacks.on_epoch_begin(epoch) 1846 | steps_done = 0 1847 | batch_index = 0 1848 | while steps_done < steps_per_epoch: 1849 | generator_output = next(output_generator) 1850 | 1851 | if not hasattr(generator_output, '__len__'): 1852 | raise ValueError('Output of generator should be ' 1853 | 'a tuple `(x, y, sample_weight)` ' 1854 | 'or `(x, y)`. Found: ' + 1855 | str(generator_output)) 1856 | if len(generator_output) == 2: 1857 | x, y = generator_output 1858 | sample_weight = None 1859 | elif len(generator_output) == 3: 1860 | x, y, sample_weight = generator_output 1861 | else: 1862 | raise ValueError('Output of generator should be ' 1863 | 'a tuple `(x, y, sample_weight)` ' 1864 | 'or `(x, y)`. Found: ' + 1865 | str(generator_output)) 1866 | # build batch logs 1867 | batch_logs = {} 1868 | if isinstance(x, list): 1869 | batch_size = x[0].shape[0] 1870 | elif isinstance(x, dict): 1871 | batch_size = list(x.values())[0].shape[0] 1872 | else: 1873 | batch_size = x.shape[0] 1874 | batch_logs['batch'] = batch_index 1875 | batch_logs['size'] = batch_size 1876 | callbacks.on_batch_begin(batch_index, batch_logs) 1877 | 1878 | outs = self.train_on_batch(x, y, 1879 | sample_weight=sample_weight, 1880 | class_weight=class_weight) 1881 | 1882 | if not isinstance(outs, list): 1883 | outs = [outs] 1884 | for l, o in zip(out_labels, outs): 1885 | batch_logs[l] = o 1886 | 1887 | callbacks.on_batch_end(batch_index, batch_logs) 1888 | 1889 | # Construct epoch logs. 1890 | epoch_logs = {} 1891 | batch_index += 1 1892 | steps_done += 1 1893 | 1894 | # Epoch finished. 1895 | if steps_done >= steps_per_epoch and do_validation: 1896 | if val_gen: 1897 | val_outs = self.evaluate_generator( 1898 | validation_data, 1899 | validation_steps, 1900 | max_queue_size=max_queue_size, 1901 | workers=workers, 1902 | use_multiprocessing=use_multiprocessing) 1903 | else: 1904 | # No need for try/except because 1905 | # data has already been validated. 1906 | val_outs = self.evaluate( 1907 | val_x, val_y, 1908 | batch_size=batch_size, 1909 | sample_weight=val_sample_weights, 1910 | verbose=0) 1911 | if not isinstance(val_outs, list): 1912 | val_outs = [val_outs] 1913 | # Same labels assumed. 1914 | for l, o in zip(out_labels, val_outs): 1915 | epoch_logs['val_' + l] = o 1916 | 1917 | callbacks.on_epoch_end(epoch, epoch_logs) 1918 | epoch += 1 1919 | if callback_model.stop_training: 1920 | break 1921 | 1922 | finally: 1923 | if enqueuer is not None: 1924 | enqueuer.stop() 1925 | 1926 | callbacks.on_train_end() 1927 | return self.history 1928 | 1929 | @interfaces.legacy_generator_methods_support 1930 | def evaluate_generator(self, generator, steps, 1931 | max_queue_size=10, 1932 | workers=1, 1933 | use_multiprocessing=False): 1934 | """Evaluates the model on a data generator. 1935 | 1936 | The generator should return the same kind of data 1937 | as accepted by `test_on_batch`. 1938 | 1939 | # Arguments 1940 | generator: Generator yielding tuples (inputs, targets) 1941 | or (inputs, targets, sample_weights) 1942 | or an instance of Sequence (keras.utils.Sequence) 1943 | object in order to avoid duplicate data 1944 | when using multiprocessing. 1945 | steps: Total number of steps (batches of samples) 1946 | to yield from `generator` before stopping. 1947 | max_queue_size: maximum size for the generator queue 1948 | workers: maximum number of processes to spin up 1949 | when using process based threading 1950 | use_multiprocessing: if True, use process based threading. 1951 | Note that because 1952 | this implementation relies on multiprocessing, 1953 | you should not pass 1954 | non picklable arguments to the generator 1955 | as they can't be passed 1956 | easily to children processes. 1957 | 1958 | # Returns 1959 | Scalar test loss (if the model has a single output and no metrics) 1960 | or list of scalars (if the model has multiple outputs 1961 | and/or metrics). The attribute `model.metrics_names` will give you 1962 | the display labels for the scalar outputs. 1963 | 1964 | # Raises 1965 | ValueError: In case the generator yields 1966 | data in an invalid format. 1967 | """ 1968 | self._make_test_function() 1969 | 1970 | steps_done = 0 1971 | wait_time = 0.01 1972 | all_outs = [] 1973 | batch_sizes = [] 1974 | is_sequence = isinstance(generator, Sequence) 1975 | if not is_sequence and use_multiprocessing and workers > 1: 1976 | warnings.warn( 1977 | UserWarning('Using a generator with `use_multiprocessing=True`' 1978 | ' and multiple workers may duplicate your data.' 1979 | ' Please consider using the`keras.utils.Sequence' 1980 | ' class.')) 1981 | enqueuer = None 1982 | 1983 | try: 1984 | if is_sequence: 1985 | enqueuer = OrderedEnqueuer(generator, 1986 | use_multiprocessing=use_multiprocessing) 1987 | else: 1988 | enqueuer = GeneratorEnqueuer(generator, 1989 | use_multiprocessing=use_multiprocessing, 1990 | wait_time=wait_time) 1991 | enqueuer.start(workers=workers, max_queue_size=max_queue_size) 1992 | output_generator = enqueuer.get() 1993 | 1994 | while steps_done < steps: 1995 | generator_output = next(output_generator) 1996 | if not hasattr(generator_output, '__len__'): 1997 | raise ValueError('Output of generator should be a tuple ' 1998 | '(x, y, sample_weight) ' 1999 | 'or (x, y). Found: ' + 2000 | str(generator_output)) 2001 | if len(generator_output) == 2: 2002 | x, y = generator_output 2003 | sample_weight = None 2004 | elif len(generator_output) == 3: 2005 | x, y, sample_weight = generator_output 2006 | else: 2007 | raise ValueError('Output of generator should be a tuple ' 2008 | '(x, y, sample_weight) ' 2009 | 'or (x, y). Found: ' + 2010 | str(generator_output)) 2011 | outs = self.test_on_batch(x, y, sample_weight=sample_weight) 2012 | 2013 | if isinstance(x, list): 2014 | batch_size = len(x[0]) 2015 | elif isinstance(x, dict): 2016 | batch_size = len(list(x.values())[0]) 2017 | else: 2018 | batch_size = len(x) 2019 | if batch_size == 0: 2020 | raise ValueError('Received an empty batch. ' 2021 | 'Batches should at least contain one item.') 2022 | all_outs.append(outs) 2023 | 2024 | steps_done += 1 2025 | batch_sizes.append(batch_size) 2026 | 2027 | finally: 2028 | if enqueuer is not None: 2029 | enqueuer.stop() 2030 | 2031 | if not isinstance(outs, list): 2032 | return np.average(np.asarray(all_outs), 2033 | weights=batch_sizes) 2034 | else: 2035 | averages = [] 2036 | for i in range(len(outs)): 2037 | averages.append(np.average([out[i] for out in all_outs], 2038 | weights=batch_sizes)) 2039 | return averages 2040 | 2041 | @interfaces.legacy_generator_methods_support 2042 | def predict_generator(self, generator, steps, 2043 | max_queue_size=10, 2044 | workers=1, 2045 | use_multiprocessing=False, 2046 | verbose=0): 2047 | """Generates predictions for the input samples from a data generator. 2048 | 2049 | The generator should return the same kind of data as accepted by 2050 | `predict_on_batch`. 2051 | 2052 | # Arguments 2053 | generator: Generator yielding batches of input samples 2054 | or an instance of Sequence (keras.utils.Sequence) 2055 | object in order to avoid duplicate data 2056 | when using multiprocessing. 2057 | steps: Total number of steps (batches of samples) 2058 | to yield from `generator` before stopping. 2059 | max_queue_size: Maximum size for the generator queue. 2060 | workers: Maximum number of processes to spin up 2061 | when using process based threading 2062 | use_multiprocessing: If `True`, use process based threading. 2063 | Note that because 2064 | this implementation relies on multiprocessing, 2065 | you should not pass 2066 | non picklable arguments to the generator 2067 | as they can't be passed 2068 | easily to children processes. 2069 | verbose: verbosity mode, 0 or 1. 2070 | 2071 | # Returns 2072 | Numpy array(s) of predictions. 2073 | 2074 | # Raises 2075 | ValueError: In case the generator yields 2076 | data in an invalid format. 2077 | """ 2078 | self._make_predict_function() 2079 | 2080 | steps_done = 0 2081 | wait_time = 0.01 2082 | all_outs = [] 2083 | is_sequence = isinstance(generator, Sequence) 2084 | if not is_sequence and use_multiprocessing and workers > 1: 2085 | warnings.warn( 2086 | UserWarning('Using a generator with `use_multiprocessing=True`' 2087 | ' and multiple workers may duplicate your data.' 2088 | ' Please consider using the`keras.utils.Sequence' 2089 | ' class.')) 2090 | enqueuer = None 2091 | 2092 | try: 2093 | if is_sequence: 2094 | enqueuer = OrderedEnqueuer(generator, 2095 | use_multiprocessing=use_multiprocessing) 2096 | else: 2097 | enqueuer = GeneratorEnqueuer(generator, 2098 | use_multiprocessing=use_multiprocessing, 2099 | wait_time=wait_time) 2100 | enqueuer.start(workers=workers, max_queue_size=max_queue_size) 2101 | output_generator = enqueuer.get() 2102 | 2103 | if verbose == 1: 2104 | progbar = Progbar(target=steps) 2105 | 2106 | while steps_done < steps: 2107 | generator_output = next(output_generator) 2108 | if isinstance(generator_output, tuple): 2109 | # Compatibility with the generators 2110 | # used for training. 2111 | if len(generator_output) == 2: 2112 | x, _ = generator_output 2113 | elif len(generator_output) == 3: 2114 | x, _, _ = generator_output 2115 | else: 2116 | raise ValueError('Output of generator should be ' 2117 | 'a tuple `(x, y, sample_weight)` ' 2118 | 'or `(x, y)`. Found: ' + 2119 | str(generator_output)) 2120 | else: 2121 | # Assumes a generator that only 2122 | # yields inputs (not targets and sample weights). 2123 | x = generator_output 2124 | 2125 | outs = self.predict_on_batch(x) 2126 | if not isinstance(outs, list): 2127 | outs = [outs] 2128 | 2129 | if not all_outs: 2130 | for out in outs: 2131 | all_outs.append([]) 2132 | 2133 | for i, out in enumerate(outs): 2134 | all_outs[i].append(out) 2135 | steps_done += 1 2136 | if verbose == 1: 2137 | progbar.update(steps_done) 2138 | 2139 | finally: 2140 | if enqueuer is not None: 2141 | enqueuer.stop() 2142 | 2143 | if len(all_outs) == 1: 2144 | if steps_done == 1: 2145 | return all_outs[0][0] 2146 | else: 2147 | return np.concatenate(all_outs[0]) 2148 | if steps_done == 1: 2149 | return [out for out in all_outs] 2150 | else: 2151 | return [np.concatenate(out) for out in all_outs] 2152 | -------------------------------------------------------------------------------- /mycode/config.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | ''' 3 | config.py 4 | define file path as so forth. 5 | ''' 6 | 7 | import time 8 | 9 | get_current_time = lambda: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) + '\t' 10 | 11 | DATAROOT = '/home/cxk/zhihucup/ieee_zhihu_cup_rowdata/' #this should be written as your own data root path 12 | 13 | # embedding files 14 | CHAR_EMBEDDING_DIR = DATAROOT + 'char_embedding.txt' 15 | WORD_EMBEDDING_DIR = DATAROOT + 'word_embedding.txt' 16 | 17 | # topic info 18 | TOPIC_INFO_DIR = DATAROOT + 'topic_info.txt' 19 | 20 | # train and eval text 21 | QUESTION_TRAIN_SET_DIR = DATAROOT + 'question_train_set.txt' 22 | QUESTION_EVAL_SET_DIR = DATAROOT + 'question_eval_set.txt' 23 | 24 | # traindata's label file 25 | QUESTION_TOPIC_TRAIN_DIR = DATAROOT + 'question_topic_train_set.txt' 26 | -------------------------------------------------------------------------------- /mycode/load_data.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | 3 | ''' 4 | loading data 5 | defining a class that process the loading function. 6 | ''' 7 | 8 | import pickle 9 | import numpy as np 10 | from keras.preprocessing.sequence import pad_sequences 11 | from keras.preprocessing.text import Tokenizer 12 | from scipy.sparse import csr_matrix 13 | 14 | import config 15 | 16 | 17 | # params 18 | 19 | 20 | class data_loader(): 21 | def __init__(self, savedir=None): 22 | self.embedword_matrix = None 23 | self.embedword_matrix_git100 = None 24 | self.embedchar_matrix = None 25 | self.word_index = None 26 | self.char_index = None 27 | self.input_length = None 28 | self.topic_dict = {} 29 | self.topic_dict_inv = {} 30 | self.MAX_NB_WORDS = 500000 31 | self.tc_len = 180 32 | self.tw_len = 76 33 | self.max_titleword_len = 0 34 | self.max_titlechar_len = 0 35 | self.max_dspword_len = 0 36 | self.max_dspchar_len = 0 37 | self.dsppad_length = 300 38 | self.savedir = savedir 39 | 40 | def load_topic_info(self): 41 | ''' 42 | just get the topic ids 43 | :return: 44 | ''' 45 | print(config.get_current_time(), "loading topic info") 46 | with open(config.TOPIC_INFO_DIR, 'r') as f: 47 | for index, line in enumerate(f.readlines()): 48 | self.topic_dict[line.strip('\n').split('\t')[0]] = index 49 | self.topic_dict_inv[index] = line.strip('\n').split('\t')[0] 50 | 51 | def load_train_data(self): 52 | ''' 53 | title_char+title_word+dsp_char+dsp_word 54 | :param istitle: bool 55 | :param iscontent: bool 56 | :param type_kind: char or word 57 | :return: 58 | ''' 59 | 60 | title_char_list = [] 61 | title_word_list = [] 62 | dsp_char_list = [] 63 | dsp_word_list = [] 64 | question_ids = [] 65 | 66 | print(config.get_current_time(), 'loading question train set file') 67 | with open(config.QUESTION_TRAIN_SET_DIR, 'r') as f: 68 | for index, line in enumerate(f.readlines()): 69 | if index > 500: 70 | break 71 | splitted = line.strip('\n').split('\t') 72 | 73 | if len(splitted) == 1: 74 | continue 75 | elif len(splitted) == 2: 76 | continue 77 | elif len(splitted) == 5: 78 | title_char_list.append(splitted[1].replace(',', ' ')) 79 | title_word_list.append(splitted[2].replace(',', ' ')) 80 | dsp_char_list.append(splitted[3].replace(',', ' ')) 81 | dsp_word_list.append(splitted[4].replace(',', ' ')) 82 | self.max_titlechar_len = max(len(splitted[1].split(',')), self.max_titlechar_len) 83 | self.max_titleword_len = max(len(splitted[2].split(',')), self.max_titleword_len) 84 | self.max_dspchar_len = max(len(splitted[3].split(',')), self.max_dspchar_len) 85 | self.max_dspword_len = max(len(splitted[4].split(',')), self.max_dspword_len) 86 | question_ids.append(splitted[0]) 87 | else: 88 | continue 89 | 90 | # print('max titlecharlength', self.max_titlechar_len) 91 | # print('max titleword length', self.max_titleword_len) 92 | # print('max dspchar length', self.max_dspchar_len) 93 | # print('max dspword length', self.max_dspword_len) 94 | 95 | pickle.dump(self.tw_len, open(self.savedir + '/tw_len.pkl', 'wb')) 96 | pickle.dump(self.tc_len, open(self.savedir + '/tc_len.pkl', 'wb')) 97 | pickle.dump(self.dsppad_length, open(self.savedir + '/dsp_pad_length.pkl', 'wb')) 98 | 99 | # ------titleword-------- 100 | print(config.get_current_time(), 'tokenizer title word working') 101 | tokenizer_word = Tokenizer(num_words=self.MAX_NB_WORDS) 102 | tokenizer_word.fit_on_texts(title_word_list + dsp_word_list) 103 | sequences_titleword = tokenizer_word.texts_to_sequences(title_word_list) 104 | self.word_index = tokenizer_word.word_index 105 | print(config.get_current_time(), 'Found %s unique word tokens.' % len(self.word_index)) 106 | titleword_array = pad_sequences(sequences_titleword, maxlen=self.tw_len) # return arrays 107 | pickle.dump(tokenizer_word, open(self.savedir + '/tokenizer_word.pkl', 'wb')) 108 | print('tokenzier is saved as %s/tokenizer_word.pkl' % (self.savedir)) 109 | # -----titlechar--------- 110 | print(config.get_current_time(), 'tokenizer title char working') 111 | tokenizer_char = Tokenizer(num_words=self.MAX_NB_WORDS) 112 | tokenizer_char.fit_on_texts(title_char_list + dsp_char_list) 113 | sequences_titlechar = tokenizer_char.texts_to_sequences(title_char_list) 114 | self.char_index = tokenizer_char.word_index 115 | print(config.get_current_time(), 'Found %s unique char tokens.' % len(self.char_index)) 116 | titlechar_array = pad_sequences(sequences_titlechar, maxlen=self.tc_len) # return arrays 117 | pickle.dump(tokenizer_char, open(self.savedir + '/tokenizer_char.pkl', 'wb')) 118 | print('tokenzier is saved as %s/tokenizer_char.pkl' % (self.savedir)) 119 | # -----dspword-------- 120 | print(config.get_current_time(), 'tokenizer dsp char working') 121 | sequences_dspchar = tokenizer_char.texts_to_sequences(dsp_char_list) 122 | dspchar_array = pad_sequences(sequences_dspchar, maxlen=self.dsppad_length) # return arrays 123 | # ---dspchar--------- 124 | print(config.get_current_time(), 'tokenizer dsp word working') 125 | sequences_dspword = tokenizer_word.texts_to_sequences(dsp_word_list) 126 | dspword_array = pad_sequences(sequences_dspword, maxlen=self.dsppad_length) # return arrays 127 | 128 | self.load_topic_info() 129 | 130 | question_to_label = {} 131 | print(config.get_current_time(), 'loading train labels') 132 | with open(config.QUESTION_TOPIC_TRAIN_DIR, 'r') as f: 133 | for index, line in enumerate(f.readlines()): 134 | # if index>100000: 135 | # break 136 | splitted = line.strip('\n').split('\t') 137 | if len(splitted) != 2: 138 | print('error!') 139 | question_to_label[splitted[0]] = [self.topic_dict[i] for i in splitted[1].split(',')] 140 | 141 | print(config.get_current_time(), 'duiqi traindata and labels') 142 | 143 | row_ = [] 144 | col_ = [] 145 | count_1 = 0 146 | # label_dense = np.zeros((train_titleword_array.shape[0], 1999)) 147 | for row, quesid in enumerate(question_ids): 148 | cols = question_to_label.get(quesid) 149 | if cols is None: 150 | print('error!') 151 | count_1 += len(cols) 152 | for k in cols: 153 | row_.append(row) 154 | col_.extend(cols) 155 | 156 | data_ = [1 for i in row_] 157 | label_sparse = csr_matrix((data_, (row_, col_)), shape=(len(question_ids), 1999)) 158 | # # Shuffle data 159 | # shuffle_indices = np.random.permutation(np.arange(train_titleword_array.shape[0])) 160 | # x_word = train_titleword_array[shuffle_indices] 161 | # x_char = train_titlechar_array[shuffle_indices] 162 | # row_ = [row_[i] for i in shuffle_indices] 163 | # col_ = [col_[i] for i in shuffle_indices] 164 | # 165 | # # label_dense = label_dense[shuffle_indices] 166 | # # label_sparse = csr_matrix(([1 for i in range(count_1))],(row_,col_)),shape = ()) 167 | # 168 | # train_len = int(x_word.shape[0] * 0.9) 169 | # x_word_train = x_word[:train_len] 170 | # x_char_train = x_char[:train_len] 171 | # y_train = label_sparse[:train_len] 172 | # x_word_test = x_word[train_len:] 173 | # x_char_test = x_char[train_len:] 174 | # y_test = label_sparse[train_len:] 175 | 176 | # return (x_word_train, x_char_train, y_train, x_word_test, x_char_test, y_test) 177 | return titlechar_array, titleword_array, dspchar_array, dspword_array, label_sparse 178 | 179 | def load_pred_data_4part(self): 180 | ''' 181 | 182 | :return: 183 | ''' 184 | title_char_list = [] 185 | title_word_list = [] 186 | dsp_char_list = [] 187 | dsp_word_list = [] 188 | question_ids = [] 189 | 190 | self.tw_len = pickle.load(open(self.savedir + '/tw_len.pkl', 'rb')) 191 | self.tc_len = pickle.load(open(self.savedir + '/tc_len.pkl', 'rb')) 192 | self.dsppad_length = pickle.load(open(self.savedir + '/dsp_pad_length.pkl', 'rb')) 193 | print('length is loaded!') 194 | 195 | print(config.get_current_time(), 'loading question eval set file') 196 | with open(config.QUESTION_EVAL_SET_DIR, 'r') as f: 197 | for index, line in enumerate(f.readlines()): 198 | # if index>50000: 199 | # break 200 | splitted = line.strip('\n').split('\t') 201 | 202 | if len(splitted) == 1: 203 | print('error!') 204 | exit() 205 | elif len(splitted) == 2: 206 | title_char_list.append(splitted[1].replace(',', ' ')) 207 | title_word_list.append(" ") 208 | dsp_char_list.append(" ") 209 | dsp_word_list.append(" ") 210 | elif len(splitted) == 3: 211 | title_char_list.append(splitted[1].replace(',', ' ')) 212 | title_word_list.append(splitted[2].replace(',', ' ')) 213 | dsp_char_list.append(" ") 214 | dsp_word_list.append(" ") 215 | elif len(splitted) == 4: 216 | title_char_list.append(splitted[1].replace(',', ' ')) 217 | title_word_list.append(splitted[2].replace(',', ' ')) 218 | dsp_char_list.append(splitted[3].replace(',', ' ')) 219 | dsp_word_list.append(" ") 220 | elif len(splitted) == 5: 221 | title_char_list.append(splitted[1].replace(',', ' ')) 222 | title_word_list.append(splitted[2].replace(',', ' ')) 223 | dsp_char_list.append(splitted[3].replace(',', ' ')) 224 | dsp_word_list.append(splitted[4].replace(',', ' ')) 225 | 226 | question_ids.append(splitted[0]) 227 | 228 | tokenizer_word = pickle.load(open(self.savedir + '/tokenizer_word.pkl', 'rb')) 229 | tokenizer_char = pickle.load(open(self.savedir + '/tokenizer_char.pkl', 'rb')) 230 | print('tokenizer word loaded!') 231 | print("") 232 | 233 | print(config.get_current_time(), 'tokenizer working title char') 234 | titlechar_sequences_char = tokenizer_char.texts_to_sequences(title_char_list) 235 | self.char_index = tokenizer_char.word_index 236 | titlechar_array = pad_sequences(titlechar_sequences_char, maxlen=self.tc_len) # return arrays 237 | 238 | print(config.get_current_time(), 'tokenizer working title word') 239 | titleword_sequences_word = tokenizer_word.texts_to_sequences(title_word_list) 240 | self.word_index = tokenizer_word.word_index 241 | titleword_array = pad_sequences(titleword_sequences_word, maxlen=self.tw_len) # return arrays 242 | 243 | print(config.get_current_time(), 'tokenizer working dsp char') 244 | dspchar_sequences_char = tokenizer_char.texts_to_sequences(dsp_char_list) 245 | dspchar_array = pad_sequences(dspchar_sequences_char, maxlen=self.dsppad_length) # return arrays 246 | 247 | print(config.get_current_time(), 'tokenizer working dsp word') 248 | dspword_sequences_word = tokenizer_word.texts_to_sequences(dsp_word_list) 249 | dspword_array = pad_sequences(dspword_sequences_word, maxlen=self.dsppad_length) # return arrays 250 | 251 | self.load_topic_info() 252 | 253 | return titlechar_array, titleword_array, dspchar_array, dspword_array, question_ids 254 | 255 | def get_quesids(self): 256 | ''' 257 | 258 | :return: 259 | ''' 260 | question_ids = [] 261 | 262 | print(config.get_current_time(), 'loading question eval ids') 263 | with open(config.QUESTION_EVAL_SET_DIR, 'r') as f: 264 | for index, line in enumerate(f.readlines()): 265 | splitted = line.strip('\n').split('\t') 266 | question_ids.append(splitted[0]) 267 | 268 | self.load_topic_info() 269 | return question_ids 270 | 271 | def load_wordembedding_matrix(self): 272 | 273 | embeddings_index = dict() 274 | 275 | embedding_max_value = 0 276 | embedding_min_value = 1 277 | 278 | with open(config.WORD_EMBEDDING_DIR, 'r') as f: 279 | for line in f: 280 | line = line.strip().split(' ') 281 | if len(line) != 257: 282 | continue 283 | 284 | coefs = np.asarray(line[1:], dtype='float32') 285 | 286 | if np.max(coefs) > embedding_max_value: 287 | embedding_max_value = np.max(coefs) 288 | if np.min(coefs) < embedding_min_value: 289 | embedding_min_value = np.min(coefs) 290 | 291 | embeddings_index[line[0]] = coefs 292 | 293 | print(config.get_current_time(), ('Found %s word vectors.' % len(embeddings_index))) 294 | 295 | self.embedword_matrix = np.zeros((len(self.word_index) + 1, 256)) 296 | for word, i in self.word_index.items(): 297 | embedding_vector = embeddings_index.get(word) 298 | if embedding_vector is not None: 299 | # words not found in embedding index will be all-zeros. 300 | self.embedword_matrix[i] = embedding_vector 301 | else: 302 | self.embedword_matrix[i] = np.random.uniform(low=embedding_min_value, high=embedding_max_value, 303 | size=256) 304 | 305 | def load_charembedding_matrix(self): 306 | 307 | embeddings_index = dict() 308 | 309 | embedding_max_value = 0 310 | embedding_min_value = 1 311 | 312 | with open(config.CHAR_EMBEDDING_DIR, 'r') as f: 313 | for line in f: 314 | line = line.strip().split(' ') 315 | if len(line) != 257: 316 | continue 317 | 318 | coefs = np.asarray(line[1:], dtype='float32') 319 | 320 | if np.max(coefs) > embedding_max_value: 321 | embedding_max_value = np.max(coefs) 322 | if np.min(coefs) < embedding_min_value: 323 | embedding_min_value = np.min(coefs) 324 | 325 | embeddings_index[line[0]] = coefs 326 | 327 | print(config.get_current_time(), ('Found %s char vectors.' % len(embeddings_index))) 328 | 329 | self.embedchar_matrix = np.zeros((len(self.char_index) + 1, 256)) 330 | for word, i in self.char_index.items(): 331 | embedding_vector = embeddings_index.get(word) 332 | if embedding_vector is not None: 333 | # words not found in embedding index will be all-zeros. 334 | self.embedchar_matrix[i] = embedding_vector 335 | else: 336 | self.embedchar_matrix[i] = np.random.uniform(low=embedding_min_value, high=embedding_max_value, 337 | size=256) 338 | 339 | -------------------------------------------------------------------------------- /mycode/main.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | from model import * 3 | import config 4 | from load_data import * 5 | import sys 6 | 7 | def results_weight_sum(filenames, topic_dict_inv, ques_ids, weights): 8 | ''' 9 | ensemble model by weighting sum 10 | :param filenames: 11 | :param topic_dict_inv: 12 | :param ques_ids: 13 | :param weights: 14 | :return: 15 | ''' 16 | assert len(weights) == len(filenames) 17 | import time 18 | import numpy as np 19 | cur_time = time.strftime('%Y-%m-%d-%H-%M', time.localtime(time.time())) 20 | from collections import Counter 21 | 22 | def tmpfunc(x): 23 | if len(x) > 5: 24 | c = Counter(x).most_common(5) 25 | res = [] 26 | for num, count in c: 27 | res.append(topic_dict_inv[num]) 28 | else: 29 | res = [] 30 | for i in x: 31 | res.append(topic_dict_inv[i]) 32 | 33 | return res 34 | 35 | predlabels = [] 36 | 37 | for i in range(len(filenames)): 38 | print('process %d th...' % (i)) 39 | predlabel = np.load(filenames[i]) 40 | if len(predlabels) == 0: 41 | predlabels = predlabel * weights[0] 42 | else: 43 | predlabels = predlabels + predlabel * weights[i] 44 | print(predlabels.shape) 45 | 46 | predlabels = np.argsort(-predlabels)[:, :5] 47 | 48 | with open("final_423.csv", 'w') as f: 49 | for i in range(predlabels.shape[0]): 50 | # f.write(ques_ids[i] + "," + ','.join([topic_dict_inv[k] for k in predlabels[i]]) + '\n') 51 | f.write(ques_ids[i] + "," + ','.join(tmpfunc(predlabels[i])) + '\n') 52 | 53 | 54 | if __name__ == '__main__': 55 | 56 | if len(sys.argv) < 2: 57 | print('error, give me mode ') 58 | exit() 59 | 60 | mode = sys.argv[1] 61 | 62 | print(config.get_current_time(), 'current mode:', mode) 63 | 64 | if mode == "train": 65 | 66 | save_root_dir = './model_exp' #your own path, to save models,tokenizers... 67 | 68 | dl = data_loader(save_root_dir) 69 | datatuple = dl.load_train_data() 70 | dl.load_charembedding_matrix() 71 | dl.load_wordembedding_matrix() 72 | 73 | mymodel = MultiModel(w_embed_matrix=dl.embedword_matrix, c_embed_matrix=dl.embedchar_matrix, 74 | word_index=dl.word_index, char_index=dl.char_index, titlechar_length=dl.tc_len, 75 | titleword_length=dl.tw_len, dsp_padlen=dl.dsppad_length, data=datatuple, 76 | savedir=save_root_dir) 77 | mymodel.trainmodel(isalldata=True) 78 | 79 | if mode == "pred": 80 | save_root_dir = './model_4rcnn_att_titledsp_lstm512_lr0_0001_nofc_alldata' #your own model path 81 | dl = data_loader(save_root_dir) 82 | datatuple = dl.load_pred_data_4part() 83 | 84 | mymodel = MultiModel() 85 | mymodel.predmodel([save_root_dir + "/2017-08-10-12-20_model-06.hdf5"], datatuple=datatuple, 86 | topic_dict_inv=dl.topic_dict_inv) 87 | 88 | -------------------------------------------------------------------------------- /mycode/model.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | 3 | import math 4 | import sys 5 | 6 | import config 7 | import numpy as np 8 | import tensorflow as tf 9 | from keras import backend as K 10 | from keras.backend.tensorflow_backend import set_session 11 | from keras.layers import Embedding, merge, Reshape, Activation, RepeatVector, Permute, Lambda, GlobalMaxPool1D, \ 12 | concatenate 13 | from keras import initializers 14 | from keras import optimizers 15 | from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback 16 | from keras.layers import Dense, Conv1D, MaxPooling1D, Input, Flatten, Dropout, Concatenate, LSTM, Bidirectional, GRU 17 | from keras.metrics import categorical_accuracy 18 | from keras.models import Model 19 | from keras.models import load_model 20 | from keras.layers.normalization import BatchNormalization 21 | 22 | from load_data import data_loader 23 | 24 | 25 | class ZHIHUMetrics(Callback): 26 | ''' 27 | ZHIHU score method 28 | ''' 29 | 30 | def on_epoch_end(self, batch, logs={}): 31 | print('') 32 | y_pred = np.asarray(self.model.predict( 33 | [self.validation_data[0], self.validation_data[1], self.validation_data[2], self.validation_data[3]])) 34 | y_true = self.validation_data[4] 35 | # y_pred = np.asarray(self.model.predict([self.validation_data[0], self.validation_data[1]])) 36 | # y_true = self.validation_data[2] 37 | # y_pred = np.asarray(self.model.predict([self.validation_data[0]])) 38 | # y_true = self.validation_data[1] 39 | 40 | print(y_pred.shape, y_true.shape) 41 | 42 | y_pred = np.argsort(-y_pred)[:, :5] 43 | 44 | y_true_list = [] 45 | for i in range(y_pred.shape[0]): 46 | y_true_list.append([]) 47 | 48 | nozero_row, nozero_col = np.nonzero(y_true) 49 | 50 | for i in range(len(nozero_row)): 51 | y_true_list[nozero_row[i]].append(nozero_col[i]) 52 | 53 | right_label_num = 0 54 | right_label_at_pos_num = [0, 0, 0, 0, 0] 55 | sample_num = 0 56 | all_marked_label_num = 0 57 | 58 | for i in range(len(y_true_list)): 59 | sample_num += 1 60 | marked_label_set = set(y_true_list[i]) 61 | all_marked_label_num += len(marked_label_set) 62 | for pos, label in zip(range(0, min(len(y_pred[i]), 5)), y_pred[i]): 63 | if label in marked_label_set: 64 | right_label_num += 1 65 | right_label_at_pos_num[pos] += 1 66 | 67 | precision = 0.0 68 | for pos, right_num in zip(range(0, 5), right_label_at_pos_num): 69 | precision += ((right_num / float(sample_num))) / math.log(2.0 + pos) 70 | recall = float(right_label_num) / all_marked_label_num 71 | 72 | print('Recall:', recall) 73 | print(' Precision:', precision) 74 | print(' res:', recall * precision / (recall + precision + 0.00000000000001)) 75 | print('') 76 | 77 | 78 | class MultiModel(): 79 | def __init__(self, w_embed_matrix=None, c_embed_matrix=None, word_index=None, char_index=None, 80 | titlechar_length=None, titleword_length=None, dsp_padlen=None, data=(0, 0, 0), savedir=None): 81 | # Model Hyperparameters 82 | self.hidden_dims = 512 83 | self.EMBEDDING_DIM = 256 84 | # Training parameters 85 | self.batch_size = 128 86 | self.num_epochs = 50 87 | 88 | self.w_embed = w_embed_matrix 89 | self.c_embed = c_embed_matrix 90 | self.word_index = word_index 91 | self.char_index = char_index 92 | self.titlechar_length = titlechar_length 93 | self.titleword_length = titleword_length 94 | self.dsp_padlen = dsp_padlen 95 | 96 | # data 97 | if len(data) == 5: 98 | self.titlechar_array, self.titleword_array, self.dspchar_array, self.dspword_array, self.y = data 99 | 100 | self.savedir = savedir 101 | 102 | self.model = None 103 | 104 | def buildmodel_rcnn4_att_titledsp(self): 105 | ''' 106 | 4 RCNN 107 | v2: 4model concat+dense1999 108 | (tw concat tc) + (dw concat dc) 109 | lstm256+lr0.001 :3epoch 0.401 0.9data 110 | lstm512+lr0.0005 :2epoch 0.410 alldata 2,3,4 epoch vote 0.414 with dp 111 | 112 | :return: 113 | ''' 114 | print('building model...') 115 | 116 | # -----titlechar------ 117 | with tf.device('/cpu:%d' % (0)): 118 | tc_embedding_layer = Embedding(len(self.char_index) + 1, 119 | self.EMBEDDING_DIM, 120 | weights=[self.c_embed], 121 | input_length=self.titlechar_length, trainable=True, 122 | embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2, 123 | seed=None)) 124 | tc_sequence_input = Input(shape=(self.titlechar_length,), name="titlechar_input") 125 | tc_embedded_sequences = tc_embedding_layer(tc_sequence_input) 126 | with tf.device('/gpu:%d' % (0)): 127 | tc_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(tc_embedded_sequences) 128 | tc_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(tc_embedded_sequences) 129 | tc_z_concat = merge([tc_z_pos, tc_embedded_sequences, tc_z_neg], mode='concat', concat_axis=-1) 130 | 131 | tc_z = Dense(512, activation='tanh')(tc_z_concat) 132 | tc_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(tc_z) 133 | # -----titleword------ 134 | with tf.device('/cpu:%d' % (1)): 135 | tw_embedding_layer = Embedding(len(self.word_index) + 1, 136 | self.EMBEDDING_DIM, 137 | weights=[self.w_embed], 138 | input_length=self.titleword_length, trainable=True, 139 | embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2, 140 | seed=None)) 141 | tw_sequence_input = Input(shape=(self.titleword_length,), name="titleword_input") 142 | tw_embedded_sequences = tw_embedding_layer(tw_sequence_input) 143 | with tf.device('/gpu:%d' % (0)): 144 | tw_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(tw_embedded_sequences) 145 | tw_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(tw_embedded_sequences) 146 | tw_z_concat = merge([tw_z_pos, tw_embedded_sequences, tw_z_neg], mode='concat', concat_axis=-1) 147 | 148 | tw_z = Dense(512, activation='tanh')(tw_z_concat) 149 | tw_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(tw_z) 150 | # -----dspchar------ 151 | with tf.device('/cpu:%d' % (2)): 152 | dc_embedding_layer = Embedding(len(self.char_index) + 1, 153 | self.EMBEDDING_DIM, 154 | weights=[self.c_embed], 155 | input_length=self.dsp_padlen, trainable=True, 156 | embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2, 157 | seed=None)) 158 | dc_sequence_input = Input(shape=(self.dsp_padlen,), name="dspchar_input") 159 | dc_embedded_sequences = dc_embedding_layer(dc_sequence_input) 160 | with tf.device('/gpu:%d' % (1)): 161 | dc_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(dc_embedded_sequences) 162 | dc_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(dc_embedded_sequences) 163 | dc_z_concat = merge([dc_z_pos, dc_embedded_sequences, dc_z_neg], mode='concat', concat_axis=-1) 164 | dc_z = Dense(512, activation='tanh')(dc_z_concat) 165 | dc_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(dc_z) 166 | # -----dspword------ 167 | with tf.device('/cpu:%d' % (3)): 168 | dw_embedding_layer = Embedding(len(self.word_index) + 1, 169 | self.EMBEDDING_DIM, 170 | weights=[self.w_embed], 171 | input_length=self.dsp_padlen, trainable=True, 172 | embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2, 173 | seed=None)) 174 | dw_sequence_input = Input(shape=(self.dsp_padlen,), name="dspword_input") 175 | dw_embedded_sequences = dw_embedding_layer(dw_sequence_input) 176 | with tf.device('/gpu:%d' % (1)): 177 | dw_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(dw_embedded_sequences) 178 | dw_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(dw_embedded_sequences) 179 | dw_z_concat = merge([dw_z_pos, dw_embedded_sequences, dw_z_neg], mode='concat', concat_axis=-1) 180 | 181 | dw_z = Dense(512, activation='tanh')(dw_z_concat) 182 | dw_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(dw_z) 183 | 184 | # ------att---------- 185 | concat_w_c = merge([tc_pool_rnn, tw_pool_rnn, dc_pool_rnn, dw_pool_rnn], mode='concat') 186 | concat_w_c = Reshape((2, 512 * 2))(concat_w_c) 187 | 188 | attention = Dense(1, activation='tanh')(concat_w_c) 189 | attention = Flatten()(attention) 190 | attention = Activation('softmax')(attention) 191 | attention = RepeatVector(512 * 2)(attention) 192 | attention = Permute([2, 1])(attention) 193 | 194 | sent_representation = merge([concat_w_c, attention], mode='mul') 195 | sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(512 * 2,))(sent_representation) 196 | # --------merge_4models------------------ 197 | model_final_ = Dense(1999, activation='relu')(sent_representation) 198 | model_final_ = Dropout(0.5)(model_final_) 199 | model_final = Dense(1999, activation='softmax')(model_final_) 200 | 201 | self.model = Model(input=[tc_sequence_input, tw_sequence_input, dc_sequence_input, dw_sequence_input], 202 | outputs=model_final) 203 | adam = optimizers.adam(lr=0.0001) 204 | self.model.compile(loss='categorical_crossentropy', 205 | optimizer=adam, 206 | metrics=[categorical_accuracy]) 207 | print(self.model.summary()) 208 | 209 | def trainmodel(self, isalldata): 210 | 211 | self.buildmodel_rcnn4_att_titledsp() 212 | 213 | import time 214 | cur_time = time.strftime('%Y-%m-%d-%H-%M', time.localtime(time.time())) 215 | 216 | checkpointer = ModelCheckpoint(filepath=self.savedir + "/" + cur_time + "_model-{epoch:02d}.hdf5", period=1) 217 | zhihuMetrics = ZHIHUMetrics() 218 | 219 | if isalldata: 220 | self.model.fit([self.titlechar_array, self.titleword_array, self.dspchar_array, self.dspword_array], 221 | self.y, 222 | epochs=self.num_epochs, batch_size=self.batch_size, verbose=1, 223 | callbacks=[checkpointer]) 224 | else:#with 9:1 validation 225 | self.model.fit([self.titlechar_array, self.titleword_array, self.dspchar_array, self.dspword_array], 226 | self.y, 227 | validation_split=0.1, 228 | epochs=self.num_epochs, batch_size=self.batch_size, verbose=1, 229 | callbacks=[checkpointer, zhihuMetrics]) 230 | self.save_model() 231 | 232 | def predmodel(self, modelname, datatuple, topic_dict_inv): 233 | 234 | import time 235 | cur_time = time.strftime('%Y-%m-%d-%H-%M', time.localtime(time.time())) 236 | from collections import Counter 237 | 238 | def tmpfunc(x): 239 | if len(x) > 5: 240 | c = Counter(x).most_common(5) 241 | res = [] 242 | for num, count in c: 243 | res.append(topic_dict_inv[num]) 244 | else: 245 | res = [] 246 | for i in x: 247 | res.append(topic_dict_inv[i]) 248 | 249 | return res 250 | 251 | predlabels = [] 252 | # titleword_array, dspword_array, ques_ids= datatuple 253 | titlechar_array, titleword_array, dspchar_array, dspword_array, ques_ids = datatuple 254 | 255 | for i in range(len(modelname)): 256 | self.model = load_model(modelname[i]) 257 | predlabel = self.model.predict([titlechar_array, titleword_array, dspchar_array, dspword_array], 258 | batch_size=512, verbose=1) 259 | # predlabel = self.model.predict([titleword_array, titleword_array, dspword_array, dspword_array], batch_size=512, verbose=1) 260 | # np.savetxt("result/scores/"+cur_time + "scores_4RCNN_gru_dense_nodropout.txt", predlabel, fmt='%s') 261 | np.save("result/scores/" + cur_time + "4RCNN_lstm512_4part_title_dsp_attention_nofc_06epoch", predlabel) 262 | # exit() 263 | predlabel = np.argsort(-predlabel)[:, :5] 264 | if len(predlabels) == 0: 265 | predlabels = predlabel 266 | else: 267 | predlabels = np.column_stack((predlabels, predlabel)) 268 | print(predlabels.shape) 269 | K.clear_session() 270 | 271 | with open("result/" + cur_time + ".csv", 'w') as f: 272 | for i in range(predlabels.shape[0]): 273 | # f.write(ques_ids[i] + "," + ','.join([topic_dict_inv[k] for k in predlabels[i]]) + '\n') 274 | f.write(ques_ids[i] + "," + ','.join(tmpfunc(predlabels[i])) + '\n') 275 | 276 | def save_model(self): 277 | import time 278 | cur_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time())) 279 | self.model.save(self.savedir + "/latest_twomodel_wordchar_" + str(cur_time) + '.h5') 280 | -------------------------------------------------------------------------------- /数据集说明.txt: -------------------------------------------------------------------------------- 1 | 数据集 2 | 3 | 名称 格式 4 | ieee_zhihu_cup.des3 des3 (1.4 GB) 5 | ieee_zhihu_cup.rar rar (1.2 GB) 6 | 7 | 8 | 9 | 请注意,以上两个链接内容一致,请选择一个下载。ieee_zhihu_cup.des3为Linux/Mac下的压缩文件,数据集解压命令: dd if=ieee_zhihu_cup.des3 |openssl des3 -d -k Pg5EnkJP7iYyRBt5|tar zxf - 。 ieee_zhihu_cup.rar 为windows下的rar格式压缩文件。 10 | 11 | 为了保护用户隐私,所有的原始文本信息都经过特殊编码处理。问题标题、问题描述、话题名字、话题描述的原始信息都被编码成单字 ID 序列和词语 ID 序列。单字包含单个汉字、中韩文字、英文字母、标点及空格等;词语包含切词后的中文词语、英文单词、标点及空格等。 单字 ID 和词语 ID 存在于两个不同的命名空间,即词语中的单字词或者标点,和单字中的相同字符及相同标点不一定有同一个 ID。 12 | 13 | char_embedding.txt 和 word embedding.txt 14 | 15 | 分别是字符级别的 256 维的 embedding 向量及词语级别的 256 维的 embedding 向量。以上两个文件都由 google word2vec 训练得到,保存成 txt 格式。可以使用 word2vec 直接加载。数据格式如下: 16 | 17 | 第一行是两个数字,指明了词汇表的大小及 embedding 向量的长度; 18 | 19 | 其余的各行均为 257 列,第一列是 char_id 或者 word_id,其后是 256 个浮点数,代表 256 维 embedding 向量。 20 | 21 | 词汇表中省略掉了出现频次为 5 以下的字符或者词语,因此在训练和验证语料中出现的词汇有可能没有对应的 word embedding 向量。 22 | 23 | question_train_set.txt 24 | 25 | 训练集中包含的问题信息;一共 5 列,各个列之间用 \t 分割。格式: 26 | 27 | question_id ct1,ct2,ct3,...,ctn wt1,wt2,wt3,...,wtn cd1,cd2,cd3,...cdn wd1,wd2,wd3,...,wdn 28 | 29 | 第二列为 title 的字符编号序列;第三列是 title 的词语编号序列;第四列是描述的字符编号序列;第五列是描述的词语标号序列。 30 | 31 | question_topic_train_set.txt 32 | 33 | 问题与话题标签的绑定关系。一共有两列,各个列之间用 \t 分割。注意,如果一个问题绑定了多个话题标签,这些标签是无序的。格式如下: 34 | 35 | question_id topic_id1,topic_id2...topic_idn 36 | 37 | topic_info.txt: 38 | 39 | 话题信息;一共 6 列,各个列之间用 \t 分割。格式: 40 | 41 | topic_id pid_1,pid_2,...,pidn cn1,cn2,cn3,...,cnn wn1,wn2,wn3,...,wnn cd1,cd2,cd3,...,cdn wd1,wd2,wd3,...,wdn 42 | 43 | 第二列为话题的父话题 id。话题之间是有向无环图结构,一个话题可能有 0 到多个父话题。第三列为话题名字的字符编号序列;第四列为话题名字的词语编号序列;第五列为话题描述的字符编号序列;第六列为话题描述的词语编号序列。 44 | 45 | question_eval_set.txt 46 | 47 | 该文件格式和 question_train_set.txt 一致。 48 | --------------------------------------------------------------------------------