├── README.md
├── modified_keras_files
    ├── README
    └── training.py
├── mycode
    ├── config.py
    ├── load_data.py
    ├── main.py
    └── model.py
└── 数据集说明.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # zhihu_kanshan_cup_2017
 2 | 2017知乎看山杯比赛，我的部分代码。仅提供了单模型最高分的RCNN+ATTENTION模型。
 3 | 
 4 | 详情请移步至我的博文：[大规模文本分类实践-知乎看山杯总结](http://coderskychen.cn/2017/08/20/zhihucup/)
 5 | 
 6 | # 数据下载与说明
 7 | 数据存在百度云： http://pan.baidu.com/s/1bpnNRQJ
 8 | 数据说明：[移步至此](https://github.com/coderSkyChen/zhihu_kanshan_cup_2017/blob/master/%E6%95%B0%E6%8D%AE%E9%9B%86%E8%AF%B4%E6%98%8E.txt)
 9 | 
10 | # 运行环境：
11 | - python版本：py3
12 | - keras版本：Keras (2.0.6)  其中prozhuchen修改了keras的training.py 文件，使得其能够处理稀疏的label，具体修改：training.py  第375~455行。
13 | 具体原理是在每一个batch时把稀疏的label转化为dense的，省空间。您也可以使用keras的fit_on_batch函数，但是如果您使用fit函数的话，必须使用我们修改后的源代码文件。
14 | - 本代码提供了修改后的training.py文件，在modified_keras_files中     请将其替换您keras目录下的：keras/engine/training.py文件
15 | 
16 | # 执行方法
17 | ## 训练：
18 | python3 main.py train
19 | 
20 | ## 预测：
21 | python3 main.py pred
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/modified_keras_files/README:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/modified_keras_files/training.py:
--------------------------------------------------------------------------------
   1 | # -*- coding: utf-8 -*-
   2 | from __future__ import print_function
   3 | from __future__ import absolute_import
   4 | 
   5 | import warnings
   6 | import copy
   7 | import numpy as np
   8 | import six
   9 | 
  10 | from keras.utils import Sequence
  11 | from keras.utils import GeneratorEnqueuer
  12 | from keras.utils import OrderedEnqueuer
  13 | 
  14 | try:
  15 |     import queue
  16 | except ImportError:
  17 |     import Queue as queue
  18 | 
  19 | from .topology import Container
  20 | from .. import backend as K
  21 | from .. import optimizers
  22 | from .. import losses
  23 | from .. import metrics as metrics_module
  24 | from ..utils.generic_utils import Progbar
  25 | from .. import callbacks as cbks
  26 | from ..legacy import interfaces
  27 | 
  28 | 
  29 | def _standardize_input_data(data, names, shapes=None,
  30 |                             check_batch_axis=True,
  31 |                             exception_prefix=''):
  32 |     """Normalizes inputs and targets provided by users.
  33 | 
  34 |     Users may pass data as a list of arrays, dictionary of arrays,
  35 |     or as a single array. We normalize this to an ordered list of
  36 |     arrays (same order as `names`), while checking that the provided
  37 |     arrays have shapes that match the network's expectations.
  38 | 
  39 |     # Arguments
  40 |         data: User-provided input data (polymorphic).
  41 |         names: List of expected array names.
  42 |         shapes: Optional list of expected array shapes.
  43 |         check_batch_axis: Boolean; whether to check that
  44 |             the batch axis of the arrays matches the expected
  45 |             value found in `shapes`.
  46 |         exception_prefix: String prefix used for exception formatting.
  47 | 
  48 |     # Returns
  49 |         List of standardized input arrays (one array per model input).
  50 | 
  51 |     # Raises
  52 |         ValueError: in case of improperly formatted user-provided data.
  53 |     """
  54 |     if not names:
  55 |         return []
  56 |     if data is None:
  57 |         return [None for _ in range(len(names))]
  58 |     if isinstance(data, dict):
  59 |         arrays = []
  60 |         for name in names:
  61 |             if name not in data:
  62 |                 raise ValueError('No data provided for "' +
  63 |                                  name + '". Need data for each key in: ' +
  64 |                                  str(names))
  65 |             arrays.append(data[name])
  66 |     elif isinstance(data, list):
  67 |         if len(data) != len(names):
  68 |             if data and hasattr(data[0], 'shape'):
  69 |                 raise ValueError('Error when checking model ' +
  70 |                                  exception_prefix +
  71 |                                  ': the list of Numpy arrays '
  72 |                                  'that you are passing to your model '
  73 |                                  'is not the size the model expected. '
  74 |                                  'Expected to see ' + str(len(names)) +
  75 |                                  ' arrays but instead got '
  76 |                                  'the following list of ' + str(len(data)) +
  77 |                                  ' arrays: ' + str(data)[:200] +
  78 |                                  '...')
  79 |             else:
  80 |                 if len(names) == 1:
  81 |                     data = [np.asarray(data)]
  82 |                 else:
  83 |                     raise ValueError(
  84 |                         'Error when checking model ' +
  85 |                         exception_prefix +
  86 |                         ': you are passing a list as '
  87 |                         'input to your model, '
  88 |                         'but the model expects '
  89 |                         'a list of ' + str(len(names)) +
  90 |                         ' Numpy arrays instead. '
  91 |                         'The list you passed was: ' +
  92 |                         str(data)[:200])
  93 |         arrays = data
  94 |     else:
  95 |         if not hasattr(data, 'shape'):
  96 |             raise TypeError('Error when checking model ' +
  97 |                             exception_prefix +
  98 |                             ': data should be a Numpy array, '
  99 |                             'or list/dict of Numpy arrays. '
 100 |                             'Found: ' + str(data)[:200] + '...')
 101 |         if len(names) > 1:
 102 |             # Case: model expects multiple inputs but only received
 103 |             # a single Numpy array.
 104 |             raise ValueError('The model expects ' + str(len(names)) + ' ' +
 105 |                              exception_prefix +
 106 |                              ' arrays, but only received one array. '
 107 |                              'Found: array with shape ' + str(data.shape))
 108 |         arrays = [data]
 109 | 
 110 |     # Make arrays at least 2D.
 111 |     for i in range(len(names)):
 112 |         array = arrays[i]
 113 |         if len(array.shape) == 1:
 114 |             array = np.expand_dims(array, 1)
 115 |             arrays[i] = array
 116 | 
 117 |     # Check shapes compatibility.
 118 |     if shapes:
 119 |         for i in range(len(names)):
 120 |             if shapes[i] is None:
 121 |                 continue
 122 |             array = arrays[i]
 123 |             if len(array.shape) != len(shapes[i]):
 124 |                 raise ValueError('Error when checking ' + exception_prefix +
 125 |                                  ': expected ' + names[i] +
 126 |                                  ' to have ' + str(len(shapes[i])) +
 127 |                                  ' dimensions, but got array with shape ' +
 128 |                                  str(array.shape))
 129 |             for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])):
 130 |                 if not j and not check_batch_axis:
 131 |                     # skip the first axis
 132 |                     continue
 133 |                 if ref_dim:
 134 |                     if ref_dim != dim:
 135 |                         raise ValueError(
 136 |                             'Error when checking ' + exception_prefix +
 137 |                             ': expected ' + names[i] +
 138 |                             ' to have shape ' + str(shapes[i]) +
 139 |                             ' but got array with shape ' +
 140 |                             str(array.shape))
 141 |     return arrays
 142 | 
 143 | 
 144 | def _standardize_sample_or_class_weights(x_weight, output_names, weight_type):
 145 |     """Maps `sample_weight` or `class_weight` to model outputs.
 146 | 
 147 |     # Arguments
 148 |         x_weight: User-provided `sample_weight` or `class_weight` argument.
 149 |         output_names: List of output names (strings) in the model.
 150 |         weight_type: A string used purely for exception printing.
 151 | 
 152 |     # Returns
 153 |         A list of `sample_weight` or `class_weight` where there are exactly
 154 |             one element per model output.
 155 | 
 156 |     # Raises
 157 |         ValueError: In case of invalid user-provided argument.
 158 |     """
 159 |     if x_weight is None or len(x_weight) == 0:
 160 |         return [None for _ in output_names]
 161 |     if len(output_names) == 1:
 162 |         if isinstance(x_weight, list) and len(x_weight) == 1:
 163 |             return x_weight
 164 |         if isinstance(x_weight, dict) and output_names[0] in x_weight:
 165 |             return [x_weight[output_names[0]]]
 166 |         else:
 167 |             return [x_weight]
 168 |     if isinstance(x_weight, list):
 169 |         if len(x_weight) != len(output_names):
 170 |             raise ValueError('Provided `' + weight_type + '` was a list of ' +
 171 |                              str(len(x_weight)) +
 172 |                              ' elements, but the model has ' +
 173 |                              str(len(output_names)) + ' outputs. '
 174 |                              'You should provide one `' + weight_type + '`'
 175 |                              'array per model output.')
 176 |         return x_weight
 177 |     if isinstance(x_weight, dict):
 178 |         x_weights = []
 179 |         for name in output_names:
 180 |             x_weights.append(x_weight.get(name))
 181 |         return x_weights
 182 |     else:
 183 |         raise TypeError('The model has multiple outputs, so `' +
 184 |                         weight_type + '` '
 185 |                         'should be either a list of a dict. '
 186 |                         'Provided `' + weight_type +
 187 |                         '` type not understood: ' +
 188 |                         str(x_weight))
 189 | 
 190 | 
 191 | def _standardize_class_weights(class_weight, output_names):
 192 |     return _standardize_sample_or_class_weights(class_weight,
 193 |                                                 output_names,
 194 |                                                 'class_weight')
 195 | 
 196 | 
 197 | def _standardize_sample_weights(sample_weight, output_names):
 198 |     return _standardize_sample_or_class_weights(sample_weight,
 199 |                                                 output_names,
 200 |                                                 'sample_weight')
 201 | 
 202 | 
 203 | def _check_array_lengths(inputs, targets, weights=None):
 204 |     """Does user input validation for numpy arrays.
 205 | 
 206 |     # Arguments
 207 |         inputs: list of Numpy arrays of inputs.
 208 |         targets: list of Numpy arrays of targets.
 209 |         weights: list of Numpy arrays of sample weights.
 210 | 
 211 |     # Raises
 212 |         ValueError: in case of incorrectly formatted data.
 213 |     """
 214 |     def set_of_lengths(x):
 215 |         # return a set with the variation between
 216 |         # different shapes, with None => 0
 217 |         if x is None:
 218 |             return {0}
 219 |         else:
 220 |             return set([0 if y is None else y.shape[0] for y in x])
 221 | 
 222 |     set_x = set_of_lengths(inputs)
 223 |     set_y = set_of_lengths(targets)
 224 |     set_w = set_of_lengths(weights)
 225 |     if len(set_x) > 1:
 226 |         raise ValueError('All input arrays (x) should have '
 227 |                          'the same number of samples. Got array shapes: ' +
 228 |                          str([x.shape for x in inputs]))
 229 |     if len(set_y) > 1:
 230 |         raise ValueError('All target arrays (y) should have '
 231 |                          'the same number of samples. Got array shapes: ' +
 232 |                          str([y.shape for y in targets]))
 233 |     if set_x and set_y and list(set_x)[0] != list(set_y)[0]:
 234 |         raise ValueError('Input arrays should have '
 235 |                          'the same number of samples as target arrays. '
 236 |                          'Found ' + str(list(set_x)[0]) + ' input samples '
 237 |                          'and ' + str(list(set_y)[0]) + ' target samples.')
 238 |     if len(set_w) > 1:
 239 |         raise ValueError('All sample_weight arrays should have '
 240 |                          'the same number of samples. Got array shapes: ' +
 241 |                          str([w.shape for w in weights]))
 242 |     if set_y and set_w and list(set_y)[0] != list(set_w)[0]:
 243 |         raise ValueError('Sample_weight arrays should have '
 244 |                          'the same number of samples as target arrays. Got ' +
 245 |                          str(list(set_y)[0]) + ' input samples and ' +
 246 |                          str(list(set_w)[0]) + ' target samples.')
 247 | 
 248 | 
 249 | def _check_loss_and_target_compatibility(targets, loss_fns, output_shapes):
 250 |     """Does validation on the compatibility of targets and loss functions.
 251 | 
 252 |     This helps prevent users from using loss functions incorrectly.
 253 | 
 254 |     # Arguments
 255 |         targets: list of Numpy arrays of targets.
 256 |         loss_fns: list of loss functions.
 257 |         output_shapes: list of shapes of model outputs.
 258 | 
 259 |     # Raises
 260 |         ValueError: if a loss function or target array
 261 |             is incompatible with an output.
 262 |     """
 263 |     key_losses = {'mean_squared_error',
 264 |                   'binary_crossentropy',
 265 |                   'categorical_crossentropy'}
 266 |     for y, loss, shape in zip(targets, loss_fns, output_shapes):
 267 |         if loss is None:
 268 |             continue
 269 |         if loss.__name__ == 'categorical_crossentropy':
 270 |             if y.shape[-1] == 1:
 271 |                 raise ValueError(
 272 |                     'You are passing a target array of shape ' + str(y.shape) +
 273 |                     ' while using as loss `categorical_crossentropy`. '
 274 |                     '`categorical_crossentropy` expects '
 275 |                     'targets to be binary matrices (1s and 0s) '
 276 |                     'of shape (samples, classes). '
 277 |                     'If your targets are integer classes, '
 278 |                     'you can convert them to the expected format via:\n'
 279 |                     '```\n'
 280 |                     'from keras.utils.np_utils import to_categorical\n'
 281 |                     'y_binary = to_categorical(y_int)\n'
 282 |                     '```\n'
 283 |                     '\n'
 284 |                     'Alternatively, you can use the loss function '
 285 |                     '`sparse_categorical_crossentropy` instead, '
 286 |                     'which does expect integer targets.')
 287 |         if loss.__name__ in key_losses:
 288 |             for target_dim, out_dim in zip(y.shape[1:], shape[1:]):
 289 |                 if out_dim is not None and target_dim != out_dim:
 290 |                     raise ValueError(
 291 |                         'A target array with shape ' + str(y.shape) +
 292 |                         ' was passed for an output of shape ' + str(shape) +
 293 |                         ' while using as loss `' + loss.__name__ + '`. '
 294 |                         'This loss expects '
 295 |                         'targets to have the same shape '
 296 |                         'as the output.')
 297 | 
 298 | 
 299 | def _collect_metrics(metrics, output_names):
 300 |     """Maps metric functions to model outputs.
 301 | 
 302 |     # Arguments
 303 |         metrics: a list or dict of metric functions.
 304 |         output_names: a list of the names (strings) of model outputs.
 305 | 
 306 |     # Returns
 307 |         A list (one entry per model output) of lists of metric functions.
 308 |         For instance, if the model has 2 outputs, and for the first output
 309 |         we want to compute "binary_accuracy" and "binary_crossentropy",
 310 |         and just "binary_accuracy" for the second output,
 311 |         the list would look like:
 312 |             `[[binary_accuracy, binary_crossentropy], [binary_accuracy]]`
 313 | 
 314 |     # Raises
 315 |         TypeError: if an incorrect type is passed for the `metrics` argument.
 316 |     """
 317 |     if not metrics:
 318 |         return [[] for _ in output_names]
 319 |     if isinstance(metrics, list):
 320 |         # we then apply all metrics to all outputs.
 321 |         return [copy.copy(metrics) for _ in output_names]
 322 |     elif isinstance(metrics, dict):
 323 |         nested_metrics = []
 324 |         for name in output_names:
 325 |             output_metrics = metrics.get(name, [])
 326 |             if not isinstance(output_metrics, list):
 327 |                 output_metrics = [output_metrics]
 328 |             nested_metrics.append(output_metrics)
 329 |         return nested_metrics
 330 |     else:
 331 |         raise TypeError('Type of `metrics` argument not understood. '
 332 |                         'Expected a list or dictionary, found: ' +
 333 |                         str(metrics))
 334 | 
 335 | 
 336 | def _batch_shuffle(index_array, batch_size):
 337 |     """Shuffles an array in a batch-wise fashion.
 338 | 
 339 |     Useful for shuffling HDF5 arrays
 340 |     (where one cannot access arbitrary indices).
 341 | 
 342 |     # Arguments
 343 |         index_array: array of indices to be shuffled.
 344 |         batch_size: integer.
 345 | 
 346 |     # Returns
 347 |         The `index_array` array, shuffled in a batch-wise fashion.
 348 |     """
 349 |     batch_count = int(len(index_array) / batch_size)
 350 |     # to reshape we need to be cleanly divisible by batch size
 351 |     # we stash extra items and reappend them after shuffling
 352 |     last_batch = index_array[batch_count * batch_size:]
 353 |     index_array = index_array[:batch_count * batch_size]
 354 |     index_array = index_array.reshape((batch_count, batch_size))
 355 |     np.random.shuffle(index_array)
 356 |     index_array = index_array.flatten()
 357 |     return np.append(index_array, last_batch)
 358 | 
 359 | 
 360 | def _make_batches(size, batch_size):
 361 |     """Returns a list of batch indices (tuples of indices).
 362 | 
 363 |     # Arguments
 364 |         size: Integer, total size of the data to slice into batches.
 365 |         batch_size: Integer, batch size.
 366 | 
 367 |     # Returns
 368 |         A list of tuples of array indices.
 369 |     """
 370 |     num_batches = int(np.ceil(size / float(batch_size)))
 371 |     return [(i * batch_size, min(size, (i + 1) * batch_size))
 372 |             for i in range(0, num_batches)]
 373 | 
 374 | 
 375 | def _slice_arrays(arrays, start=None, stop=None):
 376 |     """Slice an array or list of arrays.
 377 | 
 378 |     This takes an array-like, or a list of
 379 |     array-likes, and outputs:
 380 |         - arrays[start:stop] if `arrays` is an array-like
 381 |         - [x[start:stop] for x in arrays] if `arrays` is a list
 382 | 
 383 |     Can also work on list/array of indices: `_slice_arrays(x, indices)`
 384 | 
 385 |     # Arguments
 386 |         arrays: Single array or list of arrays.
 387 |         start: can be an integer index (start index)
 388 |             or a list/array of indices
 389 |         stop: integer (stop index); should be None if
 390 |             `start` was a list.
 391 | 
 392 |     # Returns
 393 |         A slice of the array(s).
 394 |     """
 395 |   #  if arrays is None:
 396 |   #      return [None]
 397 |   #  elif isinstance(arrays, list):
 398 |   #      if hasattr(start, '__len__'):
 399 |   #          # hdf5 datasets only support list objects as indices
 400 |   #          if hasattr(start, 'shape'):
 401 |   #              start = start.tolist()
 402 |   #          return [None if x is None else x[start] for x in arrays]
 403 |   #      else:
 404 |   #          return [None if x is None else x[start:stop] for x in arrays]
 405 |   #  else:
 406 |   #      if hasattr(start, '__len__'):
 407 |   #          if hasattr(start, 'shape'):
 408 |   #              start = start.tolist()
 409 |   #          return arrays[start]
 410 |   #      elif hasattr(start, '__getitem__'):
 411 |   #          return arrays[start:stop]
 412 |   #      else:
 413 |   #          return [None]
 414 | 
 415 |     from scipy import sparse as sps
 416 |     
 417 |     if isinstance(arrays, list):
 418 |         if hasattr(start, '__len__'):
 419 |             # hdf5 datasets only support list objects as indices
 420 |             if hasattr(start, 'shape'):
 421 |                 start = start.tolist()
 422 |             # return [x[start] for x in arrays]
 423 |             res = []
 424 |             for x in arrays:
 425 |                 if sps.issparse(x):
 426 |                     res.append(x[start].toarray())
 427 |                 else:
 428 |                     res.append(x[start])
 429 |             return res
 430 |         else:
 431 |             # return [x[start:stop] for x in arrays]
 432 |             res = []
 433 |             for x in arrays:
 434 |                 if sps.issparse(x):
 435 |                     res.append(x[start:stop].toarray())
 436 |                 else:
 437 |                     res.append(x[start:stop])
 438 |             return res
 439 |     else:
 440 |         if hasattr(start, '__len__'):
 441 |             if hasattr(start, 'shape'):
 442 |                 start = start.tolist()
 443 |             # return arrays[start]
 444 |             if sps.issparse(arrays):
 445 |                 return arrays[start].toarray()
 446 |             else:
 447 |                 return arrays[start]
 448 |         else:
 449 |             # return arrays[start:stop]
 450 |             if sps.issparse(arrays):
 451 |                 return arrays[start:stop].toarray()
 452 |             else:
 453 |                 return arrays[start:stop]
 454 | 
 455 | 
 456 | def _weighted_masked_objective(fn):
 457 |     """Adds support for masking and sample-weighting to an objective function.
 458 | 
 459 |     It transforms an objective function `fn(y_true, y_pred)`
 460 |     into a sample-weighted, cost-masked objective function
 461 |     `fn(y_true, y_pred, weights, mask)`.
 462 | 
 463 |     # Arguments
 464 |         fn: The objective function to wrap,
 465 |             with signature `fn(y_true, y_pred)`.
 466 | 
 467 |     # Returns
 468 |         A function with signature `fn(y_true, y_pred, weights, mask)`.
 469 |     """
 470 |     if fn is None:
 471 |         return None
 472 | 
 473 |     def weighted(y_true, y_pred, weights, mask=None):
 474 |         """Wrapper function.
 475 | 
 476 |         # Arguments
 477 |             y_true: `y_true` argument of `fn`.
 478 |             y_pred: `y_pred` argument of `fn`.
 479 |             weights: Weights tensor.
 480 |             mask: Mask tensor.
 481 | 
 482 |         # Returns
 483 |             Scalar tensor.
 484 |         """
 485 |         # score_array has ndim >= 2
 486 |         score_array = fn(y_true, y_pred)
 487 |         if mask is not None:
 488 |             # Cast the mask to floatX to avoid float64 upcasting in theano
 489 |             mask = K.cast(mask, K.floatx())
 490 |             # mask should have the same shape as score_array
 491 |             score_array *= mask
 492 |             #  the loss per batch should be proportional
 493 |             #  to the number of unmasked samples.
 494 |             score_array /= K.mean(mask)
 495 | 
 496 |         # apply sample weighting
 497 |         if weights is not None:
 498 |             # reduce score_array to same ndim as weight array
 499 |             ndim = K.ndim(score_array)
 500 |             weight_ndim = K.ndim(weights)
 501 |             score_array = K.mean(score_array, axis=list(range(weight_ndim, ndim)))
 502 |             score_array *= weights
 503 |             score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
 504 |         return K.mean(score_array)
 505 |     return weighted
 506 | 
 507 | 
 508 | def _masked_objective(fn):
 509 |     """Adds support for masking to an objective function.
 510 | 
 511 |     It transforms an objective function `fn(y_true, y_pred)`
 512 |     into a cost-masked objective function
 513 |     `fn(y_true, y_pred, mask)`.
 514 | 
 515 |     # Arguments
 516 |         fn: The objective function to wrap,
 517 |             with signature `fn(y_true, y_pred)`.
 518 | 
 519 |     # Returns
 520 |         A function with signature `fn(y_true, y_pred, mask)`.
 521 |     """
 522 |     def masked(y_true, y_pred, mask=None):
 523 |         """Wrapper function.
 524 | 
 525 |         # Arguments
 526 |             y_true: `y_true` argument of `fn`.
 527 |             y_pred: `y_pred` argument of `fn`.
 528 |             mask: Mask tensor.
 529 | 
 530 |         # Returns
 531 |             Scalar tensor.
 532 |         """
 533 |         # score_array has ndim >= 2
 534 |         score_array = fn(y_true, y_pred)
 535 |         if mask is not None:
 536 |             # Cast the mask to floatX to avoid float64 upcasting in theano
 537 |             mask = K.cast(mask, K.floatx())
 538 |             # mask should have the same shape as score_array
 539 |             score_array *= mask
 540 |             #  the loss per batch should be proportional
 541 |             #  to the number of unmasked samples.
 542 |             score_array /= K.mean(mask)
 543 | 
 544 |         return K.mean(score_array)
 545 |     return masked
 546 | 
 547 | 
 548 | def _standardize_weights(y, sample_weight=None, class_weight=None,
 549 |                          sample_weight_mode=None):
 550 |     """Performs sample weight validation and standardization.
 551 | 
 552 |     Everything gets normalized to a single sample-wise (or timestep-wise)
 553 |     weight array.
 554 | 
 555 |     # Arguments
 556 |         y: Numpy array of model targets to be weighted.
 557 |         sample_weight: User-provided `sample_weight` argument.
 558 |         class_weight: User-provided `class_weight` argument.
 559 |         sample_weight_mode: One of `None` or `"temporal"`.
 560 |             `"temporal"` indicated that we expect 2D weight data
 561 |             that will be applied to the last 2 dimensions of
 562 |             the targets (i.e. we are weighting timesteps, not samples).
 563 | 
 564 |     # Returns
 565 |         A numpy array of target weights, one entry per sample to weight.
 566 | 
 567 |     # Raises
 568 |         ValueError: In case of invalid user-provided arguments.
 569 |     """
 570 |     if sample_weight_mode is not None:
 571 |         if sample_weight_mode != 'temporal':
 572 |             raise ValueError('"sample_weight_mode '
 573 |                              'should be None or "temporal". '
 574 |                              'Found: ' + str(sample_weight_mode))
 575 |         if len(y.shape) < 3:
 576 |             raise ValueError('Found a sample_weight array for '
 577 |                              'an input with shape ' +
 578 |                              str(y.shape) + '. '
 579 |                              'Timestep-wise sample weighting (use of '
 580 |                              'sample_weight_mode="temporal") is restricted to '
 581 |                              'outputs that are at least 3D, i.e. that have '
 582 |                              'a time dimension.')
 583 |         if sample_weight is not None and len(sample_weight.shape) != 2:
 584 |             raise ValueError('Found a sample_weight array with shape ' +
 585 |                              str(sample_weight.shape) + '. '
 586 |                              'In order to use timestep-wise sample weighting, '
 587 |                              'you should pass a 2D sample_weight array.')
 588 |     else:
 589 |         if sample_weight is not None and len(sample_weight.shape) != 1:
 590 |             raise ValueError('Found a sample_weight array with shape ' +
 591 |                              str(sample_weight.shape) + '. '
 592 |                              'In order to use timestep-wise sample weights, '
 593 |                              'you should specify '
 594 |                              'sample_weight_mode="temporal" '
 595 |                              'in compile(). If you just mean to use '
 596 |                              'sample-wise weights, make sure your '
 597 |                              'sample_weight array is 1D.')
 598 | 
 599 |     if sample_weight is not None:
 600 |         if len(sample_weight.shape) > len(y.shape):
 601 |             raise ValueError('Found a sample_weight with shape' +
 602 |                              str(sample_weight.shape) + '.'
 603 |                              'Expected sample_weight with rank '
 604 |                              'less than or equal to ' + str(len(y.shape)))
 605 | 
 606 |         if y.shape[:sample_weight.ndim] != sample_weight.shape:
 607 |             raise ValueError('Found a sample_weight array with shape ' +
 608 |                              str(sample_weight.shape) + ' for an input with shape ' +
 609 |                              str(y.shape) + '. '
 610 |                              'sample_weight cannot be broadcast.')
 611 |         return sample_weight
 612 |     elif isinstance(class_weight, dict):
 613 |         if len(y.shape) > 2:
 614 |             raise ValueError('`class_weight` not supported for '
 615 |                              '3+ dimensional targets.')
 616 |         if y.shape[1] > 1:
 617 |             y_classes = y.argmax(axis=1)
 618 |         elif y.shape[1] == 1:
 619 |             y_classes = np.reshape(y, y.shape[0])
 620 |         else:
 621 |             y_classes = y
 622 | 
 623 |         weights = np.asarray([class_weight[cls] for cls in y_classes
 624 |                               if cls in class_weight])
 625 | 
 626 |         if len(weights) != len(y_classes):
 627 |             # subtract the sets to pick all missing classes
 628 |             existing_classes = set(y_classes)
 629 |             existing_class_weight = set(class_weight.keys())
 630 |             raise ValueError('`class_weight` must contain all classes in the data.'
 631 |                              ' The classes %s exist in the data but not in '
 632 |                              '`class_weight`.'
 633 |                              % (existing_classes - existing_class_weight))
 634 |         return weights
 635 |     else:
 636 |         if sample_weight_mode is None:
 637 |             return np.ones((y.shape[0],), dtype=K.floatx())
 638 |         else:
 639 |             return np.ones((y.shape[0], y.shape[1]), dtype=K.floatx())
 640 | 
 641 | 
 642 | class Model(Container):
 643 |     """The `Model` class adds training & evaluation routines to a `Container`.
 644 |     """
 645 | 
 646 |     def compile(self, optimizer, loss, metrics=None, loss_weights=None,
 647 |                 sample_weight_mode=None, **kwargs):
 648 |         """Configures the model for training.
 649 | 
 650 |         # Arguments
 651 |             optimizer: str (name of optimizer) or optimizer object.
 652 |                 See [optimizers](/optimizers).
 653 |             loss: str (name of objective function) or objective function.
 654 |                 See [losses](/losses).
 655 |                 If the model has multiple outputs, you can use a different loss
 656 |                 on each output by passing a dictionary or a list of losses.
 657 |                 The loss value that will be minimized by the model
 658 |                 will then be the sum of all individual losses.
 659 |             metrics: list of metrics to be evaluated by the model
 660 |                 during training and testing.
 661 |                 Typically you will use `metrics=['accuracy']`.
 662 |                 To specify different metrics for different outputs of a
 663 |                 multi-output model, you could also pass a dictionary,
 664 |                 such as `metrics={'output_a': 'accuracy'}`.
 665 |             loss_weights: Optional list or dictionary specifying scalar
 666 |                 coefficients (Python floats) to weight the loss contributions
 667 |                 of different model outputs.
 668 |                 The loss value that will be minimized by the model
 669 |                 will then be the *weighted sum* of all individual losses,
 670 |                 weighted by the `loss_weights` coefficients.
 671 |                 If a list, it is expected to have a 1:1 mapping
 672 |                 to the model's outputs. If a tensor, it is expected to map
 673 |                 output names (strings) to scalar coefficients.
 674 |             sample_weight_mode: if you need to do timestep-wise
 675 |                 sample weighting (2D weights), set this to `"temporal"`.
 676 |                 `None` defaults to sample-wise weights (1D).
 677 |                 If the model has multiple outputs, you can use a different
 678 |                 `sample_weight_mode` on each output by passing a
 679 |                 dictionary or a list of modes.
 680 |             **kwargs: when using the Theano/CNTK backends, these arguments
 681 |                 are passed into K.function. When using the TensorFlow backend,
 682 |                 these arguments are passed into `tf.Session.run`.
 683 | 
 684 |         # Raises
 685 |             ValueError: In case of invalid arguments for
 686 |                 `optimizer`, `loss`, `metrics` or `sample_weight_mode`.
 687 |         """
 688 |         loss = loss or {}
 689 |         self.optimizer = optimizers.get(optimizer)
 690 |         self.sample_weight_mode = sample_weight_mode
 691 |         self.loss = loss
 692 |         self.loss_weights = loss_weights
 693 | 
 694 |         # Prepare loss functions.
 695 |         if isinstance(loss, dict):
 696 |             for name in loss:
 697 |                 if name not in self.output_names:
 698 |                     raise ValueError('Unknown entry in loss '
 699 |                                      'dictionary: "' + name + '". '
 700 |                                      'Only expected the following keys: ' +
 701 |                                      str(self.output_names))
 702 |             loss_functions = []
 703 |             for name in self.output_names:
 704 |                 if name not in loss:
 705 |                     warnings.warn('Output "' + name +
 706 |                                   '" missing from loss dictionary. '
 707 |                                   'We assume this was done on purpose, '
 708 |                                   'and we will not be expecting '
 709 |                                   'any data to be passed to "' + name +
 710 |                                   '" during training.', stacklevel=2)
 711 |                 loss_functions.append(losses.get(loss.get(name)))
 712 |         elif isinstance(loss, list):
 713 |             if len(loss) != len(self.outputs):
 714 |                 raise ValueError('When passing a list as loss, '
 715 |                                  'it should have one entry per model outputs. '
 716 |                                  'The model has ' + str(len(self.outputs)) +
 717 |                                  ' outputs, but you passed loss=' +
 718 |                                  str(loss))
 719 |             loss_functions = [losses.get(l) for l in loss]
 720 |         else:
 721 |             loss_function = losses.get(loss)
 722 |             loss_functions = [loss_function for _ in range(len(self.outputs))]
 723 |         self.loss_functions = loss_functions
 724 |         weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
 725 |         skip_indices = []
 726 |         self._feed_outputs = []
 727 |         self._feed_output_names = []
 728 |         self._feed_output_shapes = []
 729 |         self._feed_loss_fns = []
 730 |         for i in range(len(weighted_losses)):
 731 |             if weighted_losses[i] is None:
 732 |                 skip_indices.append(i)
 733 |             else:
 734 |                 self._feed_outputs.append(self.outputs[i])
 735 |                 self._feed_output_names.append(self.output_names[i])
 736 |                 self._feed_output_shapes.append(self.internal_output_shapes[i])
 737 |                 self._feed_loss_fns.append(self.loss_functions[i])
 738 | 
 739 |         # Prepare output masks.
 740 |         masks = self.compute_mask(self.inputs, mask=None)
 741 |         if masks is None:
 742 |             masks = [None for _ in self.outputs]
 743 |         if not isinstance(masks, list):
 744 |             masks = [masks]
 745 | 
 746 |         # Prepare loss weights.
 747 |         if loss_weights is None:
 748 |             loss_weights_list = [1. for _ in range(len(self.outputs))]
 749 |         elif isinstance(loss_weights, dict):
 750 |             for name in loss_weights:
 751 |                 if name not in self.output_names:
 752 |                     raise ValueError('Unknown entry in loss_weights '
 753 |                                      'dictionary: "' + name + '". '
 754 |                                      'Only expected the following keys: ' +
 755 |                                      str(self.output_names))
 756 |             loss_weights_list = []
 757 |             for name in self.output_names:
 758 |                 loss_weights_list.append(loss_weights.get(name, 1.))
 759 |         elif isinstance(loss_weights, list):
 760 |             if len(loss_weights) != len(self.outputs):
 761 |                 raise ValueError('When passing a list as loss_weights, '
 762 |                                  'it should have one entry per model outputs. '
 763 |                                  'The model has ' + str(len(self.outputs)) +
 764 |                                  ' outputs, but you passed loss_weights=' +
 765 |                                  str(loss_weights))
 766 |             loss_weights_list = loss_weights
 767 |         else:
 768 |             raise TypeError('Could not interpret loss_weights argument: ' +
 769 |                             str(loss_weights) +
 770 |                             ' - expected a list of dicts.')
 771 | 
 772 |         # Prepare sample weights.
 773 |         sample_weights = []
 774 |         sample_weight_modes = []
 775 |         if isinstance(sample_weight_mode, dict):
 776 |             for name in sample_weight_mode:
 777 |                 if name not in self.output_names:
 778 |                     raise ValueError('Unknown entry in '
 779 |                                      'sample_weight_mode dictionary: "' +
 780 |                                      name + '". '
 781 |                                      'Only expected the following keys: ' +
 782 |                                      str(self.output_names))
 783 |             for i, name in enumerate(self.output_names):
 784 |                 if i in skip_indices:
 785 |                     weight = None
 786 |                     sample_weight_modes.append(None)
 787 |                 else:
 788 |                     if name not in sample_weight_mode:
 789 |                         raise ValueError('Output "' + name +
 790 |                                          '" missing from sample_weight_modes '
 791 |                                          'dictionary')
 792 |                     if sample_weight_mode.get(name) == 'temporal':
 793 |                         weight = K.placeholder(ndim=2,
 794 |                                                name=name + '_sample_weights')
 795 |                         sample_weight_modes.append('temporal')
 796 |                     else:
 797 |                         weight = K.placeholder(ndim=1,
 798 |                                                name=name + '_sample_weights')
 799 |                         sample_weight_modes.append(None)
 800 |                 sample_weights.append(weight)
 801 |         elif isinstance(sample_weight_mode, list):
 802 |             if len(sample_weight_mode) != len(self.outputs):
 803 |                 raise ValueError('When passing a list as sample_weight_mode, '
 804 |                                  'it should have one entry per model outputs. '
 805 |                                  'The model has ' + str(len(self.outputs)) +
 806 |                                  ' outputs, but you passed '
 807 |                                  'sample_weight_mode=' +
 808 |                                  str(sample_weight_mode))
 809 |             for i in range(len(self.output_names)):
 810 |                 if i in skip_indices:
 811 |                     weight = None
 812 |                     sample_weight_modes.append(None)
 813 |                 else:
 814 |                     mode = sample_weight_mode[i]
 815 |                     name = self.output_names[i]
 816 |                     if mode == 'temporal':
 817 |                         weight = K.placeholder(ndim=2,
 818 |                                                name=name + '_sample_weights')
 819 |                         sample_weight_modes.append('temporal')
 820 |                     else:
 821 |                         weight = K.placeholder(ndim=1,
 822 |                                                name=name + '_sample_weights')
 823 |                         sample_weight_modes.append(None)
 824 |                 sample_weights.append(weight)
 825 |         else:
 826 |             for i, name in enumerate(self.output_names):
 827 |                 if i in skip_indices:
 828 |                     sample_weight_modes.append(None)
 829 |                     sample_weights.append(None)
 830 |                 else:
 831 |                     if sample_weight_mode == 'temporal':
 832 |                         sample_weights.append(
 833 |                             K.placeholder(ndim=2,
 834 |                                           name=name + '_sample_weights'))
 835 |                         sample_weight_modes.append('temporal')
 836 |                     else:
 837 |                         sample_weights.append(
 838 |                             K.placeholder(ndim=1,
 839 |                                           name=name + '_sample_weights'))
 840 |                         sample_weight_modes.append(None)
 841 |         self.sample_weight_modes = sample_weight_modes
 842 |         self._feed_sample_weight_modes = []
 843 |         for i in range(len(self.outputs)):
 844 |             if i not in skip_indices:
 845 |                 self._feed_sample_weight_modes.append(self.sample_weight_modes[i])
 846 | 
 847 |         # Prepare targets of model.
 848 |         self.targets = []
 849 |         self._feed_targets = []
 850 |         for i in range(len(self.outputs)):
 851 |             if i in skip_indices:
 852 |                 self.targets.append(None)
 853 |             else:
 854 |                 shape = self.internal_output_shapes[i]
 855 |                 name = self.output_names[i]
 856 |                 target = K.placeholder(ndim=len(shape),
 857 |                                        name=name + '_target',
 858 |                                        sparse=K.is_sparse(self.outputs[i]),
 859 |                                        dtype=K.dtype(self.outputs[i]))
 860 |                 self.targets.append(target)
 861 |                 self._feed_targets.append(target)
 862 | 
 863 |         # Prepare metrics.
 864 |         self.metrics = metrics
 865 |         self.metrics_names = ['loss']
 866 |         self.metrics_tensors = []
 867 | 
 868 |         # Compute total loss.
 869 |         total_loss = None
 870 |         for i in range(len(self.outputs)):
 871 |             if i in skip_indices:
 872 |                 continue
 873 |             y_true = self.targets[i]
 874 |             y_pred = self.outputs[i]
 875 |             weighted_loss = weighted_losses[i]
 876 |             sample_weight = sample_weights[i]
 877 |             mask = masks[i]
 878 |             loss_weight = loss_weights_list[i]
 879 |             output_loss = weighted_loss(y_true, y_pred,
 880 |                                         sample_weight, mask)
 881 |             if len(self.outputs) > 1:
 882 |                 self.metrics_tensors.append(output_loss)
 883 |                 self.metrics_names.append(self.output_names[i] + '_loss')
 884 |             if total_loss is None:
 885 |                 total_loss = loss_weight * output_loss
 886 |             else:
 887 |                 total_loss += loss_weight * output_loss
 888 |         if total_loss is None:
 889 |             if not self.losses:
 890 |                 raise RuntimeError('The model cannot be compiled '
 891 |                                    'because it has no loss to optimize.')
 892 |             else:
 893 |                 total_loss = 0.
 894 | 
 895 |         # Add regularization penalties
 896 |         # and other layer-specific losses.
 897 |         for loss_tensor in self.losses:
 898 |             total_loss += loss_tensor
 899 | 
 900 |         # List of same size as output_names.
 901 |         # contains tuples (metrics for output, names of metrics).
 902 |         nested_metrics = _collect_metrics(metrics, self.output_names)
 903 | 
 904 |         def append_metric(layer_num, metric_name, metric_tensor):
 905 |             """Helper function used in loop below."""
 906 |             if len(self.output_names) > 1:
 907 |                 metric_name = self.output_layers[layer_num].name + '_' + metric_name
 908 |             self.metrics_names.append(metric_name)
 909 |             self.metrics_tensors.append(metric_tensor)
 910 | 
 911 |         for i in range(len(self.outputs)):
 912 |             if i in skip_indices:
 913 |                 continue
 914 |             y_true = self.targets[i]
 915 |             y_pred = self.outputs[i]
 916 |             output_metrics = nested_metrics[i]
 917 |             for metric in output_metrics:
 918 |                 if metric == 'accuracy' or metric == 'acc':
 919 |                     # custom handling of accuracy
 920 |                     # (because of class mode duality)
 921 |                     output_shape = self.internal_output_shapes[i]
 922 |                     acc_fn = None
 923 |                     if (output_shape[-1] == 1 or
 924 |                        self.loss_functions[i] == losses.binary_crossentropy):
 925 |                         # case: binary accuracy
 926 |                         acc_fn = metrics_module.binary_accuracy
 927 |                     elif self.loss_functions[i] == losses.sparse_categorical_crossentropy:
 928 |                         # case: categorical accuracy with sparse targets
 929 |                         acc_fn = metrics_module.sparse_categorical_accuracy
 930 |                     else:
 931 |                         acc_fn = metrics_module.categorical_accuracy
 932 | 
 933 |                     masked_fn = _masked_objective(acc_fn)
 934 |                     append_metric(i, 'acc', masked_fn(y_true, y_pred, mask=masks[i]))
 935 |                 else:
 936 |                     metric_fn = metrics_module.get(metric)
 937 |                     masked_metric_fn = _masked_objective(metric_fn)
 938 |                     metric_result = masked_metric_fn(y_true, y_pred, mask=masks[i])
 939 |                     metric_result = {
 940 |                         metric_fn.__name__: metric_result
 941 |                     }
 942 |                     for name, tensor in six.iteritems(metric_result):
 943 |                         append_metric(i, name, tensor)
 944 | 
 945 |         # Prepare gradient updates and state updates.
 946 |         self.total_loss = total_loss
 947 |         self.sample_weights = sample_weights
 948 |         self._feed_sample_weights = []
 949 |         for i in range(len(self.sample_weights)):
 950 |             if i not in skip_indices:
 951 |                 self._feed_sample_weights.append(sample_weights[i])
 952 | 
 953 |         # Functions for train, test and predict will
 954 |         # be compiled lazily when required.
 955 |         # This saves time when the user is not using all functions.
 956 |         self._function_kwargs = kwargs
 957 | 
 958 |         self.train_function = None
 959 |         self.test_function = None
 960 |         self.predict_function = None
 961 | 
 962 |         # Collected trainable weights, sorted in topological order.
 963 |         trainable_weights = self.trainable_weights
 964 |         self._collected_trainable_weights = trainable_weights
 965 | 
 966 |     def _make_train_function(self):
 967 |         if not hasattr(self, 'train_function'):
 968 |             raise RuntimeError('You must compile your model before using it.')
 969 |         if self.train_function is None:
 970 |             inputs = self._feed_inputs + self._feed_targets + self._feed_sample_weights
 971 |             if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
 972 |                 inputs += [K.learning_phase()]
 973 | 
 974 |             training_updates = self.optimizer.get_updates(
 975 |                 self._collected_trainable_weights,
 976 |                 self.constraints,
 977 |                 self.total_loss)
 978 |             updates = self.updates + training_updates
 979 |             # Gets loss and metrics. Updates weights at each call.
 980 |             self.train_function = K.function(inputs,
 981 |                                              [self.total_loss] + self.metrics_tensors,
 982 |                                              updates=updates,
 983 |                                              name='train_function',
 984 |                                              **self._function_kwargs)
 985 | 
 986 |     def _make_test_function(self):
 987 |         if not hasattr(self, 'test_function'):
 988 |             raise RuntimeError('You must compile your model before using it.')
 989 |         if self.test_function is None:
 990 |             inputs = self._feed_inputs + self._feed_targets + self._feed_sample_weights
 991 |             if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
 992 |                 inputs += [K.learning_phase()]
 993 |             # Return loss and metrics, no gradient updates.
 994 |             # Does update the network states.
 995 |             self.test_function = K.function(inputs,
 996 |                                             [self.total_loss] + self.metrics_tensors,
 997 |                                             updates=self.state_updates,
 998 |                                             name='test_function',
 999 |                                             **self._function_kwargs)
1000 | 
1001 |     def _make_predict_function(self):
1002 |         if not hasattr(self, 'predict_function'):
1003 |             self.predict_function = None
1004 |         if self.predict_function is None:
1005 |             if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1006 |                 inputs = self._feed_inputs + [K.learning_phase()]
1007 |             else:
1008 |                 inputs = self._feed_inputs
1009 |             # Gets network outputs. Does not update weights.
1010 |             # Does update the network states.
1011 |             kwargs = getattr(self, '_function_kwargs', {})
1012 |             self.predict_function = K.function(inputs,
1013 |                                                self.outputs,
1014 |                                                updates=self.state_updates,
1015 |                                                name='predict_function',
1016 |                                                **kwargs)
1017 | 
1018 |     def _fit_loop(self, f, ins, out_labels=None, batch_size=32,
1019 |                   epochs=100, verbose=1, callbacks=None,
1020 |                   val_f=None, val_ins=None, shuffle=True,
1021 |                   callback_metrics=None, initial_epoch=0):
1022 |         """Abstract fit function for `f(ins)`.
1023 | 
1024 |         Assume that f returns a list, labeled by out_labels.
1025 | 
1026 |         # Arguments
1027 |             f: Keras function returning a list of tensors
1028 |             ins: list of tensors to be fed to `f`
1029 |             out_labels: list of strings, display names of
1030 |                 the outputs of `f`
1031 |             batch_size: integer batch size
1032 |             epochs: number of times to iterate over the data
1033 |             verbose: verbosity mode, 0, 1 or 2
1034 |             callbacks: list of callbacks to be called during training
1035 |             val_f: Keras function to call for validation
1036 |             val_ins: list of tensors to be fed to `val_f`
1037 |             shuffle: whether to shuffle the data at the beginning of each epoch
1038 |             callback_metrics: list of strings, the display names of the metrics
1039 |                 passed to the callbacks. They should be the
1040 |                 concatenation of list the display names of the outputs of
1041 |                  `f` and the list of display names of the outputs of `f_val`.
1042 |             initial_epoch: epoch at which to start training
1043 |                 (useful for resuming a previous training run)
1044 | 
1045 |         # Returns
1046 |             `History` object.
1047 |         """
1048 |         do_validation = False
1049 |         if val_f and val_ins:
1050 |             do_validation = True
1051 |             if verbose:
1052 |                 print('Train on %d samples, validate on %d samples' %
1053 |                       (ins[0].shape[0], val_ins[0].shape[0]))
1054 | 
1055 |         if ins and hasattr(ins[0], 'shape'):
1056 |             num_train_samples = ins[0].shape[0]
1057 |         else:
1058 |             # May happen if we are running `fit` without Numpy input data,
1059 |             # i.e. if all inputs to the models are data tensors
1060 |             # instead of placeholders.
1061 |             # In that case we will run `fit` over a single batch.
1062 |             num_train_samples = batch_size
1063 |             verbose = 2
1064 |         index_array = np.arange(num_train_samples)
1065 | 
1066 |         self.history = cbks.History()
1067 |         callbacks = [cbks.BaseLogger()] + (callbacks or []) + [self.history]
1068 |         if verbose:
1069 |             callbacks += [cbks.ProgbarLogger()]
1070 |         callbacks = cbks.CallbackList(callbacks)
1071 |         out_labels = out_labels or []
1072 | 
1073 |         # it's possible to callback a different model than self
1074 |         # (used by Sequential models)
1075 |         if hasattr(self, 'callback_model') and self.callback_model:
1076 |             callback_model = self.callback_model
1077 |         else:
1078 |             callback_model = self
1079 | 
1080 |         callbacks.set_model(callback_model)
1081 |         callbacks.set_params({
1082 |             'batch_size': batch_size,
1083 |             'epochs': epochs,
1084 |             'samples': num_train_samples,
1085 |             'verbose': verbose,
1086 |             'do_validation': do_validation,
1087 |             'metrics': callback_metrics or [],
1088 |         })
1089 |         callbacks.on_train_begin()
1090 |         callback_model.stop_training = False
1091 |         for cbk in callbacks:
1092 |             cbk.validation_data = val_ins
1093 | 
1094 |         for epoch in range(initial_epoch, epochs):
1095 |             callbacks.on_epoch_begin(epoch)
1096 |             if shuffle == 'batch':
1097 |                 index_array = _batch_shuffle(index_array, batch_size)
1098 |             elif shuffle:
1099 |                 np.random.shuffle(index_array)
1100 | 
1101 |             batches = _make_batches(num_train_samples, batch_size)
1102 |             epoch_logs = {}
1103 |             for batch_index, (batch_start, batch_end) in enumerate(batches):
1104 |                 batch_ids = index_array[batch_start:batch_end]
1105 |                 try:
1106 |                     if isinstance(ins[-1], float):
1107 |                         # Do not slice the training phase flag.
1108 |                         ins_batch = _slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
1109 |                     else:
1110 |                         ins_batch = _slice_arrays(ins, batch_ids)
1111 |                 except TypeError:
1112 |                     raise TypeError('TypeError while preparing batch. '
1113 |                                     'If using HDF5 input data, '
1114 |                                     'pass shuffle="batch".')
1115 |                 batch_logs = {}
1116 |                 batch_logs['batch'] = batch_index
1117 |                 batch_logs['size'] = len(batch_ids)
1118 |                 callbacks.on_batch_begin(batch_index, batch_logs)
1119 |                 outs = f(ins_batch)
1120 |                 if not isinstance(outs, list):
1121 |                     outs = [outs]
1122 |                 for l, o in zip(out_labels, outs):
1123 |                     batch_logs[l] = o
1124 | 
1125 |                 callbacks.on_batch_end(batch_index, batch_logs)
1126 |                 if callback_model.stop_training:
1127 |                     break
1128 | 
1129 |                 if batch_index == len(batches) - 1:  # Last batch.
1130 |                     if do_validation:
1131 |                         val_outs = self._test_loop(val_f, val_ins,
1132 |                                                    batch_size=batch_size,
1133 |                                                    verbose=0)
1134 |                         if not isinstance(val_outs, list):
1135 |                             val_outs = [val_outs]
1136 |                         # Same labels assumed.
1137 |                         for l, o in zip(out_labels, val_outs):
1138 |                             epoch_logs['val_' + l] = o
1139 |             callbacks.on_epoch_end(epoch, epoch_logs)
1140 |             if callback_model.stop_training:
1141 |                 break
1142 |         callbacks.on_train_end()
1143 |         return self.history
1144 | 
1145 |     def _predict_loop(self, f, ins, batch_size=32, verbose=0):
1146 |         """Abstract method to loop over some data in batches.
1147 | 
1148 |         # Arguments
1149 |             f: Keras function returning a list of tensors.
1150 |             ins: list of tensors to be fed to `f`.
1151 |             batch_size: integer batch size.
1152 |             verbose: verbosity mode.
1153 | 
1154 |         # Returns
1155 |             Array of predictions (if the model has a single output)
1156 |             or list of arrays of predictions
1157 |             (if the model has multiple outputs).
1158 |         """
1159 |         if ins and hasattr(ins[0], 'shape'):
1160 |             samples = ins[0].shape[0]
1161 |         else:
1162 |             # May happen if we are running `predict` without Numpy input data,
1163 |             # i.e. if all inputs to the models are data tensors
1164 |             # instead of placeholders.
1165 |             # In that case we will run `predict` over a single batch.
1166 |             samples = batch_size
1167 |             verbose = 2
1168 |         outs = []
1169 |         if verbose == 1:
1170 |             progbar = Progbar(target=samples)
1171 |         batches = _make_batches(samples, batch_size)
1172 |         index_array = np.arange(samples)
1173 |         for batch_index, (batch_start, batch_end) in enumerate(batches):
1174 |             batch_ids = index_array[batch_start:batch_end]
1175 |             if ins and isinstance(ins[-1], float):
1176 |                 # Do not slice the training phase flag.
1177 |                 ins_batch = _slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
1178 |             else:
1179 |                 ins_batch = _slice_arrays(ins, batch_ids)
1180 | 
1181 |             batch_outs = f(ins_batch)
1182 |             if not isinstance(batch_outs, list):
1183 |                 batch_outs = [batch_outs]
1184 |             if batch_index == 0:
1185 |                 for batch_out in batch_outs:
1186 |                     shape = (samples,) + batch_out.shape[1:]
1187 |                     outs.append(np.zeros(shape, dtype=batch_out.dtype))
1188 | 
1189 |             for i, batch_out in enumerate(batch_outs):
1190 |                 outs[i][batch_start:batch_end] = batch_out
1191 |             if verbose == 1:
1192 |                 progbar.update(batch_end)
1193 |         if len(outs) == 1:
1194 |             return outs[0]
1195 |         return outs
1196 | 
1197 |     def _test_loop(self, f, ins, batch_size=32, verbose=0):
1198 |         """Abstract method to loop over some data in batches.
1199 | 
1200 |         # Arguments
1201 |             f: Keras function returning a list of tensors.
1202 |             ins: list of tensors to be fed to `f`.
1203 |             batch_size: integer batch size.
1204 |             verbose: verbosity mode.
1205 | 
1206 |         # Returns
1207 |             Scalar loss (if the model has a single output and no metrics)
1208 |             or list of scalars (if the model has multiple outputs
1209 |             and/or metrics). The attribute `model.metrics_names` will give you
1210 |             the display labels for the scalar outputs.
1211 |         """
1212 |         if ins and hasattr(ins[0], 'shape'):
1213 |             samples = ins[0].shape[0]
1214 |         else:
1215 |             # May happen if we are running `evaluate` without Numpy input data,
1216 |             # i.e. if all inputs to the models are data tensors
1217 |             # instead of placeholders.
1218 |             # In that case we will run `evaluate` over a single batch.
1219 |             samples = batch_size
1220 |             verbose = 2
1221 | 
1222 |         outs = []
1223 |         if verbose == 1:
1224 |             progbar = Progbar(target=samples)
1225 |         batches = _make_batches(samples, batch_size)
1226 |         index_array = np.arange(samples)
1227 |         for batch_index, (batch_start, batch_end) in enumerate(batches):
1228 |             batch_ids = index_array[batch_start:batch_end]
1229 |             if isinstance(ins[-1], float):
1230 |                 # Do not slice the training phase flag.
1231 |                 ins_batch = _slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
1232 |             else:
1233 |                 ins_batch = _slice_arrays(ins, batch_ids)
1234 | 
1235 |             batch_outs = f(ins_batch)
1236 |             if isinstance(batch_outs, list):
1237 |                 if batch_index == 0:
1238 |                     for batch_out in enumerate(batch_outs):
1239 |                         outs.append(0.)
1240 |                 for i, batch_out in enumerate(batch_outs):
1241 |                     outs[i] += batch_out * len(batch_ids)
1242 |             else:
1243 |                 if batch_index == 0:
1244 |                     outs.append(0.)
1245 |                 outs[0] += batch_outs * len(batch_ids)
1246 | 
1247 |             if verbose == 1:
1248 |                 progbar.update(batch_end)
1249 |         for i in range(len(outs)):
1250 |             outs[i] /= samples
1251 |         if len(outs) == 1:
1252 |             return outs[0]
1253 |         return outs
1254 | 
1255 |     def _standardize_user_data(self, x, y,
1256 |                                sample_weight=None, class_weight=None,
1257 |                                check_batch_axis=True, batch_size=None):
1258 |         if not hasattr(self, 'optimizer'):
1259 |             raise RuntimeError('You must compile a model before '
1260 |                                'training/testing. '
1261 |                                'Use `model.compile(optimizer, loss)`.')
1262 | 
1263 |         output_shapes = []
1264 |         for output_shape, loss_fn in zip(self._feed_output_shapes, self._feed_loss_fns):
1265 |             if loss_fn.__name__ == 'sparse_categorical_crossentropy':
1266 |                 output_shapes.append(output_shape[:-1] + (1,))
1267 |             elif getattr(losses, loss_fn.__name__, None) is None:
1268 |                 output_shapes.append(None)
1269 |             else:
1270 |                 output_shapes.append(output_shape)
1271 |         x = _standardize_input_data(x, self._feed_input_names,
1272 |                                     self._feed_input_shapes,
1273 |                                     check_batch_axis=False,
1274 |                                     exception_prefix='input')
1275 |         y = _standardize_input_data(y, self._feed_output_names,
1276 |                                     output_shapes,
1277 |                                     check_batch_axis=False,
1278 |                                     exception_prefix='target')
1279 |         sample_weights = _standardize_sample_weights(sample_weight,
1280 |                                                      self._feed_output_names)
1281 |         class_weights = _standardize_class_weights(class_weight,
1282 |                                                    self._feed_output_names)
1283 |         sample_weights = [_standardize_weights(ref, sw, cw, mode)
1284 |                           for (ref, sw, cw, mode)
1285 |                           in zip(y, sample_weights, class_weights, self._feed_sample_weight_modes)]
1286 |         _check_array_lengths(x, y, sample_weights)
1287 |         _check_loss_and_target_compatibility(y,
1288 |                                              self._feed_loss_fns,
1289 |                                              self._feed_output_shapes)
1290 |         if self.stateful and batch_size:
1291 |             if x[0].shape[0] % batch_size != 0:
1292 |                 raise ValueError('In a stateful network, '
1293 |                                  'you should only pass inputs with '
1294 |                                  'a number of samples that can be '
1295 |                                  'divided by the batch size. Found: ' +
1296 |                                  str(x[0].shape[0]) + ' samples')
1297 |         return x, y, sample_weights
1298 | 
1299 |     def _get_deduped_metrics_names(self):
1300 |         out_labels = self.metrics_names
1301 | 
1302 |         # Rename duplicated metrics name
1303 |         # (can happen with an output layer shared among multiple dataflows).
1304 |         deduped_out_labels = []
1305 |         for i, label in enumerate(out_labels):
1306 |             new_label = label
1307 |             if out_labels.count(label) > 1:
1308 |                 dup_idx = out_labels[:i].count(label)
1309 |                 new_label += '_' + str(dup_idx + 1)
1310 |             deduped_out_labels.append(new_label)
1311 |         return deduped_out_labels
1312 | 
1313 |     def fit(self, x=None,
1314 |             y=None,
1315 |             batch_size=32,
1316 |             epochs=1,
1317 |             verbose=1,
1318 |             callbacks=None,
1319 |             validation_split=0.,
1320 |             validation_data=None,
1321 |             shuffle=True,
1322 |             class_weight=None,
1323 |             sample_weight=None,
1324 |             initial_epoch=0,
1325 |             **kwargs):
1326 |         """Trains the model for a fixed number of epochs (iterations on a dataset).
1327 | 
1328 |         # Arguments
1329 |             x: Numpy array of training data,
1330 |                 or list of Numpy arrays if the model has multiple inputs.
1331 |                 If all inputs in the model are named,
1332 |                 you can also pass a dictionary
1333 |                 mapping input names to Numpy arrays.
1334 |             y: Numpy array of target data,
1335 |                 or list of Numpy arrays if the model has multiple outputs.
1336 |                 If all outputs in the model are named,
1337 |                 you can also pass a dictionary
1338 |                 mapping output names to Numpy arrays.
1339 |             batch_size: integer. Number of samples per gradient update.
1340 |             epochs: integer, the number of times to iterate
1341 |                 over the training data arrays.
1342 |             verbose: 0, 1, or 2. Verbosity mode.
1343 |                 0 = silent, 1 = verbose, 2 = one log line per epoch.
1344 |             callbacks: list of callbacks to be called during training.
1345 |                 See [callbacks](/callbacks).
1346 |             validation_split: float between 0 and 1:
1347 |                 fraction of the training data to be used as validation data.
1348 |                 The model will set apart this fraction of the training data,
1349 |                 will not train on it, and will evaluate
1350 |                 the loss and any model metrics
1351 |                 on this data at the end of each epoch.
1352 |             validation_data: data on which to evaluate
1353 |                 the loss and any model metrics
1354 |                 at the end of each epoch. The model will not
1355 |                 be trained on this data.
1356 |                 This could be a tuple (x_val, y_val)
1357 |                 or a tuple (x_val, y_val, val_sample_weights).
1358 |             shuffle: boolean, whether to shuffle the training data
1359 |                 before each epoch.
1360 |             class_weight: optional dictionary mapping
1361 |                 class indices (integers) to
1362 |                 a weight (float) to apply to the model's loss for the samples
1363 |                 from this class during training.
1364 |                 This can be useful to tell the model to "pay more attention" to
1365 |                 samples from an under-represented class.
1366 |             sample_weight: optional array of the same length as x, containing
1367 |                 weights to apply to the model's loss for each sample.
1368 |                 In the case of temporal data, you can pass a 2D array
1369 |                 with shape (samples, sequence_length),
1370 |                 to apply a different weight to every timestep of every sample.
1371 |                 In this case you should make sure to specify
1372 |                 sample_weight_mode="temporal" in compile().
1373 |             initial_epoch: epoch at which to start training
1374 |                 (useful for resuming a previous training run)
1375 | 
1376 |         # Returns
1377 |             A `History` instance. Its `history` attribute contains
1378 |             all information collected during training.
1379 | 
1380 |         # Raises
1381 |             ValueError: In case of mismatch between the provided input data
1382 |                 and what the model expects.
1383 |         """
1384 |         # Legacy support
1385 |         if 'nb_epoch' in kwargs:
1386 |             warnings.warn('The `nb_epoch` argument in `fit` '
1387 |                           'has been renamed `epochs`.', stacklevel=2)
1388 |             epochs = kwargs.pop('nb_epoch')
1389 |         if kwargs:
1390 |             raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))
1391 | 
1392 |         # Validate user data.
1393 |         x, y, sample_weights = self._standardize_user_data(
1394 |             x, y,
1395 |             sample_weight=sample_weight,
1396 |             class_weight=class_weight,
1397 |             check_batch_axis=False,
1398 |             batch_size=batch_size)
1399 |         # Prepare validation data.
1400 |         if validation_data:
1401 |             do_validation = True
1402 |             if len(validation_data) == 2:
1403 |                 val_x, val_y = validation_data
1404 |                 val_sample_weight = None
1405 |             elif len(validation_data) == 3:
1406 |                 val_x, val_y, val_sample_weight = validation_data
1407 |             else:
1408 |                 raise ValueError('When passing validation_data, '
1409 |                                  'it must contain 2 (x_val, y_val) '
1410 |                                  'or 3 (x_val, y_val, val_sample_weights) '
1411 |                                  'items, however it contains %d items' %
1412 |                                  len(validation_data))
1413 | 
1414 |             val_x, val_y, val_sample_weights = self._standardize_user_data(
1415 |                 val_x, val_y,
1416 |                 sample_weight=val_sample_weight,
1417 |                 check_batch_axis=False,
1418 |                 batch_size=batch_size)
1419 |             self._make_test_function()
1420 |             val_f = self.test_function
1421 |             if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1422 |                 val_ins = val_x + val_y + val_sample_weights + [0.]
1423 |             else:
1424 |                 val_ins = val_x + val_y + val_sample_weights
1425 | 
1426 |         elif validation_split and 0. < validation_split < 1.:
1427 |             do_validation = True
1428 |             if hasattr(x[0], 'shape'):
1429 |                 split_at = int(x[0].shape[0] * (1. - validation_split))
1430 |             else:
1431 |                 split_at = int(len(x[0]) * (1. - validation_split))
1432 |             x, val_x = (_slice_arrays(x, 0, split_at), _slice_arrays(x, split_at))
1433 |             y, val_y = (_slice_arrays(y, 0, split_at), _slice_arrays(y, split_at))
1434 |             sample_weights, val_sample_weights = (
1435 |                 _slice_arrays(sample_weights, 0, split_at),
1436 |                 _slice_arrays(sample_weights, split_at))
1437 |             self._make_test_function()
1438 |             val_f = self.test_function
1439 |             if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1440 |                 val_ins = val_x + val_y + val_sample_weights + [0.]
1441 |             else:
1442 |                 val_ins = val_x + val_y + val_sample_weights
1443 |         else:
1444 |             do_validation = False
1445 |             val_f = None
1446 |             val_ins = None
1447 | 
1448 |         # Prepare input arrays and training function.
1449 |         if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1450 |             ins = x + y + sample_weights + [1.]
1451 |         else:
1452 |             ins = x + y + sample_weights
1453 |         self._make_train_function()
1454 |         f = self.train_function
1455 | 
1456 |         # Prepare display labels.
1457 |         out_labels = self._get_deduped_metrics_names()
1458 | 
1459 |         if do_validation:
1460 |             callback_metrics = copy.copy(out_labels) + ['val_' + n for n in out_labels]
1461 |         else:
1462 |             callback_metrics = copy.copy(out_labels)
1463 | 
1464 |         # Delegate logic to `_fit_loop`.
1465 |         return self._fit_loop(f, ins, out_labels=out_labels,
1466 |                               batch_size=batch_size, epochs=epochs,
1467 |                               verbose=verbose, callbacks=callbacks,
1468 |                               val_f=val_f, val_ins=val_ins, shuffle=shuffle,
1469 |                               callback_metrics=callback_metrics,
1470 |                               initial_epoch=initial_epoch)
1471 | 
1472 |     def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):
1473 |         """Returns the loss value & metrics values for the model in test mode.
1474 | 
1475 |         Computation is done in batches.
1476 | 
1477 |         # Arguments
1478 |             x: Numpy array of test data,
1479 |                 or list of Numpy arrays if the model has multiple inputs.
1480 |                 If all inputs in the model are named,
1481 |                 you can also pass a dictionary
1482 |                 mapping input names to Numpy arrays.
1483 |             y: Numpy array of target data,
1484 |                 or list of Numpy arrays if the model has multiple outputs.
1485 |                 If all outputs in the model are named,
1486 |                 you can also pass a dictionary
1487 |                 mapping output names to Numpy arrays.
1488 |             batch_size: integer. Number of samples per gradient update.
1489 |             verbose: verbosity mode, 0 or 1.
1490 |             sample_weight: Array of weights to weight the contribution
1491 |                 of different samples to the loss and metrics.
1492 | 
1493 |         # Returns
1494 |             Scalar test loss (if the model has a single output and no metrics)
1495 |             or list of scalars (if the model has multiple outputs
1496 |             and/or metrics). The attribute `model.metrics_names` will give you
1497 |             the display labels for the scalar outputs.
1498 |         """
1499 |         # Validate user data.
1500 |         x, y, sample_weights = self._standardize_user_data(
1501 |             x, y,
1502 |             sample_weight=sample_weight,
1503 |             check_batch_axis=False,
1504 |             batch_size=batch_size)
1505 |         # Prepare inputs, delegate logic to `_test_loop`.
1506 |         if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1507 |             ins = x + y + sample_weights + [0.]
1508 |         else:
1509 |             ins = x + y + sample_weights
1510 |         self._make_test_function()
1511 |         f = self.test_function
1512 |         return self._test_loop(f, ins,
1513 |                                batch_size=batch_size,
1514 |                                verbose=verbose)
1515 | 
1516 |     def predict(self, x, batch_size=32, verbose=0):
1517 |         """Generates output predictions for the input samples.
1518 | 
1519 |         Computation is done in batches.
1520 | 
1521 |         # Arguments
1522 |             x: the input data, as a Numpy array
1523 |                 (or list of Numpy arrays if the model has multiple outputs).
1524 |             batch_size: integer.
1525 |             verbose: verbosity mode, 0 or 1.
1526 | 
1527 |         # Returns
1528 |             Numpy array(s) of predictions.
1529 | 
1530 |         # Raises
1531 |             ValueError: In case of mismatch between the provided
1532 |                 input data and the model's expectations,
1533 |                 or in case a stateful model receives a number of samples
1534 |                 that is not a multiple of the batch size.
1535 |         """
1536 |         # Validate user data.
1537 |         x = _standardize_input_data(x, self._feed_input_names,
1538 |                                     self._feed_input_shapes,
1539 |                                     check_batch_axis=False)
1540 |         if self.stateful:
1541 |             if x[0].shape[0] > batch_size and x[0].shape[0] % batch_size != 0:
1542 |                 raise ValueError('In a stateful network, '
1543 |                                  'you should only pass inputs with '
1544 |                                  'a number of samples that can be '
1545 |                                  'divided by the batch size. Found: ' +
1546 |                                  str(x[0].shape[0]) + ' samples. '
1547 |                                  'Batch size: ' + str(batch_size) + '.')
1548 | 
1549 |         # Prepare inputs, delegate logic to `_predict_loop`.
1550 |         if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1551 |             ins = x + [0.]
1552 |         else:
1553 |             ins = x
1554 |         self._make_predict_function()
1555 |         f = self.predict_function
1556 |         return self._predict_loop(f, ins,
1557 |                                   batch_size=batch_size, verbose=verbose)
1558 | 
1559 |     def train_on_batch(self, x, y,
1560 |                        sample_weight=None, class_weight=None):
1561 |         """Runs a single gradient update on a single batch of data.
1562 | 
1563 |         # Arguments
1564 |             x: Numpy array of training data,
1565 |                 or list of Numpy arrays if the model has multiple inputs.
1566 |                 If all inputs in the model are named,
1567 |                 you can also pass a dictionary
1568 |                 mapping input names to Numpy arrays.
1569 |             y: Numpy array of target data,
1570 |                 or list of Numpy arrays if the model has multiple outputs.
1571 |                 If all outputs in the model are named,
1572 |                 you can also pass a dictionary
1573 |                 mapping output names to Numpy arrays.
1574 |             sample_weight: optional array of the same length as x, containing
1575 |                 weights to apply to the model's loss for each sample.
1576 |                 In the case of temporal data, you can pass a 2D array
1577 |                 with shape (samples, sequence_length),
1578 |                 to apply a different weight to every timestep of every sample.
1579 |                 In this case you should make sure to specify
1580 |                 sample_weight_mode="temporal" in compile().
1581 |             class_weight: optional dictionary mapping
1582 |                 class indices (integers) to
1583 |                 a weight (float) to apply to the model's loss for the samples
1584 |                 from this class during training.
1585 |                 This can be useful to tell the model to "pay more attention" to
1586 |                 samples from an under-represented class.
1587 | 
1588 |         # Returns
1589 |             Scalar training loss
1590 |             (if the model has a single output and no metrics)
1591 |             or list of scalars (if the model has multiple outputs
1592 |             and/or metrics). The attribute `model.metrics_names` will give you
1593 |             the display labels for the scalar outputs.
1594 |         """
1595 |         x, y, sample_weights = self._standardize_user_data(
1596 |             x, y,
1597 |             sample_weight=sample_weight,
1598 |             class_weight=class_weight,
1599 |             check_batch_axis=True)
1600 |         if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1601 |             ins = x + y + sample_weights + [1.]
1602 |         else:
1603 |             ins = x + y + sample_weights
1604 |         self._make_train_function()
1605 |         outputs = self.train_function(ins)
1606 |         if len(outputs) == 1:
1607 |             return outputs[0]
1608 |         return outputs
1609 | 
1610 |     def test_on_batch(self, x, y, sample_weight=None):
1611 |         """Test the model on a single batch of samples.
1612 | 
1613 |         # Arguments
1614 |             x: Numpy array of test data,
1615 |                 or list of Numpy arrays if the model has multiple inputs.
1616 |                 If all inputs in the model are named,
1617 |                 you can also pass a dictionary
1618 |                 mapping input names to Numpy arrays.
1619 |             y: Numpy array of target data,
1620 |                 or list of Numpy arrays if the model has multiple outputs.
1621 |                 If all outputs in the model are named,
1622 |                 you can also pass a dictionary
1623 |                 mapping output names to Numpy arrays.
1624 |             sample_weight: optional array of the same length as x, containing
1625 |                 weights to apply to the model's loss for each sample.
1626 |                 In the case of temporal data, you can pass a 2D array
1627 |                 with shape (samples, sequence_length),
1628 |                 to apply a different weight to every timestep of every sample.
1629 |                 In this case you should make sure to specify
1630 |                 sample_weight_mode="temporal" in compile().
1631 | 
1632 |         # Returns
1633 |             Scalar test loss (if the model has a single output and no metrics)
1634 |             or list of scalars (if the model has multiple outputs
1635 |             and/or metrics). The attribute `model.metrics_names` will give you
1636 |             the display labels for the scalar outputs.
1637 |         """
1638 |         x, y, sample_weights = self._standardize_user_data(
1639 |             x, y,
1640 |             sample_weight=sample_weight,
1641 |             check_batch_axis=True)
1642 |         if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1643 |             ins = x + y + sample_weights + [0.]
1644 |         else:
1645 |             ins = x + y + sample_weights
1646 |         self._make_test_function()
1647 |         outputs = self.test_function(ins)
1648 |         if len(outputs) == 1:
1649 |             return outputs[0]
1650 |         return outputs
1651 | 
1652 |     def predict_on_batch(self, x):
1653 |         """Returns predictions for a single batch of samples.
1654 | 
1655 |         # Arguments
1656 |             x: Input samples, as a Numpy array.
1657 | 
1658 |         # Returns
1659 |             Numpy array(s) of predictions.
1660 |         """
1661 |         x = _standardize_input_data(x, self._feed_input_names,
1662 |                                     self._feed_input_shapes)
1663 |         if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1664 |             ins = x + [0.]
1665 |         else:
1666 |             ins = x
1667 |         self._make_predict_function()
1668 |         outputs = self.predict_function(ins)
1669 |         if len(outputs) == 1:
1670 |             return outputs[0]
1671 |         return outputs
1672 | 
1673 |     @interfaces.legacy_generator_methods_support
1674 |     def fit_generator(self, generator,
1675 |                       steps_per_epoch,
1676 |                       epochs=1,
1677 |                       verbose=1,
1678 |                       callbacks=None,
1679 |                       validation_data=None,
1680 |                       validation_steps=None,
1681 |                       class_weight=None,
1682 |                       max_queue_size=10,
1683 |                       workers=1,
1684 |                       use_multiprocessing=False,
1685 |                       initial_epoch=0):
1686 |         """Fits the model on data yielded batch-by-batch by a Python generator.
1687 | 
1688 |         The generator is run in parallel to the model, for efficiency.
1689 |         For instance, this allows you to do real-time data augmentation
1690 |         on images on CPU in parallel to training your model on GPU.
1691 | 
1692 |         The use of `keras.utils.Sequence` guarantees the ordering
1693 |         and guarantees the single use of every input per epoch when
1694 |         using `use_multiprocessing=True`.
1695 | 
1696 |         # Arguments
1697 |             generator: a generator or an instance of Sequence (keras.utils.Sequence)
1698 |                     object in order to avoid duplicate data
1699 |                     when using multiprocessing.
1700 |                 The output of the generator must be either
1701 |                 - a tuple (inputs, targets)
1702 |                 - a tuple (inputs, targets, sample_weights).
1703 |                 All arrays should contain the same number of samples.
1704 |                 The generator is expected to loop over its data
1705 |                 indefinitely. An epoch finishes when `steps_per_epoch`
1706 |                 batches have been seen by the model.
1707 |             steps_per_epoch: Total number of steps (batches of samples)
1708 |                 to yield from `generator` before declaring one epoch
1709 |                 finished and starting the next epoch. It should typically
1710 |                 be equal to the number of unique samples if your dataset
1711 |                 divided by the batch size.
1712 |             epochs: integer, total number of iterations on the data.
1713 |             verbose: verbosity mode, 0, 1, or 2.
1714 |             callbacks: list of callbacks to be called during training.
1715 |             validation_data: this can be either
1716 |                 - a generator for the validation data
1717 |                 - a tuple (inputs, targets)
1718 |                 - a tuple (inputs, targets, sample_weights).
1719 |             validation_steps: Only relevant if `validation_data`
1720 |                 is a generator. Total number of steps (batches of samples)
1721 |                 to yield from `generator` before stopping.
1722 |             class_weight: dictionary mapping class indices to a weight
1723 |                 for the class.
1724 |             max_queue_size: maximum size for the generator queue
1725 |             workers: maximum number of processes to spin up
1726 |                 when using process based threading
1727 |             use_multiprocessing: if True, use process based threading.
1728 |                 Note that because
1729 |                 this implementation relies on multiprocessing,
1730 |                 you should not pass
1731 |                 non picklable arguments to the generator
1732 |                 as they can't be passed
1733 |                 easily to children processes.
1734 |             initial_epoch: epoch at which to start training
1735 |                 (useful for resuming a previous training run)
1736 | 
1737 |         # Returns
1738 |             A `History` object.
1739 | 
1740 |         # Example
1741 | 
1742 |         ```python
1743 |             def generate_arrays_from_file(path):
1744 |                 while 1:
1745 |                     f = open(path)
1746 |                     for line in f:
1747 |                         # create numpy arrays of input data
1748 |                         # and labels, from each line in the file
1749 |                         x1, x2, y = process_line(line)
1750 |                         yield ({'input_1': x1, 'input_2': x2}, {'output': y})
1751 |                     f.close()
1752 | 
1753 |             model.fit_generator(generate_arrays_from_file('/my_file.txt'),
1754 |                                 steps_per_epoch=10000, epochs=10)
1755 |         ```
1756 | 
1757 |         # Raises
1758 |             ValueError: In case the generator yields
1759 |                 data in an invalid format.
1760 |         """
1761 |         wait_time = 0.01  # in seconds
1762 |         epoch = initial_epoch
1763 | 
1764 |         do_validation = bool(validation_data)
1765 |         self._make_train_function()
1766 |         if do_validation:
1767 |             self._make_test_function()
1768 | 
1769 |         # python 2 has 'next', 3 has '__next__'
1770 |         # avoid any explicit version checks
1771 |         val_gen = (hasattr(validation_data, 'next') or
1772 |                    hasattr(validation_data, '__next__') or
1773 |                    isinstance(validation_data, Sequence))
1774 |         if val_gen and not validation_steps:
1775 |             raise ValueError('When using a generator for validation data, '
1776 |                              'you must specify a value for '
1777 |                              '`validation_steps`.')
1778 | 
1779 |         # Prepare display labels.
1780 |         out_labels = self._get_deduped_metrics_names()
1781 |         callback_metrics = out_labels + ['val_' + n for n in out_labels]
1782 | 
1783 |         # prepare callbacks
1784 |         self.history = cbks.History()
1785 |         callbacks = [cbks.BaseLogger()] + (callbacks or []) + [self.history]
1786 |         if verbose:
1787 |             callbacks += [cbks.ProgbarLogger(count_mode='steps')]
1788 |         callbacks = cbks.CallbackList(callbacks)
1789 | 
1790 |         # it's possible to callback a different model than self:
1791 |         if hasattr(self, 'callback_model') and self.callback_model:
1792 |             callback_model = self.callback_model
1793 |         else:
1794 |             callback_model = self
1795 |         callbacks.set_model(callback_model)
1796 |         callbacks.set_params({
1797 |             'epochs': epochs,
1798 |             'steps': steps_per_epoch,
1799 |             'verbose': verbose,
1800 |             'do_validation': do_validation,
1801 |             'metrics': callback_metrics,
1802 |         })
1803 |         callbacks.on_train_begin()
1804 | 
1805 |         if do_validation and not val_gen:
1806 |             if len(validation_data) == 2:
1807 |                 val_x, val_y = validation_data
1808 |                 val_sample_weight = None
1809 |             elif len(validation_data) == 3:
1810 |                 val_x, val_y, val_sample_weight = validation_data
1811 |             else:
1812 |                 raise ValueError('`validation_data` should be a tuple '
1813 |                                  '`(val_x, val_y, val_sample_weight)` '
1814 |                                  'or `(val_x, val_y)`. Found: ' +
1815 |                                  str(validation_data))
1816 |             val_x, val_y, val_sample_weights = self._standardize_user_data(
1817 |                 val_x, val_y, val_sample_weight)
1818 |             val_data = val_x + val_y + val_sample_weights
1819 |             if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
1820 |                 val_data += [0.]
1821 |             for cbk in callbacks:
1822 |                 cbk.validation_data = val_data
1823 |         is_sequence = isinstance(generator, Sequence)
1824 |         if not is_sequence and use_multiprocessing and workers > 1:
1825 |             warnings.warn(
1826 |                 UserWarning('Using a generator with `use_multiprocessing=True`'
1827 |                             ' and multiple workers may duplicate your data.'
1828 |                             ' Please consider using the`keras.utils.Sequence'
1829 |                             ' class.'))
1830 |         enqueuer = None
1831 | 
1832 |         try:
1833 |             if is_sequence:
1834 |                 enqueuer = OrderedEnqueuer(generator,
1835 |                                            use_multiprocessing=use_multiprocessing)
1836 |             else:
1837 |                 enqueuer = GeneratorEnqueuer(generator,
1838 |                                              use_multiprocessing=use_multiprocessing,
1839 |                                              wait_time=wait_time)
1840 |             enqueuer.start(workers=workers, max_queue_size=max_queue_size)
1841 |             output_generator = enqueuer.get()
1842 | 
1843 |             callback_model.stop_training = False
1844 |             while epoch < epochs:
1845 |                 callbacks.on_epoch_begin(epoch)
1846 |                 steps_done = 0
1847 |                 batch_index = 0
1848 |                 while steps_done < steps_per_epoch:
1849 |                     generator_output = next(output_generator)
1850 | 
1851 |                     if not hasattr(generator_output, '__len__'):
1852 |                         raise ValueError('Output of generator should be '
1853 |                                          'a tuple `(x, y, sample_weight)` '
1854 |                                          'or `(x, y)`. Found: ' +
1855 |                                          str(generator_output))
1856 |                     if len(generator_output) == 2:
1857 |                         x, y = generator_output
1858 |                         sample_weight = None
1859 |                     elif len(generator_output) == 3:
1860 |                         x, y, sample_weight = generator_output
1861 |                     else:
1862 |                         raise ValueError('Output of generator should be '
1863 |                                          'a tuple `(x, y, sample_weight)` '
1864 |                                          'or `(x, y)`. Found: ' +
1865 |                                          str(generator_output))
1866 |                     # build batch logs
1867 |                     batch_logs = {}
1868 |                     if isinstance(x, list):
1869 |                         batch_size = x[0].shape[0]
1870 |                     elif isinstance(x, dict):
1871 |                         batch_size = list(x.values())[0].shape[0]
1872 |                     else:
1873 |                         batch_size = x.shape[0]
1874 |                     batch_logs['batch'] = batch_index
1875 |                     batch_logs['size'] = batch_size
1876 |                     callbacks.on_batch_begin(batch_index, batch_logs)
1877 | 
1878 |                     outs = self.train_on_batch(x, y,
1879 |                                                sample_weight=sample_weight,
1880 |                                                class_weight=class_weight)
1881 | 
1882 |                     if not isinstance(outs, list):
1883 |                         outs = [outs]
1884 |                     for l, o in zip(out_labels, outs):
1885 |                         batch_logs[l] = o
1886 | 
1887 |                     callbacks.on_batch_end(batch_index, batch_logs)
1888 | 
1889 |                     # Construct epoch logs.
1890 |                     epoch_logs = {}
1891 |                     batch_index += 1
1892 |                     steps_done += 1
1893 | 
1894 |                     # Epoch finished.
1895 |                     if steps_done >= steps_per_epoch and do_validation:
1896 |                         if val_gen:
1897 |                             val_outs = self.evaluate_generator(
1898 |                                 validation_data,
1899 |                                 validation_steps,
1900 |                                 max_queue_size=max_queue_size,
1901 |                                 workers=workers,
1902 |                                 use_multiprocessing=use_multiprocessing)
1903 |                         else:
1904 |                             # No need for try/except because
1905 |                             # data has already been validated.
1906 |                             val_outs = self.evaluate(
1907 |                                 val_x, val_y,
1908 |                                 batch_size=batch_size,
1909 |                                 sample_weight=val_sample_weights,
1910 |                                 verbose=0)
1911 |                         if not isinstance(val_outs, list):
1912 |                             val_outs = [val_outs]
1913 |                         # Same labels assumed.
1914 |                         for l, o in zip(out_labels, val_outs):
1915 |                             epoch_logs['val_' + l] = o
1916 | 
1917 |                 callbacks.on_epoch_end(epoch, epoch_logs)
1918 |                 epoch += 1
1919 |                 if callback_model.stop_training:
1920 |                     break
1921 | 
1922 |         finally:
1923 |             if enqueuer is not None:
1924 |                 enqueuer.stop()
1925 | 
1926 |         callbacks.on_train_end()
1927 |         return self.history
1928 | 
1929 |     @interfaces.legacy_generator_methods_support
1930 |     def evaluate_generator(self, generator, steps,
1931 |                            max_queue_size=10,
1932 |                            workers=1,
1933 |                            use_multiprocessing=False):
1934 |         """Evaluates the model on a data generator.
1935 | 
1936 |         The generator should return the same kind of data
1937 |         as accepted by `test_on_batch`.
1938 | 
1939 |         # Arguments
1940 |             generator: Generator yielding tuples (inputs, targets)
1941 |                 or (inputs, targets, sample_weights)
1942 |                 or an instance of Sequence (keras.utils.Sequence)
1943 |                     object in order to avoid duplicate data
1944 |                     when using multiprocessing.
1945 |             steps: Total number of steps (batches of samples)
1946 |                 to yield from `generator` before stopping.
1947 |             max_queue_size: maximum size for the generator queue
1948 |             workers: maximum number of processes to spin up
1949 |                 when using process based threading
1950 |             use_multiprocessing: if True, use process based threading.
1951 |                 Note that because
1952 |                 this implementation relies on multiprocessing,
1953 |                 you should not pass
1954 |                 non picklable arguments to the generator
1955 |                 as they can't be passed
1956 |                 easily to children processes.
1957 | 
1958 |         # Returns
1959 |             Scalar test loss (if the model has a single output and no metrics)
1960 |             or list of scalars (if the model has multiple outputs
1961 |             and/or metrics). The attribute `model.metrics_names` will give you
1962 |             the display labels for the scalar outputs.
1963 | 
1964 |         # Raises
1965 |             ValueError: In case the generator yields
1966 |                 data in an invalid format.
1967 |         """
1968 |         self._make_test_function()
1969 | 
1970 |         steps_done = 0
1971 |         wait_time = 0.01
1972 |         all_outs = []
1973 |         batch_sizes = []
1974 |         is_sequence = isinstance(generator, Sequence)
1975 |         if not is_sequence and use_multiprocessing and workers > 1:
1976 |             warnings.warn(
1977 |                 UserWarning('Using a generator with `use_multiprocessing=True`'
1978 |                             ' and multiple workers may duplicate your data.'
1979 |                             ' Please consider using the`keras.utils.Sequence'
1980 |                             ' class.'))
1981 |         enqueuer = None
1982 | 
1983 |         try:
1984 |             if is_sequence:
1985 |                 enqueuer = OrderedEnqueuer(generator,
1986 |                                            use_multiprocessing=use_multiprocessing)
1987 |             else:
1988 |                 enqueuer = GeneratorEnqueuer(generator,
1989 |                                              use_multiprocessing=use_multiprocessing,
1990 |                                              wait_time=wait_time)
1991 |             enqueuer.start(workers=workers, max_queue_size=max_queue_size)
1992 |             output_generator = enqueuer.get()
1993 | 
1994 |             while steps_done < steps:
1995 |                 generator_output = next(output_generator)
1996 |                 if not hasattr(generator_output, '__len__'):
1997 |                     raise ValueError('Output of generator should be a tuple '
1998 |                                      '(x, y, sample_weight) '
1999 |                                      'or (x, y). Found: ' +
2000 |                                      str(generator_output))
2001 |                 if len(generator_output) == 2:
2002 |                     x, y = generator_output
2003 |                     sample_weight = None
2004 |                 elif len(generator_output) == 3:
2005 |                     x, y, sample_weight = generator_output
2006 |                 else:
2007 |                     raise ValueError('Output of generator should be a tuple '
2008 |                                      '(x, y, sample_weight) '
2009 |                                      'or (x, y). Found: ' +
2010 |                                      str(generator_output))
2011 |                 outs = self.test_on_batch(x, y, sample_weight=sample_weight)
2012 | 
2013 |                 if isinstance(x, list):
2014 |                     batch_size = len(x[0])
2015 |                 elif isinstance(x, dict):
2016 |                     batch_size = len(list(x.values())[0])
2017 |                 else:
2018 |                     batch_size = len(x)
2019 |                 if batch_size == 0:
2020 |                     raise ValueError('Received an empty batch. '
2021 |                                      'Batches should at least contain one item.')
2022 |                 all_outs.append(outs)
2023 | 
2024 |                 steps_done += 1
2025 |                 batch_sizes.append(batch_size)
2026 | 
2027 |         finally:
2028 |             if enqueuer is not None:
2029 |                 enqueuer.stop()
2030 | 
2031 |         if not isinstance(outs, list):
2032 |             return np.average(np.asarray(all_outs),
2033 |                               weights=batch_sizes)
2034 |         else:
2035 |             averages = []
2036 |             for i in range(len(outs)):
2037 |                 averages.append(np.average([out[i] for out in all_outs],
2038 |                                            weights=batch_sizes))
2039 |             return averages
2040 | 
2041 |     @interfaces.legacy_generator_methods_support
2042 |     def predict_generator(self, generator, steps,
2043 |                           max_queue_size=10,
2044 |                           workers=1,
2045 |                           use_multiprocessing=False,
2046 |                           verbose=0):
2047 |         """Generates predictions for the input samples from a data generator.
2048 | 
2049 |         The generator should return the same kind of data as accepted by
2050 |         `predict_on_batch`.
2051 | 
2052 |         # Arguments
2053 |             generator: Generator yielding batches of input samples
2054 |                     or an instance of Sequence (keras.utils.Sequence)
2055 |                     object in order to avoid duplicate data
2056 |                     when using multiprocessing.
2057 |             steps: Total number of steps (batches of samples)
2058 |                 to yield from `generator` before stopping.
2059 |             max_queue_size: Maximum size for the generator queue.
2060 |             workers: Maximum number of processes to spin up
2061 |                 when using process based threading
2062 |             use_multiprocessing: If `True`, use process based threading.
2063 |                 Note that because
2064 |                 this implementation relies on multiprocessing,
2065 |                 you should not pass
2066 |                 non picklable arguments to the generator
2067 |                 as they can't be passed
2068 |                 easily to children processes.
2069 |             verbose: verbosity mode, 0 or 1.
2070 | 
2071 |         # Returns
2072 |             Numpy array(s) of predictions.
2073 | 
2074 |         # Raises
2075 |             ValueError: In case the generator yields
2076 |                 data in an invalid format.
2077 |         """
2078 |         self._make_predict_function()
2079 | 
2080 |         steps_done = 0
2081 |         wait_time = 0.01
2082 |         all_outs = []
2083 |         is_sequence = isinstance(generator, Sequence)
2084 |         if not is_sequence and use_multiprocessing and workers > 1:
2085 |             warnings.warn(
2086 |                 UserWarning('Using a generator with `use_multiprocessing=True`'
2087 |                             ' and multiple workers may duplicate your data.'
2088 |                             ' Please consider using the`keras.utils.Sequence'
2089 |                             ' class.'))
2090 |         enqueuer = None
2091 | 
2092 |         try:
2093 |             if is_sequence:
2094 |                 enqueuer = OrderedEnqueuer(generator,
2095 |                                            use_multiprocessing=use_multiprocessing)
2096 |             else:
2097 |                 enqueuer = GeneratorEnqueuer(generator,
2098 |                                              use_multiprocessing=use_multiprocessing,
2099 |                                              wait_time=wait_time)
2100 |             enqueuer.start(workers=workers, max_queue_size=max_queue_size)
2101 |             output_generator = enqueuer.get()
2102 | 
2103 |             if verbose == 1:
2104 |                 progbar = Progbar(target=steps)
2105 | 
2106 |             while steps_done < steps:
2107 |                 generator_output = next(output_generator)
2108 |                 if isinstance(generator_output, tuple):
2109 |                     # Compatibility with the generators
2110 |                     # used for training.
2111 |                     if len(generator_output) == 2:
2112 |                         x, _ = generator_output
2113 |                     elif len(generator_output) == 3:
2114 |                         x, _, _ = generator_output
2115 |                     else:
2116 |                         raise ValueError('Output of generator should be '
2117 |                                          'a tuple `(x, y, sample_weight)` '
2118 |                                          'or `(x, y)`. Found: ' +
2119 |                                          str(generator_output))
2120 |                 else:
2121 |                     # Assumes a generator that only
2122 |                     # yields inputs (not targets and sample weights).
2123 |                     x = generator_output
2124 | 
2125 |                 outs = self.predict_on_batch(x)
2126 |                 if not isinstance(outs, list):
2127 |                     outs = [outs]
2128 | 
2129 |                 if not all_outs:
2130 |                     for out in outs:
2131 |                         all_outs.append([])
2132 | 
2133 |                 for i, out in enumerate(outs):
2134 |                     all_outs[i].append(out)
2135 |                 steps_done += 1
2136 |                 if verbose == 1:
2137 |                     progbar.update(steps_done)
2138 | 
2139 |         finally:
2140 |             if enqueuer is not None:
2141 |                 enqueuer.stop()
2142 | 
2143 |         if len(all_outs) == 1:
2144 |             if steps_done == 1:
2145 |                 return all_outs[0][0]
2146 |             else:
2147 |                 return np.concatenate(all_outs[0])
2148 |         if steps_done == 1:
2149 |             return [out for out in all_outs]
2150 |         else:
2151 |             return [np.concatenate(out) for out in all_outs]
2152 | 


--------------------------------------------------------------------------------
/mycode/config.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | '''
 3 | config.py
 4 | define file path as so forth.
 5 | '''
 6 | 
 7 | import time
 8 | 
 9 | get_current_time = lambda: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) + '\t'
10 | 
11 | DATAROOT = '/home/cxk/zhihucup/ieee_zhihu_cup_rowdata/'  #this should be written as your own data root path
12 | 
13 | # embedding files
14 | CHAR_EMBEDDING_DIR = DATAROOT + 'char_embedding.txt'
15 | WORD_EMBEDDING_DIR = DATAROOT + 'word_embedding.txt'
16 | 
17 | # topic info
18 | TOPIC_INFO_DIR = DATAROOT + 'topic_info.txt'
19 | 
20 | # train and eval text
21 | QUESTION_TRAIN_SET_DIR = DATAROOT + 'question_train_set.txt'
22 | QUESTION_EVAL_SET_DIR = DATAROOT + 'question_eval_set.txt'
23 | 
24 | # traindata's label file
25 | QUESTION_TOPIC_TRAIN_DIR = DATAROOT + 'question_topic_train_set.txt'
26 | 


--------------------------------------------------------------------------------
/mycode/load_data.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | 
  3 | '''
  4 | loading data
  5 | defining a class that process the loading function.
  6 | '''
  7 | 
  8 | import pickle
  9 | import numpy as np
 10 | from keras.preprocessing.sequence import pad_sequences
 11 | from keras.preprocessing.text import Tokenizer
 12 | from scipy.sparse import csr_matrix
 13 | 
 14 | import config
 15 | 
 16 | 
 17 | # params
 18 | 
 19 | 
 20 | class data_loader():
 21 |     def __init__(self, savedir=None):
 22 |         self.embedword_matrix = None
 23 |         self.embedword_matrix_git100 = None
 24 |         self.embedchar_matrix = None
 25 |         self.word_index = None
 26 |         self.char_index = None
 27 |         self.input_length = None
 28 |         self.topic_dict = {}
 29 |         self.topic_dict_inv = {}
 30 |         self.MAX_NB_WORDS = 500000
 31 |         self.tc_len = 180
 32 |         self.tw_len = 76
 33 |         self.max_titleword_len = 0
 34 |         self.max_titlechar_len = 0
 35 |         self.max_dspword_len = 0
 36 |         self.max_dspchar_len = 0
 37 |         self.dsppad_length = 300
 38 |         self.savedir = savedir
 39 | 
 40 |     def load_topic_info(self):
 41 |         '''
 42 |         just get the topic ids 
 43 |         :return: 
 44 |         '''
 45 |         print(config.get_current_time(), "loading topic info")
 46 |         with open(config.TOPIC_INFO_DIR, 'r') as f:
 47 |             for index, line in enumerate(f.readlines()):
 48 |                 self.topic_dict[line.strip('\n').split('\t')[0]] = index
 49 |                 self.topic_dict_inv[index] = line.strip('\n').split('\t')[0]
 50 | 
 51 |     def load_train_data(self):
 52 |         '''
 53 |         title_char+title_word+dsp_char+dsp_word
 54 |         :param istitle: bool 
 55 |         :param iscontent: bool
 56 |         :param type_kind: char or word
 57 |         :return: 
 58 |         '''
 59 | 
 60 |         title_char_list = []
 61 |         title_word_list = []
 62 |         dsp_char_list = []
 63 |         dsp_word_list = []
 64 |         question_ids = []
 65 | 
 66 |         print(config.get_current_time(), 'loading question train set file')
 67 |         with open(config.QUESTION_TRAIN_SET_DIR, 'r') as f:
 68 |             for index, line in enumerate(f.readlines()):
 69 |                 if index > 500:
 70 |                     break
 71 |                 splitted = line.strip('\n').split('\t')
 72 | 
 73 |                 if len(splitted) == 1:
 74 |                     continue
 75 |                 elif len(splitted) == 2:
 76 |                     continue
 77 |                 elif len(splitted) == 5:
 78 |                     title_char_list.append(splitted[1].replace(',', ' '))
 79 |                     title_word_list.append(splitted[2].replace(',', ' '))
 80 |                     dsp_char_list.append(splitted[3].replace(',', ' '))
 81 |                     dsp_word_list.append(splitted[4].replace(',', ' '))
 82 |                     self.max_titlechar_len = max(len(splitted[1].split(',')), self.max_titlechar_len)
 83 |                     self.max_titleword_len = max(len(splitted[2].split(',')), self.max_titleword_len)
 84 |                     self.max_dspchar_len = max(len(splitted[3].split(',')), self.max_dspchar_len)
 85 |                     self.max_dspword_len = max(len(splitted[4].split(',')), self.max_dspword_len)
 86 |                     question_ids.append(splitted[0])
 87 |                 else:
 88 |                     continue
 89 | 
 90 |         # print('max titlecharlength', self.max_titlechar_len)
 91 |         # print('max titleword length', self.max_titleword_len)
 92 |         # print('max dspchar length', self.max_dspchar_len)
 93 |         # print('max dspword length', self.max_dspword_len)
 94 | 
 95 |         pickle.dump(self.tw_len, open(self.savedir + '/tw_len.pkl', 'wb'))
 96 |         pickle.dump(self.tc_len, open(self.savedir + '/tc_len.pkl', 'wb'))
 97 |         pickle.dump(self.dsppad_length, open(self.savedir + '/dsp_pad_length.pkl', 'wb'))
 98 | 
 99 |         # ------titleword--------
100 |         print(config.get_current_time(), 'tokenizer title word working')
101 |         tokenizer_word = Tokenizer(num_words=self.MAX_NB_WORDS)
102 |         tokenizer_word.fit_on_texts(title_word_list + dsp_word_list)
103 |         sequences_titleword = tokenizer_word.texts_to_sequences(title_word_list)
104 |         self.word_index = tokenizer_word.word_index
105 |         print(config.get_current_time(), 'Found %s unique word tokens.' % len(self.word_index))
106 |         titleword_array = pad_sequences(sequences_titleword, maxlen=self.tw_len)  # return arrays
107 |         pickle.dump(tokenizer_word, open(self.savedir + '/tokenizer_word.pkl', 'wb'))
108 |         print('tokenzier is saved as %s/tokenizer_word.pkl' % (self.savedir))
109 |         # -----titlechar---------
110 |         print(config.get_current_time(), 'tokenizer title char working')
111 |         tokenizer_char = Tokenizer(num_words=self.MAX_NB_WORDS)
112 |         tokenizer_char.fit_on_texts(title_char_list + dsp_char_list)
113 |         sequences_titlechar = tokenizer_char.texts_to_sequences(title_char_list)
114 |         self.char_index = tokenizer_char.word_index
115 |         print(config.get_current_time(), 'Found %s unique char tokens.' % len(self.char_index))
116 |         titlechar_array = pad_sequences(sequences_titlechar, maxlen=self.tc_len)  # return arrays
117 |         pickle.dump(tokenizer_char, open(self.savedir + '/tokenizer_char.pkl', 'wb'))
118 |         print('tokenzier is saved as %s/tokenizer_char.pkl' % (self.savedir))
119 |         # -----dspword--------
120 |         print(config.get_current_time(), 'tokenizer dsp char working')
121 |         sequences_dspchar = tokenizer_char.texts_to_sequences(dsp_char_list)
122 |         dspchar_array = pad_sequences(sequences_dspchar, maxlen=self.dsppad_length)  # return arrays
123 |         # ---dspchar---------
124 |         print(config.get_current_time(), 'tokenizer dsp word working')
125 |         sequences_dspword = tokenizer_word.texts_to_sequences(dsp_word_list)
126 |         dspword_array = pad_sequences(sequences_dspword, maxlen=self.dsppad_length)  # return arrays
127 | 
128 |         self.load_topic_info()
129 | 
130 |         question_to_label = {}
131 |         print(config.get_current_time(), 'loading train labels')
132 |         with open(config.QUESTION_TOPIC_TRAIN_DIR, 'r') as f:
133 |             for index, line in enumerate(f.readlines()):
134 |                 # if index>100000:
135 |                 #     break
136 |                 splitted = line.strip('\n').split('\t')
137 |                 if len(splitted) != 2:
138 |                     print('error!')
139 |                 question_to_label[splitted[0]] = [self.topic_dict[i] for i in splitted[1].split(',')]
140 | 
141 |         print(config.get_current_time(), 'duiqi traindata and labels')
142 | 
143 |         row_ = []
144 |         col_ = []
145 |         count_1 = 0
146 |         # label_dense = np.zeros((train_titleword_array.shape[0], 1999))
147 |         for row, quesid in enumerate(question_ids):
148 |             cols = question_to_label.get(quesid)
149 |             if cols is None:
150 |                 print('error!')
151 |             count_1 += len(cols)
152 |             for k in cols:
153 |                 row_.append(row)
154 |             col_.extend(cols)
155 | 
156 |         data_ = [1 for i in row_]
157 |         label_sparse = csr_matrix((data_, (row_, col_)), shape=(len(question_ids), 1999))
158 |         # # Shuffle data
159 |         # shuffle_indices = np.random.permutation(np.arange(train_titleword_array.shape[0]))
160 |         # x_word = train_titleword_array[shuffle_indices]
161 |         # x_char = train_titlechar_array[shuffle_indices]
162 |         # row_ = [row_[i] for i in shuffle_indices]
163 |         # col_ = [col_[i] for i in shuffle_indices]
164 |         #
165 |         # # label_dense = label_dense[shuffle_indices]
166 |         # # label_sparse = csr_matrix(([1 for i in range(count_1))],(row_,col_)),shape = ())
167 |         #
168 |         # train_len = int(x_word.shape[0] * 0.9)
169 |         # x_word_train = x_word[:train_len]
170 |         # x_char_train = x_char[:train_len]
171 |         # y_train = label_sparse[:train_len]
172 |         # x_word_test = x_word[train_len:]
173 |         # x_char_test = x_char[train_len:]
174 |         # y_test = label_sparse[train_len:]
175 | 
176 |         # return (x_word_train, x_char_train, y_train, x_word_test, x_char_test, y_test)
177 |         return titlechar_array, titleword_array, dspchar_array, dspword_array, label_sparse
178 | 
179 |     def load_pred_data_4part(self):
180 |         '''
181 |         
182 |         :return: 
183 |         '''
184 |         title_char_list = []
185 |         title_word_list = []
186 |         dsp_char_list = []
187 |         dsp_word_list = []
188 |         question_ids = []
189 | 
190 |         self.tw_len = pickle.load(open(self.savedir + '/tw_len.pkl', 'rb'))
191 |         self.tc_len = pickle.load(open(self.savedir + '/tc_len.pkl', 'rb'))
192 |         self.dsppad_length = pickle.load(open(self.savedir + '/dsp_pad_length.pkl', 'rb'))
193 |         print('length is loaded!')
194 | 
195 |         print(config.get_current_time(), 'loading question eval set file')
196 |         with open(config.QUESTION_EVAL_SET_DIR, 'r') as f:
197 |             for index, line in enumerate(f.readlines()):
198 |                 # if index>50000:
199 |                 #     break
200 |                 splitted = line.strip('\n').split('\t')
201 | 
202 |                 if len(splitted) == 1:
203 |                     print('error!')
204 |                     exit()
205 |                 elif len(splitted) == 2:
206 |                     title_char_list.append(splitted[1].replace(',', ' '))
207 |                     title_word_list.append(" ")
208 |                     dsp_char_list.append(" ")
209 |                     dsp_word_list.append(" ")
210 |                 elif len(splitted) == 3:
211 |                     title_char_list.append(splitted[1].replace(',', ' '))
212 |                     title_word_list.append(splitted[2].replace(',', ' '))
213 |                     dsp_char_list.append(" ")
214 |                     dsp_word_list.append(" ")
215 |                 elif len(splitted) == 4:
216 |                     title_char_list.append(splitted[1].replace(',', ' '))
217 |                     title_word_list.append(splitted[2].replace(',', ' '))
218 |                     dsp_char_list.append(splitted[3].replace(',', ' '))
219 |                     dsp_word_list.append(" ")
220 |                 elif len(splitted) == 5:
221 |                     title_char_list.append(splitted[1].replace(',', ' '))
222 |                     title_word_list.append(splitted[2].replace(',', ' '))
223 |                     dsp_char_list.append(splitted[3].replace(',', ' '))
224 |                     dsp_word_list.append(splitted[4].replace(',', ' '))
225 | 
226 |                 question_ids.append(splitted[0])
227 | 
228 |         tokenizer_word = pickle.load(open(self.savedir + '/tokenizer_word.pkl', 'rb'))
229 |         tokenizer_char = pickle.load(open(self.savedir + '/tokenizer_char.pkl', 'rb'))
230 |         print('tokenizer word loaded!')
231 |         print("")
232 | 
233 |         print(config.get_current_time(), 'tokenizer working title char')
234 |         titlechar_sequences_char = tokenizer_char.texts_to_sequences(title_char_list)
235 |         self.char_index = tokenizer_char.word_index
236 |         titlechar_array = pad_sequences(titlechar_sequences_char, maxlen=self.tc_len)  # return arrays
237 | 
238 |         print(config.get_current_time(), 'tokenizer working title word')
239 |         titleword_sequences_word = tokenizer_word.texts_to_sequences(title_word_list)
240 |         self.word_index = tokenizer_word.word_index
241 |         titleword_array = pad_sequences(titleword_sequences_word, maxlen=self.tw_len)  # return arrays
242 | 
243 |         print(config.get_current_time(), 'tokenizer working dsp char')
244 |         dspchar_sequences_char = tokenizer_char.texts_to_sequences(dsp_char_list)
245 |         dspchar_array = pad_sequences(dspchar_sequences_char, maxlen=self.dsppad_length)  # return arrays
246 | 
247 |         print(config.get_current_time(), 'tokenizer working dsp word')
248 |         dspword_sequences_word = tokenizer_word.texts_to_sequences(dsp_word_list)
249 |         dspword_array = pad_sequences(dspword_sequences_word, maxlen=self.dsppad_length)  # return arrays
250 | 
251 |         self.load_topic_info()
252 | 
253 |         return titlechar_array, titleword_array, dspchar_array, dspword_array, question_ids
254 | 
255 |     def get_quesids(self):
256 |         '''
257 | 
258 |         :return: 
259 |         '''
260 |         question_ids = []
261 | 
262 |         print(config.get_current_time(), 'loading question eval ids')
263 |         with open(config.QUESTION_EVAL_SET_DIR, 'r') as f:
264 |             for index, line in enumerate(f.readlines()):
265 |                 splitted = line.strip('\n').split('\t')
266 |                 question_ids.append(splitted[0])
267 | 
268 |         self.load_topic_info()
269 |         return question_ids
270 | 
271 |     def load_wordembedding_matrix(self):
272 | 
273 |         embeddings_index = dict()
274 | 
275 |         embedding_max_value = 0
276 |         embedding_min_value = 1
277 | 
278 |         with open(config.WORD_EMBEDDING_DIR, 'r') as f:
279 |             for line in f:
280 |                 line = line.strip().split(' ')
281 |                 if len(line) != 257:
282 |                     continue
283 | 
284 |                 coefs = np.asarray(line[1:], dtype='float32')
285 | 
286 |                 if np.max(coefs) > embedding_max_value:
287 |                     embedding_max_value = np.max(coefs)
288 |                 if np.min(coefs) < embedding_min_value:
289 |                     embedding_min_value = np.min(coefs)
290 | 
291 |                 embeddings_index[line[0]] = coefs
292 | 
293 |         print(config.get_current_time(), ('Found %s word vectors.' % len(embeddings_index)))
294 | 
295 |         self.embedword_matrix = np.zeros((len(self.word_index) + 1, 256))
296 |         for word, i in self.word_index.items():
297 |             embedding_vector = embeddings_index.get(word)
298 |             if embedding_vector is not None:
299 |                 # words not found in embedding index will be all-zeros.
300 |                 self.embedword_matrix[i] = embedding_vector
301 |             else:
302 |                 self.embedword_matrix[i] = np.random.uniform(low=embedding_min_value, high=embedding_max_value,
303 |                                                              size=256)
304 | 
305 |     def load_charembedding_matrix(self):
306 | 
307 |         embeddings_index = dict()
308 | 
309 |         embedding_max_value = 0
310 |         embedding_min_value = 1
311 | 
312 |         with open(config.CHAR_EMBEDDING_DIR, 'r') as f:
313 |             for line in f:
314 |                 line = line.strip().split(' ')
315 |                 if len(line) != 257:
316 |                     continue
317 | 
318 |                 coefs = np.asarray(line[1:], dtype='float32')
319 | 
320 |                 if np.max(coefs) > embedding_max_value:
321 |                     embedding_max_value = np.max(coefs)
322 |                 if np.min(coefs) < embedding_min_value:
323 |                     embedding_min_value = np.min(coefs)
324 | 
325 |                 embeddings_index[line[0]] = coefs
326 | 
327 |         print(config.get_current_time(), ('Found %s char vectors.' % len(embeddings_index)))
328 | 
329 |         self.embedchar_matrix = np.zeros((len(self.char_index) + 1, 256))
330 |         for word, i in self.char_index.items():
331 |             embedding_vector = embeddings_index.get(word)
332 |             if embedding_vector is not None:
333 |                 # words not found in embedding index will be all-zeros.
334 |                 self.embedchar_matrix[i] = embedding_vector
335 |             else:
336 |                 self.embedchar_matrix[i] = np.random.uniform(low=embedding_min_value, high=embedding_max_value,
337 |                                                              size=256)
338 | 
339 | 


--------------------------------------------------------------------------------
/mycode/main.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | from model import *
 3 | import config
 4 | from load_data import *
 5 | import sys
 6 | 
 7 | def results_weight_sum(filenames, topic_dict_inv, ques_ids, weights):
 8 |     '''
 9 |     ensemble model by weighting sum
10 |     :param filenames: 
11 |     :param topic_dict_inv: 
12 |     :param ques_ids: 
13 |     :param weights: 
14 |     :return: 
15 |     '''
16 |     assert len(weights) == len(filenames)
17 |     import time
18 |     import numpy as np
19 |     cur_time = time.strftime('%Y-%m-%d-%H-%M', time.localtime(time.time()))
20 |     from collections import Counter
21 | 
22 |     def tmpfunc(x):
23 |         if len(x) > 5:
24 |             c = Counter(x).most_common(5)
25 |             res = []
26 |             for num, count in c:
27 |                 res.append(topic_dict_inv[num])
28 |         else:
29 |             res = []
30 |             for i in x:
31 |                 res.append(topic_dict_inv[i])
32 | 
33 |         return res
34 | 
35 |     predlabels = []
36 | 
37 |     for i in range(len(filenames)):
38 |         print('process %d th...' % (i))
39 |         predlabel = np.load(filenames[i])
40 |         if len(predlabels) == 0:
41 |             predlabels = predlabel * weights[0]
42 |         else:
43 |             predlabels = predlabels + predlabel * weights[i]
44 |         print(predlabels.shape)
45 | 
46 |     predlabels = np.argsort(-predlabels)[:, :5]
47 | 
48 |     with open("final_423.csv", 'w') as f:
49 |         for i in range(predlabels.shape[0]):
50 |             # f.write(ques_ids[i] + "," + ','.join([topic_dict_inv[k] for k in predlabels[i]]) + '\n')
51 |             f.write(ques_ids[i] + "," + ','.join(tmpfunc(predlabels[i])) + '\n')
52 | 
53 | 
54 | if __name__ == '__main__':
55 | 
56 |     if len(sys.argv) < 2:
57 |         print('error, give me mode ')
58 |         exit()
59 | 
60 |     mode = sys.argv[1]
61 | 
62 |     print(config.get_current_time(), 'current mode:', mode)
63 | 
64 |     if mode == "train":
65 | 
66 |         save_root_dir = './model_exp'  #your own path, to save models,tokenizers...
67 | 
68 |         dl = data_loader(save_root_dir)
69 |         datatuple = dl.load_train_data()
70 |         dl.load_charembedding_matrix()
71 |         dl.load_wordembedding_matrix()
72 | 
73 |         mymodel = MultiModel(w_embed_matrix=dl.embedword_matrix, c_embed_matrix=dl.embedchar_matrix,
74 |                              word_index=dl.word_index, char_index=dl.char_index, titlechar_length=dl.tc_len,
75 |                              titleword_length=dl.tw_len, dsp_padlen=dl.dsppad_length, data=datatuple,
76 |                              savedir=save_root_dir)
77 |         mymodel.trainmodel(isalldata=True)
78 | 
79 |     if mode == "pred":
80 |         save_root_dir = './model_4rcnn_att_titledsp_lstm512_lr0_0001_nofc_alldata'   #your own model path
81 |         dl = data_loader(save_root_dir)
82 |         datatuple = dl.load_pred_data_4part()
83 | 
84 |         mymodel = MultiModel()
85 |         mymodel.predmodel([save_root_dir + "/2017-08-10-12-20_model-06.hdf5"], datatuple=datatuple,
86 |                           topic_dict_inv=dl.topic_dict_inv)
87 | 
88 | 


--------------------------------------------------------------------------------
/mycode/model.py:
--------------------------------------------------------------------------------
  1 | # encoding: utf-8
  2 | 
  3 | import math
  4 | import sys
  5 | 
  6 | import config
  7 | import numpy as np
  8 | import tensorflow as tf
  9 | from keras import backend as K
 10 | from keras.backend.tensorflow_backend import set_session
 11 | from keras.layers import Embedding, merge, Reshape, Activation, RepeatVector, Permute, Lambda, GlobalMaxPool1D, \
 12 |     concatenate
 13 | from keras import initializers
 14 | from keras import optimizers
 15 | from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
 16 | from keras.layers import Dense, Conv1D, MaxPooling1D, Input, Flatten, Dropout, Concatenate, LSTM, Bidirectional, GRU
 17 | from keras.metrics import categorical_accuracy
 18 | from keras.models import Model
 19 | from keras.models import load_model
 20 | from keras.layers.normalization import BatchNormalization
 21 | 
 22 | from load_data import data_loader
 23 | 
 24 | 
 25 | class ZHIHUMetrics(Callback):
 26 |     '''
 27 |     ZHIHU score method
 28 |     '''
 29 | 
 30 |     def on_epoch_end(self, batch, logs={}):
 31 |         print('')
 32 |         y_pred = np.asarray(self.model.predict(
 33 |             [self.validation_data[0], self.validation_data[1], self.validation_data[2], self.validation_data[3]]))
 34 |         y_true = self.validation_data[4]
 35 |         # y_pred = np.asarray(self.model.predict([self.validation_data[0], self.validation_data[1]]))
 36 |         # y_true = self.validation_data[2]
 37 |         # y_pred = np.asarray(self.model.predict([self.validation_data[0]]))
 38 |         # y_true = self.validation_data[1]
 39 | 
 40 |         print(y_pred.shape, y_true.shape)
 41 | 
 42 |         y_pred = np.argsort(-y_pred)[:, :5]
 43 | 
 44 |         y_true_list = []
 45 |         for i in range(y_pred.shape[0]):
 46 |             y_true_list.append([])
 47 | 
 48 |         nozero_row, nozero_col = np.nonzero(y_true)
 49 | 
 50 |         for i in range(len(nozero_row)):
 51 |             y_true_list[nozero_row[i]].append(nozero_col[i])
 52 | 
 53 |         right_label_num = 0
 54 |         right_label_at_pos_num = [0, 0, 0, 0, 0]
 55 |         sample_num = 0
 56 |         all_marked_label_num = 0
 57 | 
 58 |         for i in range(len(y_true_list)):
 59 |             sample_num += 1
 60 |             marked_label_set = set(y_true_list[i])
 61 |             all_marked_label_num += len(marked_label_set)
 62 |             for pos, label in zip(range(0, min(len(y_pred[i]), 5)), y_pred[i]):
 63 |                 if label in marked_label_set:
 64 |                     right_label_num += 1
 65 |                     right_label_at_pos_num[pos] += 1
 66 | 
 67 |         precision = 0.0
 68 |         for pos, right_num in zip(range(0, 5), right_label_at_pos_num):
 69 |             precision += ((right_num / float(sample_num))) / math.log(2.0 + pos)
 70 |         recall = float(right_label_num) / all_marked_label_num
 71 | 
 72 |         print('Recall:', recall)
 73 |         print(' Precision:', precision)
 74 |         print(' res:', recall * precision / (recall + precision + 0.00000000000001))
 75 |         print('')
 76 | 
 77 | 
 78 | class MultiModel():
 79 |     def __init__(self, w_embed_matrix=None, c_embed_matrix=None, word_index=None, char_index=None,
 80 |                  titlechar_length=None, titleword_length=None, dsp_padlen=None, data=(0, 0, 0), savedir=None):
 81 |         # Model Hyperparameters
 82 |         self.hidden_dims = 512
 83 |         self.EMBEDDING_DIM = 256
 84 |         # Training parameters
 85 |         self.batch_size = 128
 86 |         self.num_epochs = 50
 87 | 
 88 |         self.w_embed = w_embed_matrix
 89 |         self.c_embed = c_embed_matrix
 90 |         self.word_index = word_index
 91 |         self.char_index = char_index
 92 |         self.titlechar_length = titlechar_length
 93 |         self.titleword_length = titleword_length
 94 |         self.dsp_padlen = dsp_padlen
 95 | 
 96 |         # data
 97 |         if len(data) == 5:
 98 |             self.titlechar_array, self.titleword_array, self.dspchar_array, self.dspword_array, self.y = data
 99 | 
100 |         self.savedir = savedir
101 | 
102 |         self.model = None
103 | 
104 |     def buildmodel_rcnn4_att_titledsp(self):
105 |         '''
106 |         4 RCNN
107 |         v2: 4model concat+dense1999
108 |         (tw concat tc)  +  (dw concat dc)
109 |         lstm256+lr0.001  :3epoch 0.401  0.9data
110 |         lstm512+lr0.0005  :2epoch 0.410  alldata    2,3,4 epoch vote  0.414   with dp
111 | 
112 |         :return: 
113 |         '''
114 |         print('building model...')
115 | 
116 |         # -----titlechar------
117 |         with tf.device('/cpu:%d' % (0)):
118 |             tc_embedding_layer = Embedding(len(self.char_index) + 1,
119 |                                            self.EMBEDDING_DIM,
120 |                                            weights=[self.c_embed],
121 |                                            input_length=self.titlechar_length, trainable=True,
122 |                                            embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2,
123 |                                                                                              seed=None))
124 |         tc_sequence_input = Input(shape=(self.titlechar_length,), name="titlechar_input")
125 |         tc_embedded_sequences = tc_embedding_layer(tc_sequence_input)
126 |         with tf.device('/gpu:%d' % (0)):
127 |             tc_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(tc_embedded_sequences)
128 |             tc_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(tc_embedded_sequences)
129 |             tc_z_concat = merge([tc_z_pos, tc_embedded_sequences, tc_z_neg], mode='concat', concat_axis=-1)
130 | 
131 |             tc_z = Dense(512, activation='tanh')(tc_z_concat)
132 |             tc_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(tc_z)
133 |         # -----titleword------
134 |         with tf.device('/cpu:%d' % (1)):
135 |             tw_embedding_layer = Embedding(len(self.word_index) + 1,
136 |                                            self.EMBEDDING_DIM,
137 |                                            weights=[self.w_embed],
138 |                                            input_length=self.titleword_length, trainable=True,
139 |                                            embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2,
140 |                                                                                              seed=None))
141 |         tw_sequence_input = Input(shape=(self.titleword_length,), name="titleword_input")
142 |         tw_embedded_sequences = tw_embedding_layer(tw_sequence_input)
143 |         with tf.device('/gpu:%d' % (0)):
144 |             tw_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(tw_embedded_sequences)
145 |             tw_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(tw_embedded_sequences)
146 |             tw_z_concat = merge([tw_z_pos, tw_embedded_sequences, tw_z_neg], mode='concat', concat_axis=-1)
147 | 
148 |             tw_z = Dense(512, activation='tanh')(tw_z_concat)
149 |             tw_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(tw_z)
150 |         # -----dspchar------
151 |         with tf.device('/cpu:%d' % (2)):
152 |             dc_embedding_layer = Embedding(len(self.char_index) + 1,
153 |                                            self.EMBEDDING_DIM,
154 |                                            weights=[self.c_embed],
155 |                                            input_length=self.dsp_padlen, trainable=True,
156 |                                            embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2,
157 |                                                                                              seed=None))
158 |         dc_sequence_input = Input(shape=(self.dsp_padlen,), name="dspchar_input")
159 |         dc_embedded_sequences = dc_embedding_layer(dc_sequence_input)
160 |         with tf.device('/gpu:%d' % (1)):
161 |             dc_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(dc_embedded_sequences)
162 |             dc_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(dc_embedded_sequences)
163 |             dc_z_concat = merge([dc_z_pos, dc_embedded_sequences, dc_z_neg], mode='concat', concat_axis=-1)
164 |             dc_z = Dense(512, activation='tanh')(dc_z_concat)
165 |             dc_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(dc_z)
166 |         # -----dspword------
167 |         with tf.device('/cpu:%d' % (3)):
168 |             dw_embedding_layer = Embedding(len(self.word_index) + 1,
169 |                                            self.EMBEDDING_DIM,
170 |                                            weights=[self.w_embed],
171 |                                            input_length=self.dsp_padlen, trainable=True,
172 |                                            embeddings_initializer=initializers.RandomUniform(minval=-0.2, maxval=0.2,
173 |                                                                                              seed=None))
174 |         dw_sequence_input = Input(shape=(self.dsp_padlen,), name="dspword_input")
175 |         dw_embedded_sequences = dw_embedding_layer(dw_sequence_input)
176 |         with tf.device('/gpu:%d' % (1)):
177 |             dw_z_pos = LSTM(512, implementation=2, return_sequences=True, go_backwards=False)(dw_embedded_sequences)
178 |             dw_z_neg = LSTM(512, implementation=2, return_sequences=True, go_backwards=True)(dw_embedded_sequences)
179 |             dw_z_concat = merge([dw_z_pos, dw_embedded_sequences, dw_z_neg], mode='concat', concat_axis=-1)
180 | 
181 |             dw_z = Dense(512, activation='tanh')(dw_z_concat)
182 |             dw_pool_rnn = Lambda(lambda x: K.max(x, axis=1), output_shape=(512,))(dw_z)
183 | 
184 |         # ------att----------
185 |         concat_w_c = merge([tc_pool_rnn, tw_pool_rnn, dc_pool_rnn, dw_pool_rnn], mode='concat')
186 |         concat_w_c = Reshape((2, 512 * 2))(concat_w_c)
187 | 
188 |         attention = Dense(1, activation='tanh')(concat_w_c)
189 |         attention = Flatten()(attention)
190 |         attention = Activation('softmax')(attention)
191 |         attention = RepeatVector(512 * 2)(attention)
192 |         attention = Permute([2, 1])(attention)
193 | 
194 |         sent_representation = merge([concat_w_c, attention], mode='mul')
195 |         sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(512 * 2,))(sent_representation)
196 |         # --------merge_4models------------------
197 |         model_final_ = Dense(1999, activation='relu')(sent_representation)
198 |         model_final_ = Dropout(0.5)(model_final_)
199 |         model_final = Dense(1999, activation='softmax')(model_final_)
200 | 
201 |         self.model = Model(input=[tc_sequence_input, tw_sequence_input, dc_sequence_input, dw_sequence_input],
202 |                            outputs=model_final)
203 |         adam = optimizers.adam(lr=0.0001)
204 |         self.model.compile(loss='categorical_crossentropy',
205 |                            optimizer=adam,
206 |                            metrics=[categorical_accuracy])
207 |         print(self.model.summary())
208 | 
209 |     def trainmodel(self, isalldata):
210 | 
211 |         self.buildmodel_rcnn4_att_titledsp()
212 | 
213 |         import time
214 |         cur_time = time.strftime('%Y-%m-%d-%H-%M', time.localtime(time.time()))
215 | 
216 |         checkpointer = ModelCheckpoint(filepath=self.savedir + "/" + cur_time + "_model-{epoch:02d}.hdf5", period=1)
217 |         zhihuMetrics = ZHIHUMetrics()
218 | 
219 |         if isalldata:
220 |             self.model.fit([self.titlechar_array, self.titleword_array, self.dspchar_array, self.dspword_array],
221 |                            self.y,
222 |                            epochs=self.num_epochs, batch_size=self.batch_size, verbose=1,
223 |                            callbacks=[checkpointer])
224 |         else:#with 9:1 validation
225 |             self.model.fit([self.titlechar_array, self.titleword_array, self.dspchar_array, self.dspword_array],
226 |                            self.y,
227 |                            validation_split=0.1,
228 |                            epochs=self.num_epochs, batch_size=self.batch_size, verbose=1,
229 |                            callbacks=[checkpointer, zhihuMetrics])
230 |         self.save_model()
231 | 
232 |     def predmodel(self, modelname, datatuple, topic_dict_inv):
233 | 
234 |         import time
235 |         cur_time = time.strftime('%Y-%m-%d-%H-%M', time.localtime(time.time()))
236 |         from collections import Counter
237 | 
238 |         def tmpfunc(x):
239 |             if len(x) > 5:
240 |                 c = Counter(x).most_common(5)
241 |                 res = []
242 |                 for num, count in c:
243 |                     res.append(topic_dict_inv[num])
244 |             else:
245 |                 res = []
246 |                 for i in x:
247 |                     res.append(topic_dict_inv[i])
248 | 
249 |             return res
250 | 
251 |         predlabels = []
252 |         # titleword_array, dspword_array, ques_ids= datatuple
253 |         titlechar_array, titleword_array, dspchar_array, dspword_array, ques_ids = datatuple
254 | 
255 |         for i in range(len(modelname)):
256 |             self.model = load_model(modelname[i])
257 |             predlabel = self.model.predict([titlechar_array, titleword_array, dspchar_array, dspword_array],
258 |                                            batch_size=512, verbose=1)
259 |             # predlabel = self.model.predict([titleword_array, titleword_array, dspword_array, dspword_array], batch_size=512, verbose=1)
260 |             # np.savetxt("result/scores/"+cur_time + "scores_4RCNN_gru_dense_nodropout.txt", predlabel, fmt='%s')
261 |             np.save("result/scores/" + cur_time + "4RCNN_lstm512_4part_title_dsp_attention_nofc_06epoch", predlabel)
262 |             # exit()
263 |             predlabel = np.argsort(-predlabel)[:, :5]
264 |             if len(predlabels) == 0:
265 |                 predlabels = predlabel
266 |             else:
267 |                 predlabels = np.column_stack((predlabels, predlabel))
268 |             print(predlabels.shape)
269 |             K.clear_session()
270 | 
271 |         with open("result/" + cur_time + ".csv", 'w') as f:
272 |             for i in range(predlabels.shape[0]):
273 |                 # f.write(ques_ids[i] + "," + ','.join([topic_dict_inv[k] for k in predlabels[i]]) + '\n')
274 |                 f.write(ques_ids[i] + "," + ','.join(tmpfunc(predlabels[i])) + '\n')
275 | 
276 |     def save_model(self):
277 |         import time
278 |         cur_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
279 |         self.model.save(self.savedir + "/latest_twomodel_wordchar_" + str(cur_time) + '.h5')
280 | 


--------------------------------------------------------------------------------
/数据集说明.txt:
--------------------------------------------------------------------------------
 1 | 数据集
 2 | 
 3 | 名称 	格式 
 4 | ieee_zhihu_cup.des3 	des3 (1.4 GB) 	
 5 | ieee_zhihu_cup.rar 	rar (1.2 GB) 
 6 | 
 7 |  
 8 | 
 9 | 请注意，以上两个链接内容一致，请选择一个下载。ieee_zhihu_cup.des3为Linux/Mac下的压缩文件，数据集解压命令： dd if=ieee_zhihu_cup.des3 |openssl des3 -d -k Pg5EnkJP7iYyRBt5|tar zxf - 。 ieee_zhihu_cup.rar 为windows下的rar格式压缩文件。
10 | 
11 | 为了保护用户隐私，所有的原始文本信息都经过特殊编码处理。问题标题、问题描述、话题名字、话题描述的原始信息都被编码成单字 ID 序列和词语 ID 序列。单字包含单个汉字、中韩文字、英文字母、标点及空格等；词语包含切词后的中文词语、英文单词、标点及空格等。 单字 ID 和词语 ID 存在于两个不同的命名空间，即词语中的单字词或者标点，和单字中的相同字符及相同标点不一定有同一个 ID。
12 | 
13 | char_embedding.txt 和 word embedding.txt
14 | 
15 | 分别是字符级别的 256 维的 embedding 向量及词语级别的 256 维的 embedding 向量。以上两个文件都由 google word2vec 训练得到，保存成 txt 格式。可以使用 word2vec 直接加载。数据格式如下:
16 | 
17 | 第一行是两个数字，指明了词汇表的大小及 embedding 向量的长度；
18 | 
19 | 其余的各行均为 257 列，第一列是 char_id 或者 word_id，其后是 256 个浮点数，代表 256 维 embedding 向量。
20 | 
21 | 词汇表中省略掉了出现频次为 5 以下的字符或者词语，因此在训练和验证语料中出现的词汇有可能没有对应的 word embedding 向量。
22 | 
23 | question_train_set.txt
24 | 
25 | 训练集中包含的问题信息；一共 5 列，各个列之间用 \t 分割。格式：
26 | 
27 | question_id ct1,ct2,ct3,...,ctn wt1,wt2,wt3,...,wtn cd1,cd2,cd3,...cdn wd1,wd2,wd3,...,wdn
28 | 
29 | 第二列为 title 的字符编号序列；第三列是 title 的词语编号序列；第四列是描述的字符编号序列；第五列是描述的词语标号序列。
30 | 
31 | question_topic_train_set.txt
32 | 
33 | 问题与话题标签的绑定关系。一共有两列，各个列之间用 \t 分割。注意，如果一个问题绑定了多个话题标签，这些标签是无序的。格式如下：
34 | 
35 | question_id topic_id1,topic_id2...topic_idn
36 | 
37 | topic_info.txt:
38 | 
39 | 话题信息；一共 6 列，各个列之间用 \t 分割。格式：
40 | 
41 | topic_id pid_1,pid_2,...,pidn cn1,cn2,cn3,...,cnn wn1,wn2,wn3,...,wnn cd1,cd2,cd3,...,cdn wd1,wd2,wd3,...,wdn
42 | 
43 | 第二列为话题的父话题 id。话题之间是有向无环图结构，一个话题可能有 0 到多个父话题。第三列为话题名字的字符编号序列；第四列为话题名字的词语编号序列；第五列为话题描述的字符编号序列；第六列为话题描述的词语编号序列。
44 | 
45 | question_eval_set.txt
46 | 
47 | 该文件格式和 question_train_set.txt 一致。
48 | 


--------------------------------------------------------------------------------