├── .gitignore
├── LICENSE
├── README.md
├── app.py
└── nlp
    ├── __init__.py
    ├── embed_blocks.py
    ├── encode_blocks.py
    ├── match_blocks.py
    ├── mulitask_blocks.py
    ├── nn.py
    └── pool_blocks.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | models/
 2 | preprocessed/
 3 | raw/
 4 | *.pyc
 5 | vocab.search
 6 | local
 7 | .idea/
 8 | add_copyright.py
 9 | copyright
10 | .DS_Store
11 | data/vocab
12 | data/results/
13 | _job*.yaml
14 | toy*.py
15 | tmp.json
16 | summary.json
17 | data/*
18 | .*.yaml


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Han Xiao
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # tf-nlp-blocks
 2 | [![Python: 3.6](https://img.shields.io/badge/Python-3.6-brightgreen.svg)](https://opensource.org/licenses/MIT)    [![Tensorflow: 1.6](https://img.shields.io/badge/Tensorflow-1.6-brightgreen.svg)](https://opensource.org/licenses/MIT)  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 3 | 
 4 | 
 5 | Author: Han Xiao https://hanxiao.github.io
 6 | 
 7 | 
 8 | A collection of frequently-used deep learning blocks I have implemented in Tensorflow. It covers the core tasks in NLP such as embedding, encoding, matching and pooling. All implementations follow a modularized design pattern which I called the "block-design". More details can be found [in my blog post](https://hanxiao.github.io/2018/06/25/4-Encoding-Blocks-You-Need-to-Know-Besides-LSTM-RNN-in-Tensorflow/).
 9 | 
10 |   * [Requirements](#requirements)
11 |   * [Contents](#contents)
12 |     + [`encode_blocks.py`](#encode_blockspy)
13 |     + [`match_blocks.py`](#match_blockspy)
14 |     + [`pool_blocks.py`](#pool_blockspy)
15 |     + [`embed_blocks.py`](#embed_blockspy)
16 |     + [`mulitask_blocks.py`](#mulitask_blockspy)
17 |     + [`nn.py`](#nnpy)
18 |   * [Run](#run)
19 | 
20 | 
21 | ## Requirements
22 | 
23 | - Python >= 3.6
24 | - Tensorflow >= 1.6
25 | 
26 | ## Contents
27 | 
28 | ### `encode_blocks.py`
29 | A collection of sequence encoding blocks. Input is a sequence with shape of `[B, L, D]`, output is another sequence in `[B, L, D']`, where `B` is batch size, `L` is the length of the sequence and `D` and `D'` are the dimensions.
30 | 
31 | | Name  | Dependencies| Description | Reference |
32 | | --- | --- |--- |--- |
33 | | `LSTM_encode`| | a fast multi-layer bidirectional LSTM implementation based on [`CudnnLSTM`](https://www.tensorflow.org/api_docs/python/tf/contrib/cudnn_rnn/CudnnLSTM#call). Expect to be 5~10x faster than the standard tf `LSTMCell`. However, it can only run on GPU. | [Tensorflow doc on `CudnnLSTM`](https://www.tensorflow.org/api_docs/python/tf/contrib/cudnn_rnn/CudnnLSTM#call)|
34 | | `TCN_encode` | `Res_DualCNN_encode`| a temporal convolution network described in the paper, basically a multi-layer dilated CNN with special padding to ensure the causality| [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](https://arxiv.org/pdf/1803.01271)|
35 | | `Res_DualCNN_encode` |`CNN_encode`| a sub-block used by `TCN_encode`. It is a two-layer CNN with spatial dropout in-between, then followed by a residual connection and a layer-norm.| [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](https://arxiv.org/pdf/1803.01271)|
36 | | `CNN_encode` | | a standard `conv1d` implementation on `L` axis, with the possibility to set different paddings | [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)|
37 | 
38 | ### `match_blocks.py`
39 | A collection of sequence matching blocks, aka. attention. Input are two sequnces: `context` in the shape of `[B, L_c, D]`, and `query` in the shape of `[B, L_q, D]`. The output is a sequence has the same length as `context`, i.e. with shape of `[B, L_c, D]`. Each position in the output should encodes the relevance of that position in `context` to the complete `query`.
40 | 
41 | | Name  | Dependencies | Description | Reference |
42 | | --- | --- |--- |--- |
43 | |`Attentive_match`| |basic attention mechanism with different scoring functions, also supports future blinding.| `additive`: [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473); `scaled`: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)| 
44 | |`Transformer_match`| |a multi-head attention block from ["Attention is all you need"](https://arxiv.org/pdf/1706.03762.pdf)| [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)|
45 | |`AttentiveCNN_match`| `Attentive_match`|the light version of attentive convolution, with the possibility of future blinding to ensure causality. | [Attentive Convolution](https://arxiv.org/pdf/1710.00519)
46 | |`BiDaf_match`| |attention flow layer used in bidaf model. | [Bidirectional Attention Flow for Machine Comprehension](https://arxiv.org/abs/1611.01603)|
47 | 
48 | ### `pool_blocks.py`
49 | A collection of pooling blocks. It fuses/reduces on the time axis `L`. Input is a sequence with shape of `[B, L, D]`, output is in `[B, D]`.
50 | 
51 | | Name  | Dependencies | Description | Reference |
52 | | --- | --- |--- |--- |
53 | |`SWEM_pool`| | do pooling on the input sequence, supports max/avg. pooling, hierarchical avg. max pooling. | [Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms](https://arxiv.org/abs/1805.09843) |
54 | 
55 | There are also some convolution-based pooling blocks build on `SWEM_pool`, but they are for experimental purpose. Thus, I will not list them here.
56 | 
57 | ### `embed_blocks.py`
58 | A collection of positional encoding on the sequence.
59 | 
60 | | Name  | Dependencies | Description | Reference |
61 | | --- | --- |--- |--- |
62 | |`SinusPositional_embed`| | generate a sinusoid signal that has the same length of the input sequence | [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)|
63 | |`Positional_embed`| |parameterize the absolute position of the tokens in the input sequence | [A Convolutional Encoder Model for Neural Machine Translation](https://arxiv.org/pdf/1611.02344.pdf)|
64 | 
65 | ### `mulitask_blocks.py`
66 | A collection of multi-task learning blocks. So far only the "cross-stitch block" is available.
67 | 
68 | | Name  | Dependencies | Description | Reference |
69 | | --- | --- |--- |--- |
70 | |`CrossStitch`||a cross-stitch block, modeling the correlation & self-correlation of two tasks| [Cross-stitch Networks for Multi-task Learning](https://arxiv.org/pdf/1604.03539)|
71 | |`Stack_CrossStitch`|`CrossStitch`|stacking multiple cross-stitch blocks together with shared/separated input| [Cross-stitch Networks for Multi-task Learning](https://arxiv.org/pdf/1604.03539)|
72 | 
73 | 
74 | ### `nn.py`
75 | A collection of auxiliary functions, e.g. masking, normalizing, slicing. 
76 | 
77 | 
78 | ## Run 
79 | 
80 | Run `app.py` for a simple test on toy data.


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Han Xiao <artex.xh@gmail.com> <https://hanxiao.github.io>
 4 | 
 5 | import numpy as np
 6 | import tensorflow as tf
 7 | 
 8 | from nlp.match_blocks import AttentiveCNN_match
 9 | 
10 | if __name__ == '__main__':
11 |     # build two batch sequences
12 |     # context batch has 2 seqs, with length 4 and 3, padded to 5
13 |     # query batch has 2 seqs, with length 2 and 3, padded to 3
14 |     # dimension of each symbol is 4
15 |     context = tf.constant(np.random.random([2, 5, 4]), dtype=tf.float32)  # B,Lc,D
16 |     query = tf.constant(np.random.random([2, 3, 4]), dtype=tf.float32)  # B,Lq,D
17 |     # build mask
18 |     context_mask = tf.sequence_mask([4, 3], 5, dtype=tf.float32, name='query_mask')
19 |     query_mask = tf.sequence_mask([2, 3], 3, dtype=tf.float32, name='query_mask')
20 | 
21 |     # standard att-cnn
22 |     output1 = AttentiveCNN_match(context, query, context_mask, query_mask, scope='attcnn1')
23 |     # with res and layernorm
24 |     output2 = AttentiveCNN_match(context, query, context_mask, query_mask,
25 |                                  residual=True, normalize_output=True, scope='attcnn2')
26 |     # ensure causality
27 |     output3 = AttentiveCNN_match(context, query, context_mask, query_mask,
28 |                                  casuality=True, scope='attcnn3')
29 | 
30 |     with tf.Session() as sess:
31 |         sess.run(tf.global_variables_initializer())
32 |         for v in sess.run([output1, output2, output3]):
33 |             print(v)
34 | 


--------------------------------------------------------------------------------
/nlp/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hanxiao/tf-nlp-blocks/935a0cd3345b4ddc421f3eacd212d81570b071b2/nlp/__init__.py


--------------------------------------------------------------------------------
/nlp/embed_blocks.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import tensorflow as tf
 3 | 
 4 | 
 5 | def Positional_embed(batch_size,
 6 |                      seq_length,
 7 |                      max_seq_len,
 8 |                      num_units,
 9 |                      zero_pad=True,
10 |                      scale=True,
11 |                      scope='Pos_Block',
12 |                      reuse=None):
13 |     """Embeds a given tensor.
14 |     Args:
15 |       seqs: A `Tensor` with type `int32` or `int64` containing the ids
16 |          to be looked up in `lookup table`.
17 |       max_seq_len: An int. Vocabulary size.
18 |       num_units: An int. Number of embedding hidden units.
19 |       zero_pad: A boolean. If True, all the values of the fist row (id 0)
20 |         should be constant zeros.
21 |       scale: A boolean. If True. the outputs is multiplied by sqrt num_units.
22 |       scope: Optional scope for `variable_scope`.
23 |       reuse: Boolean, whether to reuse the weights of a previous layer
24 |         by the same name.
25 |     Returns:
26 |       A `Tensor` with one more rank than inputs's. The last dimensionality
27 |         should be `num_units`.
28 |     """
29 |     with tf.variable_scope(scope, reuse=reuse):
30 |         seqs = tf.tile(tf.expand_dims(tf.range(seq_length), 0), [batch_size, 1])
31 | 
32 |         lookup_table = tf.get_variable('lookup_table',
33 |                                        dtype=tf.float32,
34 |                                        shape=[max_seq_len, num_units],
35 |                                        initializer=tf.contrib.layers.xavier_initializer())
36 |         if zero_pad:
37 |             lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
38 |                                       lookup_table[1:, :]), 0)
39 |         outputs = tf.nn.embedding_lookup(lookup_table, seqs)
40 | 
41 |         if scale:
42 |             outputs = outputs * (num_units ** 0.5)
43 | 
44 |     return outputs
45 | 
46 | 
47 | def SinusPositional_embed(batch_size,
48 |                           seq_length,
49 |                           num_units,
50 |                           zero_pad=True,
51 |                           scale=True,
52 |                           scope='Sinus_Pos_Block',
53 |                           reuse=None):
54 |     """Sinusoidal Positional Encoding.
55 |     Args:
56 |       inputs: A 2d Tensor with shape of (N, T).
57 |       num_units: Output dimensionality
58 |       zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero
59 |       scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper)
60 |       scope: Optional scope for `variable_scope`.
61 |       reuse: Boolean, whether to reuse the weights of a previous layer
62 |         by the same name.
63 |     Returns:
64 |         A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
65 |     """
66 | 
67 |     with tf.variable_scope(scope, reuse=reuse):
68 |         seq = tf.tile(tf.expand_dims(tf.range(seq_length), 0), [batch_size, 1])
69 | 
70 |         # First part of the PE function: sin and cos argument
71 |         position_enc = np.array([
72 |             [pos / np.power(10000, 2. * i / num_units) for i in range(num_units)]
73 |             for pos in range(seq_length)])
74 | 
75 |         # Second part, apply the cosine to even columns and sin to odds.
76 |         position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
77 |         position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1
78 | 
79 |         # Convert to a tensor
80 |         lookup_table = tf.convert_to_tensor(position_enc)
81 | 
82 |         if zero_pad:
83 |             lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
84 |                                       lookup_table[1:, :]), 0)
85 |         outputs = tf.nn.embedding_lookup(lookup_table, seq)
86 | 
87 |         if scale:
88 |             outputs = outputs * num_units ** 0.5
89 | 
90 |         return outputs
91 | 


--------------------------------------------------------------------------------
/nlp/encode_blocks.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Han Xiao <artex.xh@gmail.com> <https://hanxiao.github.io>
 4 | 
 5 | 
 6 | import tensorflow as tf
 7 | 
 8 | from nlp.nn import initializer, regularizer, spatial_dropout, get_lstm_init_state, dropout_res_layernorm
 9 | 
10 | 
11 | def LSTM_encode(seqs, causality=False, scope='lstm_encode_block', reuse=None, **kwargs):
12 |     with tf.variable_scope(scope, reuse=reuse):
13 |         if causality:
14 |             kwargs['direction'] = 'unidirectional'
15 |         if 'num_units' not in kwargs or kwargs['num_units'] is None:
16 |             kwargs['num_units'] = seqs.get_shape().as_list()[-1]
17 |         batch_size = tf.shape(seqs)[0]
18 |         _seqs = tf.transpose(seqs, [1, 0, 2])  # to T, B, D
19 |         lstm = tf.contrib.cudnn_rnn.CudnnLSTM(**kwargs)
20 |         init_state = get_lstm_init_state(batch_size, **kwargs)
21 |         output = lstm(_seqs, init_state)[0]  # 2nd return is state, ignore
22 |         return tf.transpose(output, [1, 0, 2])  # back to B, T, D
23 | 
24 | 
25 | def TCN_encode(seqs, num_layers, scope='tcn_encode_block', reuse=None, **kwargs):
26 |     with tf.variable_scope(scope, reuse=reuse):
27 |         outputs = [seqs]
28 |         for i in range(num_layers):
29 |             dilation_size = 2 ** i
30 |             out = Res_DualCNN_encode(outputs[-1], dilation=dilation_size, scope='res_biconv_%d' % i, **kwargs)
31 |             outputs.append(out)
32 |         return outputs[-1]
33 | 
34 | 
35 | def Res_DualCNN_encode(seqs, use_spatial_dropout=True, scope='res_biconv_block', reuse=None, **kwargs):
36 |     with tf.variable_scope(scope, reuse=reuse):
37 |         out1 = CNN_encode(seqs, scope='first_conv1d', **kwargs)
38 |         if use_spatial_dropout:
39 |             out1 = spatial_dropout(out1)
40 |         out2 = CNN_encode(out1, scope='second_conv1d', **kwargs)
41 |         if use_spatial_dropout:
42 |             out2 = CNN_encode(out2)
43 |         return dropout_res_layernorm(seqs, out2, **kwargs)
44 | 
45 | 
46 | def CNN_encode(seqs, filter_size=3, dilation=1,
47 |                num_filters=None, direction='forward',
48 |                causality=False,
49 |                act_fn=tf.nn.relu,
50 |                scope=None,
51 |                reuse=None, **kwargs):
52 |     input_dim = seqs.get_shape().as_list()[-1]
53 |     num_filters = num_filters if num_filters else input_dim
54 | 
55 |     # add causality: shift the whole seq to the right
56 |     if causality:
57 |         direction = 'forward'
58 |     padding = (filter_size - 1) * dilation
59 |     if direction == 'forward':
60 |         pad_seqs = tf.pad(seqs, [[0, 0], [padding, 0], [0, 0]])
61 |         padding_scheme = 'VALID'
62 |     elif direction == 'backward':
63 |         pad_seqs = tf.pad(seqs, [[0, 0], [0, padding], [0, 0]])
64 |         padding_scheme = 'VALID'
65 |     elif direction == 'none':
66 |         pad_seqs = seqs  # no padding, must set to SAME so that we have same length
67 |         padding_scheme = 'SAME'
68 |     else:
69 |         raise NotImplementedError
70 | 
71 |     with tf.variable_scope(scope or 'causal_conv_%s_%s' % (filter_size, direction), reuse=reuse):
72 |         return tf.layers.conv1d(
73 |             pad_seqs,
74 |             num_filters,
75 |             filter_size,
76 |             activation=act_fn,
77 |             padding=padding_scheme,
78 |             dilation_rate=dilation,
79 |             kernel_initializer=initializer,
80 |             bias_initializer=tf.zeros_initializer(),
81 |             kernel_regularizer=regularizer)
82 | 


--------------------------------------------------------------------------------
/nlp/match_blocks.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Han Xiao <artex.xh@gmail.com> <https://hanxiao.github.io>
  4 | 
  5 | 
  6 | import tensorflow as tf
  7 | 
  8 | from nlp.encode_blocks import CNN_encode
  9 | from nlp.nn import linear_logit, layer_norm, dropout_res_layernorm
 10 | 
 11 | 
 12 | def AttentiveCNN_match(context, query, context_mask,
 13 |                        query_mask, causality=False,
 14 |                        scope='AttentiveCNN_Block',
 15 |                        reuse=None, **kwargs):
 16 |     with tf.variable_scope(scope, reuse=reuse):
 17 |         direction = 'forward' if causality else 'none'
 18 |         cnn_wo_att = CNN_encode(context, filter_size=3, direction=direction, act_fn=None)
 19 |         att_context, _ = Attentive_match(context, query, context_mask, query_mask, causality=causality)
 20 |         cnn_att = CNN_encode(att_context, filter_size=1, direction=direction, act_fn=None)
 21 |         output = tf.nn.tanh(cnn_wo_att + cnn_att)
 22 |         return dropout_res_layernorm(context, output, **kwargs)
 23 | 
 24 | 
 25 | def Attentive_match(context, query, context_mask, query_mask,
 26 |                     num_units=None,
 27 |                     score_func='scaled', causality=False,
 28 |                     scope='attention_match_block', reuse=None, **kwargs):
 29 |     with tf.variable_scope(scope, reuse=reuse):
 30 |         batch_size, context_length, _ = context.get_shape().as_list()
 31 |         if num_units is None:
 32 |             num_units = context.get_shape().as_list()[-1]
 33 |         _, query_length, _ = query.get_shape().as_list()
 34 | 
 35 |         context = linear_logit(context, num_units, act_fn=tf.nn.relu, scope='context_mapping')
 36 |         query = linear_logit(query, num_units, act_fn=tf.nn.relu, scope='query_mapping')
 37 | 
 38 |         if score_func == 'dot':
 39 |             score = tf.matmul(context, query, transpose_b=True)
 40 |         elif score_func == 'bilinear':
 41 |             score = tf.matmul(linear_logit(context, num_units, scope='context_x_We'), query, transpose_b=True)
 42 |         elif score_func == 'scaled':
 43 |             score = tf.matmul(linear_logit(context, num_units, scope='context_x_We'), query, transpose_b=True) / \
 44 |                     (num_units ** 0.5)
 45 |         elif score_func == 'additive':
 46 |             score = tf.squeeze(linear_logit(
 47 |                 tf.tanh(tf.tile(tf.expand_dims(linear_logit(context, num_units, scope='context_x_We'), axis=2),
 48 |                                 [1, 1, query_length, 1]) +
 49 |                         tf.tile(tf.expand_dims(linear_logit(query, num_units, scope='query_x_We'), axis=1),
 50 |                                 [1, context_length, 1, 1])), 1, scope='x_ve'), axis=3)
 51 |         else:
 52 |             raise NotImplementedError
 53 | 
 54 |         mask = tf.matmul(tf.expand_dims(context_mask, -1), tf.expand_dims(query_mask, -1), transpose_b=True)
 55 |         paddings = tf.ones_like(mask) * (-2 ** 32 + 1)
 56 |         masked_score = tf.where(tf.equal(mask, 0), paddings, score)  # B, Lc, Lq
 57 | 
 58 |         # Causality = Future blinding
 59 |         if causality:
 60 |             diag_vals = tf.ones_like(masked_score[0, :, :])  # (Lc, Lq)
 61 |             tril = tf.contrib.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()  # (Lc, Lq)
 62 |             masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(masked_score)[0], 1, 1])  # B, Lc, Lq
 63 | 
 64 |             paddings = tf.ones_like(masks) * (-2 ** 32 + 1)
 65 |             masked_score = tf.where(tf.equal(masks, 0), paddings, masked_score)  # B, Lc, Lq
 66 | 
 67 |         query2context_score = tf.nn.softmax(masked_score, axis=2) * mask  # B, Lc, Lq
 68 |         query2context_attention = tf.matmul(query2context_score, query)  # B, Lc, D
 69 | 
 70 |         context2query_score = tf.nn.softmax(masked_score, axis=1) * mask  # B, Lc, Lq
 71 |         context2query_attention = tf.matmul(context2query_score, context, transpose_a=True)  # B, Lq, D
 72 | 
 73 |         return (query2context_attention,  # B, Lc, D
 74 |                 context2query_attention)  # B, Lq, D
 75 | 
 76 | 
 77 | def Transformer_match(context,
 78 |                       query,
 79 |                       context_mask,
 80 |                       query_mask,
 81 |                       num_units=None,
 82 |                       num_heads=4,
 83 |                       dropout_keep_rate=1.0,
 84 |                       causality=False,
 85 |                       scope='Transformer_Block',
 86 |                       reuse=None,
 87 |                       residual=False,
 88 |                       normalize_output=False,
 89 |                       **kwargs):
 90 |     """Applies multihead attention.
 91 | 
 92 |     Args:
 93 |       context: A 3d tensor with shape of [N, T_q, C_q].
 94 |       query: A 3d tensor with shape of [N, T_k, C_k].
 95 |       num_units: A scalar. Attention size.
 96 |       dropout_rate: A floating point number.
 97 |       is_training: Boolean. Controller of mechanism for dropout.
 98 |       causality: Boolean. If true, units that reference the future are masked.
 99 |       num_heads: An int. Number of heads.
100 |       scope: Optional scope for `variable_scope`.
101 |       reuse: Boolean, whether to reuse the weights of a previous layer
102 |         by the same name.
103 | 
104 |     Returns
105 |       A 3d tensor with shape of (N, T_q, C)
106 |     """
107 |     if num_units is None or residual:
108 |         num_units = context.get_shape().as_list()[-1]
109 |     with tf.variable_scope(scope, reuse=reuse):
110 |         # Set the fall back option for num_units
111 | 
112 |         # Linear projections
113 |         Q = tf.layers.dense(context, num_units, activation=tf.nn.relu)  # (N, T_q, C)
114 |         K = tf.layers.dense(query, num_units, activation=tf.nn.relu)  # (N, T_k, C)
115 |         V = tf.layers.dense(query, num_units, activation=tf.nn.relu)  # (N, T_k, C)
116 | 
117 |         # Split and concat
118 |         Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0)  # (h*N, T_q, C/h)
119 |         K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0)  # (h*N, T_k, C/h)
120 |         V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0)  # (h*N, T_k, C/h)
121 | 
122 |         # Multiplication
123 |         outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1]))  # (h*N, T_q, T_k)
124 | 
125 |         # Scale
126 |         outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
127 | 
128 |         # Key Masking, aka query
129 |         if query_mask is None:
130 |             query_mask = tf.sign(tf.abs(tf.reduce_sum(query, axis=-1)))  # (N, T_k)
131 | 
132 |         mask1 = tf.tile(query_mask, [num_heads, 1])  # (h*N, T_k)
133 |         mask1 = tf.tile(tf.expand_dims(mask1, 1), [1, tf.shape(context)[1], 1])  # (h*N, T_q, T_k)
134 | 
135 |         paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
136 |         outputs = tf.where(tf.equal(mask1, 0), paddings, outputs)  # (h*N, T_q, T_k)
137 | 
138 |         # Causality = Future blinding
139 |         if causality:
140 |             diag_vals = tf.ones_like(outputs[0, :, :])  # (T_q, T_k)
141 |             tril = tf.contrib.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()  # (T_q, T_k)
142 |             masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])  # (h*N, T_q, T_k)
143 | 
144 |             paddings = tf.ones_like(masks) * (-2 ** 32 + 1)
145 |             outputs = tf.where(tf.equal(masks, 0), paddings, outputs)  # (h*N, T_q, T_k)
146 | 
147 |         # Activation
148 |         outputs = tf.nn.softmax(outputs)  # (h*N, T_q, T_k)
149 | 
150 |         # Query Masking  aka, context
151 |         if context_mask is None:
152 |             context_mask = tf.sign(tf.abs(tf.reduce_sum(context, axis=-1)))  # (N, T_q)
153 | 
154 |         mask2 = tf.tile(context_mask, [num_heads, 1])  # (h*N, T_q)
155 |         mask2 = tf.tile(tf.expand_dims(mask2, -1), [1, 1, tf.shape(query)[1]])  # (h*N, T_q, T_k)
156 |         outputs *= mask2  # (h*N, T_q, T_k)
157 | 
158 |         # Dropouts
159 |         outputs = tf.nn.dropout(outputs, keep_prob=dropout_keep_rate)
160 | 
161 |         # Weighted sum
162 |         outputs = tf.matmul(outputs, V_)  # ( h*N, T_q, C/h)
163 | 
164 |         # Restore shape
165 |         outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2)  # (N, T_q, C)
166 | 
167 |         if residual:
168 |             # Residual connection
169 |             outputs += context
170 | 
171 |         if normalize_output:
172 |             # Normalize
173 |             outputs = layer_norm(outputs)  # (N, T_q, C)
174 | 
175 |     return outputs
176 | 
177 | 
178 | def BiDaf_match(context, query, context_mask, query_mask, scope=None, reuse=None, **kwargs):
179 |     # context: [batch, l, d]
180 |     # question: [batch, l2, d]
181 |     with tf.variable_scope(scope, reuse=reuse):
182 |         n_a = tf.tile(tf.expand_dims(context, 2), [1, 1, tf.shape(query)[1], 1])
183 |         n_b = tf.tile(tf.expand_dims(query, 1), [1, tf.shape(context)[1], 1, 1])
184 | 
185 |         n_ab = n_a * n_b
186 |         n_abab = tf.concat([n_ab, n_a, n_b], -1)
187 | 
188 |         kernel = tf.squeeze(tf.layers.dense(n_abab, units=1), -1)
189 | 
190 |         context_mask = tf.expand_dims(context_mask, -1)
191 |         query_mask = tf.expand_dims(query_mask, -1)
192 |         kernel_mask = tf.matmul(context_mask, query_mask, transpose_b=True)
193 |         kernel += (kernel_mask - 1) * 1e5
194 | 
195 |         con_query = tf.matmul(tf.nn.softmax(kernel, 1), query)
196 |         con_query = con_query * context_mask
197 | 
198 |         query_con = tf.matmul(tf.transpose(
199 |             tf.reduce_max(kernel, 2, keepdims=True), [0, 2, 1]), context * context_mask)
200 |         query_con = tf.tile(query_con, [1, tf.shape(context)[1], 1])
201 |         return tf.concat([context * query_con, context * con_query, context, query_con], 2)
202 | 
203 | 
204 | def Stacked_Self_match(context, context_mask, self_match,
205 |                        num_layers=3,
206 |                        scope='Stacked_Self_match_Block',
207 |                        reuse=None, **kwargs):
208 |     output = context
209 |     with tf.variable_scope(scope, reuse=reuse):
210 |         for j in range(num_layers):
211 |             output = self_match(output, output, context_mask, context_mask,
212 |                                 scope=scope + '_layer_%d' % j, **kwargs)
213 |     return output
214 | 


--------------------------------------------------------------------------------
/nlp/mulitask_blocks.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Han Xiao <artex.xh@gmail.com> <https://hanxiao.github.io>
  4 | 
  5 | 
  6 | import tensorflow as tf
  7 | 
  8 | from nlp.nn import get_var, linear_logit, get_cross_correlated_mat, get_self_correlated_mat, gate_filter
  9 | 
 10 | 
 11 | def CrossStitch(logit_A, logit_B,
 12 |                 pA_given_B, pB_given_A,
 13 |                 pA_corr=None, pB_corr=None,
 14 |                 reduce_transfer='LEARNED',
 15 |                 scope=None, reuse=None,
 16 |                 **kwargs):
 17 |     with tf.variable_scope(scope or 'Cross_Stitch_Block', reuse=reuse):
 18 |         trans_B = tf.matmul(logit_A, pB_given_A)  # batch x num B
 19 |         trans_A = tf.matmul(logit_B, pA_given_B, transpose_b=True)  # batch x num A
 20 | 
 21 |         if pA_corr is not None:
 22 |             logit_A = tf.matmul(logit_A, pA_corr)
 23 |         if pB_corr is not None:
 24 |             logit_B = tf.matmul(logit_B, pB_corr)
 25 | 
 26 |         if reduce_transfer == 'LEARNED':
 27 |             res_rate_A = tf.sigmoid(get_var('taskA_weight', shape=[1, logit_A.get_shape().as_list()[-1]]))
 28 |             res_rate_B = tf.sigmoid(get_var('taskB_weight', shape=[1, logit_B.get_shape().as_list()[-1]]))
 29 |             output_A = res_rate_A * logit_A + (1 - res_rate_A) * trans_A  # batch x num A
 30 |             output_B = res_rate_B * logit_B + (1 - res_rate_B) * trans_B  # batch x num B
 31 |         elif reduce_transfer == 'MEAN':
 32 |             output_A = (logit_A + trans_A) / 2
 33 |             output_B = (logit_B + trans_B) / 2
 34 |         elif reduce_transfer == 'GATE':
 35 |             output_A = logit_A * tf.sigmoid(trans_A)
 36 |             output_B = logit_B * tf.sigmoid(trans_B)
 37 |         elif reduce_transfer == 'RES_GATE':
 38 |             output_A = logit_A + logit_A * tf.sigmoid(trans_A)
 39 |             output_B = logit_B + logit_B * tf.sigmoid(trans_B)
 40 |         else:
 41 |             raise NotImplementedError
 42 | 
 43 |         return output_A, output_B
 44 | 
 45 | 
 46 | def Stack_CrossStitch_old(shared_input, cooc_AB=None, num_layers=3,
 47 |                           num_out_A=None, num_out_B=None,
 48 |                           transform=linear_logit,
 49 |                           learn_cooc='FIXED',
 50 |                           gated=False,
 51 |                           self_correlation=False,
 52 |                           scope=None,
 53 |                           reuse=None,
 54 |                           **kwargs):
 55 |     with tf.variable_scope(scope or 'Stacked_Cross_Stitch_Block', reuse=reuse):
 56 |         pA_given_B, pB_given_A = get_cross_correlated_mat(num_out_A, num_out_B, learn_cooc, cooc_AB)
 57 | 
 58 |         pA_corr = get_self_correlated_mat(num_out_A, scope='self_corr_a') if self_correlation else None
 59 |         pB_corr = get_self_correlated_mat(num_out_B, scope='self_corr_b') if self_correlation else None
 60 | 
 61 |         iter_shared = shared_input
 62 |         out_A, out_B = None, None
 63 |         for j in range(num_layers):
 64 |             with tf.variable_scope('cross_stitch_block_%d' % j):
 65 |                 iter_inputA = gate_filter(iter_shared, 'threshold_a') if gated else iter_shared
 66 |                 iter_inputB = gate_filter(iter_shared, 'threshold_b') if gated else iter_shared
 67 |                 logit_A = transform(iter_inputA, num_out_A, scope='dense_A', **kwargs)
 68 |                 logit_B = transform(iter_inputB, num_out_B, scope='dense_B', **kwargs)
 69 |                 out_A, out_B = CrossStitch(logit_A, logit_B,
 70 |                                            pA_given_B, pB_given_A,
 71 |                                            pA_corr, pB_corr,
 72 |                                            **kwargs)
 73 |                 iter_shared = tf.concat([out_A, out_B], axis=-1)
 74 |         return out_A, out_B
 75 | 
 76 | 
 77 | def Stack_CrossStitch(shared_input=None, init_input_A=None, init_input_B=None,
 78 |                       cooc_AB=None, num_layers=3,
 79 |                       num_out_A=None, num_out_B=None,
 80 |                       transform=linear_logit,
 81 |                       learn_cooc='FIXED',
 82 |                       self_correlation=False,
 83 |                       scope=None,
 84 |                       reuse=None,
 85 |                       **kwargs):
 86 |     if shared_input is None and (init_input_A is None or init_input_B is None):
 87 |         raise AttributeError('Must specify shared input or init_input')
 88 | 
 89 |     with tf.variable_scope(scope or 'Stacked_Cross_Stitch_Block', reuse=reuse):
 90 |         pA_given_B, pB_given_A = get_cross_correlated_mat(num_out_A, num_out_B, learn_cooc, cooc_AB)
 91 | 
 92 |         pA_corr = get_self_correlated_mat(num_out_A, scope='self_corr_a') if self_correlation else None
 93 |         pB_corr = get_self_correlated_mat(num_out_B, scope='self_corr_b') if self_correlation else None
 94 | 
 95 |         iter_inputA = init_input_A if init_input_A is not None else shared_input
 96 |         iter_inputB = init_input_B if init_input_B is not None else shared_input
 97 | 
 98 |         out_A, out_B = None, None
 99 |         for j in range(num_layers):
100 |             with tf.variable_scope('cross_stitch_block_%d' % j):
101 |                 logit_A = transform(iter_inputA, num_out_A, scope='dense_A', **kwargs)
102 |                 logit_B = transform(iter_inputB, num_out_B, scope='dense_B', **kwargs)
103 |                 out_A, out_B = CrossStitch(logit_A, logit_B,
104 |                                            pA_given_B, pB_given_A,
105 |                                            pA_corr, pB_corr,
106 |                                            **kwargs)
107 |                 iter_inputA = tf.concat([out_A, out_B], axis=-1)
108 |                 iter_inputB = iter_inputA
109 |         return out_A, out_B
110 | 


--------------------------------------------------------------------------------
/nlp/nn.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Han Xiao <artex.xh@gmail.com> <https://hanxiao.github.io>
  4 | 
  5 | 
  6 | import tensorflow as tf
  7 | 
  8 | initializer = tf.contrib.layers.variance_scaling_initializer(factor=1.0,
  9 |                                                              mode='FAN_AVG',
 10 |                                                              uniform=True,
 11 |                                                              dtype=tf.float32)
 12 | initializer_relu = tf.contrib.layers.variance_scaling_initializer(factor=2.0,
 13 |                                                                   mode='FAN_IN',
 14 |                                                                   uniform=False,
 15 |                                                                   dtype=tf.float32)
 16 | regularizer = tf.contrib.layers.l2_regularizer(scale=3e-7)
 17 | 
 18 | 
 19 | def minus_mask(x, mask, offset=1e30):
 20 |     """
 21 |     masking by subtract a very large number
 22 |     :param x: sequence data in the shape of [B, L, D]
 23 |     :param mask: 0-1 mask in the shape of [B, L]
 24 |     :param offset: very large negative number
 25 |     :return: masked x
 26 |     """
 27 |     return x - tf.expand_dims(1.0 - mask, axis=-1) * offset
 28 | 
 29 | 
 30 | def mul_mask(x, mask):
 31 |     """
 32 |     masking by multiply zero
 33 |     :param x: sequence data in the shape of [B, L, D]
 34 |     :param mask: 0-1 mask in the shape of [B, L]
 35 |     :return: masked x
 36 |     """
 37 |     return x * tf.expand_dims(mask, axis=-1)
 38 | 
 39 | 
 40 | def masked_reduce_mean(x, mask):
 41 |     return tf.reduce_sum(mul_mask(x, mask), axis=1) / tf.reduce_sum(mask, axis=1, keepdims=True)
 42 | 
 43 | 
 44 | def masked_reduce_max(x, mask):
 45 |     return tf.reduce_max(minus_mask(x, mask), axis=1)
 46 | 
 47 | 
 48 | def weighted_sparse_softmax_cross_entropy(labels, preds, weights):
 49 |     """
 50 |     computing sparse softmax cross entropy by weighting differently on classes
 51 |     :param labels: sparse label in the shape of [B], size of label is L
 52 |     :param preds: logit in the shape of [B, L]
 53 |     :param weights: weight in the shape of [L]
 54 |     :return: weighted sparse softmax cross entropy in the shape of [B]
 55 |     """
 56 | 
 57 |     return tf.losses.sparse_softmax_cross_entropy(labels,
 58 |                                                   logits=preds,
 59 |                                                   weights=get_bounded_class_weight(labels, weights))
 60 | 
 61 | 
 62 | def get_bounded_class_weight(labels, weights, ub=None):
 63 |     if weights is None:
 64 |         return 1.0
 65 |     else:
 66 |         w = tf.gather(weights, labels)
 67 |         w = w / tf.reduce_min(w)
 68 |         w = tf.clip_by_value(1.0 + tf.log1p(w),
 69 |                              clip_value_min=1.0,
 70 |                              clip_value_max=ub if ub is not None else tf.cast(tf.shape(weights)[0], tf.float32) / 2.0)
 71 |     return w
 72 | 
 73 | 
 74 | def weighted_smooth_softmax_cross_entropy(labels, num_labels, preds, weights,
 75 |                                           epsilon=0.1):
 76 |     """
 77 |         computing smoothed softmax cross entropy by weighting differently on classes
 78 |         :param epsilon: smoothing factor
 79 |         :param num_labels: maximum number of labels
 80 |         :param labels: sparse label in the shape of [B], size of label is L
 81 |         :param preds: logit in the shape of [B, L]
 82 |         :param weights: weight in the shape of [L]
 83 |         :return: weighted sparse softmax cross entropy in the shape of [B]
 84 |         """
 85 | 
 86 |     return tf.losses.softmax_cross_entropy(tf.one_hot(labels, num_labels),
 87 |                                            logits=preds,
 88 |                                            label_smoothing=epsilon,
 89 |                                            weights=get_bounded_class_weight(labels, weights))
 90 | 
 91 | 
 92 | def get_var(name, shape, dtype=tf.float32,
 93 |             initializer_fn=initializer,
 94 |             regularizer_fn=regularizer, **kwargs):
 95 |     return tf.get_variable(name, shape,
 96 |                            initializer=initializer_fn,
 97 |                            dtype=dtype,
 98 |                            regularizer=regularizer_fn, **kwargs)
 99 | 
100 | 
101 | def layer_norm(inputs,
102 |                epsilon=1e-8,
103 |                scope=None,
104 |                reuse=None):
105 |     """Applies layer normalization.
106 | 
107 |     Args:
108 |       inputs: A tensor with 2 or more dimensions, where the first dimension has
109 |         `batch_size`.
110 |       epsilon: A floating number. A very small number for preventing ZeroDivision Error.
111 |       scope: Optional scope for `variable_scope`.
112 |       reuse: Boolean, whether to reuse the weights of a previous layer
113 |         by the same name.
114 | 
115 |     Returns:
116 |       A tensor with the same shape and data dtype as `inputs`.
117 |     """
118 |     with tf.variable_scope(scope or 'Layer_Normalize', reuse=reuse):
119 |         inputs_shape = inputs.get_shape()
120 |         params_shape = inputs_shape[-1:]
121 | 
122 |         mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
123 |         beta = tf.Variable(tf.zeros(params_shape))
124 |         gamma = tf.Variable(tf.ones(params_shape))
125 |         normalized = (inputs - mean) / ((variance + epsilon) ** .5)
126 |         outputs = gamma * normalized + beta
127 | 
128 |     return outputs
129 | 
130 | 
131 | def linear_logit(x, units, act_fn=None, dropout_keep=1., use_layer_norm=False, scope=None, reuse=None, **kwargs):
132 |     with tf.variable_scope(scope or 'linear_logit', reuse=reuse):
133 |         logit = tf.layers.dense(x, units=units, activation=act_fn,
134 |                                 kernel_initializer=initializer,
135 |                                 kernel_regularizer=regularizer)
136 |         # do dropout
137 |         logit = tf.nn.dropout(logit, keep_prob=dropout_keep)
138 |         if use_layer_norm:
139 |             logit = tf.contrib.layers.layer_norm(logit)
140 |         return logit
141 | 
142 | 
143 | def bilinear_logit(x, units, act_fn=None,
144 |                    first_units=256,
145 |                    first_act_fn=tf.nn.relu, scope=None, **kwargs):
146 |     with tf.variable_scope(scope or 'bilinear_logit'):
147 |         first = linear_logit(x, first_units, act_fn=first_act_fn, scope='first', **kwargs)
148 |         return linear_logit(first, units, scope='second', act_fn=act_fn, **kwargs)
149 | 
150 | 
151 | def label_smoothing(inputs, epsilon=0.1):
152 |     """Applies label smoothing. See https://arxiv.org/abs/1512.00567.
153 | 
154 |     Args:
155 |       inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
156 |       epsilon: Smoothing rate.
157 | 
158 |     For example,
159 | 
160 |     ```
161 |     import tensorflow as tf
162 |     inputs = tf.convert_to_tensor([[[0, 0, 1],
163 |        [0, 1, 0],
164 |        [1, 0, 0]],
165 |       [[1, 0, 0],
166 |        [1, 0, 0],
167 |        [0, 1, 0]]], tf.float32)
168 | 
169 |     outputs = label_smoothing(inputs)
170 | 
171 |     with tf.Session() as sess:
172 |         print(sess.run([outputs]))
173 | 
174 |     >>
175 |     [array([[[ 0.03333334,  0.03333334,  0.93333334],
176 |         [ 0.03333334,  0.93333334,  0.03333334],
177 |         [ 0.93333334,  0.03333334,  0.03333334]],
178 |        [[ 0.93333334,  0.03333334,  0.03333334],
179 |         [ 0.93333334,  0.03333334,  0.03333334],
180 |         [ 0.03333334,  0.93333334,  0.03333334]]], dtype=float32)]
181 |     ```
182 |     """
183 |     K = inputs.get_shape().as_list()[-1]  # number of channels
184 |     return ((1 - epsilon) * inputs) + (epsilon / K)
185 | 
186 | 
187 | def normalize_by_axis(x, axis, smooth_factor=1e-5):
188 |     x += smooth_factor
189 |     return x / tf.reduce_sum(x, axis, keepdims=True)  # num A x num B
190 | 
191 | 
192 | def get_cross_correlated_mat(num_out_A, num_out_B, learn_cooc='FIXED', cooc_AB=None, scope=None, reuse=None):
193 |     with tf.variable_scope(scope or 'CrossCorrlated_Mat', reuse=reuse):
194 |         if learn_cooc == 'FIXED' and cooc_AB is not None:
195 |             pB_given_A = normalize_by_axis(cooc_AB, 1)
196 |             pA_given_B = normalize_by_axis(cooc_AB, 0)
197 |         elif learn_cooc == 'JOINT':
198 |             share_cooc = tf.nn.relu(get_var('cooc_ab', shape=[num_out_A, num_out_B]))
199 |             pB_given_A = normalize_by_axis(share_cooc, 1)
200 |             pA_given_B = normalize_by_axis(share_cooc, 0)
201 |         elif learn_cooc == 'DISJOINT':
202 |             cooc1 = tf.nn.relu(get_var('pb_given_a', shape=[num_out_A, num_out_B]))
203 |             cooc2 = tf.nn.relu(get_var('pa_given_b', shape=[num_out_A, num_out_B]))
204 |             pB_given_A = normalize_by_axis(cooc1, 1)
205 |             pA_given_B = normalize_by_axis(cooc2, 0)
206 |         else:
207 |             raise NotImplementedError
208 | 
209 |         return pA_given_B, pB_given_A
210 | 
211 | 
212 | def get_self_correlated_mat(num_out_A, scope=None, reuse=None):
213 |     with tf.variable_scope(scope or 'Self_Correlated_mat', reuse=reuse):
214 |         cooc1 = get_var('pa_corr', shape=[num_out_A, num_out_A],
215 |                         initializer_fn=tf.contrib.layers.variance_scaling_initializer(factor=0.1,
216 |                                                                                       mode='FAN_AVG',
217 |                                                                                       uniform=True,
218 |                                                                                       dtype=tf.float32),
219 |                         regularizer_fn=tf.contrib.layers.l2_regularizer(scale=3e-4))
220 |         return tf.matmul(cooc1, cooc1, transpose_b=True) + tf.eye(num_out_A)
221 | 
222 | 
223 | def gate_filter(x, scope=None, reuse=None):
224 |     with tf.variable_scope(scope or 'Gate', reuse=reuse):
225 |         threshold = get_var('threshold', shape=[])
226 |         gate = tf.cast(tf.greater(x, threshold), tf.float32)
227 |         return x * gate
228 | 
229 | 
230 | from tensorflow.python.ops import array_ops
231 | 
232 | 
233 | def focal_loss(prediction_tensor, target_tensor, weights=None, alpha=0.25, gamma=2):
234 |     r"""Compute focal loss for predictions.
235 |         Multi-labels Focal loss formula:
236 |             FL = -alpha * (z-p)^gamma * log(p) -(1-alpha) * p^gamma * log(1-p)
237 |                  ,which alpha = 0.25, gamma = 2, p = sigmoid(x), z = target_tensor.
238 |     Args:
239 |      prediction_tensor: A float tensor of shape [batch_size, num_anchors,
240 |         num_classes] representing the predicted logits for each class
241 |      target_tensor: A float tensor of shape [batch_size, num_anchors,
242 |         num_classes] representing one-hot encoded classification targets
243 |      weights: A float tensor of shape [batch_size, num_anchors]
244 |      alpha: A scalar tensor for focal loss alpha hyper-parameter
245 |      gamma: A scalar tensor for focal loss gamma hyper-parameter
246 |     Returns:
247 |         loss: A (scalar) tensor representing the value of the loss function
248 |     """
249 |     sigmoid_p = tf.nn.sigmoid(prediction_tensor)
250 |     zeros = array_ops.zeros_like(sigmoid_p, dtype=sigmoid_p.dtype)
251 | 
252 |     # For poitive prediction, only need consider front part loss, back part is 0;
253 |     # target_tensor > zeros <=> z=1, so poitive coefficient = z - p.
254 |     pos_p_sub = array_ops.where(target_tensor > zeros, target_tensor - sigmoid_p, zeros)
255 | 
256 |     # For negative prediction, only need consider back part loss, front part is 0;
257 |     # target_tensor > zeros <=> z=1, so negative coefficient = 0.
258 |     neg_p_sub = array_ops.where(target_tensor > zeros, zeros, sigmoid_p)
259 |     per_entry_cross_ent = - alpha * (pos_p_sub ** gamma) * tf.log(tf.clip_by_value(sigmoid_p, 1e-8, 1.0)) \
260 |                           - (1 - alpha) * (neg_p_sub ** gamma) * tf.log(tf.clip_by_value(1.0 - sigmoid_p, 1e-8, 1.0))
261 |     return tf.reduce_sum(per_entry_cross_ent)
262 | 
263 | 
264 | def spatial_dropout(x, scope=None, reuse=None):
265 |     input_dim = x.get_shape().as_list()[-1]
266 |     with tf.variable_scope(scope or 'spatial_dropout', reuse=reuse):
267 |         d = tf.random_uniform(shape=[1], minval=0, maxval=input_dim, dtype=tf.int32)
268 |         f = tf.one_hot(d, on_value=0., off_value=1., depth=input_dim)
269 |         g = x * f  # do dropout
270 |         g *= (1. + 1. / input_dim)  # do rescale
271 |         return g
272 | 
273 | 
274 | def get_last_output(output, seq_length, scope=None, reuse=None):
275 |     """Get the last value of the returned output of an RNN.
276 |     http://disq.us/p/1gjkgdr
277 |     output: [batch x number of steps x ... ] Output of the dynamic lstm.
278 |     sequence_length: [batch] Length of each of the sequence.
279 |     """
280 |     with tf.variable_scope(scope or 'gather_nd', reuse=reuse):
281 |         rng = tf.range(0, tf.shape(seq_length)[0])
282 |         indexes = tf.stack([rng, seq_length - 1], 1)
283 |         return tf.gather_nd(output, indexes)
284 | 
285 | 
286 | def get_lstm_init_state(batch_size, num_layers, num_units, direction, scope=None, reuse=None, **kwargs):
287 |     with tf.variable_scope(scope or 'lstm_init_state', reuse=reuse):
288 |         num_dir = 2 if direction.startswith('bi') else 1
289 |         c = get_var('lstm_init_c', shape=[num_layers * num_dir, num_units])
290 |         c = tf.tile(tf.expand_dims(c, axis=1), [1, batch_size, 1])
291 |         h = get_var('lstm_init_h', shape=[num_layers * num_dir, num_units])
292 |         h = tf.tile(tf.expand_dims(h, axis=1), [1, batch_size, 1])
293 |         return c, h
294 | 
295 | 
296 | def dropout_res_layernorm(x, fx, act_fn=tf.nn.relu,
297 |                           dropout_keep_rate=1.0,
298 |                           residual=True,
299 |                           normalize_output=True,
300 |                           scope='rnd_block',
301 |                           reuse=None, **kwargs):
302 |     with tf.variable_scope(scope, reuse=reuse):
303 |         input_dim = x.get_shape().as_list()[-1]
304 |         output_dim = fx.get_shape().as_list()[-1]
305 | 
306 |         # do dropout
307 |         fx = tf.nn.dropout(fx, keep_prob=dropout_keep_rate)
308 | 
309 |         if residual and input_dim != output_dim:
310 |             res_x = tf.layers.conv1d(x,
311 |                                      filters=output_dim,
312 |                                      kernel_size=1,
313 |                                      activation=None,
314 |                                      name='res_1x1conv')
315 |         else:
316 |             res_x = x
317 | 
318 |         if residual and act_fn is None:
319 |             output = fx + res_x
320 |         elif residual and act_fn is not None:
321 |             output = act_fn(fx + res_x)
322 |         else:
323 |             output = fx
324 | 
325 |         if normalize_output:
326 |             output = layer_norm(output)
327 | 
328 |         return output
329 | 


--------------------------------------------------------------------------------
/nlp/pool_blocks.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Han Xiao <artex.xh@gmail.com> <https://hanxiao.github.io>
  4 | 
  5 | 
  6 | import tensorflow as tf
  7 | 
  8 | from nlp.encode_blocks import CNN_encode
  9 | from nlp.match_blocks import Transformer_match
 10 | from nlp.nn import masked_reduce_max, masked_reduce_mean, get_var, minus_mask
 11 | 
 12 | 
 13 | def SWEM_pool(seqs, mask, reduce='CONCAT_MEAN_MAX', scope=None, reuse=None,
 14 |               task_name=None, norm_by_layer=False, dropout_keep=1., **kwargs):
 15 |     """
 16 |     Simple word embedding fusion
 17 |     """
 18 |     with tf.variable_scope(scope or 'SWEM_Block', reuse=reuse):
 19 |         if reduce == 'MAX':
 20 |             pooled = masked_reduce_max(seqs, mask)
 21 |         elif reduce == 'CONCAT_MEAN_MAX':  # often best
 22 |             pooled = tf.concat([masked_reduce_mean(seqs, mask),
 23 |                                 masked_reduce_max(seqs, mask)], axis=1)
 24 |         elif reduce == 'MEAN':  # often worst
 25 |             pooled = masked_reduce_mean(seqs, mask)
 26 |         elif reduce == 'LEVEL_MEAN_MAX':
 27 |             # 1. segment sequence in a small window
 28 |             # 2. do average pooling on each window
 29 |             # 3. do max pooling across all windows
 30 |             def reshape_seqs(x, avg_window_size=3, **kwargs):
 31 |                 B = tf.shape(x)[0]
 32 |                 L = tf.cast(tf.shape(x)[1], tf.float32)
 33 |                 D = x.get_shape().as_list()[-1]
 34 |                 b = tf.transpose(x, [0, 2, 1])
 35 |                 extra_pads = tf.cast(tf.ceil(L / avg_window_size) * avg_window_size - L, tf.int32)
 36 |                 c = tf.pad(b, tf.concat([tf.zeros([2, 2], dtype=tf.int32), [[0, extra_pads]]], axis=0))
 37 |                 return tf.reshape(c, [B, D, avg_window_size, -1])
 38 | 
 39 |             # avg pooling with mask, be careful with all zero mask
 40 |             d = tf.reduce_sum(reshape_seqs(seqs, **kwargs), axis=2) / \
 41 |                 tf.reduce_sum(reshape_seqs(tf.expand_dims(mask + 1e-10, axis=-1), **kwargs), axis=2)
 42 | 
 43 |             # max pooling
 44 |             pooled = tf.reduce_max(d, axis=2)
 45 |         elif reduce == 'SELF_ATTENTION':
 46 |             D = seqs.get_shape().as_list()[-1]
 47 |             query = tf.tile(get_var('%s_query' % (task_name if task_name else ''),
 48 |                                     shape=[1, D, 1]), [tf.shape(seqs)[0], 1, 1])
 49 |             score = tf.nn.softmax(minus_mask(tf.matmul(seqs, query) / (D ** 0.5), mask), axis=1)  # [B,L,1]
 50 |             pooled = tf.reduce_sum(score * seqs, axis=1)  # [B, D]
 51 |         else:
 52 |             raise NotImplementedError
 53 | 
 54 |         pooled = tf.nn.dropout(pooled, keep_prob=dropout_keep)
 55 |         if norm_by_layer:
 56 |             pooled = tf.contrib.layers.layer_norm(pooled)
 57 | 
 58 |         return pooled
 59 | 
 60 | 
 61 | # the following blocks are for experimental purpose!!!
 62 | # i just keep them here in case i need them
 63 | 
 64 | def MH_Att_pool(seqs, mask, scope=None, reuse=None, **kwargs):
 65 |     with tf.variable_scope(scope or 'MultiHead_Attention_Pool_Block', reuse=reuse):
 66 |         rep = Transformer_match(seqs, seqs, **kwargs)
 67 |         return SWEM_pool(rep, mask, **kwargs)
 68 | 
 69 | 
 70 | def Stack_MH_Att_pool(seqs, mask, num_layers=5, num_units=None,
 71 |                       concat_output=True,
 72 |                       residual=True, scope=None, reuse=None, **kwargs):
 73 |     if num_units is None or residual:
 74 |         num_units = seqs.get_shape().as_list()[-1]
 75 |     pooled_outputs = []
 76 |     with tf.variable_scope(scope or 'Stack_MultiHead_Attention_Pool_Block', reuse=reuse):
 77 |         res_rep = None
 78 |         iter_rep = seqs
 79 |         for idx in range(num_layers):
 80 |             with tf.variable_scope("multi_head_pool_%s" % idx):
 81 |                 iter_rep = Transformer_match(iter_rep, iter_rep, num_units, **kwargs)
 82 |                 # res
 83 |                 if residual and (res_rep is not None):
 84 |                     iter_rep = iter_rep + res_rep
 85 |                 res_rep = iter_rep
 86 |                 pooled = SWEM_pool(iter_rep, mask, **kwargs)
 87 |                 pooled_outputs.append(pooled)
 88 |         return tf.concat(pooled_outputs, axis=1) if concat_output else pooled_outputs[-1]
 89 | 
 90 | 
 91 | def Stack_CNN_pool(seqs, mask, filter_sizes=(3, 4, 5), num_units=None,
 92 |                    residual=True, concat_output=True, scope=None, reuse=None,
 93 |                    **kwargs):
 94 |     if num_units is None or residual:
 95 |         num_units = seqs.get_shape().as_list()[-1]
 96 |     pooled_outputs = []
 97 |     with tf.variable_scope(scope or 'Deep_CNN_Residual_Block', reuse=reuse):
 98 |         res_rep = None
 99 |         iter_rep = seqs
100 |         for fz in filter_sizes:
101 |             with tf.variable_scope("conv_pool_%s" % fz):
102 |                 conv_relu = CNN_encode(iter_rep, fz, 2 * num_units)
103 |                 # gate
104 |                 map_res_a, map_res_b = tf.split(conv_relu, num_or_size_splits=2, axis=2)
105 |                 iter_rep = map_res_a * tf.nn.sigmoid(map_res_b)
106 |                 # res
107 |                 if residual and (res_rep is not None):
108 |                     iter_rep = iter_rep + res_rep
109 |                 res_rep = iter_rep
110 |                 pooled = SWEM_pool(iter_rep, mask, **kwargs)
111 |                 pooled_outputs.append(pooled)
112 |         return tf.concat(pooled_outputs, axis=1) if concat_output else pooled_outputs[-1]
113 | 
114 | 
115 | def CNN_pool(seqs, mask, filter_size, num_units=None, scope=None, reuse=None, **kwargs):
116 |     if num_units is None:
117 |         num_units = seqs.get_shape().as_list()[-1]
118 |     with tf.variable_scope(scope or 'CNN_Pool_Block', reuse=reuse):
119 |         conv_relu = CNN_encode(seqs, filter_size, num_units)
120 |         pooled = SWEM_pool(conv_relu, mask, **kwargs)
121 |     return pooled
122 | 
123 | 
124 | def MR_CNN_pool(seqs, mask, filter_sizes=(3, 4, 5),
125 |                 highway=False, scope=None, reuse=None, **kwargs):
126 |     """
127 |     Multi-resolution CNN fusion
128 |     """
129 |     with tf.variable_scope(scope or 'MR_CNN_Block', reuse=reuse):
130 |         outputs = [SWEM_pool(seqs, mask, **kwargs)] if highway else []
131 |         for fz in filter_sizes:
132 |             outputs.append(CNN_pool(seqs, mask, fz, scope='cnn_pool_%d' % fz, **kwargs))
133 |         return tf.concat(outputs, axis=1)
134 | 
135 | 
136 | def MH_MR_CNN_pool_v2(seqs, mask, num_heads=4, num_units=256,
137 |                       scope=None, reuse=None, **kwargs):
138 |     with tf.variable_scope(scope or 'MultiHead_MCNN_Block', reuse=reuse):
139 |         outputs = []
140 |         part_seqs = tf.split(seqs, num_heads, axis=2)
141 |         for idx, part_seq in enumerate(part_seqs):
142 |             outputs.append(MR_CNN_pool(part_seq, mask,
143 |                                        num_units=int(num_units / num_heads),
144 |                                        scope='mr_cnn_head_%d' % idx, **kwargs))
145 |         return tf.concat(outputs, axis=1)
146 | 
147 | 
148 | def MH_MR_CNN_pool(seqs, mask, num_heads=3,
149 |                    scope=None, reuse=None, **kwargs):
150 |     """
151 |     Multi-head multi-resolution CNN fusion
152 |     """
153 |     with tf.variable_scope(scope or 'MultiHead_MCNN_Block', reuse=reuse):
154 |         outputs = []
155 |         for j in range(num_heads):
156 |             outputs.append(MR_CNN_pool(seqs, mask, scope='mr_cnn_block_%d' % j, **kwargs))
157 |         return tf.add_n(outputs)
158 | 


--------------------------------------------------------------------------------