├── .gitignore ├── LICENSE ├── README.md ├── app.py └── nlp ├── __init__.py ├── embed_blocks.py ├── encode_blocks.py ├── match_blocks.py ├── mulitask_blocks.py ├── nn.py └── pool_blocks.py /.gitignore: -------------------------------------------------------------------------------- 1 | models/ 2 | preprocessed/ 3 | raw/ 4 | *.pyc 5 | vocab.search 6 | local 7 | .idea/ 8 | add_copyright.py 9 | copyright 10 | .DS_Store 11 | data/vocab 12 | data/results/ 13 | _job*.yaml 14 | toy*.py 15 | tmp.json 16 | summary.json 17 | data/* 18 | .*.yaml -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Han Xiao 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # tf-nlp-blocks 2 | [![Python: 3.6](https://img.shields.io/badge/Python-3.6-brightgreen.svg)](https://opensource.org/licenses/MIT) [![Tensorflow: 1.6](https://img.shields.io/badge/Tensorflow-1.6-brightgreen.svg)](https://opensource.org/licenses/MIT) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 3 | 4 | 5 | Author: Han Xiao https://hanxiao.github.io 6 | 7 | 8 | A collection of frequently-used deep learning blocks I have implemented in Tensorflow. It covers the core tasks in NLP such as embedding, encoding, matching and pooling. All implementations follow a modularized design pattern which I called the "block-design". More details can be found [in my blog post](https://hanxiao.github.io/2018/06/25/4-Encoding-Blocks-You-Need-to-Know-Besides-LSTM-RNN-in-Tensorflow/). 9 | 10 | * [Requirements](#requirements) 11 | * [Contents](#contents) 12 | + [`encode_blocks.py`](#encode_blockspy) 13 | + [`match_blocks.py`](#match_blockspy) 14 | + [`pool_blocks.py`](#pool_blockspy) 15 | + [`embed_blocks.py`](#embed_blockspy) 16 | + [`mulitask_blocks.py`](#mulitask_blockspy) 17 | + [`nn.py`](#nnpy) 18 | * [Run](#run) 19 | 20 | 21 | ## Requirements 22 | 23 | - Python >= 3.6 24 | - Tensorflow >= 1.6 25 | 26 | ## Contents 27 | 28 | ### `encode_blocks.py` 29 | A collection of sequence encoding blocks. Input is a sequence with shape of `[B, L, D]`, output is another sequence in `[B, L, D']`, where `B` is batch size, `L` is the length of the sequence and `D` and `D'` are the dimensions. 30 | 31 | | Name | Dependencies| Description | Reference | 32 | | --- | --- |--- |--- | 33 | | `LSTM_encode`| | a fast multi-layer bidirectional LSTM implementation based on [`CudnnLSTM`](https://www.tensorflow.org/api_docs/python/tf/contrib/cudnn_rnn/CudnnLSTM#call). Expect to be 5~10x faster than the standard tf `LSTMCell`. However, it can only run on GPU. | [Tensorflow doc on `CudnnLSTM`](https://www.tensorflow.org/api_docs/python/tf/contrib/cudnn_rnn/CudnnLSTM#call)| 34 | | `TCN_encode` | `Res_DualCNN_encode`| a temporal convolution network described in the paper, basically a multi-layer dilated CNN with special padding to ensure the causality| [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](https://arxiv.org/pdf/1803.01271)| 35 | | `Res_DualCNN_encode` |`CNN_encode`| a sub-block used by `TCN_encode`. It is a two-layer CNN with spatial dropout in-between, then followed by a residual connection and a layer-norm.| [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](https://arxiv.org/pdf/1803.01271)| 36 | | `CNN_encode` | | a standard `conv1d` implementation on `L` axis, with the possibility to set different paddings | [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)| 37 | 38 | ### `match_blocks.py` 39 | A collection of sequence matching blocks, aka. attention. Input are two sequnces: `context` in the shape of `[B, L_c, D]`, and `query` in the shape of `[B, L_q, D]`. The output is a sequence has the same length as `context`, i.e. with shape of `[B, L_c, D]`. Each position in the output should encodes the relevance of that position in `context` to the complete `query`. 40 | 41 | | Name | Dependencies | Description | Reference | 42 | | --- | --- |--- |--- | 43 | |`Attentive_match`| |basic attention mechanism with different scoring functions, also supports future blinding.| `additive`: [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473); `scaled`: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)| 44 | |`Transformer_match`| |a multi-head attention block from ["Attention is all you need"](https://arxiv.org/pdf/1706.03762.pdf)| [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)| 45 | |`AttentiveCNN_match`| `Attentive_match`|the light version of attentive convolution, with the possibility of future blinding to ensure causality. | [Attentive Convolution](https://arxiv.org/pdf/1710.00519) 46 | |`BiDaf_match`| |attention flow layer used in bidaf model. | [Bidirectional Attention Flow for Machine Comprehension](https://arxiv.org/abs/1611.01603)| 47 | 48 | ### `pool_blocks.py` 49 | A collection of pooling blocks. It fuses/reduces on the time axis `L`. Input is a sequence with shape of `[B, L, D]`, output is in `[B, D]`. 50 | 51 | | Name | Dependencies | Description | Reference | 52 | | --- | --- |--- |--- | 53 | |`SWEM_pool`| | do pooling on the input sequence, supports max/avg. pooling, hierarchical avg. max pooling. | [Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms](https://arxiv.org/abs/1805.09843) | 54 | 55 | There are also some convolution-based pooling blocks build on `SWEM_pool`, but they are for experimental purpose. Thus, I will not list them here. 56 | 57 | ### `embed_blocks.py` 58 | A collection of positional encoding on the sequence. 59 | 60 | | Name | Dependencies | Description | Reference | 61 | | --- | --- |--- |--- | 62 | |`SinusPositional_embed`| | generate a sinusoid signal that has the same length of the input sequence | [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)| 63 | |`Positional_embed`| |parameterize the absolute position of the tokens in the input sequence | [A Convolutional Encoder Model for Neural Machine Translation](https://arxiv.org/pdf/1611.02344.pdf)| 64 | 65 | ### `mulitask_blocks.py` 66 | A collection of multi-task learning blocks. So far only the "cross-stitch block" is available. 67 | 68 | | Name | Dependencies | Description | Reference | 69 | | --- | --- |--- |--- | 70 | |`CrossStitch`||a cross-stitch block, modeling the correlation & self-correlation of two tasks| [Cross-stitch Networks for Multi-task Learning](https://arxiv.org/pdf/1604.03539)| 71 | |`Stack_CrossStitch`|`CrossStitch`|stacking multiple cross-stitch blocks together with shared/separated input| [Cross-stitch Networks for Multi-task Learning](https://arxiv.org/pdf/1604.03539)| 72 | 73 | 74 | ### `nn.py` 75 | A collection of auxiliary functions, e.g. masking, normalizing, slicing. 76 | 77 | 78 | ## Run 79 | 80 | Run `app.py` for a simple test on toy data. -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Han Xiao 4 | 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | from nlp.match_blocks import AttentiveCNN_match 9 | 10 | if __name__ == '__main__': 11 | # build two batch sequences 12 | # context batch has 2 seqs, with length 4 and 3, padded to 5 13 | # query batch has 2 seqs, with length 2 and 3, padded to 3 14 | # dimension of each symbol is 4 15 | context = tf.constant(np.random.random([2, 5, 4]), dtype=tf.float32) # B,Lc,D 16 | query = tf.constant(np.random.random([2, 3, 4]), dtype=tf.float32) # B,Lq,D 17 | # build mask 18 | context_mask = tf.sequence_mask([4, 3], 5, dtype=tf.float32, name='query_mask') 19 | query_mask = tf.sequence_mask([2, 3], 3, dtype=tf.float32, name='query_mask') 20 | 21 | # standard att-cnn 22 | output1 = AttentiveCNN_match(context, query, context_mask, query_mask, scope='attcnn1') 23 | # with res and layernorm 24 | output2 = AttentiveCNN_match(context, query, context_mask, query_mask, 25 | residual=True, normalize_output=True, scope='attcnn2') 26 | # ensure causality 27 | output3 = AttentiveCNN_match(context, query, context_mask, query_mask, 28 | casuality=True, scope='attcnn3') 29 | 30 | with tf.Session() as sess: 31 | sess.run(tf.global_variables_initializer()) 32 | for v in sess.run([output1, output2, output3]): 33 | print(v) 34 | -------------------------------------------------------------------------------- /nlp/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hanxiao/tf-nlp-blocks/935a0cd3345b4ddc421f3eacd212d81570b071b2/nlp/__init__.py -------------------------------------------------------------------------------- /nlp/embed_blocks.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | 5 | def Positional_embed(batch_size, 6 | seq_length, 7 | max_seq_len, 8 | num_units, 9 | zero_pad=True, 10 | scale=True, 11 | scope='Pos_Block', 12 | reuse=None): 13 | """Embeds a given tensor. 14 | Args: 15 | seqs: A `Tensor` with type `int32` or `int64` containing the ids 16 | to be looked up in `lookup table`. 17 | max_seq_len: An int. Vocabulary size. 18 | num_units: An int. Number of embedding hidden units. 19 | zero_pad: A boolean. If True, all the values of the fist row (id 0) 20 | should be constant zeros. 21 | scale: A boolean. If True. the outputs is multiplied by sqrt num_units. 22 | scope: Optional scope for `variable_scope`. 23 | reuse: Boolean, whether to reuse the weights of a previous layer 24 | by the same name. 25 | Returns: 26 | A `Tensor` with one more rank than inputs's. The last dimensionality 27 | should be `num_units`. 28 | """ 29 | with tf.variable_scope(scope, reuse=reuse): 30 | seqs = tf.tile(tf.expand_dims(tf.range(seq_length), 0), [batch_size, 1]) 31 | 32 | lookup_table = tf.get_variable('lookup_table', 33 | dtype=tf.float32, 34 | shape=[max_seq_len, num_units], 35 | initializer=tf.contrib.layers.xavier_initializer()) 36 | if zero_pad: 37 | lookup_table = tf.concat((tf.zeros(shape=[1, num_units]), 38 | lookup_table[1:, :]), 0) 39 | outputs = tf.nn.embedding_lookup(lookup_table, seqs) 40 | 41 | if scale: 42 | outputs = outputs * (num_units ** 0.5) 43 | 44 | return outputs 45 | 46 | 47 | def SinusPositional_embed(batch_size, 48 | seq_length, 49 | num_units, 50 | zero_pad=True, 51 | scale=True, 52 | scope='Sinus_Pos_Block', 53 | reuse=None): 54 | """Sinusoidal Positional Encoding. 55 | Args: 56 | inputs: A 2d Tensor with shape of (N, T). 57 | num_units: Output dimensionality 58 | zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero 59 | scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper) 60 | scope: Optional scope for `variable_scope`. 61 | reuse: Boolean, whether to reuse the weights of a previous layer 62 | by the same name. 63 | Returns: 64 | A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units' 65 | """ 66 | 67 | with tf.variable_scope(scope, reuse=reuse): 68 | seq = tf.tile(tf.expand_dims(tf.range(seq_length), 0), [batch_size, 1]) 69 | 70 | # First part of the PE function: sin and cos argument 71 | position_enc = np.array([ 72 | [pos / np.power(10000, 2. * i / num_units) for i in range(num_units)] 73 | for pos in range(seq_length)]) 74 | 75 | # Second part, apply the cosine to even columns and sin to odds. 76 | position_enc[:, 0::2] = np.sin(position_enc[:, 0::2]) # dim 2i 77 | position_enc[:, 1::2] = np.cos(position_enc[:, 1::2]) # dim 2i+1 78 | 79 | # Convert to a tensor 80 | lookup_table = tf.convert_to_tensor(position_enc) 81 | 82 | if zero_pad: 83 | lookup_table = tf.concat((tf.zeros(shape=[1, num_units]), 84 | lookup_table[1:, :]), 0) 85 | outputs = tf.nn.embedding_lookup(lookup_table, seq) 86 | 87 | if scale: 88 | outputs = outputs * num_units ** 0.5 89 | 90 | return outputs 91 | -------------------------------------------------------------------------------- /nlp/encode_blocks.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Han Xiao 4 | 5 | 6 | import tensorflow as tf 7 | 8 | from nlp.nn import initializer, regularizer, spatial_dropout, get_lstm_init_state, dropout_res_layernorm 9 | 10 | 11 | def LSTM_encode(seqs, causality=False, scope='lstm_encode_block', reuse=None, **kwargs): 12 | with tf.variable_scope(scope, reuse=reuse): 13 | if causality: 14 | kwargs['direction'] = 'unidirectional' 15 | if 'num_units' not in kwargs or kwargs['num_units'] is None: 16 | kwargs['num_units'] = seqs.get_shape().as_list()[-1] 17 | batch_size = tf.shape(seqs)[0] 18 | _seqs = tf.transpose(seqs, [1, 0, 2]) # to T, B, D 19 | lstm = tf.contrib.cudnn_rnn.CudnnLSTM(**kwargs) 20 | init_state = get_lstm_init_state(batch_size, **kwargs) 21 | output = lstm(_seqs, init_state)[0] # 2nd return is state, ignore 22 | return tf.transpose(output, [1, 0, 2]) # back to B, T, D 23 | 24 | 25 | def TCN_encode(seqs, num_layers, scope='tcn_encode_block', reuse=None, **kwargs): 26 | with tf.variable_scope(scope, reuse=reuse): 27 | outputs = [seqs] 28 | for i in range(num_layers): 29 | dilation_size = 2 ** i 30 | out = Res_DualCNN_encode(outputs[-1], dilation=dilation_size, scope='res_biconv_%d' % i, **kwargs) 31 | outputs.append(out) 32 | return outputs[-1] 33 | 34 | 35 | def Res_DualCNN_encode(seqs, use_spatial_dropout=True, scope='res_biconv_block', reuse=None, **kwargs): 36 | with tf.variable_scope(scope, reuse=reuse): 37 | out1 = CNN_encode(seqs, scope='first_conv1d', **kwargs) 38 | if use_spatial_dropout: 39 | out1 = spatial_dropout(out1) 40 | out2 = CNN_encode(out1, scope='second_conv1d', **kwargs) 41 | if use_spatial_dropout: 42 | out2 = CNN_encode(out2) 43 | return dropout_res_layernorm(seqs, out2, **kwargs) 44 | 45 | 46 | def CNN_encode(seqs, filter_size=3, dilation=1, 47 | num_filters=None, direction='forward', 48 | causality=False, 49 | act_fn=tf.nn.relu, 50 | scope=None, 51 | reuse=None, **kwargs): 52 | input_dim = seqs.get_shape().as_list()[-1] 53 | num_filters = num_filters if num_filters else input_dim 54 | 55 | # add causality: shift the whole seq to the right 56 | if causality: 57 | direction = 'forward' 58 | padding = (filter_size - 1) * dilation 59 | if direction == 'forward': 60 | pad_seqs = tf.pad(seqs, [[0, 0], [padding, 0], [0, 0]]) 61 | padding_scheme = 'VALID' 62 | elif direction == 'backward': 63 | pad_seqs = tf.pad(seqs, [[0, 0], [0, padding], [0, 0]]) 64 | padding_scheme = 'VALID' 65 | elif direction == 'none': 66 | pad_seqs = seqs # no padding, must set to SAME so that we have same length 67 | padding_scheme = 'SAME' 68 | else: 69 | raise NotImplementedError 70 | 71 | with tf.variable_scope(scope or 'causal_conv_%s_%s' % (filter_size, direction), reuse=reuse): 72 | return tf.layers.conv1d( 73 | pad_seqs, 74 | num_filters, 75 | filter_size, 76 | activation=act_fn, 77 | padding=padding_scheme, 78 | dilation_rate=dilation, 79 | kernel_initializer=initializer, 80 | bias_initializer=tf.zeros_initializer(), 81 | kernel_regularizer=regularizer) 82 | -------------------------------------------------------------------------------- /nlp/match_blocks.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Han Xiao 4 | 5 | 6 | import tensorflow as tf 7 | 8 | from nlp.encode_blocks import CNN_encode 9 | from nlp.nn import linear_logit, layer_norm, dropout_res_layernorm 10 | 11 | 12 | def AttentiveCNN_match(context, query, context_mask, 13 | query_mask, causality=False, 14 | scope='AttentiveCNN_Block', 15 | reuse=None, **kwargs): 16 | with tf.variable_scope(scope, reuse=reuse): 17 | direction = 'forward' if causality else 'none' 18 | cnn_wo_att = CNN_encode(context, filter_size=3, direction=direction, act_fn=None) 19 | att_context, _ = Attentive_match(context, query, context_mask, query_mask, causality=causality) 20 | cnn_att = CNN_encode(att_context, filter_size=1, direction=direction, act_fn=None) 21 | output = tf.nn.tanh(cnn_wo_att + cnn_att) 22 | return dropout_res_layernorm(context, output, **kwargs) 23 | 24 | 25 | def Attentive_match(context, query, context_mask, query_mask, 26 | num_units=None, 27 | score_func='scaled', causality=False, 28 | scope='attention_match_block', reuse=None, **kwargs): 29 | with tf.variable_scope(scope, reuse=reuse): 30 | batch_size, context_length, _ = context.get_shape().as_list() 31 | if num_units is None: 32 | num_units = context.get_shape().as_list()[-1] 33 | _, query_length, _ = query.get_shape().as_list() 34 | 35 | context = linear_logit(context, num_units, act_fn=tf.nn.relu, scope='context_mapping') 36 | query = linear_logit(query, num_units, act_fn=tf.nn.relu, scope='query_mapping') 37 | 38 | if score_func == 'dot': 39 | score = tf.matmul(context, query, transpose_b=True) 40 | elif score_func == 'bilinear': 41 | score = tf.matmul(linear_logit(context, num_units, scope='context_x_We'), query, transpose_b=True) 42 | elif score_func == 'scaled': 43 | score = tf.matmul(linear_logit(context, num_units, scope='context_x_We'), query, transpose_b=True) / \ 44 | (num_units ** 0.5) 45 | elif score_func == 'additive': 46 | score = tf.squeeze(linear_logit( 47 | tf.tanh(tf.tile(tf.expand_dims(linear_logit(context, num_units, scope='context_x_We'), axis=2), 48 | [1, 1, query_length, 1]) + 49 | tf.tile(tf.expand_dims(linear_logit(query, num_units, scope='query_x_We'), axis=1), 50 | [1, context_length, 1, 1])), 1, scope='x_ve'), axis=3) 51 | else: 52 | raise NotImplementedError 53 | 54 | mask = tf.matmul(tf.expand_dims(context_mask, -1), tf.expand_dims(query_mask, -1), transpose_b=True) 55 | paddings = tf.ones_like(mask) * (-2 ** 32 + 1) 56 | masked_score = tf.where(tf.equal(mask, 0), paddings, score) # B, Lc, Lq 57 | 58 | # Causality = Future blinding 59 | if causality: 60 | diag_vals = tf.ones_like(masked_score[0, :, :]) # (Lc, Lq) 61 | tril = tf.contrib.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (Lc, Lq) 62 | masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(masked_score)[0], 1, 1]) # B, Lc, Lq 63 | 64 | paddings = tf.ones_like(masks) * (-2 ** 32 + 1) 65 | masked_score = tf.where(tf.equal(masks, 0), paddings, masked_score) # B, Lc, Lq 66 | 67 | query2context_score = tf.nn.softmax(masked_score, axis=2) * mask # B, Lc, Lq 68 | query2context_attention = tf.matmul(query2context_score, query) # B, Lc, D 69 | 70 | context2query_score = tf.nn.softmax(masked_score, axis=1) * mask # B, Lc, Lq 71 | context2query_attention = tf.matmul(context2query_score, context, transpose_a=True) # B, Lq, D 72 | 73 | return (query2context_attention, # B, Lc, D 74 | context2query_attention) # B, Lq, D 75 | 76 | 77 | def Transformer_match(context, 78 | query, 79 | context_mask, 80 | query_mask, 81 | num_units=None, 82 | num_heads=4, 83 | dropout_keep_rate=1.0, 84 | causality=False, 85 | scope='Transformer_Block', 86 | reuse=None, 87 | residual=False, 88 | normalize_output=False, 89 | **kwargs): 90 | """Applies multihead attention. 91 | 92 | Args: 93 | context: A 3d tensor with shape of [N, T_q, C_q]. 94 | query: A 3d tensor with shape of [N, T_k, C_k]. 95 | num_units: A scalar. Attention size. 96 | dropout_rate: A floating point number. 97 | is_training: Boolean. Controller of mechanism for dropout. 98 | causality: Boolean. If true, units that reference the future are masked. 99 | num_heads: An int. Number of heads. 100 | scope: Optional scope for `variable_scope`. 101 | reuse: Boolean, whether to reuse the weights of a previous layer 102 | by the same name. 103 | 104 | Returns 105 | A 3d tensor with shape of (N, T_q, C) 106 | """ 107 | if num_units is None or residual: 108 | num_units = context.get_shape().as_list()[-1] 109 | with tf.variable_scope(scope, reuse=reuse): 110 | # Set the fall back option for num_units 111 | 112 | # Linear projections 113 | Q = tf.layers.dense(context, num_units, activation=tf.nn.relu) # (N, T_q, C) 114 | K = tf.layers.dense(query, num_units, activation=tf.nn.relu) # (N, T_k, C) 115 | V = tf.layers.dense(query, num_units, activation=tf.nn.relu) # (N, T_k, C) 116 | 117 | # Split and concat 118 | Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 119 | K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 120 | V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 121 | 122 | # Multiplication 123 | outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k) 124 | 125 | # Scale 126 | outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5) 127 | 128 | # Key Masking, aka query 129 | if query_mask is None: 130 | query_mask = tf.sign(tf.abs(tf.reduce_sum(query, axis=-1))) # (N, T_k) 131 | 132 | mask1 = tf.tile(query_mask, [num_heads, 1]) # (h*N, T_k) 133 | mask1 = tf.tile(tf.expand_dims(mask1, 1), [1, tf.shape(context)[1], 1]) # (h*N, T_q, T_k) 134 | 135 | paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) 136 | outputs = tf.where(tf.equal(mask1, 0), paddings, outputs) # (h*N, T_q, T_k) 137 | 138 | # Causality = Future blinding 139 | if causality: 140 | diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k) 141 | tril = tf.contrib.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k) 142 | masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k) 143 | 144 | paddings = tf.ones_like(masks) * (-2 ** 32 + 1) 145 | outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k) 146 | 147 | # Activation 148 | outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k) 149 | 150 | # Query Masking aka, context 151 | if context_mask is None: 152 | context_mask = tf.sign(tf.abs(tf.reduce_sum(context, axis=-1))) # (N, T_q) 153 | 154 | mask2 = tf.tile(context_mask, [num_heads, 1]) # (h*N, T_q) 155 | mask2 = tf.tile(tf.expand_dims(mask2, -1), [1, 1, tf.shape(query)[1]]) # (h*N, T_q, T_k) 156 | outputs *= mask2 # (h*N, T_q, T_k) 157 | 158 | # Dropouts 159 | outputs = tf.nn.dropout(outputs, keep_prob=dropout_keep_rate) 160 | 161 | # Weighted sum 162 | outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h) 163 | 164 | # Restore shape 165 | outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2) # (N, T_q, C) 166 | 167 | if residual: 168 | # Residual connection 169 | outputs += context 170 | 171 | if normalize_output: 172 | # Normalize 173 | outputs = layer_norm(outputs) # (N, T_q, C) 174 | 175 | return outputs 176 | 177 | 178 | def BiDaf_match(context, query, context_mask, query_mask, scope=None, reuse=None, **kwargs): 179 | # context: [batch, l, d] 180 | # question: [batch, l2, d] 181 | with tf.variable_scope(scope, reuse=reuse): 182 | n_a = tf.tile(tf.expand_dims(context, 2), [1, 1, tf.shape(query)[1], 1]) 183 | n_b = tf.tile(tf.expand_dims(query, 1), [1, tf.shape(context)[1], 1, 1]) 184 | 185 | n_ab = n_a * n_b 186 | n_abab = tf.concat([n_ab, n_a, n_b], -1) 187 | 188 | kernel = tf.squeeze(tf.layers.dense(n_abab, units=1), -1) 189 | 190 | context_mask = tf.expand_dims(context_mask, -1) 191 | query_mask = tf.expand_dims(query_mask, -1) 192 | kernel_mask = tf.matmul(context_mask, query_mask, transpose_b=True) 193 | kernel += (kernel_mask - 1) * 1e5 194 | 195 | con_query = tf.matmul(tf.nn.softmax(kernel, 1), query) 196 | con_query = con_query * context_mask 197 | 198 | query_con = tf.matmul(tf.transpose( 199 | tf.reduce_max(kernel, 2, keepdims=True), [0, 2, 1]), context * context_mask) 200 | query_con = tf.tile(query_con, [1, tf.shape(context)[1], 1]) 201 | return tf.concat([context * query_con, context * con_query, context, query_con], 2) 202 | 203 | 204 | def Stacked_Self_match(context, context_mask, self_match, 205 | num_layers=3, 206 | scope='Stacked_Self_match_Block', 207 | reuse=None, **kwargs): 208 | output = context 209 | with tf.variable_scope(scope, reuse=reuse): 210 | for j in range(num_layers): 211 | output = self_match(output, output, context_mask, context_mask, 212 | scope=scope + '_layer_%d' % j, **kwargs) 213 | return output 214 | -------------------------------------------------------------------------------- /nlp/mulitask_blocks.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Han Xiao 4 | 5 | 6 | import tensorflow as tf 7 | 8 | from nlp.nn import get_var, linear_logit, get_cross_correlated_mat, get_self_correlated_mat, gate_filter 9 | 10 | 11 | def CrossStitch(logit_A, logit_B, 12 | pA_given_B, pB_given_A, 13 | pA_corr=None, pB_corr=None, 14 | reduce_transfer='LEARNED', 15 | scope=None, reuse=None, 16 | **kwargs): 17 | with tf.variable_scope(scope or 'Cross_Stitch_Block', reuse=reuse): 18 | trans_B = tf.matmul(logit_A, pB_given_A) # batch x num B 19 | trans_A = tf.matmul(logit_B, pA_given_B, transpose_b=True) # batch x num A 20 | 21 | if pA_corr is not None: 22 | logit_A = tf.matmul(logit_A, pA_corr) 23 | if pB_corr is not None: 24 | logit_B = tf.matmul(logit_B, pB_corr) 25 | 26 | if reduce_transfer == 'LEARNED': 27 | res_rate_A = tf.sigmoid(get_var('taskA_weight', shape=[1, logit_A.get_shape().as_list()[-1]])) 28 | res_rate_B = tf.sigmoid(get_var('taskB_weight', shape=[1, logit_B.get_shape().as_list()[-1]])) 29 | output_A = res_rate_A * logit_A + (1 - res_rate_A) * trans_A # batch x num A 30 | output_B = res_rate_B * logit_B + (1 - res_rate_B) * trans_B # batch x num B 31 | elif reduce_transfer == 'MEAN': 32 | output_A = (logit_A + trans_A) / 2 33 | output_B = (logit_B + trans_B) / 2 34 | elif reduce_transfer == 'GATE': 35 | output_A = logit_A * tf.sigmoid(trans_A) 36 | output_B = logit_B * tf.sigmoid(trans_B) 37 | elif reduce_transfer == 'RES_GATE': 38 | output_A = logit_A + logit_A * tf.sigmoid(trans_A) 39 | output_B = logit_B + logit_B * tf.sigmoid(trans_B) 40 | else: 41 | raise NotImplementedError 42 | 43 | return output_A, output_B 44 | 45 | 46 | def Stack_CrossStitch_old(shared_input, cooc_AB=None, num_layers=3, 47 | num_out_A=None, num_out_B=None, 48 | transform=linear_logit, 49 | learn_cooc='FIXED', 50 | gated=False, 51 | self_correlation=False, 52 | scope=None, 53 | reuse=None, 54 | **kwargs): 55 | with tf.variable_scope(scope or 'Stacked_Cross_Stitch_Block', reuse=reuse): 56 | pA_given_B, pB_given_A = get_cross_correlated_mat(num_out_A, num_out_B, learn_cooc, cooc_AB) 57 | 58 | pA_corr = get_self_correlated_mat(num_out_A, scope='self_corr_a') if self_correlation else None 59 | pB_corr = get_self_correlated_mat(num_out_B, scope='self_corr_b') if self_correlation else None 60 | 61 | iter_shared = shared_input 62 | out_A, out_B = None, None 63 | for j in range(num_layers): 64 | with tf.variable_scope('cross_stitch_block_%d' % j): 65 | iter_inputA = gate_filter(iter_shared, 'threshold_a') if gated else iter_shared 66 | iter_inputB = gate_filter(iter_shared, 'threshold_b') if gated else iter_shared 67 | logit_A = transform(iter_inputA, num_out_A, scope='dense_A', **kwargs) 68 | logit_B = transform(iter_inputB, num_out_B, scope='dense_B', **kwargs) 69 | out_A, out_B = CrossStitch(logit_A, logit_B, 70 | pA_given_B, pB_given_A, 71 | pA_corr, pB_corr, 72 | **kwargs) 73 | iter_shared = tf.concat([out_A, out_B], axis=-1) 74 | return out_A, out_B 75 | 76 | 77 | def Stack_CrossStitch(shared_input=None, init_input_A=None, init_input_B=None, 78 | cooc_AB=None, num_layers=3, 79 | num_out_A=None, num_out_B=None, 80 | transform=linear_logit, 81 | learn_cooc='FIXED', 82 | self_correlation=False, 83 | scope=None, 84 | reuse=None, 85 | **kwargs): 86 | if shared_input is None and (init_input_A is None or init_input_B is None): 87 | raise AttributeError('Must specify shared input or init_input') 88 | 89 | with tf.variable_scope(scope or 'Stacked_Cross_Stitch_Block', reuse=reuse): 90 | pA_given_B, pB_given_A = get_cross_correlated_mat(num_out_A, num_out_B, learn_cooc, cooc_AB) 91 | 92 | pA_corr = get_self_correlated_mat(num_out_A, scope='self_corr_a') if self_correlation else None 93 | pB_corr = get_self_correlated_mat(num_out_B, scope='self_corr_b') if self_correlation else None 94 | 95 | iter_inputA = init_input_A if init_input_A is not None else shared_input 96 | iter_inputB = init_input_B if init_input_B is not None else shared_input 97 | 98 | out_A, out_B = None, None 99 | for j in range(num_layers): 100 | with tf.variable_scope('cross_stitch_block_%d' % j): 101 | logit_A = transform(iter_inputA, num_out_A, scope='dense_A', **kwargs) 102 | logit_B = transform(iter_inputB, num_out_B, scope='dense_B', **kwargs) 103 | out_A, out_B = CrossStitch(logit_A, logit_B, 104 | pA_given_B, pB_given_A, 105 | pA_corr, pB_corr, 106 | **kwargs) 107 | iter_inputA = tf.concat([out_A, out_B], axis=-1) 108 | iter_inputB = iter_inputA 109 | return out_A, out_B 110 | -------------------------------------------------------------------------------- /nlp/nn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Han Xiao 4 | 5 | 6 | import tensorflow as tf 7 | 8 | initializer = tf.contrib.layers.variance_scaling_initializer(factor=1.0, 9 | mode='FAN_AVG', 10 | uniform=True, 11 | dtype=tf.float32) 12 | initializer_relu = tf.contrib.layers.variance_scaling_initializer(factor=2.0, 13 | mode='FAN_IN', 14 | uniform=False, 15 | dtype=tf.float32) 16 | regularizer = tf.contrib.layers.l2_regularizer(scale=3e-7) 17 | 18 | 19 | def minus_mask(x, mask, offset=1e30): 20 | """ 21 | masking by subtract a very large number 22 | :param x: sequence data in the shape of [B, L, D] 23 | :param mask: 0-1 mask in the shape of [B, L] 24 | :param offset: very large negative number 25 | :return: masked x 26 | """ 27 | return x - tf.expand_dims(1.0 - mask, axis=-1) * offset 28 | 29 | 30 | def mul_mask(x, mask): 31 | """ 32 | masking by multiply zero 33 | :param x: sequence data in the shape of [B, L, D] 34 | :param mask: 0-1 mask in the shape of [B, L] 35 | :return: masked x 36 | """ 37 | return x * tf.expand_dims(mask, axis=-1) 38 | 39 | 40 | def masked_reduce_mean(x, mask): 41 | return tf.reduce_sum(mul_mask(x, mask), axis=1) / tf.reduce_sum(mask, axis=1, keepdims=True) 42 | 43 | 44 | def masked_reduce_max(x, mask): 45 | return tf.reduce_max(minus_mask(x, mask), axis=1) 46 | 47 | 48 | def weighted_sparse_softmax_cross_entropy(labels, preds, weights): 49 | """ 50 | computing sparse softmax cross entropy by weighting differently on classes 51 | :param labels: sparse label in the shape of [B], size of label is L 52 | :param preds: logit in the shape of [B, L] 53 | :param weights: weight in the shape of [L] 54 | :return: weighted sparse softmax cross entropy in the shape of [B] 55 | """ 56 | 57 | return tf.losses.sparse_softmax_cross_entropy(labels, 58 | logits=preds, 59 | weights=get_bounded_class_weight(labels, weights)) 60 | 61 | 62 | def get_bounded_class_weight(labels, weights, ub=None): 63 | if weights is None: 64 | return 1.0 65 | else: 66 | w = tf.gather(weights, labels) 67 | w = w / tf.reduce_min(w) 68 | w = tf.clip_by_value(1.0 + tf.log1p(w), 69 | clip_value_min=1.0, 70 | clip_value_max=ub if ub is not None else tf.cast(tf.shape(weights)[0], tf.float32) / 2.0) 71 | return w 72 | 73 | 74 | def weighted_smooth_softmax_cross_entropy(labels, num_labels, preds, weights, 75 | epsilon=0.1): 76 | """ 77 | computing smoothed softmax cross entropy by weighting differently on classes 78 | :param epsilon: smoothing factor 79 | :param num_labels: maximum number of labels 80 | :param labels: sparse label in the shape of [B], size of label is L 81 | :param preds: logit in the shape of [B, L] 82 | :param weights: weight in the shape of [L] 83 | :return: weighted sparse softmax cross entropy in the shape of [B] 84 | """ 85 | 86 | return tf.losses.softmax_cross_entropy(tf.one_hot(labels, num_labels), 87 | logits=preds, 88 | label_smoothing=epsilon, 89 | weights=get_bounded_class_weight(labels, weights)) 90 | 91 | 92 | def get_var(name, shape, dtype=tf.float32, 93 | initializer_fn=initializer, 94 | regularizer_fn=regularizer, **kwargs): 95 | return tf.get_variable(name, shape, 96 | initializer=initializer_fn, 97 | dtype=dtype, 98 | regularizer=regularizer_fn, **kwargs) 99 | 100 | 101 | def layer_norm(inputs, 102 | epsilon=1e-8, 103 | scope=None, 104 | reuse=None): 105 | """Applies layer normalization. 106 | 107 | Args: 108 | inputs: A tensor with 2 or more dimensions, where the first dimension has 109 | `batch_size`. 110 | epsilon: A floating number. A very small number for preventing ZeroDivision Error. 111 | scope: Optional scope for `variable_scope`. 112 | reuse: Boolean, whether to reuse the weights of a previous layer 113 | by the same name. 114 | 115 | Returns: 116 | A tensor with the same shape and data dtype as `inputs`. 117 | """ 118 | with tf.variable_scope(scope or 'Layer_Normalize', reuse=reuse): 119 | inputs_shape = inputs.get_shape() 120 | params_shape = inputs_shape[-1:] 121 | 122 | mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True) 123 | beta = tf.Variable(tf.zeros(params_shape)) 124 | gamma = tf.Variable(tf.ones(params_shape)) 125 | normalized = (inputs - mean) / ((variance + epsilon) ** .5) 126 | outputs = gamma * normalized + beta 127 | 128 | return outputs 129 | 130 | 131 | def linear_logit(x, units, act_fn=None, dropout_keep=1., use_layer_norm=False, scope=None, reuse=None, **kwargs): 132 | with tf.variable_scope(scope or 'linear_logit', reuse=reuse): 133 | logit = tf.layers.dense(x, units=units, activation=act_fn, 134 | kernel_initializer=initializer, 135 | kernel_regularizer=regularizer) 136 | # do dropout 137 | logit = tf.nn.dropout(logit, keep_prob=dropout_keep) 138 | if use_layer_norm: 139 | logit = tf.contrib.layers.layer_norm(logit) 140 | return logit 141 | 142 | 143 | def bilinear_logit(x, units, act_fn=None, 144 | first_units=256, 145 | first_act_fn=tf.nn.relu, scope=None, **kwargs): 146 | with tf.variable_scope(scope or 'bilinear_logit'): 147 | first = linear_logit(x, first_units, act_fn=first_act_fn, scope='first', **kwargs) 148 | return linear_logit(first, units, scope='second', act_fn=act_fn, **kwargs) 149 | 150 | 151 | def label_smoothing(inputs, epsilon=0.1): 152 | """Applies label smoothing. See https://arxiv.org/abs/1512.00567. 153 | 154 | Args: 155 | inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary. 156 | epsilon: Smoothing rate. 157 | 158 | For example, 159 | 160 | ``` 161 | import tensorflow as tf 162 | inputs = tf.convert_to_tensor([[[0, 0, 1], 163 | [0, 1, 0], 164 | [1, 0, 0]], 165 | [[1, 0, 0], 166 | [1, 0, 0], 167 | [0, 1, 0]]], tf.float32) 168 | 169 | outputs = label_smoothing(inputs) 170 | 171 | with tf.Session() as sess: 172 | print(sess.run([outputs])) 173 | 174 | >> 175 | [array([[[ 0.03333334, 0.03333334, 0.93333334], 176 | [ 0.03333334, 0.93333334, 0.03333334], 177 | [ 0.93333334, 0.03333334, 0.03333334]], 178 | [[ 0.93333334, 0.03333334, 0.03333334], 179 | [ 0.93333334, 0.03333334, 0.03333334], 180 | [ 0.03333334, 0.93333334, 0.03333334]]], dtype=float32)] 181 | ``` 182 | """ 183 | K = inputs.get_shape().as_list()[-1] # number of channels 184 | return ((1 - epsilon) * inputs) + (epsilon / K) 185 | 186 | 187 | def normalize_by_axis(x, axis, smooth_factor=1e-5): 188 | x += smooth_factor 189 | return x / tf.reduce_sum(x, axis, keepdims=True) # num A x num B 190 | 191 | 192 | def get_cross_correlated_mat(num_out_A, num_out_B, learn_cooc='FIXED', cooc_AB=None, scope=None, reuse=None): 193 | with tf.variable_scope(scope or 'CrossCorrlated_Mat', reuse=reuse): 194 | if learn_cooc == 'FIXED' and cooc_AB is not None: 195 | pB_given_A = normalize_by_axis(cooc_AB, 1) 196 | pA_given_B = normalize_by_axis(cooc_AB, 0) 197 | elif learn_cooc == 'JOINT': 198 | share_cooc = tf.nn.relu(get_var('cooc_ab', shape=[num_out_A, num_out_B])) 199 | pB_given_A = normalize_by_axis(share_cooc, 1) 200 | pA_given_B = normalize_by_axis(share_cooc, 0) 201 | elif learn_cooc == 'DISJOINT': 202 | cooc1 = tf.nn.relu(get_var('pb_given_a', shape=[num_out_A, num_out_B])) 203 | cooc2 = tf.nn.relu(get_var('pa_given_b', shape=[num_out_A, num_out_B])) 204 | pB_given_A = normalize_by_axis(cooc1, 1) 205 | pA_given_B = normalize_by_axis(cooc2, 0) 206 | else: 207 | raise NotImplementedError 208 | 209 | return pA_given_B, pB_given_A 210 | 211 | 212 | def get_self_correlated_mat(num_out_A, scope=None, reuse=None): 213 | with tf.variable_scope(scope or 'Self_Correlated_mat', reuse=reuse): 214 | cooc1 = get_var('pa_corr', shape=[num_out_A, num_out_A], 215 | initializer_fn=tf.contrib.layers.variance_scaling_initializer(factor=0.1, 216 | mode='FAN_AVG', 217 | uniform=True, 218 | dtype=tf.float32), 219 | regularizer_fn=tf.contrib.layers.l2_regularizer(scale=3e-4)) 220 | return tf.matmul(cooc1, cooc1, transpose_b=True) + tf.eye(num_out_A) 221 | 222 | 223 | def gate_filter(x, scope=None, reuse=None): 224 | with tf.variable_scope(scope or 'Gate', reuse=reuse): 225 | threshold = get_var('threshold', shape=[]) 226 | gate = tf.cast(tf.greater(x, threshold), tf.float32) 227 | return x * gate 228 | 229 | 230 | from tensorflow.python.ops import array_ops 231 | 232 | 233 | def focal_loss(prediction_tensor, target_tensor, weights=None, alpha=0.25, gamma=2): 234 | r"""Compute focal loss for predictions. 235 | Multi-labels Focal loss formula: 236 | FL = -alpha * (z-p)^gamma * log(p) -(1-alpha) * p^gamma * log(1-p) 237 | ,which alpha = 0.25, gamma = 2, p = sigmoid(x), z = target_tensor. 238 | Args: 239 | prediction_tensor: A float tensor of shape [batch_size, num_anchors, 240 | num_classes] representing the predicted logits for each class 241 | target_tensor: A float tensor of shape [batch_size, num_anchors, 242 | num_classes] representing one-hot encoded classification targets 243 | weights: A float tensor of shape [batch_size, num_anchors] 244 | alpha: A scalar tensor for focal loss alpha hyper-parameter 245 | gamma: A scalar tensor for focal loss gamma hyper-parameter 246 | Returns: 247 | loss: A (scalar) tensor representing the value of the loss function 248 | """ 249 | sigmoid_p = tf.nn.sigmoid(prediction_tensor) 250 | zeros = array_ops.zeros_like(sigmoid_p, dtype=sigmoid_p.dtype) 251 | 252 | # For poitive prediction, only need consider front part loss, back part is 0; 253 | # target_tensor > zeros <=> z=1, so poitive coefficient = z - p. 254 | pos_p_sub = array_ops.where(target_tensor > zeros, target_tensor - sigmoid_p, zeros) 255 | 256 | # For negative prediction, only need consider back part loss, front part is 0; 257 | # target_tensor > zeros <=> z=1, so negative coefficient = 0. 258 | neg_p_sub = array_ops.where(target_tensor > zeros, zeros, sigmoid_p) 259 | per_entry_cross_ent = - alpha * (pos_p_sub ** gamma) * tf.log(tf.clip_by_value(sigmoid_p, 1e-8, 1.0)) \ 260 | - (1 - alpha) * (neg_p_sub ** gamma) * tf.log(tf.clip_by_value(1.0 - sigmoid_p, 1e-8, 1.0)) 261 | return tf.reduce_sum(per_entry_cross_ent) 262 | 263 | 264 | def spatial_dropout(x, scope=None, reuse=None): 265 | input_dim = x.get_shape().as_list()[-1] 266 | with tf.variable_scope(scope or 'spatial_dropout', reuse=reuse): 267 | d = tf.random_uniform(shape=[1], minval=0, maxval=input_dim, dtype=tf.int32) 268 | f = tf.one_hot(d, on_value=0., off_value=1., depth=input_dim) 269 | g = x * f # do dropout 270 | g *= (1. + 1. / input_dim) # do rescale 271 | return g 272 | 273 | 274 | def get_last_output(output, seq_length, scope=None, reuse=None): 275 | """Get the last value of the returned output of an RNN. 276 | http://disq.us/p/1gjkgdr 277 | output: [batch x number of steps x ... ] Output of the dynamic lstm. 278 | sequence_length: [batch] Length of each of the sequence. 279 | """ 280 | with tf.variable_scope(scope or 'gather_nd', reuse=reuse): 281 | rng = tf.range(0, tf.shape(seq_length)[0]) 282 | indexes = tf.stack([rng, seq_length - 1], 1) 283 | return tf.gather_nd(output, indexes) 284 | 285 | 286 | def get_lstm_init_state(batch_size, num_layers, num_units, direction, scope=None, reuse=None, **kwargs): 287 | with tf.variable_scope(scope or 'lstm_init_state', reuse=reuse): 288 | num_dir = 2 if direction.startswith('bi') else 1 289 | c = get_var('lstm_init_c', shape=[num_layers * num_dir, num_units]) 290 | c = tf.tile(tf.expand_dims(c, axis=1), [1, batch_size, 1]) 291 | h = get_var('lstm_init_h', shape=[num_layers * num_dir, num_units]) 292 | h = tf.tile(tf.expand_dims(h, axis=1), [1, batch_size, 1]) 293 | return c, h 294 | 295 | 296 | def dropout_res_layernorm(x, fx, act_fn=tf.nn.relu, 297 | dropout_keep_rate=1.0, 298 | residual=True, 299 | normalize_output=True, 300 | scope='rnd_block', 301 | reuse=None, **kwargs): 302 | with tf.variable_scope(scope, reuse=reuse): 303 | input_dim = x.get_shape().as_list()[-1] 304 | output_dim = fx.get_shape().as_list()[-1] 305 | 306 | # do dropout 307 | fx = tf.nn.dropout(fx, keep_prob=dropout_keep_rate) 308 | 309 | if residual and input_dim != output_dim: 310 | res_x = tf.layers.conv1d(x, 311 | filters=output_dim, 312 | kernel_size=1, 313 | activation=None, 314 | name='res_1x1conv') 315 | else: 316 | res_x = x 317 | 318 | if residual and act_fn is None: 319 | output = fx + res_x 320 | elif residual and act_fn is not None: 321 | output = act_fn(fx + res_x) 322 | else: 323 | output = fx 324 | 325 | if normalize_output: 326 | output = layer_norm(output) 327 | 328 | return output 329 | -------------------------------------------------------------------------------- /nlp/pool_blocks.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Han Xiao 4 | 5 | 6 | import tensorflow as tf 7 | 8 | from nlp.encode_blocks import CNN_encode 9 | from nlp.match_blocks import Transformer_match 10 | from nlp.nn import masked_reduce_max, masked_reduce_mean, get_var, minus_mask 11 | 12 | 13 | def SWEM_pool(seqs, mask, reduce='CONCAT_MEAN_MAX', scope=None, reuse=None, 14 | task_name=None, norm_by_layer=False, dropout_keep=1., **kwargs): 15 | """ 16 | Simple word embedding fusion 17 | """ 18 | with tf.variable_scope(scope or 'SWEM_Block', reuse=reuse): 19 | if reduce == 'MAX': 20 | pooled = masked_reduce_max(seqs, mask) 21 | elif reduce == 'CONCAT_MEAN_MAX': # often best 22 | pooled = tf.concat([masked_reduce_mean(seqs, mask), 23 | masked_reduce_max(seqs, mask)], axis=1) 24 | elif reduce == 'MEAN': # often worst 25 | pooled = masked_reduce_mean(seqs, mask) 26 | elif reduce == 'LEVEL_MEAN_MAX': 27 | # 1. segment sequence in a small window 28 | # 2. do average pooling on each window 29 | # 3. do max pooling across all windows 30 | def reshape_seqs(x, avg_window_size=3, **kwargs): 31 | B = tf.shape(x)[0] 32 | L = tf.cast(tf.shape(x)[1], tf.float32) 33 | D = x.get_shape().as_list()[-1] 34 | b = tf.transpose(x, [0, 2, 1]) 35 | extra_pads = tf.cast(tf.ceil(L / avg_window_size) * avg_window_size - L, tf.int32) 36 | c = tf.pad(b, tf.concat([tf.zeros([2, 2], dtype=tf.int32), [[0, extra_pads]]], axis=0)) 37 | return tf.reshape(c, [B, D, avg_window_size, -1]) 38 | 39 | # avg pooling with mask, be careful with all zero mask 40 | d = tf.reduce_sum(reshape_seqs(seqs, **kwargs), axis=2) / \ 41 | tf.reduce_sum(reshape_seqs(tf.expand_dims(mask + 1e-10, axis=-1), **kwargs), axis=2) 42 | 43 | # max pooling 44 | pooled = tf.reduce_max(d, axis=2) 45 | elif reduce == 'SELF_ATTENTION': 46 | D = seqs.get_shape().as_list()[-1] 47 | query = tf.tile(get_var('%s_query' % (task_name if task_name else ''), 48 | shape=[1, D, 1]), [tf.shape(seqs)[0], 1, 1]) 49 | score = tf.nn.softmax(minus_mask(tf.matmul(seqs, query) / (D ** 0.5), mask), axis=1) # [B,L,1] 50 | pooled = tf.reduce_sum(score * seqs, axis=1) # [B, D] 51 | else: 52 | raise NotImplementedError 53 | 54 | pooled = tf.nn.dropout(pooled, keep_prob=dropout_keep) 55 | if norm_by_layer: 56 | pooled = tf.contrib.layers.layer_norm(pooled) 57 | 58 | return pooled 59 | 60 | 61 | # the following blocks are for experimental purpose!!! 62 | # i just keep them here in case i need them 63 | 64 | def MH_Att_pool(seqs, mask, scope=None, reuse=None, **kwargs): 65 | with tf.variable_scope(scope or 'MultiHead_Attention_Pool_Block', reuse=reuse): 66 | rep = Transformer_match(seqs, seqs, **kwargs) 67 | return SWEM_pool(rep, mask, **kwargs) 68 | 69 | 70 | def Stack_MH_Att_pool(seqs, mask, num_layers=5, num_units=None, 71 | concat_output=True, 72 | residual=True, scope=None, reuse=None, **kwargs): 73 | if num_units is None or residual: 74 | num_units = seqs.get_shape().as_list()[-1] 75 | pooled_outputs = [] 76 | with tf.variable_scope(scope or 'Stack_MultiHead_Attention_Pool_Block', reuse=reuse): 77 | res_rep = None 78 | iter_rep = seqs 79 | for idx in range(num_layers): 80 | with tf.variable_scope("multi_head_pool_%s" % idx): 81 | iter_rep = Transformer_match(iter_rep, iter_rep, num_units, **kwargs) 82 | # res 83 | if residual and (res_rep is not None): 84 | iter_rep = iter_rep + res_rep 85 | res_rep = iter_rep 86 | pooled = SWEM_pool(iter_rep, mask, **kwargs) 87 | pooled_outputs.append(pooled) 88 | return tf.concat(pooled_outputs, axis=1) if concat_output else pooled_outputs[-1] 89 | 90 | 91 | def Stack_CNN_pool(seqs, mask, filter_sizes=(3, 4, 5), num_units=None, 92 | residual=True, concat_output=True, scope=None, reuse=None, 93 | **kwargs): 94 | if num_units is None or residual: 95 | num_units = seqs.get_shape().as_list()[-1] 96 | pooled_outputs = [] 97 | with tf.variable_scope(scope or 'Deep_CNN_Residual_Block', reuse=reuse): 98 | res_rep = None 99 | iter_rep = seqs 100 | for fz in filter_sizes: 101 | with tf.variable_scope("conv_pool_%s" % fz): 102 | conv_relu = CNN_encode(iter_rep, fz, 2 * num_units) 103 | # gate 104 | map_res_a, map_res_b = tf.split(conv_relu, num_or_size_splits=2, axis=2) 105 | iter_rep = map_res_a * tf.nn.sigmoid(map_res_b) 106 | # res 107 | if residual and (res_rep is not None): 108 | iter_rep = iter_rep + res_rep 109 | res_rep = iter_rep 110 | pooled = SWEM_pool(iter_rep, mask, **kwargs) 111 | pooled_outputs.append(pooled) 112 | return tf.concat(pooled_outputs, axis=1) if concat_output else pooled_outputs[-1] 113 | 114 | 115 | def CNN_pool(seqs, mask, filter_size, num_units=None, scope=None, reuse=None, **kwargs): 116 | if num_units is None: 117 | num_units = seqs.get_shape().as_list()[-1] 118 | with tf.variable_scope(scope or 'CNN_Pool_Block', reuse=reuse): 119 | conv_relu = CNN_encode(seqs, filter_size, num_units) 120 | pooled = SWEM_pool(conv_relu, mask, **kwargs) 121 | return pooled 122 | 123 | 124 | def MR_CNN_pool(seqs, mask, filter_sizes=(3, 4, 5), 125 | highway=False, scope=None, reuse=None, **kwargs): 126 | """ 127 | Multi-resolution CNN fusion 128 | """ 129 | with tf.variable_scope(scope or 'MR_CNN_Block', reuse=reuse): 130 | outputs = [SWEM_pool(seqs, mask, **kwargs)] if highway else [] 131 | for fz in filter_sizes: 132 | outputs.append(CNN_pool(seqs, mask, fz, scope='cnn_pool_%d' % fz, **kwargs)) 133 | return tf.concat(outputs, axis=1) 134 | 135 | 136 | def MH_MR_CNN_pool_v2(seqs, mask, num_heads=4, num_units=256, 137 | scope=None, reuse=None, **kwargs): 138 | with tf.variable_scope(scope or 'MultiHead_MCNN_Block', reuse=reuse): 139 | outputs = [] 140 | part_seqs = tf.split(seqs, num_heads, axis=2) 141 | for idx, part_seq in enumerate(part_seqs): 142 | outputs.append(MR_CNN_pool(part_seq, mask, 143 | num_units=int(num_units / num_heads), 144 | scope='mr_cnn_head_%d' % idx, **kwargs)) 145 | return tf.concat(outputs, axis=1) 146 | 147 | 148 | def MH_MR_CNN_pool(seqs, mask, num_heads=3, 149 | scope=None, reuse=None, **kwargs): 150 | """ 151 | Multi-head multi-resolution CNN fusion 152 | """ 153 | with tf.variable_scope(scope or 'MultiHead_MCNN_Block', reuse=reuse): 154 | outputs = [] 155 | for j in range(num_heads): 156 | outputs.append(MR_CNN_pool(seqs, mask, scope='mr_cnn_block_%d' % j, **kwargs)) 157 | return tf.add_n(outputs) 158 | --------------------------------------------------------------------------------