├── 0.0- Hierarchical Attention.py ├── 0.1 -Hierarchical Attention.py ├── 0.2 Hierarchical Attention.py ├── 1.0- addictive attention.py ├── 2.0- Bahdanau_attention.py ├── 3.0- Soft_attention.py ├── 4.0 -Luong_attention.py ├── 5.0-recognizing_entailment.py ├── Images ├── Bahdanau_attention.png ├── alignments.png ├── attention-mechanisms.png ├── demo.txt ├── diff.png ├── ml.png └── white.png ├── Keras_Multi-head_attention.py ├── Keras_Multihead_attention_1.py ├── Multi-Head_attention.py ├── Multiple_Multi_head_attention.py ├── README.md ├── Sentence_level_Hierarchical_Attention.py ├── Tensorflow_Attention_apis ├── Bahdanau_attention.ipynb └── Luong_Attention.ipynb ├── Word_level_Hierarchical_Attention.py ├── scaled_dot_product_attention.py └── simplest_self_attention.py /0.0- Hierarchical Attention.py: -------------------------------------------------------------------------------- 1 | #From https://github.com/ilivans/tf-rnn-attention/blob/master/attention.py 2 | 3 | import tensorflow as tf 4 | 5 | 6 | def attention(inputs, attention_size, time_major=False, return_alphas=False): 7 | """ 8 | Attention mechanism layer which reduces RNN/Bi-RNN outputs with Attention vector. 9 | The idea was proposed in the article by Z. Yang et al., "Hierarchical Attention Networks 10 | for Document Classification", 2016: http://www.aclweb.org/anthology/N16-1174. 11 | Variables notation is also inherited from the article 12 | 13 | Args: 14 | inputs: The Attention inputs. 15 | Matches outputs of RNN/Bi-RNN layer (not final state): 16 | In case of RNN, this must be RNN outputs `Tensor`: 17 | If time_major == False (default), this must be a tensor of shape: 18 | `[batch_size, max_time, cell.output_size]`. 19 | If time_major == True, this must be a tensor of shape: 20 | `[max_time, batch_size, cell.output_size]`. 21 | In case of Bidirectional RNN, this must be a tuple (outputs_fw, outputs_bw) containing the forward and 22 | the backward RNN outputs `Tensor`. 23 | If time_major == False (default), 24 | outputs_fw is a `Tensor` shaped: 25 | `[batch_size, max_time, cell_fw.output_size]` 26 | and outputs_bw is a `Tensor` shaped: 27 | `[batch_size, max_time, cell_bw.output_size]`. 28 | If time_major == True, 29 | outputs_fw is a `Tensor` shaped: 30 | `[max_time, batch_size, cell_fw.output_size]` 31 | and outputs_bw is a `Tensor` shaped: 32 | `[max_time, batch_size, cell_bw.output_size]`. 33 | attention_size: Linear size of the Attention weights. 34 | time_major: The shape format of the `inputs` Tensors. 35 | If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`. 36 | If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`. 37 | Using `time_major = True` is a bit more efficient because it avoids 38 | transposes at the beginning and end of the RNN calculation. However, 39 | most TensorFlow data is batch-major, so by default this function 40 | accepts input and emits output in batch-major form. 41 | return_alphas: Whether to return attention coefficients variable along with layer's output. 42 | Used for visualization purpose. 43 | Returns: 44 | The Attention output `Tensor`. 45 | In case of RNN, this will be a `Tensor` shaped: 46 | `[batch_size, cell.output_size]`. 47 | In case of Bidirectional RNN, this will be a `Tensor` shaped: 48 | `[batch_size, cell_fw.output_size + cell_bw.output_size]`. 49 | """ 50 | 51 | if isinstance(inputs, tuple): 52 | # In case of Bi-RNN, concatenate the forward and the backward RNN outputs. 53 | inputs = tf.concat(inputs, 2) 54 | 55 | if time_major: 56 | # (T,B,D) => (B,T,D) 57 | inputs = tf.array_ops.transpose(inputs, [1, 0, 2]) 58 | 59 | hidden_size = inputs.shape[2].value # D value - hidden size of the RNN layer 60 | 61 | # Trainable parameters 62 | w_omega = tf.Variable(tf.random_normal([hidden_size, attention_size], stddev=0.1)) 63 | b_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1)) 64 | u_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1)) 65 | 66 | with tf.name_scope('v'): 67 | # Applying fully connected layer with non-linear activation to each of the B*T timestamps; 68 | # the shape of `v` is (B,T,D)*(D,A)=(B,T,A), where A=attention_size 69 | v = tf.tanh(tf.tensordot(inputs, w_omega, axes=1) + b_omega) 70 | 71 | # For each of the timestamps its vector of size A from `v` is reduced with `u` vector 72 | vu = tf.tensordot(v, u_omega, axes=1, name='vu') # (B,T) shape 73 | alphas = tf.nn.softmax(vu, name='alphas') # (B,T) shape 74 | 75 | # Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape 76 | output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1) 77 | 78 | if not return_alphas: 79 | return output 80 | else: 81 | return output, alphas 82 | -------------------------------------------------------------------------------- /0.1 -Hierarchical Attention.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | def attention(inputs, attention_size, time_major=False, return_alphas=False): 5 | 6 | if isinstance(inputs, tuple): 7 | inputs = tf.concat(inputs, 2) 8 | 9 | if time_major: 10 | inputs = tf.array_ops.transpose(inputs, [1, 0, 2]) 11 | 12 | inputs = tf.transpose(inputs, [1, 0, 2]) 13 | sequence_length = inputs.shape[1].value # the length of sequences processed in the antecedent RNN layer 14 | hidden_size = inputs.shape[2].value # hidden size of the RNN layer 15 | 16 | # Attention mechanism 17 | W_omega = tf.Variable(tf.random_normal([hidden_size, attention_size], stddev=0.1)) 18 | b_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1)) 19 | u_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1)) 20 | 21 | v = tf.tanh(tf.matmul(tf.reshape(inputs, [-1, hidden_size]), W_omega) + tf.reshape(b_omega, [1, -1])) 22 | vu = tf.matmul(v, tf.reshape(u_omega, [-1, 1])) 23 | exps = tf.reshape(tf.exp(vu), [-1, sequence_length]) 24 | alphas = exps / tf.reshape(tf.reduce_sum(exps, 1), [-1, 1]) 25 | 26 | # Output of Bi-RNN is reduced with attention vector 27 | output = tf.reduce_sum(inputs * tf.reshape(alphas, [-1, sequence_length, 1]), 1) 28 | 29 | if not return_alphas: 30 | return output 31 | else: 32 | return output, alphas 33 | -------------------------------------------------------------------------------- /0.2 Hierarchical Attention.py: -------------------------------------------------------------------------------- 1 | import datetime, pickle, os 2 | import numpy as np 3 | import keras 4 | from keras.models import * 5 | from keras.layers import * 6 | from keras.optimizers import * 7 | from keras.callbacks import * 8 | from keras import regularizers 9 | from keras.preprocessing.text import Tokenizer 10 | from keras.preprocessing.sequence import pad_sequences 11 | from keras import backend as K 12 | from keras.utils import CustomObjectScope 13 | from keras.engine.topology import Layer 14 | from keras import initializers 15 | 16 | from util.text_util import normalize 17 | from util.glove import load_glove_embedding 18 | 19 | # Uncomment below for debugging 20 | # from tensorflow.python import debug as tf_debug 21 | # sess = K.get_session() 22 | # sess = tf_debug.LocalCLIDebugWrapperSession(sess) 23 | # K.set_session(sess) 24 | 25 | TOKENIZER_STATE_PATH = 'saved_models/tokenizer.p' 26 | GLOVE_EMBEDDING_PATH = 'saved_models/glove.6B.100d.txt' 27 | 28 | class Attention(Layer): 29 | def __init__(self, regularizer=None, **kwargs): 30 | super(Attention, self).__init__(**kwargs) 31 | self.regularizer = regularizer 32 | self.supports_masking = True 33 | 34 | def build(self, input_shape): 35 | # Create a trainable weight variable for this layer. 36 | self.context = self.add_weight(name='context', 37 | shape=(input_shape[-1], 1), 38 | initializer=initializers.RandomNormal( 39 | mean=0.0, stddev=0.05, seed=None), 40 | regularizer=self.regularizer, 41 | trainable=True) 42 | super(Attention, self).build(input_shape) 43 | 44 | def call(self, x, mask=None): 45 | attention_in = K.exp(K.squeeze(K.dot(x, self.context), axis=-1)) 46 | attention = attention_in/K.expand_dims(K.sum(attention_in, axis=-1), -1) 47 | 48 | if mask is not None: 49 | # use only the inputs specified by the mask 50 | # import pdb; pdb.set_trace() 51 | attention = attention*K.cast(mask, 'float32') 52 | 53 | weighted_sum = K.batch_dot(K.permute_dimensions(x, [0, 2, 1]), attention) 54 | return weighted_sum 55 | 56 | def compute_output_shape(self, input_shape): 57 | print(input_shape) 58 | return (input_shape[0], input_shape[-1]) 59 | 60 | class HNATT(): 61 | def __init__(self): 62 | self.model = None 63 | self.MAX_SENTENCE_LENGTH = 0 64 | self.MAX_SENTENCE_COUNT = 0 65 | self.VOCABULARY_SIZE = 0 66 | self.word_embedding = None 67 | self.model = None 68 | self.word_attention_model = None 69 | self.tokenizer = None 70 | self.class_count = 2 71 | 72 | def _generate_embedding(self, path, dim): 73 | return load_glove_embedding(path, dim, self.tokenizer.word_index) 74 | 75 | def _build_model(self, n_classes=2, embedding_dim=100, embeddings_path=False): 76 | l2_reg = regularizers.l2(1e-8) 77 | # embedding_weights = np.random.normal(0, 1, (len(self.tokenizer.word_index) + 1, embedding_dim)) 78 | # embedding_weights = np.zeros((len(self.tokenizer.word_index) + 1, embedding_dim)) 79 | embedding_weights = np.random.normal(0, 1, (len(self.tokenizer.word_index) + 1, embedding_dim)) 80 | if embeddings_path: 81 | embedding_weights = self._generate_embedding(embeddings_path, embedding_dim) 82 | 83 | # Generate word-attention-weighted sentence scores 84 | sentence_in = Input(shape=(self.MAX_SENTENCE_LENGTH,), dtype='int32') 85 | embedded_word_seq = Embedding( 86 | self.VOCABULARY_SIZE, 87 | embedding_dim, 88 | weights=[embedding_weights], 89 | input_length=self.MAX_SENTENCE_LENGTH, 90 | trainable=True, 91 | mask_zero=True, 92 | name='word_embeddings',)(sentence_in) 93 | word_encoder = Bidirectional( 94 | GRU(50, return_sequences=True, kernel_regularizer=l2_reg))(embedded_word_seq) 95 | dense_transform_w = Dense( 96 | 100, 97 | activation='relu', 98 | name='dense_transform_w', 99 | kernel_regularizer=l2_reg)(word_encoder) 100 | attention_weighted_sentence = Model( 101 | sentence_in, Attention(name='word_attention', regularizer=l2_reg)(dense_transform_w)) 102 | self.word_attention_model = attention_weighted_sentence 103 | attention_weighted_sentence.summary() 104 | 105 | # Generate sentence-attention-weighted document scores 106 | texts_in = Input(shape=(self.MAX_SENTENCE_COUNT, self.MAX_SENTENCE_LENGTH), dtype='int32') 107 | attention_weighted_sentences = TimeDistributed(attention_weighted_sentence)(texts_in) 108 | sentence_encoder = Bidirectional( 109 | GRU(50, return_sequences=True, kernel_regularizer=l2_reg))(attention_weighted_sentences) 110 | dense_transform_s = Dense( 111 | 100, 112 | activation='relu', 113 | name='dense_transform_s', 114 | kernel_regularizer=l2_reg)(sentence_encoder) 115 | attention_weighted_text = Attention(name='sentence_attention', regularizer=l2_reg)(dense_transform_s) 116 | prediction = Dense(n_classes, activation='softmax')(attention_weighted_text) 117 | model = Model(texts_in, prediction) 118 | model.summary() 119 | 120 | model.compile(#optimizer=RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0), 121 | #optimizer=SGD(lr=0.01, decay=1e-6, nesterov=True), 122 | optimizer=Adam(lr=0.001), 123 | loss='categorical_crossentropy', 124 | metrics=['acc']) 125 | 126 | return model 127 | 128 | def load_weights(self, saved_model_dir, saved_model_filename): 129 | with CustomObjectScope({'Attention': Attention}): 130 | self.model = load_model(os.path.join(saved_model_dir, saved_model_filename)) 131 | self.word_attention_model = self.model.get_layer('time_distributed_1').layer 132 | tokenizer_path = os.path.join( 133 | saved_model_dir, self._get_tokenizer_filename(saved_model_filename)) 134 | tokenizer_state = pickle.load(open(tokenizer_path, "rb" )) 135 | self.tokenizer = tokenizer_state['tokenizer'] 136 | self.MAX_SENTENCE_COUNT = tokenizer_state['maxSentenceCount'] 137 | self.MAX_SENTENCE_LENGTH = tokenizer_state['maxSentenceLength'] 138 | self.VOCABULARY_SIZE = tokenizer_state['vocabularySize'] 139 | self._create_reverse_word_index() 140 | 141 | def _get_tokenizer_filename(self, saved_model_filename): 142 | return saved_model_filename + '.tokenizer' 143 | 144 | def _fit_on_texts(self, texts): 145 | self.tokenizer = Tokenizer(filters='"()*,-/;[\]^_`{|}~', oov_token='UNK'); 146 | all_sentences = [] 147 | max_sentence_count = 0 148 | max_sentence_length = 0 149 | for text in texts: 150 | sentence_count = len(text) 151 | if sentence_count > max_sentence_count: 152 | max_sentence_count = sentence_count 153 | for sentence in text: 154 | sentence_length = len(sentence) 155 | if sentence_length > max_sentence_length: 156 | max_sentence_length = sentence_length 157 | all_sentences.append(sentence) 158 | 159 | self.MAX_SENTENCE_COUNT = min(max_sentence_count, 20) 160 | self.MAX_SENTENCE_LENGTH = min(max_sentence_length, 50) 161 | self.tokenizer.fit_on_texts(all_sentences) 162 | self.VOCABULARY_SIZE = len(self.tokenizer.word_index) + 1 163 | self._create_reverse_word_index() 164 | 165 | def _create_reverse_word_index(self): 166 | self.reverse_word_index = {value:key for key,value in self.tokenizer.word_index.items()} 167 | 168 | def _encode_texts(self, texts): 169 | encoded_texts = np.zeros((len(texts), self.MAX_SENTENCE_COUNT, self.MAX_SENTENCE_LENGTH)) 170 | for i, text in enumerate(texts): 171 | encoded_text = np.array(pad_sequences( 172 | self.tokenizer.texts_to_sequences(text), 173 | maxlen=self.MAX_SENTENCE_LENGTH))[:self.MAX_SENTENCE_COUNT] 174 | encoded_texts[i][-len(encoded_text):] = encoded_text 175 | return encoded_texts 176 | 177 | def _save_tokenizer_on_epoch_end(self, path, epoch): 178 | if epoch == 0: 179 | tokenizer_state = { 180 | 'tokenizer': self.tokenizer, 181 | 'maxSentenceCount': self.MAX_SENTENCE_COUNT, 182 | 'maxSentenceLength': self.MAX_SENTENCE_LENGTH, 183 | 'vocabularySize': self.VOCABULARY_SIZE 184 | } 185 | pickle.dump(tokenizer_state, open(path, "wb" ) ) 186 | 187 | def train(self, train_x, train_y, 188 | batch_size=16, epochs=1, 189 | embedding_dim=100, 190 | embeddings_path=False, 191 | saved_model_dir='saved_models', saved_model_filename=None,): 192 | # fit tokenizer 193 | self._fit_on_texts(train_x) 194 | self.model = self._build_model( 195 | n_classes=train_y.shape[-1], 196 | embedding_dim=100, 197 | embeddings_path=embeddings_path) 198 | encoded_train_x = self._encode_texts(train_x) 199 | callbacks = [ 200 | # EarlyStopping( 201 | # monitor='acc', 202 | # patience=2, 203 | # ), 204 | ReduceLROnPlateau(), 205 | # keras.callbacks.TensorBoard( 206 | # log_dir="logs/final/{}".format(datetime.datetime.now()), 207 | # histogram_freq=1, 208 | # write_graph=True, 209 | # write_images=True 210 | # ) 211 | LambdaCallback( 212 | on_epoch_end=lambda epoch, logs: self._save_tokenizer_on_epoch_end( 213 | os.path.join(saved_model_dir, 214 | self._get_tokenizer_filename(saved_model_filename)), epoch)) 215 | ] 216 | 217 | if saved_model_filename: 218 | callbacks.append( 219 | ModelCheckpoint( 220 | filepath=os.path.join(saved_model_dir, saved_model_filename), 221 | monitor='val_acc', 222 | save_best_only=True, 223 | save_weights_only=False, 224 | ) 225 | ) 226 | self.model.fit(x=encoded_train_x, y=train_y, 227 | batch_size=batch_size, 228 | epochs=epochs, 229 | verbose=1, 230 | callbacks=callbacks, 231 | validation_split=0.1, 232 | shuffle=True) 233 | 234 | def _encode_input(self, x, log=False): 235 | x = np.array(x) 236 | if not x.shape: 237 | x = np.expand_dims(x, 0) 238 | texts = np.array([normalize(text) for text in x]) 239 | return self._encode_texts(texts) 240 | 241 | def predict(self, x): 242 | encoded_x = self._encode_texts(x) 243 | return self.model.predict(encoded_x) 244 | 245 | def activation_maps(self, text, websafe=False): 246 | normalized_text = normalize(text) 247 | encoded_text = self._encode_input(text)[0] 248 | 249 | # get word activations 250 | hidden_word_encoding_out = Model(inputs=self.word_attention_model.input, 251 | outputs=self.word_attention_model.get_layer('dense_transform_w').output) 252 | hidden_word_encodings = hidden_word_encoding_out.predict(encoded_text) 253 | word_context = self.word_attention_model.get_layer('word_attention').get_weights()[0] 254 | u_wattention = encoded_text*np.exp(np.squeeze(np.dot(hidden_word_encodings, word_context))) 255 | if websafe: 256 | u_wattention = u_wattention.astype(float) 257 | 258 | # generate word, activation pairs 259 | nopad_encoded_text = encoded_text[-len(normalized_text):] 260 | nopad_encoded_text = [list(filter(lambda x: x > 0, sentence)) for sentence in nopad_encoded_text] 261 | reconstructed_texts = [[self.reverse_word_index[int(i)] 262 | for i in sentence] for sentence in nopad_encoded_text] 263 | nopad_wattention = u_wattention[-len(normalized_text):] 264 | nopad_wattention = nopad_wattention/np.expand_dims(np.sum(nopad_wattention, -1), -1) 265 | nopad_wattention = np.array([attention_seq[-len(sentence):] 266 | for attention_seq, sentence in zip(nopad_wattention, nopad_encoded_text)]) 267 | word_activation_maps = [] 268 | for i, text in enumerate(reconstructed_texts): 269 | word_activation_maps.append(list(zip(text, nopad_wattention[i]))) 270 | 271 | # get sentence activations 272 | hidden_sentence_encoding_out = Model(inputs=self.model.input, 273 | outputs=self.model.get_layer('dense_transform_s').output) 274 | hidden_sentence_encodings = np.squeeze( 275 | hidden_sentence_encoding_out.predict(np.expand_dims(encoded_text, 0)), 0) 276 | sentence_context = self.model.get_layer('sentence_attention').get_weights()[0] 277 | u_sattention = np.exp(np.squeeze(np.dot(hidden_sentence_encodings, sentence_context), -1)) 278 | if websafe: 279 | u_sattention = u_sattention.astype(float) 280 | nopad_sattention = u_sattention[-len(normalized_text):] 281 | 282 | nopad_sattention = nopad_sattention/np.expand_dims(np.sum(nopad_sattention, -1), -1) 283 | 284 | activation_map = list(zip(word_activation_maps, nopad_sattention)) 285 | 286 | return activation_map 287 | 288 | # source https://github.com/minqi/hnatt/blob/master/hnatt.py 289 | -------------------------------------------------------------------------------- /1.0- addictive attention.py: -------------------------------------------------------------------------------- 1 | #From suriyadeepan code library 2 | 3 | def additive_attention(ref, query, ref_dim, qdim, 4 | normalize=False, blend=False): 5 | # infer timesteps 6 | timesteps = tf.shape(ref)[1] 7 | 8 | U = tf.get_variable('U', 9 | shape=[ref_dim, qdim], 10 | dtype=tf.float32, 11 | initializer=tf.random_uniform_initializer(-0.01, 0.01)) 12 | V = tf.get_variable('V', 13 | shape=[qdim, qdim], 14 | dtype=tf.float32, 15 | initializer=tf.random_uniform_initializer(-0.01, 0.01)) 16 | Av = tf.get_variable('Av', 17 | shape=[qdim, 1], 18 | dtype=tf.float32, 19 | initializer=tf.random_uniform_initializer(-0.01, 0.01)) 20 | # NOTE : reference should be in batch_major format 21 | ref_proj = tf.reshape( 22 | tf.matmul(tf.reshape(ref, [-1, ref_dim]), U), # collapse dims to matmul 23 | [-1, timesteps, qdim]) # expand again 24 | hi = tf.expand_dims(tf.matmul(query, V), 25 | axis=1) # expand time dim to add to reference 26 | 27 | # sum up ref, query 28 | blended = (ref_proj + hi) 29 | scores = tf.reshape(tf.matmul( 30 | tf.reshape(blended, [-1, qdim]), # collapse dims 31 | Av), # matmul with attention vector 32 | [-1, timesteps]) # attention weights across timesteps 33 | 34 | # normalize scores 35 | probs = tf.nn.softmax(scores) 36 | if normalize: 37 | return probs 38 | if blend: # reduce reference based on attention weights 39 | return tf.reduce_sum(ref * tf.expand_dims(probs, axis=-1), 40 | axis=1) # reduce across time dimension 41 | return scores # return score 42 | -------------------------------------------------------------------------------- /2.0- Bahdanau_attention.py: -------------------------------------------------------------------------------- 1 | #from pytorch 2 | 3 | class BahdanauAttnDecoderRNN(nn.Module): 4 | def __init__(self, hidden_size, output_size, n_layers=1, dropout_p=0.1): 5 | super(AttnDecoderRNN, self).__init__() 6 | 7 | # Define parameters 8 | self.hidden_size = hidden_size 9 | self.output_size = output_size 10 | self.n_layers = n_layers 11 | self.dropout_p = dropout_p 12 | self.max_length = max_length 13 | 14 | # Define layers 15 | self.embedding = nn.Embedding(output_size, hidden_size) 16 | self.dropout = nn.Dropout(dropout_p) 17 | self.attn = GeneralAttn(hidden_size) 18 | self.gru = nn.GRU(hidden_size * 2, hidden_size, n_layers, dropout=dropout_p) 19 | self.out = nn.Linear(hidden_size, output_size) 20 | 21 | def forward(self, word_input, last_hidden, encoder_outputs): 22 | # Note that we will only be running forward for a single decoder time step, but will use all encoder outputs 23 | 24 | # Get the embedding of the current input word (last output word) 25 | word_embedded = self.embedding(word_input).view(1, 1, -1) # S=1 x B x N 26 | word_embedded = self.dropout(word_embedded) 27 | 28 | # Calculate attention weights and apply to encoder outputs 29 | attn_weights = self.attn(last_hidden[-1], encoder_outputs) 30 | context = attn_weights.bmm(encoder_outputs.transpose(0, 1)) # B x 1 x N 31 | 32 | # Combine embedded input word and attended context, run through RNN 33 | rnn_input = torch.cat((word_embedded, context), 2) 34 | output, hidden = self.gru(rnn_input, last_hidden) 35 | 36 | # Final output layer 37 | output = output.squeeze(0) # B x N 38 | output = F.log_softmax(self.out(torch.cat((output, context), 1))) 39 | 40 | # Return final output, hidden state, and attention weights (for visualization) 41 | return output, hidden, attn_weights 42 | -------------------------------------------------------------------------------- /3.0- Soft_attention.py: -------------------------------------------------------------------------------- 1 | #after getting output from bidirectional rnn 2 | 3 | #Attention_layer 4 | 5 | x_attention = tf.reshape(transpose,[-1,rnn_num_units*2]) 6 | attention_size=tf.get_variable(name='attention',shape=[rnn_num_units*2,1],dtype=tf.float32,initializer=tf.random_uniform_initializer(-0.01,0.01)) 7 | bias_ = tf.get_variable(name='bias_',shape=[1],dtype=tf.float32,initializer=tf.random_uniform_initializer(-0.01,0.01)) 8 | linear_projection = tf.add(tf.matmul(x_attention,attention_size),bias_) 9 | # print(sentence_input.shape[0]) 10 | reshape_ = tf.reshape(linear_projection,[tf.shape(sentence_input)[0],tf.shape(sentence_input)[1],-1]) 11 | attention_output=tf.nn.softmax(reshape_,dim=1) 12 | 13 | atten_visualize=tf.reshape(attention_output,[tf.shape(sentence_input)[0],tf.shape(sentence_input)[1]],name='plot_dis') 14 | 15 | multi = tf.multiply(attention_output,transpose) 16 | 17 | 18 | atten_out_s = tf.reduce_sum(multi,1) 19 | 20 | # attention_visualize = tf.reshape(atten_out,[tf.shape(sentence_input)[0],tf.shape(sentence_input)[1]]) 21 | -------------------------------------------------------------------------------- /4.0 -Luong_attention.py: -------------------------------------------------------------------------------- 1 | class Attn(nn.Module): 2 | def __init__(self, method, hidden_size, max_length=MAX_LENGTH): 3 | super(Attn, self).__init__() 4 | 5 | self.method = method 6 | self.hidden_size = hidden_size 7 | 8 | if self.method == 'general': 9 | self.attn = nn.Linear(self.hidden_size, hidden_size) 10 | 11 | elif self.method == 'concat': 12 | self.attn = nn.Linear(self.hidden_size * 2, hidden_size) 13 | self.other = nn.Parameter(torch.FloatTensor(1, hidden_size)) 14 | 15 | def forward(self, hidden, encoder_outputs): 16 | seq_len = len(encoder_outputs) 17 | 18 | # Create variable to store attention energies 19 | attn_energies = Variable(torch.zeros(seq_len)) # B x 1 x S 20 | if USE_CUDA: attn_energies = attn_energies.cuda() 21 | 22 | # Calculate energies for each encoder output 23 | for i in range(seq_len): 24 | attn_energies[i] = self.score(hidden, encoder_outputs[i]) 25 | 26 | # Normalize energies to weights in range 0 to 1, resize to 1 x 1 x seq_len 27 | return F.softmax(attn_energies).unsqueeze(0).unsqueeze(0) 28 | 29 | def score(self, hidden, encoder_output): 30 | 31 | if self.method == 'dot': 32 | energy = hidden.dot(encoder_output) 33 | return energy 34 | 35 | elif self.method == 'general': 36 | energy = self.attn(encoder_output) 37 | energy = hidden.dot(energy) 38 | return energy 39 | 40 | elif self.method == 'concat': 41 | energy = self.attn(torch.cat((hidden, encoder_output), 1)) 42 | energy = self.other.dot(energy) 43 | return energy 44 | -------------------------------------------------------------------------------- /5.0-recognizing_entailment.py: -------------------------------------------------------------------------------- 1 | #paper 2 | #Reasoning about Entailment with Neural Attention 3 | #https://arxiv.org/pdf/1509.06664v1.pdf 4 | 5 | 6 | import tensorflow as tf 7 | import numpy as np 8 | 9 | batch_size = 3 10 | seq_len = 5 11 | dim = 2 12 | # [batch_size x seq_len x dim] -- hidden states 13 | Y = tf.constant(np.random.randn(batch_size, seq_len, dim), tf.float32) 14 | # [batch_size x dim] -- h_N 15 | h = tf.constant(np.random.randn(batch_size, dim), tf.float32) 16 | 17 | initializer = tf.random_uniform_initializer() 18 | W = tf.get_variable("weights_Y", [dim, dim], initializer=initializer) 19 | w = tf.get_variable("weights_w", [dim], initializer=initializer) 20 | 21 | # [batch_size x seq_len x dim] -- tanh(W^{Y}Y) 22 | M = tf.tanh(tf.einsum("aij,jk->aik", Y, W)) 23 | # [batch_size x seq_len] -- softmax(Y w^T) 24 | a = tf.nn.softmax(tf.einsum("aij,j->ai", M, w)) 25 | # [batch_size x dim] -- Ya^T 26 | r = tf.einsum("aij,ai->aj", Y, a) 27 | 28 | with tf.Session() as sess: 29 | sess.run(tf.global_variables_initializer()) 30 | a_val, r_val = sess.run([a, r]) 31 | print("a:", a_val, "\nr:", r_val) 32 | 33 | 34 | 35 | 36 | #I came across this here https://stackoverflow.com/questions/42507030/implementing-attention-in-tensorflow 37 | -------------------------------------------------------------------------------- /Images/ Bahdanau_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monk1337/Various-Attention-mechanisms/b4462102dcecb05c544a31aae8973a5477cc838d/Images/ Bahdanau_attention.png -------------------------------------------------------------------------------- /Images/alignments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monk1337/Various-Attention-mechanisms/b4462102dcecb05c544a31aae8973a5477cc838d/Images/alignments.png -------------------------------------------------------------------------------- /Images/attention-mechanisms.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monk1337/Various-Attention-mechanisms/b4462102dcecb05c544a31aae8973a5477cc838d/Images/attention-mechanisms.png -------------------------------------------------------------------------------- /Images/demo.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Images/diff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monk1337/Various-Attention-mechanisms/b4462102dcecb05c544a31aae8973a5477cc838d/Images/diff.png -------------------------------------------------------------------------------- /Images/ml.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monk1337/Various-Attention-mechanisms/b4462102dcecb05c544a31aae8973a5477cc838d/Images/ml.png -------------------------------------------------------------------------------- /Images/white.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monk1337/Various-Attention-mechanisms/b4462102dcecb05c544a31aae8973a5477cc838d/Images/white.png -------------------------------------------------------------------------------- /Keras_Multi-head_attention.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow import keras 3 | from tensorflow.keras import layers 4 | 5 | 6 | class MultiHeadSelfAttention(layers.Layer): 7 | def __init__(self, embed_dim, num_heads=8): 8 | super(MultiHeadSelfAttention, self).__init__() 9 | self.embed_dim = embed_dim 10 | self.num_heads = num_heads 11 | if embed_dim % num_heads != 0: 12 | raise ValueError( 13 | f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}" 14 | ) 15 | self.projection_dim = embed_dim // num_heads 16 | self.query_dense = layers.Dense(embed_dim) 17 | self.key_dense = layers.Dense(embed_dim) 18 | self.value_dense = layers.Dense(embed_dim) 19 | self.combine_heads = layers.Dense(embed_dim) 20 | 21 | def attention(self, query, key, value): 22 | score = tf.matmul(query, key, transpose_b=True) 23 | dim_key = tf.cast(tf.shape(key)[-1], tf.float32) 24 | scaled_score = score / tf.math.sqrt(dim_key) 25 | weights = tf.nn.softmax(scaled_score, axis=-1) 26 | output = tf.matmul(weights, value) 27 | return output, weights 28 | 29 | def separate_heads(self, x, batch_size): 30 | x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim)) 31 | return tf.transpose(x, perm=[0, 2, 1, 3]) 32 | 33 | def call(self, inputs): 34 | # x.shape = [batch_size, seq_len, embedding_dim] 35 | batch_size = tf.shape(inputs)[0] 36 | query = self.query_dense(inputs) # (batch_size, seq_len, embed_dim) 37 | key = self.key_dense(inputs) # (batch_size, seq_len, embed_dim) 38 | value = self.value_dense(inputs) # (batch_size, seq_len, embed_dim) 39 | query = self.separate_heads( 40 | query, batch_size 41 | ) # (batch_size, num_heads, seq_len, projection_dim) 42 | key = self.separate_heads( 43 | key, batch_size 44 | ) # (batch_size, num_heads, seq_len, projection_dim) 45 | value = self.separate_heads( 46 | value, batch_size 47 | ) # (batch_size, num_heads, seq_len, projection_dim) 48 | attention, weights = self.attention(query, key, value) 49 | attention = tf.transpose( 50 | attention, perm=[0, 2, 1, 3] 51 | ) # (batch_size, seq_len, num_heads, projection_dim) 52 | concat_attention = tf.reshape( 53 | attention, (batch_size, -1, self.embed_dim) 54 | ) # (batch_size, seq_len, embed_dim) 55 | output = self.combine_heads( 56 | concat_attention 57 | ) # (batch_size, seq_len, embed_dim) 58 | return output 59 | -------------------------------------------------------------------------------- /Keras_Multihead_attention_1.py: -------------------------------------------------------------------------------- 1 | 2 | import tensorflow as tf 3 | 4 | import time 5 | import numpy as np 6 | 7 | def scaled_dot_product_attention(q, k, v, mask): 8 | """Calculate the attention weights. 9 | q, k, v must have matching leading dimensions. 10 | k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v. 11 | The mask has different shapes depending on its type(padding or look ahead) 12 | but it must be broadcastable for addition. 13 | 14 | Args: 15 | q: query shape == (..., seq_len_q, depth) 16 | k: key shape == (..., seq_len_k, depth) 17 | v: value shape == (..., seq_len_v, depth_v) 18 | mask: Float tensor with shape broadcastable 19 | to (..., seq_len_q, seq_len_k). Defaults to None. 20 | 21 | Returns: 22 | output, attention_weights 23 | """ 24 | 25 | matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k) 26 | 27 | # scale matmul_qk 28 | dk = tf.cast(tf.shape(k)[-1], tf.float32) 29 | scaled_attention_logits = matmul_qk / tf.math.sqrt(dk) 30 | 31 | # add the mask to the scaled tensor. 32 | if mask is not None: 33 | scaled_attention_logits += (mask * -1e9) 34 | 35 | # softmax is normalized on the last axis (seq_len_k) so that the scores 36 | # add up to 1. 37 | attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k) 38 | 39 | output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v) 40 | 41 | return output, attention_weights 42 | 43 | 44 | 45 | class MultiHeadAttention(tf.keras.layers.Layer): 46 | def __init__(self, d_model, num_heads): 47 | super(MultiHeadAttention, self).__init__() 48 | self.num_heads = num_heads 49 | self.d_model = d_model 50 | 51 | assert d_model % self.num_heads == 0 52 | 53 | self.depth = d_model // self.num_heads 54 | 55 | self.wq = tf.keras.layers.Dense(d_model) 56 | self.wk = tf.keras.layers.Dense(d_model) 57 | self.wv = tf.keras.layers.Dense(d_model) 58 | 59 | self.dense = tf.keras.layers.Dense(d_model) 60 | 61 | def split_heads(self, x, batch_size): 62 | """Split the last dimension into (num_heads, depth). 63 | Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth) 64 | """ 65 | x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) 66 | return tf.transpose(x, perm=[0, 2, 1, 3]) 67 | 68 | def call(self, v, k, q, mask): 69 | batch_size = tf.shape(q)[0] 70 | 71 | q = self.wq(q) # (batch_size, seq_len, d_model) 72 | k = self.wk(k) # (batch_size, seq_len, d_model) 73 | v = self.wv(v) # (batch_size, seq_len, d_model) 74 | 75 | q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth) 76 | k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth) 77 | v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth) 78 | 79 | # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth) 80 | # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k) 81 | scaled_attention, attention_weights = scaled_dot_product_attention( 82 | q, k, v, mask) 83 | 84 | scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth) 85 | 86 | concat_attention = tf.reshape(scaled_attention, 87 | (batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model) 88 | 89 | output = self.dense(concat_attention) # (batch_size, seq_len_q, d_model) 90 | 91 | return output, attention_weights 92 | -------------------------------------------------------------------------------- /Multi-Head_attention.py: -------------------------------------------------------------------------------- 1 | def multihead_attention(queries, 2 | keys, 3 | num_units=None, 4 | num_heads=8, 5 | dropout_rate=0, 6 | is_training=True, 7 | causality=False, 8 | scope="multihead_attention", 9 | reuse=None): 10 | with tf.variable_scope(scope, reuse=reuse): 11 | if num_units is None: # set default size for attention size C 12 | num_units = queries.get_shape().as_list()[-1] 13 | 14 | # Linear Projections 15 | Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # [N, T_q, C] 16 | K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # [N, T_k, C] 17 | V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # [N, T_k, C] 18 | 19 | # Split and concat 20 | Q_ = tf.concat(tf.split(Q, num_heads, axis=-1), axis=0) # [num_heads * N, T_q, C/num_heads] 21 | K_ = tf.concat(tf.split(K, num_heads, axis=-1), axis=0) # [num_heads * N, T_k, C/num_heads] 22 | V_ = tf.concat(tf.split(V, num_heads, axis=-1), axis=0) # [num_heads * N, T_k, C/num_heads] 23 | 24 | # Attention 25 | outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (num_heads * N, T_q, T_k) 26 | 27 | # Scale : outputs = outputs / sqrt( d_k) 28 | outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5) 29 | 30 | # Key Masking 31 | # see : https://github.com/Kyubyong/transformer/issues/3 32 | key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k) 33 | key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k) 34 | key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k) 35 | 36 | paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) # -infinity 37 | outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k) 38 | 39 | # Causality = Future blinding 40 | if causality: 41 | diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k) 42 | tril = tf.contrib.linalg.LinearOperatorTriL(diag_vals).to_dense() # (T_q, T_k) 43 | masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k) 44 | 45 | paddings = tf.ones_like(masks) * (-2 ** 32 + 1) 46 | outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k) 47 | 48 | # Activation: outputs is a weight matrix 49 | outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k) 50 | 51 | # Query Masking 52 | query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q) 53 | query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q) 54 | query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k) 55 | outputs *= query_masks # broadcasting. (N, T_q, C) 56 | 57 | # dropouts 58 | outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training)) 59 | 60 | # weighted sum 61 | outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h) 62 | 63 | # reshape 64 | outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2) # (N, T_q, C) 65 | 66 | # residual connection 67 | outputs += queries 68 | 69 | # layer normaliztion 70 | outputs = layer_normalization(outputs) 71 | return outputs 72 | # https://github.com/TobiasLee/Text-Classification/blob/master/models/modules/multihead.py 73 | -------------------------------------------------------------------------------- /Multiple_Multi_head_attention.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #test self-attention 3 | import tensorflow as tf 4 | import time 5 | """ 6 | multi head attention. 7 | 1.linearly project the queries,keys and values h times(with different,learned linear projections to d_k,d_k,d_v dimensions) 8 | 2.scaled dot product attention for each projected version of Q,K,V 9 | 3.concatenated result 10 | 4.linear projection to get final result 11 | three kinds of usage: 12 | 1. attention for encoder 13 | 2. attention for decoder(need a mask to pay attention for only known position) 14 | 3. attention as bridge of encoder and decoder 15 | """ 16 | class MultiHeadAttention(object): 17 | """ multi head attention""" 18 | def __init__(self,Q,K_s,V_s,d_model,d_k,d_v,sequence_length,h,type=None,is_training=None,mask=None,dropout_rate=0.1): 19 | self.d_model=d_model 20 | self.d_k=d_k 21 | self.d_v=d_v 22 | self.sequence_length=sequence_length 23 | self.h=h 24 | self.Q=Q 25 | self.K_s=K_s 26 | self.V_s=V_s 27 | self.type=type 28 | self.is_training=is_training 29 | self.mask=mask 30 | self.dropout_rate=dropout_rate 31 | print("MultiHeadAttention.self.dropout_rate:",self.dropout_rate) 32 | 33 | def multi_head_attention_fn(self): 34 | """ 35 | multi head attention 36 | :param Q: query. shape:[batch,sequence_length,d_model] 37 | :param K_s: keys. shape:[batch,sequence_length,d_model]. 38 | :param V_s:values.shape:[batch,sequence_length,d_model]. 39 | :param h: h times 40 | :return: result of scaled dot product attention. shape:[sequence_length,d_model] 41 | """ 42 | # 1. linearly project the queries,keys and values h times(with different,learned linear projections to d_k,d_k,d_v dimensions) 43 | Q_projected = tf.layers.dense(self.Q,units=self.d_model) # [batch,sequence_length,d_model] 44 | K_s_projected = tf.layers.dense(self.K_s, units=self.d_model) # [batch,sequence_length,d_model] 45 | V_s_projected = tf.layers.dense(self.V_s, units=self.d_model) # [batch,sequence_length,d_model] 46 | # 2. scaled dot product attention for each projected version of Q,K,V 47 | dot_product=self.scaled_dot_product_attention_batch(Q_projected,K_s_projected,V_s_projected) # [batch,h,sequence_length,d_k] 48 | # 3. concatenated 49 | print("dot_product:====================================================================================>",dot_product) #dot_product:(128, 8, 6, 64) 50 | batch_size,h,length,d_k=dot_product.get_shape().as_list() 51 | print("self.sequence_length:",self.sequence_length) #5 52 | dot_product=tf.reshape(dot_product,shape=(-1,length,self.d_model)) 53 | # 4. linear projection 54 | output=tf.layers.dense(dot_product,units=self.d_model) # [batch,sequence_length,d_model] 55 | return output #[batch,sequence_length,d_model] 56 | 57 | def scaled_dot_product_attention_batch_mine(self,Q,K_s,V_s): #my own implementation of scaled dot product attention. 58 | """ 59 | scaled dot product attention 60 | :param Q: query. shape:[batch,sequence_length,d_model] 61 | :param K_s: keys. shape:[batch,sequence_length,d_model] 62 | :param V_s:values. shape:[batch,sequence_length,d_model] 63 | :param mask: shape:[batch,sequence_length] 64 | :return: result of scaled dot product attention. shape:[batch,h,sequence_length,d_k] 65 | """ 66 | # 1. split Q,K,V 67 | Q_heads = tf.stack(tf.split(Q,self.h,axis=2),axis=1) # [batch,h,sequence_length,d_k] 68 | K_heads = tf.stack(tf.split(K_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 69 | V_heads = tf.stack(tf.split(V_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 70 | dot_product=tf.multiply(Q_heads,K_heads) # [batch,h,sequence_length,d_k] 71 | # 2. dot product 72 | dot_product=dot_product*(1.0/tf.sqrt(tf.cast(self.d_model,tf.float32))) # [batch,h,sequence_length,d_k] 73 | dot_product=tf.reduce_sum(dot_product,axis=-1,keep_dims=True) # [batch,h,sequence_length,1] 74 | # 3. add mask if it is none 75 | if self.mask is not None: 76 | mask = tf.expand_dims(self.mask, axis=-1) # [batch,sequence_length,1] 77 | mask = tf.expand_dims(mask, axis=1) # [batch,1,sequence_length,1] 78 | dot_product=dot_product+mask # [batch,h,sequence_length,1] 79 | # 4. get possibility 80 | p=tf.nn.softmax(dot_product) # [batch,h,sequence_length,1] 81 | # 5. final output 82 | output=tf.multiply(p,V_heads) # [batch,h,sequence_length,d_k] 83 | return output # [batch,h,sequence_length,d_k] 84 | 85 | def scaled_dot_product_attention_batch(self, Q, K_s, V_s):# scaled dot product attention: implementation style like tensor2tensor from google 86 | """ 87 | scaled dot product attention 88 | :param Q: query. shape:[batch,sequence_length,d_model] 89 | :param K_s: keys. shape:[batch,sequence_length,d_model] 90 | :param V_s:values. shape:[batch,sequence_length,d_model] 91 | :param mask: shape:[sequence_length,sequence_length] 92 | :return: result of scaled dot product attention. shape:[batch,h,sequence_length,d_k] 93 | """ 94 | # 1. split Q,K,V 95 | Q_heads = tf.stack(tf.split(Q,self.h,axis=2),axis=1) # [batch,h,sequence_length,d_k] 96 | K_heads = tf.stack(tf.split(K_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 97 | V_heads = tf.stack(tf.split(V_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 98 | # 2. dot product of Q,K 99 | dot_product=tf.matmul(Q_heads,K_heads,transpose_b=True) # [batch,h,sequence_length,sequence_length] 100 | dot_product=dot_product*(1.0/tf.sqrt(tf.cast(self.d_model,tf.float32))) # [batch,h,sequence_length,sequence_length] 101 | # 3. add mask if it is none 102 | print("scaled_dot_product_attention_batch.===============================================================>mask is not none?",self.mask is not None) 103 | if self.mask is not None: 104 | mask_expand=tf.expand_dims(tf.expand_dims(self.mask,axis=0),axis=0) # [1,1,sequence_length,sequence_length] 105 | #dot_product:(128, 8, 6, 6);mask_expand:(1, 1, 5, 5) 106 | print("scaled_dot_product_attention_batch.===============================================================>dot_product:",dot_product,";mask_expand:",mask_expand) 107 | dot_product=dot_product+mask_expand # [batch,h,sequence_length,sequence_length] 108 | # 4.get possibility 109 | weights=tf.nn.softmax(dot_product) # [batch,h,sequence_length,sequence_length] 110 | # drop out weights 111 | weights=tf.nn.dropout(weights,1.0-self.dropout_rate) # [batch,h,sequence_length,sequence_length] 112 | # 5. final output 113 | output=tf.matmul(weights,V_heads) # [batch,h,sequence_length,d_model] 114 | return output 115 | 116 | 117 | #vectorized implementation of multi head attention for sentences with batch 118 | def multi_head_attention_for_sentence_vectorized(layer_number): 119 | print("started...") 120 | start = time.time() 121 | # 1.set parameter 122 | d_model = 512 123 | d_k = 64 124 | d_v = 64 125 | sequence_length = 1000 126 | h = 8 127 | batch_size=128 128 | initializer = tf.random_normal_initializer(stddev=0.1) 129 | # 2.set Q,K,V 130 | vocab_size=1000 131 | embed_size=d_model 132 | type='decoder' 133 | Embedding = tf.get_variable("Embedding_", shape=[vocab_size, embed_size],initializer=initializer) 134 | input_x = tf.placeholder(tf.int32, [batch_size,sequence_length], name="input_x") 135 | embedded_words = tf.nn.embedding_lookup(Embedding, input_x) #[batch_size,sequence_length,embed_size] 136 | mask=get_mask(batch_size,sequence_length) #tf.ones((batch_size,sequence_length))*-1e8 #[batch,sequence_length] 137 | with tf.variable_scope("query_at_each_sentence"+str(layer_number)): 138 | Q = embedded_words # [batch_size*sequence_length,embed_size] 139 | K_s=embedded_words #[batch_size*sequence_length,embed_size] 140 | #V_s=tf.get_variable("V_s_original_", shape=embedded_words.get_shape().as_list(),initializer=initializer) #[batch_size,sequence_length,embed_size] 141 | V_s=K_s 142 | # 3.call method to get result 143 | multi_head_attention_class = MultiHeadAttention(Q, K_s, V_s, d_model, d_k, d_v, sequence_length, h,type='decoder',mask=mask) 144 | encoder_output=multi_head_attention_class.multi_head_attention_fn() #shape:[sequence_length,d_model] 145 | encoder_output=tf.reshape(encoder_output,shape=(batch_size,sequence_length,d_model)) 146 | end = time.time() 147 | print("input_x:",input_x) 148 | print("encoder_output:",encoder_output,";time_spent:",(end-start)) 149 | 150 | def get_mask(batch_size,sequence_length): 151 | lower_triangle=tf.matrix_band_part(tf.ones([sequence_length,sequence_length]),-1,0) 152 | result=-1e9*(1.0-lower_triangle) 153 | print("get_mask==>result:",result) 154 | return result 155 | 156 | #multi_head_attention_for_sentence_vectorized(0) 157 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
6 |
7 |