├── 021519-pic2.png ├── Screenshot 2020-10-16 at 9.43.26 AM.png ├── README.md └── nmt_advance_fra_eng (2).py /021519-pic2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nageshsinghc4/Neural-machine-translation-NMT/HEAD/021519-pic2.png -------------------------------------------------------------------------------- /Screenshot 2020-10-16 at 9.43.26 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nageshsinghc4/Neural-machine-translation-NMT/HEAD/Screenshot 2020-10-16 at 9.43.26 AM.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Neural-machine-translation-NMT 2 | ![NMT](https://github.com/nageshsinghc4/Neural-machine-translation-NMT/blob/master/021519-pic2.png) 3 | 4 | Machine translation is a subfield of computational linguistics that is focused on the task of automatically converting source text in one language to text in another language. 5 | 6 | In machine translation, the input already consists of a series of symbols in some language, and the computer program must convert this into a series of symbols in a different language. 7 | 8 | Neural machine translation (NMT) is a proposition to machine translation that uses an artificial neural network to predict the probability of a sequence of words, typically modeling whole sentences in a single integrated model. 9 | 10 | With the power of Neural networks, Neural Machine Translation (NMT) has emerged as the most powerful algorithm to perform this task. This state-of-the-art algorithm is an application of deep learning in which massive datasets of translated sentences are used to train a model capable of translating between any two languages. 11 | 12 | Here, we will create a LSTM encoder-decoder model that will translate English sentences into their French-language counterparts using Keras and python. 13 | 14 | The data set can be downloaded from [here](http://www.manythings.org/anki/). 15 | 16 | For more information and step by step explaination, please the article on [www.theaidream.com](https://www.theaidream.com/post/how-ai-is-changing-personal-data-tracking) 17 | 18 | ## Predictions 19 | To test the performance we will randomly choose a sentence from the input_sentences list, retrieve the corresponding padded sequence for the sentence, and will pass it to the translate_sentence() method. The method will return the translated sentence. 20 | ``` 21 | i = np.random.choice(len(input_sentences)) 22 | input_seq = encoder_input_sequences[i:i+1] 23 | translation = translate_sentence(input_seq) 24 | print('Input Language : ', input_sentences[i]) 25 | print('Actual translation : ', output_sentences[i]) 26 | print('French translation : ', translation) 27 | ``` 28 | 29 | **Results:** 30 | 31 | ![Results](https://github.com/nageshsinghc4/Neural-machine-translation-NMT/blob/master/Screenshot%202020-10-16%20at%209.43.26%20AM.png) 32 | 33 | You can follow my kaggle kernel for [NMT with attention mechanism](https://www.kaggle.com/nageshsingh/neural-machine-translation-attention-mechanism) implementation. 34 | -------------------------------------------------------------------------------- /nmt_advance_fra_eng (2).py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """NMT_advance_fra-eng.ipynb 3 | 4 | Automatically generated by Colaboratory. 5 | 6 | Original file is located at 7 | https://colab.research.google.com/drive/1ICjuU5xpv8rpMJ-5RMufu2mdxMRdmcNn 8 | """ 9 | 10 | # import the required libraries: 11 | import os, sys 12 | from keras.models import Model 13 | from keras.layers import Input, LSTM, GRU, Dense, Embedding 14 | from keras.preprocessing.text import Tokenizer 15 | from keras.preprocessing.sequence import pad_sequences 16 | from keras.utils import to_categorical 17 | import numpy as np 18 | import pickle 19 | import matplotlib.pyplot as plt 20 | 21 | #Execute this script to set values for different parameters: 22 | BATCH_SIZE = 64 23 | EPOCHS = 20 24 | LSTM_NODES =256 25 | NUM_SENTENCES = 20000 26 | MAX_SENTENCE_LENGTH = 50 27 | MAX_NUM_WORDS = 20000 28 | EMBEDDING_SIZE = 200 29 | 30 | """The language translation model that we are going to develop will translate English sentences into their French language counterparts. To develop such a model, we need a dataset that contains English sentences and their French translations. 31 | 32 | # Data Preprocessing 33 | 34 | We need to generate two copies of the translated sentence: one with the start-of-sentence token and the other with the end-of-sentence token. 35 | """ 36 | 37 | input_sentences = [] 38 | output_sentences = [] 39 | output_sentences_inputs = [] 40 | 41 | count = 0 42 | for line in open('./drive/My Drive/fra.txt', encoding="utf-8"): 43 | count += 1 44 | if count > NUM_SENTENCES: 45 | break 46 | if '\t' not in line: 47 | continue 48 | input_sentence = line.rstrip().split('\t')[0] 49 | output = line.rstrip().split('\t')[1] 50 | 51 | output_sentence = output + ' ' 52 | output_sentence_input = ' ' + output 53 | 54 | input_sentences.append(input_sentence) 55 | output_sentences.append(output_sentence) 56 | output_sentences_inputs.append(output_sentence_input) 57 | 58 | print("Number of sample input:", len(input_sentences)) 59 | print("Number of sample output:", len(output_sentences)) 60 | print("Number of sample output input:", len(output_sentences_inputs)) 61 | 62 | """Now randomly print a sentence to analyse your dataset.""" 63 | 64 | print("English sentence: ",input_sentences[180]) 65 | print("French translation: ",output_sentences[180]) 66 | 67 | """You can see the original sentence, i.e. **Join us**; its corresponding translation in the output, i.e **Joignez-vous à nous.** . Notice, here we have token at the end of the sentence. Similarly, for the input to the decoder, we have **Joignez-vous à nous.** 68 | 69 | # Tokenization and Padding 70 | 71 | The next step is tokenizing the original and translated sentences and applying padding to the sentences that are longer or shorter than a certain length, which in case of inputs will be the length of the longest input sentence. And for the output this will be the length of the longest sentence in the output. 72 | """ 73 | 74 | # let’s visualise the length of the sentences. 75 | import pandas as pd 76 | 77 | eng_len = [] 78 | fren_len = [] 79 | 80 | # populate the lists with sentence lengths 81 | for i in input_sentences: 82 | eng_len.append(len(i.split())) 83 | 84 | for i in output_sentences: 85 | fren_len.append(len(i.split())) 86 | 87 | length_df = pd.DataFrame({'english':eng_len, 'french':fren_len}) 88 | 89 | length_df.hist(bins = 20) 90 | plt.show() 91 | 92 | """The histogram above shows maximum length of the French sentences is 12 and that of the English sentence is 6. 93 | 94 | For tokenization, the Tokenizer class from the keras.preprocessing.text library can be used. The tokenizer class performs two tasks: 95 | 96 | 1. It divides a sentence into the corresponding list of word 97 | 98 | 2. Then it converts the words to integers 99 | 100 | Also the **word_index** attribute of the Tokenizer class returns a word-to-index dictionary where words are the keys and the corresponding integers are the values. 101 | """ 102 | 103 | #tokenize the input sentences(input language) 104 | input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS) 105 | input_tokenizer.fit_on_texts(input_sentences) 106 | input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences) 107 | print(input_integer_seq) 108 | 109 | word2idx_inputs = input_tokenizer.word_index 110 | print('Total unique words in the input: %s' % len(word2idx_inputs)) 111 | 112 | max_input_len = max(len(sen) for sen in input_integer_seq) 113 | print("Length of longest sentence in input: %g" % max_input_len) 114 | 115 | #with open('input_tokenizer_NMT.pickle', 'wb') as handle: 116 | # pickle.dump(input_tokenizer, handle, protocol=4) 117 | 118 | #tokenize the output sentences(Output language) 119 | output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='') 120 | output_tokenizer.fit_on_texts(output_sentences + output_sentences_inputs) 121 | output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences) 122 | output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs) 123 | print(output_input_integer_seq) 124 | 125 | word2idx_outputs = output_tokenizer.word_index 126 | print('Total unique words in the output: %s' % len(word2idx_outputs)) 127 | 128 | num_words_output = len(word2idx_outputs) + 1 129 | max_out_len = max(len(sen) for sen in output_integer_seq) 130 | print("Length of longest sentence in the output: %g" % max_out_len) 131 | 132 | #with open('output_tokenizer_NMT.pickle', 'wb') as handle: 133 | # pickle.dump(output_tokenizer, handle, protocol=4) 134 | 135 | """Now the lengths of longest sentence can also be varified from the histogram above. And it can be concluded that English sentences are normally shorter and contain a smaller number of words on average, compared to the translated French sentences. 136 | 137 | Next, we need to pad the input. The reason behind padding the input and the output is that text sentences can be of varying length, however LSTM expects input instances with the same length. Therefore, we need to convert our sentences into fixed-length vectors. One way to do this is via padding. 138 | """ 139 | 140 | encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len) 141 | print("encoder_input_sequences.shape:", encoder_input_sequences.shape) 142 | print("encoder_input_sequences[180]:", encoder_input_sequences[180]) 143 | 144 | """Since there are 20,000 sentences in the input and each input sentence is of length 6, the shape of the input is now (20000, 6). 145 | 146 | You may recall that the original sentence at index 180 is **join us**. The tokenizer divided the sentence into two words ***join*** and ***us***, converted them to integers, and then applied pre-padding by adding four zeros at the start of the corresponding integer sequence for the sentence at index 180 of the input list. 147 | 148 | To verify that the integer values for ***join*** and ***us*** are 464 and 59 respectively, you can pass the words to the word2index_inputs dictionary, as shown below: 149 | """ 150 | 151 | print(word2idx_inputs["join"]) 152 | print(word2idx_inputs["us"]) 153 | 154 | """In the same way, the decoder outputs and the decoder inputs are padded.""" 155 | 156 | decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post') 157 | print("decoder_input_sequences.shape:", decoder_input_sequences.shape) 158 | print("decoder_input_sequences[180]:", decoder_input_sequences[180]) 159 | 160 | """The sentence at index 180 of the decoder input is Joignez-vous à nous. If you print the corresponding integers from the word2idx_outputs dictionary, you should see 2, 2028, 20, and 228 printed on the console.""" 161 | 162 | print(word2idx_outputs[""]) 163 | print(word2idx_outputs["joignez-vous"]) 164 | print(word2idx_outputs["à"]) 165 | print(word2idx_outputs["nous."]) 166 | 167 | decoder_output_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding='post') 168 | print("decoder_output_sequences.shape:", decoder_output_sequences.shape) 169 | 170 | """# Word Embeddings 171 | 172 | We already converted our words into integers. So what's the difference between integer representation and word embeddings? 173 | 174 | There are two main differences between single integer representation and word embeddings. With integer reprensentation, a word is represented only with a single integer. With vector representation a word is represented by a vector of 50, 100, 200, or whatever dimensions you like. Hence, word embeddings capture a lot more information about words. Secondly, the single-integer representation doesn't capture the relationships between different words. On the contrary, word embeddings retain relationships between the words. You can either use custom word embeddings or you can use pretrained word embeddings. 175 | 176 | For English sentences, i.e. the inputs, we will use the GloVe word embeddings. For the translated French sentences in the output, we will use custom word embeddings. 177 | 178 | Let's create word embeddings for the inputs first. To do so, we need to load the GloVe word vectors into memory. We will then create a dictionary where words are the keys and the corresponding vectors are values, 179 | """ 180 | 181 | from numpy import array 182 | from numpy import asarray 183 | from numpy import zeros 184 | 185 | embeddings_dictionary = dict() 186 | 187 | glove_file = open(r'./drive/My Drive/kaggle_sarcasm/glove.twitter.27B.200d.txt', encoding="utf8") 188 | 189 | for line in glove_file: 190 | rec = line.split() 191 | word = rec[0] 192 | vector_dimensions = asarray(rec[1:], dtype='float32') 193 | embeddings_dictionary[word] = vector_dimensions 194 | glove_file.close() 195 | 196 | """Recall that we have 2150 unique words in the input. We will create a matrix where the row number will represent the integer value for the word and the columns will correspond to the dimensions of the word. This matrix will contain the word embeddings for the words in our input sentences.""" 197 | 198 | num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1) 199 | embedding_matrix = zeros((num_words, EMBEDDING_SIZE)) 200 | for word, index in word2idx_inputs.items(): 201 | embedding_vector = embeddings_dictionary.get(word) 202 | if embedding_vector is not None: 203 | embedding_matrix[index] = embedding_vector 204 | 205 | print(embeddings_dictionary["join"]) 206 | 207 | """In the previous section, we saw that the integer representation for the word **join** is 464. Let's now check the 464th index of the word embedding matrix.""" 208 | 209 | print(embedding_matrix[464]) 210 | 211 | """You can see that the values for the 464th row in the embedding matrix are similar to the vector representation of the word **join** in the GloVe dictionary, which confirms that rows in the embedding matrix represent corresponding word embeddings from the GloVe word embedding dictionary. This word embedding matrix will be used to create the embedding layer for our LSTM model. 212 | 213 | **Creates the embedding layer for the input:** 214 | """ 215 | 216 | embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len) 217 | 218 | """# Creating the Model 219 | 220 | The first thing we need to do is to define our outputs, as we know that the output will be a sequence of words. Recall that the total number of unique words in the output are 9511. Therefore, each word in the output can be any of the 9511 words. The length of an output sentence is 12. And for each input sentence, we need a corresponding output sentence. Therefore, the final shape of the output will be: 221 | """ 222 | 223 | #(number of inputs, length of the output sentence, the number of words in the output) 224 | 225 | decoder_targets_one_hot = np.zeros(( 226 | len(input_sentences), 227 | max_out_len, 228 | num_words_output 229 | ), 230 | dtype='float32' 231 | ) 232 | decoder_targets_one_hot.shape 233 | 234 | """To make predictions, the final layer of the model will be a dense layer, therefore we need the outputs in the form of one-hot encoded vectors, since we will be using softmax activation function at the dense layer. To create such one-hot encoded output, the next step is to assign 1 to the column number that corresponds to the integer representation of the word.""" 235 | 236 | for i, d in enumerate(decoder_output_sequences): 237 | for t, word in enumerate(d): 238 | decoder_targets_one_hot[i, t, word] = 1 239 | 240 | """Next, we need to create the encoder and decoders. The input to the encoder will be the sentence in English and the output will be the hidden state and cell state of the LSTM.""" 241 | 242 | encoder_inputs = Input(shape=(max_input_len,)) 243 | x = embedding_layer(encoder_inputs) 244 | encoder = LSTM(LSTM_NODES, return_state=True) 245 | 246 | encoder_outputs, h, c = encoder(x) 247 | encoder_states = [h, c] 248 | 249 | """The next step is to define the decoder. The decoder will have two inputs: the hidden state and cell state from the encoder and the input sentence, which actually will be the output sentence with an token appended at the beginning.""" 250 | 251 | decoder_inputs = Input(shape=(max_out_len,)) 252 | 253 | decoder_embedding = Embedding(num_words_output, LSTM_NODES) 254 | decoder_inputs_x = decoder_embedding(decoder_inputs) 255 | 256 | decoder_lstm = LSTM(LSTM_NODES, return_sequences=True, return_state=True) 257 | decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states) 258 | 259 | """Finally, the output from the decoder LSTM is passed through a dense layer to predict decoder outputs.""" 260 | 261 | decoder_dense = Dense(num_words_output, activation='softmax') 262 | decoder_outputs = decoder_dense(decoder_outputs) 263 | 264 | #Compile 265 | model = Model([encoder_inputs,decoder_inputs], decoder_outputs) 266 | model.compile( 267 | optimizer='rmsprop', 268 | loss='categorical_crossentropy', 269 | metrics=['accuracy'] 270 | ) 271 | model.summary() 272 | 273 | """Let's plot our model to see how it looks.""" 274 | 275 | from keras.utils import plot_model 276 | plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True) 277 | 278 | """From the output, you can see that we have two types of input. input_1 is the input placeholder for the encoder, which is embedded and passed through lstm_1 layer, which basically is the encoder LSTM. There are three outputs from the lstm_1 layer: the output, the hidden layer and the cell state. However, only the cell state and the hidden state are passed to the decoder. 279 | 280 | Here the lstm_2 layer is the decoder LSTM. The input_2 contains the output sentences with token appended at the start. The input_2 is also passed through an embedding layer and is used as input to the decoder LSTM, lstm_2. Finally, the output from the decoder LSTM is passed through the dense layer to make predictions. 281 | """ 282 | 283 | from keras.callbacks import EarlyStopping 284 | es = EarlyStopping(monitor='val_loss', mode='min', verbose=1) 285 | 286 | history = model.fit([encoder_input_sequences, decoder_input_sequences], decoder_targets_one_hot, 287 | batch_size=BATCH_SIZE, 288 | epochs=20, 289 | callbacks=[es], 290 | validation_split=0.1, 291 | ) 292 | 293 | model.save('seq2seq_eng-fra.h5') 294 | 295 | import matplotlib.pyplot as plt 296 | # %matplotlib inline 297 | plt.title('Model Loss') 298 | plt.plot(history.history['loss']) 299 | plt.plot(history.history['val_loss']) 300 | plt.ylabel('loss') 301 | plt.xlabel('epoch') 302 | plt.legend(['train', 'test'], loc='upper left') 303 | plt.show() 304 | 305 | 306 | 307 | encoder_model = Model(encoder_inputs, encoder_states) 308 | model.compile(optimizer='rmsprop', loss='categorical_crossentropy') 309 | model.load_weights('seq2seq_eng-fra.h5') 310 | 311 | decoder_state_input_h = Input(shape=(LSTM_NODES,)) 312 | decoder_state_input_c = Input(shape=(LSTM_NODES,)) 313 | decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] 314 | 315 | decoder_inputs_single = Input(shape=(1,)) 316 | decoder_inputs_single_x = decoder_embedding(decoder_inputs_single) 317 | 318 | decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs) 319 | 320 | decoder_states = [h, c] 321 | decoder_outputs = decoder_dense(decoder_outputs) 322 | 323 | decoder_model = Model( 324 | [decoder_inputs_single] + decoder_states_inputs, 325 | [decoder_outputs] + decoder_states 326 | ) 327 | 328 | from keras.utils import plot_model 329 | plot_model(decoder_model, to_file='model_plot_dec.png', show_shapes=True, show_layer_names=True) 330 | 331 | """# Making Predictions 332 | 333 | we want our output to be a sequence of words in the French language. To do so, we need to convert the integers back to words. We will create new dictionaries for both inputs and outputs where the keys will be the integers and the corresponding values will be the words. 334 | """ 335 | 336 | idx2word_input = {v:k for k, v in word2idx_inputs.items()} 337 | idx2word_target = {v:k for k, v in word2idx_outputs.items()} 338 | 339 | """The method will accept an input-padded sequence English sentence (in the integer form) and will return the translated French sentence.""" 340 | 341 | def translate_sentence(input_seq): 342 | states_value = encoder_model.predict(input_seq) 343 | target_seq = np.zeros((1, 1)) 344 | target_seq[0, 0] = word2idx_outputs[''] 345 | eos = word2idx_outputs[''] 346 | output_sentence = [] 347 | 348 | for _ in range(max_out_len): 349 | output_tokens, h, c = decoder_model.predict([target_seq] + states_value) 350 | idx = np.argmax(output_tokens[0, 0, :]) 351 | 352 | if eos == idx: 353 | break 354 | 355 | word = '' 356 | 357 | if idx > 0: 358 | word = idx2word_target[idx] 359 | output_sentence.append(word) 360 | 361 | target_seq[0, 0] = idx 362 | states_value = [h, c] 363 | 364 | return ' '.join(output_sentence) 365 | 366 | i = np.random.choice(len(input_sentences)) 367 | input_seq = encoder_input_sequences[i:i+1] 368 | translation = translate_sentence(input_seq) 369 | print('Input Language : ', input_sentences[i]) 370 | print('Actual translation : ', output_sentences[i]) 371 | print('French translation : ', translation) 372 | 373 | --------------------------------------------------------------------------------