├── README.md ├── language_identification.py ├── requirements.txt └── train_files.json /README.md: -------------------------------------------------------------------------------- 1 | # Spoken Language Identification 2 | 3 | ## Objective 4 | Spoken Language Identification (LID) is broadly defined as recognizing the language of a given speech utterance. It has numerous applications in automated language and speech recognition, multilingual machine translations, speech-to-speech translations, and emergency call routing. In this project, we will try to classify three languages (English, Hindi and Mandarin) from the spoken utterances that have been crowd-sourced. We will implement a GRU/LSTM model, and train it to classify the languages using Keras. We will use MFCC features as they are widely employed in various speech processing applications including LID. 5 | 6 | ## Environment Setup 7 | Download the codebase and open up a terminal in the root directory. Make sure python 3.6 is installed in the current environment. Then execute 8 | 9 | pip install -r requirements.txt 10 | 11 | This should install all the necessary packages for the code to run. 12 | 13 | ## Dataset 14 | The dataset has a bunch of wav files and a json file containing labels. The wav file names are anonymized, and class labels are provided as integers. Training is done with the provided integer class labels. The following mapping is used to convert language IDs to integer labels: 15 | mapping = dict{’english ’: 0, ’hindi ’: 1, ’mandarin’: 2} 16 | 17 | I have not uploaded the audio files here due to a size constraint. The `train_files.json` file is used to map the audio files to the language spoken in it. 18 | 19 | ## Sample length 20 | The full audio files are ∼ 10 minutes long which might be too long to train an RNN. Multiple 10 seconds samples are created from every utterance and the same label as the original utterance are assigned to them. The choice of sequence length can be changed to experiment with samples of different length. 21 | 22 | ## Audio Format 23 | The wav files have 16KHz sampling rate, single channel, and 16-bit Signed Integer PCM encoding. 24 | 25 | ## Notes about the code 26 | The code has been divided into 6 blocks. Kindly refer to the following notes to comment/uncomment the blocks as needed 27 | 28 | - The code in Block 1 is used to extract the mfcc features provided and write them into a dataset “mfcc_dataset.hdf5”. This part of the code can be commented out if the hdf5 file already exists. 29 | 30 | - The code in Block 2 is used to read the “mfcc_dataset.hdf5” dataset. Do not comment it out. 31 | 32 | - The code in Block 3 is used to train the model. Comment it out after the model has been trained and saved by the name “sld.hdf5” 33 | 34 | - The code in Block 4 sets up the inference mode. 35 | 36 | - The code in Block 5 runs the streaming model in inference mode by predicting the label for a single random sequence from the validation dataset. 37 | 38 | - The code in Block 6 runs the streaming model in inference mode by predicting the the labels for all the sequences in the validation dataset. Comment this out since it can take a long time to run. 39 | -------------------------------------------------------------------------------- /language_identification.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import glob 3 | import os 4 | from keras.models import Model 5 | from keras.layers import Input, Dense, GRU, CuDNNGRU, CuDNNLSTM 6 | from keras import optimizers 7 | import h5py 8 | from sklearn.model_selection import train_test_split 9 | from keras.models import load_model 10 | 11 | 12 | def language_name(index): 13 | if index == 0: 14 | return "English" 15 | elif index == 1: 16 | return "Hindi" 17 | elif index == 2: 18 | return "Mandarin" 19 | 20 | # ---------------------------BLOCK 1------------------------------------ 21 | # COMMENT/UNCOMMENT BELOW CODE BLOCK - 22 | # Below code extracts mfcc features from the files provided into a dataset 23 | codePath = './train/' 24 | num_mfcc_features = 64 25 | 26 | english_mfcc = np.array([]).reshape(0, num_mfcc_features) 27 | for file in glob.glob(codePath + 'english/*.npy'): 28 | current_data = np.load(file).T 29 | english_mfcc = np.vstack((english_mfcc, current_data)) 30 | 31 | hindi_mfcc = np.array([]).reshape(0, num_mfcc_features) 32 | for file in glob.glob(codePath + 'hindi/*.npy'): 33 | current_data = np.load(file).T 34 | hindi_mfcc = np.vstack((hindi_mfcc, current_data)) 35 | 36 | mandarin_mfcc = np.array([]).reshape(0, num_mfcc_features) 37 | for file in glob.glob(codePath + 'mandarin/*.npy'): 38 | current_data = np.load(file).T 39 | mandarin_mfcc = np.vstack((mandarin_mfcc, current_data)) 40 | 41 | # Sequence length is 10 seconds 42 | sequence_length = 1000 43 | list_english_mfcc = [] 44 | num_english_sequence = int(np.floor(len(english_mfcc)/sequence_length)) 45 | for i in range(num_english_sequence): 46 | list_english_mfcc.append(english_mfcc[sequence_length*i:sequence_length*(i+1)]) 47 | list_english_mfcc = np.array(list_english_mfcc) 48 | english_labels = np.full((num_english_sequence, 1000, 3), np.array([1, 0, 0])) 49 | 50 | list_hindi_mfcc = [] 51 | num_hindi_sequence = int(np.floor(len(hindi_mfcc)/sequence_length)) 52 | for i in range(num_hindi_sequence): 53 | list_hindi_mfcc.append(hindi_mfcc[sequence_length*i:sequence_length*(i+1)]) 54 | list_hindi_mfcc = np.array(list_hindi_mfcc) 55 | hindi_labels = np.full((num_hindi_sequence, 1000, 3), np.array([0, 1, 0])) 56 | 57 | list_mandarin_mfcc = [] 58 | num_mandarin_sequence = int(np.floor(len(mandarin_mfcc)/sequence_length)) 59 | for i in range(num_mandarin_sequence): 60 | list_mandarin_mfcc.append(mandarin_mfcc[sequence_length*i:sequence_length*(i+1)]) 61 | list_mandarin_mfcc = np.array(list_mandarin_mfcc) 62 | mandarin_labels = np.full((num_mandarin_sequence, 1000, 3), np.array([0, 0, 1])) 63 | 64 | del english_mfcc 65 | del hindi_mfcc 66 | del mandarin_mfcc 67 | 68 | total_sequence_length = num_english_sequence + num_hindi_sequence + num_mandarin_sequence 69 | Y_train = np.vstack((english_labels, hindi_labels)) 70 | Y_train = np.vstack((Y_train, mandarin_labels)) 71 | 72 | X_train = np.vstack((list_english_mfcc, list_hindi_mfcc)) 73 | X_train = np.vstack((X_train, list_mandarin_mfcc)) 74 | 75 | del list_english_mfcc 76 | del list_hindi_mfcc 77 | del list_mandarin_mfcc 78 | 79 | X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2) 80 | 81 | with h5py.File("mfcc_dataset.hdf5", 'w') as hf: 82 | hf.create_dataset('X_train', data=X_train) 83 | hf.create_dataset('Y_train', data=Y_train) 84 | hf.create_dataset('X_val', data=X_val) 85 | hf.create_dataset('Y_val', data=Y_val) 86 | # --------------------------------------------------------------- 87 | 88 | 89 | # --------------------------BLOCK 2------------------------------------- 90 | # Load MFCC Dataset created by the code in the previous steps 91 | with h5py.File("mfcc_dataset.hdf5", 'r') as hf: 92 | X_train = hf['X_train'][:] 93 | Y_train = hf['Y_train'][:] 94 | X_val = hf['X_val'][:] 95 | Y_val = hf['Y_val'][:] 96 | # --------------------------------------------------------------- 97 | 98 | 99 | # ---------------------------BLOCK 3------------------------------------ 100 | # Setting up the model for training 101 | DROPOUT = 0.3 102 | RECURRENT_DROP_OUT = 0.2 103 | optimizer = optimizers.Adam(decay=1e-4) 104 | main_input = Input(shape=(sequence_length, 64), name='main_input') 105 | 106 | # ### main_input = Input(shape=(None, 64), name='main_input') 107 | # ### pred_gru = GRU(4, return_sequences=True, name='pred_gru')(main_input) 108 | # ### rnn_output = Dense(3, activation='softmax', name='rnn_output')(pred_gru) 109 | 110 | layer1 = CuDNNLSTM(64, return_sequences=True, name='layer1')(main_input) 111 | layer2 = CuDNNLSTM(32, return_sequences=True, name='layer2')(layer1) 112 | layer3 = Dense(100, activation='tanh', name='layer3')(layer2) 113 | rnn_output = Dense(3, activation='softmax', name='rnn_output')(layer3) 114 | 115 | model = Model(inputs=main_input, outputs=rnn_output) 116 | print('\nCompiling model...') 117 | model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['acc']) 118 | model.summary() 119 | history = model.fit(X_train, Y_train, batch_size=32, epochs=75, validation_data=(X_val, Y_val), shuffle=True, verbose=1) 120 | model.save('sld.hdf5') 121 | # --------------------------------------------------------------- 122 | 123 | # --------------------------BLOCK 4------------------------------------- 124 | # Inference Mode Setup 125 | streaming_input = Input(name='streaming_input', batch_shape=(1, 1, 64)) 126 | pred_layer1 = CuDNNLSTM(64, return_sequences=True, name='layer1', stateful=True)(streaming_input) 127 | pred_layer2 = CuDNNLSTM(32, return_sequences=True, name='layer2')(pred_layer1) 128 | pred_layer3 = Dense(100, activation='tanh', name='layer3')(pred_layer2) 129 | pred_output = Dense(3, activation='softmax', name='rnn_output')(pred_layer3) 130 | streaming_model = Model(inputs=streaming_input, outputs=pred_output) 131 | streaming_model.load_weights('sld.hdf5') 132 | # streaming_model.summary() 133 | # --------------------------------------------------------------- 134 | 135 | # ---------------------------BLOCK 5------------------------------------ 136 | # Language Prediction for a random sequence from the validation data set 137 | random_val_sample = np.random.randint(0, X_val.shape[0]) 138 | random_sequence_num = np.random.randint(0, len(X_val[random_val_sample])) 139 | test_single = X_val[random_val_sample][random_sequence_num].reshape(1, 1, 64) 140 | val_label = Y_val[random_val_sample][random_sequence_num] 141 | true_label = language_name(np.argmax(val_label)) 142 | print("***********************") 143 | print("True label is ", true_label) 144 | single_test_pred_prob = streaming_model.predict(test_single) 145 | pred_label = language_name(np.argmax(single_test_pred_prob)) 146 | print("Predicted label is ", pred_label) 147 | print("***********************") 148 | # --------------------------------------------------------------- 149 | 150 | # ---------------------------BLOCK 6------------------------------------ 151 | ## COMMENT/UNCOMMENT BELOW 152 | # Prediction for all sequences in the validation set - Takes very long to run 153 | print("Predicting labels for all sequences - (Will take a lot of time)") 154 | list_pred_labels = [] 155 | for i in range(X_val.shape[0]): 156 | for j in range(X_val.shape[1]): 157 | test = X_val[i][j].reshape(1, 1, 64) 158 | seq_predictions_prob = streaming_model.predict(test) 159 | predicted_language_index = np.argmax(seq_predictions_prob) 160 | list_pred_labels.append(predicted_language_index) 161 | pred_english = list_pred_labels.count(0) 162 | pred_hindi = list_pred_labels.count(1) 163 | pred_mandarin = list_pred_labels.count(2) 164 | print("Number of English labels = ", pred_english) 165 | print("Number of Hindi labels = ", pred_hindi) 166 | print("Number of Mandarin labels = ", pred_mandarin) 167 | # --------------------------------------------------------------- 168 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.13.3 2 | keras==2.2.4 3 | pandas==0.24.2 4 | matplotlib==2.0.2 5 | scikit-learn==0.20.1 6 | h5py==2.9.0 7 | -------------------------------------------------------------------------------- /train_files.json: -------------------------------------------------------------------------------- 1 | { 2 | "english/speaker-01-file-00.wav": 0, 3 | "english/speaker-02-file-00.wav": 0, 4 | "english/speaker-03-file-00.wav": 0, 5 | "english/speaker-04-file-00.wav": 0, 6 | "english/speaker-06-file-00.wav": 0, 7 | "english/speaker-07-file-00.wav": 0, 8 | "english/speaker-08-file-00.wav": 0, 9 | "english/speaker-09-file-00.wav": 0, 10 | "english/speaker-11-file-00.wav": 0, 11 | "english/speaker-12-file-00.wav": 0, 12 | "english/speaker-13-file-00.wav": 0, 13 | "english/speaker-14-file-00.wav": 0, 14 | "english/speaker-16-file-00.wav": 0, 15 | "english/speaker-17-file-00.wav": 0, 16 | "english/speaker-18-file-00.wav": 0, 17 | "english/speaker-19-file-00.wav": 0, 18 | "english/speaker-21-file-00.wav": 0, 19 | "english/speaker-22-file-00.wav": 0, 20 | "english/speaker-23-file-00.wav": 0, 21 | "english/speaker-24-file-00.wav": 0, 22 | "english/speaker-26-file-00.wav": 0, 23 | "english/speaker-27-file-00.wav": 0, 24 | "english/speaker-28-file-00.wav": 0, 25 | "english/speaker-29-file-00.wav": 0, 26 | "english/speaker-31-file-00.wav": 0, 27 | "english/speaker-32-file-00.wav": 0, 28 | "english/speaker-33-file-00.wav": 0, 29 | "english/speaker-34-file-00.wav": 0, 30 | "english/speaker-36-file-00.wav": 0, 31 | "english/speaker-37-file-00.wav": 0, 32 | "english/speaker-38-file-00.wav": 0, 33 | "english/speaker-39-file-00.wav": 0, 34 | "english/speaker-41-file-00.wav": 0, 35 | "english/speaker-42-file-00.wav": 0, 36 | "english/speaker-43-file-00.wav": 0, 37 | "english/speaker-44-file-00.wav": 0, 38 | "english/speaker-46-file-00.wav": 0, 39 | "english/speaker-47-file-00.wav": 0, 40 | "english/speaker-48-file-00.wav": 0, 41 | "english/speaker-49-file-00.wav": 0, 42 | "english/speaker-51-file-00.wav": 0, 43 | "english/speaker-52-file-00.wav": 0, 44 | "english/speaker-53-file-00.wav": 0, 45 | "english/speaker-54-file-00.wav": 0, 46 | "english/speaker-56-file-00.wav": 0, 47 | "english/speaker-57-file-00.wav": 0, 48 | "english/speaker-58-file-00.wav": 0, 49 | "english/speaker-59-file-00.wav": 0, 50 | "english/speaker-61-file-00.wav": 0, 51 | "english/speaker-62-file-00.wav": 0, 52 | "english/speaker-63-file-00.wav": 0, 53 | "english/speaker-64-file-00.wav": 0, 54 | "english/speaker-66-file-00.wav": 0, 55 | "english/speaker-67-file-00.wav": 0, 56 | "english/speaker-68-file-00.wav": 0, 57 | "english/speaker-69-file-00.wav": 0, 58 | "english/speaker-71-file-00.wav": 0, 59 | "english/speaker-72-file-00.wav": 0, 60 | "english/speaker-73-file-00.wav": 0, 61 | "english/speaker-74-file-00.wav": 0, 62 | "english/speaker-76-file-00.wav": 0, 63 | "english/speaker-77-file-00.wav": 0, 64 | "english/speaker-78-file-00.wav": 0, 65 | "english/speaker-79-file-00.wav": 0, 66 | "english/speaker-80-file-00.wav": 0, 67 | "hindi/speaker-01-file-00.wav": 1, 68 | "hindi/speaker-03-file-00.wav": 1, 69 | "hindi/speaker-07-file-00.wav": 1, 70 | "hindi/speaker-08-file-00.wav": 1, 71 | "hindi/speaker-11-file-00.wav": 1, 72 | "hindi/speaker-12-file-00.wav": 1, 73 | "hindi/speaker-14-file-00.wav": 1, 74 | "hindi/speaker-23-file-00.wav": 1, 75 | "hindi/speaker-24-file-00.wav": 1, 76 | "hindi/speaker-27-file-00.wav": 1, 77 | "hindi/speaker-36-file-00.wav": 1, 78 | "hindi/speaker-37-file-00.wav": 1, 79 | "hindi/speaker-38-file-00.wav": 1, 80 | "hindi/speaker-42-file-00.wav": 1, 81 | "hindi/speaker-43-file-00.wav": 1, 82 | "hindi/speaker-47-file-00.wav": 1, 83 | "hindi/speaker-51-file-00.wav": 1, 84 | "hindi/speaker-61-file-00.wav": 1, 85 | "hindi/speaker-66-file-00.wav": 1, 86 | "hindi/speaker-72-file-00.wav": 1, 87 | "hindi/speaker-77-file-00.wav": 1, 88 | "mandarin/speaker-02-file-00.wav": 2, 89 | "mandarin/speaker-04-file-00.wav": 2, 90 | "mandarin/speaker-06-file-00.wav": 2, 91 | "mandarin/speaker-09-file-00.wav": 2, 92 | "mandarin/speaker-13-file-00.wav": 2, 93 | "mandarin/speaker-16-file-00.wav": 2, 94 | "mandarin/speaker-17-file-00.wav": 2, 95 | "mandarin/speaker-18-file-00.wav": 2, 96 | 97 | "mandarin/speaker-19-file-00.wav": 2, 98 | "mandarin/speaker-21-file-00.wav": 2, 99 | "mandarin/speaker-22-file-00.wav": 2, 100 | "mandarin/speaker-26-file-00.wav": 2, 101 | "mandarin/speaker-28-file-00.wav": 2, 102 | "mandarin/speaker-29-file-00.wav": 2, 103 | "mandarin/speaker-31-file-00.wav": 2, 104 | "mandarin/speaker-32-file-00.wav": 2, 105 | "mandarin/speaker-33-file-00.wav": 2, 106 | "mandarin/speaker-34-file-00.wav": 2, 107 | "mandarin/speaker-39-file-00.wav": 2, 108 | "mandarin/speaker-41-file-00.wav": 2, 109 | "mandarin/speaker-44-file-00.wav": 2, 110 | "mandarin/speaker-46-file-00.wav": 2, 111 | "mandarin/speaker-48-file-00.wav": 2, 112 | "mandarin/speaker-49-file-00.wav": 2, 113 | "mandarin/speaker-52-file-00.wav": 2, 114 | "mandarin/speaker-53-file-00.wav": 2, 115 | "mandarin/speaker-54-file-00.wav": 2, 116 | "mandarin/speaker-56-file-00.wav": 2, 117 | "mandarin/speaker-57-file-00.wav": 2, 118 | "mandarin/speaker-58-file-00.wav": 2, 119 | "mandarin/speaker-59-file-00.wav": 2, 120 | "mandarin/speaker-62-file-00.wav": 2, 121 | "mandarin/speaker-63-file-00.wav": 2, 122 | "mandarin/speaker-64-file-00.wav": 2, 123 | "mandarin/speaker-67-file-00.wav": 2, 124 | "mandarin/speaker-68-file-00.wav": 2, 125 | "mandarin/speaker-69-file-00.wav": 2, 126 | "mandarin/speaker-71-file-00.wav": 2, 127 | "mandarin/speaker-73-file-00.wav": 2, 128 | "mandarin/speaker-74-file-00.wav": 2, 129 | "mandarin/speaker-76-file-00.wav": 2, 130 | "mandarin/speaker-78-file-00.wav": 2, 131 | "mandarin/speaker-79-file-00.wav": 2, 132 | "mandarin/speaker-80-file-00.wav": 2 133 | } 134 | --------------------------------------------------------------------------------