├── .gitignore ├── AttRight.png ├── LICENSE ├── README.md ├── SpeechDownloader.py ├── SpeechGenerator.py ├── SpeechModels.py ├── Speech_Recog_Demo.ipynb ├── audioUtils.py ├── model-KWS-attRNN.tflite ├── model-attRNN.h5 └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.wav 2 | *.zip 3 | *.npy 4 | .ipynb_checkpoints/ 5 | resources/.ipynb_checkpoints/ 6 | sd_*/ 7 | __pycache__/ -------------------------------------------------------------------------------- /AttRight.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/douglas125/SpeechCmdRecognition/4c1ccf3c2663db32dafa22acba635bc4d05bbd03/AttRight.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Douglas Coimbra de Andrade 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Speech Command Recognition 2 | 3 | ## A Keras implementation of neural attention model for speech command recognition 4 | 5 | This repository presents a recurrent attention model designed to identify keywords in short segments of audio. It has been tested using the Google Speech Command Datasets (v1 and v2). 6 | For a complete description of the architecture, please refer to [our paper](https://arxiv.org/abs/1808.08929). 7 | 8 | Our main contributions are: 9 | 10 | - A small footprint model (201K trainable parameters) that outperforms convolutional architectures for speech command recognition (AKA keyword spotting); 11 | - A SpeechGenerator.py script that enables loading WAV files saved in .npy format from disk (like a Keras image generator, but for audio files); 12 | - Attention outputs that make the model explainable (i.e., it is possible to identify what part of the audio was important to reach a conclusion). 13 | 14 | # Attention Model 15 | 16 | One usual problem with deep learning models is that they are usually "black-box" in the sense that it is very difficult to explain why the model reaches a certain decision. Attention is a powerful tool to make deep neural network models explainable: the picture below demonstrates that the transition from phoneme /a/ to phoneme /i/ is the most relevant part of the audio that the model used to decide (correctly) that the word is "right". Please refer to [our paper](https://arxiv.org/abs/1808.08929) for confusion matrix and more attention plots. 17 | 18 | ![Attention for word Right](AttRight.png) 19 | 20 | # How to use this code 21 | 22 | The Demo notebook is preconfigured with a set of tasks: ```['12cmd', 'leftright', '35word', '20cmd']```. Each of these refer to how many commands should be recognized by the model. When loading the Google Speech Dataset, the user should also select which version to download and use by adjusting the following line: 23 | 24 | ```gscInfo, nCategs = SpeechDownloader.PrepareGoogleSpeechCmd(version=1, task = '35word')``` 25 | 26 | If you want a pretrained model, `model-attRNN.h5` contains pre-trained weights for task 35word, version=2. 27 | 28 | ## Cloning this repository 29 | 30 | - Download or clone this repository; 31 | - Open the Demo notebook; 32 | - Choose how many words should be recognized and the Google Speech Dataset version to use; 33 | - Run training and tests. 34 | 35 | ## Using Google Colab 36 | 37 | Google Colaboratory is an amazing tool for experimentation using a Jupyter Notebook environment with GPUs. 38 | 39 | - Open Colab: https://colab.research.google.com/ ; 40 | - Download and upload the notebood Speech_Recog_Demo.ipynb to Colab, then open it; 41 | - Enable GPU acceleration in menu Edit -> Notebook settings; 42 | - Set useColab = True; 43 | - Choose how many words should be recognized and the Google Speech Dataset version to use; 44 | - Run training and tests. 45 | 46 | ## Train with your own data 47 | 48 | If you want to train with your own data: 49 | 50 | - Use the ```audioUtily.py WAV2Numpy``` function to save your WAV files in numpy format. This speeds up loading considerably; 51 | - Create a ```list_IDs``` array containing the paths to all the numpy files and a ```labels``` array with corresponding labels (already converted to integers); 52 | - Instantiate a ```SpeechGenerator.py SpeechGen``` class; 53 | - Create your own Keras model for audio classification or use one provided in ```SpeechModels.py```; 54 | - Train the model. 55 | 56 | # Final Words 57 | 58 | We would like to thank Google for making such a great speech dataset available for public use, for making Colab available and for hosting the Kaggle competition Tensorflow Speech Recognition Challenge. 59 | 60 | If you find this code useful, please cite our work: 61 | 62 | ``` 63 | @ARTICLE{2018arXiv180808929C, 64 | author = {{Coimbra de Andrade}, D. and {Leo}, S. and {Loesener Da Silva Viana}, M. and 65 | {Bernkopf}, C.}, 66 | title = "{A neural attention model for speech command recognition}", 67 | journal = {ArXiv e-prints}, 68 | archivePrefix = "arXiv", 69 | eprint = {1808.08929}, 70 | keywords = {Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound}, 71 | year = 2018, 72 | month = aug, 73 | adsurl = {http://adsabs.harvard.edu/abs/2018arXiv180808929C}, 74 | adsnote = {Provided by the SAO/NASA Astrophysics Data System} 75 | } 76 | ``` 77 | -------------------------------------------------------------------------------- /SpeechDownloader.py: -------------------------------------------------------------------------------- 1 | """ 2 | File containing scripts to download audio from various datasets 3 | 4 | Also has tools to convert audio into numpy 5 | """ 6 | from tqdm import tqdm 7 | import requests 8 | import math 9 | import os 10 | import tarfile 11 | import numpy as np 12 | import pandas as pd 13 | 14 | import audioUtils 15 | 16 | 17 | # ################## 18 | # Google Speech Commands Dataset V2 19 | # ################## 20 | 21 | # GSCmdV2Categs = {'unknown' : 0, 'silence' : 1, '_unknown_' : 0,'_silence_' : 1, '_background_noise_' : 1, 'yes' : 2, 22 | # 'no' : 3, 'up' : 4, 'down' : 5, 'left' : 6, 'right' : 7, 'on' : 8, 'off' : 9, 'stop' : 10, 'go' : 11} 23 | # numGSCmdV2Categs = 12 24 | 25 | # "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", 26 | # "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", and "Nine" 27 | 28 | GSCmdV2Categs = { 29 | 'unknown': 0, 30 | 'silence': 0, 31 | '_unknown_': 0, 32 | '_silence_': 0, 33 | '_background_noise_': 0, 34 | 'yes': 2, 35 | 'no': 3, 36 | 'up': 4, 37 | 'down': 5, 38 | 'left': 6, 39 | 'right': 7, 40 | 'on': 8, 41 | 'off': 9, 42 | 'stop': 10, 43 | 'go': 11, 44 | 'zero': 12, 45 | 'one': 13, 46 | 'two': 14, 47 | 'three': 15, 48 | 'four': 16, 49 | 'five': 17, 50 | 'six': 18, 51 | 'seven': 19, 52 | 'eight': 20, 53 | 'nine': 1} 54 | numGSCmdV2Categs = 21 55 | 56 | 57 | def PrepareGoogleSpeechCmd(version=2, forceDownload=False, task='20cmd'): 58 | """ 59 | Prepares Google Speech commands dataset version 2 for use 60 | 61 | tasks: 20cmd, 12cmd, leftright or 35word 62 | 63 | Returns full path to training, validation and test file list and file categories 64 | """ 65 | allowedTasks = ['12cmd', 'leftright', '35word', '20cmd'] 66 | if task not in allowedTasks: 67 | raise Exception('Task must be one of: {}'.format(allowedTasks)) 68 | 69 | basePath = None 70 | if version == 2: 71 | _DownloadGoogleSpeechCmdV2(forceDownload) 72 | basePath = 'sd_GSCmdV2' 73 | elif version == 1: 74 | _DownloadGoogleSpeechCmdV1(forceDownload) 75 | basePath = 'sd_GSCmdV1' 76 | else: 77 | raise Exception('Version must be 1 or 2') 78 | 79 | if task == '12cmd': 80 | GSCmdV2Categs = { 81 | 'unknown': 0, 82 | 'silence': 1, 83 | '_unknown_': 0, 84 | '_silence_': 1, 85 | '_background_noise_': 1, 86 | 'yes': 2, 87 | 'no': 3, 88 | 'up': 4, 89 | 'down': 5, 90 | 'left': 6, 91 | 'right': 7, 92 | 'on': 8, 93 | 'off': 9, 94 | 'stop': 10, 95 | 'go': 11} 96 | numGSCmdV2Categs = 12 97 | elif task == 'leftright': 98 | GSCmdV2Categs = { 99 | 'unknown': 0, 100 | 'silence': 0, 101 | '_unknown_': 0, 102 | '_silence_': 0, 103 | '_background_noise_': 0, 104 | 'left': 1, 105 | 'right': 2} 106 | numGSCmdV2Categs = 3 107 | elif task == '35word': 108 | GSCmdV2Categs = { 109 | 'unknown': 0, 110 | 'silence': 0, 111 | '_unknown_': 0, 112 | '_silence_': 0, 113 | '_background_noise_': 0, 114 | 'yes': 2, 115 | 'no': 3, 116 | 'up': 4, 117 | 'down': 5, 118 | 'left': 6, 119 | 'right': 7, 120 | 'on': 8, 121 | 'off': 9, 122 | 'stop': 10, 123 | 'go': 11, 124 | 'zero': 12, 125 | 'one': 13, 126 | 'two': 14, 127 | 'three': 15, 128 | 'four': 16, 129 | 'five': 17, 130 | 'six': 18, 131 | 'seven': 19, 132 | 'eight': 20, 133 | 'nine': 1, 134 | 'backward': 21, 135 | 'bed': 22, 136 | 'bird': 23, 137 | 'cat': 24, 138 | 'dog': 25, 139 | 'follow': 26, 140 | 'forward': 27, 141 | 'happy': 28, 142 | 'house': 29, 143 | 'learn': 30, 144 | 'marvin': 31, 145 | 'sheila': 32, 146 | 'tree': 33, 147 | 'visual': 34, 148 | 'wow': 35} 149 | numGSCmdV2Categs = 36 150 | elif task == '20cmd': 151 | GSCmdV2Categs = { 152 | 'unknown': 0, 153 | 'silence': 0, 154 | '_unknown_': 0, 155 | '_silence_': 0, 156 | '_background_noise_': 0, 157 | 'yes': 2, 158 | 'no': 3, 159 | 'up': 4, 160 | 'down': 5, 161 | 'left': 6, 162 | 'right': 7, 163 | 'on': 8, 164 | 'off': 9, 165 | 'stop': 10, 166 | 'go': 11, 167 | 'zero': 12, 168 | 'one': 13, 169 | 'two': 14, 170 | 'three': 15, 171 | 'four': 16, 172 | 'five': 17, 173 | 'six': 18, 174 | 'seven': 19, 175 | 'eight': 20, 176 | 'nine': 1} 177 | numGSCmdV2Categs = 21 178 | 179 | print('Converting test set WAVs to numpy files') 180 | audioUtils.WAV2Numpy(basePath + '/test/') 181 | print('Converting training set WAVs to numpy files') 182 | audioUtils.WAV2Numpy(basePath + '/train/') 183 | 184 | # read split from files and all files in folders 185 | testWAVs = pd.read_csv(basePath + '/train/testing_list.txt', 186 | sep=" ", header=None)[0].tolist() 187 | valWAVs = pd.read_csv(basePath + '/train/validation_list.txt', 188 | sep=" ", header=None)[0].tolist() 189 | 190 | testWAVs = [os.path.join(basePath + '/train/', f + '.npy') 191 | for f in testWAVs if f.endswith('.wav')] 192 | valWAVs = [os.path.join(basePath + '/train/', f + '.npy') 193 | for f in valWAVs if f.endswith('.wav')] 194 | allWAVs = [] 195 | for root, dirs, files in os.walk(basePath + '/train/'): 196 | allWAVs += [root + '/' + f for f in files if f.endswith('.wav.npy')] 197 | trainWAVs = list(set(allWAVs) - set(valWAVs) - set(testWAVs)) 198 | 199 | testWAVsREAL = [] 200 | for root, dirs, files in os.walk(basePath + '/test/'): 201 | testWAVsREAL += [root + '/' + 202 | f for f in files if f.endswith('.wav.npy')] 203 | 204 | # get categories 205 | testWAVlabels = [_getFileCategory(f, GSCmdV2Categs) for f in testWAVs] 206 | valWAVlabels = [_getFileCategory(f, GSCmdV2Categs) for f in valWAVs] 207 | trainWAVlabels = [_getFileCategory(f, GSCmdV2Categs) for f in trainWAVs] 208 | testWAVREALlabels = [_getFileCategory(f, GSCmdV2Categs) 209 | for f in testWAVsREAL] 210 | 211 | # background noise should be used for validation as well 212 | backNoiseFiles = [trainWAVs[i] for i in range(len(trainWAVlabels)) 213 | if trainWAVlabels[i] == GSCmdV2Categs['silence']] 214 | backNoiseCats = [GSCmdV2Categs['silence'] 215 | for i in range(len(backNoiseFiles))] 216 | if numGSCmdV2Categs == 12: 217 | valWAVs += backNoiseFiles 218 | valWAVlabels += backNoiseCats 219 | 220 | # build dictionaries 221 | testWAVlabelsDict = dict(zip(testWAVs, testWAVlabels)) 222 | valWAVlabelsDict = dict(zip(valWAVs, valWAVlabels)) 223 | trainWAVlabelsDict = dict(zip(trainWAVs, trainWAVlabels)) 224 | testWAVREALlabelsDict = dict(zip(testWAVsREAL, testWAVREALlabels)) 225 | 226 | # a tweak here: we will heavily underuse silence samples because there are few files. 227 | # we can add them to the training list to reuse them multiple times 228 | # note that since we already added the files to the label dicts we don't 229 | # need to do it again 230 | 231 | # for i in range(200): 232 | # trainWAVs = trainWAVs + backNoiseFiles 233 | 234 | # info dictionary 235 | trainInfo = {'files': trainWAVs, 'labels': trainWAVlabelsDict} 236 | valInfo = {'files': valWAVs, 'labels': valWAVlabelsDict} 237 | testInfo = {'files': testWAVs, 'labels': testWAVlabelsDict} 238 | testREALInfo = {'files': testWAVsREAL, 'labels': testWAVREALlabelsDict} 239 | gscInfo = {'train': trainInfo, 240 | 'test': testInfo, 241 | 'val': valInfo, 242 | 'testREAL': testREALInfo} 243 | 244 | print('Done preparing Google Speech commands dataset version {}'.format(version)) 245 | 246 | return gscInfo, numGSCmdV2Categs 247 | 248 | 249 | def _getFileCategory(file, catDict): 250 | """ 251 | Receives a file with name sd_GSCmdV2/train// and returns an integer that is catDict[cat] 252 | """ 253 | categ = os.path.basename(os.path.dirname(file)) 254 | return catDict.get(categ, 0) 255 | 256 | 257 | def _DownloadGoogleSpeechCmdV2(forceDownload=False): 258 | """ 259 | Downloads Google Speech commands dataset version 2 260 | """ 261 | if os.path.isdir("sd_GSCmdV2/") and not forceDownload: 262 | print('Google Speech commands dataset version 2 already exists. Skipping download.') 263 | else: 264 | if not os.path.exists("sd_GSCmdV2/"): 265 | os.makedirs("sd_GSCmdV2/") 266 | trainFiles = 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz' 267 | testFiles = 'http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz' 268 | _downloadFile(testFiles, 'sd_GSCmdV2/test.tar.gz') 269 | _downloadFile(trainFiles, 'sd_GSCmdV2/train.tar.gz') 270 | 271 | # extract files 272 | if not os.path.isdir("sd_GSCmdV2/test/"): 273 | _extractTar('sd_GSCmdV2/test.tar.gz', 'sd_GSCmdV2/test/') 274 | 275 | if not os.path.isdir("sd_GSCmdV2/train/"): 276 | _extractTar('sd_GSCmdV2/train.tar.gz', 'sd_GSCmdV2/train/') 277 | 278 | 279 | def _DownloadGoogleSpeechCmdV1(forceDownload=False): 280 | """ 281 | Downloads Google Speech commands dataset version 1 282 | """ 283 | if os.path.isdir("sd_GSCmdV1/") and not forceDownload: 284 | print('Google Speech commands dataset version 1 already exists. Skipping download.') 285 | else: 286 | if not os.path.exists("sd_GSCmdV1/"): 287 | os.makedirs("sd_GSCmdV1/") 288 | trainFiles = 'http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz' 289 | testFiles = 'http://download.tensorflow.org/data/speech_commands_test_set_v0.01.tar.gz' 290 | _downloadFile(testFiles, 'sd_GSCmdV1/test.tar.gz') 291 | _downloadFile(trainFiles, 'sd_GSCmdV1/train.tar.gz') 292 | 293 | # extract files 294 | if not os.path.isdir("sd_GSCmdV1/test/"): 295 | _extractTar('sd_GSCmdV1/test.tar.gz', 'sd_GSCmdV1/test/') 296 | 297 | if not os.path.isdir("sd_GSCmdV1/train/"): 298 | _extractTar('sd_GSCmdV1/train.tar.gz', 'sd_GSCmdV1/train/') 299 | 300 | ############## 301 | # Utilities 302 | ############## 303 | 304 | 305 | def _downloadFile(url, fName): 306 | # Streaming, so we can iterate over the response. 307 | r = requests.get(url, stream=True) 308 | 309 | # Total size in bytes. 310 | total_size = int(r.headers.get('content-length', 0)) 311 | block_size = 1024 312 | wrote = 0 313 | print('Downloading {} into {}'.format(url, fName)) 314 | with open(fName, 'wb') as f: 315 | for data in tqdm(r.iter_content(block_size), 316 | total=math.ceil(total_size // block_size), 317 | unit='KB', 318 | unit_scale=True): 319 | wrote = wrote + len(data) 320 | f.write(data) 321 | if total_size != 0 and wrote != total_size: 322 | print("ERROR, something went wrong") 323 | 324 | 325 | def _extractTar(fname, folder): 326 | print('Extracting {} into {}'.format(fname, folder)) 327 | if (fname.endswith("tar.gz")): 328 | tar = tarfile.open(fname, "r:gz") 329 | tar.extractall(path=folder) 330 | tar.close() 331 | elif (fname.endswith("tar")): 332 | tar = tarfile.open(fname, "r:") 333 | tar.extractall(path=folder) 334 | tar.close() 335 | -------------------------------------------------------------------------------- /SpeechGenerator.py: -------------------------------------------------------------------------------- 1 | """ 2 | A generator for reading and serving audio files 3 | 4 | https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly.html 5 | 6 | Remember to use multiprocessing: 7 | # Train model on dataset 8 | model.fit_generator(generator=training_generator, 9 | validation_data=validation_generator, 10 | use_multiprocessing=True, 11 | workers=6) 12 | 13 | """ 14 | 15 | import numpy as np 16 | import tensorflow.keras 17 | 18 | 19 | class SpeechGen(tensorflow.keras.utils.Sequence): 20 | """ 21 | 'Generates data for Keras' 22 | 23 | list_IDs - list of files that this generator should load 24 | labels - dictionary of corresponding (integer) category 25 | to each file in list_IDs 26 | 27 | Expects list_IDs and labels to be of the same length 28 | """ 29 | def __init__(self, list_IDs, labels, batch_size=32, 30 | dim=16000, shuffle=True): 31 | 'Initialization' 32 | self.dim = dim 33 | self.batch_size = batch_size 34 | self.labels = labels 35 | self.list_IDs = list_IDs 36 | self.shuffle = shuffle 37 | self.on_epoch_end() 38 | 39 | def __len__(self): 40 | 'Denotes the number of batches per epoch' 41 | return int(np.floor(len(self.list_IDs) / self.batch_size)) 42 | 43 | def __getitem__(self, index): 44 | 'Generate one batch of data' 45 | # Generate indexes of the batch 46 | indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size] 47 | 48 | # Find list of IDs 49 | list_IDs_temp = [self.list_IDs[k] for k in indexes] 50 | 51 | # Generate data 52 | X, y = self.__data_generation(list_IDs_temp) 53 | 54 | return X, y 55 | 56 | def on_epoch_end(self): 57 | 'Updates indexes after each epoch' 58 | self.indexes = np.arange(len(self.list_IDs)) 59 | if self.shuffle: 60 | np.random.shuffle(self.indexes) 61 | 62 | def __data_generation(self, list_IDs_temp): 63 | 'Generates data containing batch_size samples' 64 | # X : (n_samples, *dim, n_channels) 65 | # Initialization 66 | X = np.empty((self.batch_size, self.dim)) 67 | y = np.empty((self.batch_size), dtype=int) 68 | 69 | # Generate data 70 | for i, ID in enumerate(list_IDs_temp): 71 | 72 | # load data from file, saved as numpy array on disk 73 | curX = np.load(ID)[:, 0] 74 | 75 | # normalize 76 | # invMax = 1/(np.max(np.abs(curX))+1e-3) 77 | # curX *= invMax 78 | 79 | # curX could be bigger or smaller than self.dim 80 | if curX.shape[0] == self.dim: 81 | X[i] = curX 82 | elif curX.shape[0] > self.dim: # bigger 83 | # we can choose any position in curX-self.dim 84 | randPos = np.random.randint(curX.shape[0]-self.dim) 85 | X[i] = curX[randPos:randPos+self.dim] 86 | else: # smaller 87 | randPos = np.random.randint(self.dim-curX.shape[0]) 88 | X[i, randPos:randPos + curX.shape[0]] = curX 89 | # print('File dim smaller') 90 | 91 | # Store class 92 | y[i] = self.labels[ID] 93 | 94 | return X, y 95 | -------------------------------------------------------------------------------- /SpeechModels.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.keras.models import Model, load_model 3 | 4 | from tensorflow.keras import layers as L 5 | from tensorflow.keras import backend as K 6 | from tensorflow.keras.utils import to_categorical 7 | from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler 8 | from tensorflow.keras import backend as K 9 | from tensorflow.keras import optimizers 10 | import audioUtils 11 | 12 | def get_melspec_model(iLen=None): 13 | inp = L.Input((iLen,), name='input') 14 | mel_spec = audioUtils.normalized_mel_spectrogram(inp) 15 | melspecModel = Model(inputs=inp, outputs=mel_spec, name='normalized_spectrogram_model') 16 | return melspecModel 17 | 18 | def AttRNNSpeechModel(nCategories, samplingrate=16000, 19 | inputLength=16000, rnn_func=L.LSTM): 20 | # simple LSTM 21 | sr = samplingrate 22 | iLen = inputLength 23 | 24 | inputs = L.Input((inputLength,), name='input') 25 | 26 | m = get_melspec_model(iLen=inputLength) 27 | m.trainable = False 28 | 29 | x = m(inputs) 30 | x = tf.expand_dims(x, axis=-1, name='mel_stft') 31 | 32 | x = L.Conv2D(10, (5, 1), activation='relu', padding='same')(x) 33 | x = L.BatchNormalization()(x) 34 | x = L.Conv2D(1, (5, 1), activation='relu', padding='same')(x) 35 | x = L.BatchNormalization()(x) 36 | 37 | # x = Reshape((125, 80)) (x) 38 | # keras.backend.squeeze(x, axis) 39 | x = L.Lambda(lambda q: K.squeeze(q, -1), name='squeeze_last_dim')(x) 40 | 41 | x = L.Bidirectional(rnn_func(64, return_sequences=True) 42 | )(x) # [b_s, seq_len, vec_dim] 43 | x = L.Bidirectional(rnn_func(64, return_sequences=True) 44 | )(x) # [b_s, seq_len, vec_dim] 45 | 46 | xFirst = L.Lambda(lambda q: q[:, -1])(x) # [b_s, vec_dim] 47 | query = L.Dense(128)(xFirst) 48 | 49 | # dot product attention 50 | attScores = L.Dot(axes=[1, 2])([query, x]) 51 | attScores = L.Softmax(name='attSoftmax')(attScores) # [b_s, seq_len] 52 | 53 | # rescale sequence 54 | attVector = L.Dot(axes=[1, 1])([attScores, x]) # [b_s, vec_dim] 55 | 56 | x = L.Dense(64, activation='relu')(attVector) 57 | x = L.Dense(32)(x) 58 | 59 | output = L.Dense(nCategories, activation='softmax', name='output')(x) 60 | 61 | model = Model(inputs=[inputs], outputs=[output]) 62 | 63 | return model 64 | -------------------------------------------------------------------------------- /audioUtils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utility functions for audio files 3 | """ 4 | import os 5 | import itertools 6 | import numpy as np 7 | from tqdm import tqdm 8 | import tensorflow as tf 9 | import matplotlib.pyplot as plt 10 | 11 | 12 | def plot_confusion_matrix(cm, classes, 13 | normalize=False, 14 | title='Confusion matrix', 15 | cmap=plt.cm.Blues): 16 | """ 17 | This function prints and plots the confusion matrix. 18 | Normalization can be applied by setting `normalize=True`. 19 | """ 20 | if normalize: 21 | cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 22 | print("Normalized confusion matrix") 23 | else: 24 | print('Confusion matrix, without normalization') 25 | 26 | plt.figure(figsize=(15, 15)) 27 | plt.imshow(cm, interpolation='nearest', cmap=cmap) 28 | plt.title(title, fontsize=30) 29 | plt.colorbar() 30 | tick_marks = np.arange(len(classes)) 31 | plt.xticks(tick_marks, classes, rotation=45, fontsize=15) 32 | plt.yticks(tick_marks, classes, fontsize=15) 33 | 34 | fmt = '.3f' if normalize else 'd' 35 | thresh = cm.max() / 2. 36 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): 37 | plt.text(j, i, format(cm[i, j], fmt), size=11, 38 | horizontalalignment="center", 39 | color="white" if cm[i, j] > thresh else "black") 40 | 41 | plt.ylabel('True label', fontsize=30) 42 | plt.xlabel('Predicted label', fontsize=30) 43 | plt.savefig('picConfMatrix.png', dpi=400) 44 | plt.tight_layout() 45 | 46 | 47 | def WAV2Numpy(folder, sr=None): 48 | """ 49 | Recursively converts WAV to numpy arrays. 50 | Deletes the WAV files in the process 51 | 52 | folder - folder to convert. 53 | """ 54 | allFiles = [] 55 | for root, dirs, files in os.walk(folder): 56 | allFiles += [os.path.join(root, f) for f in files 57 | if f.endswith('.wav')] 58 | 59 | for file in tqdm(allFiles): 60 | x = tf.io.read_file(str(file)) 61 | y, sample_rate = tf.audio.decode_wav(x, desired_channels=1, desired_samples=16000,) 62 | 63 | # if we want to write the file later 64 | np.save(file + '.npy', y.numpy()) 65 | os.remove(file) 66 | 67 | # this was supposed to be tfio.audio.spectrogram 68 | def spectrogram_fn(input_signal, nfft, window, stride, name=None): 69 | """ 70 | Create spectrogram from audio. 71 | Args: 72 | input: An 1-D audio signal Tensor. 73 | nfft: Size of FFT. 74 | window: Size of window. 75 | stride: Size of hops between windows. 76 | name: A name for the operation (optional). 77 | Returns: 78 | A tensor of spectrogram. 79 | """ 80 | 81 | # TODO: Support audio with channel > 1. 82 | return tf.math.abs( 83 | tf.signal.stft( 84 | input_signal, 85 | frame_length=window, 86 | frame_step=stride, 87 | fft_length=nfft, 88 | window_fn=tf.signal.hann_window, 89 | pad_end=True, 90 | ) 91 | ) 92 | 93 | def normalized_mel_spectrogram(x, sr=16000, n_mel_bins=80): 94 | spec_stride = 128 95 | spec_len = 1024 96 | 97 | spectrogram = spectrogram_fn( 98 | x, nfft=spec_len, window=spec_len, stride=spec_stride 99 | ) 100 | 101 | num_spectrogram_bins = spec_len // 2 + 1 # spectrogram.shape[-1] 102 | lower_edge_hertz, upper_edge_hertz = 40.0, 8000.0 103 | num_mel_bins = n_mel_bins 104 | linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix( 105 | num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, 106 | upper_edge_hertz) 107 | mel_spectrograms = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1) 108 | mel_spectrograms.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:])) 109 | 110 | # Compute a stabilized log to get log-magnitude mel-scale spectrograms. 111 | log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6) 112 | avg = tf.math.reduce_mean(log_mel_spectrograms) 113 | std = tf.math.reduce_std(log_mel_spectrograms) 114 | 115 | return (log_mel_spectrograms - avg) / std -------------------------------------------------------------------------------- /model-KWS-attRNN.tflite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/douglas125/SpeechCmdRecognition/4c1ccf3c2663db32dafa22acba635bc4d05bbd03/model-KWS-attRNN.tflite -------------------------------------------------------------------------------- /model-attRNN.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/douglas125/SpeechCmdRecognition/4c1ccf3c2663db32dafa22acba635bc4d05bbd03/model-attRNN.h5 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow==2.10.1 2 | pandas>=0.25 3 | tqdm 4 | librosa 5 | matplotlib 6 | --------------------------------------------------------------------------------