├── .gitignore
├── AttRight.png
├── LICENSE
├── README.md
├── SpeechDownloader.py
├── SpeechGenerator.py
├── SpeechModels.py
├── Speech_Recog_Demo.ipynb
├── audioUtils.py
├── model-KWS-attRNN.tflite
├── model-attRNN.h5
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | *.wav
2 | *.zip
3 | *.npy
4 | .ipynb_checkpoints/
5 | resources/.ipynb_checkpoints/
6 | sd_*/
7 | __pycache__/


--------------------------------------------------------------------------------
/AttRight.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/douglas125/SpeechCmdRecognition/4c1ccf3c2663db32dafa22acba635bc4d05bbd03/AttRight.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Douglas Coimbra de Andrade
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Speech Command Recognition
 2 | 
 3 | ## A Keras implementation of neural attention model for speech command recognition
 4 | 
 5 | This repository presents a recurrent attention model designed to identify keywords in short segments of audio. It has been tested using the Google Speech Command Datasets (v1 and v2).
 6 | For a complete description of the architecture, please refer to [our paper](https://arxiv.org/abs/1808.08929).
 7 | 
 8 | Our main contributions are:
 9 | 
10 | - A small footprint model (201K trainable parameters) that outperforms convolutional architectures for speech command recognition (AKA keyword spotting);
11 | - A SpeechGenerator.py script that enables loading WAV files saved in .npy format from disk (like a Keras image generator, but for audio files);
12 | - Attention outputs that make the model explainable (i.e., it is possible to identify what part of the audio was important to reach a conclusion).
13 | 
14 | # Attention Model
15 | 
16 | One usual problem with deep learning models is that they are usually "black-box" in the sense that it is very difficult to explain why the model reaches a certain decision. Attention is a powerful tool to make deep neural network models explainable: the picture below demonstrates that the transition from phoneme /a/ to phoneme /i/ is the most relevant part of the audio that the model used to decide (correctly) that the word is "right". Please refer to  [our paper](https://arxiv.org/abs/1808.08929) for confusion matrix and more attention plots.
17 | 
18 | ![Attention for word Right](AttRight.png)
19 | 
20 | # How to use this code
21 | 
22 | The Demo notebook is preconfigured with a set of tasks: ```['12cmd', 'leftright', '35word', '20cmd']```. Each of these refer to how many commands should be recognized by the model. When loading the Google Speech Dataset, the user should also select which version to download and use by adjusting the following line:
23 | 
24 | ```gscInfo, nCategs = SpeechDownloader.PrepareGoogleSpeechCmd(version=1, task = '35word')```
25 | 
26 | If you want a pretrained model, `model-attRNN.h5` contains pre-trained weights for task 35word, version=2.
27 | 
28 | ## Cloning this repository
29 | 
30 | - Download or clone this repository;
31 | - Open the Demo notebook;
32 | - Choose how many words should be recognized and the Google Speech Dataset version to use;
33 | - Run training and tests.
34 | 
35 | ## Using Google Colab
36 | 
37 | Google Colaboratory is an amazing tool for experimentation using a Jupyter Notebook environment with GPUs.
38 | 
39 | - Open Colab: https://colab.research.google.com/ ;
40 | - Download and upload the notebood Speech_Recog_Demo.ipynb to Colab, then open it;
41 | - Enable GPU acceleration in menu Edit -> Notebook settings;
42 | - Set useColab = True;
43 | - Choose how many words should be recognized and the Google Speech Dataset version to use;
44 | - Run training and tests.
45 | 
46 | ## Train with your own data
47 | 
48 | If you want to train with your own data:
49 | 
50 | - Use the ```audioUtily.py WAV2Numpy``` function to save your WAV files in numpy format. This speeds up loading considerably;
51 | - Create a ```list_IDs``` array containing the paths to all the numpy files and a ```labels``` array with corresponding labels (already converted to integers);
52 | - Instantiate a ```SpeechGenerator.py SpeechGen``` class;
53 | - Create your own Keras model for audio classification or use one provided in ```SpeechModels.py```;
54 | - Train the model.
55 | 
56 | # Final Words
57 | 
58 | We would like to thank Google for making such a great speech dataset available for public use, for making Colab available and for hosting the Kaggle competition Tensorflow Speech Recognition Challenge.
59 | 
60 | If you find this code useful, please cite our work:
61 | 
62 | ```
63 | @ARTICLE{2018arXiv180808929C,
64 |    author = {{Coimbra de Andrade}, D. and {Leo}, S. and {Loesener Da Silva Viana}, M. and 
65 | 	{Bernkopf}, C.},
66 |     title = "{A neural attention model for speech command recognition}",
67 |   journal = {ArXiv e-prints},
68 | archivePrefix = "arXiv",
69 |    eprint = {1808.08929},
70 |  keywords = {Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound},
71 |      year = 2018,
72 |     month = aug,
73 |    adsurl = {http://adsabs.harvard.edu/abs/2018arXiv180808929C},
74 |   adsnote = {Provided by the SAO/NASA Astrophysics Data System}
75 | }
76 | ```
77 | 


--------------------------------------------------------------------------------
/SpeechDownloader.py:
--------------------------------------------------------------------------------
  1 | """
  2 | File containing scripts to download audio from various datasets
  3 | 
  4 | Also has tools to convert audio into numpy
  5 | """
  6 | from tqdm import tqdm
  7 | import requests
  8 | import math
  9 | import os
 10 | import tarfile
 11 | import numpy as np
 12 | import pandas as pd
 13 | 
 14 | import audioUtils
 15 | 
 16 | 
 17 | # ##################
 18 | # Google Speech Commands Dataset V2
 19 | # ##################
 20 | 
 21 | # GSCmdV2Categs = {'unknown' : 0, 'silence' : 1, '_unknown_' : 0,'_silence_' : 1, '_background_noise_' : 1, 'yes' : 2,
 22 | #                 'no' : 3, 'up' : 4, 'down' : 5, 'left' : 6, 'right' : 7, 'on' : 8, 'off' : 9, 'stop' : 10, 'go' : 11}
 23 | # numGSCmdV2Categs = 12
 24 | 
 25 | # "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero",
 26 | # "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", and "Nine"
 27 | 
 28 | GSCmdV2Categs = {
 29 |     'unknown': 0,
 30 |     'silence': 0,
 31 |     '_unknown_': 0,
 32 |     '_silence_': 0,
 33 |     '_background_noise_': 0,
 34 |     'yes': 2,
 35 |     'no': 3,
 36 |     'up': 4,
 37 |     'down': 5,
 38 |     'left': 6,
 39 |     'right': 7,
 40 |     'on': 8,
 41 |     'off': 9,
 42 |     'stop': 10,
 43 |     'go': 11,
 44 |     'zero': 12,
 45 |     'one': 13,
 46 |     'two': 14,
 47 |     'three': 15,
 48 |     'four': 16,
 49 |     'five': 17,
 50 |     'six': 18,
 51 |     'seven': 19,
 52 |     'eight': 20,
 53 |     'nine': 1}
 54 | numGSCmdV2Categs = 21
 55 | 
 56 | 
 57 | def PrepareGoogleSpeechCmd(version=2, forceDownload=False, task='20cmd'):
 58 |     """
 59 |     Prepares Google Speech commands dataset version 2 for use
 60 | 
 61 |     tasks: 20cmd, 12cmd, leftright or 35word
 62 | 
 63 |     Returns full path to training, validation and test file list and file categories
 64 |     """
 65 |     allowedTasks = ['12cmd', 'leftright', '35word', '20cmd']
 66 |     if task not in allowedTasks:
 67 |         raise Exception('Task must be one of: {}'.format(allowedTasks))
 68 | 
 69 |     basePath = None
 70 |     if version == 2:
 71 |         _DownloadGoogleSpeechCmdV2(forceDownload)
 72 |         basePath = 'sd_GSCmdV2'
 73 |     elif version == 1:
 74 |         _DownloadGoogleSpeechCmdV1(forceDownload)
 75 |         basePath = 'sd_GSCmdV1'
 76 |     else:
 77 |         raise Exception('Version must be 1 or 2')
 78 | 
 79 |     if task == '12cmd':
 80 |         GSCmdV2Categs = {
 81 |             'unknown': 0,
 82 |             'silence': 1,
 83 |             '_unknown_': 0,
 84 |             '_silence_': 1,
 85 |             '_background_noise_': 1,
 86 |             'yes': 2,
 87 |             'no': 3,
 88 |             'up': 4,
 89 |             'down': 5,
 90 |             'left': 6,
 91 |             'right': 7,
 92 |             'on': 8,
 93 |             'off': 9,
 94 |             'stop': 10,
 95 |             'go': 11}
 96 |         numGSCmdV2Categs = 12
 97 |     elif task == 'leftright':
 98 |         GSCmdV2Categs = {
 99 |             'unknown': 0,
100 |             'silence': 0,
101 |             '_unknown_': 0,
102 |             '_silence_': 0,
103 |             '_background_noise_': 0,
104 |             'left': 1,
105 |             'right': 2}
106 |         numGSCmdV2Categs = 3
107 |     elif task == '35word':
108 |         GSCmdV2Categs = {
109 |             'unknown': 0,
110 |             'silence': 0,
111 |             '_unknown_': 0,
112 |             '_silence_': 0,
113 |             '_background_noise_': 0,
114 |             'yes': 2,
115 |             'no': 3,
116 |             'up': 4,
117 |             'down': 5,
118 |             'left': 6,
119 |             'right': 7,
120 |             'on': 8,
121 |             'off': 9,
122 |             'stop': 10,
123 |             'go': 11,
124 |             'zero': 12,
125 |             'one': 13,
126 |             'two': 14,
127 |             'three': 15,
128 |             'four': 16,
129 |             'five': 17,
130 |             'six': 18,
131 |             'seven': 19,
132 |             'eight': 20,
133 |             'nine': 1,
134 |             'backward': 21,
135 |             'bed': 22,
136 |             'bird': 23,
137 |             'cat': 24,
138 |             'dog': 25,
139 |             'follow': 26,
140 |             'forward': 27,
141 |             'happy': 28,
142 |             'house': 29,
143 |             'learn': 30,
144 |             'marvin': 31,
145 |             'sheila': 32,
146 |             'tree': 33,
147 |             'visual': 34,
148 |             'wow': 35}
149 |         numGSCmdV2Categs = 36
150 |     elif task == '20cmd':
151 |         GSCmdV2Categs = {
152 |             'unknown': 0,
153 |             'silence': 0,
154 |             '_unknown_': 0,
155 |             '_silence_': 0,
156 |             '_background_noise_': 0,
157 |             'yes': 2,
158 |             'no': 3,
159 |             'up': 4,
160 |             'down': 5,
161 |             'left': 6,
162 |             'right': 7,
163 |             'on': 8,
164 |             'off': 9,
165 |             'stop': 10,
166 |             'go': 11,
167 |             'zero': 12,
168 |             'one': 13,
169 |             'two': 14,
170 |             'three': 15,
171 |             'four': 16,
172 |             'five': 17,
173 |             'six': 18,
174 |             'seven': 19,
175 |             'eight': 20,
176 |             'nine': 1}
177 |         numGSCmdV2Categs = 21
178 | 
179 |     print('Converting test set WAVs to numpy files')
180 |     audioUtils.WAV2Numpy(basePath + '/test/')
181 |     print('Converting training set WAVs to numpy files')
182 |     audioUtils.WAV2Numpy(basePath + '/train/')
183 | 
184 |     # read split from files and all files in folders
185 |     testWAVs = pd.read_csv(basePath + '/train/testing_list.txt',
186 |                            sep=" ", header=None)[0].tolist()
187 |     valWAVs = pd.read_csv(basePath + '/train/validation_list.txt',
188 |                           sep=" ", header=None)[0].tolist()
189 | 
190 |     testWAVs = [os.path.join(basePath + '/train/', f + '.npy')
191 |                 for f in testWAVs if f.endswith('.wav')]
192 |     valWAVs = [os.path.join(basePath + '/train/', f + '.npy')
193 |                for f in valWAVs if f.endswith('.wav')]
194 |     allWAVs = []
195 |     for root, dirs, files in os.walk(basePath + '/train/'):
196 |         allWAVs += [root + '/' + f for f in files if f.endswith('.wav.npy')]
197 |     trainWAVs = list(set(allWAVs) - set(valWAVs) - set(testWAVs))
198 | 
199 |     testWAVsREAL = []
200 |     for root, dirs, files in os.walk(basePath + '/test/'):
201 |         testWAVsREAL += [root + '/' +
202 |                          f for f in files if f.endswith('.wav.npy')]
203 | 
204 |     # get categories
205 |     testWAVlabels = [_getFileCategory(f, GSCmdV2Categs) for f in testWAVs]
206 |     valWAVlabels = [_getFileCategory(f, GSCmdV2Categs) for f in valWAVs]
207 |     trainWAVlabels = [_getFileCategory(f, GSCmdV2Categs) for f in trainWAVs]
208 |     testWAVREALlabels = [_getFileCategory(f, GSCmdV2Categs)
209 |                          for f in testWAVsREAL]
210 | 
211 |     # background noise should be used for validation as well
212 |     backNoiseFiles = [trainWAVs[i] for i in range(len(trainWAVlabels))
213 |                       if trainWAVlabels[i] == GSCmdV2Categs['silence']]
214 |     backNoiseCats = [GSCmdV2Categs['silence']
215 |                      for i in range(len(backNoiseFiles))]
216 |     if numGSCmdV2Categs == 12:
217 |         valWAVs += backNoiseFiles
218 |         valWAVlabels += backNoiseCats
219 | 
220 |     # build dictionaries
221 |     testWAVlabelsDict = dict(zip(testWAVs, testWAVlabels))
222 |     valWAVlabelsDict = dict(zip(valWAVs, valWAVlabels))
223 |     trainWAVlabelsDict = dict(zip(trainWAVs, trainWAVlabels))
224 |     testWAVREALlabelsDict = dict(zip(testWAVsREAL, testWAVREALlabels))
225 | 
226 |     # a tweak here: we will heavily underuse silence samples because there are few files.
227 |     # we can add them to the training list to reuse them multiple times
228 |     # note that since we already added the files to the label dicts we don't
229 |     # need to do it again
230 | 
231 |     # for i in range(200):
232 |     #     trainWAVs = trainWAVs + backNoiseFiles
233 | 
234 |     # info dictionary
235 |     trainInfo = {'files': trainWAVs, 'labels': trainWAVlabelsDict}
236 |     valInfo = {'files': valWAVs, 'labels': valWAVlabelsDict}
237 |     testInfo = {'files': testWAVs, 'labels': testWAVlabelsDict}
238 |     testREALInfo = {'files': testWAVsREAL, 'labels': testWAVREALlabelsDict}
239 |     gscInfo = {'train': trainInfo,
240 |                'test': testInfo,
241 |                'val': valInfo,
242 |                'testREAL': testREALInfo}
243 | 
244 |     print('Done preparing Google Speech commands dataset version {}'.format(version))
245 | 
246 |     return gscInfo, numGSCmdV2Categs
247 | 
248 | 
249 | def _getFileCategory(file, catDict):
250 |     """
251 |     Receives a file with name sd_GSCmdV2/train/<cat>/<filename> and returns an integer that is catDict[cat]
252 |     """
253 |     categ = os.path.basename(os.path.dirname(file))
254 |     return catDict.get(categ, 0)
255 | 
256 | 
257 | def _DownloadGoogleSpeechCmdV2(forceDownload=False):
258 |     """
259 |     Downloads Google Speech commands dataset version 2
260 |     """
261 |     if os.path.isdir("sd_GSCmdV2/") and not forceDownload:
262 |         print('Google Speech commands dataset version 2 already exists. Skipping download.')
263 |     else:
264 |         if not os.path.exists("sd_GSCmdV2/"):
265 |             os.makedirs("sd_GSCmdV2/")
266 |         trainFiles = 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz'
267 |         testFiles = 'http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz'
268 |         _downloadFile(testFiles, 'sd_GSCmdV2/test.tar.gz')
269 |         _downloadFile(trainFiles, 'sd_GSCmdV2/train.tar.gz')
270 | 
271 |     # extract files
272 |     if not os.path.isdir("sd_GSCmdV2/test/"):
273 |         _extractTar('sd_GSCmdV2/test.tar.gz', 'sd_GSCmdV2/test/')
274 | 
275 |     if not os.path.isdir("sd_GSCmdV2/train/"):
276 |         _extractTar('sd_GSCmdV2/train.tar.gz', 'sd_GSCmdV2/train/')
277 | 
278 | 
279 | def _DownloadGoogleSpeechCmdV1(forceDownload=False):
280 |     """
281 |     Downloads Google Speech commands dataset version 1
282 |     """
283 |     if os.path.isdir("sd_GSCmdV1/") and not forceDownload:
284 |         print('Google Speech commands dataset version 1 already exists. Skipping download.')
285 |     else:
286 |         if not os.path.exists("sd_GSCmdV1/"):
287 |             os.makedirs("sd_GSCmdV1/")
288 |         trainFiles = 'http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz'
289 |         testFiles = 'http://download.tensorflow.org/data/speech_commands_test_set_v0.01.tar.gz'
290 |         _downloadFile(testFiles, 'sd_GSCmdV1/test.tar.gz')
291 |         _downloadFile(trainFiles, 'sd_GSCmdV1/train.tar.gz')
292 | 
293 |     # extract files
294 |     if not os.path.isdir("sd_GSCmdV1/test/"):
295 |         _extractTar('sd_GSCmdV1/test.tar.gz', 'sd_GSCmdV1/test/')
296 | 
297 |     if not os.path.isdir("sd_GSCmdV1/train/"):
298 |         _extractTar('sd_GSCmdV1/train.tar.gz', 'sd_GSCmdV1/train/')
299 | 
300 | ##############
301 | # Utilities
302 | ##############
303 | 
304 | 
305 | def _downloadFile(url, fName):
306 |     # Streaming, so we can iterate over the response.
307 |     r = requests.get(url, stream=True)
308 | 
309 |     # Total size in bytes.
310 |     total_size = int(r.headers.get('content-length', 0))
311 |     block_size = 1024
312 |     wrote = 0
313 |     print('Downloading {} into {}'.format(url, fName))
314 |     with open(fName, 'wb') as f:
315 |         for data in tqdm(r.iter_content(block_size),
316 |                          total=math.ceil(total_size // block_size),
317 |                          unit='KB',
318 |                          unit_scale=True):
319 |             wrote = wrote + len(data)
320 |             f.write(data)
321 |     if total_size != 0 and wrote != total_size:
322 |         print("ERROR, something went wrong")
323 | 
324 | 
325 | def _extractTar(fname, folder):
326 |     print('Extracting {} into {}'.format(fname, folder))
327 |     if (fname.endswith("tar.gz")):
328 |         tar = tarfile.open(fname, "r:gz")
329 |         tar.extractall(path=folder)
330 |         tar.close()
331 |     elif (fname.endswith("tar")):
332 |         tar = tarfile.open(fname, "r:")
333 |         tar.extractall(path=folder)
334 |         tar.close()
335 | 


--------------------------------------------------------------------------------
/SpeechGenerator.py:
--------------------------------------------------------------------------------
 1 | """
 2 | A generator for reading and serving audio files
 3 | 
 4 | https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly.html
 5 | 
 6 | Remember to use multiprocessing:
 7 | # Train model on dataset
 8 | model.fit_generator(generator=training_generator,
 9 |                     validation_data=validation_generator,
10 |                     use_multiprocessing=True,
11 |                     workers=6)
12 | 
13 | """
14 | 
15 | import numpy as np
16 | import tensorflow.keras
17 | 
18 | 
19 | class SpeechGen(tensorflow.keras.utils.Sequence):
20 |     """
21 |     'Generates data for Keras'
22 | 
23 |     list_IDs - list of files that this generator should load
24 |     labels - dictionary of corresponding (integer) category
25 |     to each file in list_IDs
26 | 
27 |     Expects list_IDs and labels to be of the same length
28 |     """
29 |     def __init__(self, list_IDs, labels, batch_size=32,
30 |                  dim=16000, shuffle=True):
31 |         'Initialization'
32 |         self.dim = dim
33 |         self.batch_size = batch_size
34 |         self.labels = labels
35 |         self.list_IDs = list_IDs
36 |         self.shuffle = shuffle
37 |         self.on_epoch_end()
38 | 
39 |     def __len__(self):
40 |         'Denotes the number of batches per epoch'
41 |         return int(np.floor(len(self.list_IDs) / self.batch_size))
42 | 
43 |     def __getitem__(self, index):
44 |         'Generate one batch of data'
45 |         # Generate indexes of the batch
46 |         indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
47 | 
48 |         # Find list of IDs
49 |         list_IDs_temp = [self.list_IDs[k] for k in indexes]
50 | 
51 |         # Generate data
52 |         X, y = self.__data_generation(list_IDs_temp)
53 | 
54 |         return X, y
55 | 
56 |     def on_epoch_end(self):
57 |         'Updates indexes after each epoch'
58 |         self.indexes = np.arange(len(self.list_IDs))
59 |         if self.shuffle:
60 |             np.random.shuffle(self.indexes)
61 | 
62 |     def __data_generation(self, list_IDs_temp):
63 |         'Generates data containing batch_size samples'
64 |         # X : (n_samples, *dim, n_channels)
65 |         # Initialization
66 |         X = np.empty((self.batch_size, self.dim))
67 |         y = np.empty((self.batch_size), dtype=int)
68 | 
69 |         # Generate data
70 |         for i, ID in enumerate(list_IDs_temp):
71 | 
72 |             # load data from file, saved as numpy array on disk
73 |             curX = np.load(ID)[:, 0]
74 | 
75 |             # normalize
76 |             # invMax = 1/(np.max(np.abs(curX))+1e-3)
77 |             # curX *= invMax
78 | 
79 |             # curX could be bigger or smaller than self.dim
80 |             if curX.shape[0] == self.dim:
81 |                 X[i] = curX
82 |             elif curX.shape[0] > self.dim:  # bigger
83 |                 # we can choose any position in curX-self.dim
84 |                 randPos = np.random.randint(curX.shape[0]-self.dim)
85 |                 X[i] = curX[randPos:randPos+self.dim]
86 |             else:  # smaller
87 |                 randPos = np.random.randint(self.dim-curX.shape[0])
88 |                 X[i, randPos:randPos + curX.shape[0]] = curX
89 |                 # print('File dim smaller')
90 | 
91 |             # Store class
92 |             y[i] = self.labels[ID]
93 | 
94 |         return X, y
95 | 


--------------------------------------------------------------------------------
/SpeechModels.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from tensorflow.keras.models import Model, load_model
 3 | 
 4 | from tensorflow.keras import layers as L
 5 | from tensorflow.keras import backend as K
 6 | from tensorflow.keras.utils import to_categorical
 7 | from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
 8 | from tensorflow.keras import backend as K
 9 | from tensorflow.keras import optimizers
10 | import audioUtils
11 | 
12 | def get_melspec_model(iLen=None):
13 |     inp = L.Input((iLen,), name='input')
14 |     mel_spec = audioUtils.normalized_mel_spectrogram(inp)
15 |     melspecModel = Model(inputs=inp, outputs=mel_spec, name='normalized_spectrogram_model')
16 |     return melspecModel
17 | 
18 | def AttRNNSpeechModel(nCategories, samplingrate=16000,
19 |                       inputLength=16000, rnn_func=L.LSTM):
20 |     # simple LSTM
21 |     sr = samplingrate
22 |     iLen = inputLength
23 | 
24 |     inputs = L.Input((inputLength,), name='input')
25 | 
26 |     m =  get_melspec_model(iLen=inputLength)
27 |     m.trainable = False
28 | 
29 |     x = m(inputs)
30 |     x = tf.expand_dims(x, axis=-1, name='mel_stft')
31 | 
32 |     x = L.Conv2D(10, (5, 1), activation='relu', padding='same')(x)
33 |     x = L.BatchNormalization()(x)
34 |     x = L.Conv2D(1, (5, 1), activation='relu', padding='same')(x)
35 |     x = L.BatchNormalization()(x)
36 | 
37 |     # x = Reshape((125, 80)) (x)
38 |     # keras.backend.squeeze(x, axis)
39 |     x = L.Lambda(lambda q: K.squeeze(q, -1), name='squeeze_last_dim')(x)
40 | 
41 |     x = L.Bidirectional(rnn_func(64, return_sequences=True)
42 |                         )(x)  # [b_s, seq_len, vec_dim]
43 |     x = L.Bidirectional(rnn_func(64, return_sequences=True)
44 |                         )(x)  # [b_s, seq_len, vec_dim]
45 | 
46 |     xFirst = L.Lambda(lambda q: q[:, -1])(x)  # [b_s, vec_dim]
47 |     query = L.Dense(128)(xFirst)
48 | 
49 |     # dot product attention
50 |     attScores = L.Dot(axes=[1, 2])([query, x])
51 |     attScores = L.Softmax(name='attSoftmax')(attScores)  # [b_s, seq_len]
52 | 
53 |     # rescale sequence
54 |     attVector = L.Dot(axes=[1, 1])([attScores, x])  # [b_s, vec_dim]
55 | 
56 |     x = L.Dense(64, activation='relu')(attVector)
57 |     x = L.Dense(32)(x)
58 | 
59 |     output = L.Dense(nCategories, activation='softmax', name='output')(x)
60 | 
61 |     model = Model(inputs=[inputs], outputs=[output])
62 | 
63 |     return model
64 | 


--------------------------------------------------------------------------------
/audioUtils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Utility functions for audio files
  3 | """
  4 | import os
  5 | import itertools
  6 | import numpy as np
  7 | from tqdm import tqdm
  8 | import tensorflow as tf
  9 | import matplotlib.pyplot as plt
 10 | 
 11 | 
 12 | def plot_confusion_matrix(cm, classes,
 13 |                           normalize=False,
 14 |                           title='Confusion matrix',
 15 |                           cmap=plt.cm.Blues):
 16 |     """
 17 |     This function prints and plots the confusion matrix.
 18 |     Normalization can be applied by setting `normalize=True`.
 19 |     """
 20 |     if normalize:
 21 |         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 22 |         print("Normalized confusion matrix")
 23 |     else:
 24 |         print('Confusion matrix, without normalization')
 25 | 
 26 |     plt.figure(figsize=(15, 15))
 27 |     plt.imshow(cm, interpolation='nearest', cmap=cmap)
 28 |     plt.title(title, fontsize=30)
 29 |     plt.colorbar()
 30 |     tick_marks = np.arange(len(classes))
 31 |     plt.xticks(tick_marks, classes, rotation=45, fontsize=15)
 32 |     plt.yticks(tick_marks, classes, fontsize=15)
 33 | 
 34 |     fmt = '.3f' if normalize else 'd'
 35 |     thresh = cm.max() / 2.
 36 |     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
 37 |         plt.text(j, i, format(cm[i, j], fmt), size=11,
 38 |                  horizontalalignment="center",
 39 |                  color="white" if cm[i, j] > thresh else "black")
 40 | 
 41 |     plt.ylabel('True label', fontsize=30)
 42 |     plt.xlabel('Predicted label', fontsize=30)
 43 |     plt.savefig('picConfMatrix.png', dpi=400)
 44 |     plt.tight_layout()
 45 | 
 46 | 
 47 | def WAV2Numpy(folder, sr=None):
 48 |     """
 49 |     Recursively converts WAV to numpy arrays.
 50 |     Deletes the WAV files in the process
 51 | 
 52 |     folder - folder to convert.
 53 |     """
 54 |     allFiles = []
 55 |     for root, dirs, files in os.walk(folder):
 56 |         allFiles += [os.path.join(root, f) for f in files
 57 |                      if f.endswith('.wav')]
 58 | 
 59 |     for file in tqdm(allFiles):
 60 |         x = tf.io.read_file(str(file))
 61 |         y, sample_rate = tf.audio.decode_wav(x, desired_channels=1, desired_samples=16000,)
 62 | 
 63 |         # if we want to write the file later
 64 |         np.save(file + '.npy', y.numpy())
 65 |         os.remove(file)
 66 | 
 67 | # this was supposed to be tfio.audio.spectrogram
 68 | def spectrogram_fn(input_signal, nfft, window, stride, name=None):
 69 |     """
 70 |     Create spectrogram from audio.
 71 |     Args:
 72 |       input: An 1-D audio signal Tensor.
 73 |       nfft: Size of FFT.
 74 |       window: Size of window.
 75 |       stride: Size of hops between windows.
 76 |       name: A name for the operation (optional).
 77 |     Returns:
 78 |       A tensor of spectrogram.
 79 |     """
 80 | 
 81 |     # TODO: Support audio with channel > 1.
 82 |     return tf.math.abs(
 83 |         tf.signal.stft(
 84 |             input_signal,
 85 |             frame_length=window,
 86 |             frame_step=stride,
 87 |             fft_length=nfft,
 88 |             window_fn=tf.signal.hann_window,
 89 |             pad_end=True,
 90 |         )
 91 |     )
 92 | 
 93 | def normalized_mel_spectrogram(x, sr=16000, n_mel_bins=80):
 94 |     spec_stride = 128
 95 |     spec_len = 1024
 96 | 
 97 |     spectrogram = spectrogram_fn(
 98 |         x, nfft=spec_len, window=spec_len, stride=spec_stride
 99 |     )
100 | 
101 |     num_spectrogram_bins = spec_len // 2 + 1  # spectrogram.shape[-1]
102 |     lower_edge_hertz, upper_edge_hertz = 40.0, 8000.0
103 |     num_mel_bins = n_mel_bins
104 |     linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
105 |       num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz,
106 |       upper_edge_hertz)
107 |     mel_spectrograms = tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
108 |     mel_spectrograms.set_shape(spectrogram.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))
109 | 
110 |     # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
111 |     log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
112 |     avg = tf.math.reduce_mean(log_mel_spectrograms)
113 |     std = tf.math.reduce_std(log_mel_spectrograms)
114 | 
115 |     return (log_mel_spectrograms - avg) / std


--------------------------------------------------------------------------------
/model-KWS-attRNN.tflite:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/douglas125/SpeechCmdRecognition/4c1ccf3c2663db32dafa22acba635bc4d05bbd03/model-KWS-attRNN.tflite


--------------------------------------------------------------------------------
/model-attRNN.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/douglas125/SpeechCmdRecognition/4c1ccf3c2663db32dafa22acba635bc4d05bbd03/model-attRNN.h5


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow==2.10.1
2 | pandas>=0.25
3 | tqdm
4 | librosa
5 | matplotlib
6 | 


--------------------------------------------------------------------------------