├── .gitignore ├── 1-Audio-Processing.ipynb ├── 2-Model.ipynb ├── 3-Training.ipynb ├── 4-Audio Device Streaming.ipynb ├── 5-Audio Sampling Rates.ipynb ├── 6-Prototype Speaker Recognizer.ipynb ├── README.md ├── analysis.py ├── application.py ├── audiolib.py ├── audiostream.py ├── charts.py ├── checkpoint.py ├── checkpoints └── voice-embeddings.csv ├── config.py ├── images ├── model.png └── walleclipse-loss.png ├── minibatch.py ├── model.drawio ├── models.py ├── requirements.txt ├── sound └── sf1_cln.wav ├── speakerdb.py ├── train.py └── triplet_loss.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | # We'll add the checkpoints back into the repo later, but for now these udpates are too huge. 6 | checkpoints/* 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | .hypothesis/ 50 | .pytest_cache/ 51 | 52 | # Translations 53 | *.mo 54 | *.pot 55 | 56 | # Django stuff: 57 | *.log 58 | local_settings.py 59 | db.sqlite3 60 | 61 | # Flask stuff: 62 | instance/ 63 | .webassets-cache 64 | 65 | # Scrapy stuff: 66 | .scrapy 67 | 68 | # Sphinx documentation 69 | docs/_build/ 70 | 71 | # PyBuilder 72 | target/ 73 | 74 | # Jupyter Notebook 75 | .ipynb_checkpoints 76 | 77 | # pyenv 78 | .python-version 79 | 80 | # celery beat schedule file 81 | celerybeat-schedule 82 | 83 | # SageMath parsed files 84 | *.sage.py 85 | 86 | # Environments 87 | .env 88 | .venv 89 | env/ 90 | venv/ 91 | ENV/ 92 | env.bak/ 93 | venv.bak/ 94 | 95 | # Spyder project settings 96 | .spyderproject 97 | .spyproject 98 | 99 | # Rope project settings 100 | .ropeproject 101 | 102 | # mkdocs documentation 103 | /site 104 | 105 | # mypy 106 | .mypy_cache/ 107 | -------------------------------------------------------------------------------- /2-Model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Model\n", 8 | "In this notebook we'll put together the ideas for building a model for training our voice embeddings." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 2, 14 | "metadata": {}, 15 | "outputs": [ 16 | { 17 | "name": "stderr", 18 | "output_type": "stream", 19 | "text": [ 20 | "Using TensorFlow backend.\n" 21 | ] 22 | } 23 | ], 24 | "source": [ 25 | "from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization\n", 26 | "from keras.models import Sequential, Model\n", 27 | "from keras.layers import Conv2D, MaxPooling2D\n", 28 | "from keras import regularizers, optimizers\n", 29 | "import numpy as np" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Audio Trigger Model\n", 37 | "Here's the model I used in a class assignment building a model for detetecting a trigger word (\"activate\") from audio samples. This model was used to monitor an audio channel for the trigger word, and output $\\hat{y}$ at each time step with 1 if the trigger word had just finished being spoken, and 0 otherwise.\n" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "We won't use this exact model. This is because the $\\hat{y}$ predictions in that model were one per time step.\n", 45 | "\n", 46 | "Instead what we want is a single $\\hat{y}$ of shape ${n}$ where ${n}$ is the length of the desired word embedding." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Model\n", 54 | "Here is the model we'll use for producing an embedding for an audio clip:\n", 55 | "\n", 56 | "![Model](images/model.png)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## Resources\n", 64 | " * Architecture diagram drawn with draw.io\n", 65 | " * Trigger Word Network architecture from Course 4 (sequential nets) week 4 Deep Learning Specialization, Sequential Models, Deep Learning.ai, Coursera" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## Triplet Loss on Images\n", 73 | "\n", 74 | "For an image $x$, we denote its encoding $f(x)$, where $f$ is the function computed by the neural network.\n", 75 | "\n", 76 | "\n", 79 | "\n", 80 | "Training will use triplets of images $(A, P, N)$: \n", 81 | "\n", 82 | "- A is an \"Anchor\" image--a picture of a person. \n", 83 | "- P is a \"Positive\" image--a picture of the same person as the Anchor image.\n", 84 | "- N is a \"Negative\" image--a picture of a different person than the Anchor image.\n", 85 | "\n", 86 | "These triplets are picked from our training dataset. We will write $(A^{(i)}, P^{(i)}, N^{(i)})$ to denote the $i$-th training example. \n", 87 | "\n", 88 | "You'd like to make sure that an image $A^{(i)}$ of an individual is closer to the Positive $P^{(i)}$ than to the Negative image $N^{(i)}$) by at least a margin $\\alpha$:\n", 89 | "\n", 90 | "$$\\mid \\mid f(A^{(i)}) - f(P^{(i)}) \\mid \\mid_2^2 + \\alpha < \\mid \\mid f(A^{(i)}) - f(N^{(i)}) \\mid \\mid_2^2$$\n", 91 | "\n", 92 | "You would thus like to minimize the following \"triplet cost\":\n", 93 | "\n", 94 | "$$\\mathcal{J} = \\sum^{m}_{i=1} \\large[ \\small \\underbrace{\\mid \\mid f(A^{(i)}) - f(P^{(i)}) \\mid \\mid_2^2}_\\text{(1)} - \\underbrace{\\mid \\mid f(A^{(i)}) - f(N^{(i)}) \\mid \\mid_2^2}_\\text{(2)} + \\alpha \\large ] \\small_+ \\tag{3}$$\n", 95 | "\n", 96 | "Here, we are using the notation \"$[z]_+$\" to denote $max(z,0)$. \n", 97 | "\n", 98 | "Notes:\n", 99 | "- The term (1) is the squared distance between the anchor \"A\" and the positive \"P\" for a given triplet; you want this to be small. \n", 100 | "- The term (2) is the squared distance between the anchor \"A\" and the negative \"N\" for a given triplet, you want this to be relatively large. It has a minus sign preceding it because minimizing the negative of the term is the same as maximizing that term.\n", 101 | "- $\\alpha$ is called the margin. It is a hyperparameter that you pick manually. We will use $\\alpha = 0.2$. \n", 102 | "\n", 103 | "Most implementations also rescale the encoding vectors to haven L2 norm equal to one (i.e., $\\mid \\mid f(img)\\mid \\mid_2$=1); you won't have to worry about that in this assignment.\n", 104 | "\n", 105 | "**Exercise**: Implement the triplet loss as defined by formula (3). Here are the 4 steps:\n", 106 | "1. Compute the distance between the encodings of \"anchor\" and \"positive\": $\\mid \\mid f(A^{(i)}) - f(P^{(i)}) \\mid \\mid_2^2$\n", 107 | "2. Compute the distance between the encodings of \"anchor\" and \"negative\": $\\mid \\mid f(A^{(i)}) - f(N^{(i)}) \\mid \\mid_2^2$\n", 108 | "3. Compute the formula per training example: $ \\mid \\mid f(A^{(i)}) - f(P^{(i)}) \\mid \\mid_2^2 - \\mid \\mid f(A^{(i)}) - f(N^{(i)}) \\mid \\mid_2^2 + \\alpha$\n", 109 | "3. Compute the full formula by taking the max with zero and summing over the training examples:\n", 110 | "$$\\mathcal{J} = \\sum^{m}_{i=1} \\large[ \\small \\mid \\mid f(A^{(i)}) - f(P^{(i)}) \\mid \\mid_2^2 - \\mid \\mid f(A^{(i)}) - f(N^{(i)}) \\mid \\mid_2^2+ \\alpha \\large ] \\small_+ \\tag{3}$$\n", 111 | "\n", 112 | "#### Hints\n", 113 | "* Useful functions: `tf.reduce_sum()`, `tf.square()`, `tf.subtract()`, `tf.add()`, `tf.maximum()`.\n", 114 | "* For steps 1 and 2, you will sum over the entries of $\\mid \\mid f(A^{(i)}) - f(P^{(i)}) \\mid \\mid_2^2$ and $\\mid \\mid f(A^{(i)}) - f(N^{(i)}) \\mid \\mid_2^2$. \n", 115 | "* For step 4 you will sum over the training examples.\n", 116 | "\n", 117 | "#### Additional Hints\n", 118 | "* Recall that the square of the L2 norm is the sum of the squared differences: $||x - y||_{2}^{2} = \\sum_{i=1}^{N}(x_{i} - y_{i})^{2}$\n", 119 | "* Note that the `anchor`, `positive` and `negative` encodings are of shape `(m,128)`, where m is the number of training examples and 128 is the number of elements used to encode a single example.\n", 120 | "* For steps 1 and 2, you will maintain the number of `m` training examples and sum along the 128 values of each encoding. \n", 121 | "[tf.reduce_sum](https://www.tensorflow.org/api_docs/python/tf/math/reduce_sum) has an `axis` parameter. This chooses along which axis the sums are applied. \n", 122 | "* Note that one way to choose the last axis in a tensor is to use negative indexing (`axis=-1`).\n", 123 | "* In step 4, when summing over training examples, the result will be a single scalar value.\n", 124 | "* For `tf.reduce_sum` to sum across all axes, keep the default value `axis=None`." 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "## Simpler CNN Model\n", 132 | "\n", 133 | "My research indicates that convolutional neural networks can perform many tasks pretty well if the samples are small. CNN's train faster than RNN's using GRU's or LSTMs. So maybe we can succeed with a simpler CNN model. We'll be using a sliding window of about 1 second of audio, so maybe we can succeed with a CNN model.\n", 134 | "\n", 135 | "Let's try it." 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "First we need to determine the shape we want for the inputs." 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 4, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "_________________________________________________________________\n", 155 | "Layer (type) Output Shape Param # \n", 156 | "=================================================================\n", 157 | "conv2d_1 (Conv2D) (None, 64, 64, 32) 896 \n", 158 | "_________________________________________________________________\n", 159 | "activation_1 (Activation) (None, 64, 64, 32) 0 \n", 160 | "_________________________________________________________________\n", 161 | "conv2d_2 (Conv2D) (None, 62, 62, 64) 18496 \n", 162 | "_________________________________________________________________\n", 163 | "activation_2 (Activation) (None, 62, 62, 64) 0 \n", 164 | "_________________________________________________________________\n", 165 | "max_pooling2d_1 (MaxPooling2 (None, 31, 31, 64) 0 \n", 166 | "_________________________________________________________________\n", 167 | "dropout_1 (Dropout) (None, 31, 31, 64) 0 \n", 168 | "_________________________________________________________________\n", 169 | "conv2d_3 (Conv2D) (None, 31, 31, 64) 36928 \n", 170 | "_________________________________________________________________\n", 171 | "activation_3 (Activation) (None, 31, 31, 64) 0 \n", 172 | "_________________________________________________________________\n", 173 | "conv2d_4 (Conv2D) (None, 29, 29, 64) 36928 \n", 174 | "_________________________________________________________________\n", 175 | "activation_4 (Activation) (None, 29, 29, 64) 0 \n", 176 | "_________________________________________________________________\n", 177 | "max_pooling2d_2 (MaxPooling2 (None, 14, 14, 64) 0 \n", 178 | "_________________________________________________________________\n", 179 | "dropout_2 (Dropout) (None, 14, 14, 64) 0 \n", 180 | "_________________________________________________________________\n", 181 | "conv2d_5 (Conv2D) (None, 14, 14, 128) 73856 \n", 182 | "_________________________________________________________________\n", 183 | "activation_5 (Activation) (None, 14, 14, 128) 0 \n", 184 | "_________________________________________________________________\n", 185 | "conv2d_6 (Conv2D) (None, 12, 12, 128) 147584 \n", 186 | "_________________________________________________________________\n", 187 | "activation_6 (Activation) (None, 12, 12, 128) 0 \n", 188 | "_________________________________________________________________\n", 189 | "max_pooling2d_3 (MaxPooling2 (None, 6, 6, 128) 0 \n", 190 | "_________________________________________________________________\n", 191 | "dropout_3 (Dropout) (None, 6, 6, 128) 0 \n", 192 | "_________________________________________________________________\n", 193 | "flatten_1 (Flatten) (None, 4608) 0 \n", 194 | "_________________________________________________________________\n", 195 | "dense_1 (Dense) (None, 512) 2359808 \n", 196 | "_________________________________________________________________\n", 197 | "activation_7 (Activation) (None, 512) 0 \n", 198 | "_________________________________________________________________\n", 199 | "dropout_4 (Dropout) (None, 512) 0 \n", 200 | "_________________________________________________________________\n", 201 | "dense_2 (Dense) (None, 10) 5130 \n", 202 | "=================================================================\n", 203 | "Total params: 2,679,626\n", 204 | "Trainable params: 2,679,626\n", 205 | "Non-trainable params: 0\n", 206 | "_________________________________________________________________\n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "def create_model():\n", 212 | " model = Sequential()\n", 213 | " model.add(Conv2D(32, (3, 3), padding='same',\n", 214 | " input_shape=(64,64,3)))\n", 215 | " model.add(Activation('relu'))\n", 216 | " model.add(Conv2D(64, (3, 3)))\n", 217 | " model.add(Activation('relu'))\n", 218 | " model.add(MaxPooling2D(pool_size=(2, 2)))\n", 219 | " model.add(Dropout(0.25))\n", 220 | " model.add(Conv2D(64, (3, 3), padding='same'))\n", 221 | " model.add(Activation('relu'))\n", 222 | " model.add(Conv2D(64, (3, 3)))\n", 223 | " model.add(Activation('relu'))\n", 224 | " model.add(MaxPooling2D(pool_size=(2, 2)))\n", 225 | " model.add(Dropout(0.5))\n", 226 | " model.add(Conv2D(128, (3, 3), padding='same'))\n", 227 | " model.add(Activation('relu'))\n", 228 | " model.add(Conv2D(128, (3, 3)))\n", 229 | " model.add(Activation('relu'))\n", 230 | " model.add(MaxPooling2D(pool_size=(2, 2)))\n", 231 | " model.add(Dropout(0.5))\n", 232 | " model.add(Flatten())\n", 233 | " model.add(Dense(512))\n", 234 | " model.add(Activation('relu'))\n", 235 | " model.add(Dropout(0.5))\n", 236 | " model.add(Dense(10, activation='softmax'))\n", 237 | " model.compile(optimizers.rmsprop(lr=0.0005, decay=1e-6),loss=\"categorical_crossentropy\",metrics=[\"accuracy\"])\n", 238 | " model.summary()\n", 239 | "\n", 240 | "model = create_model()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "I want to just get a 1 second window connected to a neural net and have it do computations.\n", 248 | "\n", 249 | "Then after that I'll get an RNN connected to it." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 7, 255 | "metadata": {}, 256 | "outputs": [ 257 | { 258 | "data": { 259 | "text/plain": [ 260 | "array([[[1, 2, 3],\n", 261 | " [4, 5, 6]]])" 262 | ] 263 | }, 264 | "execution_count": 7, 265 | "metadata": {}, 266 | "output_type": "execute_result" 267 | } 268 | ], 269 | "source": [ 270 | "a=np.array([[[1,2,3],[4,5,6]]])\n", 271 | "a" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 8, 277 | "metadata": {}, 278 | "outputs": [ 279 | { 280 | "data": { 281 | "text/plain": [ 282 | "" 283 | ] 284 | }, 285 | "execution_count": 8, 286 | "metadata": {}, 287 | "output_type": "execute_result" 288 | } 289 | ], 290 | "source": [ 291 | "import keras.backend as K\n", 292 | "K.squeeze(a, axis=0)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [] 301 | } 302 | ], 303 | "metadata": { 304 | "kernelspec": { 305 | "display_name": "Python 3", 306 | "language": "python", 307 | "name": "python3" 308 | }, 309 | "language_info": { 310 | "codemirror_mode": { 311 | "name": "ipython", 312 | "version": 3 313 | }, 314 | "file_extension": ".py", 315 | "mimetype": "text/x-python", 316 | "name": "python", 317 | "nbconvert_exporter": "python", 318 | "pygments_lexer": "ipython3", 319 | "version": "3.7.4" 320 | } 321 | }, 322 | "nbformat": 4, 323 | "nbformat_minor": 2 324 | } 325 | -------------------------------------------------------------------------------- /5-Audio Sampling Rates.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Prototype Recognizer\n", 8 | "In this notebook we'll be putting together something that listens to the audio and tries to tell who is talking." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [ 16 | { 17 | "name": "stderr", 18 | "output_type": "stream", 19 | "text": [ 20 | "Using TensorFlow backend.\n" 21 | ] 22 | } 23 | ], 24 | "source": [ 25 | "import audiostream\n", 26 | "import audiolib\n", 27 | "import application\n", 28 | "import config\n", 29 | "import models\n", 30 | "import numpy as np\n", 31 | "\n", 32 | "application.init()" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "name": "stdout", 42 | "output_type": "stream", 43 | "text": [ 44 | "Resuming at batch 29731\n", 45 | "Loading model from checkpoints\\voice-embeddings.h5\n", 46 | "Preloaded model from checkpoints\\voice-embeddings.h5\n" 47 | ] 48 | } 49 | ], 50 | "source": [ 51 | "model = application.load_model()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 3, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | "Recording\n", 64 | "4 seconds remaining\n", 65 | "3 seconds remaining\n", 66 | "2 seconds remaining\n", 67 | "1 seconds remaining\n", 68 | "Recording finished\n" 69 | ] 70 | } 71 | ], 72 | "source": [ 73 | "clip1 = audiostream.record(4)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "Recording\n", 86 | "4 seconds remaining\n", 87 | "3 seconds remaining\n", 88 | "2 seconds remaining\n", 89 | "1 seconds remaining\n", 90 | "Recording finished\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "clip2 = audiostream.record(4)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "How long are these sound samples?" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 13, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "4.0" 114 | ] 115 | }, 116 | "execution_count": 13, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "len(clip1) / 44100" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 3, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "def calc_embedding(model, sound):\n", 132 | " \"\"\"\n", 133 | " sound: a numpy array of 16-bit signed sound samples.\n", 134 | " \"\"\"\n", 135 | " print('extract soundlen=',len(sound))\n", 136 | " features = audiolib.extract_features(sound, sample_rate=44100, num_filters=config.NUM_FILTERS)\n", 137 | " emb = models.get_embedding(model, features)\n", 138 | " return emb " 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 19, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/plain": [ 149 | "2.49375" 150 | ] 151 | }, 152 | "execution_count": 19, 153 | "metadata": {}, 154 | "output_type": "execute_result" 155 | } 156 | ], 157 | "source": [ 158 | "399/160" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "The key to what's going wrong here is ```minibatch._clipped_audio``` and ```config.NUM_FRAMES```. I don't quite understand the details yet (actually I figured it out; read on). I thought that 160 frames would be 4 seconds, but I suspect that somehow it is not.\n", 166 | "\n", 167 | "Under the covers we're using [python_speech_features.fbank()](https://github.com/jameslyons/python_speech_features) with the default winstep parameter of 0.010 = 10ms. At 160 frames, my theory is that we're actually processing not 4 seconds, but 1.6 seconds. In particular, a randomly selected 1.6 seconds from the sample.\n", 168 | "\n", 169 | "This theory is consistent with the fact that I was getting exactly 2.5x the amount of time I expected. This makes sense, because winlen=0.025 and winstep=0.010. That's exactly a ratio of 2.5:1.\n", 170 | "\n", 171 | "I can determine what I want to do about this long term, but for now I think if I set up my sound buffer to have 1.6 seconds of sound this will work well enough to see if it generally works." 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 4, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "def sound_chunks(sound, chunk_seconds=1.6, step_seconds=0.5, sample_rate=44100):\n", 181 | " \"\"\"Return a sequence of sound chunks from a sound clip.\n", 182 | " Each chunk will be 1.6 seconds of the sound, and each\n", 183 | " successive chunk will be advanced by the specified number of seconds.\n", 184 | " sound: a numpy array of 16-bit signed integers representing a sound sample.\n", 185 | " \"\"\"\n", 186 | " chunk_len = int(chunk_seconds * sample_rate)\n", 187 | " chunk_step = int(step_seconds * sample_rate)\n", 188 | " chunk_count = int(len(sound) / chunk_step)\n", 189 | " for i in range(chunk_count):\n", 190 | " start = i * chunk_step\n", 191 | " end = start + chunk_len\n", 192 | " yield sound[start:end]" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 89, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "chunks = list(sound_chunks(clip1, chunk_seconds=1.61))" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 86, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "8" 213 | ] 214 | }, 215 | "execution_count": 86, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "len(chunks)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 90, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "data": { 231 | "text/plain": [ 232 | "1.61" 233 | ] 234 | }, 235 | "execution_count": 90, 236 | "metadata": {}, 237 | "output_type": "execute_result" 238 | } 239 | ], 240 | "source": [ 241 | "len(chunks[0]) / 44100" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 17, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "def embeddings_from_sound(model, sound, sample_rate=44100):\n", 251 | " \"\"\"Return a sequence of embeddings from the different time slices\n", 252 | " in the sound clip.\n", 253 | " sound: a numpy array of 16-bit signed integers representing a sound sample.\n", 254 | " \"\"\"\n", 255 | " # The 1.601 is a hack to make sure we end up with a shape of 160 instead of 159.\n", 256 | " # What we actually want is 1.6.\n", 257 | " #*TODO: Figure out a better way to fix the 159->160 off by one error than adding .001.\n", 258 | " chunk_seconds=1.61\n", 259 | " for chunk in sound_chunks(sound, chunk_seconds=chunk_seconds, sample_rate=sample_rate):\n", 260 | " # The last portion of the sound may be less than our desired length.\n", 261 | " # We can safely skip it because we'll process it later as it shifts down the time window.\n", 262 | " lc = len(chunk)\n", 263 | " print('lc=%d sec=%f delta=%f' % (lc, lc/sample_rate, lc/sample_rate - chunk_seconds))\n", 264 | " if len(chunk)/sample_rate - chunk_seconds < -0.1:\n", 265 | " continue\n", 266 | " print('calculating embedding for chunk len=%d' % len(chunk))\n", 267 | " yield calc_embedding(model, chunk)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 184, 273 | "metadata": {}, 274 | "outputs": [ 275 | { 276 | "name": "stderr", 277 | "output_type": "stream", 278 | "text": [ 279 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n", 280 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n" 281 | ] 282 | }, 283 | { 284 | "name": "stdout", 285 | "output_type": "stream", 286 | "text": [ 287 | "extract soundlen= 71001\n", 288 | "extract soundlen= 71001\n" 289 | ] 290 | }, 291 | { 292 | "name": "stderr", 293 | "output_type": "stream", 294 | "text": [ 295 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n", 296 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n", 297 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n" 298 | ] 299 | }, 300 | { 301 | "name": "stdout", 302 | "output_type": "stream", 303 | "text": [ 304 | "extract soundlen= 71001\n", 305 | "extract soundlen= 71001\n", 306 | "extract soundlen= 71001\n" 307 | ] 308 | } 309 | ], 310 | "source": [ 311 | "embs1 = list(embeddings_from_sound(model, clip1))" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 185, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/plain": [ 322 | "1.61" 323 | ] 324 | }, 325 | "execution_count": 185, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "71001/44100" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 150, 337 | "metadata": {}, 338 | "outputs": [ 339 | { 340 | "data": { 341 | "text/plain": [ 342 | "5" 343 | ] 344 | }, 345 | "execution_count": 150, 346 | "metadata": {}, 347 | "output_type": "execute_result" 348 | } 349 | ], 350 | "source": [ 351 | "len(embs1)" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "## Frame Length and FFT size\n", 359 | "I've been having this problem while processing sound samples into filter banks:\n", 360 | "> WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n", 361 | "\n", 362 | "This problem is discussed in this [github issue on Python Speech Features](https://github.com/jameslyons/python_speech_features).\n", 363 | "\n", 364 | "The recommendation there is to read this [Practical Cryptography tutorial on Mel Frequency Cepstral Coefficients mfccs](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/)\n", 365 | "\n", 366 | "The recommendation is:\n", 367 | "> Please read this [tutorial](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/) to better understand the relationship between FFT size and frame length. The FFT size should be longer than the frame length. either increase your fft size, or decrease your frame length, or you can ignore the warning and see how it performs.\n", 368 | "\n", 369 | "Investigating this further, I found this [pull request](https://github.com/jameslyons/python_speech_features/pull/76/commits/9ab32879b1fb31a38c1a70392fd21370b8fdc30f) that makes NFFT a power of two and simply increases it until it fits. The comments in the PR say:\n", 370 | "> Having an FFT less than the window length loses precision by dropping\n", 371 | " many of the samples; a longer FFT than the window allows zero-padding\n", 372 | " of the FFT buffer which is neutral in terms of frequency domain conversion.\n", 373 | "\n", 374 | "So it sounds like I could increase the numfft parameter on ```fbank()``` from its default of 512 to 2048. But why am I getting this warning in the first place? Am I giving it the wrong sample rate? I didn't have this problem when running the training data through this code. So I need to figure out how to run my real-time captured sound data through the same code without it breaking. I want to use the same processing code for training and inference, so I'm not liking the idea of changing the ```numfft``` parameter in ```fbank()```.\n", 375 | "\n", 376 | "Let's compare some training data to my realtime captured audio and try to zero in on what's going wrong." 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "## Comparing Realtime Audio to Training Data" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 181, 389 | "metadata": {}, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "176400" 395 | ] 396 | }, 397 | "execution_count": 181, 398 | "metadata": {}, 399 | "output_type": "execute_result" 400 | } 401 | ], 402 | "source": [ 403 | "len(clip1)" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 182, 409 | "metadata": {}, 410 | "outputs": [ 411 | { 412 | "data": { 413 | "text/plain": [ 414 | "4.0" 415 | ] 416 | }, 417 | "execution_count": 182, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "len(clip1)/44100" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 187, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "db = application.make_speaker_db()\n", 433 | "triplet = db.random_triplet()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 188, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/plain": [ 444 | "('d:\\\\datasets\\\\voxceleb1\\\\vox1\\\\wav\\\\id10716\\\\yKO2BD79hQ0\\\\00012.wav',\n", 445 | " 'd:\\\\datasets\\\\voxceleb1\\\\vox1\\\\wav\\\\id10716\\\\Pvm_Dv1P3-M\\\\00001.wav',\n", 446 | " 'd:\\\\datasets\\\\voxceleb1\\\\vox1\\\\wav\\\\id10161\\\\6KdOSVcTQNc\\\\00001.wav',\n", 447 | " 'id10716',\n", 448 | " 'id10161')" 449 | ] 450 | }, 451 | "execution_count": 188, 452 | "metadata": {}, 453 | "output_type": "execute_result" 454 | } 455 | ], 456 | "source": [ 457 | "triplet" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 189, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "rate, sample = audiolib.load_wav(triplet[0])" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 190, 472 | "metadata": {}, 473 | "outputs": [ 474 | { 475 | "data": { 476 | "text/plain": [ 477 | "16000" 478 | ] 479 | }, 480 | "execution_count": 190, 481 | "metadata": {}, 482 | "output_type": "execute_result" 483 | } 484 | ], 485 | "source": [ 486 | "rate" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "That's an awfully low sample rate, isn't it? Is that actually correct? I would expect 44100 not 16000. Let's see if these numbers check out." 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 197, 499 | "metadata": {}, 500 | "outputs": [ 501 | { 502 | "data": { 503 | "text/plain": [ 504 | "9.6400625" 505 | ] 506 | }, 507 | "execution_count": 197, 508 | "metadata": {}, 509 | "output_type": "execute_result" 510 | } 511 | ], 512 | "source": [ 513 | "len(sample)/rate # Expected number of seconds" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "I opened this sound clip up in Audacity and verified that it is in fact 9.64 seconds long.That could explain why I'm getting such a big difference in working with my own sound clips. My sound clips are sampled at a significantly higher rate." 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 194, 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [ 529 | "features = audiolib.extract_features(sample, sample_rate=rate)" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "Notice there was no warning. I've created a resampled version of this same clip, resampled from a rate of 16000 to 44100. Let's see how that fares." 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": 201, 542 | "metadata": {}, 543 | "outputs": [ 544 | { 545 | "data": { 546 | "text/plain": [ 547 | "44100" 548 | ] 549 | }, 550 | "execution_count": 201, 551 | "metadata": {}, 552 | "output_type": "execute_result" 553 | } 554 | ], 555 | "source": [ 556 | "rate2, sample2 = audiolib.load_wav(r'd:\\tmp\\00012-441.wav')\n", 557 | "rate2" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 202, 563 | "metadata": {}, 564 | "outputs": [ 565 | { 566 | "data": { 567 | "text/plain": [ 568 | "9.640068027210884" 569 | ] 570 | }, 571 | "execution_count": 202, 572 | "metadata": {}, 573 | "output_type": "execute_result" 574 | } 575 | ], 576 | "source": [ 577 | "len(sample2)/rate2" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 203, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "name": "stderr", 587 | "output_type": "stream", 588 | "text": [ 589 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n" 590 | ] 591 | } 592 | ], 593 | "source": [ 594 | "features2 = audiolib.extract_features(sample2, sample_rate=rate2)" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "Aha! So the same cound clip, upsampled from 16000 to 44100 gives me this warning.\n", 602 | "\n", 603 | "I either need to pre-downsample my sound clips or up the number of FFT's in the feature extraction." 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "metadata": {}, 609 | "source": [ 610 | "## Downsampling\n", 611 | "Here are a couple of options for downsampling that I found in this [thread](https://stackoverflow.com/questions/30619740/downsampling-wav-audio-file):\n", 612 | " * [audioop](https://docs.python.org/2/library/audioop.html) ```ratecv``` In the python standard library.\n", 613 | " * [scipi signal resample](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample)\n", 614 | " \n", 615 | " I'm going to try the standard library approach. To verify that I've got it downsampling properly I want to save out a wav file. I won't need to save wav files for the actual application, but I don't want to blindly use the sound data I've downsampled; I want to verify it's really doing what I think it is.\n", 616 | "\n", 617 | "From [audioop Docs](https://docs.python.org/2/library/audioop.html):\n", 618 | "```audioop.ratecv(fragment, width, nchannels, inrate, outrate, state[, weightA[, weightB]])```\n", 619 | " > state is a tuple containing the state of the converter. The converter returns a tuple (newfragment, newstate), and newstate should be passed to the next call of ratecv(). The initial call should pass None as the state." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 204, 625 | "metadata": {}, 626 | "outputs": [ 627 | { 628 | "data": { 629 | "text/plain": [ 630 | "176400" 631 | ] 632 | }, 633 | "execution_count": 204, 634 | "metadata": {}, 635 | "output_type": "execute_result" 636 | } 637 | ], 638 | "source": [ 639 | "len(clip1)" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": 234, 645 | "metadata": {}, 646 | "outputs": [], 647 | "source": [ 648 | "import audioop\n", 649 | "def downsample(samples, from_rate=44100, to_rate=16000):\n", 650 | " width = 2\n", 651 | " nchannels = 1\n", 652 | " state = None\n", 653 | " fragment, new_state = audioop.ratecv(samples, width, nchannels, from_rate, to_rate, state)\n", 654 | " return fragment" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 235, 660 | "metadata": {}, 661 | "outputs": [], 662 | "source": [ 663 | "clip1_d = downsample(clip1)" 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": 238, 669 | "metadata": {}, 670 | "outputs": [ 671 | { 672 | "data": { 673 | "text/plain": [ 674 | "(176400, 128000, 4.0, 8.0)" 675 | ] 676 | }, 677 | "execution_count": 238, 678 | "metadata": {}, 679 | "output_type": "execute_result" 680 | } 681 | ], 682 | "source": [ 683 | "len(clip1), len(clip1_d), len(clip1)/44100, len(clip1_d)/16000" 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": 237, 689 | "metadata": {}, 690 | "outputs": [ 691 | { 692 | "data": { 693 | "text/plain": [ 694 | "8.0" 695 | ] 696 | }, 697 | "execution_count": 237, 698 | "metadata": {}, 699 | "output_type": "execute_result" 700 | } 701 | ], 702 | "source": [ 703 | "128000 / 16000" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "The problem here is that each entry in my samples is 16 bits, and this thing is expecting a byte-like object. I've through of 3 options:\n", 711 | " 1. Pass ```ratecv``` width=1\n", 712 | " 2. Do the conversion inside of AudioStream\n", 713 | " 3. Tell AudioStream that I want samples at 16000 instead of 44100.\n", 714 | " \n", 715 | "Let's try option 3 first, because if that works, it will be cleaner and eliminate needless conversions." 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": 271, 721 | "metadata": {}, 722 | "outputs": [ 723 | { 724 | "name": "stdout", 725 | "output_type": "stream", 726 | "text": [ 727 | "Recording\n", 728 | "6 seconds remaining\n", 729 | "5 seconds remaining\n", 730 | "4 seconds remaining\n", 731 | "3 seconds remaining\n", 732 | "2 seconds remaining\n", 733 | "1 seconds remaining\n" 734 | ] 735 | } 736 | ], 737 | "source": [ 738 | "import time\n", 739 | "as2 = audiostream.AudioStream(seconds=4, rate=16000)\n", 740 | "as2.start()\n", 741 | "seconds=6\n", 742 | "print('Recording')\n", 743 | "for i in range(seconds):\n", 744 | " print('%d seconds remaining' % (seconds - i))\n", 745 | " time.sleep(1)\n", 746 | "as2.stop()\n", 747 | "clip3 = as2.sound_array()" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": 245, 753 | "metadata": {}, 754 | "outputs": [ 755 | { 756 | "data": { 757 | "text/plain": [ 758 | "4.0" 759 | ] 760 | }, 761 | "execution_count": 245, 762 | "metadata": {}, 763 | "output_type": "execute_result" 764 | } 765 | ], 766 | "source": [ 767 | "len(clip3) / 16000" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 255, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "features3 = audiolib.extract_features(clip3, sample_rate=16000)" 777 | ] 778 | }, 779 | { 780 | "cell_type": "markdown", 781 | "metadata": {}, 782 | "source": [ 783 | "Ok that worked with no warnings. Let's try saving that to a wav file to make sure it's actually producing sound the way we expect and not just crazy data.\n", 784 | "\n", 785 | "We'll use the Python standard library [Wave module](https://docs.python.org/2/library/wave.html)." 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 256, 791 | "metadata": {}, 792 | "outputs": [], 793 | "source": [ 794 | "import wave\n", 795 | "def save_wav(filename, samples, rate=16000, width=2, channels=1):\n", 796 | " wav = wave.open(filename, 'wb')\n", 797 | " wav.setnchannels(channels)\n", 798 | " wav.setsampwidth(width)\n", 799 | " wav.setframerate(rate)\n", 800 | " wav.writeframes(samples)\n", 801 | " wav.close()" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": 272, 807 | "metadata": {}, 808 | "outputs": [], 809 | "source": [ 810 | "save_wav(r'd:\\tmp\\clip3.wav', clip3, rate=16000)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "metadata": {}, 816 | "source": [ 817 | "This does produce a wav file as I expect, so this seems to be working from the realtime audio streaming all the way to producing sound at the desired sampling rate." 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "## Testing with some clips\n", 825 | "Now that we've got the issues with the sampling rate all sorted out (the issue is I need to use a rate of 16000 not 44100), let's try this with some live recorded clips. We'll do that in the next notebook, \"6-Prototype Speaker Recognizer\"." 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": 6, 831 | "metadata": {}, 832 | "outputs": [ 833 | { 834 | "name": "stdout", 835 | "output_type": "stream", 836 | "text": [ 837 | "Recording\n", 838 | "4 seconds remaining\n", 839 | "3 seconds remaining\n", 840 | "2 seconds remaining\n", 841 | "1 seconds remaining\n", 842 | "Recording finished\n" 843 | ] 844 | } 845 | ], 846 | "source": [ 847 | "clip1 = audiostream.record(4, rate=16000)" 848 | ] 849 | }, 850 | { 851 | "cell_type": "code", 852 | "execution_count": 7, 853 | "metadata": {}, 854 | "outputs": [ 855 | { 856 | "name": "stdout", 857 | "output_type": "stream", 858 | "text": [ 859 | "Recording\n", 860 | "4 seconds remaining\n", 861 | "3 seconds remaining\n", 862 | "2 seconds remaining\n", 863 | "1 seconds remaining\n", 864 | "Recording finished\n" 865 | ] 866 | } 867 | ], 868 | "source": [ 869 | "clip2 = audiostream.record(4, rate=16000)" 870 | ] 871 | }, 872 | { 873 | "cell_type": "code", 874 | "execution_count": 13, 875 | "metadata": {}, 876 | "outputs": [], 877 | "source": [ 878 | "audiolib.save_wav(r'd:\\tmp\\clip1.wav', clip1, rate=16000)\n", 879 | "audiolib.save_wav(r'd:\\tmp\\clip2.wav', clip2, rate=16000)" 880 | ] 881 | }, 882 | { 883 | "cell_type": "code", 884 | "execution_count": 18, 885 | "metadata": {}, 886 | "outputs": [ 887 | { 888 | "name": "stderr", 889 | "output_type": "stream", 890 | "text": [ 891 | "WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.\n" 892 | ] 893 | }, 894 | { 895 | "name": "stdout", 896 | "output_type": "stream", 897 | "text": [ 898 | "lc=25760 sec=1.610000 delta=0.000000\n", 899 | "calculating embedding for chunk len=25760\n", 900 | "extract soundlen= 25760\n" 901 | ] 902 | }, 903 | { 904 | "ename": "ValueError", 905 | "evalue": "Error when checking input: expected input_1 to have shape (160, 64, 1) but got array with shape (57, 64, 1)", 906 | "output_type": "error", 907 | "traceback": [ 908 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 909 | "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", 910 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0membs1\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0membeddings_from_sound\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mclip1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msample_rate\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m16000\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 911 | "\u001b[1;32m\u001b[0m in \u001b[0;36membeddings_from_sound\u001b[1;34m(model, sound, sample_rate)\u001b[0m\n\u001b[0;32m 16\u001b[0m \u001b[1;32mcontinue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 17\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'calculating embedding for chunk len=%d'\u001b[0m \u001b[1;33m%\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mchunk\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 18\u001b[1;33m \u001b[1;32myield\u001b[0m \u001b[0mcalc_embedding\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mchunk\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 912 | "\u001b[1;32m\u001b[0m in \u001b[0;36mcalc_embedding\u001b[1;34m(model, sound)\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'extract soundlen='\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msound\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[0mfeatures\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0maudiolib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mextract_features\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msound\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msample_rate\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m44100\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_filters\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mconfig\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mNUM_FILTERS\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 7\u001b[1;33m \u001b[0memb\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmodels\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_embedding\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfeatures\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 8\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0memb\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 913 | "\u001b[1;32m~\\Documents\\notebooks\\voice-embeddings\\models.py\u001b[0m in \u001b[0;36mget_embedding\u001b[1;34m(model, sample)\u001b[0m\n\u001b[0;32m 137\u001b[0m \"\"\"\n\u001b[0;32m 138\u001b[0m \u001b[0minput_batch\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexpand_dims\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msample\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# .predict() wants a batch, not a single entry.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 139\u001b[1;33m \u001b[0memb_rows\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0minput_batch\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# Predict for this batch of 1.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 140\u001b[0m \u001b[0membedding\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msqueeze\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0memb_rows\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;31m# Predict() returns a batch of results. Extract the single row.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 141\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0membedding\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 914 | "\u001b[1;32md:\\prg\\Anaconda3\\lib\\site-packages\\keras\\engine\\training.py\u001b[0m in \u001b[0;36mpredict\u001b[1;34m(self, x, batch_size, verbose, steps)\u001b[0m\n\u001b[0;32m 1147\u001b[0m 'argument.')\n\u001b[0;32m 1148\u001b[0m \u001b[1;31m# Validate user data.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1149\u001b[1;33m \u001b[0mx\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0m_\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0m_\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_standardize_user_data\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1150\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstateful\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1151\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mx\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m>\u001b[0m \u001b[0mbatch_size\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mx\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m%\u001b[0m \u001b[0mbatch_size\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 915 | "\u001b[1;32md:\\prg\\Anaconda3\\lib\\site-packages\\keras\\engine\\training.py\u001b[0m in \u001b[0;36m_standardize_user_data\u001b[1;34m(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)\u001b[0m\n\u001b[0;32m 749\u001b[0m \u001b[0mfeed_input_shapes\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 750\u001b[0m \u001b[0mcheck_batch_axis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;31m# Don't enforce the batch size.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 751\u001b[1;33m exception_prefix='input')\n\u001b[0m\u001b[0;32m 752\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 753\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0my\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 916 | "\u001b[1;32md:\\prg\\Anaconda3\\lib\\site-packages\\keras\\engine\\training_utils.py\u001b[0m in \u001b[0;36mstandardize_input_data\u001b[1;34m(data, names, shapes, check_batch_axis, exception_prefix)\u001b[0m\n\u001b[0;32m 136\u001b[0m \u001b[1;34m': expected '\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0mnames\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m+\u001b[0m \u001b[1;34m' to have shape '\u001b[0m \u001b[1;33m+\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 137\u001b[0m \u001b[0mstr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m+\u001b[0m \u001b[1;34m' but got array with shape '\u001b[0m \u001b[1;33m+\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 138\u001b[1;33m str(data_shape))\n\u001b[0m\u001b[0;32m 139\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 140\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", 917 | "\u001b[1;31mValueError\u001b[0m: Error when checking input: expected input_1 to have shape (160, 64, 1) but got array with shape (57, 64, 1)" 918 | ] 919 | } 920 | ], 921 | "source": [ 922 | "embs1 = list(embeddings_from_sound(model, clip1, sample_rate=16000))" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": 10, 928 | "metadata": {}, 929 | "outputs": [ 930 | { 931 | "ename": "TypeError", 932 | "evalue": "object of type 'generator' has no len()", 933 | "output_type": "error", 934 | "traceback": [ 935 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 936 | "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", 937 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0membs1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 938 | "\u001b[1;31mTypeError\u001b[0m: object of type 'generator' has no len()" 939 | ] 940 | } 941 | ], 942 | "source": [ 943 | "len(embs1)" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": {}, 950 | "outputs": [], 951 | "source": [] 952 | } 953 | ], 954 | "metadata": { 955 | "kernelspec": { 956 | "display_name": "Python 3", 957 | "language": "python", 958 | "name": "python3" 959 | }, 960 | "language_info": { 961 | "codemirror_mode": { 962 | "name": "ipython", 963 | "version": 3 964 | }, 965 | "file_extension": ".py", 966 | "mimetype": "text/x-python", 967 | "name": "python", 968 | "nbconvert_exporter": "python", 969 | "pygments_lexer": "ipython3", 970 | "version": "3.7.4" 971 | } 972 | }, 973 | "nbformat": 4, 974 | "nbformat_minor": 2 975 | } 976 | -------------------------------------------------------------------------------- /6-Prototype Speaker Recognizer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "Using TensorFlow backend.\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "import application\n", 18 | "import audiostream\n", 19 | "import audiolib\n", 20 | "import config\n", 21 | "import models\n", 22 | "import charts\n", 23 | "\n", 24 | "import time\n", 25 | "import numpy as np" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "name": "stdout", 35 | "output_type": "stream", 36 | "text": [ 37 | "Resuming at batch 42548\n", 38 | "Loading model from checkpoints\\voice-embeddings.h5\n", 39 | "Preloaded model from checkpoints\\voice-embeddings.h5\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "application.init()\n", 45 | "model = application.load_model()" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 4, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "ename": "AttributeError", 55 | "evalue": "'RefVariable' object has no attribute 'get_value'", 56 | "output_type": "error", 57 | "traceback": [ 58 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 59 | "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", 60 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mmodel\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0moptimizer\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", 61 | "\u001b[1;31mAttributeError\u001b[0m: 'RefVariable' object has no attribute 'get_value'" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "model.optimizer.lr.get_value()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 2, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "name": "stdout", 76 | "output_type": "stream", 77 | "text": [ 78 | "Recording\n", 79 | "4 seconds remaining\n", 80 | "3 seconds remaining\n", 81 | "2 seconds remaining\n", 82 | "1 seconds remaining\n", 83 | "Recording finished\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "clip1 = audiostream.record(4, rate=16000)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 213, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "def sound_chunks(sound, chunk_seconds=1.6, step_seconds=0.5, sample_rate=44100):\n", 98 | " \"\"\"Return a sequence of sound chunks from a sound clip.\n", 99 | " Each chunk will be 1.6 seconds of the sound, and each\n", 100 | " successive chunk will be advanced by the specified number of seconds.\n", 101 | " sound: a numpy array of 16-bit signed integers representing a sound sample.\n", 102 | " \"\"\"\n", 103 | " chunk_len = int(chunk_seconds * sample_rate)\n", 104 | " chunk_step = int(step_seconds * sample_rate)\n", 105 | " chunk_count = int(len(sound) / chunk_step)\n", 106 | " for i in range(chunk_count):\n", 107 | " start = i * chunk_step\n", 108 | " end = start + chunk_len\n", 109 | " yield sound[start:end]\n", 110 | "\n", 111 | "def embeddings_from_sound(model, sound, sample_rate=16000):\n", 112 | " \"\"\"Return a sequence of embeddings from the different time slices\n", 113 | " in the sound clip.\n", 114 | " sound: a numpy array of 16-bit signed integers representing a sound sample.\n", 115 | " \"\"\"\n", 116 | " # The 1.601 is a hack to make sure we end up with a shape of 160 instead of 159.\n", 117 | " # What we actually want is 1.6.\n", 118 | " #*TODO: Figure out a better way to fix the 159->160 off by one error than adding .001.\n", 119 | " chunk_seconds=1.61\n", 120 | " for chunk in sound_chunks(sound, chunk_seconds=chunk_seconds, sample_rate=sample_rate):\n", 121 | " # The last portion of the sound may be less than our desired length.\n", 122 | " # We can safely skip it because we'll process it later as it shifts down the time window.\n", 123 | " lc = len(chunk)\n", 124 | " #print('lc=%d sec=%f delta=%f' % (lc, lc/sample_rate, lc/sample_rate - chunk_seconds))\n", 125 | " if len(chunk)/sample_rate - chunk_seconds < -0.009:\n", 126 | " continue\n", 127 | " yield calc_embedding(model, chunk)\n", 128 | "\n", 129 | "def calc_embedding(model, sound, sample_rate=16000):\n", 130 | " \"\"\"\n", 131 | " sound: a numpy array of 16-bit signed sound samples.\n", 132 | " \"\"\"\n", 133 | " features = audiolib.extract_features(sound, sample_rate=sample_rate, num_filters=config.NUM_FILTERS)\n", 134 | " if len(features) < 160:\n", 135 | " raise Exception('need exactly 160 features to calculate an embedding, but got %d' % len(features))\n", 136 | " emb = models.get_embedding(model, features)\n", 137 | " return emb\n", 138 | "\n", 139 | "def compare_embeddings(emb1, emb2):\n", 140 | " \"\"\"Returns a scalar indicating the difference between 2 embeddings.\n", 141 | " Smaller numbers indicate closer.\n", 142 | " \"\"\"\n", 143 | " dist = np.linalg.norm(emb1 - emb2)\n", 144 | " return dist" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 7, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "data": { 154 | "text/plain": [ 155 | "(64000, 4.0)" 156 | ] 157 | }, 158 | "execution_count": 7, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "len(clip1), len(clip1)/16000" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 65, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "import itertools\n", 174 | "\n", 175 | "def compare_embedding_permutations(embeddings):\n", 176 | " \"\"\"Compare multiple embeddings against each other.\n", 177 | " Given a list of multiple embeddings, compare all the combinations of them taken 2 at a time.\n", 178 | " Returns a list of scalars representing the various comparisons.\n", 179 | " \"\"\"\n", 180 | " for emb1, emb2 in itertools.permutations(embeddings, 2):\n", 181 | " yield compare_embeddings(emb1, emb2)" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 69, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "rico1_embs = list(embeddings_from_sound(model, clip1, sample_rate=16000))" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 70, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "rico1_comparisons = list(compare_embedding_permutations(rico1_embs))" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 71, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEWCAYAAABsY4yMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAUFElEQVR4nO3df5RndX3f8ecLVgXkVxNWu2FZVgxiVk8MMLVSWiNo0w1FjClNoZoeUnUbNSYm6QnRmMQ2xzSkCUqKaV0tjaKiuIlEadRoI3r0gDorRH4aE1wKIjLS0oWIIPDuH987YXZ2mLnz48539rPPxznfM/d+v/d7P+/PfGdec+fzvd/PTVUhSWrPAeMuQJI0DANekhplwEtSowx4SWqUAS9JjTLgJalRBryakuTGJC8Ydx3SWmDAa5+SZFeSF82677wknwOoqmdV1VUL7GNzkkqybsBSpbEz4KUV5h8OrRUGvJoy8wg/yXOTTCbZneRbSS7sNvts9/XeJPcnOSXJAUnelOS2JHcneU+SI2bs9990j92T5NdntfPmJDuSvDfJbuC8ru2rk9yb5JtJLk7yxBn7qySvSfK1JPcl+a0kT++eszvJ5TO3l5bCgFfLLgIuqqrDgacDl3f3P7/7emRVHVpVVwPndbfTgOOAQ4GLAZJsAf4QeBmwATgCOHpWWy8BdgBHAu8DHgF+ETgKOAV4IfCaWc/ZCpwMPA/4FWB718YxwLOBc5fRd8mA1z7piu7I+N4k9zIK37l8D/jBJEdV1f1Vdc08+3wZcGFV3VpV9wNvAM7phlvOBj5aVZ+rqoeA3wBmT+J0dVVdUVWPVtUDVbWzqq6pqoerahfwDuBHZz3ngqraXVU3AjcAf961//+AjwEn9v+WSHsz4LUv+omqOnL6xt5HxtNeATwDuCXJl5KcOc8+fwC4bcb6bcA64KndY7dPP1BV3wHumfX822euJHlGkiuT3NUN2/w2o6P5mb41Y/mBOdYPnadeaUEGvJpVVV+rqnOBpwAXADuSPJm9j74B7gSOnbG+CXiYUeh+E9g4/UCSg4Hvn93crPX/CtwCHN8NEb0RyNJ7Iy2eAa9mJXl5kvVV9Shwb3f3I8AU8CijsfZplwG/mORpSQ5ldMT9wap6mNHY+ouT/KPujc//wMJhfRiwG7g/yTOBV69Yx6SeDHi1bCtwY5L7Gb3hek5VfbcbYnkL8PluHP95wCXApYzOsPk68F3gdQDdGPnrgA8wOpq/D7gbeHCetv898K+7bd8JfHDluyfNL17wQ1qc7gj/XkbDL18fdz3S4/EIXuohyYuTHNKN4f8ecD2wa7xVSfMz4KV+XsLojdg7geMZDff476/WNIdoJKlRHsFLUqPW1KRIRx11VG3evHncZUjSPmPnzp3frqr1cz22pgJ+8+bNTE5OjrsMSdpnJLnt8R5ziEaSGmXAS1KjDHhJapQBL0mNMuAlqVEGvCQ1atCAT3Jkd63KW5LcnOSUIduTJD1m6PPgLwI+XlVnd/NoHzJwe5KkzmABn+RwRhc3Pg+gu5blQ0O1J0na05BDNMcxunLO/0hybZJ3dVOt7iHJtiSTSSanpqYGLEeS5rdh4yaSrPptw8ZNg/RnsNkkk0wA1wCnVtUXklwE7K6qX3+850xMTJRTFUgalyQce/6Vq97ubRecyVKzOMnOqpqY67Ehj+DvAO6oqi906zuAkwZsT5I0w2ABX1V3AbcnOaG764XATUO1J0na09Bn0bwOeF93Bs2twM8M3J4kqTNowFfVdcCcY0OSpGH5SVZJapQBL0mNMuAlqVEGvCQ1yoCXpEYZ8JLUKANekhplwEtSowx4SWqUAS9JjTLgJalRBrwkNcqAl6RGGfCS1CgDXpIaZcBLUqMMeElqlAEvSY0y4CWpUQa8JDXKgJekRhnwktQoA16SGmXAS1KjDHhJatS6IXeeZBdwH/AI8HBVTQzZniTpMYMGfOe0qvr2KrQjSZrBIRpJatTQAV/AnyfZmWTbXBsk2ZZkMsnk1NTUwOVI0v5j6IA/tapOAn4ceG2S58/eoKq2V9VEVU2sX79+4HIkaf8xaMBX1Z3d17uBDwPPHbI9SdJjBgv4JE9Octj0MvBjwA1DtSdJ2tOQZ9E8Ffhwkul23l9VHx+wPUnSDIMFfFXdCjxnqP1LkubnaZKS1CgDXpIaZcBLUqMMeElqlAEvSY0y4CWpUQa8JDXKgJekRhnwktQoA16SGmXAS1KjDHhJapQBL0mNMuAlqVEGvCQ1yoCXpEYZ8JLUKANekhplwEtSowx4SWqUAS9JjTLgJalRBrwkNcqAl6RGGfCS1CgDXpIaNXjAJzkwybVJrhy6LUnSY1bjCP4XgJtXoR1J0gyDBnySjcA/B941ZDuSpL0NfQT/NuBXgEcfb4Mk25JMJpmcmpoauJyVt2HjJpKM5bZh46Zxd1/SGrZuqB0nORO4u6p2JnnB421XVduB7QATExM1VD1Duesbt3Ps+eN5e+G2C84cS7uS9g1DHsGfCpyVZBfwAeD0JO8dsD1J0gyDBXxVvaGqNlbVZuAc4C+q6uVDtSdJ2pPnwUtSo3oFfJJnL6eRqrqqqhwwlqRV1PcI/r8l+WKS1yQ5ctCKJEkrolfAV9U/Bl4GHANMJnl/kn86aGWSpGXpPQZfVV8D3gScD/wo8AdJbknyk0MVJ0laur5j8D+c5K2Mphw4HXhxVf1Qt/zWAeuTJC1R3w86XQy8E3hjVT0wfWdV3ZnkTYNUJklalr4BfwbwQFU9ApDkAOCgqvpOVV06WHWSpCXrOwb/KeDgGeuHdPdJktaovgF/UFXdP73SLR8yTEmSpJXQN+D/NslJ0ytJTgYemGd7SdKY9R2Dfz3woSR3dusbgH81TEmSpJXQK+Cr6ktJngmcAAS4paq+N2hlkqRlWcx88P8A2Nw958QkVNV7BqlKkrRsvQI+yaXA04HrgEe6uwsw4CVpjep7BD8BbKmqfe6KS5K0v+p7Fs0NwN8fshBJ0srqewR/FHBTki8CD07fWVVnDVKVJGnZ+gb8m4csQpK08vqeJvmZJMcCx1fVp5IcAhw4bGmSpOXoO13wq4AdwDu6u44GrhiqKEnS8vV9k/W1wKnAbvi7i388ZaiiJEnL1zfgH6yqh6ZXkqxjdB68JGmN6hvwn0nyRuDg7lqsHwI+OlxZkqTl6hvwvwpMAdcD/w74M0bXZ5UkrVF9z6J5lNEl+945bDmSpJXSdy6arzPHmHtVHbfiFUmSVsRi5qKZdhDwL4Hvm+8JSQ4CPgs8qWtnR1X95lKKlCQtXq8x+Kq6Z8btG1X1NuD0BZ72IHB6VT0H+BFga5LnLbNeSVJPfYdoTpqxegCjI/rD5ntON/Pk9HVcn9DdPLVSklZJ3yGa35+x/DCwC/iphZ6U5EBgJ/CDwNur6gtzbLMN2AawadOmnuVIkhbS9yya05ay86p6BPiRJEcCH07y7Kq6YdY224HtABMTEx7hS9IK6TtE80vzPV5VFy7w+L1JrgK2MppbXpI0sL4fdJoAXs1okrGjgZ8FtjAah59zLD7J+u7InSQHAy8CblluwZKkfhZzwY+Tquo+gCRvBj5UVa+c5zkbgHd34/AHAJdX1ZXLKVaS1F/fgN8EPDRj/SFg83xPqKqvACcurSxJ0nL1DfhLgS8m+TCjUx1fCrxnsKokScvW9yyatyT5GPBPurt+pqquHa4sSdJy9X2TFeAQYHdVXQTckeRpA9UkSVoBfS/Z95vA+cAburueALx3qKIkScvX9wj+pcBZwN8CVNWdLDBVgSRpvPoG/EPd3DIFkOTJw5UkSVoJfQP+8iTvAI5M8irgU3jxD0la0/qeRfN73bVYdwMnAL9RVZ8ctDJJ0rIsGPDdJ1E/UVUvAgx1SdpHLDhE080I+Z0kR6xCPZKkFdL3k6zfBa5P8km6M2kAqurnB6lKkrRsfQP+f3Y3SdI+Yt6AT7Kpqv53Vb17tQqSJK2Mhcbgr5heSPLHA9ciSVpBCwV8ZiwfN2QhkqSVtVDA1+MsS5LWuIXeZH1Okt2MjuQP7pbp1quqDh+0OknSks0b8FV14GoVIklaWYuZD16StA8x4CWpUQa8JDXKgJekRhnwktQoA16SGmXAS1KjDHhJatRgAZ/kmCSfTnJzkhuT/MJQbUmS9tZ3PvileBj45ar6cpLDgJ1JPllVNw3YpiSpM9gRfFV9s6q+3C3fB9wMHD1Ue5KkPa3KGHySzcCJwBfmeGxbkskkk1NTU0tuY8PGTSRZ9dv+aFzf6yRs2Lhpv+rzuicd7Pfa3+clG3KIBoAkhwJ/DLy+qnbPfryqtgPbASYmJpY8JfFd37idY8+/csl1LtVtF5y56m2O27i+1zC+7/c4f778Xq+e1n6fBz2CT/IERuH+vqr6kyHbkiTtacizaAL8d+DmqrpwqHYkSXMb8gj+VOCngdOTXNfdzhiwPUnSDIONwVfV59jzmq6SpFXkJ1klqVEGvCQ1yoCXpEYZ8JLUKANekhplwEtSowx4SWqUAS9JjTLgJalRBrwkNcqAl6RGGfCS1CgDXpIaZcBLUqMMeElqlAEvSY0y4CWpUQa8JDXKgJekRhnwktQoA16SGmXAS1KjDHhJapQBL0mNMuAlqVGDBXySS5LcneSGodqQJD2+IY/g/wjYOuD+JUnzGCzgq+qzwP8Zav+SpPmNfQw+ybYkk0kmp6amxl2OJDVj7AFfVduraqKqJtavXz/uciSpGWMPeEnSMAx4SWrUkKdJXgZcDZyQ5I4krxiqLUnS3tYNteOqOneofUuSFuYQjSQ1yoCXpEYZ8JLUKANekhplwEtSowx4SWqUAS9JjTLgJalRBrwkNcqAl6RGGfCS1CgDXpIaZcBLUqMMeElqlAEvSY0y4CWpUQa8JDXKgJekRhnwktQoA16SGmXAS1KjDHhJapQBL0mNMuAlqVEGvCQ1yoCXpEYNGvBJtib5apK/TvKrQ7YlSdrTYAGf5EDg7cCPA1uAc5NsGao9SdKehjyCfy7w11V1a1U9BHwAeMmA7UmSZkhVDbPj5Gxga1W9slv/aeAfVtXPzdpuG7CtWz0B+OogBcFRwLcH2vdqa6Uv9mNtaaUf0E5f+vTj2KpaP9cD61a+nr+TOe7b669JVW0Htg9Yx6iYZLKqJoZuZzW00hf7sba00g9opy/L7ceQQzR3AMfMWN8I3Dlge5KkGYYM+C8Bxyd5WpInAucAHxmwPUnSDIMN0VTVw0l+DvgEcCBwSVXdOFR7PQw+DLSKWumL/VhbWukHtNOXZfVjsDdZJUnj5SdZJalRBrwkNaq5gF9oeoQkv5TkpiRfSfK/khw7jjoX0qMfP5vk+iTXJfncWv6UcN8pK5KcnaSSrMnT23q8Juclmepek+uSvHIcdS6kz+uR5Ke635Mbk7x/tWvso8fr8dYZr8VfJbl3HHX20aMvm5J8Osm1XXad0WvHVdXMjdGbuX8DHAc8EfhLYMusbU4DDumWXw18cNx1L7Efh89YPgv4+LjrXmpfuu0OAz4LXANMjLvuJb4m5wEXj7vWFejH8cC1wN/r1p8y7rqX+nM1Y/vXMTrRY+y1L/E12Q68ulveAuzqs+/WjuAXnB6hqj5dVd/pVq9hdH7+WtOnH7tnrD6ZOT5Etkb0nbLit4DfBb67msUtQitTb/Tpx6uAt1fV/wWoqrtXucY+Fvt6nAtctiqVLV6fvhRweLd8BD0/U9RawB8N3D5j/Y7uvsfzCuBjg1a0NL36keS1Sf6GUTD+/CrVtlgL9iXJicAxVXXlaha2SH1/tv5F9y/0jiTHzPH4uPXpxzOAZyT5fJJrkmxdter66/273g3DPg34i1Woayn69OXNwMuT3AH8GaP/SBbUWsD3mh4BIMnLgQngPw9a0dL0nebh7VX1dOB84E2DV7U08/YlyQHAW4FfXrWKlqbPa/JRYHNV/TDwKeDdg1e1eH36sY7RMM0LGB35vivJkQPXtVi9f9cZfchyR1U9MmA9y9GnL+cCf1RVG4EzgEu73515tRbwvaZHSPIi4NeAs6rqwVWqbTEWO83DB4CfGLSipVuoL4cBzwauSrILeB7wkTX4RuuCr0lV3TPj5+mdwMmrVNti9PnZugP406r6XlV9ndEEgMevUn19LeZ35BzW7vAM9OvLK4DLAarqauAgRhORzW/cbzCs8JsV64BbGf07Nv1mxbNmbXMiozc0jh93vcvsx/Ezll8MTI677qX2Zdb2V7E232Tt85psmLH8UuCacde9xH5sBd7dLR/FaPjg+8dd+1J+rhjNULuL7kOda/HW8zX5GHBet/xDjP4ALNinsXdugG/WGcBfdSH+a919/5HR0TqM/nX+FnBdd/vIuGteYj8uAm7s+vDp+UJz3LeF+jJr2zUZ8D1fk//UvSZ/2b0mzxx3zUvsR4ALgZuA64Fzxl3zUn+uGI1d/864a12B12QL8PnuZ+s64Mf67NepCiSpUa2NwUuSOga8JDXKgJekRhnwktQoA16SGmXAa7+S5Kok/2zWfa9P8ofzPOf+4SuTVp4Br/3NZYw+2TjTWv+ko7QkBrz2NzuAM5M8CSDJZuAHgOu66wN8uZtnf6+ZCZO8IMmVM9YvTnJet3xyks8k2ZnkE0k2rEZnpPkY8NqvVNU9wBcZfRwfRkfvHwQeAF5aVScxumbA7yeZaxKovSR5AvBfgLOr6mTgEuAtK127tFjrxl2ANAbTwzR/2n39t4w+nv/bSZ4PPMpoutanAnf12N8JjCZM+2T3N+FA4JsrX7a0OAa89kdXABcmOQk4uKq+3A21rAdOrqrvdTNbHjTreQ+z53+9048HuLGqThm2bGlxHKLRfqeq7mc0qdklPPbm6hHA3V24nwbMda3e24AtSZ6U5Ajghd39XwXWJzkFRkM2SZ41ZB+kPjyC1/7qMuBPeOyMmvcBH00yyWi2vltmP6Gqbk9yOfAV4GuMrltKVT2U5GzgD7rgXwe8jdHMktLYOJukJDXKIRpJapQBL0mNMuAlqVEGvCQ1yoCXpEYZ8JLUKANekhr1/wHSrjwSS5Gj2QAAAABJRU5ErkJggg==\n", 210 | "text/plain": [ 211 | "
" 212 | ] 213 | }, 214 | "metadata": { 215 | "needs_background": "light" 216 | }, 217 | "output_type": "display_data" 218 | } 219 | ], 220 | "source": [ 221 | "charts.histogram(rico1_comparisons)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 64, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "data": { 231 | "text/plain": [ 232 | "0.40245754" 233 | ] 234 | }, 235 | "execution_count": 64, 236 | "metadata": {}, 237 | "output_type": "execute_result" 238 | } 239 | ], 240 | "source": [ 241 | "compare_embeddings(rico1_embs[0], emb1[0])" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 72, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "name": "stdout", 251 | "output_type": "stream", 252 | "text": [ 253 | "Recording\n", 254 | "4 seconds remaining\n", 255 | "3 seconds remaining\n", 256 | "2 seconds remaining\n", 257 | "1 seconds remaining\n", 258 | "Recording finished\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "rico_clip2 = audiostream.record(4, rate=16000)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 74, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "rico2_embs = list(embeddings_from_sound(model, rico_clip2, sample_rate=16000))" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 77, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "rico2_comparisons = list(compare_embedding_permutations(rico2_embs))" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 78, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAYcUlEQVR4nO3dfbBkdX3n8ffHGZ4UBA03kcDAoKIGLBW4y4JuElRSQRchrrg7xicMho3RGKNbUYyLSspskU18xIjjSon4BKLrDiysgVU0bAl6QR4F1xFhGUG5gjBOeHLgu3/0mdD29Nzbd2ZO950571fVqTkPv9Pn2113+tPn6XdSVUiSuusxky5AkjRZBoEkdZxBIEkdZxBIUscZBJLUcQaBJHWcQaDOSXJDkiMnXYe0WBgE2u4kuSXJUQPzTkhyGUBVHVRVl87zGsuTVJKlLZYqLQoGgTQBBowWE4NAndO/x5DksCQzSdYm+WmS9zfNvtn8e0+SdUmOSPKYJO9KcmuSO5N8Osnufa/7mmbZXUn+88B23pPkvCSfSbIWOKHZ9reS3JPkjiSnJ9mx7/UqyZ8m+UGSXyT56yRPadZZm+Tc/vbS5jII1HUfAj5UVY8HngKc28z/nebfPapq16r6FnBCMzwfeDKwK3A6QJIDgX8AXgnsBewO7D2wreOA84A9gM8CDwN/AewJHAG8EPjTgXWOBg4FDgf+EljZbGMZ8EzgFVvw3iXAIND26yvNL+17ktxD70t6mF8CT02yZ1Wtq6rL53jNVwLvr6qbq2odcDKwojnMczxwflVdVlUPAacAgx15fauqvlJVj1TV/VV1ZVVdXlXrq+oW4OPA7w6sc1pVra2qG4DrgX9stn8vcBFw8OgfiTScQaDt1R9U1R4bBjb+pb3BicDTgJuSfCfJMXO85m8Ct/ZN3wosBX6jWXbbhgVVdR9w18D6t/VPJHlakguS/KQ5XPQ39PYO+v20b/z+IdO7zlGvNBKDQJ1WVT+oqlcAvw6cBpyX5HFs/Gse4HZgv77pfYH19L6c7wD22bAgyS7Arw1ubmD6Y8BNwAHNoal3Atn8dyNtHoNAnZbkVUmmquoR4J5m9sPALPAIvXMBG3we+Isk+yfZld4v+HOqaj29Y/8vSfLc5gTue5n/S303YC2wLskzgDdstTcmLYBBoK47GrghyTp6J45XVNUDzaGd9wH/pznPcDhwJnA2vSuKfgQ8APwZQHMM/8+AL9DbO/gFcCfw4Bzb/k/AHzZtPwGcs/XfnjS/+GAaaetr9hjuoXfY50eTrkeai3sE0laS5CVJHtucY/g74DrglslWJc3PIJC2nuPonVC+HTiA3mEmd7m16HloSJI6zj0CSeq4ba7jqz333LOWL18+6TIkaZty5ZVX/qyqpoYt2+aCYPny5czMzEy6DEnapiS5dVPLPDQkSR1nEEhSxxkEktRxBoEkdZxBIEkdZxBIUse1HgRJliT5bpILhizbKck5SVYnuSLJ8rbrkST9qnHsEfw5cOMmlp0I/Lyqngp8gN6DQSRJY9RqECTZB/i3wH/bRJPjgLOa8fOAFybxCU2SNEZt7xF8EPhLek96GmZvmue4Nk95upeNH+9HkpOSzCSZmZ2d3exi9tpnX5JMZNhrn303u25tOyb1N7Z0p138ux6T7fF7pLUuJtJ7CPidVXVlkiM31WzIvI26Q62qlcBKgOnp6c3uLvUnP76N/d6+0amKsbj1tLmeia7txaT+xm497ZiJbbdrtsfvkTb3CJ4HHJvkFnqP73tBks8MtFkDLANIshTYHbi7xZokSQNaC4KqOrmq9qmq5cAK4GtV9aqBZquA1zbjxzdtfECCJI3R2HsfTXIqMFNVq4BPAmcnWU1vT2DFuOuRpK4bSxBU1aXApc34KX3zHwBePo4aJEnDeWexJHWcQSBJHWcQSFLHGQSS1HEGgSR1nEEgSR1nEEhSxxkEktRxBoEkdZxBIEkdZxBIUscZBJLUcQaBJHWcQSBJHWcQSFLHGQSS1HGtBUGSnZN8O8k1SW5I8t4hbU5IMpvk6mZ4fVv1SJKGa/MJZQ8CL6iqdUl2AC5LclFVXT7Q7pyqelOLdUiS5tBaEDQPoV/XTO7QDD6YXpIWmVbPESRZkuRq4E7g4qq6YkizlyW5Nsl5SZa1WY8kaWOtBkFVPVxVzwH2AQ5L8syBJucDy6vqWcAlwFnDXifJSUlmkszMzs62WbIkdc5YrhqqqnuAS4GjB+bfVVUPNpOfAA7dxPorq2q6qqanpqZarVWSuqbNq4amkuzRjO8CHAXcNNBmr77JY4Eb26pHkjRcm1cN7QWclWQJvcA5t6ouSHIqMFNVq4A3JzkWWA/cDZzQYj2SpCHavGroWuDgIfNP6Rs/GTi5rRokSfPzzmJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeq4Np9ZvHOSbye5JskNSd47pM1OSc5JsjrJFUmWt1WPJGm4NvcIHgReUFXPBp4DHJ3k8IE2JwI/r6qnAh8ATmuxHknSEK0FQfWsayZ3aIYaaHYccFYzfh7wwiRpqyZJ0sZaPUeQZEmSq4E7gYur6oqBJnsDtwFU1XrgXuDXhrzOSUlmkszMzs62WbIkdU6rQVBVD1fVc4B9gMOSPHOgybBf/4N7DVTVyqqarqrpqampNkqVpM4ay1VDVXUPcClw9MCiNcAygCRLgd2Bu8dRkySpp82rhqaS7NGM7wIcBdw00GwV8Npm/Hjga1W10R6BJKk9S1t87b2As5IsoRc451bVBUlOBWaqahXwSeDsJKvp7QmsaLEeSdIQrQVBVV0LHDxk/il94w8AL2+rBknS/LyzWJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOm6kIEjyzIW+cJJlSb6e5MYkNyT58yFtjkxyb5Krm+GUYa8lSWrPqI+qPCPJjsCngM9V1T0jrLMeeFtVXZVkN+DKJBdX1fcG2v1TVR0zesmSpK1ppD2Cqvo3wCuBZcBMks8l+b151rmjqq5qxn8B3AjsvYX1SpK2spHPEVTVD4B3AW8Hfhf4cJKbkvy7+dZNspzeg+yvGLL4iCTXJLkoyUGbWP+kJDNJZmZnZ0ctWZI0glHPETwryQfo/ap/AfCSqvqtZvwD86y7K/Al4C1VtXZg8VXAflX1bOAjwFeGvUZVrayq6aqanpqaGqVkSdKIRt0jOJ3el/azq+qNfYd8bqe3lzBUkh3ohcBnq+rLg8uram1VrWvGLwR2SLLnAt+DJGkLjHqy+MXA/VX1MECSxwA7V9V9VXX2sBWSBPgkcGNVvX8TbZ4E/LSqKslh9ILproW+CUnS5hs1CC4BjgLWNdOPBf4ReO4c6zwPeDVwXZKrm3nvBPYFqKozgOOBNyRZD9wPrKiqWtA7kCRtkVGDYOcNh3AAqmpdksfOtUJVXQZknjan0zvsJEmakFHPEfxzkkM2TCQ5lN4veEnSNm7UPYK3AF9McnszvRfwH9opSZI0TiMFQVV9J8kzgKfTO9xzU1X9stXKJEljMeoeAcC/ApY36xychKr6dCtVSZLGZqQgSHI28BTgauDhZnYBBoEkbeNG3SOYBg700k5J2v6MetXQ9cCT2ixEkjQZo+4R7Al8L8m3gQc3zKyqY1upSpI0NqMGwXvaLEKSNDmjXj76jST7AQdU1SXNXcVL2i1NkjQOo3ZD/cfAecDHm1l7s4kuoyVJ25ZRTxa/kV4ncmvhXx5S8+ttFSVJGp9Rg+DBqnpow0SSpfTuI5AkbeNGDYJvJHknsEvzrOIvAue3V5YkaVxGDYJ3ALPAdcB/BC5kjieTSZK2HaNeNfQI8IlmkCRtR0bta+hHDDknUFVP3uoVSZLGaiF9DW2wM/By4IlzrZBkGb1O6Z4EPAKsrKoPDbQJ8CF6z0S+Dzihqq4asSZJ0lYw0jmCqrqrb/hxVX0QeME8q60H3lZVvwUcDrwxyYEDbV4EHNAMJwEfW1j5kqQtNeqhoUP6Jh9Dbw9ht7nWqao7gDua8V8kuZHejWjf62t2HPDpplfTy5PskWSvZl1J0hiMemjo7/vG1wO3AP9+1I0kWQ4cDFwxsGhv4La+6TXNvF8JgiQn0dtjYN999x11s5LatmQHekd4J7DpHXfm4YcemMi2tzejXjX0/M3dQJJdgS8Bb6mqtYOLh21uyPZXAisBpqenvZFNWiwe/iX7vf2CiWz61tOOmci2bz3tmLFvs22jHhp661zLq+r9m1hvB3oh8Nmq+vKQJmuAZX3T+wC3j1KTJGnrGPWGsmngDfQO2+wN/AlwIL3zBEPPFTRXBH0SuHFTQQGsAl6TnsOBez0/IEnjtZAH0xxSVb8ASPIe4ItV9fo51nke8GrguiRXN/PeCewLUFVn0LtD+cXAanqXj75uoW9AkrRlRg2CfYGH+qYfApbPtUJVXcbwcwD9bYpez6aSpAkZNQjOBr6d5L/TO5n7Uno3i0mStnGjXjX0viQXAb/dzHpdVX23vbIkSeMy6sligMcCa5tuItYk2b+lmiRJYzTqoyrfDbwdOLmZtQPwmbaKkiSNz6h7BC8FjgX+GaCqbmeeLiYkSduGUYPgoeYKnwJI8rj2SpIkjdOoQXBuko8DeyT5Y+ASfEiNJG0XRr1q6O+aZxWvBZ4OnFJVF7damSRpLOYNgiRLgK9W1VGAX/6StJ2Z99BQVT0M3Jdk9zHUI0kas1HvLH6AXp9BF9NcOQRQVW9upSpJ0tiMGgT/sxkkSduZOYMgyb5V9f+q6qxxFSRJGq/5zhF8ZcNIki+1XIskaQLmC4L+bqSf3GYhkqTJmC8IahPjkqTtxHwni5+dZC29PYNdmnGa6aqqx7danSSpdXPuEVTVkqp6fFXtVlVLm/EN03OGQJIzk9yZ5PpNLD8yyb1Jrm6GU7bkjUiSNs+ol49ujk8BpzP3k8z+qaqOabEGSdI8FvJgmgWpqm8Cd7f1+pKkraO1IBjREUmuSXJRkoM21SjJSUlmkszMzs6Osz5J2u5NMgiuAvarqmcDH6HvnoVBVbWyqqaranpqampsBUpSF0wsCKpqbVWta8YvBHZIsuek6pGkrppYECR5UpI044c1tdw1qXokqatau2ooyeeBI4E9k6wB3k3vofdU1RnA8cAbkqwH7gdWNI/DlCSNUWtBUFWvmGf56fQuL5UkTdCkrxqSJE2YQSBJHWcQSFLHGQSS1HEGgSR1nEEgSR1nEEhSxxkEktRxBoEkdZxBIEkdZxBIUscZBJLUcQaBJHWcQSBJHWcQSFLHGQSS1HEGgSR1XGtBkOTMJHcmuX4Ty5Pkw0lWJ7k2ySFt1SJJ2rQ29wg+BRw9x/IXAQc0w0nAx1qsRZK0Ca0FQVV9E7h7jibHAZ+unsuBPZLs1VY9kqThJnmOYG/gtr7pNc28jSQ5KclMkpnZ2dmxFCdJXTHJIMiQeTWsYVWtrKrpqpqemppquSxJ6pZJBsEaYFnf9D7A7ROqRZI6a5JBsAp4TXP10OHAvVV1xwTrkaROWtrWCyf5PHAksGeSNcC7gR0AquoM4ELgxcBq4D7gdW3VIknatNaCoKpeMc/yAt7Y1vYlSaPxzmJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeq4VoMgydFJvp9kdZJ3DFl+QpLZJFc3w+vbrEeStLE2n1m8BPgo8HvAGuA7SVZV1fcGmp5TVW9qqw5J0tza3CM4DFhdVTdX1UPAF4DjWtyeJGkztBkEewO39U2vaeYNelmSa5Ocl2TZsBdKclKSmSQzs7OzbdQqSZ3VZhBkyLwamD4fWF5VzwIuAc4a9kJVtbKqpqtqempqaiuXKUnd1mYQrAH6f+HvA9ze36Cq7qqqB5vJTwCHtliPJGmINoPgO8ABSfZPsiOwAljV3yDJXn2TxwI3tliPJGmI1q4aqqr1Sd4EfBVYApxZVTckORWYqapVwJuTHAusB+4GTmirHknScK0FAUBVXQhcODDvlL7xk4GT26xBkjQ37yyWpI4zCCSp4wwCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4wwCSeo4g0CSOs4gkKSOazUIkhyd5PtJVid5x5DlOyU5p1l+RZLlbdYjSdpYa0GQZAnwUeBFwIHAK5IcONDsRODnVfVU4APAaW3VI0kars09gsOA1VV1c1U9BHwBOG6gzXHAWc34ecALk6TFmiRJA1JV7bxwcjxwdFW9vpl+NfCvq+pNfW2ub9qsaaZ/2LT52cBrnQSc1Ew+Hfh+K0VvXXsCP5u31eKwLdUK1tumbalWsN6F2K+qpoYtWNriRof9sh9MnVHaUFUrgZVbo6hxSTJTVdOTrmMU21KtYL1t2pZqBevdWto8NLQGWNY3vQ9w+6baJFkK7A7c3WJNkqQBbQbBd4ADkuyfZEdgBbBqoM0q4LXN+PHA16qtY1WSpKFaOzRUVeuTvAn4KrAEOLOqbkhyKjBTVauATwJnJ1lNb09gRVv1TMC2dChrW6oVrLdN21KtYL1bRWsniyVJ2wbvLJakjjMIJKnjDIIFGqHbjLcm+V6Sa5P87yT79S17OMnVzTB44nxS9f5Jkuuami7rv/s7ycnNet9P8vuLud4ky5Pc3/f5njHpWvvaHZ+kkkz3zVt0n+2m6p3EZztKvUlOSDLbV9fr+5a9NskPmuG1g+suslrH/r2wkapyGHGgd9L7h8CTgR2Ba4ADB9o8H3hsM/4G4Jy+ZesWYb2P7xs/FvhfzfiBTfudgP2b11myiOtdDly/mD7bpt1uwDeBy4HpxfzZzlHvWD/bBfwtnACcPmTdJwI3N/8+oRl/wmKstVk21u+FYYN7BAszb7cZVfX1qrqvmbyc3v0TkzJKvWv7Jh/Hozf0HQd8oaoerKofAaub11us9Y7bKF2oAPw18LfAA33zFuVnO0e9kzBqvcP8PnBxVd1dVT8HLgaObqlO2LJaFwWDYGH2Bm7rm17TzNuUE4GL+qZ3TjKT5PIkf9BGgQNGqjfJG5vuPf4WePNC1t3KtqRegP2TfDfJN5L8drulzl9rkoOBZVV1wULXbcGW1Avj/Wxh9M/oZc1h2POSbLiBddyf75bUCuP/XtiIQbAwI3WJAZDkVcA08F/7Zu9bvdvL/xD4YJKnbP0Sf7WMIfOGdeHx0ap6CvB24F0LWXcr25J676D3+R4MvBX4XJLHt1bpPLUmeQy9HnXfttB1W7Il9Y77s4XRPqPzgeVV9SzgEh7twHLcn++W1Arj/17YiEGwMKN0m0GSo4C/Ao6tqgc3zK+q25t/bwYuBQ5us1hGrLfPF4ANv0gWuu7WsNn1NodZ7mrGr6R3zPZpLdUJ89e6G/BM4NIktwCHA6uaE7CL8bPdZL0T+GxHqZequqvv/9cngENHXXcr25JaJ/G9sLFJn6TYlgZ6d2LfTO8E34aTQgcNtDmY3n+UAwbmPwHYqRnfE/gBQ07WTaDeA/rGX0Lvrm+Ag/jVE5o30/4JzS2pd2pDffRO2v0YeOIkax1ofymPnnxdlJ/tHPWO9bNdwN/CXn3jLwUub8afCPyo+T/3hGZ8on8Lc9Q69u+Foe9h3Bvc1gfgxcD/bb7s/6qZdyq9X//Q2+37KXB1M6xq5j8XuK75I7kOOHGR1Psh4Iam1q/3/wHT26v5Ib1uv1+0mOsFXtbMvwa4CnjJpGsdaPsvX6yL9bPdVL2T+GxH/Fv4L311fR14Rt+6f0TvJPxq4HWLtdZJfS8MDnYxIUkd5zkCSeo4g0CSOs4gkKSOMwgkqeMMAknqOINAGpDk0sEeQZO8Jck/zLHOuvYrk9phEEgb+zwbPzZ1RTNf2u4YBNLGzgOOSbIT9PrjB34TuDq9Z0xc1TwTYaMeJpMcmeSCvunTk5zQjB/adNp2ZZKvJtlrHG9Gmo9BIA2oXr863+bRrotXAOcA9wMvrapD6D134u+TDOtwbCNJdgA+AhxfVYcCZwLv29q1S5tj6aQLkBapDYeH/kfz7x/R62Xyb5L8DvAIva6GfwP4yQiv93R6nbpd3GTHEnq9ekoTZxBIw30FeH+SQ4Bdquqq5hDPFHBoVf2y6aVz54H11vOre9oblge4oaqOaLdsaeE8NCQNUVXr6HW8diaPniTeHbizCYHnA/sNWfVW4MAkOyXZHXhhM//7wFSSI6B3qCjJQW2+B2lU7hFIm/Z54Ms8egXRZ4Hzk8zQ6/30psEVquq2JOcC19LrUvi7zfyHkhwPfLgJiKXAB+n1SClNlL2PSlLHeWhIkjrOIJCkjjMIJKnjDAJJ6jiDQJI6ziCQpI4zCCSp4/4/fB8brt1qHWAAAAAASUVORK5CYII=\n", 292 | "text/plain": [ 293 | "
" 294 | ] 295 | }, 296 | "metadata": { 297 | "needs_background": "light" 298 | }, 299 | "output_type": "display_data" 300 | } 301 | ], 302 | "source": [ 303 | "charts.histogram(rico2_comparisons)" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 85, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "0.60666585" 315 | ] 316 | }, 317 | "execution_count": 85, 318 | "metadata": {}, 319 | "output_type": "execute_result" 320 | } 321 | ], 322 | "source": [ 323 | "compare_embeddings(rico1_embs[3], rico2_embs[3])" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 88, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "name": "stdout", 333 | "output_type": "stream", 334 | "text": [ 335 | "Recording\n", 336 | "4 seconds remaining\n", 337 | "3 seconds remaining\n", 338 | "2 seconds remaining\n", 339 | "1 seconds remaining\n", 340 | "Recording finished\n" 341 | ] 342 | } 343 | ], 344 | "source": [ 345 | "cor_clip1 = audiostream.record(4, rate=16000)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 89, 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "name": "stdout", 355 | "output_type": "stream", 356 | "text": [ 357 | "Recording\n", 358 | "4 seconds remaining\n", 359 | "3 seconds remaining\n", 360 | "2 seconds remaining\n", 361 | "1 seconds remaining\n", 362 | "Recording finished\n" 363 | ] 364 | } 365 | ], 366 | "source": [ 367 | "cor_clip2 = audiostream.record(4, rate=16000)" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 90, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "name": "stdout", 377 | "output_type": "stream", 378 | "text": [ 379 | "Recording\n", 380 | "4 seconds remaining\n", 381 | "3 seconds remaining\n", 382 | "2 seconds remaining\n", 383 | "1 seconds remaining\n", 384 | "Recording finished\n" 385 | ] 386 | } 387 | ], 388 | "source": [ 389 | "cor_clip3 = audiostream.record(4, rate=16000)" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 135, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "cor1_embs = list(embeddings_from_sound(model, cor_clip1, sample_rate=16000))\n", 399 | "cor2_embs = list(embeddings_from_sound(model, cor_clip1, sample_rate=16000))\n", 400 | "cor3_embs = list(embeddings_from_sound(model, cor_clip1, sample_rate=16000))" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 99, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "data": { 410 | "text/plain": [ 411 | "1.5730244" 412 | ] 413 | }, 414 | "execution_count": 99, 415 | "metadata": {}, 416 | "output_type": "execute_result" 417 | } 418 | ], 419 | "source": [ 420 | "compare_embeddings(rico1_embs[2], cor1_embs[3])" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 100, 426 | "metadata": {}, 427 | "outputs": [ 428 | { 429 | "name": "stdout", 430 | "output_type": "stream", 431 | "text": [ 432 | "Recording\n", 433 | "4 seconds remaining\n", 434 | "3 seconds remaining\n", 435 | "2 seconds remaining\n", 436 | "1 seconds remaining\n", 437 | "Recording finished\n" 438 | ] 439 | } 440 | ], 441 | "source": [ 442 | "jecka_clip1 = audiostream.record(4, rate=16000)" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 101, 448 | "metadata": {}, 449 | "outputs": [ 450 | { 451 | "name": "stdout", 452 | "output_type": "stream", 453 | "text": [ 454 | "Recording\n", 455 | "4 seconds remaining\n", 456 | "3 seconds remaining\n", 457 | "2 seconds remaining\n", 458 | "1 seconds remaining\n", 459 | "Recording finished\n" 460 | ] 461 | } 462 | ], 463 | "source": [ 464 | "jecka_clip2 = audiostream.record(4, rate=16000)" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 102, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "name": "stdout", 474 | "output_type": "stream", 475 | "text": [ 476 | "Recording\n", 477 | "4 seconds remaining\n", 478 | "3 seconds remaining\n", 479 | "2 seconds remaining\n", 480 | "1 seconds remaining\n", 481 | "Recording finished\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "jecka_clip3 = audiostream.record(4, rate=16000)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": 103, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "audiolib.save_wav(r'd:\\tmp\\rico1.wav', clip1, rate=16000)\n", 496 | "audiolib.save_wav(r'd:\\tmp\\rico2.wav', rico_clip2, rate=16000)\n", 497 | "audiolib.save_wav(r'd:\\tmp\\cor_clip1.wav', cor_clip1, rate=16000)\n", 498 | "audiolib.save_wav(r'd:\\tmp\\cor_clip2.wav', cor_clip2, rate=16000)\n", 499 | "audiolib.save_wav(r'd:\\tmp\\cor_clip3.wav', cor_clip3, rate=16000)\n", 500 | "audiolib.save_wav(r'd:\\tmp\\jecka_clip1.wav', jecka_clip1, rate=16000)\n", 501 | "audiolib.save_wav(r'd:\\tmp\\jecka_clip2.wav', jecka_clip2, rate=16000)\n", 502 | "audiolib.save_wav(r'd:\\tmp\\jecka_clip3.wav', jecka_clip3, rate=16000)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 110, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "jecka1_embs = list(embeddings_from_sound(model, jecka_clip1, sample_rate=16000))\n", 512 | "jecka2_embs = list(embeddings_from_sound(model, jecka_clip2, sample_rate=16000))\n", 513 | "jecka3_embs = list(embeddings_from_sound(model, jecka_clip3, sample_rate=16000))" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": 124, 519 | "metadata": {}, 520 | "outputs": [ 521 | { 522 | "data": { 523 | "text/plain": [ 524 | "1.2861824" 525 | ] 526 | }, 527 | "execution_count": 124, 528 | "metadata": {}, 529 | "output_type": "execute_result" 530 | } 531 | ], 532 | "source": [ 533 | "compare_embeddings(jecka1_embs[0],rico1_embs[0])" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 125, 539 | "metadata": {}, 540 | "outputs": [ 541 | { 542 | "data": { 543 | "text/plain": [ 544 | "0.8714248" 545 | ] 546 | }, 547 | "execution_count": 125, 548 | "metadata": {}, 549 | "output_type": "execute_result" 550 | } 551 | ], 552 | "source": [ 553 | "compare_embeddings(jecka1_embs[1],jecka2_embs[0])" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 126, 559 | "metadata": {}, 560 | "outputs": [ 561 | { 562 | "data": { 563 | "text/plain": [ 564 | "1.1345081" 565 | ] 566 | }, 567 | "execution_count": 126, 568 | "metadata": {}, 569 | "output_type": "execute_result" 570 | } 571 | ], 572 | "source": [ 573 | "compare_embeddings(cor1_embs[1],jecka2_embs[0])" 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": 155, 579 | "metadata": {}, 580 | "outputs": [], 581 | "source": [ 582 | "class KnownSpeaker:\n", 583 | " def __init__(self, name):\n", 584 | " self.name = name\n", 585 | " self.embeddings = [] # A list of embeddings of known utterances by this speaker.\n", 586 | " \n", 587 | " def add_embeddings(self, embeddings):\n", 588 | " self.embeddings.extend(embeddings)\n", 589 | " \n", 590 | " def distance(self, anchor_embedding):\n", 591 | " \"\"\"Returns the average distance of the embedding to known\n", 592 | " utterances by this speaker.\n", 593 | " \"\"\"\n", 594 | " distances = [compare_embeddings(anchor_embedding, emb) for emb in self.embeddings]\n", 595 | " return np.mean(distances)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": 157, 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "rico = KnownSpeaker('Rico')\n", 605 | "rico.add_embeddings(rico1_embs)\n", 606 | "rico.add_embeddings(rico2_embs)\n", 607 | "cor = KnownSpeaker('Cor')\n", 608 | "cor.add_embeddings(cor1_embs)\n", 609 | "cor.add_embeddings(cor2_embs)\n", 610 | "cor.add_embeddings(cor3_embs)\n", 611 | "jecka = KnownSpeaker('Jecka')\n", 612 | "jecka.add_embeddings(jecka1_embs)\n", 613 | "jecka.add_embeddings(jecka2_embs)\n", 614 | "jecka.add_embeddings(jecka2_embs)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 169, 620 | "metadata": {}, 621 | "outputs": [ 622 | { 623 | "data": { 624 | "text/plain": [ 625 | "(0.49930662, 1.5563623, 1.2464644)" 626 | ] 627 | }, 628 | "execution_count": 169, 629 | "metadata": {}, 630 | "output_type": "execute_result" 631 | } 632 | ], 633 | "source": [ 634 | "emb = rico2_embs[3]\n", 635 | "#emb = cor1_embs[3]\n", 636 | "#emb = jecka2_embs[3]\n", 637 | "rico.distance(emb), cor.distance(emb), jecka.distance(emb)" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 231, 643 | "metadata": {}, 644 | "outputs": [], 645 | "source": [ 646 | "class RecognizerPrototype:\n", 647 | " def __init__(self, known_speakers, model):\n", 648 | " self.known_speakers = known_speakers\n", 649 | " self.model = model\n", 650 | " \n", 651 | " def run(self, duration_seconds=20):\n", 652 | " rate = 16000\n", 653 | " stream = audiostream.AudioStream(seconds=4, rate=rate)\n", 654 | " stream.start()\n", 655 | " seconds_remaining = duration_seconds\n", 656 | " speaker_name = '-'\n", 657 | " speaker = None\n", 658 | " step_duration = 0.5\n", 659 | " dist = 999\n", 660 | " while seconds_remaining > 0:\n", 661 | " print('%d remaining. dist=%f Speaker=%s' % (seconds_remaining, dist, speaker_name))\n", 662 | " time.sleep(step_duration)\n", 663 | " \n", 664 | " sound = stream.sound_array()\n", 665 | " if len(sound)/rate < 1.61:\n", 666 | " continue\n", 667 | " embs = list(embeddings_from_sound(self.model, sound, sample_rate=rate))\n", 668 | " # To avoid the rounding error of 159 frames instead of 160, we need at least 2 full chunks.\n", 669 | " if len(embs) <1:\n", 670 | " continue\n", 671 | " # Later, we could change this to use multiple embeddings form the last few seconds.\n", 672 | " # For now we'll just get the last full chunk.\n", 673 | " embedding = embs[-1]\n", 674 | " dist, speaker = self.determine_speaker(embedding)\n", 675 | " if dist < 0.99:\n", 676 | " speaker_name = speaker.name\n", 677 | " else:\n", 678 | " speaker_name = '-'\n", 679 | " \n", 680 | " seconds_remaining -= step_duration\n", 681 | " \n", 682 | " def determine_speaker(self, embedding):\n", 683 | " best_speaker = None\n", 684 | " best_dist = 999\n", 685 | " for speaker in self.known_speakers:\n", 686 | " dist = speaker.distance(embedding)\n", 687 | " if dist < best_dist:\n", 688 | " best_dist = dist\n", 689 | " best_speaker = speaker\n", 690 | " return best_dist, best_speaker" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 235, 696 | "metadata": { 697 | "scrolled": false 698 | }, 699 | "outputs": [ 700 | { 701 | "name": "stdout", 702 | "output_type": "stream", 703 | "text": [ 704 | "15 remaining. dist=999.000000 Speaker=-\n", 705 | "15 remaining. dist=999.000000 Speaker=-\n", 706 | "15 remaining. dist=999.000000 Speaker=-\n", 707 | "15 remaining. dist=999.000000 Speaker=-\n", 708 | "14 remaining. dist=1.161672 Speaker=-\n", 709 | "14 remaining. dist=0.963942 Speaker=Cor\n", 710 | "13 remaining. dist=0.787207 Speaker=Cor\n", 711 | "13 remaining. dist=0.941732 Speaker=Cor\n", 712 | "12 remaining. dist=0.850635 Speaker=Cor\n", 713 | "12 remaining. dist=0.895779 Speaker=Cor\n", 714 | "11 remaining. dist=1.052427 Speaker=-\n", 715 | "11 remaining. dist=0.857238 Speaker=Cor\n", 716 | "10 remaining. dist=1.042673 Speaker=-\n", 717 | "10 remaining. dist=1.156600 Speaker=-\n", 718 | "9 remaining. dist=1.305294 Speaker=-\n", 719 | "9 remaining. dist=1.246918 Speaker=-\n", 720 | "8 remaining. dist=1.161736 Speaker=-\n", 721 | "8 remaining. dist=0.900886 Speaker=Rico\n", 722 | "7 remaining. dist=1.151994 Speaker=-\n", 723 | "7 remaining. dist=0.781099 Speaker=Rico\n", 724 | "6 remaining. dist=0.743346 Speaker=Rico\n", 725 | "6 remaining. dist=1.173527 Speaker=-\n", 726 | "5 remaining. dist=1.402816 Speaker=-\n", 727 | "5 remaining. dist=1.385853 Speaker=-\n", 728 | "4 remaining. dist=1.395822 Speaker=-\n", 729 | "4 remaining. dist=1.351745 Speaker=-\n", 730 | "3 remaining. dist=1.267563 Speaker=-\n", 731 | "3 remaining. dist=1.295461 Speaker=-\n", 732 | "2 remaining. dist=1.383031 Speaker=-\n", 733 | "2 remaining. dist=1.380281 Speaker=-\n", 734 | "1 remaining. dist=1.378038 Speaker=-\n", 735 | "1 remaining. dist=1.373088 Speaker=-\n", 736 | "0 remaining. dist=1.425356 Speaker=-\n" 737 | ] 738 | } 739 | ], 740 | "source": [ 741 | "r = RecognizerPrototype([rico, cor, jecka], model)\n", 742 | "r.run(duration_seconds=15)" 743 | ] 744 | }, 745 | { 746 | "cell_type": "code", 747 | "execution_count": null, 748 | "metadata": {}, 749 | "outputs": [], 750 | "source": [] 751 | } 752 | ], 753 | "metadata": { 754 | "kernelspec": { 755 | "display_name": "Python 3", 756 | "language": "python", 757 | "name": "python3" 758 | }, 759 | "language_info": { 760 | "codemirror_mode": { 761 | "name": "ipython", 762 | "version": 3 763 | }, 764 | "file_extension": ".py", 765 | "mimetype": "text/x-python", 766 | "name": "python", 767 | "nbconvert_exporter": "python", 768 | "pygments_lexer": "ipython3", 769 | "version": "3.7.4" 770 | } 771 | }, 772 | "nbformat": 4, 773 | "nbformat_minor": 2 774 | } 775 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # voice-embeddings 2 | Audio processing using deep neural networks. Speaker identification using voice embeddings. 3 | 4 | 5 | ## References 6 | * [Deep Speaker Paper](https://arxiv.org/pdf/1705.02304.pdf) 7 | * [Vox Celeb Dataset](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) 8 | * [philipperemy/deep-speaker](https://github.com/philipperemy/deep-speaker) 9 | * [Walleclipse/Deep_Speaker-speaker_recognition_system](https://github.com/Walleclipse/Deep_Speaker-speaker_recognition_system) 10 | * [python_speech_features](https://github.com/jameslyons/python_speech_features) 11 | * ```python_speech_features.fbank()``` Produces Mel-filterbank energy features from an audio signal 12 | 13 | ## Structure of a Batch 14 | * ```config.BATCH_SIZE=32```. A batch has 32 triplets in it. Each triplet is composed of an anchor, positive, and negative. 15 | * ```config.EMBEDDING_LENGTH=512```. The embeddings are of length 512, so a speaker is characterized by 512 numbers. 16 | * **Batch sequencing**. The total number of entries in a batch is ```config.BATCH_SIZE * 3``` because there are 3 samples in a triplet (anchor, positive, negative). Each one of these entries is 512 numbers. The sequence is as follows: 17 | * All the anchor samples 18 | * Then all the positive samples 19 | * Then all the negative samples 20 | 21 | | anchor | positive | negative | 22 | |--------|----------|----------| 23 | | 0-31 | 32-63 | 63-96 | 24 | 25 | * **The loss function**. The loss function defined in **triplet_loss**.py optimizes in both of these ways: 26 | * *maximize* the cosine similarity between the **anchor** examples and the **postive** examples 27 | * *minimize* the cosine similarity between the **anchor** examples and the **negative** examples 28 | 29 | ## Model 30 | * The shape of the input is ```(NUM_FRAMES, 64, 1)```. 31 | * ```config.NUM_FRAMES=160```. Each frame is 25ms long, so by default this is 4 seconds of audio. 32 | * The shape of the output is ```config.EMBEDDING_LENGTH=512```. 33 | 34 | 35 | ## Recently Completed Tasks 36 | **Get the model built**. Become very certain about what the inputs are to the model, and how the model trains. 37 | Right now I'm a bit hazy on how the loss function works, and what the shapes of the inputs to the model are. 38 | Looking at the loss function, it looks to me like it returns a single number. But it looks to me like the model 39 | contains many triplets. So I need to gain crystal clarity on the following issues: 40 | * what are the shapes of the inputs in the model? *done, and documented* 41 | * what's the purpose of that resize with a 2048, and the /16 in there? *done, and documented* 42 | * **How does the loss function work with multiple samples?** 43 | It computes the loss among all the triplets then reduces it down to a single scalar value reflecting the 44 | total loss across the entire batch. Then that gets optimized to minimum. 45 | **Train and test the model** 46 | 47 | ## Next Steps 48 | * **Install a voice activity detector**. Detect when there's voice activity. Don't fill the buffer when there's no activity. 49 | * **Create a timeline** of whether there's voice activity, and the embedding for that time slice 50 | * **Speaker diarization** will probably be done in a separate repository. 51 | * **Speaker diaraization UI** will probably be done in a separate repository. 52 | * **Clean up requirements.txt** Had a problem trying to use pre-trained weights on another machine that had a different version of Keras and Tensorflow, so I added requirements.txt. Now github is registering a security vulnerability on a version of PyCypto that I'm not actually using. Need to pare down the requirements.txt to just the things actually in use and install on a clean virtualenv. 53 | 54 | -------------------------------------------------------------------------------- /analysis.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | def difference_emb(emb1, emb2): 4 | """Returns a scalar indicating the difference between 2 embeddings. 5 | Smaller numbers indicate closer. 6 | """ 7 | dist = np.linalg.norm(emb1 - emb2) 8 | return dist 9 | 10 | def difference_sample(model, sample1, sample2): 11 | """Returns a scalar indicating the difference between 2 sound samples. 12 | Smaller numbers indicate closer. 13 | """ 14 | emb1 = models.get_embedding(model, sample1) 15 | emb2 = models.get_embedding(model, sample2) 16 | return difference_emb(emb1, emb2) 17 | 18 | def comparison_matrix(model, batch, qty=3): 19 | """Compares the embeddings for the samples in a batch. 20 | All the embeddings in the batch will be placed along the rows and the columns, 21 | and the difference between each pair of embeddings will be placed in the matrix cells. 22 | """ 23 | # It wouldn't surprise me if there exists a better, vectorized way to do this. 24 | embeddings = model.predict(batch.X) 25 | num_emb = len(embeddings) 26 | dists = np.zeros((qty, qty)) 27 | qty = min(qty, num_emb) 28 | # Compare all the embeddings to each other 29 | for row in range(qty): 30 | row_emb = embeddings[row] 31 | for col in range(qty): 32 | col_emb = embeddings[col] 33 | dists[row][col] = difference_emb(row_emb, col_emb) 34 | return dists -------------------------------------------------------------------------------- /application.py: -------------------------------------------------------------------------------- 1 | # Factory methods for producing things the application needs. 2 | # Generally the strategy is for this module to read the configuration 3 | # and configure things as appropriate for the application and produce 4 | # the configured objects needed for the application to run. 5 | # So we should move any references to config in to here. 6 | # This isn't completely done yet, so that's aspirational. 7 | 8 | import config 9 | from checkpoint import CheckpointMonitor 10 | import minibatch 11 | import models 12 | from speakerdb import SpeakerDatabase 13 | 14 | import keras.models 15 | import os 16 | from tensorflow.python.util import deprecation 17 | import tensorflow as tf 18 | 19 | # Singletons 20 | _speaker_db = None 21 | 22 | def make_speaker_db() -> SpeakerDatabase: 23 | global _speaker_db 24 | if not _speaker_db: 25 | _speaker_db = SpeakerDatabase(config.DATASET_TRAINING_DIR) 26 | return _speaker_db 27 | 28 | def make_model() -> keras.models.Model: 29 | """Returns an untrained model.""" 30 | model = models.make_model(config.BATCH_SIZE, config.EMBEDDING_LENGTH, 31 | config.NUM_FRAMES, config.NUM_FILTERS, config.LEARNING_RATE) 32 | return model 33 | 34 | def load_model() -> keras.models.Model: 35 | model = make_model() 36 | checkpoint_monitor = make_checkpoint_monitor(model) 37 | checkpoint_monitor.load_most_recent() 38 | return model 39 | 40 | def make_checkpoint_monitor(model: keras.models.Model) -> CheckpointMonitor: 41 | return CheckpointMonitor(model, directory=config.CHECKPOINT_DIRECTORY, base_name='voice-embeddings', 42 | seconds_between_saves=config.CHECKPOINT_SECONDS, log_fn=log) 43 | 44 | def make_batch() -> minibatch.MiniBatch: 45 | make_speaker_db() 46 | return minibatch.create_batch(_speaker_db) 47 | 48 | def log(*items): 49 | print(*items) 50 | 51 | def init(): 52 | """Call this once at application startup.""" 53 | # I hate to suppress warnings, but there are so many layers of software printing so many warnings that 54 | # I struggle to even make my application code work, because it gets swamped in all this bullshit. 55 | # Make it stop! 56 | # Layers that are fucking spamming me include: Intel KMP, keras, and tensorflow. 57 | 58 | # Stop console spam. https://stackoverflow.com/questions/56224689/reduce-console-verbosity 59 | os.environ['KMP_WARNINGS'] = 'off' 60 | deprecation._PRINT_DEPRECATION_WARNINGS = False 61 | tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR) -------------------------------------------------------------------------------- /audiolib.py: -------------------------------------------------------------------------------- 1 | from scipy.io import wavfile 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | import os 5 | import random 6 | import wave 7 | from python_speech_features import fbank 8 | 9 | def load_wav(wav_file): 10 | """Load a wav file. 11 | Returns: 12 | rate: number of samples per second. 13 | data: an array of samples as signed 16-bit integers. 14 | """ 15 | rate, data = wavfile.read(wav_file) 16 | return rate, data 17 | 18 | def save_wav(filename, samples, rate=16000, width=2, channels=1): 19 | """Save a wav file. 20 | samples: an iterable such as a numpy array where each item is a number of size width bytes. 21 | """ 22 | wav = wave.open(filename, 'wb') 23 | wav.setnchannels(channels) 24 | wav.setsampwidth(width) 25 | wav.setframerate(rate) 26 | wav.writeframes(samples) 27 | wav.close() 28 | 29 | def graph_spectrogram(wav_file): 30 | """Plots a spectrogram for a wav audio file.""" 31 | rate, data = get_wav_info(wav_file) 32 | nfft = 200 # Length of each window segment 33 | fs = 8000 # Sampling frequencies 34 | noverlap = 120 # Overlap between windows 35 | nchannels = data.ndim 36 | if nchannels == 1: 37 | pxx, freqs, bins, im = plt.specgram(data, nfft, fs, noverlap = noverlap) 38 | elif nchannels == 2: 39 | pxx, freqs, bins, im = plt.specgram(data[:,0], nfft, fs, noverlap = noverlap) 40 | return pxx 41 | 42 | def graph_audio(wav_file): 43 | """Plots 2 graphs for an audio file: an amplitude graph, and a spectrogram.""" 44 | rate, samples = load_wav(wav_file) 45 | graph_raw_audio(samples) 46 | 47 | def graph_raw_audio(samples): 48 | plt.figure(1) 49 | a = plt.subplot(211) 50 | a.set_xlabel('time [s]') 51 | a.set_ylabel('value [-]') 52 | plt.plot(samples) 53 | c = plt.subplot(212) 54 | Pxx, freqs, bins, im = c.specgram(samples, NFFT=1024, Fs=16000, noverlap=900) 55 | c.set_xlabel('Time') 56 | c.set_ylabel('Frequency') 57 | plt.show() 58 | 59 | def normalize_frames(m,epsilon=1e-12): 60 | return [(v - np.mean(v)) / max(np.std(v),epsilon) for v in m] 61 | 62 | def extract_features(signal=np.random.uniform(size=48000), sample_rate=16000, num_filters=64): 63 | """ 64 | Returns: np.array of shape (num_frames, num_filters, 1). Each frame is 25 ms. 65 | """ 66 | filter_banks, energies = fbank(signal, samplerate=sample_rate, nfilt=num_filters, winlen=0.025) #filter_bank (num_frames , 64),energies (num_frames ,) 67 | #delta_1 = delta(filter_banks, N=1) 68 | #delta_2 = delta(delta_1, N=1) 69 | 70 | filter_banks = normalize_frames(filter_banks) 71 | #delta_1 = normalize_frames(delta_1) 72 | #delta_2 = normalize_frames(delta_2) 73 | 74 | #frames_features = np.hstack([filter_banks, delta_1, delta_2]) # (num_frames , 192) 75 | frames_features = filter_banks # (num_frames , 64) 76 | num_frames = len(frames_features) 77 | return np.reshape(np.array(frames_features),(num_frames, num_filters, 1)) #(num_frames,64, 1) 78 | -------------------------------------------------------------------------------- /audiostream.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | import pyaudio 3 | import time 4 | import numpy as np 5 | 6 | def record(seconds, rate=16000): 7 | """Record a few seconds of audio. 8 | Normally AudioStream is used for continuous processing of an ongoing audio stream, 9 | but you can use this method to record a few seconds of audio and stop. 10 | Returns a numpy array of 16-bit signed integers. 11 | """ 12 | s = AudioStream(seconds=seconds, rate=rate) 13 | s.start() 14 | print('Recording') 15 | for i in range(seconds): 16 | print('%d seconds remaining' % (seconds - i)) 17 | time.sleep(1) 18 | s.stop() 19 | print('Recording finished') 20 | return s.sound_array() 21 | 22 | # This implementation uses a deque. It might be more efficient to use io.BytesIO 23 | # like this does: https://github.com/hbock/byte-fifo/blob/master/fifo.py 24 | # What I've written here does some unknown amount of byte copying. 25 | # I'm going to just use what I've written here for now and optimize later if needed. 26 | class AudioStream: 27 | """ 28 | Streams audio recording in real time, providing raw sound data from the last few seconds. 29 | """ 30 | def __init__(self, seconds=4, rate=44100, bytes_per_sample=2): 31 | self.buffer_seconds = seconds 32 | self.rate = rate 33 | self.bytes_per_sample = bytes_per_sample 34 | buffer_size = seconds * rate * bytes_per_sample 35 | self._buffer = deque(maxlen=buffer_size) 36 | self._pyaudio = None # pyaudio.PyAudio object will be initialized in self.start(). 37 | self._stream = None # Stream object will be initialized in self.start(). 38 | 39 | seconds_per_buffer = 0.2 # How much audio we want included in each callback. 40 | self._frames_per_buffer = int(rate * seconds_per_buffer) # How many samples we get per callback. 41 | self._stop_requested = False 42 | 43 | def start(self): 44 | """Start recording the audio stream.""" 45 | self._pyaudio = pyaudio.PyAudio() # (1) Instantiate PyAdio. Sets up the portaudio system. 46 | pyaudio_format = self._pyaudio.get_format_from_width(width=2) 47 | self._stream = self._pyaudio.open(format = pyaudio_format, 48 | channels = 1, 49 | rate = self.rate, 50 | input = True, 51 | frames_per_buffer = self._frames_per_buffer, 52 | stream_callback = self._pyaudio_callback) 53 | self._stream.start_stream() 54 | 55 | def stop(self): 56 | """Stop recording the audio stream.""" 57 | self._stop_requested = True 58 | while self._stream.is_active(): 59 | time.sleep(0.1) 60 | self._stream.stop_stream() 61 | self._stream.close() 62 | self._pyaudio.terminate() 63 | 64 | def sound_data(self): 65 | """Returns an iterable sequence of bytes in the buffer representing the most recent buffered audio.""" 66 | return self._buffer 67 | 68 | def sound_array(self): 69 | """Returns a numpy array with 16-bit signed integers representing 70 | samples of the most recent buffered audio. 71 | """ 72 | b = bytearray(self._buffer) 73 | return np.frombuffer(b, dtype=np.int16) 74 | 75 | def _pyaudio_callback(self, in_data, frame_count, time_info, status): 76 | """Called by pyaudio to process a chunk of data. 77 | This is called in a separate thread. 78 | """ 79 | self._buffer.extend(in_data) 80 | if self._stop_requested: 81 | flag = pyaudio.paComplete 82 | else: 83 | flag = pyaudio.paContinue 84 | return (None, flag) -------------------------------------------------------------------------------- /charts.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import analysis 3 | 4 | def histogram(seq, bins=10, title='Histogram'): 5 | plt.hist(seq, edgecolor = 'black', bins = bins) 6 | plt.title(title) 7 | plt.xlabel('Value') 8 | plt.ylabel('Frequency') 9 | plt.show() 10 | 11 | def plot_comparison(model, batch, qty=96): 12 | comparison = analysis.comparison_matrix(model, batch, qty) 13 | plt.figure(figsize=(12,8)) 14 | plt.imshow(comparison) 15 | plt.colorbar() -------------------------------------------------------------------------------- /checkpoint.py: -------------------------------------------------------------------------------- 1 | # Save checkpoints of models periodically 2 | 3 | import os 4 | import os.path 5 | import time 6 | 7 | class CheckpointMonitor: 8 | """Periodically saves models, and manages rolling checkpoint files.""" 9 | def __init__(self, model, directory, base_name, seconds_between_saves=120, log_fn=None): 10 | self.model = model 11 | self.directory = directory 12 | self.base_name = base_name 13 | self.seconds_between_saves = seconds_between_saves 14 | self.csv_filename = os.path.join(directory, base_name + '.csv') 15 | self._log_fn = log_fn 16 | self.batch_num = 0 # Might be updated in self._init_csv() 17 | 18 | self._next_checkpoint_time = 0 # Will be set below. 19 | self._csv_file = self._init_csv() 20 | self._csv_updates = [] # We'll write these to the csv file when we save the model. 21 | self._reset_checkpoint_time() 22 | 23 | def train_step_done(self, training_loss, test_loss=None): 24 | """Call this periodically during the training process. 25 | The CheckpointMonitor will save the model if appropriate. 26 | Returns true if the model was saved, or false if we determined 27 | now isn't an appropriate time to save. 28 | """ 29 | self.batch_num += 1 30 | self._append_csv(training_loss, test_loss) 31 | if self.is_save_needed(): 32 | self.save() 33 | return True 34 | else: 35 | return False 36 | 37 | def save(self): 38 | """Definitely save the model without regard for whether we've 39 | saved recently. You can call this at the end of your training 40 | to make sure the latest model is saved. 41 | """ 42 | filename = self._make_filename() 43 | self._log('Saving', filename) 44 | # https://www.tensorflow.org/guide/keras/save_and_serialize 45 | self.model.save_weights(filename) 46 | 47 | for line in self._csv_updates: 48 | self._csv_file.write(line + '\n') 49 | self._csv_updates = [] 50 | self._csv_file.flush() 51 | 52 | self._reset_checkpoint_time() 53 | self._log('Saved', filename) 54 | 55 | def load_most_recent(self): 56 | """Load the most recent model. Returns the model loaded, or None if no model found. 57 | """ 58 | # TODO: find most recent. 59 | # TODO: keep most recent N files. Maybe keep several of different ages (one week, one day, one hour) 60 | filename = self._make_filename() 61 | loaded = False 62 | if os.path.exists(filename): 63 | self._log('Loading model from', filename) 64 | # Don't use model.load_model() because we use lambda layers in our model. Use load_weights instead. 65 | # See https://github.com/keras-team/keras/issues/5298 66 | self.model.load_weights(filename) 67 | self._log('Preloaded model from', filename) 68 | loaded = True 69 | return loaded 70 | 71 | def is_save_needed(self): 72 | return time.time() >= self._next_checkpoint_time 73 | 74 | def _reset_checkpoint_time(self): 75 | self._next_checkpoint_time = time.time() + self.seconds_between_saves 76 | 77 | def _make_filename(self): 78 | filename = os.path.join(self.directory, self.base_name) + '.h5' 79 | return filename 80 | 81 | def _log(self, *params): 82 | if not self._log_fn: 83 | return 84 | self._log_fn(*params) 85 | 86 | def _init_csv(self): 87 | if os.path.exists(self.csv_filename): 88 | last_line = read_last_line(self.csv_filename) 89 | self._csv_file = open(self.csv_filename, 'a') 90 | parts = last_line.split(',') 91 | try: 92 | last_batch = int(parts[0]) 93 | except ValueError: 94 | last_batch = 0 95 | self._log('Could not extract last batch number from csv file. Using 0.') 96 | self.batch_num = last_batch 97 | self._log('Resuming at batch', self.batch_num) 98 | else: 99 | self._csv_file = open(self.csv_filename, 'w') 100 | self._csv_file.write('batch_num,loss,test_loss\n') 101 | return self._csv_file 102 | 103 | def _append_csv(self, training_loss, test_loss): 104 | if test_loss==None: 105 | test_loss='' 106 | parts = [self.batch_num, training_loss, test_loss] 107 | line = ','.join([str(n) for n in parts]) 108 | self._csv_updates.append(line) 109 | # We will update the csv file next time we write the model. 110 | # This is so that the information in the csv file stays consistent with the last saved model. 111 | 112 | def read_last_line(filename): 113 | # https://stackoverflow.com/questions/3346430/what-is-the-most-efficient-way-to-get-first-and-last-line-of-a-text-file/3346788 114 | # max_line_len = 1024 115 | # with open(filename, 'rb') as f: 116 | # first = f.readline() 117 | # f.seek(-2, os.SEEK_END) # Jump to the second last byte. 118 | # while f.tell() > 1 and f.read(1) != b'\n': # Is this byte an eol? 119 | # f.seek(-2, os.SEEK_CUR) # Jump back 2 bytes (because previous line went forward 1). 120 | # f.seek(-1, os.SEEK_CUR) # This is probably not needed on unix-like line ending systems 121 | # last = f.readline().decode() 122 | # return last 123 | # I struggled so much with the above stuff that I just decided to do this the naive way: 124 | with open(filename, 'r') as f: 125 | last_line = None 126 | for line in f: 127 | last_line = line 128 | last_line = last_line.strip() 129 | return last_line -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | 2 | # Things you definitely should change 3 | 4 | # Dataset format: one directory for each speaker where the speaker id is the name of the directory. 5 | # Under that is many folders containing wav files for that speaker. 6 | DATASET_TRAINING_DIR = r'd:\datasets\voxceleb1\vox1\wav' # Training dataset. 7 | # You should start the learning rate at 0.001. 8 | # I'm setting it down once we're levelling off around epoch 42000. 9 | LEARNING_RATE = 0.0001 10 | 11 | # ---------- 12 | # Things you could safely change 13 | CHECKPOINT_SECONDS = 600 # Save the model after processing every n seconds. 14 | CHECKPOINT_DIRECTORY = 'checkpoints' # Directory to save models during training. 15 | 16 | # ---------- 17 | # Things you probably shouldn't change 18 | 19 | BATCH_SIZE = 32 # Must be even. 20 | # Alpha, as used in FaceNet https://arxiv.org/pdf/1503.03832.pdf . 21 | # Alpha is how close the embeddings of two samples need to be to be considered the same person. 22 | ALPHA = 0.2 # as used in FaceNet https://arxiv.org/pdf/1503.03832.pdf . 23 | NUM_FRAMES = 160 # Each frame is advanced by 10ms, so so 160 frames is 1.6 seconds 24 | EMBEDDING_LENGTH = 512 # How many features are in a speaker embedding. 25 | NUM_FILTERS = 64 # Number of FFT frequency filter bands used to create embeddings. 26 | -------------------------------------------------------------------------------- /images/model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unoti/voice-embeddings/665c283a281d3c49f6ff9436877b02dd3c3cc3fd/images/model.png -------------------------------------------------------------------------------- /images/walleclipse-loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unoti/voice-embeddings/665c283a281d3c49f6ff9436877b02dd3c3cc3fd/images/walleclipse-loss.png -------------------------------------------------------------------------------- /minibatch.py: -------------------------------------------------------------------------------- 1 | from speakerdb import SpeakerDatabase 2 | import audiolib 3 | import config 4 | 5 | import numpy as np 6 | 7 | class MiniBatch: 8 | """ 9 | A minibatch is a group of training triplets in matrix form assembled from audio samples. 10 | """ 11 | def __init__(self, samples, speaker_ids): 12 | """ 13 | The samples will be sequenced as all of the anchors first, then all of the positives, then 14 | all of the negatives. 15 | samples: A list of audio samples. Each sample is of shape (num_frames, num_filters, 1). 16 | speaker_ids: The speaker id for each sample in **samples**. 17 | """ 18 | self.X = np.array(samples) 19 | self.Y = np.array(speaker_ids) 20 | 21 | def inputs(self): 22 | """ 23 | Returns (X, Y) 24 | X: A tensor of shape (batch_size * 3, num_frames=160, num_filter_banks=64, 1) 25 | With our default config: batch_size=32, num_frames=160, num_filter_banks=64. 26 | So the default shape of X is (96, 160, 64, 1). 27 | Y: The speaker id of each sample in X. 28 | """ 29 | return self.X, self.Y 30 | 31 | def _clipped_audio(x, num_frames): 32 | """ 33 | Truncate an audio clip to be at most num_frames. 34 | If the input is longer than num_frames then select a random subsection. 35 | """ 36 | if x.shape[0] > num_frames: 37 | bias = np.random.randint(0, x.shape[0] - num_frames) 38 | clipped_x = x[bias: num_frames + bias] 39 | else: 40 | clipped_x = x 41 | 42 | return clipped_x 43 | 44 | def _get_audio_features(filename): 45 | """Load audio from a file and process it into filter bank frequencies and time slices.""" 46 | rate, data = audiolib.load_wav(filename) 47 | features = audiolib.extract_features(data, sample_rate=rate, num_filters=config.NUM_FILTERS) 48 | return features 49 | 50 | def _add_samples(batch, samples): 51 | """Add samples into a batch, truncating each sample as needed.""" 52 | for sample in samples: 53 | batch.append(_clipped_audio(sample, config.NUM_FRAMES)) 54 | 55 | def create_batch(db: SpeakerDatabase) -> MiniBatch: 56 | # Select and preprocess all of the triplets we'll use in this batch. 57 | anchors = [] 58 | positives = [] 59 | negatives = [] 60 | anchor_ids = [] 61 | negative_ids = [] 62 | for _ in range(config.BATCH_SIZE): 63 | anchor_fnam, positive_fnam, negative_fnam, anchor_id, negative_id = db.random_triplet() 64 | anchors.append(_get_audio_features(anchor_fnam)) 65 | positives.append(_get_audio_features(positive_fnam)) 66 | negatives.append(_get_audio_features(negative_fnam)) 67 | anchor_ids.append(anchor_id) 68 | negative_ids.append(negative_id) 69 | 70 | # Assemble the batch. Sequencing is all of the anchors, then the positives, then the negatives. 71 | batch = [] 72 | _add_samples(batch, anchors) 73 | _add_samples(batch, positives) 74 | _add_samples(batch, negatives) 75 | 76 | # Assemble the speaker ids of the samples in the batch in the same sequence. 77 | batch_ids = [] 78 | batch_ids.extend(anchor_ids) # Anchor ids. 79 | batch_ids.extend(anchor_ids) # Positive ids, which are the same as the anchor ids. 80 | batch_ids.extend(negative_ids) # Negative ids 81 | return MiniBatch(batch, batch_ids) -------------------------------------------------------------------------------- /model.drawio: -------------------------------------------------------------------------------- 1 | 7R1dc5s48Nd4pn1IBiQE+DG2e72bueZ6Sdtr+0aNYtPDlg/jxO6vP2EkA5JwMObLCTOdBhaxEvul3WUXD+B4sX0fOKv5B+JifwA0dzuAkwEAOtAR/RNBdjHE0s0YMAs8lw1KAPfeL8yAGoNuPBevMwNDQvzQW2WBU7Jc4mmYgTlBQJ6ywx6In5115cywBLifOr4M/cdzw3kMtZGWwH/H3mzOZ9Y1dmXh8MEMxXruuOQpBu3HwHcDOA4ICeOjxXaM/Yh4nC4xot9yrh4WFuBlWOSGG/DVHi+cn1eTu+8fHr/fftHv/76CMZZHx9+wBx4A06f4Rj8CejSLjtjywx2nCV66NxFp6dnUd9Zrb0pHzcOFTwH64Yb0ythisZshNlvne0wWOAx2dMBTQmJO4XmKuhwWYN8JvccsixzG6dkB3WGGj8SjKwEak0qLoTmIJLhGWRxrsgmmmN2WpulxTCayREyhE8xwKGGiB6kHT0B7np3APyOXfw9kv9AHZxpfMP/bRJI2+uQtqD4B7RY/0f/vyMJZJhcTlsdI1qvoaor3fGCE/YqBb6JFho5PxYCKYHSzFl/eK3N0VbdWW3mSLZ+FPnk8kTD5ZiWC2EPlrecw4VA1H/3rLFb0YI9KP5xmZqBriScxRbBiOdJgQVFCvA2zqrEOA/IvHhOfBBSyJEscLd7zfQFEqTlbRvpFVQhT+OgRB6FHzdINu7DwXDeaZvQ090J8v4rZ/ESNMIUFZLN0cSQ3GqMNM6w6OKae0Rx4e1RBc+R+yM5T+mso9JdZKKWqZjTjVDVAvRqUVAPQq0F5NQBQ4/aea4J92Eva0oVhrwsldQEhvd8VzlAHZIrK0PK2wIOI1+XeUqsk+LcCfYt6tyIiEzXr3Or53u0LZh8yhgL7SgYnIiLT1Jpln9nqTvTMVsOxeBxwfX2dsuCeytYrdqRXa+yhaOytlv0eXZK2P24/fv4kcSRLGRXtUsxK82UAoIuw7RoSE+kVG/yAplkNafUc+522P0MFaWFtpJV9ypETTueRppJgcSaJU7IJKyJgVjSRwg/R9SZlE8iOyB32N50nnNk64UC3XABGpyvtWhtqQ1ktG0tamkZFKUutmFOQ4OEDycPDGtfiOID8rHQXHQenCrehyserOkhGI32AJtJzF82nqmjSR9PP2hg7q6hI4QU06l+BjkVjnTHFSPS7Sppiwy4WXhcxxZTmzi41YhXdsS7+SEg7ZeWcAol4xfNXuyvIDv4kICuyidb1Rru233benTJad6dsiYbv7z7LavxGB90npwEKklPMe1RHTjkuujCRNLS2RRLKat07e606e6B39tp29iBq2dmDfdytdnWgVY2zB0FnnD04PMnZ4xSo1dnjLvWFZhxh6xlHaMvaSZXsnp2SIJyTGVk6/rsEKpi5ZMyfhKwY7X7iMNwx4jmbkGQpi7de+DV1/C1CdY3Y2WTLMO9PdgprepQbsaYde2ZY0JqcaRJ0QSPAUOBizS/WoJwfu1wvHhTNMdfmxcPh61MVHgg2rSoGKqYqMiIoWFgxb52zD+fvjZWX4qqKTarUPddZz/f3HnWlTlBEq2uaaKhiwUsiYeF9XxfFtzoawhdnzeo2UhCW28/lgikbHUeUY6QqM0FGz/szeV96gxJ5LyFqf4Oyeuk4UTqMYUWWwTBatgwvLxxsmvelLYPI++5ZBiS7WQ1IR3XRDC90jTly7EEbyhDw96cHdRfzPHW3BaqaQLrsSBvDrsUi6NLCOZGEHYhFkJyoqvT9aCVUEzRV9UIUKKhWXy+jXGdTXeq7BorpKjlrlmKtJvASh+Zb2p9ROjeU6sEudVN0+o3ji06S2/Zn1Sf+eAvP81ulWfVWeR6P5RbfrpsSXbWHNasYcgfOBC/XOCLZ+e8GaiCZoo+8WYqZl56BLFxjVpvjZKqqt7tMQtFx6gAJ879n8MqqL6hJEDvsyzY+KFB1sPfBtHrWczUU2lfLMl5C1EG26/lsL1rOpyNVPd/4r9svV/ok2vEfPD8qfaOomDGl9BzvMQeeGwGM+Hy5oUGGFo+O6hD3k2r60HxbtEyvtHmvqJpOaKA0oWzQm22gNOW66y4Xufat0CeJGxTbISxZ3hqt3jTb/QZML26Nipuh8FcbFTfrskr4e3E7S9xA21/1sUDvIvOQRkj5m6VjIwFRwfdVdVSnS2thhRWF125mxtdTn2517MOpLYZpwqvO0jIoIWpRBqW1PCODeUSoVwbb/epl3zlXpnNORQSQf0/fXleJDyN+ya71/jrrsr6N9gp0t8AnTnrdbUV3hfK51j+EYr3CfiGzqCN6bjm2yGyx3bZwObbwUs8QX9adXHR52oJr9fx4BN5uvWZSzDJIlbKkKltyillypfDZmhSzqfJNTTsuPGVLfyVxbr30187P3b2yQFb0EREo/bMgIirDKvq7ILUkVMQHM/iDFV1/Qop6zVqf2MvJJyDxaxFlkyqG3Z2kCjJOS6pwItQrg/23J19GYMZ+zKGPzWr9xQfUsdjMlovne/W9xJxor74tqC9su5KA91B3v5OjM22qopdUsLH4WTzia7yaOxbt/Aq5k2zuXqsl62ZE/9B47oQDa7QbWBMGSkUXZ+HPMdjP2pG0FFkV6bSQclDsyPBI4HKCStPT5NdLYzlIfgMWvvsf7V1Rc5s4EP41nmkfkgEkgf1Yx2mvN22v0yS9tm/UKDY9bPkwbuz79SeMMCCJBGNAONFMpzGLtBKrb1er1QoG4GqxfRe6q/lH4uFgYBnedgAmA8syLRPRPzFll1Ac004Is9D3WKGMcOP/hxnRYNSN7+F1oWBESBD5qyJxSpZLPI0KNDcMyUOx2D0Jiq2u3BkWCDdTNxCpf/teNE+oQ2Rk9D+wP5unLZsGu7Nw08KMxXrueuQhIe3LgOsBuAoJiZJfi+0VDmLhpXK5ef/rOnj742Zi3P7aep49JIu7i4T722OqHB4hxMuoNmu8+/rne9rduw/fvOHkx/DdNfnIqhi/3WDD5DWw7IA2Mv4Z0l+z+Bd7+miXihQvvTfxyNCraeCu1/6UlppHi4ASzEMF7AkD80T/zYNQKRoxWeAo3NF6D9mwpaM2z41YSgtx4Eb+72KbLkPP7MDu0MJn4tPeWAZDusPYHGBuXaIijzXZhFPMquUF/TgnGzk8p8gNZzgSONEfuQfPSPuBPGJQzdJBvSf7jt670+SG/e8mRu/41l9QHbWMT/iB/v+FLNxldjPDQcJkvYrv5gCRFoy5XzDym7iTkRtQbFBliSsbye29gYjvms5qKzayTVuhT540xDW+WfEk9lBl/Tk0OJK1R/+6ixX9sWdlHi4LLdC+JI3YPFnSHaEwpz0R3kZFfVlHIfkHX5GAhJSyJEscd94PAo5EpTlbxkpH1QhT+vg3DiOfmro37MbC97y4mfHD3I/wzSoZ5gdq2CktJJulh2PcGEw2zFib1qGbMUO8bURrS5RhxK5zSg0lSm0Z5fpbUJdjdcPSulFTNyytGw3rhgWMdGZI1WN4mHVUKQjQClJTQRAy9fzRtI4gm9cQxRMI1B5zZr44l5mTeVWHmWdko279ZaSHNNU2OOKGtOYaiGdk20anQ2orncWemKZSLn5KuLy8zFl/XzZPSGYzPVHkJwrATxSOYkfKESD4/tPnu1thmIrikgk0N4L5wRpYwEN46EFhZOmdofUT2HaL8jZLjH/eUI0k8gZtyXsoyHvsRtN5rNIkXJwo9hyIQZtCLWIYSZwd0+wSxCNBqF9wsDlPYdqqhZk21mtHgwnvwrg0RsZI1OnOIrA2bCj+alRzPTI+aUFyf7/G7URo1YZoj/VP3Ca8kyYfr+l1PBqbAzQRnrtqcFgmE73gr2d4hkXtRRK/olM3ziyPGGujLRhbxLt3NY02HFYLAVQx2nQg3F2uxCqusa7+SMg4puepBDLMJe03O3+IUdpJSFZkE/frlXE5fH2eLhpU7qKJ4b13X+5EhX9lWmcqY2hVlDEfsWlOxmK87TmAFxrKwXteUa8X4FVa2qvspVcJkGqvUowOaq+y1KcCTjNeJbB641WC0VFeZSqBdr3KZxhCBcpDqOZI1GOqoTfskoTRnMzI0g2uMypnJbMyHwhZMYH+wlG0YxJ1NxEpihtv/ehb7vf3mNUlYleTLeO8v9hJjDE41ZzQB94r6SOSyWXRFkbyRONhcrpjjbihbXnvMUXS811DWFVD6a2tIQ452VqppCvY1pUKompKJTICnIHmw/Mlc3v5fNu4/soifk1qqeeu5/u6Zpsq6/ROZ2WphWcv18oOhskDvTnBwhdpDBu3cQDUcxzEhLYhepxRiY1rzIIhDYg2AFF70uMBITBSP+nZGjJNQAaOGrIhECq2IY4GRBuAqG1DeED00IYMVUCm86gFi+ck4/ZIQdAOoNL96INh4CNXbYc3xJxAYdj778bDUd+WR0AMGz0DufZgeQTEZLtG95vbEyWn6LINZksiyvbOBoqRkeY2CLoSoylDZLdiBCpdq8yd+p73pqSuFZV7uMtVii+/p/zii6za/kpV0BPAqnMyOnFOPm3gxeyis7REpmyy7FaFJDlEeLnGsRxP31bpSo6SlyF0LEZZxtDZ+Rx8SLZyGmF7vpwsvePs5Mr7cj2Qq5ihoNNmsiapReFfKFH3EI2EVQ/P0QDZWlTjofQUdl00CIx6iAUoWz8fl8ZpIlke59Vfn75emJPYtbj3gzjlkbJitpjK82rPOfS9mACT6+WGLoaMpHScf7pv1DBH9uuq6Zm1Z4c2syi5M782EOeDbs/8wvM6R6fP+Z+OQcCft3FEEHb7QhgxKKIx+LIwCCWOcbcYVPvWLo1B9Ri0lL8Y6xzejKVybcZtjNi1V2Yco4r7f20cahD6wvJcKvfdLpRv51gDFGOFGpiD0v3k2sAUGCkEptCXJ4BZJoR2ganPa57feU2ZEKzyOvpQZ3suEP/eSOWnOmH5qU6t0EoUusJrfbRC90ehuRRH5S//gUoyGpUnwVbJuLDlI3lqxj2PAP7kd+WMe27vEvJ7kkdnyx7X4XYdRyVHinmIZclBg1xqUC5TqCQ56ERoPp3j47QCTWQYjyOqbiK3gHHlidypH6VXzHLbwvmdyKr9rSCeFXSqfiyolXAO/2AwfbCq/c9E0aoBROX7fhqgYjQD8e89qRvSgcP+hHQQPC6kkwqhXWBapcDUK8BzWgGyj7noRWD3X3xBPVsEovPaXH0BOl0zTKt1ui86DVQnTCClr8w45ghOv083805WxUPqT/Lh9yDb/gJU+RbxUdZ5r/+CHYTxP3Q1d6OBM94NnAkj5VYsJ/EvMe1PWpw8tJw2tZ8LeEgmdPDIYugI5aeX2feYE3BkX7UG1/8D -------------------------------------------------------------------------------- /models.py: -------------------------------------------------------------------------------- 1 | from triplet_loss import deep_speaker_loss 2 | 3 | import numpy as np 4 | import math 5 | import keras.backend as K 6 | from keras import layers 7 | from keras import regularizers 8 | import keras.optimizers 9 | from keras.layers import Input 10 | from keras.layers.normalization import BatchNormalization 11 | from keras.layers.convolutional import Conv2D 12 | from keras.layers.core import Lambda, Dense 13 | from keras.models import Model 14 | 15 | TRIPLETS_PER_BATCH = 3 16 | 17 | def clipped_relu(inputs): 18 | return Lambda(lambda y: K.minimum(K.maximum(y, 0), 20))(inputs) 19 | 20 | def identity_block2(input_tensor, kernel_size, filters, stage, block): # next step try full-pre activation 21 | conv_name_base = 'res{}_{}_branch'.format(stage, block) 22 | 23 | x = Conv2D(filters, 24 | kernel_size=1, 25 | strides=1, 26 | activation=None, 27 | padding='same', 28 | kernel_initializer='glorot_uniform', 29 | kernel_regularizer=regularizers.l2(l=0.00001), 30 | name=conv_name_base + '_conv1_1')(input_tensor) 31 | x = BatchNormalization(name=conv_name_base + '_conv1.1_bn')(x) 32 | x = clipped_relu(x) 33 | 34 | x = Conv2D(filters, 35 | kernel_size=kernel_size, 36 | strides=1, 37 | activation=None, 38 | padding='same', 39 | kernel_initializer='glorot_uniform', 40 | kernel_regularizer=regularizers.l2(l=0.00001), 41 | name=conv_name_base + '_conv3')(x) 42 | x = BatchNormalization(name=conv_name_base + '_conv3_bn')(x) 43 | x = clipped_relu(x) 44 | 45 | x = Conv2D(filters, 46 | kernel_size=1, 47 | strides=1, 48 | activation=None, 49 | padding='same', 50 | kernel_initializer='glorot_uniform', 51 | kernel_regularizer=regularizers.l2(l=0.00001), 52 | name=conv_name_base + '_conv1_2')(x) 53 | x = BatchNormalization(name=conv_name_base + '_conv1.2_bn')(x) 54 | 55 | x = layers.add([x, input_tensor]) 56 | x = clipped_relu(x) 57 | return x 58 | 59 | def conv_and_res_block(inp, filters, stage): 60 | conv_name = 'conv{}-s'.format(filters) 61 | o = Conv2D(filters, 62 | kernel_size=5, 63 | strides=2, 64 | padding='same', 65 | kernel_initializer='glorot_uniform', 66 | kernel_regularizer=regularizers.l2(l=0.00001), name=conv_name)(inp) 67 | o = BatchNormalization(name=conv_name + '_bn')(o) 68 | o = clipped_relu(o) 69 | for i in range(3): 70 | o = identity_block2(o, kernel_size=3, filters=filters, stage=stage, block=i) 71 | return o 72 | 73 | def cnn_component(inp): 74 | x_ = conv_and_res_block(inp, 64, stage=1) 75 | x_ = conv_and_res_block(x_, 128, stage=2) 76 | x_ = conv_and_res_block(x_, 256, stage=3) 77 | #x_ = conv_and_res_block(x_, 512, stage=4) # This is the difference between the simple and complex model. 78 | return x_ 79 | 80 | def convolutional_model_simple(input_shape, batch_size, num_frames, embedding_length): 81 | """ 82 | Builds a convolutional model that holds and entire batch of processed sound samples. 83 | input_shape: (NUM_FRAMES, NUM_FILTERS, 1) 84 | batch_size: (BATCH_SIZE * TRIPLETS_PER_BATCH) 85 | num_frames: Number of audio frames from config = 160 (=4sec because a frame is 25ms) 86 | embedding_length: number of features (floating point numbers) per output embedding. 87 | Returns: An uncompiled keras model. 88 | """ 89 | # 90 | # http://cs231n.github.io/convolutional-networks/ 91 | # conv weights 92 | # #params = ks * ks * nb_filters * num_channels_input 93 | 94 | # Conv128-s 95 | # 5*5*128*128/2+128 96 | # ks*ks*nb_filters*channels/strides+bias(=nb_filters) 97 | 98 | # take 100 ms -> 4 frames. 99 | # if signal is 3 seconds, then take 100ms per 100ms and average out this network. 100 | # 8*8 = 64 features. 101 | 102 | # used to share all the layers across the inputs 103 | 104 | # num_frames = K.shape() - do it dynamically after. 105 | 106 | 107 | inputs = Input(shape=input_shape) 108 | x = cnn_component(inputs) # .shape = (BATCH_SIZE , num_frames/8, 64/8, 512) 109 | # -1 in the target size means that the number of dimensions in that axis will be inferred. 110 | # So this resize means: (n, num_frames/8, 2048) 111 | x = Lambda(lambda y: K.reshape(y, (-1, math.ceil(num_frames / 8), 2048)), name='reshape')(x) 112 | # Compute the average over all of the frames within a sample. 113 | x = Lambda(lambda y: K.mean(y, axis=1), name='average')(x) # .shape = (BATCH_SIZE, embedding_length) 114 | x = Dense(embedding_length, name='affine')(x) # .shape = (BATCH_SIZE , embedding_length) 115 | x = Lambda(lambda y: K.l2_normalize(y, axis=1), name='ln')(x) 116 | 117 | model = Model(inputs, x, name='convolutional') 118 | return model 119 | 120 | def make_model(batch_size, embedding_length, num_frames, num_filters, learning_rate): 121 | batch_size = batch_size * TRIPLETS_PER_BATCH 122 | # TODO: document exactly what's happening with the shape here. 123 | # [x] Become very certain what the shapes of x are coming out of batch.to_inputs() line 66 train.py 124 | # [x] make minibatch.X be of shape (batchsize, num_frames, 64, 1) <-- that's x shape. 125 | # then input_shape = (num_frames, 64, 1) 126 | # Shape of a single sample 127 | input_shape = (num_frames, num_filters, 1) 128 | model = convolutional_model_simple(input_shape, batch_size, num_frames, embedding_length) 129 | adam = keras.optimizers.Adam(lr=learning_rate) 130 | model.compile(optimizer=adam, loss=deep_speaker_loss) 131 | return model 132 | 133 | def get_embedding(model, sample): 134 | """ 135 | sample: A sample of shape (NUM_FRAMES, NUM_FILTERS, 1) which is (160,64,1). 136 | You can get this by using something like batch.X[n]. 137 | model: a model. 138 | returns: an array of shape (NUM_EMBEDDING_LENGTH,) with the embeddings for this speaker. 139 | """ 140 | input_batch = np.expand_dims(sample, axis=0) # .predict() wants a batch, not a single entry. 141 | emb_rows = model.predict(input_batch) # Predict for this batch of 1. 142 | embedding = np.squeeze(emb_rows) # Predict() returns a batch of results. Extract the single row. 143 | return embedding -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.8.1 2 | alabaster==0.7.12 3 | anaconda-client==1.7.2 4 | anaconda-navigator==1.9.7 5 | anaconda-project==0.8.3 6 | -e git+git@github.com:unoti/aigym-antnest.git@6dfcb2ddc2118edd64ba8b474510f2a20a67d6f9#egg=antnest 7 | asn1crypto==1.2.0 8 | astor==0.8.0 9 | astroid==2.3.3 10 | astropy==3.2.3 11 | atari-py==0.2.6 12 | atomicwrites==1.3.0 13 | attrs==19.3.0 14 | audioread==2.1.8 15 | Babel==2.7.0 16 | backcall==0.1.0 17 | backports.functools-lru-cache==1.6.1 18 | backports.os==0.1.1 19 | backports.shutil-get-terminal-size==1.0.0 20 | backports.tempfile==1.0 21 | backports.weakref==1.0.post1 22 | beautifulsoup4==4.8.1 23 | bitarray==1.1.0 24 | bkcharts==0.2 25 | bleach==3.1.4 26 | bokeh==1.4.0 27 | boto==2.49.0 28 | Bottleneck==1.3.1 29 | box2d-py==2.3.8 30 | cachetools==3.1.1 31 | certifi==2019.11.28 32 | cffi==1.13.2 33 | chardet==3.0.4 34 | Click==7.0 35 | cloudpickle==1.2.2 36 | clyent==1.2.2 37 | colorama==0.4.1 38 | comtypes==1.1.7 39 | conda==4.7.12 40 | conda-build==3.18.9 41 | conda-package-handling==1.6.0 42 | conda-verify==3.4.2 43 | contextlib2==0.6.0.post1 44 | cryptography==2.8 45 | cycler==0.10.0 46 | Cython==0.29.14 47 | cytoolz==0.10.1 48 | dask==2.8.1 49 | decorator==4.4.1 50 | defusedxml==0.6.0 51 | distributed==2.8.1 52 | docutils==0.15.2 53 | entrypoints==0.3 54 | et-xmlfile==1.0.1 55 | fastcache==1.1.0 56 | filelock==3.0.12 57 | Flask==1.1.1 58 | fsspec==0.6.0 59 | future==0.18.2 60 | gast==0.2.2 61 | gevent==1.4.0 62 | glfw==1.8.3 63 | glob2==0.7 64 | google-auth==1.7.0 65 | google-auth-oauthlib==0.4.1 66 | google-pasta==0.1.8 67 | greenlet==0.4.15 68 | grpcio==1.25.0 69 | gym==0.14.0 70 | h5py==2.9.0 71 | HeapDict==1.0.1 72 | html5lib==1.0.1 73 | idna==2.8 74 | imageio==2.6.1 75 | imagesize==1.1.0 76 | importlib-metadata==1.1.0 77 | ipykernel==5.1.3 78 | ipython==7.10.1 79 | ipython-genutils==0.2.0 80 | ipywidgets==7.5.1 81 | isort==4.3.21 82 | itsdangerous==1.1.0 83 | jdcal==1.4.1 84 | jedi==0.15.1 85 | Jinja2==2.10.3 86 | joblib==0.14.0 87 | json5==0.8.5 88 | jsonschema==3.2.0 89 | jupyter==1.0.0 90 | jupyter-client==5.3.4 91 | jupyter-console==5.2.0 92 | jupyter-core==4.6.1 93 | jupyterlab==1.2.3 94 | jupyterlab-server==1.0.6 95 | Keras==2.2.4 96 | Keras-Applications==1.0.8 97 | Keras-Preprocessing==1.1.0 98 | keyring==19.2.0 99 | kiwisolver==1.1.0 100 | lazy-object-proxy==1.4.3 101 | libarchive-c==2.8 102 | librosa==0.6.3 103 | llvmlite==0.30.0 104 | locket==0.2.0 105 | lockfile==0.12.2 106 | lxml==4.4.2 107 | Markdown==3.1.1 108 | MarkupSafe==1.1.1 109 | matplotlib==3.1.1 110 | mccabe==0.6.1 111 | menuinst==1.4.16 112 | mistune==0.8.4 113 | mkl-fft==1.0.15 114 | mkl-random==1.1.0 115 | mkl-service==2.3.0 116 | mock==3.0.5 117 | more-itertools==7.2.0 118 | mpmath==1.1.0 119 | msgpack==0.6.1 120 | multipledispatch==0.6.0 121 | navigator-updater==0.2.1 122 | nbconvert==5.6.1 123 | nbformat==4.4.0 124 | networkx==2.4 125 | nltk==3.4.5 126 | nose==1.3.7 127 | notebook==6.0.2 128 | numba==0.46.0 129 | numexpr==2.7.0 130 | numpy==1.17.4 131 | numpydoc==0.9.1 132 | oauthlib==3.1.0 133 | olefile==0.46 134 | opencv-python==4.1.1.26 135 | openpyxl==3.0.2 136 | opt-einsum==3.1.0 137 | packaging==19.2 138 | pandas==0.25.3 139 | pandocfilters==1.4.2 140 | parso==0.5.1 141 | partd==1.1.0 142 | path.py==12.0.2 143 | pathlib2==2.3.5 144 | patsy==0.5.1 145 | pep8==1.7.1 146 | pickleshare==0.7.5 147 | Pillow==6.2.1 148 | pkginfo==1.5.0.1 149 | pluggy==0.13.1 150 | ply==3.11 151 | prometheus-client==0.7.1 152 | prompt-toolkit==3.0.2 153 | protobuf==3.10.1 154 | psutil==5.6.7 155 | py==1.8.0 156 | pyasn1==0.4.7 157 | pyasn1-modules==0.2.7 158 | PyAudio==0.2.11 159 | pycodestyle==2.5.0 160 | pycosat==0.6.3 161 | pycparser==2.19 162 | pycrypto==2.6.1 163 | pycurl==7.43.0.3 164 | pydub==0.23.1 165 | pyflakes==2.1.1 166 | pyglet==1.3.2 167 | Pygments==2.5.2 168 | pylint==2.4.4 169 | pyodbc==4.0.27 170 | pyOpenSSL==19.1.0 171 | pyparsing==2.4.5 172 | pyreadline==2.1 173 | pyrsistent==0.15.6 174 | PySocks==1.7.1 175 | pytest==5.3.1 176 | pytest-arraydiff==0.3 177 | pytest-astropy==0.6.0 178 | pytest-astropy-header==0.1.1 179 | pytest-doctestplus==0.5.0 180 | pytest-openfiles==0.4.0 181 | pytest-remotedata==0.3.2 182 | python-dateutil==2.8.1 183 | python-speech-features==0.6 184 | pytz==2019.3 185 | PyWavelets==1.1.1 186 | pywin32==223 187 | pywin32-ctypes==0.2.0 188 | pywinpty==0.5.5 189 | PyYAML==5.1.2 190 | pyzmq==18.1.0 191 | QtAwesome==0.6.0 192 | qtconsole==4.6.0 193 | QtPy==1.9.0 194 | requests==2.22.0 195 | requests-oauthlib==1.3.0 196 | resampy==0.2.2 197 | rope==0.14.0 198 | rsa==4.0 199 | ruamel-yaml==0.15.46 200 | scikit-image==0.15.0 201 | scikit-learn==0.21.3 202 | scipy==1.3.1 203 | seaborn==0.9.0 204 | Send2Trash==1.5.0 205 | simplegeneric==0.8.1 206 | singledispatch==3.4.0.3 207 | six==1.13.0 208 | snowballstemmer==2.0.0 209 | sortedcollections==1.1.2 210 | sortedcontainers==2.1.0 211 | soupsieve==1.9.5 212 | Sphinx==2.2.2 213 | sphinxcontrib-applehelp==1.0.1 214 | sphinxcontrib-devhelp==1.0.1 215 | sphinxcontrib-htmlhelp==1.0.2 216 | sphinxcontrib-jsmath==1.0.1 217 | sphinxcontrib-qthelp==1.0.2 218 | sphinxcontrib-serializinghtml==1.1.3 219 | sphinxcontrib-websupport==1.1.2 220 | spyder==3.3.6 221 | spyder-kernels==0.5.2 222 | SQLAlchemy==1.3.11 223 | statsmodels==0.10.1 224 | sympy==1.4 225 | tables==3.6.1 226 | tblib==1.5.0 227 | tensorboard==2.0.1 228 | tensorflow==2.0.1 229 | tensorflow-estimator==2.0.1 230 | termcolor==1.1.0 231 | terminado==0.8.3 232 | testpath==0.4.4 233 | toolz==0.10.0 234 | tornado==6.0.3 235 | tqdm==4.40.0 236 | traitlets==4.3.3 237 | unicodecsv==0.14.1 238 | urllib3==1.25.7 239 | wcwidth==0.1.7 240 | webencodings==0.5.1 241 | Werkzeug==0.16.0 242 | widgetsnbextension==3.5.1 243 | win-inet-pton==1.1.0 244 | win-unicode-console==0.5 245 | wincertstore==0.2 246 | wrapt==1.11.2 247 | xlrd==1.2.0 248 | XlsxWriter==1.2.6 249 | xlwings==0.16.1 250 | xlwt==1.3.0 251 | zict==1.0.0 252 | zipp==0.6.0 253 | -------------------------------------------------------------------------------- /sound/sf1_cln.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unoti/voice-embeddings/665c283a281d3c49f6ff9436877b02dd3c3cc3fd/sound/sf1_cln.wav -------------------------------------------------------------------------------- /speakerdb.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | 4 | 5 | class SpeakerDatabase: 6 | """A dataset of audio files from various speakers. 7 | The directory structure of the data is one directory per speaker, with the speaker id as the name 8 | of the directory. 9 | Under the speaker directory are multiple directories, each containing audio clips from that speaker. 10 | """ 11 | def __init__(self, directory): 12 | """ 13 | directory: path name of the root directory. 14 | This directory should contain directories named by the speaker ids. 15 | """ 16 | self.directory = directory 17 | self.speaker_ids = os.listdir(self.directory) 18 | if not self.speaker_ids: 19 | raise Exception('Speaker database at %s should contain folders named by speaker id') 20 | 21 | def random_speaker(self): 22 | """Returns the id of a random speaker.""" 23 | return random.choice(self.speaker_ids) 24 | 25 | def random_wav(self): 26 | """Selects a random speaker, then returns a random utterance from that speaker. 27 | returns speaker_id, wav_filename. 28 | """ 29 | speaker_id = self.random_speaker() 30 | wav_filename = self.random_wav_for_speaker(speaker_id) 31 | return speaker_id, wav_filename 32 | 33 | def random_wav_for_speaker(self, speaker_id): 34 | speaker_root = os.path.join(self.directory, speaker_id) 35 | speaker_files = [] 36 | for root, dirs, filenames in os.walk(speaker_root): 37 | for partial_fnam in filenames: 38 | filename = os.path.join(root, partial_fnam) 39 | speaker_files.append(filename) 40 | return random.choice(speaker_files) 41 | 42 | def random_triplet(self): 43 | """Selects a triplet of audio samples. Two from the same speaker, and one from a different speaker. 44 | Returns (anchor_fnam, positive_fnam, negative_fnam, anchor_id, negative_id) 45 | Returns filenames: (anchor, positive, negative). 46 | anchor: A wav filename with an audio sample for the "anchor" speaker, which will match the positive example. 47 | positive: A wave filename with an audio sample for the "positive" speaker, which is from the same speaker as the anchor. 48 | negative: A wave filename with an audio sample from a different speaker from the anchor. 49 | """ 50 | anchor_id, anchor_wav = self.random_wav() 51 | negative_id = None, None # Will be set below. 52 | while True: 53 | positive_wav = self.random_wav_for_speaker(anchor_id) 54 | if positive_wav != anchor_wav: 55 | break 56 | while True: 57 | negative_id, negative_wav = self.random_wav() 58 | if negative_id != anchor_id: 59 | break 60 | return anchor_wav, positive_wav, negative_wav, anchor_id, negative_id -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import minibatch 2 | import application 3 | from application import log 4 | 5 | import keras.models 6 | import numpy as np 7 | 8 | def _train_batch(model: keras.models.Model, batch: minibatch.MiniBatch): 9 | X, _ = batch.inputs() 10 | # Y isn't used because it's the speaker ids as strings. 11 | # We're not actually trying to predict Y in this model, we're just optimizing the embeddings. 12 | # So just generate numbers for Y. I expect zeros or ones would work just as well, since Y 13 | # isn't involved in the loss function. Is this true? 14 | Y = np.random.uniform(size=(X.shape[0], 1)) 15 | loss = model.train_on_batch(X, Y) 16 | return loss 17 | 18 | def train(): 19 | application.init() 20 | 21 | log('Building model') 22 | #*TODO: Detect if the pre-saved model doesn't exist. 23 | model = application.make_model() 24 | checkpoint_monitor = application.make_checkpoint_monitor(model) 25 | checkpoint_monitor.load_most_recent() 26 | batch_num = checkpoint_monitor.batch_num # Restored from the csv file. 27 | #log('Model learning rate set to %f' % model.lr.get_value()) 28 | 29 | log('Building speaker db') 30 | speaker_db = application.make_speaker_db() 31 | 32 | while True: 33 | batch_num += 1 34 | log('Building batch {0}'.format(batch_num)) 35 | batch = minibatch.create_batch(speaker_db) 36 | 37 | log('Training') 38 | loss = _train_batch(model, batch) 39 | log('batch {0} loss={1}'.format(batch_num, loss)) 40 | 41 | #*TODO: log test_loss 42 | #if checkpoint_monitor.is_save_needed(): 43 | # test_loss = ... 44 | checkpoint_monitor.train_step_done(loss, test_loss=None) 45 | 46 | 47 | if __name__ == '__main__': 48 | train() -------------------------------------------------------------------------------- /triplet_loss.py: -------------------------------------------------------------------------------- 1 | import config 2 | 3 | import keras.backend as K 4 | 5 | def batch_cosine_similarity(x1, x2): 6 | # https://en.wikipedia.org/wiki/Cosine_similarity 7 | # 1 = equal direction ; -1 = opposite direction 8 | # https://keras.io/backend/#squeeze 9 | # https://keras.io/backend/#batch_dot 10 | # Calculate the dot product for the entire batch, where the input batch 11 | # is shaped like (batch_size, :). Result size has fewer dimensions than the 12 | # input but is expanded to have at least 2 dimensions. Then we squeeze that 13 | # removing one dimension. 14 | dot = K.squeeze(K.batch_dot(x1, x2, axes=1), axis=1) 15 | return dot 16 | 17 | def deep_speaker_loss(y_true, y_pred): 18 | # y_true.shape = (batch_size, embedding_size) 19 | # y_pred.shape = (batch_size, embedding_size) 20 | # CONVENTION: Input is: 21 | # concat(BATCH_SIZE * [ANCHOR, POSITIVE_EX, NEGATIVE_EX] * NUM_FRAMES) 22 | # EXAMPLE: 23 | # BATCH_NUM_TRIPLETS = 3, NUM_FRAMES = 2 24 | # _____________________________________________________ 25 | # ANCHOR 1 (512,) 26 | # ANCHOR 2 (512,) 27 | # ANCHOR 3 (512,) 28 | # POS EX 1 (512,) 29 | # POS EX 2 (512,) 30 | # POS EX 3 (512,) 31 | # NEG EX 1 (512,) 32 | # NEG EX 2 (512,) 33 | # NEG EX 3 (512,) 34 | # _____________________________________________________ 35 | 36 | #elements = int(y_pred.shape.as_list()[0] / 3) 37 | elements = config.BATCH_SIZE 38 | 39 | anchor = y_pred[0:elements] 40 | positive_ex = y_pred[elements:2 * elements] 41 | negative_ex = y_pred[2 * elements:] 42 | 43 | sap = batch_cosine_similarity(anchor, positive_ex) 44 | san = batch_cosine_similarity(anchor, negative_ex) 45 | loss = K.maximum(san - sap + config.ALPHA, 0.0) 46 | total_loss = K.sum(loss) 47 | return total_loss 48 | --------------------------------------------------------------------------------