├── slides
├── ml-4-audio-session1.pdf
├── ml-4-audio-session2.pdf
├── ml-4-audio-session3.pdf
├── ml-4-audio-paper-reading-2-hubert.pdf
├── ml-4-audio-paper-reading-1-wav2vec2.pdf
├── ml-4-audio-paper-reading-3-data2vec.pdf
└── ml-4-audio-paper-reading-4-speecht5.pdf
├── README.md
└── notebooks
├── session2
├── speech_recognition.ipynb
└── audio_data.ipynb
└── session3
└── text_to_speech.ipynb
/slides/ml-4-audio-session1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-session1.pdf
--------------------------------------------------------------------------------
/slides/ml-4-audio-session2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-session2.pdf
--------------------------------------------------------------------------------
/slides/ml-4-audio-session3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-session3.pdf
--------------------------------------------------------------------------------
/slides/ml-4-audio-paper-reading-2-hubert.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-paper-reading-2-hubert.pdf
--------------------------------------------------------------------------------
/slides/ml-4-audio-paper-reading-1-wav2vec2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-paper-reading-1-wav2vec2.pdf
--------------------------------------------------------------------------------
/slides/ml-4-audio-paper-reading-3-data2vec.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-paper-reading-3-data2vec.pdf
--------------------------------------------------------------------------------
/slides/ml-4-audio-paper-reading-4-speecht5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vaibhavs10/ml-with-audio/HEAD/slides/ml-4-audio-paper-reading-4-speecht5.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Hugging Face Machine Learning for Audio Study Group
2 |
3 | Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and discussions, we'll explore the field of applying Machine Learning in the Audio domain. Some examples of this are:
4 | * Generating synthetic sound out of a given text (think of conversational assistants)
5 | * Transcribing audio signals to text.
6 | * Removing noise out of an audio.
7 | * Separating different sources of audio.
8 | * Identifying which speaker is talking.
9 | * And much more!
10 |
11 | We suggest you to join the community Discord at http://hf.co/join/discord, and we're looking forward to meet at the #ml-4-audio-study-group channel 🤗. Remember, this is a community effort so make out of this your space!
12 |
13 | ## Organisation
14 |
15 | We'll kick off with some basics and then collaboratively decide the further direction of the group.
16 |
17 | Before each session:
18 | * Read/watch related resources
19 |
20 | During each session, you can
21 | * Ask question in the forum
22 | * Present a short (~10-15mins) presentation on the topic (agree beforehand)
23 |
24 | Before/after:
25 | * Keep discussing/asking questions about the topic (#ml-4-audio-study channel on discord)
26 | * Share interesting resources
27 |
28 | ## Schedule
29 |
30 | | Date | Topics | Resources (To read before) |
31 | |--------------|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
32 | | Dec 14, 2021 | Kickoff + Overview of Audio related usecases ([video](https://www.youtube.com/watch?v=cAviRhkqdnc&ab_channel=HuggingFace), [questions](https://discuss.huggingface.co/t/ml-for-audio-study-group-kick-off-dec-14/12745))| [The 3 DL Frameworks for e2e Speech Recognition that power your devices](https://heartbeat.comet.ml/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices-37b891ddc380) |
33 | | Dec 21, 2021 |
Intro to Audio Automatic Speech Recognition Deep Dive ([video](https://www.youtube.com/watch?v=D-MH6YjuIlE&ab_channel=HuggingFace), [questions](https://discuss.huggingface.co/t/ml-for-audio-study-group-intro-to-audio-and-asr-dec-21/12890))| [Intro to Audio for FastAI Sections 1 and 2](https://nbviewer.org/github/fastaudio/fastaudio/blob/master/docs/Introduction%20to%20Audio.ipynb) [Speech and Language Processing 26.1-26.5](https://web.stanford.edu/~jurafsky/slp3/) |
34 | | Jan 4, 2022 | Text to Speech Deep Dive ([video](https://www.youtube.com/watch?v=aLBedWj-5CQ&ab_channel=HuggingFace), [questions](https://discuss.huggingface.co/t/ml-for-audio-study-group-text-to-speech-deep-dive-jan-4/13315)) | [Intro to Audio & ASR Notebooks](https://github.com/Vaibhavs10/ml-with-audio/tree/master/notebooks/session2) [Speech and Language Processing 26.6](https://web.stanford.edu/~jurafsky/slp3/) |
35 | | Jan 18, 2022 | pyctcdecode: A simple & fast STT prediction decoding algorithm ([demo](https://github.com/rhgrossman/pyctcdecode_demo), [slides](https://docs.google.com/presentation/d/1pjp8kTGChsr58D7Z2eVo9S7CsppMXNgZOApJo-rJ1As/edit#slide=id.g10e9c4afc9e_0_984), [questions](https://discuss.huggingface.co/t/ml-for-audio-study-group-pyctcdecode-jan-18/13561)) | [Beam search CTC decoding](https://towardsdatascience.com/beam-search-decoding-in-ctc-trained-neural-networks-5a889a3d85a7) [pyctcdecode](https://blog.kensho.com/pyctcdecode-a-new-beam-search-decoder-for-ctc-speech-recognition-2be3863afa96) |
36 |
37 | ## Supplementary Resources
38 |
39 | In case you want to solidify a concept, or just want to go down further deep into the speech processing rabbit-hole.
40 | ### General Resources
41 | * Slides from LSA352: [Slides](https://nlp.stanford.edu/courses/lsa352/) (no videos available)
42 | * Slides from CS224S (Latest): [Slides](http://web.stanford.edu/class/cs224s/syllabus/) (no videos available)
43 | * Speech & Language Processing Book (Chapters 25 & 26) - [E-book](https://web.stanford.edu/~jurafsky/slp3/)
44 |
45 | ### Research Papers
46 | * Speech Recognition Papers: [Github repo](https://github.com/wenet-e2e/speech-recognition-papers)
47 | * Speech Synthesis Papers: [Github repo](https://github.com/xcmyz/speech-synthesis-paper)
48 |
49 | ### Toolkits
50 | * Speechbrain - [Github repo](https://speechbrain.github.io/)
51 | * Toucan - [Github repo](https://github.com/DigitalPhonetics/IMS-Toucan)
52 | * ESPnet - [Github repo](https://github.com/espnet/espnet)
53 |
54 | ## Demos
55 | * Add interesting effects to your audio files - [Huggingface spaces](https://huggingface.co/spaces/akhaliq/steerable-nafx)
56 | * Generate Speech from text (TTS) - [Huggingface spaces](https://huggingface.co/spaces/akhaliq/coqui-ai-tts)
57 | * Generate text from Speech (ASR) - [Huggingface spaces](https://huggingface.co/spaces/facebook/XLS-R-2B-22-16)
58 |
--------------------------------------------------------------------------------
/notebooks/session2/speech_recognition.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Speech Recognition [Tutorial].ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "source": [
22 | "# SpeechBrain + HuggingFace for Speech Recognition tasks\n",
23 | "\n",
24 | "compiled by: [Vaibhav Srivastav](https://twitter.com/reach_vb)\n",
25 | "\n",
26 | "for pre-reads + further materials headover to: [ml-with-audio repo](https://github.com/Vaibhavs10/ml-with-audio)"
27 | ],
28 | "metadata": {
29 | "id": "quYLA2D9Os79"
30 | }
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "source": [
35 | "some important Speech Recognition tasks:\n",
36 | "- **Speech Recognition**: Speech-to-text ([see this tutorial](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing))\n",
37 | "- **Speaker Recognition**: Speaker verification/ID ([see this tutorial](https://colab.research.google.com/drive/1UwisnAjr8nQF3UnrkIJ4abBMAWzVwBMh?usp=sharing)).\n",
38 | "- **Speaker Diarization**: Detect who spoke when.\n",
39 | "- **Speech Enhancement**: Noisy to clean speech ([see this tutorial](https://colab.research.google.com/drive/18RyiuKupAhwWX7fh3LCatwQGU5eIS3TR?usp=sharing)).\n",
40 | "- **Speech Separation**: Separate overlapped speech ([see this tutorial](https://colab.research.google.com/drive/1YxsMW1KNqP1YihNUcfrjy0zUp7FhNNhN?usp=sharing)). \n",
41 | "- **Spoken Language Understanding**: Speech to intent/slots. \n",
42 | "- **Multi-microphone processing**: Combining input signals ([see this tutorial](https://colab.research.google.com/drive/1UVoYDUiIrwMpBTghQPbA6rC1mc9IBzi6?usp=sharing))."
43 | ],
44 | "metadata": {
45 | "id": "yWARjsRhOVU_"
46 | }
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {
52 | "id": "1xKGzpdbOE_D"
53 | },
54 | "outputs": [],
55 | "source": [
56 | "%%capture\n",
57 | "!pip install speechbrain\n",
58 | "!pip install transformers"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "source": [
64 | "import speechbrain as sb\n",
65 | "from speechbrain.dataio.dataio import read_audio\n",
66 | "from IPython.display import Audio"
67 | ],
68 | "metadata": {
69 | "id": "XFUnd8qMO5Ky"
70 | },
71 | "execution_count": null,
72 | "outputs": []
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "source": [
77 | "## Let's use a pre-trained model from the HF hub and transcribe some text"
78 | ],
79 | "metadata": {
80 | "id": "AbNqC-f6PeQm"
81 | }
82 | },
83 | {
84 | "cell_type": "code",
85 | "source": [
86 | "from speechbrain.pretrained import EncoderDecoderASR\n",
87 | "\n",
88 | "asr_model = EncoderDecoderASR.from_hparams(source=\"speechbrain/asr-crdnn-rnnlm-librispeech\", savedir=\"pretrained_models/asr-crdnn-rnnlm-librispeech\")\n",
89 | "asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')"
90 | ],
91 | "metadata": {
92 | "id": "8rdpr9hVPCTf"
93 | },
94 | "execution_count": null,
95 | "outputs": []
96 | },
97 | {
98 | "cell_type": "code",
99 | "source": [
100 | "signal = read_audio(\"example.wav\").squeeze()\n",
101 | "Audio(signal, rate=16000)"
102 | ],
103 | "metadata": {
104 | "id": "E_CuLNbBPKSM"
105 | },
106 | "execution_count": null,
107 | "outputs": []
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "source": [
112 | "## Your turn, find a model from [HF Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) and transcribe the wav file\n",
113 | "\n",
114 | "Try both the types of pretrained ASR models:\n",
115 | "\n",
116 | "1. EncoderDecoderASR\n",
117 | "2. EncoderASR"
118 | ],
119 | "metadata": {
120 | "id": "NKd0iKqMP_ib"
121 | }
122 | },
123 | {
124 | "cell_type": "code",
125 | "source": [
126 | "from speechbrain.pretrained import EncoderDecoderASR\n",
127 | "\n",
128 | "asr_model = EncoderDecoderASR.from_hparams(source=\"\", savedir=\"pretrained_models/\")\n",
129 | "asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')"
130 | ],
131 | "metadata": {
132 | "id": "OQ25aB_nPW42"
133 | },
134 | "execution_count": null,
135 | "outputs": []
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "source": [
140 | "### Let's take it up a notch: What if we are provided with a sound file with multiple speakers, how do we seperate their individual sounds?"
141 | ],
142 | "metadata": {
143 | "id": "ub9UrXjDRugx"
144 | }
145 | },
146 | {
147 | "cell_type": "code",
148 | "source": [
149 | "from speechbrain.pretrained import SepformerSeparation as separator\n",
150 | "\n",
151 | "model = separator.from_hparams(source=\"speechbrain/sepformer-wsj02mix\", savedir='pretrained_models/sepformer-wsj02mix')\n",
152 | "est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') "
153 | ],
154 | "metadata": {
155 | "id": "Idbe73EFRr0z"
156 | },
157 | "execution_count": null,
158 | "outputs": []
159 | },
160 | {
161 | "cell_type": "code",
162 | "source": [
163 | "signal = read_audio(\"test_mixture.wav\").squeeze()\n",
164 | "Audio(signal, rate=8000)"
165 | ],
166 | "metadata": {
167 | "id": "AoUd9YaaSkd8"
168 | },
169 | "execution_count": null,
170 | "outputs": []
171 | },
172 | {
173 | "cell_type": "code",
174 | "source": [
175 | "Audio(est_sources[:, :, 0].detach().cpu().squeeze(), rate=8000)"
176 | ],
177 | "metadata": {
178 | "id": "fyV2TzwXSopl"
179 | },
180 | "execution_count": null,
181 | "outputs": []
182 | },
183 | {
184 | "cell_type": "code",
185 | "source": [
186 | "Audio(est_sources[:, :, 1].detach().cpu().squeeze(), rate=8000)"
187 | ],
188 | "metadata": {
189 | "id": "nTKwAilUSvNZ"
190 | },
191 | "execution_count": null,
192 | "outputs": []
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "source": [
197 | "## Your turn, find a model from [HF Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) and separate the sounds\n",
198 | "\n",
199 | "Look for Sepformer :)"
200 | ],
201 | "metadata": {
202 | "id": "Slfem2H9S91E"
203 | }
204 | },
205 | {
206 | "cell_type": "code",
207 | "source": [
208 | "from speechbrain.pretrained import SepformerSeparation as separator\n",
209 | "\n",
210 | "model = separator.from_hparams(source=\"\", savedir='pretrained_models/')\n",
211 | "est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') "
212 | ],
213 | "metadata": {
214 | "id": "hyESMfysSzYa"
215 | },
216 | "execution_count": null,
217 | "outputs": []
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "source": [
222 | "## Alright, so far so good, let's now try to see if we can verify if two audio files are from the same speaker"
223 | ],
224 | "metadata": {
225 | "id": "o_AwHq33URv3"
226 | }
227 | },
228 | {
229 | "cell_type": "code",
230 | "source": [
231 | "from speechbrain.pretrained import SpeakerRecognition\n",
232 | "verification = SpeakerRecognition.from_hparams(source=\"speechbrain/spkrec-ecapa-voxceleb\", savedir=\"pretrained_models/spkrec-ecapa-voxceleb\")\n",
233 | "score, prediction = verification.verify_files(\"speechbrain/spkrec-ecapa-voxceleb/example1.wav\", \"speechbrain/spkrec-ecapa-voxceleb/example2.flac\")\n",
234 | "\n",
235 | "print(prediction, score)"
236 | ],
237 | "metadata": {
238 | "id": "G46db0fgTofZ"
239 | },
240 | "execution_count": null,
241 | "outputs": []
242 | },
243 | {
244 | "cell_type": "code",
245 | "source": [
246 | "signal = read_audio(\"example1.wav\").squeeze()\n",
247 | "Audio(signal, rate=16000)"
248 | ],
249 | "metadata": {
250 | "id": "UpGb-xI7VS48"
251 | },
252 | "execution_count": null,
253 | "outputs": []
254 | },
255 | {
256 | "cell_type": "code",
257 | "source": [
258 | "signal = read_audio(\"example2.flac\").squeeze()\n",
259 | "Audio(signal, rate=16000)"
260 | ],
261 | "metadata": {
262 | "id": "YyRRqNkkVXD2"
263 | },
264 | "execution_count": null,
265 | "outputs": []
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "source": [
270 | "Want to have more fun with pre-trained models and out of the box tasks, head over to the [SpeechBrain documentation](https://speechbrain.readthedocs.io/en/latest/API/speechbrain.pretrained.interfaces.html)\n",
271 | "\n",
272 | "Some suggestions:\n",
273 | "\n",
274 | "- [Speech Enhancement](https://huggingface.co/speechbrain/metricgan-plus-voicebank)\n",
275 | "- [Command Recognition](https://huggingface.co/speechbrain/google_speech_command_xvector)\n",
276 | "- [Spoken Language Understanding](https://huggingface.co/speechbrain/slu-timers-and-such-direct-librispeech-asr)\n",
277 | "- [Urban Sound Classification](https://huggingface.co/speechbrain/urbansound8k_ecapa)\n",
278 | "\n",
279 | "Send us your experiments on twitter or discord ;)"
280 | ],
281 | "metadata": {
282 | "id": "zmsfDpRMVkuK"
283 | }
284 | },
285 | {
286 | "cell_type": "markdown",
287 | "source": [
288 | "## Let's train a ASR model on some sample files!"
289 | ],
290 | "metadata": {
291 | "id": "1QUmFY_cWh5V"
292 | }
293 | },
294 | {
295 | "cell_type": "code",
296 | "source": [
297 | "%%capture\n",
298 | "!git clone https://github.com/speechbrain/speechbrain.git"
299 | ],
300 | "metadata": {
301 | "id": "Op0aX_09VZTQ"
302 | },
303 | "execution_count": null,
304 | "outputs": []
305 | },
306 | {
307 | "cell_type": "code",
308 | "source": [
309 | "%cd speechbrain/tests/integration/neural_networks/ASR_CTC/\n",
310 | "!python example_asr_ctc_experiment.py hyperparams.yaml "
311 | ],
312 | "metadata": {
313 | "id": "W14hnYWKX002"
314 | },
315 | "execution_count": null,
316 | "outputs": []
317 | },
318 | {
319 | "cell_type": "code",
320 | "source": [
321 | "%cd speechbrain/tests/integration/neural_networks/ASR_CTC/\n",
322 | "!cat example_asr_ctc_experiment.py"
323 | ],
324 | "metadata": {
325 | "id": "S6XAoSFPYMbE"
326 | },
327 | "execution_count": null,
328 | "outputs": []
329 | },
330 | {
331 | "cell_type": "code",
332 | "source": [
333 | "%cd speechbrain/tests/integration/neural_networks/ASR_CTC/\n",
334 | "!cat hyperparams.yaml"
335 | ],
336 | "metadata": {
337 | "id": "vVL9_Mg3boKF"
338 | },
339 | "execution_count": null,
340 | "outputs": []
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "source": [
345 | "## Your turn, Take the sample data and train a Seq2Seq model next.\n",
346 | "\n",
347 | "Hint: Look at the [integrations folder](https://github.com/speechbrain/speechbrain/tree/develop/tests/integration/neural_networks) ;)"
348 | ],
349 | "metadata": {
350 | "id": "KMrC3LBBcHWt"
351 | }
352 | }
353 | ]
354 | }
--------------------------------------------------------------------------------
/notebooks/session2/audio_data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "audio_data.ipynb",
7 | "private_outputs": true,
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "authorship_tag": "ABX9TyN2hvLoALDmLeRkMcbNI929",
11 | "include_colab_link": true
12 | },
13 | "kernelspec": {
14 | "name": "python3",
15 | "display_name": "Python 3"
16 | },
17 | "language_info": {
18 | "name": "python"
19 | }
20 | },
21 | "cells": [
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {
25 | "id": "view-in-github",
26 | "colab_type": "text"
27 | },
28 | "source": [
29 | " "
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "source": [
35 | "# Intro to Audio data\n",
36 | "\n",
37 | "Hello! This quick notebook will show you how to::\n",
38 | "* load audio data\n",
39 | "* plot an audio's waveform\n",
40 | "* create a spectrogram\n",
41 | "* do quick automatic speech recognition\n",
42 | "\n",
43 | "This notebook should take about 10 minutes to run."
44 | ],
45 | "metadata": {
46 | "id": "66IztFTLWfRY"
47 | }
48 | },
49 | {
50 | "cell_type": "code",
51 | "source": [
52 | "!pip install transformers"
53 | ],
54 | "metadata": {
55 | "id": "57rV9LgqdojT"
56 | },
57 | "execution_count": null,
58 | "outputs": []
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {
64 | "id": "GP-iDjnJbPxM"
65 | },
66 | "outputs": [],
67 | "source": [
68 | "import librosa\n",
69 | "import librosa.display\n",
70 | "import numpy as np\n",
71 | "import matplotlib.pyplot as plt\n",
72 | "\n",
73 | "from IPython.display import Audio"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "source": [
79 | "!wget https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac"
80 | ],
81 | "metadata": {
82 | "id": "hptqnOvvbsZ6"
83 | },
84 | "execution_count": null,
85 | "outputs": []
86 | },
87 | {
88 | "cell_type": "code",
89 | "source": [
90 | "sample = \"/content/LibriSpeech_61-70968-0000.flac\""
91 | ],
92 | "metadata": {
93 | "id": "HGTZy6EsbV2A"
94 | },
95 | "execution_count": null,
96 | "outputs": []
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "source": [
101 | "The following cell uses `Audio` from IPython` to be able to play an audio\n",
102 | "directly in a notebook."
103 | ],
104 | "metadata": {
105 | "id": "uQsXGOBlXIo4"
106 | }
107 | },
108 | {
109 | "cell_type": "code",
110 | "source": [
111 | "Audio(sample)"
112 | ],
113 | "metadata": {
114 | "id": "VIm6XLRTcE5K"
115 | },
116 | "execution_count": null,
117 | "outputs": []
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "source": [
122 | "[Librosa](https://librosa.org/doc/latest/index.html) is a very common Python\n",
123 | "library for audio analysis. It allows to easily load audio files, create\n",
124 | "spectrograms, add effects, extract features and much more! Let's plot the\n",
125 | "waveform of an audio. Some quick questions to reflect about:\n",
126 | "\n",
127 | "* When is the quietest moment?"
128 | ],
129 | "metadata": {
130 | "id": "Kda8q6KUWw74"
131 | }
132 | },
133 | {
134 | "cell_type": "code",
135 | "source": [
136 | "y, sr = librosa.load(sample)\n",
137 | "\n",
138 | "plt.plot(y);\n",
139 | "plt.title('Signal');\n",
140 | "plt.xlabel('Time (samples)');\n",
141 | "plt.ylabel('Amplitude');"
142 | ],
143 | "metadata": {
144 | "id": "ZNH3DZYPbwi8"
145 | },
146 | "execution_count": null,
147 | "outputs": []
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "source": [
152 | "Librosa also has a `waveplot` method for the same thing :)"
153 | ],
154 | "metadata": {
155 | "id": "popkL26jdzMO"
156 | }
157 | },
158 | {
159 | "cell_type": "code",
160 | "source": [
161 | "librosa.display.waveplot(y, sr=sr);"
162 | ],
163 | "metadata": {
164 | "id": "LpLXq3sgdw7h"
165 | },
166 | "execution_count": null,
167 | "outputs": []
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "source": [
172 | "When we loaded a sample with librosa, we get the sample rate as well."
173 | ],
174 | "metadata": {
175 | "id": "CV6Uu3SfX-mZ"
176 | }
177 | },
178 | {
179 | "cell_type": "code",
180 | "source": [
181 | "sr"
182 | ],
183 | "metadata": {
184 | "id": "L7yDaOeeb0Kn"
185 | },
186 | "execution_count": null,
187 | "outputs": []
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "source": [
192 | "A sample rate of 22,050 means we're getting 22,050 samples in a given second.\n",
193 | "Let's see how many samples we have in total."
194 | ],
195 | "metadata": {
196 | "id": "7ETQ8A7OYCYC"
197 | }
198 | },
199 | {
200 | "cell_type": "code",
201 | "source": [
202 | "len(y)"
203 | ],
204 | "metadata": {
205 | "id": "BBu2ADM-b51y"
206 | },
207 | "execution_count": null,
208 | "outputs": []
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "source": [
213 | "Alright! So if we get 108,156 samples, and we divide that by the\n",
214 | "sample rate, we should get the number of seconds in the audio. Let's see if\n",
215 | "that confirms our intuition."
216 | ],
217 | "metadata": {
218 | "id": "sX2yI-COYIph"
219 | }
220 | },
221 | {
222 | "cell_type": "code",
223 | "source": [
224 | "len(y)/sr"
225 | ],
226 | "metadata": {
227 | "id": "MvZZLZFGb8z9"
228 | },
229 | "execution_count": null,
230 | "outputs": []
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "source": [
235 | "Nice! It does!"
236 | ],
237 | "metadata": {
238 | "id": "8cZ87nZKYTTU"
239 | }
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "source": [
244 | "Alright. Let's go to the next thing. What happens if we use a different sample\n",
245 | "rate when executing? Let's hear the audios"
246 | ],
247 | "metadata": {
248 | "id": "-qdA7hJuYUmo"
249 | }
250 | },
251 | {
252 | "cell_type": "code",
253 | "source": [
254 | "Audio(y, rate=sr*1.5)"
255 | ],
256 | "metadata": {
257 | "id": "Mhsdl8KZb-Y3"
258 | },
259 | "execution_count": null,
260 | "outputs": []
261 | },
262 | {
263 | "cell_type": "code",
264 | "source": [
265 | "Audio(y, rate=sr*0.75)"
266 | ],
267 | "metadata": {
268 | "id": "Iu_7Llt1chYq"
269 | },
270 | "execution_count": null,
271 | "outputs": []
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "source": [
276 | "The voice is completely distorted now! Interesting."
277 | ],
278 | "metadata": {
279 | "id": "RJbkHZ9CZZ6c"
280 | }
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "source": [
285 | "## Spectrograms\n",
286 | "\n",
287 | "Cool, it's now time to build a spectrogram. We'll be using Short Time Fourier\n",
288 | "Transform (STFT), which means we will be using a bunch of Fourier Transforms (FT) since we have frequencies changing over time. Just as a\n",
289 | "reminder, FFT Is useful for decomposing a signal. STFT is useful for a signal\n",
290 | "that changes over time. It divides a long signal into shorter segments of equal length and applies Fourier transforms for each segment.\n",
291 | "\n"
292 | ],
293 | "metadata": {
294 | "id": "T4Z7-TtWZdYo"
295 | }
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "source": [
300 | "Let's then build a spectrogram! Note that there are different types of \n",
301 | "spectrograms and many variables you can play with. Let's compute Short-Time Fourier Transform using `librosa.stft` ([spec](http://librosa.org/doc/main/generated/librosa.stft.html)) and see what we get out of it.\n"
302 | ],
303 | "metadata": {
304 | "id": "0CiR5SGFd_9J"
305 | }
306 | },
307 | {
308 | "cell_type": "code",
309 | "source": [
310 | "spec = np.abs(librosa.stft(y))\n",
311 | "librosa.display.specshow(spec, sr=sr, x_axis='time')\n",
312 | "plt.colorbar(format='%+2.0f amplitude')\n",
313 | "plt.title('Almost Spectrogram')"
314 | ],
315 | "metadata": {
316 | "id": "P6dnulFscz0f"
317 | },
318 | "execution_count": null,
319 | "outputs": []
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "source": [
324 | "Well, I cannot see anything here. What is going on? The sounds we (humans) hear are concentrated in a very small frequency and amplitude ranges, so plotting the raw data is not great.\n",
325 | "\n",
326 | ""
327 | ],
328 | "metadata": {
329 | "id": "wT14YDVwem13"
330 | }
331 | },
332 | {
333 | "cell_type": "markdown",
334 | "source": [
335 | "What we can do is to transform the y axis to be log scaled and convert the amplitude to decibels. Making the data log-based will provide us much more informative information."
336 | ],
337 | "metadata": {
338 | "id": "mIsX9auifGDr"
339 | }
340 | },
341 | {
342 | "cell_type": "code",
343 | "source": [
344 | "dec_spec = librosa.amplitude_to_db(spec, ref=np.max)\n",
345 | "librosa.display.specshow(dec_spec, sr=sr, x_axis='time', y_axis='log')\n",
346 | "plt.colorbar(format='%+2.0f dB')\n",
347 | "plt.title('Spectrogram')"
348 | ],
349 | "metadata": {
350 | "id": "lNkwCdJTeWM-"
351 | },
352 | "execution_count": null,
353 | "outputs": []
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "source": [
358 | "Librosa allows you to create a mel-spectrogram in two ways using the `librosa.feature.melspectrogram` method:\n",
359 | "* By providing the raw data, as we did before (you set the `y` param).\n",
360 | "* By providing a pre-computer power spectrogram (you set `S` param)."
361 | ],
362 | "metadata": {
363 | "id": "7KD8n9LccWOI"
364 | }
365 | },
366 | {
367 | "cell_type": "code",
368 | "source": [
369 | "sg = librosa.feature.melspectrogram(y, sr=sr)\n",
370 | "db_spec = librosa.power_to_db(sg, ref=np.max)\n",
371 | "librosa.display.specshow(db_spec, x_axis='time', y_axis='mel', fmax=8000)\n",
372 | "plt.colorbar(format='%+2.0f dB')"
373 | ],
374 | "metadata": {
375 | "id": "C3-3zK5HhJ7J"
376 | },
377 | "execution_count": null,
378 | "outputs": []
379 | },
380 | {
381 | "cell_type": "code",
382 | "source": [
383 | "sg = librosa.feature.melspectrogram(S=spec, sr=sr)\n",
384 | "db_spec = librosa.amplitude_to_db(sg, ref=1.0, amin=1e-05, top_db=80.0)\n",
385 | "librosa.display.specshow(db_spec, x_axis='time', y_axis='mel', fmax=8000)\n",
386 | "plt.colorbar(format='%+2.0f dB')"
387 | ],
388 | "metadata": {
389 | "id": "ptw3Qlpic2mZ"
390 | },
391 | "execution_count": null,
392 | "outputs": []
393 | },
394 | {
395 | "cell_type": "markdown",
396 | "source": [
397 | "# Automatic Speech Recognition demo\n",
398 | "\n",
399 | "Let's very quickly show you how to do ASR with our given audio using the `transformers` library with the `pipeline`. Loading the model the first time can be a bit slow, but it's much faster afterwards."
400 | ],
401 | "metadata": {
402 | "id": "vAtRMqSJhxpb"
403 | }
404 | },
405 | {
406 | "cell_type": "code",
407 | "source": [
408 | "from transformers import pipeline"
409 | ],
410 | "metadata": {
411 | "id": "KCtDBJ6mdas3"
412 | },
413 | "execution_count": null,
414 | "outputs": []
415 | },
416 | {
417 | "cell_type": "code",
418 | "source": [
419 | "pipe = pipeline(\"automatic-speech-recognition\")"
420 | ],
421 | "metadata": {
422 | "id": "s8Y4w3mrdbyz"
423 | },
424 | "execution_count": null,
425 | "outputs": []
426 | },
427 | {
428 | "cell_type": "code",
429 | "source": [
430 | "pipe(y)"
431 | ],
432 | "metadata": {
433 | "id": "q_52d2ihdkVI"
434 | },
435 | "execution_count": null,
436 | "outputs": []
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "source": [
441 | "Some useful resources worth checking if you want to re-inforce this content:\n",
442 | "* https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53\n",
443 | "* https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0\n",
444 | "* https://ch.mathworks.com/matlabcentral/answers/387458-why-my-spectrogram-have-negative-values"
445 | ],
446 | "metadata": {
447 | "id": "Qo_AMDXciIyd"
448 | }
449 | }
450 | ]
451 | }
--------------------------------------------------------------------------------
/notebooks/session3/text_to_speech.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Text To Speech [Tutorial].ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | },
17 | "accelerator": "GPU"
18 | },
19 | "cells": [
20 | {
21 | "cell_type": "markdown",
22 | "source": [
23 | "# Text To Speech\n",
24 | "\n",
25 | "compiled by: [Vaibhav Srivastav](https://twitter.com/reach_vb)\n",
26 | "\n",
27 | "for pre-reads + further materials headover to: [ml-with-audio repo](https://github.com/Vaibhavs10/ml-with-audio)"
28 | ],
29 | "metadata": {
30 | "id": "9phIws3ekvoY"
31 | }
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "metadata": {
37 | "id": "jAtE54VkeYAT"
38 | },
39 | "outputs": [],
40 | "source": [
41 | "%%capture\n",
42 | "!pip install tts"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "source": [
48 | "import torch\n",
49 | "import librosa\n",
50 | "\n",
51 | "import IPython.display as ipd"
52 | ],
53 | "metadata": {
54 | "id": "uXhLJ46dlWv2"
55 | },
56 | "execution_count": 1,
57 | "outputs": []
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "source": [
62 | "## Let's look at all the pre-trained models provided by Coqui-ai TTS"
63 | ],
64 | "metadata": {
65 | "id": "mFoQXekuk7MR"
66 | }
67 | },
68 | {
69 | "cell_type": "code",
70 | "source": [
71 | "!tts --list_models"
72 | ],
73 | "metadata": {
74 | "colab": {
75 | "base_uri": "https://localhost:8080/"
76 | },
77 | "id": "5T5CrBHzgqTX",
78 | "outputId": "2039e06d-cb83-41d0-8387-93edf894884f"
79 | },
80 | "execution_count": null,
81 | "outputs": [
82 | {
83 | "output_type": "stream",
84 | "name": "stdout",
85 | "text": [
86 | " Name format: type/language/dataset/model\n",
87 | " 1: tts_models/multilingual/multi-dataset/your_tts\n",
88 | " 2: tts_models/en/ek1/tacotron2\n",
89 | " 3: tts_models/en/ljspeech/tacotron2-DDC\n",
90 | " 4: tts_models/en/ljspeech/tacotron2-DDC_ph\n",
91 | " 5: tts_models/en/ljspeech/glow-tts\n",
92 | " 6: tts_models/en/ljspeech/speedy-speech\n",
93 | " 7: tts_models/en/ljspeech/tacotron2-DCA\n",
94 | " 8: tts_models/en/ljspeech/vits\n",
95 | " 9: tts_models/en/ljspeech/fast_pitch\n",
96 | " 10: tts_models/en/vctk/sc-glow-tts\n",
97 | " 11: tts_models/en/vctk/vits\n",
98 | " 12: tts_models/en/vctk/fast_pitch\n",
99 | " 13: tts_models/en/sam/tacotron-DDC\n",
100 | " 14: tts_models/es/mai/tacotron2-DDC\n",
101 | " 15: tts_models/fr/mai/tacotron2-DDC\n",
102 | " 16: tts_models/uk/mai/glow-tts\n",
103 | " 17: tts_models/zh-CN/baker/tacotron2-DDC-GST\n",
104 | " 18: tts_models/nl/mai/tacotron2-DDC\n",
105 | " 19: tts_models/de/thorsten/tacotron2-DCA\n",
106 | " 20: tts_models/ja/kokoro/tacotron2-DDC\n",
107 | " 1: vocoder_models/universal/libri-tts/wavegrad\n",
108 | " 2: vocoder_models/universal/libri-tts/fullband-melgan\n",
109 | " 3: vocoder_models/en/ek1/wavegrad\n",
110 | " 4: vocoder_models/en/ljspeech/multiband-melgan\n",
111 | " 5: vocoder_models/en/ljspeech/hifigan_v2\n",
112 | " 6: vocoder_models/en/ljspeech/univnet\n",
113 | " 7: vocoder_models/en/vctk/hifigan_v2\n",
114 | " 8: vocoder_models/en/sam/hifigan_v2\n",
115 | " 9: vocoder_models/nl/mai/parallel-wavegan\n",
116 | " 10: vocoder_models/de/thorsten/wavegrad\n",
117 | " 11: vocoder_models/de/thorsten/fullband-melgan\n",
118 | " 12: vocoder_models/ja/kokoro/hifigan_v1\n",
119 | " 13: vocoder_models/uk/mai/multiband-melgan\n"
120 | ]
121 | }
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "source": [
127 | "## Let's test the Tacotron model we looked at in the presentation"
128 | ],
129 | "metadata": {
130 | "id": "IIWFRuAqlKAn"
131 | }
132 | },
133 | {
134 | "cell_type": "code",
135 | "source": [
136 | "!tts --text \"Today is a good day!\" \\\n",
137 | " --model_name \"tts_models/en/ek1/tacotron2\" \\\n",
138 | " --out_path output.wav"
139 | ],
140 | "metadata": {
141 | "colab": {
142 | "base_uri": "https://localhost:8080/"
143 | },
144 | "id": "wvHQKPRkhizM",
145 | "outputId": "2b23b108-e512-4974-ab57-568c577a5dad"
146 | },
147 | "execution_count": null,
148 | "outputs": [
149 | {
150 | "output_type": "stream",
151 | "name": "stdout",
152 | "text": [
153 | " > Downloading model to /root/.local/share/tts/tts_models--en--ek1--tacotron2\n",
154 | " > Downloading model to /root/.local/share/tts/vocoder_models--en--ek1--wavegrad\n",
155 | " > Using model: Tacotron2\n",
156 | " > Model's reduction rate `r` is set to: 2\n",
157 | " > Vocoder Model: wavegrad\n",
158 | " > Text: Today is a good day!\n",
159 | " > Text splitted to sentences.\n",
160 | "['Today is a good day!']\n",
161 | " > Processing time: 75.69261932373047\n",
162 | " > Real-time factor: 44.94351185071782\n",
163 | " > Saving output to output.wav\n"
164 | ]
165 | }
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "source": [
171 | "Load the audio up with librosa and hear it"
172 | ],
173 | "metadata": {
174 | "id": "ALcwxlRgldSn"
175 | }
176 | },
177 | {
178 | "cell_type": "code",
179 | "source": [
180 | "x, sr = librosa.load('output.wav')\n",
181 | "ipd.Audio(x, rate=sr)"
182 | ],
183 | "metadata": {
184 | "colab": {
185 | "base_uri": "https://localhost:8080/",
186 | "height": 75
187 | },
188 | "id": "QZmdbs0EizwY",
189 | "outputId": "40153196-ef47-4c04-e1b4-2b4fa8047297"
190 | },
191 | "execution_count": null,
192 | "outputs": [
193 | {
194 | "output_type": "execute_result",
195 | "data": {
196 | "text/html": [
197 | "\n",
198 | " \n",
199 | " \n",
200 | " Your browser does not support the audio element.\n",
201 | " \n",
202 | " "
203 | ],
204 | "text/plain": [
205 | ""
206 | ]
207 | },
208 | "metadata": {},
209 | "execution_count": 4
210 | }
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "source": [
216 | "## Your turn, find a model from the list above and synthesize the below sentences\n",
217 | "\n",
218 | "1. It’s no use to ask to use the telephone.\n",
219 | "2. Do you live near a zoo with live animals?\n",
220 | "3. I prefer bass fishing to playing the bass guitar.\n",
221 | "\n",
222 | "Which model performs the best?"
223 | ],
224 | "metadata": {
225 | "id": "2rUYqtXnqUeU"
226 | }
227 | },
228 | {
229 | "cell_type": "code",
230 | "source": [
231 | "!tts --text \"\" \\\n",
232 | " --model_name \"\" \\\n",
233 | " --out_path output.wav"
234 | ],
235 | "metadata": {
236 | "id": "IRyfLoY5jkAP"
237 | },
238 | "execution_count": null,
239 | "outputs": []
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "source": [
244 | "## Next up, let's experiment with different available vocoders\n",
245 | "\n",
246 | "Which vocoder sounded the most natural to you?"
247 | ],
248 | "metadata": {
249 | "id": "JMoZHcRSrlRN"
250 | }
251 | },
252 | {
253 | "cell_type": "code",
254 | "source": [
255 | "!tts --text \"\" \\\n",
256 | " --model_name \"\" \\\n",
257 | " --vocoder_name \"\" \\ \n",
258 | " --out_path output.wav"
259 | ],
260 | "metadata": {
261 | "id": "EE_cEDQEs80K"
262 | },
263 | "execution_count": null,
264 | "outputs": []
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "source": [
269 | "## Alright, time to look into [HF Hub](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=downloads), lets use a pre-trained fastspeech model from the hub"
270 | ],
271 | "metadata": {
272 | "id": "W0HC7m1SFqkS"
273 | }
274 | }
275 | ]
276 | }
--------------------------------------------------------------------------------