├── .gitignore ├── README.md ├── audio_file.wav ├── examples.ipynb ├── images ├── cqtchromagram.png ├── cqtkernel.png ├── cqtspectrogram.png ├── dct.png ├── dst.png ├── imdct.png ├── istft.png ├── mdct.png ├── melfilterbank.png ├── melspectrogram.png ├── mfcc.png └── stft.png └── zaf.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | __pycache__ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Zaf-Python 2 | 3 | Zafar's Audio Functions in **Python** for audio signal analysis. 4 | 5 | Files: 6 | - [`zaf.py`](#zafpy): Python module with the audio functions. 7 | - [`examples.ipynb`](#examplesipynb): Jupyter notebook with some examples. 8 | - [`audio_file.wav`](#audio_filewav): audio file used for the examples. 9 | 10 | See also: 11 | - [Zaf-Matlab](https://github.com/zafarrafii/Zaf-Matlab): Zafar's Audio Functions in **Matlab** for audio signal analysis. 12 | - [Zaf-Julia](https://github.com/zafarrafii/Zaf-Julia): Zafar's Audio Functions in **Julia** for audio signal analysis. 13 | 14 | ## zaf.py 15 | 16 | This Python module implements a number of functions for audio signal analysis. 17 | 18 | Simply copy the file `zaf.py` in your working directory and you are good to go. Make sure you have Python 3, NumPy, and SciPy installed. 19 | 20 | Functions: 21 | - [`stft`](#stft) - Compute the short-time Fourier transform (STFT). 22 | - [`istft`](#istft) - Compute the inverse STFT. 23 | - [`melfilterbank`](#melfilterbank) - Compute the mel filterbank. 24 | - [`melspectrogram`](#melspectrogram) - Compute the mel spectrogram using a mel filterbank. 25 | - [`mfcc`](#mfcc) - Compute the mel-frequency cepstral coefficients (MFCCs) using a mel filterbank. 26 | - [`cqtkernel`](#cqtkernel) - Compute the constant-Q transform (CQT) kernel. 27 | - [`cqtspectrogram`](#cqtspectrogram) - Compute the CQT spectrogram using a CQT kernel. 28 | - [`cqtchromagram`](#cqtchromagram) - Compute the CQT chromagram using a CQT kernel. 29 | - [`dct`](#dct) - Compute the discrete cosine transform (DCT) using the fast Fourier transform (FFT). 30 | - [`dst`](#dst) - Compute the discrete sine transform (DST) using the FFT. 31 | - [`mdct`](#mdct) - Compute the modified discrete cosine transform (MDCT) using the FFT. 32 | - [`imdct`](#imdct) - Compute the inverse MDCT using the FFT. 33 | 34 | Other: 35 | - `wavread` - Read a WAVE file (using SciPy). 36 | - `wavwrite` - Write a WAVE file (using SciPy). 37 | - `sigplot` - Plot a signal in seconds. 38 | - `specshow` - Display a spectrogram in dB, seconds, and Hz. 39 | - `melspecshow` - Display a mel spectrogram in dB, seconds, and Hz. 40 | - `mfccshow` - Display MFCCs in seconds. 41 | - `cqtspecshow` - Display a CQT spectrogram in dB, seconds, and Hz. 42 | - `cqtchromshow` - Display a CQT chromagram in seconds. 43 | 44 | 45 | ### stft 46 | 47 | Compute the short-time Fourier transform (STFT). 48 | 49 | ``` 50 | audio_stft = zaf.stft(audio_signal, window_function, step_length) 51 | 52 | Inputs: 53 | audio_signal: audio signal (number_samples,) 54 | window_function: window function (window_length,) 55 | step_length: step length in samples 56 | Output: 57 | audio_stft: audio STFT (window_length, number_frames) 58 | ``` 59 | 60 | #### Example: Compute and display the spectrogram from an audio file. 61 | 62 | ``` 63 | # Import the needed modules 64 | import numpy as np 65 | import scipy.signal 66 | import zaf 67 | import matplotlib.pyplot as plt 68 | 69 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 70 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 71 | audio_signal = np.mean(audio_signal, 1) 72 | 73 | # Set the window duration in seconds (audio is stationary around 40 milliseconds) 74 | window_duration = 0.04 75 | 76 | # Derive the window length in samples (use powers of 2 for faster FFT and constant overlap-add (COLA)) 77 | window_length = pow(2, int(np.ceil(np.log2(window_duration*sampling_frequency)))) 78 | 79 | # Compute the window function (use SciPy's periodic Hamming window for COLA as NumPy's Hamming window is symmetric) 80 | window_function = scipy.signal.hamming(window_length, sym=False) 81 | 82 | # Set the step length in samples (half of the window length for COLA) 83 | step_length = int(window_length/2) 84 | 85 | # Compute the STFT 86 | audio_stft = zaf.stft(audio_signal, window_function, step_length) 87 | 88 | # Derive the magnitude spectrogram (without the DC component and the mirrored frequencies) 89 | audio_spectrogram = np.absolute(audio_stft[1:int(window_length/2)+1, :]) 90 | 91 | # Display the spectrogram in dB, seconds, and Hz 92 | number_samples = len(audio_signal) 93 | plt.figure(figsize=(14, 7)) 94 | zaf.specshow(audio_spectrogram, number_samples, sampling_frequency, xtick_step=1, ytick_step=1000) 95 | plt.title("Spectrogram (dB)") 96 | plt.tight_layout() 97 | plt.show() 98 | ``` 99 | 100 | 101 | 102 | 103 | ### istft 104 | 105 | Compute the inverse short-time Fourier transform (STFT). 106 | 107 | ``` 108 | audio_signal = zaf.istft(audio_stft, window_function, step_length) 109 | 110 | Inputs: 111 | audio_stft: audio STFT (window_length, number_frames) 112 | window_function: window function (window_length,) 113 | step_length: step length in samples 114 | Output: 115 | audio_signal: audio signal (number_samples,) 116 | ``` 117 | 118 | #### Example: Estimate the center and the sides from a stereo audio file. 119 | 120 | ``` 121 | # Import the needed modules 122 | import numpy as np 123 | import scipy.signal 124 | import zaf 125 | import matplotlib.pyplot as plt 126 | 127 | # Read the (stereo) audio signal with its sampling frequency in Hz 128 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 129 | 130 | # Set the parameters for the STFT 131 | window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency)))) 132 | window_function = scipy.signal.hamming(window_length, sym=False) 133 | step_length = int(window_length/2) 134 | 135 | # Compute the STFTs for the left and right channels 136 | audio_stft1 = zaf.stft(audio_signal[:, 0], window_function, step_length) 137 | audio_stft2 = zaf.stft(audio_signal[:, 1], window_function, step_length) 138 | 139 | # Derive the magnitude spectrograms (with DC component) for the left and right channels 140 | number_frequencies = int(window_length/2)+1 141 | audio_spectrogram1 = abs(audio_stft1[0:number_frequencies, :]) 142 | audio_spectrogram2 = abs(audio_stft2[0:number_frequencies, :]) 143 | 144 | # Estimate the time-frequency masks for the left and right channels for the center 145 | center_mask1 = np.minimum(audio_spectrogram1, audio_spectrogram2)/audio_spectrogram1 146 | center_mask2 = np.minimum(audio_spectrogram1, audio_spectrogram2)/audio_spectrogram2 147 | 148 | # Derive the STFTs for the left and right channels for the center (with mirrored frequencies) 149 | center_stft1 = np.multiply(np.concatenate((center_mask1, center_mask1[-2:0:-1, :])), audio_stft1) 150 | center_stft2 = np.multiply(np.concatenate((center_mask2, center_mask2[-2:0:-1, :])), audio_stft2) 151 | 152 | # Synthesize the signals for the left and right channels for the center 153 | center_signal1 = zaf.istft(center_stft1, window_function, step_length) 154 | center_signal2 = zaf.istft(center_stft2, window_function, step_length) 155 | 156 | # Derive the final stereo center and sides signals 157 | center_signal = np.stack((center_signal1, center_signal2), axis=1) 158 | center_signal = center_signal[0:np.shape(audio_signal)[0], :] 159 | sides_signal = audio_signal-center_signal 160 | 161 | # Write the center and sides signals 162 | zaf.wavwrite(center_signal, sampling_frequency, "center_file.wav") 163 | zaf.wavwrite(sides_signal, sampling_frequency, "sides_file.wav") 164 | 165 | # Display the original, center, and sides signals in seconds 166 | xtick_step = 1 167 | plt.figure(figsize=(14, 7)) 168 | plt.subplot(3, 1, 1), zaf.sigplot(audio_signal, sampling_frequency, xtick_step) 169 | plt.ylim(-1, 1), plt.title("Original signal") 170 | plt.subplot(3, 1, 2), zaf.sigplot(center_signal, sampling_frequency, xtick_step) 171 | plt.ylim(-1, 1), plt.title("Center signal") 172 | plt.subplot(3, 1, 3), zaf.sigplot(sides_signal, sampling_frequency, xtick_step) 173 | plt.ylim(-1, 1), plt.title("Sides signal") 174 | plt.tight_layout() 175 | plt.show() 176 | ``` 177 | 178 | 179 | 180 | 181 | ### melfilterbank 182 | 183 | Compute the mel filterbank. 184 | 185 | ``` 186 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 187 | 188 | Inputs: 189 | sampling_frequency: sampling frequency in Hz 190 | window_length: window length for the Fourier analysis in samples 191 | number_mels: number of mel filters 192 | 193 | Output: 194 | mel_filterbank: mel filterbank (sparse) (number_mels, number_frequencies) 195 | ``` 196 | 197 | #### Example: Compute and display the mel filterbank. 198 | 199 | ``` 200 | # Import the needed modules 201 | import numpy as np 202 | import zaf 203 | import matplotlib.pyplot as plt 204 | 205 | # Compute the mel filterbank using some parameters 206 | sampling_frequency = 44100 207 | window_length = pow(2, int(np.ceil(np.log2(0.04 * sampling_frequency)))) 208 | number_mels = 128 209 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 210 | 211 | # Display the mel filterbank 212 | plt.figure(figsize=(14, 5)) 213 | plt.imshow(mel_filterbank.toarray(), aspect="auto", cmap="jet", origin="lower") 214 | plt.title("Mel filterbank") 215 | plt.xlabel("Frequency index") 216 | plt.ylabel("Mel index") 217 | plt.tight_layout() 218 | plt.show() 219 | ``` 220 | 221 | 222 | 223 | 224 | ### melspectrogram 225 | 226 | Compute the mel spectrogram using a mel filterbank. 227 | 228 | ``` 229 | mel_filterbank = zaf.melspectrogram(audio_signal, window_function, step_length, mel_filterbank) 230 | 231 | Inputs: 232 | audio_signal: audio signal (number_samples,) 233 | window_function: window function (window_length,) 234 | step_length: step length in samples 235 | mel_filterbank: mel filterbank (number_mels, number_frequencies) 236 | Output: 237 | mel_spectrogram: mel spectrogram (number_mels, number_times) 238 | ``` 239 | 240 | #### Example: Compute and display the mel spectrogram. 241 | 242 | ``` 243 | # Import the needed modules 244 | import numpy as np 245 | import scipy.signal 246 | import zaf 247 | import matplotlib.pyplot as plt 248 | 249 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 250 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 251 | audio_signal = np.mean(audio_signal, 1) 252 | 253 | # Set the parameters for the Fourier analysis 254 | window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency)))) 255 | window_function = scipy.signal.hamming(window_length, sym=False) 256 | step_length = int(window_length/2) 257 | 258 | # Compute the mel filterbank 259 | number_mels = 128 260 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 261 | 262 | # Compute the mel spectrogram using the filterbank 263 | mel_spectrogram = zaf.melspectrogram(audio_signal, window_function, step_length, mel_filterbank) 264 | 265 | # Display the mel spectrogram in dB, seconds, and Hz 266 | number_samples = len(audio_signal) 267 | plt.figure(figsize=(14, 5)) 268 | zaf.melspecshow(mel_spectrogram, number_samples, sampling_frequency, window_length, xtick_step=1) 269 | plt.title("Mel spectrogram (dB)") 270 | plt.tight_layout() 271 | plt.show() 272 | ``` 273 | 274 | 275 | 276 | 277 | ### mfcc 278 | 279 | Compute the mel-frequency cepstral coefficients (MFCCs) using a mel filterbank. 280 | 281 | ``` 282 | audio_mfcc = zaf.mfcc(audio_signal, sample_frequency, number_filters, number_coefficients) 283 | 284 | Inputs: 285 | audio_signal: audio signal (number_samples,) 286 | sampling_frequency: sampling frequency in Hz 287 | number_filters: number of filters 288 | number_coefficients: number of coefficients (without the 0th coefficient) 289 | Output: 290 | audio_mfcc: audio MFCCs (number_times, number_coefficients) 291 | ``` 292 | 293 | #### Example: Compute and display the MFCCs, delta MFCCs, and delta-delta MFCCs. 294 | 295 | ``` 296 | # Import the needed modules 297 | import numpy as np 298 | import scipy.signal 299 | import zaf 300 | import matplotlib.pyplot as plt 301 | 302 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 303 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 304 | audio_signal = np.mean(audio_signal, 1) 305 | 306 | # Set the parameters for the Fourier analysis 307 | window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency)))) 308 | window_function = scipy.signal.hamming(window_length, sym=False) 309 | step_length = int(window_length/2) 310 | 311 | # Compute the mel filterbank 312 | number_mels = 40 313 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 314 | 315 | # Compute the MFCCs using the filterbank 316 | number_coefficients = 20 317 | audio_mfcc = zaf.mfcc(audio_signal, window_function, step_length, mel_filterbank, number_coefficients) 318 | 319 | # Compute the delta and delta-delta MFCCs 320 | audio_dmfcc = np.diff(audio_mfcc, n=1, axis=1) 321 | audio_ddmfcc = np.diff(audio_dmfcc, n=1, axis=1) 322 | 323 | # Display the MFCCs, delta MFCCs, and delta-delta MFCCs in seconds 324 | number_samples = len(audio_signal) 325 | xtick_step = 1 326 | plt.figure(figsize=(14, 7)) 327 | plt.subplot(3, 1, 1) 328 | zaf.mfccshow(audio_mfcc, number_samples, sampling_frequency, xtick_step), plt.title("MFCCs") 329 | plt.subplot(3, 1, 2) 330 | zaf.mfccshow(audio_dmfcc, number_samples, sampling_frequency, xtick_step), plt.title("Delta MFCCs") 331 | plt.subplot(3, 1, 3) 332 | zaf.mfccshow(audio_ddmfcc, number_samples, sampling_frequency, xtick_step), plt.title("Delta-delta MFCCs") 333 | plt.tight_layout() 334 | plt.show() 335 | ``` 336 | 337 | 338 | 339 | 340 | ### cqtkernel 341 | 342 | Compute the constant-Q transform (CQT) kernel. 343 | 344 | ``` 345 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 346 | 347 | Inputs: 348 | sampling_frequency: sampling frequency in Hz 349 | octave_resolution: number of frequency channels per octave 350 | minimum_frequency: minimum frequency in Hz 351 | maximum_frequency: maximum frequency in Hz 352 | Output: 353 | cqt_kernel: CQT kernel (sparse) (number_frequencies, fft_length) 354 | ``` 355 | 356 | #### Example: Compute and display the CQT kernel. 357 | 358 | ``` 359 | # Import the needed modules 360 | import numpy as np 361 | import zaf 362 | import matplotlib.pyplot as plt 363 | 364 | # Set the parameters for the CQT kernel 365 | sampling_frequency = 44100 366 | octave_resolution = 24 367 | minimum_frequency = 55 368 | maximum_frequency = sampling_frequency/2 369 | 370 | # Compute the CQT kernel 371 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 372 | 373 | # Display the magnitude CQT kernel 374 | plt.figure(figsize=(14, 5)) 375 | plt.imshow(np.absolute(cqt_kernel).toarray(), aspect="auto", cmap="jet", origin="lower") 376 | plt.title("Magnitude CQT kernel") 377 | plt.xlabel("FFT index") 378 | plt.ylabel("CQT index") 379 | plt.tight_layout() 380 | plt.show() 381 | ``` 382 | 383 | 384 | 385 | 386 | ### cqtspectrogram 387 | 388 | Compute the constant-Q transform (CQT) spectrogram using a CQT kernel. 389 | 390 | ``` 391 | cqt_spectrogram = zaf.cqtspectrogram(audio_signal, sample_frequency, time_resolution, cqt_kernel) 392 | 393 | Inputs: 394 | audio_signal: audio signal (number_samples,) 395 | sampling_frequency: sampling frequency in Hz 396 | time_resolution: number of time frames per second 397 | cqt_kernel: CQT kernel (number_frequencies, fft_length) 398 | Output: 399 | cqt_spectrogram: CQT spectrogram (number_frequencies, number_times) 400 | ``` 401 | 402 | #### Example: Compute and display the CQT spectrogram. 403 | 404 | ``` 405 | # Import the needed modules 406 | import numpy as np 407 | import zaf 408 | import matplotlib.pyplot as plt 409 | 410 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 411 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 412 | audio_signal = np.mean(audio_signal, 1) 413 | 414 | # Compute the CQT kernel 415 | octave_resolution = 24 416 | minimum_frequency = 55 417 | maximum_frequency = 3520 418 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 419 | 420 | # Compute the CQT spectrogram using the kernel 421 | time_resolution = 25 422 | cqt_spectrogram = zaf.cqtspectrogram(audio_signal, sampling_frequency, time_resolution, cqt_kernel) 423 | 424 | # Display the CQT spectrogram in dB, seconds, and Hz 425 | plt.figure(figsize=(14, 5)) 426 | zaf.cqtspecshow(cqt_spectrogram, time_resolution, octave_resolution, minimum_frequency, xtick_step=1) 427 | plt.title("CQT spectrogram (dB)") 428 | plt.tight_layout() 429 | plt.show() 430 | ``` 431 | 432 | 433 | 434 | 435 | ### cqtchromagram 436 | 437 | Compute the constant-Q transform (CQT) chromagram using a CQT kernel. 438 | 439 | ``` 440 | cqt_chromagram = zaf.cqtchromagram(audio_signal, sampling_frequency, time_resolution, octave_resolution, cqt_kernel) 441 | 442 | Inputs: 443 | audio_signal: audio signal (number_samples,) 444 | sampling_frequency: sampling frequency in Hz 445 | time_resolution: number of time frames per second 446 | octave_resolution: number of frequency channels per octave 447 | cqt_kernel: CQT kernel (number_frequencies, fft_length) 448 | Output: 449 | cqt_chromagram: CQT chromagram (number_chromas, number_times) 450 | ``` 451 | 452 | #### Example: Compute and display the CQT chromagram. 453 | 454 | ``` 455 | # Import the needed modules 456 | import numpy as np 457 | import zaf 458 | import matplotlib.pyplot as plt 459 | 460 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 461 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 462 | audio_signal = np.mean(audio_signal, 1) 463 | 464 | # Compute the CQT kernel 465 | octave_resolution = 24 466 | minimum_frequency = 55 467 | maximum_frequency = 3520 468 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 469 | 470 | # Compute the CQT chromagram using the kernel 471 | time_resolution = 25 472 | cqt_chromagram = zaf.cqtchromagram(audio_signal, sampling_frequency, time_resolution, octave_resolution, cqt_kernel) 473 | 474 | # Display the CQT chromagram in seconds 475 | plt.figure(figsize=(14, 3)) 476 | zaf.cqtchromshow(cqt_chromagram, time_resolution, xtick_step=1) 477 | plt.title("CQT chromagram") 478 | plt.tight_layout() 479 | plt.show() 480 | ``` 481 | 482 | 483 | 484 | 485 | ### dct 486 | 487 | Compute the discrete cosine transform (DCT) using the fast Fourier transform (FFT). 488 | 489 | ``` 490 | audio_dct = zaf.dct(audio_signal, dct_type) 491 | 492 | Inputs: 493 | audio_signal: audio signal (window_length,) 494 | dct_type: dct type (1, 2, 3, or 4) 495 | Output: 496 | audio_dct: audio DCT (number_frequencies,) 497 | ``` 498 | 499 | #### Example: Compute the 4 different DCTs and compare them to SciPy's DCTs. 500 | 501 | ``` 502 | # Import the needed modules 503 | import numpy as np 504 | import zaf 505 | import scipy.fftpack 506 | import matplotlib.pyplot as plt 507 | 508 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 509 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 510 | audio_signal = np.mean(audio_signal, 1) 511 | 512 | # Get an audio segment for a given window length 513 | window_length = 1024 514 | audio_segment = audio_signal[0:window_length] 515 | 516 | # Compute the DCT-I, II, III, and IV 517 | audio_dct1 = zaf.dct(audio_segment, 1) 518 | audio_dct2 = zaf.dct(audio_segment, 2) 519 | audio_dct3 = zaf.dct(audio_segment, 3) 520 | audio_dct4 = zaf.dct(audio_segment, 4) 521 | 522 | # Compute SciPy's DCT-I, II, III, and IV (orthogonalized) 523 | scipy_dct1 = scipy.fftpack.dct(audio_segment, type=1, norm="ortho") 524 | scipy_dct2 = scipy.fftpack.dct(audio_segment, type=2, norm="ortho") 525 | scipy_dct3 = scipy.fftpack.dct(audio_segment, type=3, norm="ortho") 526 | scipy_dct4 = scipy.fftpack.dct(audio_segment, type=4, norm="ortho") 527 | 528 | # Plot the DCT-I, II, III, and IV, SciPy's versions, and their differences 529 | plt.figure(figsize=(14, 7)) 530 | plt.subplot(3, 4, 1), plt.plot(audio_dct1), plt.autoscale(tight=True), plt.title("DCT-I") 531 | plt.subplot(3, 4, 2), plt.plot(audio_dct2), plt.autoscale(tight=True), plt.title("DCT-II") 532 | plt.subplot(3, 4, 3), plt.plot(audio_dct3), plt.autoscale(tight=True), plt.title("DCT-III") 533 | plt.subplot(3, 4, 4), plt.plot(audio_dct4), plt.autoscale(tight=True), plt.title("DCT-IV") 534 | plt.subplot(3, 4, 5), plt.plot(scipy_dct1), plt.autoscale(tight=True), plt.title("SciPy's DCT-I") 535 | plt.subplot(3, 4, 6), plt.plot(scipy_dct2), plt.autoscale(tight=True), plt.title("SciPy's DCT-II") 536 | plt.subplot(3, 4, 7), plt.plot(scipy_dct3), plt.autoscale(tight=True), plt.title("SciPy's DCT-III") 537 | plt.subplot(3, 4, 8), plt.plot(scipy_dct4), plt.autoscale(tight=True), plt.title("SciPy's DCT-IV") 538 | plt.subplot(3, 4, 9), plt.plot(audio_dct1-scipy_dct1), plt.autoscale(tight=True), plt.title("DCT-I - SciPy's DCT-I") 539 | plt.subplot(3, 4, 10), plt.plot(audio_dct2-scipy_dct2), plt.autoscale(tight=True), plt.title("DCT-II - SciPy's DCT-II") 540 | plt.subplot(3, 4, 11), plt.plot(audio_dct3-scipy_dct3), plt.autoscale(tight=True), plt.title("DCT-III - SciPy's DCT-III") 541 | plt.subplot(3, 4, 12), plt.plot(audio_dct3-scipy_dct3), plt.autoscale(tight=True), plt.title("DCT-IV - SciPy's DCT-IV") 542 | plt.tight_layout() 543 | plt.show() 544 | ``` 545 | 546 | 547 | 548 | 549 | ### dst 550 | 551 | Compute the discrete sine transform (DST) using the fast Fourier transform (FFT). 552 | 553 | ``` 554 | audio_dst = zaf.dst(audio_signal, dst_type) 555 | 556 | Inputs: 557 | audio_signal: audio signal (window_length,) 558 | dst_type: DST type (1, 2, 3, or 4) 559 | Output: 560 | audio_dst: audio DST (number_frequencies,) 561 | ``` 562 | 563 | #### Example: Compute the 4 different DSTs and compare their respective inverses with the original audio. 564 | 565 | ``` 566 | # Import the needed modules 567 | import numpy as np 568 | import zaf 569 | import matplotlib.pyplot as plt 570 | 571 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 572 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 573 | audio_signal = np.mean(audio_signal, 1) 574 | 575 | # Get an audio segment for a given window length 576 | window_length = 1024 577 | audio_segment = audio_signal[0:window_length] 578 | 579 | # Compute the DST-I, II, III, and IV 580 | audio_dst1 = zaf.dst(audio_segment, 1) 581 | audio_dst2 = zaf.dst(audio_segment, 2) 582 | audio_dst3 = zaf.dst(audio_segment, 3) 583 | audio_dst4 = zaf.dst(audio_segment, 4) 584 | 585 | # Compute their respective inverses, i.e., DST-I, II, III, and IV 586 | audio_idst1 = zaf.dst(audio_dst1, 1) 587 | audio_idst2 = zaf.dst(audio_dst2, 3) 588 | audio_idst3 = zaf.dst(audio_dst3, 2) 589 | audio_idst4 = zaf.dst(audio_dst4, 4) 590 | 591 | # Plot the DST-I, II, III, and IV, their respective inverses, and their differences with the original audio segment 592 | plt.figure(figsize=(14, 7)) 593 | plt.subplot(3, 4, 1), plt.plot(audio_dst1), plt.autoscale(tight=True), plt.title("DCT-I") 594 | plt.subplot(3, 4, 2), plt.plot(audio_dst2), plt.autoscale(tight=True), plt.title("DST-II") 595 | plt.subplot(3, 4, 3), plt.plot(audio_dst3), plt.autoscale(tight=True), plt.title("DST-III") 596 | plt.subplot(3, 4, 4), plt.plot(audio_dst4), plt.autoscale(tight=True), plt.title("DST-IV") 597 | plt.subplot(3, 4, 5), plt.plot(audio_idst1), plt.autoscale(tight=True), plt.title("Inverse DST-I (DST-I)") 598 | plt.subplot(3, 4, 6), plt.plot(audio_idst2), plt.autoscale(tight=True), plt.title("Inverse DST-II (DST-III)") 599 | plt.subplot(3, 4, 7), plt.plot(audio_idst3), plt.autoscale(tight=True), plt.title("Inverse DST-III (DST-II)") 600 | plt.subplot(3, 4, 8), plt.plot(audio_idst4), plt.autoscale(tight=True), plt.title("Inverse DST-IV (DST-IV)") 601 | plt.subplot(3, 4, 9), plt.plot(audio_idst1-audio_segment), plt.autoscale(tight=True) 602 | plt.title("Inverse DST-I - audio segment") 603 | plt.subplot(3, 4, 10), plt.plot(audio_idst2-audio_segment), plt.autoscale(tight=True) 604 | plt.title("Inverse DST-II - audio segment") 605 | plt.subplot(3, 4, 11), plt.plot(audio_idst3-audio_segment), plt.autoscale(tight=True) 606 | plt.title("Inverse DST-III - audio segment") 607 | plt.subplot(3, 4, 12), plt.plot(audio_idst4-audio_segment), plt.autoscale(tight=True) 608 | plt.title("Inverse DST-IV - audio segment") 609 | plt.tight_layout() 610 | plt.show() 611 | ``` 612 | 613 | 614 | 615 | 616 | ### mdct 617 | 618 | Compute the modified discrete cosine transform (MDCT) using the fast Fourier transform (FFT). 619 | 620 | ``` 621 | audio_mdct = zaf.mdct(audio_signal, window_function) 622 | 623 | Inputs: 624 | audio_signal: audio signal (number_samples,) 625 | window_function: window function (window_length,) 626 | Output: 627 | audio_mdct: audio MDCT (number_frequencies, number_times) 628 | ``` 629 | 630 | #### Example: Compute and display the MDCT as used in the AC-3 audio coding format. 631 | 632 | ``` 633 | # Import the needed modules 634 | import numpy as np 635 | import zaf 636 | import matplotlib.pyplot as plt 637 | 638 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 639 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 640 | audio_signal = np.mean(audio_signal, 1) 641 | 642 | # Compute the Kaiser-Bessel-derived (KBD) window as used in the AC-3 audio coding format 643 | window_length = 512 644 | alpha_value = 5 645 | window_function = np.kaiser(int(window_length/2)+1, alpha_value*np.pi) 646 | window_function2 = np.cumsum(window_function[1:int(window_length/2)]) 647 | window_function = np.sqrt(np.concatenate((window_function2, window_function2[int(window_length/2)::-1])) 648 | /np.sum(window_function)) 649 | 650 | # Compute the MDCT 651 | audio_mdct = zaf.mdct(audio_signal, window_function) 652 | 653 | # Display the MDCT in dB, seconds, and Hz 654 | number_samples = len(audio_signal) 655 | plt.figure(figsize=(14, 7)) 656 | zaf.specshow(np.absolute(audio_mdct), number_samples, sampling_frequency, xtick_step=1, ytick_step=1000) 657 | plt.title("MDCT (dB)") 658 | plt.tight_layout() 659 | plt.show() 660 | ``` 661 | 662 | 663 | 664 | 665 | ### imdct 666 | 667 | Compute the inverse modified discrete cosine transform (MDCT) using the fast Fourier transform (FFT). 668 | 669 | ``` 670 | audio_signal = zaf.imdct(audio_mdct, window_function) 671 | 672 | Inputs: 673 | audio_mdct: audio MDCT (number_frequencies, number_times) 674 | window_function: window function (window_length,) 675 | Output: 676 | audio_signal: audio signal (number_samples,) 677 | ``` 678 | 679 | #### Example: Verify that the MDCT is perfectly invertible. 680 | 681 | ``` 682 | # Import the needed modules 683 | import numpy as np 684 | import zaf 685 | import matplotlib.pyplot as plt 686 | 687 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 688 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 689 | audio_signal = np.mean(audio_signal, 1) 690 | 691 | # Compute the MDCT with a slope function as used in the Vorbis audio coding format 692 | window_length = 2048 693 | window_function = np.sin(np.pi/2*pow(np.sin(np.pi/window_length*np.arange(0.5, window_length+0.5)), 2)) 694 | audio_mdct = zaf.mdct(audio_signal, window_function) 695 | 696 | # Compute the inverse MDCT 697 | audio_signal2 = zaf.imdct(audio_mdct, window_function) 698 | audio_signal2 = audio_signal2[0:len(audio_signal)] 699 | 700 | # Compute the differences between the original signal and the resynthesized one 701 | audio_differences = audio_signal-audio_signal2 702 | y_max = np.max(np.absolute(audio_differences)) 703 | 704 | # Display the original and resynthesized signals, and their differences in seconds 705 | xtick_step = 1 706 | plt.figure(figsize=(14, 7)) 707 | plt.subplot(3, 1, 1), zaf.sigplot(audio_signal, sampling_frequency, xtick_step) 708 | plt.ylim(-1, 1), plt.title("Original signal") 709 | plt.subplot(3, 1, 2), zaf.sigplot(audio_signal2, sampling_frequency, xtick_step) 710 | plt.ylim(-1, 1), plt.title("Resyntesized signal") 711 | plt.subplot(3, 1, 3), zaf.sigplot(audio_differences, sampling_frequency, xtick_step) 712 | plt.ylim(-y_max, y_max), plt.title("Original - resyntesized signal") 713 | plt.tight_layout() 714 | plt.show() 715 | ``` 716 | 717 | 718 | 719 | 720 | ## examples.ipynb 721 | 722 | This Jupyter notebook shows some examples for the different functions of the Python module `zaf`. 723 | 724 | See [Jupyter notebook viewer](https://nbviewer.jupyter.org/github/zafarrafii/Zaf-Python/blob/master/examples.ipynb). 725 | 726 | 727 | ## audio_file.wav 728 | 729 | 23 second audio excerpt from the song *Que Pena Tanto Faz* performed by *Tamy*. 730 | 731 | 732 | # Author 733 | 734 | - Zafar Rafii 735 | - http://zafarrafii.com/ 736 | - [CV](http://zafarrafii.com/Zafar%20Rafii%20-%20C.V..pdf) 737 | - [GitHub](https://github.com/zafarrafii) 738 | - [LinkedIn](https://www.linkedin.com/in/zafarrafii/) 739 | - [Google Scholar](https://scholar.google.com/citations?user=8wbS2EsAAAAJ&hl=en) 740 | -------------------------------------------------------------------------------- /audio_file.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/audio_file.wav -------------------------------------------------------------------------------- /images/cqtchromagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/cqtchromagram.png -------------------------------------------------------------------------------- /images/cqtkernel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/cqtkernel.png -------------------------------------------------------------------------------- /images/cqtspectrogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/cqtspectrogram.png -------------------------------------------------------------------------------- /images/dct.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/dct.png -------------------------------------------------------------------------------- /images/dst.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/dst.png -------------------------------------------------------------------------------- /images/imdct.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/imdct.png -------------------------------------------------------------------------------- /images/istft.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/istft.png -------------------------------------------------------------------------------- /images/mdct.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/mdct.png -------------------------------------------------------------------------------- /images/melfilterbank.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/melfilterbank.png -------------------------------------------------------------------------------- /images/melspectrogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/melspectrogram.png -------------------------------------------------------------------------------- /images/mfcc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/mfcc.png -------------------------------------------------------------------------------- /images/stft.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zafarrafii/Zaf-Python/9589f61f8da499ba53bed3465215b8cc54d287a7/images/stft.png -------------------------------------------------------------------------------- /zaf.py: -------------------------------------------------------------------------------- 1 | """ 2 | This Python module implements a number of functions for audio signal analysis. 3 | 4 | Functions: 5 | stft - Compute the short-time Fourier transform (STFT). 6 | istft - Compute the inverse STFT. 7 | melfilterbank - Compute the mel filterbank. 8 | melspectrogram - Compute the mel spectrogram using a mel filterbank. 9 | mfcc - Compute the mel-frequency cepstral coefficients (MFCCs) using a mel filterbank. 10 | cqtkernel - Compute the constant-Q transform (CQT) kernel. 11 | cqtspectrogram - Compute the CQT spectrogram using a CQT kernel. 12 | cqtchromagram - Compute the CQT chromagram using a CQT kernel. 13 | dct - Compute the discrete cosine transform (DCT) using the fast Fourier transform (FFT). 14 | dst - Compute the discrete sine transform (DST) using the FFT. 15 | mdct - Compute the modified discrete cosine transform (MDCT) using the FFT. 16 | imdct - Compute the inverse MDCT using the FFT. 17 | 18 | Other: 19 | wavread - Read a WAVE file (using SciPy). 20 | wavwrite - Write a WAVE file (using SciPy). 21 | sigplot - Plot a signal in seconds. 22 | specshow - Display an spectrogram in dB, seconds, and Hz. 23 | melspecshow - Display a mel spectrogram in dB, seconds, and Hz. 24 | mfccshow - Display MFCCs in seconds. 25 | cqtspecshow - Display a CQT spectrogram in dB, seconds, and Hz. 26 | cqtchromshow - Display a CQT chromagram in seconds. 27 | 28 | Author: 29 | Zafar Rafii 30 | zafarrafii@gmail.com 31 | http://zafarrafii.com 32 | https://github.com/zafarrafii 33 | https://www.linkedin.com/in/zafarrafii/ 34 | 08/24/21 35 | """ 36 | 37 | import numpy as np 38 | import scipy.sparse 39 | import scipy.signal 40 | import scipy.fftpack 41 | import scipy.io.wavfile 42 | import matplotlib.pyplot as plt 43 | 44 | 45 | def stft(audio_signal, window_function, step_length): 46 | """ 47 | Compute the short-time Fourier transform (STFT). 48 | 49 | Inputs: 50 | audio_signal: audio signal (number_samples,) 51 | window_function: window function (window_length,) 52 | step_length: step length in samples 53 | Output: 54 | audio_stft: audio STFT (window_length, number_frames) 55 | 56 | Example: Compute and display the spectrogram from an audio file. 57 | # Import the needed modules 58 | import numpy as np 59 | import scipy.signal 60 | import zaf 61 | import matplotlib.pyplot as plt 62 | 63 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 64 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 65 | audio_signal = np.mean(audio_signal, 1) 66 | 67 | # Set the window duration in seconds (audio is stationary around 40 milliseconds) 68 | window_duration = 0.04 69 | 70 | # Derive the window length in samples (use powers of 2 for faster FFT and constant overlap-add (COLA)) 71 | window_length = pow(2, int(np.ceil(np.log2(window_duration*sampling_frequency)))) 72 | 73 | # Compute the window function (use SciPy's periodic Hamming window for COLA as NumPy's Hamming window is symmetric) 74 | window_function = scipy.signal.hamming(window_length, sym=False) 75 | 76 | # Set the step length in samples (half of the window length for COLA) 77 | step_length = int(window_length/2) 78 | 79 | # Compute the STFT 80 | audio_stft = zaf.stft(audio_signal, window_function, step_length) 81 | 82 | # Derive the magnitude spectrogram (without the DC component and the mirrored frequencies) 83 | audio_spectrogram = np.absolute(audio_stft[1:int(window_length/2)+1, :]) 84 | 85 | # Display the spectrogram in dB, seconds, and Hz 86 | number_samples = len(audio_signal) 87 | plt.figure(figsize=(14, 7)) 88 | zaf.specshow(audio_spectrogram, number_samples, sampling_frequency, xtick_step=1, ytick_step=1000) 89 | plt.title("Spectrogram (dB)") 90 | plt.tight_layout() 91 | plt.show() 92 | """ 93 | 94 | # Get the number of samples and the window length in samples 95 | number_samples = len(audio_signal) 96 | window_length = len(window_function) 97 | 98 | # Derive the zero-padding length at the start and at the end of the signal to center the windows 99 | padding_length = int(np.floor(window_length / 2)) 100 | 101 | # Compute the number of time frames given the zero-padding at the start and at the end of the signal 102 | number_times = ( 103 | int( 104 | np.ceil( 105 | ((number_samples + 2 * padding_length) - window_length) / step_length 106 | ) 107 | ) 108 | + 1 109 | ) 110 | 111 | # Zero-pad the start and the end of the signal to center the windows 112 | audio_signal = np.pad( 113 | audio_signal, 114 | ( 115 | padding_length, 116 | ( 117 | number_times * step_length 118 | + (window_length - step_length) 119 | - padding_length 120 | ) 121 | - number_samples, 122 | ), 123 | "constant", 124 | constant_values=0, 125 | ) 126 | 127 | # Initialize the STFT 128 | audio_stft = np.zeros((window_length, number_times)) 129 | 130 | # Loop over the time frames 131 | i = 0 132 | for j in range(number_times): 133 | 134 | # Window the signal 135 | audio_stft[:, j] = audio_signal[i : i + window_length] * window_function 136 | i = i + step_length 137 | 138 | # Compute the Fourier transform of the frames using the FFT 139 | audio_stft = np.fft.fft(audio_stft, axis=0) 140 | 141 | return audio_stft 142 | 143 | 144 | def istft(audio_stft, window_function, step_length): 145 | """ 146 | Compute the inverse short-time Fourier transform (STFT). 147 | 148 | Inputs: 149 | audio_stft: audio STFT (window_length, number_frames) 150 | window_function: window function (window_length,) 151 | step_length: step length in samples 152 | Output: 153 | audio_signal: audio signal (number_samples,) 154 | 155 | Example: Estimate the center and the sides from a stereo audio file. 156 | # Import the needed modules 157 | import numpy as np 158 | import scipy.signal 159 | import zaf 160 | import matplotlib.pyplot as plt 161 | 162 | # Read the (stereo) audio signal with its sampling frequency in Hz 163 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 164 | 165 | # Set the parameters for the STFT 166 | window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency)))) 167 | window_function = scipy.signal.hamming(window_length, sym=False) 168 | step_length = int(window_length/2) 169 | 170 | # Compute the STFTs for the left and right channels 171 | audio_stft1 = zaf.stft(audio_signal[:, 0], window_function, step_length) 172 | audio_stft2 = zaf.stft(audio_signal[:, 1], window_function, step_length) 173 | 174 | # Derive the magnitude spectrograms (with DC component) for the left and right channels 175 | number_frequencies = int(window_length/2)+1 176 | audio_spectrogram1 = abs(audio_stft1[0:number_frequencies, :]) 177 | audio_spectrogram2 = abs(audio_stft2[0:number_frequencies, :]) 178 | 179 | # Estimate the time-frequency masks for the left and right channels for the center 180 | center_mask1 = np.minimum(audio_spectrogram1, audio_spectrogram2)/audio_spectrogram1 181 | center_mask2 = np.minimum(audio_spectrogram1, audio_spectrogram2)/audio_spectrogram2 182 | 183 | # Derive the STFTs for the left and right channels for the center (with mirrored frequencies) 184 | center_stft1 = np.multiply(np.concatenate((center_mask1, center_mask1[-2:0:-1, :])), audio_stft1) 185 | center_stft2 = np.multiply(np.concatenate((center_mask2, center_mask2[-2:0:-1, :])), audio_stft2) 186 | 187 | # Synthesize the signals for the left and right channels for the center 188 | center_signal1 = zaf.istft(center_stft1, window_function, step_length) 189 | center_signal2 = zaf.istft(center_stft2, window_function, step_length) 190 | 191 | # Derive the final stereo center and sides signals 192 | center_signal = np.stack((center_signal1, center_signal2), axis=1) 193 | center_signal = center_signal[0:np.shape(audio_signal)[0], :] 194 | sides_signal = audio_signal-center_signal 195 | 196 | # Write the center and sides signals 197 | zaf.wavwrite(center_signal, sampling_frequency, "center_file.wav") 198 | zaf.wavwrite(sides_signal, sampling_frequency, "sides_file.wav") 199 | 200 | # Display the original, center, and sides signals in seconds 201 | xtick_step = 1 202 | plt.figure(figsize=(14, 7)) 203 | plt.subplot(3, 1, 1), zaf.sigplot(audio_signal, sampling_frequency, xtick_step) 204 | plt.ylim(-1, 1), plt.title("Original signal") 205 | plt.subplot(3, 1, 2), zaf.sigplot(center_signal, sampling_frequency, xtick_step) 206 | plt.ylim(-1, 1), plt.title("Center signal") 207 | plt.subplot(3, 1, 3), zaf.sigplot(sides_signal, sampling_frequency, xtick_step) 208 | plt.ylim(-1, 1), plt.title("Sides signal") 209 | plt.tight_layout() 210 | plt.show() 211 | """ 212 | 213 | # Get the window length in samples and the number of time frames 214 | window_length, number_times = np.shape(audio_stft) 215 | 216 | # Compute the number of samples for the signal 217 | number_samples = number_times * step_length + (window_length - step_length) 218 | 219 | # Initialize the signal 220 | audio_signal = np.zeros(number_samples) 221 | 222 | # Compute the inverse Fourier transform of the frames and take the real part to ensure real values 223 | audio_stft = np.real(np.fft.ifft(audio_stft, axis=0)) 224 | 225 | # Loop over the time frames 226 | i = 0 227 | for j in range(number_times): 228 | 229 | # Perform a constant overlap-add (COLA) of the signal (with proper window function and step length) 230 | audio_signal[i : i + window_length] = ( 231 | audio_signal[i : i + window_length] + audio_stft[:, j] 232 | ) 233 | i = i + step_length 234 | 235 | # Remove the zero-padding at the start and at the end of the signal 236 | audio_signal = audio_signal[ 237 | window_length - step_length : number_samples - (window_length - step_length) 238 | ] 239 | 240 | # Normalize the signal by the gain introduced by the COLA (if any) 241 | audio_signal = audio_signal / sum(window_function[0:window_length:step_length]) 242 | 243 | return audio_signal 244 | 245 | 246 | def melfilterbank(sampling_frequency, window_length, number_filters): 247 | """ 248 | Compute the mel filterbank. 249 | 250 | Inputs: 251 | sampling_frequency: sampling frequency in Hz 252 | window_length: window length for the Fourier analysis in samples 253 | number_mels: number of mel filters 254 | Output: 255 | mel_filterbank: mel filterbank (sparse) (number_mels, number_frequencies) 256 | 257 | Example: Compute and display the mel filterbank. 258 | # Import the needed modules 259 | import numpy as np 260 | import zaf 261 | import matplotlib.pyplot as plt 262 | 263 | # Compute the mel filterbank using some parameters 264 | sampling_frequency = 44100 265 | window_length = pow(2, int(np.ceil(np.log2(0.04 * sampling_frequency)))) 266 | number_mels = 128 267 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 268 | 269 | # Display the mel filterbank 270 | plt.figure(figsize=(14, 5)) 271 | plt.imshow(mel_filterbank.toarray(), aspect="auto", cmap="jet", origin="lower") 272 | plt.title("Mel filterbank") 273 | plt.xlabel("Frequency index") 274 | plt.ylabel("Mel index") 275 | plt.tight_layout() 276 | plt.show() 277 | """ 278 | 279 | # Compute the minimum and maximum mels 280 | minimum_mel = 2595 * np.log10(1 + (sampling_frequency / window_length) / 700) 281 | maximum_mel = 2595 * np.log10(1 + (sampling_frequency / 2) / 700) 282 | 283 | # Derive the width of the half-overlapping filters in the mel scale (constant) 284 | filter_width = 2 * (maximum_mel - minimum_mel) / (number_filters + 1) 285 | 286 | # Compute the start and end indices of the filters in the mel scale (linearly spaced) 287 | filter_indices = np.arange(minimum_mel, maximum_mel + 1, filter_width / 2) 288 | 289 | # Derive the indices of the filters in the linear frequency scale (log spaced) 290 | filter_indices = np.round( 291 | 700 292 | * (np.power(10, filter_indices / 2595) - 1) 293 | * window_length 294 | / sampling_frequency 295 | ).astype(int) 296 | 297 | # Initialize the mel filterbank 298 | mel_filterbank = np.zeros((number_filters, int(window_length / 2))) 299 | 300 | # Loop over the filters 301 | for i in range(number_filters): 302 | 303 | # Compute the left and right sides of the triangular filters 304 | # (this is more accurate than creating triangular filters directly) 305 | mel_filterbank[i, filter_indices[i] - 1 : filter_indices[i + 1]] = np.linspace( 306 | 0, 307 | 1, 308 | num=filter_indices[i + 1] - filter_indices[i] + 1, 309 | ) 310 | mel_filterbank[ 311 | i, filter_indices[i + 1] - 1 : filter_indices[i + 2] 312 | ] = np.linspace( 313 | 1, 314 | 0, 315 | num=filter_indices[i + 2] - filter_indices[i + 1] + 1, 316 | ) 317 | 318 | # Make the mel filterbank sparse by saving it as a compressed sparse row matrix 319 | mel_filterbank = scipy.sparse.csr_matrix(mel_filterbank) 320 | 321 | return mel_filterbank 322 | 323 | 324 | def melspectrogram(audio_signal, window_function, step_length, mel_filterbank): 325 | """ 326 | Compute the mel spectrogram using a mel filterbank. 327 | 328 | Inputs: 329 | audio_signal: audio signal (number_samples,) 330 | window_function: window function (window_length,) 331 | step_length: step length in samples 332 | mel_filterbank: mel filterbank (number_mels, number_frequencies) 333 | Output: 334 | mel_spectrogram: mel spectrogram (number_mels, number_times) 335 | 336 | Example: Compute and display the mel spectrogram. 337 | # Import the needed modules 338 | import numpy as np 339 | import scipy.signal 340 | import zaf 341 | import matplotlib.pyplot as plt 342 | 343 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 344 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 345 | audio_signal = np.mean(audio_signal, 1) 346 | 347 | # Set the parameters for the Fourier analysis 348 | window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency)))) 349 | window_function = scipy.signal.hamming(window_length, sym=False) 350 | step_length = int(window_length/2) 351 | 352 | # Compute the mel filterbank 353 | number_mels = 128 354 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 355 | 356 | # Compute the mel spectrogram using the filterbank 357 | mel_spectrogram = zaf.melspectrogram(audio_signal, window_function, step_length, mel_filterbank) 358 | 359 | # Display the mel spectrogram in dB, seconds, and Hz 360 | number_samples = len(audio_signal) 361 | plt.figure(figsize=(14, 5)) 362 | zaf.melspecshow(mel_spectrogram, number_samples, sampling_frequency, window_length, xtick_step=1) 363 | plt.title("Mel spectrogram (dB)") 364 | plt.tight_layout() 365 | plt.show() 366 | """ 367 | 368 | # Compute the magnitude spectrogram (without the DC component and the mirrored frequencies) 369 | audio_stft = stft(audio_signal, window_function, step_length) 370 | audio_spectrogram = abs(audio_stft[1 : int(len(window_function) / 2) + 1, :]) 371 | 372 | # Compute the mel spectrogram by using the filterbank 373 | mel_spectrogram = np.matmul(mel_filterbank.toarray(), audio_spectrogram) 374 | 375 | return mel_spectrogram 376 | 377 | 378 | def mfcc( 379 | audio_signal, window_function, step_length, mel_filterbank, number_coefficients 380 | ): 381 | """ 382 | Compute the mel-frequency cepstral coefficients (MFCCs) using a mel filterbank. 383 | 384 | Inputs: 385 | audio_signal: audio signal (number_samples,) 386 | window_function: window function (window_length,) 387 | step_length: step length in samples 388 | mel_filterbank: mel filterbank (number_mels, number_frequencies) 389 | number_coefficients: number of coefficients (without the 0th coefficient) 390 | Output: 391 | audio_mfcc: audio MFCCs (number_coefficients, number_times) 392 | 393 | Example: Compute and display the MFCCs, delta MFCCs, and delta-delta MFCCs. 394 | # Import the needed modules 395 | import numpy as np 396 | import scipy.signal 397 | import zaf 398 | import matplotlib.pyplot as plt 399 | 400 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 401 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 402 | audio_signal = np.mean(audio_signal, 1) 403 | 404 | # Set the parameters for the Fourier analysis 405 | window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency)))) 406 | window_function = scipy.signal.hamming(window_length, sym=False) 407 | step_length = int(window_length/2) 408 | 409 | # Compute the mel filterbank 410 | number_mels = 40 411 | mel_filterbank = zaf.melfilterbank(sampling_frequency, window_length, number_mels) 412 | 413 | # Compute the MFCCs using the filterbank 414 | number_coefficients = 20 415 | audio_mfcc = zaf.mfcc(audio_signal, window_function, step_length, mel_filterbank, number_coefficients) 416 | 417 | # Compute the delta and delta-delta MFCCs 418 | audio_dmfcc = np.diff(audio_mfcc, n=1, axis=1) 419 | audio_ddmfcc = np.diff(audio_dmfcc, n=1, axis=1) 420 | 421 | # Display the MFCCs, delta MFCCs, and delta-delta MFCCs in seconds 422 | number_samples = len(audio_signal) 423 | xtick_step = 1 424 | plt.figure(figsize=(14, 7)) 425 | plt.subplot(3, 1, 1) 426 | zaf.mfccshow(audio_mfcc, number_samples, sampling_frequency, xtick_step), plt.title("MFCCs") 427 | plt.subplot(3, 1, 2) 428 | zaf.mfccshow(audio_dmfcc, number_samples, sampling_frequency, xtick_step), plt.title("Delta MFCCs") 429 | plt.subplot(3, 1, 3) 430 | zaf.mfccshow(audio_ddmfcc, number_samples, sampling_frequency, xtick_step), plt.title("Delta-delta MFCCs") 431 | plt.tight_layout() 432 | plt.show() 433 | """ 434 | 435 | # Compute the power spectrogram (without the DC component and the mirrored frequencies) 436 | audio_stft = stft(audio_signal, window_function, step_length) 437 | audio_spectrogram = np.power( 438 | abs(audio_stft[1 : int(len(window_function) / 2) + 1, :]), 2 439 | ) 440 | 441 | # Compute the discrete cosine transform of the log magnitude spectrogram 442 | # mapped onto the mel scale using the filter bank 443 | audio_mfcc = scipy.fftpack.dct( 444 | np.log( 445 | np.matmul(mel_filterbank.toarray(), audio_spectrogram) + np.finfo(float).eps 446 | ), 447 | axis=0, 448 | norm="ortho", 449 | ) 450 | 451 | # Keep only the first coefficients (without the 0th) 452 | audio_mfcc = audio_mfcc[1 : number_coefficients + 1, :] 453 | 454 | return audio_mfcc 455 | 456 | 457 | def cqtkernel( 458 | sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency 459 | ): 460 | """ 461 | Compute the constant-Q transform (CQT) kernel. 462 | 463 | Inputs: 464 | sampling_frequency: sampling frequency in Hz 465 | octave_resolution: number of frequency channels per octave 466 | minimum_frequency: minimum frequency in Hz 467 | maximum_frequency: maximum frequency in Hz 468 | Output: 469 | cqt_kernel: CQT kernel (sparse) (number_frequencies, fft_length) 470 | 471 | Example: Compute and display a CQT kernel. 472 | # Import the needed modules 473 | import numpy as np 474 | import zaf 475 | import matplotlib.pyplot as plt 476 | 477 | # Set the parameters for the CQT kernel 478 | sampling_frequency = 44100 479 | octave_resolution = 24 480 | minimum_frequency = 55 481 | maximum_frequency = sampling_frequency/2 482 | 483 | # Compute the CQT kernel 484 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 485 | 486 | # Display the magnitude CQT kernel 487 | plt.figure(figsize=(14, 5)) 488 | plt.imshow(np.absolute(cqt_kernel).toarray(), aspect="auto", cmap="jet", origin="lower") 489 | plt.title("Magnitude CQT kernel") 490 | plt.xlabel("FFT index") 491 | plt.ylabel("CQT index") 492 | plt.tight_layout() 493 | plt.show() 494 | """ 495 | 496 | # Compute the constant ratio of frequency to resolution (= fk/(fk+1-fk)) 497 | quality_factor = 1 / (pow(2, 1 / octave_resolution) - 1) 498 | 499 | # Compute the number of frequency channels for the CQT 500 | number_frequencies = round( 501 | octave_resolution * np.log2(maximum_frequency / minimum_frequency) 502 | ) 503 | 504 | # Compute the window length for the FFT (= longest window for the minimum frequency) 505 | fft_length = int( 506 | pow( 507 | 2, np.ceil(np.log2(quality_factor * sampling_frequency / minimum_frequency)) 508 | ) 509 | ) 510 | 511 | # Initialize the (complex) CQT kernel 512 | cqt_kernel = np.zeros((number_frequencies, fft_length), dtype=complex) 513 | 514 | # Loop over the frequency channels 515 | for i in range(number_frequencies): 516 | 517 | # Derive the frequency value in Hz 518 | frequency_value = minimum_frequency * pow(2, i / octave_resolution) 519 | 520 | # Compute the window length in samples (nearest odd value to center the temporal kernel on 0) 521 | window_length = ( 522 | 2 * round(quality_factor * sampling_frequency / frequency_value / 2) + 1 523 | ) 524 | 525 | # Compute the temporal kernel for the current frequency (odd and symmetric) 526 | temporal_kernel = ( 527 | np.hamming(window_length) 528 | * np.exp( 529 | 2 530 | * np.pi 531 | * 1j 532 | * quality_factor 533 | * np.arange(-(window_length - 1) / 2, (window_length - 1) / 2 + 1) 534 | / window_length 535 | ) 536 | / window_length 537 | ) 538 | 539 | # Derive the pad width to center the temporal kernels 540 | pad_width = int((fft_length - window_length + 1) / 2) 541 | 542 | # Save the current temporal kernel at the center 543 | # (the zero-padded temporal kernels are not perfectly symmetric anymore because of the even length here) 544 | cqt_kernel[i, pad_width : pad_width + window_length] = temporal_kernel 545 | 546 | # Derive the spectral kernels by taking the FFT of the temporal kernels 547 | # (the spectral kernels are almost real because the temporal kernels are almost symmetric) 548 | cqt_kernel = np.fft.fft(cqt_kernel, axis=1) 549 | 550 | # Make the CQT kernel sparser by zeroing magnitudes below a threshold 551 | cqt_kernel[np.absolute(cqt_kernel) < 0.01] = 0 552 | 553 | # Make the CQT kernel sparse by saving it as a compressed sparse row matrix 554 | cqt_kernel = scipy.sparse.csr_matrix(cqt_kernel) 555 | 556 | # Get the final CQT kernel by using Parseval's theorem 557 | cqt_kernel = np.conjugate(cqt_kernel) / fft_length 558 | 559 | return cqt_kernel 560 | 561 | 562 | def cqtspectrogram(audio_signal, sampling_frequency, time_resolution, cqt_kernel): 563 | """ 564 | Compute the constant-Q transform (CQT) spectrogram using a CQT kernel. 565 | 566 | Inputs: 567 | audio_signal: audio signal (number_samples,) 568 | sampling_frequency: sampling frequency in Hz 569 | time_resolution: number of time frames per second 570 | cqt_kernel: CQT kernel (number_frequencies, fft_length) 571 | Output: 572 | cqt_spectrogram: CQT spectrogram (number_frequencies, number_times) 573 | 574 | Example: Compute and display the CQT spectrogram. 575 | # Import the modules 576 | import numpy as np 577 | import zaf 578 | import matplotlib.pyplot as plt 579 | 580 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 581 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 582 | audio_signal = np.mean(audio_signal, 1) 583 | 584 | # Compute the CQT kernel 585 | octave_resolution = 24 586 | minimum_frequency = 55 587 | maximum_frequency = 3520 588 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 589 | 590 | # Compute the CQT spectrogram using the kernel 591 | time_resolution = 25 592 | cqt_spectrogram = zaf.cqtspectrogram(audio_signal, sampling_frequency, time_resolution, cqt_kernel) 593 | 594 | # Display the CQT spectrogram in dB, seconds, and Hz 595 | plt.figure(figsize=(14, 5)) 596 | zaf.cqtspecshow(cqt_spectrogram, time_resolution, octave_resolution, minimum_frequency, xtick_step=1) 597 | plt.title("CQT spectrogram (dB)") 598 | plt.tight_layout() 599 | plt.show() 600 | """ 601 | 602 | # Derive the number of time samples per time frame 603 | step_length = round(sampling_frequency / time_resolution) 604 | 605 | # Compute the number of time frames 606 | number_times = int(np.floor(len(audio_signal) / step_length)) 607 | 608 | # Get th number of frequency channels and the FFT length 609 | number_frequencies, fft_length = np.shape(cqt_kernel) 610 | 611 | # Zero-pad the signal to center the CQT 612 | audio_signal = np.pad( 613 | audio_signal, 614 | ( 615 | int(np.ceil((fft_length - step_length) / 2)), 616 | int(np.floor((fft_length - step_length) / 2)), 617 | ), 618 | "constant", 619 | constant_values=(0, 0), 620 | ) 621 | 622 | # Initialize the CQT spectrogram 623 | cqt_spectrogram = np.zeros((number_frequencies, number_times)) 624 | 625 | # Loop over the time frames 626 | i = 0 627 | for j in range(number_times): 628 | 629 | # Compute the magnitude CQT using the kernel 630 | cqt_spectrogram[:, j] = np.absolute( 631 | cqt_kernel * np.fft.fft(audio_signal[i : i + fft_length]) 632 | ) 633 | i = i + step_length 634 | 635 | return cqt_spectrogram 636 | 637 | 638 | def cqtchromagram( 639 | audio_signal, sampling_frequency, time_resolution, octave_resolution, cqt_kernel 640 | ): 641 | """ 642 | Compute the constant-Q transform (CQT) chromagram using a CQT kernel. 643 | 644 | Inputs: 645 | audio_signal: audio signal (number_samples,) 646 | sampling_frequency: sampling frequency in Hz 647 | time_resolution: number of time frames per second 648 | octave_resolution: number of frequency channels per octave 649 | cqt_kernel: CQT kernel (number_frequencies, fft_length) 650 | Output: 651 | cqt_chromagram: CQT chromagram (octave_resolution, number_times) 652 | 653 | Example: Compute and display the CQT chromagram. 654 | # Import the needed modules 655 | import numpy as np 656 | import zaf 657 | import matplotlib.pyplot as plt 658 | 659 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 660 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 661 | audio_signal = np.mean(audio_signal, 1) 662 | 663 | # Compute the CQT kernel 664 | octave_resolution = 24 665 | minimum_frequency = 55 666 | maximum_frequency = 3520 667 | cqt_kernel = zaf.cqtkernel(sampling_frequency, octave_resolution, minimum_frequency, maximum_frequency) 668 | 669 | # Compute the CQT chromagram using the kernel 670 | time_resolution = 25 671 | cqt_chromagram = zaf.cqtchromagram(audio_signal, sampling_frequency, time_resolution, octave_resolution, cqt_kernel) 672 | 673 | # Display the CQT chromagram in seconds 674 | plt.figure(figsize=(14, 3)) 675 | zaf.cqtchromshow(cqt_chromagram, time_resolution, xtick_step=1) 676 | plt.title("CQT chromagram") 677 | plt.tight_layout() 678 | plt.show() 679 | """ 680 | 681 | # Compute the CQT spectrogram 682 | cqt_spectrogram = cqtspectrogram( 683 | audio_signal, sampling_frequency, time_resolution, cqt_kernel 684 | ) 685 | 686 | # Get the number of frequency channels and time frames 687 | number_frequencies, number_times = np.shape(cqt_spectrogram) 688 | 689 | # Initialize the CQT chromagram 690 | cqt_chromagram = np.zeros((octave_resolution, number_times)) 691 | 692 | # Loop over the chroma channels 693 | for i in range(octave_resolution): 694 | 695 | # Sum the energy of the frequency channels for every chroma 696 | cqt_chromagram[i, :] = np.sum( 697 | cqt_spectrogram[i:number_frequencies:octave_resolution, :], axis=0 698 | ) 699 | 700 | return cqt_chromagram 701 | 702 | 703 | def dct(audio_signal, dct_type): 704 | """ 705 | Compute the discrete cosine transform (DCT) using the fast Fourier transform (FFT). 706 | 707 | Inputs: 708 | audio_signal: audio signal (window_length,) 709 | dct_type: DCT type (1, 2, 3, or 4) 710 | Output: 711 | audio_dct: audio DCT (number_frequencies,) 712 | 713 | Example: Compute the 4 different DCTs and compare them to SciPy's DCTs. 714 | # Import the needed modules 715 | import numpy as np 716 | import zaf 717 | import scipy.fftpack 718 | import matplotlib.pyplot as plt 719 | 720 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 721 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 722 | audio_signal = np.mean(audio_signal, 1) 723 | 724 | # Get an audio segment for a given window length 725 | window_length = 1024 726 | audio_segment = audio_signal[0:window_length] 727 | 728 | # Compute the DCT-I, II, III, and IV 729 | audio_dct1 = zaf.dct(audio_segment, 1) 730 | audio_dct2 = zaf.dct(audio_segment, 2) 731 | audio_dct3 = zaf.dct(audio_segment, 3) 732 | audio_dct4 = zaf.dct(audio_segment, 4) 733 | 734 | # Compute SciPy's DCT-I, II, III, and IV (orthogonalized) 735 | scipy_dct1 = scipy.fftpack.dct(audio_segment, type=1, norm="ortho") 736 | scipy_dct2 = scipy.fftpack.dct(audio_segment, type=2, norm="ortho") 737 | scipy_dct3 = scipy.fftpack.dct(audio_segment, type=3, norm="ortho") 738 | scipy_dct4 = scipy.fftpack.dct(audio_segment, type=4, norm="ortho") 739 | 740 | # Plot the DCT-I, II, III, and IV, SciPy's versions, and their differences 741 | plt.figure(figsize=(14, 7)) 742 | plt.subplot(3, 4, 1), plt.plot(audio_dct1), plt.autoscale(tight=True), plt.title("DCT-I") 743 | plt.subplot(3, 4, 2), plt.plot(audio_dct2), plt.autoscale(tight=True), plt.title("DCT-II") 744 | plt.subplot(3, 4, 3), plt.plot(audio_dct3), plt.autoscale(tight=True), plt.title("DCT-III") 745 | plt.subplot(3, 4, 4), plt.plot(audio_dct4), plt.autoscale(tight=True), plt.title("DCT-IV") 746 | plt.subplot(3, 4, 5), plt.plot(scipy_dct1), plt.autoscale(tight=True), plt.title("SciPy's DCT-I") 747 | plt.subplot(3, 4, 6), plt.plot(scipy_dct2), plt.autoscale(tight=True), plt.title("SciPy's DCT-II") 748 | plt.subplot(3, 4, 7), plt.plot(scipy_dct3), plt.autoscale(tight=True), plt.title("SciPy's DCT-III") 749 | plt.subplot(3, 4, 8), plt.plot(scipy_dct4), plt.autoscale(tight=True), plt.title("SciPy's DCT-IV") 750 | plt.subplot(3, 4, 9), plt.plot(audio_dct1-scipy_dct1), plt.autoscale(tight=True), plt.title("DCT-I - SciPy's DCT-I") 751 | plt.subplot(3, 4, 10), plt.plot(audio_dct2-scipy_dct2), plt.autoscale(tight=True), plt.title("DCT-II - SciPy's DCT-II") 752 | plt.subplot(3, 4, 11), plt.plot(audio_dct3-scipy_dct3), plt.autoscale(tight=True), plt.title("DCT-III - SciPy's DCT-III") 753 | plt.subplot(3, 4, 12), plt.plot(audio_dct3-scipy_dct3), plt.autoscale(tight=True), plt.title("DCT-IV - SciPy's DCT-IV") 754 | plt.tight_layout() 755 | plt.show() 756 | """ 757 | 758 | # Check if the DCT type is I, II, III, or IV 759 | if dct_type == 1: 760 | 761 | # Get the number of samples 762 | window_length = len(audio_signal) 763 | 764 | # Pre-process the signal to make the DCT-I matrix orthogonal 765 | # (copy the signal to avoid modifying it outside of the function) 766 | audio_signal = audio_signal.copy() 767 | audio_signal[[0, -1]] = audio_signal[[0, -1]] * np.sqrt(2) 768 | 769 | # Compute the DCT-I using the FFT 770 | audio_dct = np.concatenate((audio_signal, audio_signal[-2:0:-1])) 771 | audio_dct = np.fft.fft(audio_dct) 772 | audio_dct = np.real(audio_dct[0:window_length]) / 2 773 | 774 | # Post-process the results to make the DCT-I matrix orthogonal 775 | audio_dct[[0, -1]] = audio_dct[[0, -1]] / np.sqrt(2) 776 | audio_dct = audio_dct * np.sqrt(2 / (window_length - 1)) 777 | 778 | return audio_dct 779 | 780 | elif dct_type == 2: 781 | 782 | # Get the number of samples 783 | window_length = len(audio_signal) 784 | 785 | # Compute the DCT-II using the FFT 786 | audio_dct = np.zeros(4 * window_length) 787 | audio_dct[1 : 2 * window_length : 2] = audio_signal 788 | audio_dct[2 * window_length + 1 : 4 * window_length : 2] = audio_signal[::-1] 789 | audio_dct = np.fft.fft(audio_dct) 790 | audio_dct = np.real(audio_dct[0:window_length]) / 2 791 | 792 | # Post-process the results to make the DCT-II matrix orthogonal 793 | audio_dct[0] = audio_dct[0] / np.sqrt(2) 794 | audio_dct = audio_dct * np.sqrt(2 / window_length) 795 | 796 | return audio_dct 797 | 798 | elif dct_type == 3: 799 | 800 | # Get the number of samples 801 | window_length = len(audio_signal) 802 | 803 | # Pre-process the signal to make the DCT-III matrix orthogonal 804 | # (copy the signal to avoid modifying it outside of the function) 805 | audio_signal = audio_signal.copy() 806 | audio_signal[0] = audio_signal[0] * np.sqrt(2) 807 | 808 | # Compute the DCT-III using the FFT 809 | audio_dct = np.zeros(4 * window_length) 810 | audio_dct[0:window_length] = audio_signal 811 | audio_dct[window_length + 1 : 2 * window_length + 1] = -audio_signal[::-1] 812 | audio_dct[2 * window_length + 1 : 3 * window_length] = -audio_signal[1:] 813 | audio_dct[3 * window_length + 1 : 4 * window_length] = audio_signal[:0:-1] 814 | audio_dct = np.fft.fft(audio_dct) 815 | audio_dct = np.real(audio_dct[1 : 2 * window_length : 2]) / 4 816 | 817 | # Post-process the results to make the DCT-III matrix orthogonal 818 | audio_dct = audio_dct * np.sqrt(2 / window_length) 819 | 820 | return audio_dct 821 | 822 | elif dct_type == 4: 823 | 824 | # Get the number of samples 825 | window_length = len(audio_signal) 826 | 827 | # Compute the DCT-IV using the FFT 828 | audio_dct = np.zeros(8 * window_length) 829 | audio_dct[1 : 2 * window_length : 2] = audio_signal 830 | audio_dct[2 * window_length + 1 : 4 * window_length : 2] = -audio_signal[::-1] 831 | audio_dct[4 * window_length + 1 : 6 * window_length : 2] = -audio_signal 832 | audio_dct[6 * window_length + 1 : 8 * window_length : 2] = audio_signal[::-1] 833 | audio_dct = np.fft.fft(audio_dct) 834 | audio_dct = np.real(audio_dct[1 : 2 * window_length : 2]) / 4 835 | 836 | # Post-process the results to make the DCT-IV matrix orthogonal 837 | audio_dct = audio_dct * np.sqrt(2 / window_length) 838 | 839 | return audio_dct 840 | 841 | 842 | def dst(audio_signal, dst_type): 843 | """ 844 | Compute the discrete sine transform (DST) using the fast Fourier transform (FFT). 845 | 846 | Inputs: 847 | audio_signal: audio signal (window_length,) 848 | dst_type: DST type (1, 2, 3, or 4) 849 | Output: 850 | audio_dst: audio DST (number_frequencies,) 851 | 852 | Example: Compute the 4 different DSTs and compare their respective inverses with the original audio. 853 | # Import the needed modules 854 | import numpy as np 855 | import zaf 856 | import matplotlib.pyplot as plt 857 | 858 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 859 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 860 | audio_signal = np.mean(audio_signal, 1) 861 | 862 | # Get an audio segment for a given window length 863 | window_length = 1024 864 | audio_segment = audio_signal[0:window_length] 865 | 866 | # Compute the DST-I, II, III, and IV 867 | audio_dst1 = zaf.dst(audio_segment, 1) 868 | audio_dst2 = zaf.dst(audio_segment, 2) 869 | audio_dst3 = zaf.dst(audio_segment, 3) 870 | audio_dst4 = zaf.dst(audio_segment, 4) 871 | 872 | # Compute their respective inverses, i.e., DST-I, II, III, and IV 873 | audio_idst1 = zaf.dst(audio_dst1, 1) 874 | audio_idst2 = zaf.dst(audio_dst2, 3) 875 | audio_idst3 = zaf.dst(audio_dst3, 2) 876 | audio_idst4 = zaf.dst(audio_dst4, 4) 877 | 878 | # Plot the DST-I, II, III, and IV, their respective inverses, and their differences with the original audio segment 879 | plt.figure(figsize=(14, 7)) 880 | plt.subplot(3, 4, 1), plt.plot(audio_dst1), plt.autoscale(tight=True), plt.title("DCT-I") 881 | plt.subplot(3, 4, 2), plt.plot(audio_dst2), plt.autoscale(tight=True), plt.title("DST-II") 882 | plt.subplot(3, 4, 3), plt.plot(audio_dst3), plt.autoscale(tight=True), plt.title("DST-III") 883 | plt.subplot(3, 4, 4), plt.plot(audio_dst4), plt.autoscale(tight=True), plt.title("DST-IV") 884 | plt.subplot(3, 4, 5), plt.plot(audio_idst1), plt.autoscale(tight=True), plt.title("Inverse DST-I (DST-I)") 885 | plt.subplot(3, 4, 6), plt.plot(audio_idst2), plt.autoscale(tight=True), plt.title("Inverse DST-II (DST-III)") 886 | plt.subplot(3, 4, 7), plt.plot(audio_idst3), plt.autoscale(tight=True), plt.title("Inverse DST-III (DST-II)") 887 | plt.subplot(3, 4, 8), plt.plot(audio_idst4), plt.autoscale(tight=True), plt.title("Inverse DST-IV (DST-IV)") 888 | plt.subplot(3, 4, 9), plt.plot(audio_idst1-audio_segment), plt.autoscale(tight=True) 889 | plt.title("Inverse DST-I - audio segment") 890 | plt.subplot(3, 4, 10), plt.plot(audio_idst2-audio_segment), plt.autoscale(tight=True) 891 | plt.title("Inverse DST-II - audio segment") 892 | plt.subplot(3, 4, 11), plt.plot(audio_idst3-audio_segment), plt.autoscale(tight=True) 893 | plt.title("Inverse DST-III - audio segment") 894 | plt.subplot(3, 4, 12), plt.plot(audio_idst4-audio_segment), plt.autoscale(tight=True) 895 | plt.title("Inverse DST-IV - audio segment") 896 | plt.tight_layout() 897 | plt.show() 898 | """ 899 | 900 | # Check if the DST type is I, II, III, or IV 901 | if dst_type == 1: 902 | 903 | # Get the number of samples 904 | window_length = len(audio_signal) 905 | 906 | # Compute the DST-I using the FFT 907 | audio_dst = np.zeros(2 * window_length + 2) 908 | audio_dst[1 : window_length + 1] = audio_signal 909 | audio_dst[window_length + 2 :] = -audio_signal[::-1] 910 | audio_dst = np.fft.fft(audio_dst) 911 | audio_dst = -np.imag(audio_dst[1 : window_length + 1]) / 2 912 | 913 | # Post-process the results to make the DST-I matrix orthogonal 914 | audio_dst = audio_dst * np.sqrt(2 / (window_length + 1)) 915 | 916 | return audio_dst 917 | 918 | elif dst_type == 2: 919 | 920 | # Get the number of samples 921 | window_length = len(audio_signal) 922 | 923 | # Compute the DST-II using the FFT 924 | audio_dst = np.zeros(4 * window_length) 925 | audio_dst[1 : 2 * window_length : 2] = audio_signal 926 | audio_dst[2 * window_length + 1 : 4 * window_length : 2] = -audio_signal[-1::-1] 927 | audio_dst = np.fft.fft(audio_dst) 928 | audio_dst = -np.imag(audio_dst[1 : window_length + 1]) / 2 929 | 930 | # Post-process the results to make the DST-II matrix orthogonal 931 | audio_dst[-1] = audio_dst[-1] / np.sqrt(2) 932 | audio_dst = audio_dst * np.sqrt(2 / window_length) 933 | 934 | return audio_dst 935 | 936 | elif dst_type == 3: 937 | 938 | # Get the number of samples 939 | window_length = len(audio_signal) 940 | 941 | # Pre-process the signal to make the DST-III matrix orthogonal 942 | # (copy the signal to avoid modifying it outside of the function) 943 | audio_signal = audio_signal.copy() 944 | audio_signal[-1] = audio_signal[-1] * np.sqrt(2) 945 | 946 | # Compute the DST-III using the FFT 947 | audio_dst = np.zeros(4 * window_length) 948 | audio_dst[1 : window_length + 1] = audio_signal 949 | audio_dst[window_length + 1 : 2 * window_length] = audio_signal[-2::-1] 950 | audio_dst[2 * window_length + 1 : 3 * window_length + 1] = -audio_signal 951 | audio_dst[3 * window_length + 1 : 4 * window_length] = -audio_signal[-2::-1] 952 | audio_dst = np.fft.fft(audio_dst) 953 | audio_dst = -np.imag(audio_dst[1 : 2 * window_length : 2]) / 4 954 | 955 | # Post-process the results to make the DST-III matrix orthogonal 956 | audio_dst = audio_dst * np.sqrt(2 / window_length) 957 | 958 | return audio_dst 959 | 960 | elif dst_type == 4: 961 | 962 | # Initialize the DST-IV 963 | window_length = len(audio_signal) 964 | audio_dst = np.zeros(8 * window_length) 965 | 966 | # Compute the DST-IV using the FFT 967 | audio_dst[1 : 2 * window_length : 2] = audio_signal 968 | audio_dst[2 * window_length + 1 : 4 * window_length : 2] = audio_signal[ 969 | window_length - 1 :: -1 970 | ] 971 | audio_dst[4 * window_length + 1 : 6 * window_length : 2] = -audio_signal 972 | audio_dst[6 * window_length + 1 : 8 * window_length : 2] = -audio_signal[ 973 | window_length - 1 :: -1 974 | ] 975 | audio_dst = np.fft.fft(audio_dst) 976 | audio_dst = -np.imag(audio_dst[1 : 2 * window_length : 2]) / 4 977 | 978 | # Post-process the results to make the DST-IV matrix orthogonal 979 | audio_dst = audio_dst * np.sqrt(2 / window_length) 980 | 981 | return audio_dst 982 | 983 | 984 | def mdct(audio_signal, window_function): 985 | """ 986 | Compute the modified discrete cosine transform (MDCT) using the fast Fourier transform (FFT). 987 | 988 | Inputs: 989 | audio_signal: audio signal (number_samples,) 990 | window_function: window function (window_length,) 991 | Output: 992 | audio_mdct: audio MDCT (number_frequencies, number_times) 993 | 994 | Example: Compute and display the MDCT as used in the AC-3 audio coding format. 995 | # Import the needed modules 996 | import numpy as np 997 | import zaf 998 | import matplotlib.pyplot as plt 999 | 1000 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 1001 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 1002 | audio_signal = np.mean(audio_signal, 1) 1003 | 1004 | # Compute the Kaiser-Bessel-derived (KBD) window as used in the AC-3 audio coding format 1005 | window_length = 512 1006 | alpha_value = 5 1007 | window_function = np.kaiser(int(window_length/2)+1, alpha_value*np.pi) 1008 | window_function2 = np.cumsum(window_function[1:int(window_length/2)]) 1009 | window_function = np.sqrt(np.concatenate((window_function2, window_function2[int(window_length/2)::-1])) 1010 | /np.sum(window_function)) 1011 | 1012 | # Compute the MDCT 1013 | audio_mdct = zaf.mdct(audio_signal, window_function) 1014 | 1015 | # Display the MDCT in dB, seconds, and Hz 1016 | number_samples = len(audio_signal) 1017 | plt.figure(figsize=(14, 7)) 1018 | zaf.specshow(np.absolute(audio_mdct), number_samples, sampling_frequency, xtick_step=1, ytick_step=1000) 1019 | plt.title("MDCT (dB)") 1020 | plt.tight_layout() 1021 | plt.show() 1022 | """ 1023 | 1024 | # Get the number of samples and the window length in samples 1025 | number_samples = len(audio_signal) 1026 | window_length = len(window_function) 1027 | 1028 | # Derive the step length and the number of frequencies (for clarity) 1029 | step_length = int(window_length / 2) 1030 | number_frequencies = int(window_length / 2) 1031 | 1032 | # Derive the number of time frames 1033 | number_times = int(np.ceil(number_samples / step_length)) + 1 1034 | 1035 | # Zero-pad the start and the end of the signal to center the windows 1036 | audio_signal = np.pad( 1037 | audio_signal, 1038 | (step_length, (number_times + 1) * step_length - number_samples), 1039 | "constant", 1040 | constant_values=0, 1041 | ) 1042 | 1043 | # Initialize the MDCT 1044 | audio_mdct = np.zeros((number_frequencies, number_times)) 1045 | 1046 | # Prepare the pre-processing and post-processing arrays 1047 | preprocessing_array = np.exp( 1048 | -1j * np.pi / window_length * np.arange(0, window_length) 1049 | ) 1050 | postprocessing_array = np.exp( 1051 | -1j 1052 | * np.pi 1053 | / window_length 1054 | * (window_length / 2 + 1) 1055 | * np.arange(0.5, window_length / 2 + 0.5) 1056 | ) 1057 | 1058 | # Loop over the time frames 1059 | # (Do the pre and post-processing, and take the FFT in the loop to avoid storing twice longer frames) 1060 | i = 0 1061 | for j in range(number_times): 1062 | 1063 | # Window the signal 1064 | audio_segment = audio_signal[i : i + window_length] * window_function 1065 | i = i + step_length 1066 | 1067 | # Compute the Fourier transform of the windowed segment using the FFT after pre-processing 1068 | audio_segment = np.fft.fft(audio_segment * preprocessing_array) 1069 | 1070 | # Truncate to the first half before post-processing (and take the real to ensure real values) 1071 | audio_mdct[:, j] = np.real( 1072 | audio_segment[0:number_frequencies] * postprocessing_array 1073 | ) 1074 | 1075 | return audio_mdct 1076 | 1077 | 1078 | def imdct(audio_mdct, window_function): 1079 | """ 1080 | Compute the inverse modified discrete cosine transform (MDCT) using the fast Fourier transform (FFT). 1081 | 1082 | Inputs: 1083 | audio_mdct: audio MDCT (number_frequencies, number_times) 1084 | window_function: window function (window_length,) 1085 | Output: 1086 | audio_signal: audio signal (number_samples,) 1087 | 1088 | Example: Verify that the MDCT is perfectly invertible. 1089 | # Import the needed modules 1090 | import numpy as np 1091 | import zaf 1092 | import matplotlib.pyplot as plt 1093 | 1094 | # Read the audio signal (normalized) with its sampling frequency in Hz, and average it over its channels 1095 | audio_signal, sampling_frequency = zaf.wavread("audio_file.wav") 1096 | audio_signal = np.mean(audio_signal, 1) 1097 | 1098 | # Compute the MDCT with a slope function as used in the Vorbis audio coding format 1099 | window_length = 2048 1100 | window_function = np.sin(np.pi/2*pow(np.sin(np.pi/window_length*np.arange(0.5, window_length+0.5)), 2)) 1101 | audio_mdct = zaf.mdct(audio_signal, window_function) 1102 | 1103 | # Compute the inverse MDCT 1104 | audio_signal2 = zaf.imdct(audio_mdct, window_function) 1105 | audio_signal2 = audio_signal2[0:len(audio_signal)] 1106 | 1107 | # Compute the differences between the original signal and the resynthesized one 1108 | audio_differences = audio_signal-audio_signal2 1109 | y_max = np.max(np.absolute(audio_differences)) 1110 | 1111 | # Display the original and resynthesized signals, and their differences in seconds 1112 | xtick_step = 1 1113 | plt.figure(figsize=(14, 7)) 1114 | plt.subplot(3, 1, 1), zaf.sigplot(audio_signal, sampling_frequency, xtick_step) 1115 | plt.ylim(-1, 1), plt.title("Original signal") 1116 | plt.subplot(3, 1, 2), zaf.sigplot(audio_signal2, sampling_frequency, xtick_step) 1117 | plt.ylim(-1, 1), plt.title("Resyntesized signal") 1118 | plt.subplot(3, 1, 3), zaf.sigplot(audio_differences, sampling_frequency, xtick_step) 1119 | plt.ylim(-y_max, y_max), plt.title("Original - resyntesized signal") 1120 | plt.tight_layout() 1121 | plt.show() 1122 | """ 1123 | 1124 | # Get the number of frequency channels and time frames 1125 | number_frequencies, number_times = np.shape(audio_mdct) 1126 | 1127 | # Derive the window length and the step length in samples (for clarity) 1128 | window_length = 2 * number_frequencies 1129 | step_length = number_frequencies 1130 | 1131 | # Derive the number of samples for the signal 1132 | number_samples = step_length * (number_times + 1) 1133 | 1134 | # Initialize the audio signal 1135 | audio_signal = np.zeros(number_samples) 1136 | 1137 | # Prepare the pre-processing and post-processing arrays 1138 | preprocessing_array = np.exp( 1139 | -1j 1140 | * np.pi 1141 | / (2 * number_frequencies) 1142 | * (number_frequencies + 1) 1143 | * np.arange(0, number_frequencies) 1144 | ) 1145 | postprocessing_array = ( 1146 | np.exp( 1147 | -1j 1148 | * np.pi 1149 | / (2 * number_frequencies) 1150 | * np.arange( 1151 | 0.5 + number_frequencies / 2, 1152 | 2 * number_frequencies + number_frequencies / 2 + 0.5, 1153 | ) 1154 | ) 1155 | / number_frequencies 1156 | ) 1157 | 1158 | # Compute the Fourier transform of the frames using the FFT after pre-processing (zero-pad to get twice the length) 1159 | audio_mdct = np.fft.fft( 1160 | audio_mdct * preprocessing_array[:, np.newaxis], 1161 | n=2 * number_frequencies, 1162 | axis=0, 1163 | ) 1164 | 1165 | # Apply the window function to the frames after post-processing (take the real to ensure real values) 1166 | audio_mdct = 2 * ( 1167 | np.real(audio_mdct * postprocessing_array[:, np.newaxis]) 1168 | * window_function[:, np.newaxis] 1169 | ) 1170 | 1171 | # Loop over the time frames 1172 | i = 0 1173 | for j in range(number_times): 1174 | 1175 | # Recover the signal with the time-domain aliasing cancellation (TDAC) principle 1176 | audio_signal[i : i + window_length] = ( 1177 | audio_signal[i : i + window_length] + audio_mdct[:, j] 1178 | ) 1179 | i = i + step_length 1180 | 1181 | # Remove the zero-padding at the start and at the end of the signal 1182 | audio_signal = audio_signal[step_length : -step_length - 1] 1183 | 1184 | return audio_signal 1185 | 1186 | 1187 | def wavread(audio_file): 1188 | """ 1189 | Read a WAVE file (using SciPy). 1190 | 1191 | Input: 1192 | audio_file: path to an audio file 1193 | Outputs: 1194 | audio_signal: audio signal (number_samples, number_channels) 1195 | sampling_frequency: sampling frequency in Hz 1196 | """ 1197 | 1198 | # Read the audio file and return the sampling frequency in Hz and the non-normalized signal using SciPy 1199 | sampling_frequency, audio_signal = scipy.io.wavfile.read(audio_file) 1200 | 1201 | # Normalize the signal by the data range given the size of an item in bytes 1202 | audio_signal = audio_signal / pow(2, audio_signal.itemsize * 8 - 1) 1203 | 1204 | return audio_signal, sampling_frequency 1205 | 1206 | 1207 | def wavwrite(audio_signal, sampling_frequency, audio_file): 1208 | """ 1209 | Write a WAVE file (using Scipy). 1210 | 1211 | Inputs: 1212 | audio_signal: audio signal (number_samples, number_channels) 1213 | sampling_frequency: sampling frequency in Hz 1214 | Output: 1215 | audio_file: path to an audio file 1216 | """ 1217 | 1218 | # Write the audio signal using SciPy 1219 | scipy.io.wavfile.write(audio_file, sampling_frequency, audio_signal) 1220 | 1221 | 1222 | def sigplot( 1223 | audio_signal, 1224 | sampling_frequency, 1225 | xtick_step=1, 1226 | ): 1227 | """ 1228 | Plot a signal in seconds. 1229 | 1230 | Inputs: 1231 | audio_signal: audio signal (number_samples, number_channels) (number_channels>=0) 1232 | sampling_frequency: sampling frequency in Hz 1233 | xtick_step: step for the x-axis ticks in seconds (default: 1 second) 1234 | """ 1235 | 1236 | # Get the number of samples 1237 | number_samples = np.shape(audio_signal)[0] 1238 | 1239 | # Prepare the tick locations and labels for the x-axis 1240 | xtick_locations = np.arange( 1241 | xtick_step * sampling_frequency, 1242 | number_samples, 1243 | xtick_step * sampling_frequency, 1244 | ) 1245 | xtick_labels = np.arange( 1246 | xtick_step, number_samples / sampling_frequency, xtick_step 1247 | ).astype(int) 1248 | 1249 | # Plot the signal in seconds 1250 | plt.plot(audio_signal) 1251 | plt.autoscale(tight=True) 1252 | plt.xticks(ticks=xtick_locations, labels=xtick_labels) 1253 | plt.xlabel("Time (s)") 1254 | 1255 | 1256 | def specshow( 1257 | audio_spectrogram, 1258 | number_samples, 1259 | sampling_frequency, 1260 | xtick_step=1, 1261 | ytick_step=1000, 1262 | ): 1263 | """ 1264 | Display a spectrogram in dB, seconds, and Hz. 1265 | 1266 | Inputs: 1267 | audio_spectrogram: audio spectrogram (without DC and mirrored frequencies) (number_frequencies, number_times) 1268 | number_samples: number of samples from the original signal 1269 | sampling_frequency: sampling frequency from the original signal in Hz 1270 | xtick_step: step for the x-axis ticks in seconds (default: 1 second) 1271 | ytick_step: step for the y-axis ticks in Hz (default: 1000 Hz) 1272 | """ 1273 | 1274 | # Get the number of frequency channels and time frames 1275 | number_frequencies, number_times = np.shape(audio_spectrogram) 1276 | 1277 | # Derive the number of seconds and Hertz 1278 | number_seconds = number_samples / sampling_frequency 1279 | number_hertz = sampling_frequency / 2 1280 | 1281 | # Derive the number of time frames per second and the number of frequency channels per Hz 1282 | time_resolution = number_times / number_seconds 1283 | frequency_resolution = number_frequencies / number_hertz 1284 | 1285 | # Prepare the tick locations and labels for the x-axis 1286 | xtick_locations = np.arange( 1287 | xtick_step * time_resolution, 1288 | number_times, 1289 | xtick_step * time_resolution, 1290 | ) 1291 | xtick_labels = np.arange(xtick_step, number_seconds, xtick_step).astype(int) 1292 | 1293 | # Prepare the tick locations and labels for the y-axis 1294 | ytick_locations = np.arange( 1295 | ytick_step * frequency_resolution, 1296 | number_frequencies, 1297 | ytick_step * frequency_resolution, 1298 | ) 1299 | ytick_labels = np.arange(ytick_step, number_hertz, ytick_step).astype(int) 1300 | 1301 | # Display the spectrogram in dB, seconds, and Hz 1302 | plt.imshow( 1303 | 20 * np.log10(audio_spectrogram), aspect="auto", cmap="jet", origin="lower" 1304 | ) 1305 | plt.xticks(ticks=xtick_locations, labels=xtick_labels) 1306 | plt.yticks(ticks=ytick_locations, labels=ytick_labels) 1307 | plt.xlabel("Time (s)") 1308 | plt.ylabel("Frequency (Hz)") 1309 | 1310 | 1311 | def melspecshow( 1312 | mel_spectrogram, 1313 | number_samples, 1314 | sampling_frequency, 1315 | window_length, 1316 | xtick_step=1, 1317 | ): 1318 | """ 1319 | Display a mel spectrogram in dB, seconds, and Hz. 1320 | 1321 | Inputs: 1322 | mel_spectrogram: mel spectrogram (number_mels, number_times) 1323 | number_samples: number of samples from the original signal 1324 | sampling_frequency: sampling frequency from the original signal in Hz 1325 | window_length: window length from the Fourier analysis in number of samples 1326 | xtick_step: step for the x-axis ticks in seconds (default: 1 second) 1327 | """ 1328 | 1329 | # Get the number of mels and time frames 1330 | number_mels, number_times = np.shape(mel_spectrogram) 1331 | 1332 | # Derive the number of seconds and the number of time frames per second 1333 | number_seconds = number_samples / sampling_frequency 1334 | time_resolution = number_times / number_seconds 1335 | 1336 | # Derive the minimum and maximum mel 1337 | minimum_mel = 2595 * np.log10(1 + (sampling_frequency / window_length) / 700) 1338 | maximum_mel = 2595 * np.log10(1 + (sampling_frequency / 2) / 700) 1339 | 1340 | # Compute the mel scale (linearly spaced) 1341 | mel_scale = np.linspace(minimum_mel, maximum_mel, number_mels) 1342 | 1343 | # Derive the Hertz scale (log spaced) 1344 | hertz_scale = 700 * (np.power(10, mel_scale / 2595) - 1) 1345 | 1346 | # Prepare the tick locations and labels for the x-axis 1347 | xtick_locations = np.arange( 1348 | xtick_step * time_resolution, 1349 | number_times, 1350 | xtick_step * time_resolution, 1351 | ) 1352 | xtick_labels = np.arange(xtick_step, number_seconds, xtick_step).astype(int) 1353 | 1354 | # Prepare the tick locations and labels for the y-axis 1355 | ytick_locations = np.arange(0, number_mels, 8) 1356 | ytick_labels = hertz_scale[::8].astype(int) 1357 | 1358 | # Display the mel spectrogram in dB, seconds, and Hz 1359 | plt.imshow( 1360 | 20 * np.log10(mel_spectrogram), aspect="auto", cmap="jet", origin="lower" 1361 | ) 1362 | plt.xticks(ticks=xtick_locations, labels=xtick_labels) 1363 | plt.yticks(ticks=ytick_locations, labels=ytick_labels) 1364 | plt.xlabel("Time (s)") 1365 | plt.ylabel("Frequency (Hz)") 1366 | 1367 | 1368 | def mfccshow( 1369 | audio_mfcc, 1370 | number_samples, 1371 | sampling_frequency, 1372 | xtick_step=1, 1373 | ): 1374 | """ 1375 | Display MFCCs in seconds. 1376 | 1377 | Inputs: 1378 | audio_mfcc: audio MFCCs (number_coefficients, number_times) 1379 | number_samples: number of samples from the original signal 1380 | sampling_frequency: sampling frequency from the original signal in Hz 1381 | xtick_step: step for the x-axis ticks in seconds (default: 1 second) 1382 | """ 1383 | 1384 | # Get the number of time frames 1385 | number_times = np.shape(audio_mfcc)[1] 1386 | 1387 | # Derive the number of seconds and the number of time frames per second 1388 | number_seconds = number_samples / sampling_frequency 1389 | time_resolution = number_times / number_seconds 1390 | 1391 | # Prepare the tick locations and labels for the x-axis 1392 | xtick_locations = np.arange( 1393 | xtick_step * time_resolution, 1394 | number_times, 1395 | xtick_step * time_resolution, 1396 | ) 1397 | xtick_labels = np.arange(xtick_step, number_seconds, xtick_step).astype(int) 1398 | 1399 | # Display the MFCCs in seconds 1400 | plt.imshow(audio_mfcc, aspect="auto", cmap="jet", origin="lower") 1401 | plt.xticks(ticks=xtick_locations, labels=xtick_labels) 1402 | plt.xlabel("Time (s)") 1403 | plt.ylabel("Coefficients") 1404 | 1405 | 1406 | def cqtspecshow( 1407 | cqt_spectrogram, 1408 | time_resolution, 1409 | octave_resolution, 1410 | minimum_frequency, 1411 | xtick_step=1, 1412 | ): 1413 | """ 1414 | Display a CQT spectrogram in dB, seconds, and Hz. 1415 | 1416 | Inputs: 1417 | cqt_spectrogram: CQT spectrogram (number_frequencies, number_times) 1418 | time_resolution: number of time frames per second 1419 | octave_resolution: number of frequency channels per octave 1420 | minimum_frequency: minimum frequency in Hz 1421 | xtick_step: step for the x-axis ticks in seconds (default: 1 second) 1422 | """ 1423 | 1424 | # Get the number of frequency channels and time frames 1425 | number_frequencies, number_times = np.shape(cqt_spectrogram) 1426 | 1427 | # Prepare the tick locations and labels for the x-axis 1428 | xtick_locations = np.arange( 1429 | xtick_step * time_resolution, 1430 | number_times, 1431 | xtick_step * time_resolution, 1432 | ) 1433 | xtick_labels = np.arange( 1434 | xtick_step, number_times / time_resolution, xtick_step 1435 | ).astype(int) 1436 | 1437 | # Prepare the tick locations and labels for the y-axis 1438 | ytick_locations = np.arange(0, number_frequencies, octave_resolution) 1439 | ytick_labels = ( 1440 | minimum_frequency * pow(2, ytick_locations / octave_resolution) 1441 | ).astype(int) 1442 | 1443 | # Display the CQT spectrogram in dB and seconds, and Hz 1444 | plt.imshow( 1445 | 20 * np.log10(cqt_spectrogram), aspect="auto", cmap="jet", origin="lower" 1446 | ) 1447 | plt.xticks(ticks=xtick_locations, labels=xtick_labels) 1448 | plt.yticks(ticks=ytick_locations, labels=ytick_labels) 1449 | plt.xlabel("Time (s)") 1450 | plt.ylabel("Frequency (Hz)") 1451 | 1452 | 1453 | def cqtchromshow( 1454 | cqt_chromagram, 1455 | time_resolution, 1456 | xtick_step=1, 1457 | ): 1458 | """ 1459 | Display a CQT chromagram in seconds. 1460 | 1461 | Inputs: 1462 | audio_chromagram: CQT chromagram (number_chromas, number_times) 1463 | time_resolution: number of time frames per second 1464 | xtick_step: step for the x-axis ticks in seconds (default: 1 second) 1465 | """ 1466 | 1467 | # Get the number of time frames 1468 | number_times = np.shape(cqt_chromagram)[1] 1469 | 1470 | # Prepare the tick locations and labels for the x-axis 1471 | xtick_locations = np.arange( 1472 | xtick_step * time_resolution, 1473 | number_times, 1474 | xtick_step * time_resolution, 1475 | ) 1476 | xtick_labels = np.arange( 1477 | xtick_step, number_times / time_resolution, xtick_step 1478 | ).astype(int) 1479 | 1480 | # Display the CQT chromagram in seconds 1481 | plt.imshow(cqt_chromagram, aspect="auto", cmap="jet", origin="lower") 1482 | plt.xticks(ticks=xtick_locations, labels=xtick_labels) 1483 | plt.xlabel("Time (s)") 1484 | plt.ylabel("Chroma") --------------------------------------------------------------------------------