├── .gitmodules ├── LICENSE ├── README.md ├── audio_processing.py ├── data ├── cmu_dictionary ├── debussy_prelude_lyrics.musicxml ├── example1.wav ├── example2.wav ├── examples_filelist.txt ├── haendel_hallelujah.musicxml └── mozart_requiem_kyrie_satb.musicxml ├── data_utils.py ├── distributed.py ├── filelists ├── libritts_speakerinfo.txt ├── libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt ├── libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt ├── ljs_audiopaths_text_sid_train_filelist.txt └── ljs_audiopaths_text_sid_val_filelist.txt ├── fp16_optimizer.py ├── hparams.py ├── inference.ipynb ├── layers.py ├── logger.py ├── loss_function.py ├── loss_scaler.py ├── mellotron_logo.png ├── mellotron_utils.py ├── model.py ├── modules.py ├── multiproc.py ├── plotting_utils.py ├── requirements.txt ├── stft.py ├── text ├── LICENSE ├── __init__.py ├── cleaners.py ├── cmudict.py ├── numbers.py └── symbols.py ├── train.py ├── utils.py └── yin.py /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "waveglow"] 2 | path = waveglow 3 | url = https://github.com/NVIDIA/waveglow.git 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2019, NVIDIA Corporation 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Mellotron](mellotron_logo.png "Mellotron") 2 | 3 | ### Rafael Valle\*, Jason Li\*, Ryan Prenger and Bryan Catanzaro 4 | In our recent [paper] we propose Mellotron: a multispeaker voice synthesis model 5 | based on Tacotron 2 GST that can make a voice emote and sing without emotive or 6 | singing training data. 7 | 8 | By explicitly conditioning on rhythm and continuous pitch 9 | contours from an audio signal or music score, Mellotron is able to generate 10 | speech in a variety of styles ranging from read speech to expressive speech, 11 | from slow drawls to rap and from monotonous voice to singing voice. 12 | 13 | Visit our [website] for audio samples. 14 | 15 | ## Pre-requisites 16 | 1. NVIDIA GPU + CUDA cuDNN 17 | 18 | ## Setup 19 | 1. Clone this repo: `git clone https://github.com/NVIDIA/mellotron.git` 20 | 2. CD into this repo: `cd mellotron` 21 | 3. Initialize submodule: `git submodule init; git submodule update` 22 | 4. Install [PyTorch] 23 | 5. Install [Apex] 24 | 6. Install python requirements or build docker image 25 | - Install python requirements: `pip install -r requirements.txt` 26 | 27 | ## Training 28 | 1. Update the filelists inside the filelists folder to point to your data 29 | 2. `python train.py --output_directory=outdir --log_directory=logdir` 30 | 3. (OPTIONAL) `tensorboard --logdir=outdir/logdir` 31 | 32 | ## Training using a pre-trained model 33 | Training using a pre-trained model can lead to faster convergence 34 | By default, the speaker embedding layer is [ignored] 35 | 36 | 1. Download our published Mellotron model trained on [LibriTTS] or [LJS] 37 | 2. `python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start` 38 | 39 | ## Multi-GPU (distributed) and Automatic Mixed Precision Training 40 | 1. `python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True` 41 | 42 | ## Inference demo 43 | 1. `jupyter notebook --ip=127.0.0.1 --port=31337` 44 | 2. Load inference.ipynb 45 | 3. (optional) Download our published [WaveGlow](https://drive.google.com/open?id=1okuUstGoBe_qZ4qUEF8CcwEugHP7GM_b) model 46 | 47 | ## Related repos 48 | [WaveGlow](https://github.com/NVIDIA/WaveGlow) Faster than real time Flow-based 49 | Generative Network for Speech Synthesis. 50 | 51 | ## Acknowledgements 52 | This implementation uses code from the following repos: [Keith 53 | Ito](https://github.com/keithito/tacotron/), [Prem 54 | Seetharaman](https://github.com/pseeth/pytorch-stft), 55 | [Chengqi Deng](https://github.com/KinglittleQ/GST-Tacotron), 56 | [Patrice Guyot](https://github.com/patriceguyot/Yin), as described in our code. 57 | 58 | [ignored]: https://github.com/NVIDIA/mellotron/blob/master/hparams.py#L22 59 | [paper]: https://arxiv.org/abs/1910.11997 60 | [WaveGlow]: https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF 61 | [LibriTTS]: https://drive.google.com/open?id=1ZesPPyRRKloltRIuRnGZ2LIUEuMSVjkI 62 | [LJS]: https://drive.google.com/open?id=1UwDARlUl8JvB2xSuyMFHFsIWELVpgQD4 63 | [pytorch]: https://github.com/pytorch/pytorch#installation 64 | [website]: https://nv-adlr.github.io/Mellotron 65 | [Apex]: https://github.com/nvidia/apex 66 | [AMP]: https://github.com/NVIDIA/apex/tree/master/apex/amp -------------------------------------------------------------------------------- /audio_processing.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from scipy.signal import get_window 4 | import librosa.util as librosa_util 5 | 6 | 7 | def window_sumsquare(window, n_frames, hop_length=200, win_length=800, 8 | n_fft=800, dtype=np.float32, norm=None): 9 | """ 10 | # from librosa 0.6 11 | Compute the sum-square envelope of a window function at a given hop length. 12 | 13 | This is used to estimate modulation effects induced by windowing 14 | observations in short-time fourier transforms. 15 | 16 | Parameters 17 | ---------- 18 | window : string, tuple, number, callable, or list-like 19 | Window specification, as in `get_window` 20 | 21 | n_frames : int > 0 22 | The number of analysis frames 23 | 24 | hop_length : int > 0 25 | The number of samples to advance between frames 26 | 27 | win_length : [optional] 28 | The length of the window function. By default, this matches `n_fft`. 29 | 30 | n_fft : int > 0 31 | The length of each analysis frame. 32 | 33 | dtype : np.dtype 34 | The data type of the output 35 | 36 | Returns 37 | ------- 38 | wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))` 39 | The sum-squared envelope of the window function 40 | """ 41 | if win_length is None: 42 | win_length = n_fft 43 | 44 | n = n_fft + hop_length * (n_frames - 1) 45 | x = np.zeros(n, dtype=dtype) 46 | 47 | # Compute the squared window at the desired length 48 | win_sq = get_window(window, win_length, fftbins=True) 49 | win_sq = librosa_util.normalize(win_sq, norm=norm)**2 50 | win_sq = librosa_util.pad_center(win_sq, n_fft) 51 | 52 | # Fill the envelope 53 | for i in range(n_frames): 54 | sample = i * hop_length 55 | x[sample:min(n, sample + n_fft)] += win_sq[:max(0, min(n_fft, n - sample))] 56 | return x 57 | 58 | 59 | def griffin_lim(magnitudes, stft_fn, n_iters=30): 60 | """ 61 | PARAMS 62 | ------ 63 | magnitudes: spectrogram magnitudes 64 | stft_fn: STFT class with transform (STFT) and inverse (ISTFT) methods 65 | """ 66 | 67 | angles = np.angle(np.exp(2j * np.pi * np.random.rand(*magnitudes.size()))) 68 | angles = angles.astype(np.float32) 69 | angles = torch.autograd.Variable(torch.from_numpy(angles)) 70 | signal = stft_fn.inverse(magnitudes, angles).squeeze(1) 71 | 72 | for i in range(n_iters): 73 | _, angles = stft_fn.transform(signal) 74 | signal = stft_fn.inverse(magnitudes, angles).squeeze(1) 75 | return signal 76 | 77 | 78 | def dynamic_range_compression(x, C=1, clip_val=1e-5): 79 | """ 80 | PARAMS 81 | ------ 82 | C: compression factor 83 | """ 84 | return torch.log(torch.clamp(x, min=clip_val) * C) 85 | 86 | 87 | def dynamic_range_decompression(x, C=1): 88 | """ 89 | PARAMS 90 | ------ 91 | C: compression factor used to compress 92 | """ 93 | return torch.exp(x) / C 94 | -------------------------------------------------------------------------------- /data/cmu_dictionary: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/data/cmu_dictionary -------------------------------------------------------------------------------- /data/debussy_prelude_lyrics.musicxml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Prelude 5 | 6 | Debussy 7 | 8 | Finale v26 for Mac 9 | 2019-10-22 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 7.0273 20 | 40 21 | 22 | 23 | 1590 24 | 1229 25 | 26 | 57 27 | 57 28 | 57 29 | 114 30 | 31 | 32 | 57 33 | 57 34 | 57 35 | 114 36 | 37 | 38 | 39 | 40 | 0 41 | 0 42 | 43 | 103 44 | 60 45 | 46 | 47 | 0.918 48 | 5 49 | 0.918 50 | 0.918 51 | 5 52 | 1.0807 53 | 0.957 54 | 0.918 55 | 0.918 56 | 0.918 57 | 60 58 | 60 59 | 120 60 | 8 61 | 62 | 63 | 64 | 65 | 66 | 67 | Prelude 68 | 69 | 70 | Debussy 71 | 72 | 73 | 74 | Flute 75 | Pno. 76 | 77 | Piano 78 | 79 | 80 | 1 81 | 1 82 | 79 83 | 0 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 170 93 | 94 | system 95 | 96 | 97 | 12 98 | 99 | 4 100 | major 101 | 102 | 106 | 107 | G 108 | 2 109 | 110 | 111 | 112 | 113 | 114 | quarter 115 | 60 116 | 117 | 118 | 119 | 120 | 121 | 122 | Solo 123 | 124 | 125 | 126 | 127 | C 128 | 1 129 | 5 130 | 131 | 18 132 | 133 | 1 134 | quarter 135 | 136 | down 137 | 138 | 139 | 140 | 141 | 142 | begin 143 | Pre 144 | 145 | 146 | 147 | 148 | C 149 | 1 150 | 5 151 | 152 | 6 153 | 154 | 155 | 1 156 | eighth 157 | up 158 | begin 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | C 167 | 1 168 | 5 169 | 170 | 3 171 | 172 | 1 173 | 16th 174 | up 175 | continue 176 | begin 177 | 178 | 179 | 180 | 181 | 182 | 183 | B 184 | 4 185 | 186 | 3 187 | 1 188 | 16th 189 | up 190 | continue 191 | continue 192 | 193 | 194 | 195 | A 196 | 1 197 | 4 198 | 199 | 2 200 | 1 201 | 16th 202 | sharp 203 | 204 | 3 205 | 2 206 | 207 | up 208 | continue 209 | continue 210 | 211 | 212 | 213 | 214 | 215 | 216 | A 217 | 4 218 | 219 | 2 220 | 1 221 | 16th 222 | natural 223 | 224 | 3 225 | 2 226 | 227 | up 228 | continue 229 | continue 230 | 231 | 232 | 233 | G 234 | 1 235 | 4 236 | 237 | 2 238 | 1 239 | 16th 240 | 241 | 3 242 | 2 243 | 244 | up 245 | end 246 | end 247 | 248 | 249 | 250 | 251 | 252 | 253 | G 254 | 4 255 | 256 | 9 257 | 1 258 | eighth 259 | 260 | natural 261 | up 262 | begin 263 | 264 | 265 | 266 | A 267 | 4 268 | 269 | 3 270 | 1 271 | 16th 272 | up 273 | continue 274 | begin 275 | 276 | 277 | 278 | B 279 | 4 280 | 281 | 3 282 | 1 283 | 16th 284 | up 285 | continue 286 | continue 287 | 288 | 289 | 290 | B 291 | 1 292 | 4 293 | 294 | 3 295 | 1 296 | 16th 297 | sharp 298 | up 299 | end 300 | end 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | C 311 | 1 312 | 5 313 | 314 | 18 315 | 316 | 1 317 | quarter 318 | 319 | down 320 | 321 | 322 | 323 | 324 | 325 | middle 326 | lude 327 | 328 | 329 | 330 | 331 | C 332 | 1 333 | 5 334 | 335 | 6 336 | 337 | 338 | 1 339 | eighth 340 | up 341 | begin 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | C 350 | 1 351 | 5 352 | 353 | 3 354 | 355 | 1 356 | 16th 357 | up 358 | continue 359 | begin 360 | 361 | 362 | 363 | 364 | 365 | 366 | B 367 | 4 368 | 369 | 3 370 | 1 371 | 16th 372 | up 373 | continue 374 | continue 375 | 376 | 377 | 378 | A 379 | 1 380 | 4 381 | 382 | 2 383 | 1 384 | 16th 385 | sharp 386 | 387 | 3 388 | 2 389 | 390 | up 391 | continue 392 | continue 393 | 394 | 395 | 396 | 397 | 398 | 399 | A 400 | 4 401 | 402 | 2 403 | 1 404 | 16th 405 | natural 406 | 407 | 3 408 | 2 409 | 410 | up 411 | continue 412 | continue 413 | 414 | 415 | 416 | G 417 | 1 418 | 4 419 | 420 | 2 421 | 1 422 | 16th 423 | 424 | 3 425 | 2 426 | 427 | up 428 | end 429 | end 430 | 431 | 432 | 433 | 434 | 435 | 436 | G 437 | 4 438 | 439 | 9 440 | 1 441 | eighth 442 | 443 | natural 444 | up 445 | begin 446 | 447 | 448 | 449 | A 450 | 4 451 | 452 | 3 453 | 1 454 | 16th 455 | up 456 | continue 457 | begin 458 | 459 | 460 | 461 | B 462 | 4 463 | 464 | 3 465 | 1 466 | 16th 467 | up 468 | continue 469 | continue 470 | 471 | 472 | 473 | B 474 | 1 475 | 4 476 | 477 | 3 478 | 1 479 | 16th 480 | sharp 481 | up 482 | end 483 | end 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | C 499 | 1 500 | 5 501 | 502 | 6 503 | 1 504 | eighth 505 | down 506 | begin 507 | 508 | 509 | 510 | 511 | 512 | 513 | D 514 | 1 515 | 5 516 | 517 | 6 518 | 1 519 | eighth 520 | down 521 | continue 522 | 523 | end 524 | To 525 | 526 | 527 | 528 | 529 | G 530 | 1 531 | 5 532 | 533 | 6 534 | 1 535 | eighth 536 | down 537 | end 538 | 539 | single 540 | The 541 | 542 | 543 | 544 | 545 | E 546 | 5 547 | 548 | 12 549 | 1 550 | quarter 551 | down 552 | 553 | begin 554 | Af 555 | 556 | 557 | 558 | 559 | G 560 | 1 561 | 4 562 | 563 | 6 564 | 1 565 | eighth 566 | up 567 | 568 | middle 569 | ter 570 | 571 | 572 | 573 | 574 | B 575 | 4 576 | 577 | 18 578 | 579 | 1 580 | quarter 581 | 582 | down 583 | 584 | 585 | 586 | 587 | end 588 | noon 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | B 608 | 4 609 | 610 | 6 611 | 612 | 1 613 | eighth 614 | down 615 | begin 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | B 626 | 4 627 | 628 | 6 629 | 1 630 | eighth 631 | down 632 | continue 633 | 634 | single 635 | Of 636 | 637 | 638 | 639 | 640 | C 641 | 1 642 | 5 643 | 644 | 6 645 | 1 646 | eighth 647 | down 648 | end 649 | 650 | single 651 | A 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | A 662 | 1 663 | 4 664 | 665 | 36 666 | 1 667 | half 668 | 669 | sharp 670 | up 671 | 672 | 673 | 674 | 675 | single 676 | Faun 677 | 678 | 679 | 680 | light-heavy 681 | 682 | 683 | 684 | 685 | 686 | -------------------------------------------------------------------------------- /data/example1.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/data/example1.wav -------------------------------------------------------------------------------- /data/example2.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/data/example2.wav -------------------------------------------------------------------------------- /data/examples_filelist.txt: -------------------------------------------------------------------------------- 1 | data/example1.wav|exploring the expanses of space to keep our planet safe|1 2 | data/example2.wav|and all the species that call it home|1 3 | -------------------------------------------------------------------------------- /data_utils.py: -------------------------------------------------------------------------------- 1 | import random 2 | import os 3 | import re 4 | import numpy as np 5 | import torch 6 | import torch.utils.data 7 | import librosa 8 | 9 | import layers 10 | from utils import load_wav_to_torch, load_filepaths_and_text 11 | from text import text_to_sequence, cmudict 12 | from yin import compute_yin 13 | 14 | 15 | class TextMelLoader(torch.utils.data.Dataset): 16 | """ 17 | 1) loads audio, text and speaker ids 18 | 2) normalizes text and converts them to sequences of one-hot vectors 19 | 3) computes mel-spectrograms and f0s from audio files. 20 | """ 21 | def __init__(self, audiopaths_and_text, hparams, speaker_ids=None): 22 | self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text) 23 | self.text_cleaners = hparams.text_cleaners 24 | self.max_wav_value = hparams.max_wav_value 25 | self.sampling_rate = hparams.sampling_rate 26 | self.stft = layers.TacotronSTFT( 27 | hparams.filter_length, hparams.hop_length, hparams.win_length, 28 | hparams.n_mel_channels, hparams.sampling_rate, hparams.mel_fmin, 29 | hparams.mel_fmax) 30 | self.sampling_rate = hparams.sampling_rate 31 | self.filter_length = hparams.filter_length 32 | self.hop_length = hparams.hop_length 33 | self.f0_min = hparams.f0_min 34 | self.f0_max = hparams.f0_max 35 | self.harm_thresh = hparams.harm_thresh 36 | self.p_arpabet = hparams.p_arpabet 37 | 38 | self.cmudict = None 39 | if hparams.cmudict_path is not None: 40 | self.cmudict = cmudict.CMUDict(hparams.cmudict_path) 41 | 42 | self.speaker_ids = speaker_ids 43 | if speaker_ids is None: 44 | self.speaker_ids = self.create_speaker_lookup_table( 45 | self.audiopaths_and_text) 46 | 47 | random.seed(1234) 48 | random.shuffle(self.audiopaths_and_text) 49 | 50 | def create_speaker_lookup_table(self, audiopaths_and_text): 51 | speaker_ids = np.sort(np.unique([x[2] for x in audiopaths_and_text])) 52 | d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))} 53 | return d 54 | 55 | def get_f0(self, audio, sampling_rate=22050, frame_length=1024, 56 | hop_length=256, f0_min=100, f0_max=300, harm_thresh=0.1): 57 | f0, harmonic_rates, argmins, times = compute_yin( 58 | audio, sampling_rate, frame_length, hop_length, f0_min, f0_max, 59 | harm_thresh) 60 | pad = int((frame_length / hop_length) / 2) 61 | f0 = [0.0] * pad + f0 + [0.0] * pad 62 | 63 | f0 = np.array(f0, dtype=np.float32) 64 | return f0 65 | 66 | def get_data(self, audiopath_and_text): 67 | audiopath, text, speaker = audiopath_and_text 68 | text = self.get_text(text) 69 | mel, f0 = self.get_mel_and_f0(audiopath) 70 | speaker_id = self.get_speaker_id(speaker) 71 | return (text, mel, speaker_id, f0) 72 | 73 | def get_speaker_id(self, speaker_id): 74 | return torch.IntTensor([self.speaker_ids[int(speaker_id)]]) 75 | 76 | def get_mel_and_f0(self, filepath): 77 | audio, sampling_rate = load_wav_to_torch(filepath) 78 | if sampling_rate != self.stft.sampling_rate: 79 | raise ValueError("{} SR doesn't match target {} SR".format( 80 | sampling_rate, self.stft.sampling_rate)) 81 | audio_norm = audio / self.max_wav_value 82 | audio_norm = audio_norm.unsqueeze(0) 83 | melspec = self.stft.mel_spectrogram(audio_norm) 84 | melspec = torch.squeeze(melspec, 0) 85 | 86 | f0 = self.get_f0(audio.cpu().numpy(), self.sampling_rate, 87 | self.filter_length, self.hop_length, self.f0_min, 88 | self.f0_max, self.harm_thresh) 89 | f0 = torch.from_numpy(f0)[None] 90 | f0 = f0[:, :melspec.size(1)] 91 | 92 | return melspec, f0 93 | 94 | def get_text(self, text): 95 | text_norm = torch.IntTensor( 96 | text_to_sequence(text, self.text_cleaners, self.cmudict, self.p_arpabet)) 97 | 98 | return text_norm 99 | 100 | def __getitem__(self, index): 101 | return self.get_data(self.audiopaths_and_text[index]) 102 | 103 | def __len__(self): 104 | return len(self.audiopaths_and_text) 105 | 106 | 107 | class TextMelCollate(): 108 | """ Zero-pads model inputs and targets based on number of frames per setep 109 | """ 110 | def __init__(self, n_frames_per_step): 111 | self.n_frames_per_step = n_frames_per_step 112 | 113 | def __call__(self, batch): 114 | """Collate's training batch from normalized text and mel-spectrogram 115 | PARAMS 116 | ------ 117 | batch: [text_normalized, mel_normalized] 118 | """ 119 | # Right zero-pad all one-hot text sequences to max input length 120 | input_lengths, ids_sorted_decreasing = torch.sort( 121 | torch.LongTensor([len(x[0]) for x in batch]), 122 | dim=0, descending=True) 123 | max_input_len = input_lengths[0] 124 | 125 | text_padded = torch.LongTensor(len(batch), max_input_len) 126 | text_padded.zero_() 127 | for i in range(len(ids_sorted_decreasing)): 128 | text = batch[ids_sorted_decreasing[i]][0] 129 | text_padded[i, :text.size(0)] = text 130 | 131 | # Right zero-pad mel-spec 132 | num_mels = batch[0][1].size(0) 133 | max_target_len = max([x[1].size(1) for x in batch]) 134 | if max_target_len % self.n_frames_per_step != 0: 135 | max_target_len += self.n_frames_per_step - max_target_len % self.n_frames_per_step 136 | assert max_target_len % self.n_frames_per_step == 0 137 | 138 | # include mel padded, gate padded and speaker ids 139 | mel_padded = torch.FloatTensor(len(batch), num_mels, max_target_len) 140 | mel_padded.zero_() 141 | gate_padded = torch.FloatTensor(len(batch), max_target_len) 142 | gate_padded.zero_() 143 | output_lengths = torch.LongTensor(len(batch)) 144 | speaker_ids = torch.LongTensor(len(batch)) 145 | f0_padded = torch.FloatTensor(len(batch), 1, max_target_len) 146 | f0_padded.zero_() 147 | 148 | for i in range(len(ids_sorted_decreasing)): 149 | mel = batch[ids_sorted_decreasing[i]][1] 150 | mel_padded[i, :, :mel.size(1)] = mel 151 | gate_padded[i, mel.size(1)-1:] = 1 152 | output_lengths[i] = mel.size(1) 153 | speaker_ids[i] = batch[ids_sorted_decreasing[i]][2] 154 | f0 = batch[ids_sorted_decreasing[i]][3] 155 | f0_padded[i, :, :f0.size(1)] = f0 156 | 157 | model_inputs = (text_padded, input_lengths, mel_padded, gate_padded, 158 | output_lengths, speaker_ids, f0_padded) 159 | 160 | return model_inputs 161 | -------------------------------------------------------------------------------- /distributed.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.distributed as dist 3 | from torch.nn.modules import Module 4 | from torch.autograd import Variable 5 | 6 | def _flatten_dense_tensors(tensors): 7 | """Flatten dense tensors into a contiguous 1D buffer. Assume tensors are of 8 | same dense type. 9 | Since inputs are dense, the resulting tensor will be a concatenated 1D 10 | buffer. Element-wise operation on this buffer will be equivalent to 11 | operating individually. 12 | Arguments: 13 | tensors (Iterable[Tensor]): dense tensors to flatten. 14 | Returns: 15 | A contiguous 1D buffer containing input tensors. 16 | """ 17 | if len(tensors) == 1: 18 | return tensors[0].contiguous().view(-1) 19 | flat = torch.cat([t.contiguous().view(-1).float() for t in tensors], dim=0) 20 | return flat 21 | 22 | def _unflatten_dense_tensors(flat, tensors): 23 | """View a flat buffer using the sizes of tensors. Assume that tensors are of 24 | same dense type, and that flat is given by _flatten_dense_tensors. 25 | Arguments: 26 | flat (Tensor): flattened dense tensors to unflatten. 27 | tensors (Iterable[Tensor]): dense tensors whose sizes will be used to 28 | unflatten flat. 29 | Returns: 30 | Unflattened dense tensors with sizes same as tensors and values from 31 | flat. 32 | """ 33 | outputs = [] 34 | offset = 0 35 | for tensor in tensors: 36 | numel = tensor.numel() 37 | outputs.append(flat.narrow(0, offset, numel).view_as(tensor)) 38 | offset += numel 39 | return tuple(outputs) 40 | 41 | 42 | ''' 43 | This version of DistributedDataParallel is designed to be used in conjunction with the multiproc.py 44 | launcher included with this example. It assumes that your run is using multiprocess with 1 45 | GPU/process, that the model is on the correct device, and that torch.set_device has been 46 | used to set the device. 47 | 48 | Parameters are broadcasted to the other processes on initialization of DistributedDataParallel, 49 | and will be allreduced at the finish of the backward pass. 50 | ''' 51 | class DistributedDataParallel(Module): 52 | 53 | def __init__(self, module): 54 | super(DistributedDataParallel, self).__init__() 55 | #fallback for PyTorch 0.3 56 | if not hasattr(dist, '_backend'): 57 | self.warn_on_half = True 58 | else: 59 | self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False 60 | 61 | self.module = module 62 | 63 | for p in self.module.state_dict().values(): 64 | if not torch.is_tensor(p): 65 | continue 66 | dist.broadcast(p, 0) 67 | 68 | def allreduce_params(): 69 | if(self.needs_reduction): 70 | self.needs_reduction = False 71 | buckets = {} 72 | for param in self.module.parameters(): 73 | if param.requires_grad and param.grad is not None: 74 | tp = type(param.data) 75 | if tp not in buckets: 76 | buckets[tp] = [] 77 | buckets[tp].append(param) 78 | if self.warn_on_half: 79 | if torch.cuda.HalfTensor in buckets: 80 | print("WARNING: gloo dist backend for half parameters may be extremely slow." + 81 | " It is recommended to use the NCCL backend in this case. This currently requires" + 82 | "PyTorch built from top of tree master.") 83 | self.warn_on_half = False 84 | 85 | for tp in buckets: 86 | bucket = buckets[tp] 87 | grads = [param.grad.data for param in bucket] 88 | coalesced = _flatten_dense_tensors(grads) 89 | dist.all_reduce(coalesced) 90 | coalesced /= dist.get_world_size() 91 | for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)): 92 | buf.copy_(synced) 93 | 94 | for param in list(self.module.parameters()): 95 | def allreduce_hook(*unused): 96 | param._execution_engine.queue_callback(allreduce_params) 97 | if param.requires_grad: 98 | param.register_hook(allreduce_hook) 99 | 100 | def forward(self, *inputs, **kwargs): 101 | self.needs_reduction = True 102 | return self.module(*inputs, **kwargs) 103 | 104 | ''' 105 | def _sync_buffers(self): 106 | buffers = list(self.module._all_buffers()) 107 | if len(buffers) > 0: 108 | # cross-node buffer sync 109 | flat_buffers = _flatten_dense_tensors(buffers) 110 | dist.broadcast(flat_buffers, 0) 111 | for buf, synced in zip(buffers, _unflatten_dense_tensors(flat_buffers, buffers)): 112 | buf.copy_(synced) 113 | def train(self, mode=True): 114 | # Clear NCCL communicator and CUDA event cache of the default group ID, 115 | # These cache will be recreated at the later call. This is currently a 116 | # work-around for a potential NCCL deadlock. 117 | if dist._backend == dist.dist_backend.NCCL: 118 | dist._clear_group_cache() 119 | super(DistributedDataParallel, self).train(mode) 120 | self.module.train(mode) 121 | ''' 122 | ''' 123 | Modifies existing model to do gradient allreduce, but doesn't change class 124 | so you don't need "module" 125 | ''' 126 | def apply_gradient_allreduce(module): 127 | if not hasattr(dist, '_backend'): 128 | module.warn_on_half = True 129 | else: 130 | module.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False 131 | 132 | for p in module.state_dict().values(): 133 | if not torch.is_tensor(p): 134 | continue 135 | dist.broadcast(p, 0) 136 | 137 | def allreduce_params(): 138 | if(module.needs_reduction): 139 | module.needs_reduction = False 140 | buckets = {} 141 | for param in module.parameters(): 142 | if param.requires_grad and param.grad is not None: 143 | tp = type(param.data) 144 | if tp not in buckets: 145 | buckets[tp] = [] 146 | buckets[tp].append(param) 147 | if module.warn_on_half: 148 | if torch.cuda.HalfTensor in buckets: 149 | print("WARNING: gloo dist backend for half parameters may be extremely slow." + 150 | " It is recommended to use the NCCL backend in this case. This currently requires" + 151 | "PyTorch built from top of tree master.") 152 | module.warn_on_half = False 153 | 154 | for tp in buckets: 155 | bucket = buckets[tp] 156 | grads = [param.grad.data for param in bucket] 157 | coalesced = _flatten_dense_tensors(grads) 158 | dist.all_reduce(coalesced) 159 | coalesced /= dist.get_world_size() 160 | for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)): 161 | buf.copy_(synced) 162 | 163 | for param in list(module.parameters()): 164 | def allreduce_hook(*unused): 165 | Variable._execution_engine.queue_callback(allreduce_params) 166 | if param.requires_grad: 167 | param.register_hook(allreduce_hook) 168 | 169 | def set_needs_reduction(self, input, output): 170 | self.needs_reduction = True 171 | 172 | module.register_forward_hook(set_needs_reduction) 173 | return module 174 | -------------------------------------------------------------------------------- /filelists/libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt: -------------------------------------------------------------------------------- 1 | /path_to_libritts/7367/86737/7367_86737_000132_000020.wav|There was some surprise, however, that, as he was with his face to the enemy, he should have received a ball between his shoulders. That astonishment ceased when one of the brigands remarked to his comrades that Cucumetto was stationed ten paces in Carlini's rear when he fell.|7367 2 | /path_to_libritts/7511/102419/7511_102419_000020_000000.wav|Mr. Stewart is the queerest man: instead of letting me enjoy the tableau, he solemnly drove on, saying he would not want any one gawking at him if he were the happy man.|7511 3 | /path_to_libritts/7078/271888/7078_271888_000062_000000.wav|"It is a strange fancy of mine," she explained, when I had greeted her. "I'm sure the dress is very becoming--isn't it?" And she waved the goblet she was holding above her head.|7078 4 | /path_to_libritts/2289/152257/2289_152257_000026_000000.wav|To the last year of his life Justinian was strong and active and a hard worker.|2289 5 | /path_to_libritts/3879/173592/3879_173592_000033_000002.wav|Friends or foes, French or Spaniards, succor or death,--betwixt these were their hopes and fears divided.|3879 6 | /path_to_libritts/6529/62554/6529_62554_000068_000000.wav|"Isn't what they have done already enough?" asked Pencroft, who did not understand these scruples.|6529 7 | /path_to_libritts/1263/141777/1263_141777_000033_000002.wav|Clumps of small trees and high growing bushes dotted that expanse, an ideal cover.|1263 8 | /path_to_libritts/200/126784/200_126784_000053_000000.wav|"I did not know it until I reached them, and it was then too late to retreat," said Wharton sullenly.|200 9 | /path_to_libritts/6529/62554/6529_62554_000062_000001.wav|"Poor Ayrton!|6529 10 | /path_to_libritts/7190/90543/7190_90543_000039_000000.wav|I had become so accustomed to Quarles jumping to some sudden conclusion that I was disappointed.|7190 11 | /path_to_libritts/7367/86737/7367_86737_000109_000001.wav|he is only two and twenty;--he will gain himself a reputation."|7367 12 | /path_to_libritts/4267/72637/4267_72637_000035_000000.wav|She withdrew entirely now, all but her hand, and her eyes sought the ground.|4267 13 | /path_to_libritts/2436/2476/2436_2476_000004_000003.wav|Then I sought out Carter.|2436 14 | /path_to_libritts/6272/70191/6272_70191_000027_000000.wav|The graveyard was a mile outside the village--a sandy plain where a few stunted pines transplanted from the woods near it struggled to keep alive.|6272 15 | /path_to_libritts/2416/152139/2416_152139_000001_000001.wav|The rest of the house was in darkness.|2416 16 | /path_to_libritts/8419/293469/8419_293469_000003_000002.wav|All the wonderful things of which he told had happened before he came to the meadow, and while he was still a young Frog.|8419 17 | /path_to_libritts/1088/134315/1088_134315_000020_000000.wav|"Is that your creed?" she asked quietly.|1088 18 | /path_to_libritts/7367/86737/7367_86737_000132_000001.wav|The old man remained motionless; he felt that some great and unforeseen misfortune hung over his head.|7367 19 | /path_to_libritts/8108/274318/8108_274318_000018_000002.wav|He seemed very pleased with himself, and smiled with an expression of supreme innocence.|8108 20 | /path_to_libritts/3983/5371/3983_5371_000019_000000.wav|Miss Corny paused.|3983 21 | /path_to_libritts/4406/16883/4406_16883_000024_000000.wav|THE FIFTEENTH REMOVE|4406 22 | /path_to_libritts/4640/19187/4640_19187_000038_000001.wav|He returned from his sombre eagle flight into outer darkness.|4640 23 | /path_to_libritts/1502/122615/1502_122615_000004_000001.wav|Slowly and reluctantly yielding to the necessity, he quitted the place, and mingled with the throng that hovered nigh.|1502 24 | /path_to_libritts/6209/34601/6209_34601_000068_000030.wav|I was alone.|6209 25 | /path_to_libritts/7190/90542/7190_90542_000054_000000.wav|We took rooms at a hotel in Medworth, Quarles explaining that our investigations might take some days.|7190 26 | /path_to_libritts/7226/86964/7226_86964_000007_000002.wav|The latter made a pretty picture standing out on the bold bank, backed by a number of huge stacks of golden straw.|7226 27 | /path_to_libritts/6209/34601/6209_34601_000096_000023.wav|What a hurricane!|6209 28 | /path_to_libritts/4406/16882/4406_16882_000018_000002.wav|About two hours in the night, my sweet babe like a lamb departed this life on Feb. 18, 1675.|4406 29 | /path_to_libritts/6818/68772/6818_68772_000008_000000.wav|After luncheon began the speech-making, interspersed with music by the band.|6818 30 | /path_to_libritts/5022/29411/5022_29411_000058_000009.wav|My mother whispered to me--I thanked the little mill-girl, and gave her a kiss.|5022 31 | /path_to_libritts/8088/284756/8088_284756_000185_000002.wav|Her eyes met mine and I knew that I had not misunderstood.|8088 32 | /path_to_libritts/1841/150351/1841_150351_000025_000003.wav|The figures of animals and birds carved upon it represent the mythological ancestors of the family or clan in front of whose abode the pole stands.|1841 33 | /path_to_libritts/2836/5355/2836_5355_000063_000000.wav|She made no reply.|2836 34 | /path_to_libritts/2836/5355/2836_5355_000008_000000.wav|"What do you mean by 'deal?'" he asked, settling the logs to his apparent satisfaction.|2836 35 | /path_to_libritts/8609/262281/8609_262281_000047_000002.wav|One of the girls shall go with us to-day; whichever deserves it best."|8609 36 | /path_to_libritts/7067/76048/7067_76048_000044_000006.wav|The big thing confronts us still.|7067 37 | /path_to_libritts/8324/286681/8324_286681_000008_000003.wav|When he had thought for a while, he waved his big pinching-claws and said, "It would be better for me not to tell what I think.|8324 38 | /path_to_libritts/6385/34669/6385_34669_000022_000001.wav|It was a long platform, floored and tarred, supported by a network of joists, and under which flowed the river.|6385 39 | /path_to_libritts/8238/283452/8238_283452_000022_000008.wav|And there are very few funny child-ghosts--you might almost say none, in comparison with the number of grown-ups.|8238 40 | /path_to_libritts/4018/103416/4018_103416_000018_000001.wav|But I meant them."|4018 41 | /path_to_libritts/7800/283478/7800_283478_000008_000000.wav|"Now you've gone and done it, you young rapscallions!" cried Isaac Chase, so excited that he could hardly control his trembling voice.|7800 42 | /path_to_libritts/1246/135815/1246_135815_000015_000000.wav|"He certainly is all right as a digger," exclaimed Peter Rabbit. "My, how he can make the sand fly!|1246 43 | /path_to_libritts/374/180299/374_180299_000013_000000.wav|"We went all over the house, and we shall have everything perfect.|374 44 | /path_to_libritts/2436/2481/2436_2481_000024_000000.wav|I stammered, "If ... if she dies ... will you flash us word?"|2436 45 | /path_to_libritts/7067/76047/7067_76047_000048_000007.wav|And there is something more to be said.|7067 46 | /path_to_libritts/1970/28415/1970_28415_000016_000001.wav|This was his Father's house and his house.|1970 47 | /path_to_libritts/4680/16026/4680_16026_000045_000003.wav|And my mother?|4680 48 | /path_to_libritts/5104/33406/5104_33406_000113_000003.wav|So of those thirty-five ships only fifteen got to Greenland.|5104 49 | /path_to_libritts/7367/86737/7367_86737_000099_000000.wav|"Well, Signor Pastrini," said Franz, "now that my companion is quieted, and you have seen how peaceful my intentions are, tell me who is this Luigi Vampa.|7367 50 | /path_to_libritts/2952/410/2952_410_000034_000001.wav|They judged him to be a hardened criminal, and his story an insult to their intelligence.|2952 51 | /path_to_libritts/1246/124550/1246_124550_000012_000000.wav|Her first acquaintances were the members of the Tincomb Methodist Church, a vast red-brick tabernacle.|1246 52 | /path_to_libritts/5393/19219/5393_19219_000019_000004.wav|The house was no less fragrant than the church; after the incense, roses.|5393 53 | /path_to_libritts/1263/138246/1263_138246_000005_000001.wav|We are bound to make the best of our new lodgings, and make ourselves comfortable.|1263 54 | /path_to_libritts/7059/88364/7059_88364_000012_000007.wav|The California photo-playwright can base his Crowd Picture upon the city-worshipping mobs of San Francisco.|7059 55 | /path_to_libritts/587/54108/587_54108_000030_000001.wav|"Then will you wrap something about you and come down to the river?"|587 56 | /path_to_libritts/7278/246956/7278_246956_000013_000000.wav|The reason of George Bascombe's absence from church that morning was, that, after an early breakfast, he had mounted Helen's mare, and set out to call on Mr. Hooker before he should have gone to church.|7278 57 | /path_to_libritts/4406/16883/4406_16883_000001_000001.wav|Connecticut, to meet with King Philip.|4406 58 | /path_to_libritts/831/130746/831_130746_000052_000001.wav|I was unhappy at home--never mind why.|831 59 | /path_to_libritts/2196/174172/2196_174172_000002_000015.wav|Then begin to talk about it.|2196 60 | /path_to_libritts/7302/86814/7302_86814_000011_000004.wav|"Let us horsewhip the fine gentleman!" said others.|7302 61 | /path_to_libritts/4195/186238/4195_186238_000039_000000.wav|"Why, you foolish old Uncle!|4195 62 | /path_to_libritts/1963/142393/1963_142393_000005_000010.wav|As Adam's confidence waned, his patience waned with it, and he thought he must write himself.|1963 63 | /path_to_libritts/2836/5355/2836_5355_000038_000001.wav|"How on the wrong scent?"|2836 64 | /path_to_libritts/7178/34644/7178_34644_000049_000001.wav|This message addressed to justice has been faithfully delivered by the sea."|7178 65 | /path_to_libritts/1263/138246/1263_138246_000056_000001.wav|A few instants alone separate us from an eventful moment.|1263 66 | /path_to_libritts/5678/43302/5678_43302_000015_000001.wav|Then she broke off and sat back.|5678 67 | /path_to_libritts/2836/5355/2836_5355_000065_000001.wav|"These apartments are mine now; they have been transferred into my name, and they can never again afford you accommodation.|2836 68 | /path_to_libritts/8098/278278/8098_278278_000010_000000.wav|Coonskin's opinion didn't benefit Pod much.|8098 69 | /path_to_libritts/8324/286683/8324_286683_000018_000001.wav|You're big enough, but you're just as homely as you can be.|8324 70 | /path_to_libritts/3664/178366/3664_178366_000019_000011.wav|The play of "Buffalo Bill" had a very successful run of six or eight weeks, and was afterwards produced in all the principal cities of the country, everywhere being received with genuine enthusiasm.|3664 71 | /path_to_libritts/5339/14133/5339_14133_000040_000001.wav|I'm noan comin' down again to-night.'|5339 72 | /path_to_libritts/5703/47198/5703_47198_000003_000000.wav|One hot summer day, a few months after the marriage, Juliet, returning to the consulate after a morning spent in very active exercise upon a tennis court, was met on the doorstep by Dora, the youngest of the Clarency Butchers, who was awaiting her approach in a high state of excitement.|5703 73 | /path_to_libritts/78/369/78_369_000066_000009.wav|Urged thus far, I had no choice but to adapt my nature to an element which I had willingly chosen.|78 74 | /path_to_libritts/4051/11218/4051_11218_000036_000000.wav|"It is only a sleeping potion," said the enchantress to Prince Jason. "One always finds a use for these mischievous creatures, sooner or later; so I did not wish to kill him outright.|4051 75 | /path_to_libritts/8838/298545/8838_298545_000055_000000.wav|"I am always in training."|8838 76 | /path_to_libritts/4680/16041/4680_16041_000014_000002.wav|Entering a street was like entering a cellar.|4680 77 | /path_to_libritts/3526/176653/3526_176653_000034_000000.wav|"Good Lord!" said the fisherman, startled, and then he stopped--the words were as innocent on her lips as a benediction.|3526 78 | /path_to_libritts/87/121553/87_121553_000058_000000.wav|Let him the mouth imagine of the horn That in the point beginneth of the axis Round about which the primal wheel revolves,--|87 79 | /path_to_libritts/3526/176653/3526_176653_000006_000000.wav|Her eyes fell at the ancient banter, but she lifted them straightway and stared again.|3526 80 | /path_to_libritts/8629/261139/8629_261139_000031_000000.wav|"I have made an assertion," said he, "before God and before this jury. To make it seem a credible one I shall have to tell my own story from the beginning.|8629 81 | /path_to_libritts/6209/34601/6209_34601_000096_000042.wav|Besides, the laws are violated.|6209 82 | /path_to_libritts/3526/176651/3526_176651_000012_000000.wav|She sat at the base of the big tree--her little sunbonnet pushed back, her arms locked about her knees, her bare feet gathered under her crimson gown and her deep eyes fixed on the smoke in the valley below. Her breath was still coming fast between her parted lips.|3526 83 | /path_to_libritts/7402/90848/7402_90848_000024_000003.wav|It was very dignified and wore tortoise-shell glasses."|7402 84 | /path_to_libritts/118/47824/118_47824_000036_000000.wav|"Good morning.|118 85 | /path_to_libritts/405/130895/405_130895_000051_000001.wav|It was white underneath and reddish on top, with big round spots of deep blue encircled in black, its hide quite smooth and ending in a double-lobed fin.|405 86 | /path_to_libritts/8088/284756/8088_284756_000001_000002.wav|Rum-runners, seeking out their hidden port with their cargo of contraband from Cuba.|8088 87 | /path_to_libritts/7190/90542/7190_90542_000002_000000.wav|My association with Christopher Quarles has, however, led to the solution of some strange mysteries, and, since my own achievements are sufficiently well known, I may confine myself to those cases which, single-handed, I should have failed to solve.|7190 88 | /path_to_libritts/5339/14134/5339_14134_000047_000001.wav|'Liking has nought to do with it.'|5339 89 | /path_to_libritts/2289/152258/2289_152258_000017_000000.wav|Mohammed was very earnest and serious.|2289 90 | /path_to_libritts/8098/275181/8098_275181_000020_000006.wav|But Tom overcame me forthwith, choked me nearly black in the face, then, in dumb show, knocked my head with a stone.|8098 91 | /path_to_libritts/6209/34599/6209_34599_000025_000003.wav|An iron knocker was attached to it.|6209 92 | /path_to_libritts/2836/5355/2836_5355_000072_000001.wav|Count them."|2836 93 | /path_to_libritts/5750/100289/5750_100289_000023_000000.wav|Bishop Whipple developed many able preachers, of whom perhaps the most accomplished was the Rev. Charles Smith Cook, of the Yankton Sioux.|5750 94 | /path_to_libritts/1970/26100/1970_26100_000068_000000.wav|"Then Mr. Woods wasn't here all through dinner, Jackson?"|1970 95 | /path_to_libritts/7278/91083/7278_91083_000012_000002.wav|Besides, he was never idle, he was economical, his habits were the best, and why should not such a boy succeed?|7278 96 | /path_to_libritts/4297/13006/4297_13006_000058_000000.wav|"That is nonsense, Emily.|4297 97 | /path_to_libritts/7302/86814/7302_86814_000007_000001.wav|The prisoners then approached and formed a circle.|7302 98 | /path_to_libritts/831/130746/831_130746_000022_000001.wav|"And fairish?"|831 99 | /path_to_libritts/2289/152258/2289_152258_000004_000001.wav|He always spoke the truth and never broke a promise.|2289 100 | /path_to_libritts/8419/293469/8419_293469_000010_000005.wav|Oh, how frightened I was!|8419 101 | /path_to_libritts/7178/34645/7178_34645_000042_000002.wav|The work on which he was engaged could only be expressed in these strange words--the construction of a thunderbolt.|7178 102 | /path_to_libritts/6385/34669/6385_34669_000011_000006.wav|The wolf appeared to him in a halo of light.|6385 103 | /path_to_libritts/669/129061/669_129061_000004_000002.wav|His gentleman alone took the opportunity of perusing the newspaper before he laid it by his master's desk.|669 104 | /path_to_libritts/4680/16026/4680_16026_000115_000000.wav|In the meantime she stared at them with a stern but peaceful air.|4680 105 | /path_to_libritts/6415/116629/6415_116629_000012_000001.wav|I said you'd better take your hands out of your pockets, and then your earnings would run in.|6415 106 | /path_to_libritts/8123/275209/8123_275209_000008_000002.wav|How should a poor crawling creature like me know what to do without asking my betters?"|8123 107 | /path_to_libritts/1088/134315/1088_134315_000111_000001.wav|It was a large safe of the usual type.|1088 108 | /path_to_libritts/696/93314/696_93314_000049_000000.wav|"I'm not likely to have forgotten you," said the Lover.|696 109 | /path_to_libritts/3983/5371/3983_5371_000047_000000.wav|"What did she go into hysterics for?" again snapped Miss Carlyle.|3983 110 | /path_to_libritts/3486/166424/3486_166424_000067_000000.wav|Within the black background of the fissure stood a shape, an apparition, a woman--beautiful, awesome, incredible!|3486 111 | /path_to_libritts/730/358/730_358_000004_000004.wav|Sometimes I tried to imitate the pleasant songs of the birds but was unable. Sometimes I wished to express my sensations in my own mode, but the uncouth and inarticulate sounds which broke from me frightened me into silence again.|730 112 | /path_to_libritts/6415/111615/6415_111615_000022_000003.wav|Poor soul, he is like one condemned to harangue the vast, idiotic world through a keyhole, whence his anguish issues thin and faint.|6415 113 | /path_to_libritts/8088/284756/8088_284756_000169_000000.wav|They searched ceaselessly for something, and I guessed that something was food.|8088 114 | /path_to_libritts/8088/284756/8088_284756_000181_000002.wav|The man and the woman came up, and I looked closely into their faces.|8088 115 | /path_to_libritts/7505/83618/7505_83618_000009_000010.wav|King Manco, however, was a real character, the Rudolph of Hapsburg of their reigning family, and flourished about the eleventh century.|7505 116 | /path_to_libritts/4406/16883/4406_16883_000001_000013.wav|When I came ashore, they gathered all about me, I sitting alone in the midst.|4406 117 | /path_to_libritts/5022/29411/5022_29411_000020_000000.wav|"But, Mr. Toller," I objected, "something must have happened to distress her.|5022 118 | /path_to_libritts/7190/90542/7190_90542_000113_000000.wav|His calmness almost exasperated me, but he would answer no questions until we had returned to our hotel and had breakfast.|7190 119 | /path_to_libritts/405/130894/405_130894_000046_000000.wav|"Professor Aronnax," he told me, "this calls for heroic measures, or we'll be sealed up in this solidified water as if it were cement."|405 120 | /path_to_libritts/7302/86815/7302_86815_000008_000001.wav|The Judge.|7302 121 | /path_to_libritts/7367/86737/7367_86737_000130_000002.wav|Cucumetto fancied for a moment the young man was about to take her in his arms and fly; but this mattered little to him now Rita had been his; and as for the money, three hundred piastres distributed among the band was so small a sum that he cared little about it.|7367 122 | /path_to_libritts/125/121124/125_121124_000021_000001.wav|But whom are you seeking, Debray?"|125 123 | /path_to_libritts/2092/145706/2092_145706_000054_000000.wav|Quick as lightning the wolf flew round the wood, and in a minute many hundred wolves rose up before him, increasing in number every moment, till they could be counted by thousands.|2092 124 | /path_to_libritts/8238/274553/8238_274553_000018_000000.wav|Your most humble servant,|8238 125 | /path_to_libritts/2836/5355/2836_5355_000085_000000.wav|"Don't I know it?|2836 126 | /path_to_libritts/587/41619/587_41619_000035_000002.wav|Besides, Miss Pierson is too short.|587 127 | /path_to_libritts/6272/70171/6272_70171_000011_000002.wav|In the morning, she said, we should see her three children. She never left them, she was so afraid of their being ill, also telling mother that she would do all in her power to make my stay in Rosville pleasant and profitable.|6272 128 | /path_to_libritts/2436/2481/2436_2481_000005_000001.wav|Both were closed.|2436 129 | /path_to_libritts/2911/7601/2911_7601_000009_000004.wav|I say I knew it well. I knew what the old man felt, and pitied him, although I chuckled at heart.|2911 130 | /path_to_libritts/6064/56165/6064_56165_000059_000000.wav|"She's invited to my cooking party next week," said Nora.|6064 131 | /path_to_libritts/6272/70168/6272_70168_000025_000000.wav|A smile crept into her blue eye, as she said: "My hearing him, or not, would make no difference, since God could hear and answer."|6272 132 | /path_to_libritts/1502/122615/1502_122615_000008_000001.wav|He would greatly have preferred silence and meditation to speech, when a discovery of his real condition might prove so instantly fatal.|1502 133 | /path_to_libritts/3983/5371/3983_5371_000025_000003.wav|What on earth had put him into that state?|3983 134 | /path_to_libritts/6848/76049/6848_76049_000005_000012.wav|Old Grammont had struck the table sharply and the eyes that looked out of his mask had blazed.|6848 135 | /path_to_libritts/460/172359/460_172359_000049_000001.wav|It was all over the town in a minutes.|460 136 | /path_to_libritts/8838/298545/8838_298545_000053_000000.wav|"I said that he was a welter weight."|8838 137 | /path_to_libritts/3983/5331/3983_5331_000042_000002.wav|"Have you told me all?" he asked presently, lifting them.|3983 138 | /path_to_libritts/5322/7680/5322_7680_000047_000001.wav|If you don't you will offend me.|5322 139 | /path_to_libritts/2196/170379/2196_170379_000014_000001.wav|This is a legitimate use of regression although it is not used so much these days to uncover past traumatic incidents.|2196 140 | /path_to_libritts/118/47824/118_47824_000101_000000.wav|"I won't blame Carlos for that," Bobby muttered.|118 141 | /path_to_libritts/8088/284756/8088_284756_000135_000001.wav|Something she had drawn from her girdle shone palely in her hand.|8088 142 | /path_to_libritts/6415/116629/6415_116629_000036_000007.wav|Dietrich was walking in steep and dangerous paths; that she was sure of, but he knew the straight road and would not his steps turn back to it again?|6415 143 | /path_to_libritts/2136/5143/2136_5143_000055_000002.wav|Let us talk of something else.'|2136 144 | /path_to_libritts/1246/124548/1246_124548_000020_000001.wav|You take a genuwine, honest-to-God homo Americanibus and there ain't anything he's afraid to tackle.|1246 145 | /path_to_libritts/5022/29405/5022_29405_000021_000003.wav|We will drive out after luncheon, and pay a round of visits." When this prospect was placed before me, I remembered having read in books of sensitive persons receiving impressions which made their blood run cold; I now found myself one of those persons, for the first time in my life.|5022 146 | /path_to_libritts/40/222/40_222_000011_000007.wav|She drew back, trying to beg their pardon, but was, with gentle violence, forced to return; and the others withdrew, after Eleanor had affectionately expressed a wish of being of use or comfort to her.|40 147 | /path_to_libritts/2836/5354/2836_5354_000056_000001.wav|"I cannot help the delay."|2836 148 | /path_to_libritts/887/123289/887_123289_000033_000000.wav|"Yes; to be sure we have.|887 149 | /path_to_libritts/7447/91186/7447_91186_000027_000005.wav|Spontaneous as was his creative power he was most painstaking in regard to the setting of his musical ideas and would often devote weeks to re-writing a single page that every detail might be perfect.|7447 150 | /path_to_libritts/8238/283452/8238_283452_000003_000000.wav|HUMOROUS GHOST STORIES|8238 151 | /path_to_libritts/7447/91187/7447_91187_000018_000000.wav|In regard to the much discussed tempo rubato of Chopin many and fatal blunders have been made.|7447 152 | /path_to_libritts/4018/103416/4018_103416_000070_000000.wav|"Heavens, how late it is!" she exclaimed.|4018 153 | /path_to_libritts/7067/76048/7067_76048_000064_000004.wav|They don't seem to know what they are doing.|7067 154 | /path_to_libritts/8838/298546/8838_298546_000013_000000.wav|The sheet of the paper which he held up was a lake of print around an islet of illustration.|8838 155 | /path_to_libritts/4297/13006/4297_13006_000049_000002.wav|He is well educated;--oh, so much better than most men that one meets.|4297 156 | /path_to_libritts/6415/116629/6415_116629_000036_000009.wav|She recalled the evening of the day when her husband was borne from the house to his burial.|6415 157 | /path_to_libritts/8088/284756/8088_284756_000167_000000.wav|There was a gray haze of mist everywhere.|8088 158 | /path_to_libritts/3879/174923/3879_174923_000019_000002.wav|Now Lord Chiltern was again his very intimate friend.|3879 159 | /path_to_libritts/5750/100289/5750_100289_000039_000004.wav|Or should we keep clear of these matters, avoid discussion of official methods and action, and simply aim at arousing racial pride and ambition along new lines, holding up a modern ideal for the support and encouragement of our youth?|5750 160 | /path_to_libritts/2136/5143/2136_5143_000018_000000.wav|'Poor mamma is buried there.'|2136 161 | /path_to_libritts/8629/261139/8629_261139_000033_000004.wav|Still, this would not seem to be reason enough for me to intrude upon her late at night with a plea for a large loan of money, had I not been in a desperate condition of mind, which made any attempt seem reasonable that promised relief from the unendurable burden of a pressing and disreputable debt.|8629 162 | /path_to_libritts/1963/147036/1963_147036_000017_000002.wav|She looked up at his words.|1963 163 | /path_to_libritts/8419/286676/8419_286676_000018_000001.wav|"I think the Ducks spoil their children," said she.|8419 164 | /path_to_libritts/2092/145706/2092_145706_000040_000001.wav|Before to-morrow night all the grain in the kingdom has to be gathered into one big heap, and if as much as a stalk of corn is wanting I must pay for it with my life.'|2092 165 | /path_to_libritts/2416/152139/2416_152139_000083_000000.wav|An instant Jimmie Dale watched the other, then he picked up the sheet of paper.|2416 166 | /path_to_libritts/669/129074/669_129074_000024_000007.wav|Isn't the whole course of life made up of such?|669 167 | /path_to_libritts/7402/90848/7402_90848_000060_000001.wav|He had arrived half an hour late, but he could have that half-hour back again!|7402 168 | /path_to_libritts/8088/284756/8088_284756_000163_000002.wav|One of the occupants of the room was a very old man; his face was wrinkled, and his hair was silvery.|8088 169 | /path_to_libritts/2136/5147/2136_5147_000051_000000.wav|But it was impossible long to be vexed with Cousin Monica.|2136 170 | /path_to_libritts/6818/68772/6818_68772_000026_000002.wav|Forbes, who never earned a dollar in his life, but inherited his money, is trying to take the dollars out of the pockets of the farmers by depriving them of the income derived by selling spaces for advertising signs.|6818 171 | /path_to_libritts/3664/11714/3664_11714_000020_000003.wav|Yet, although hitherto he had bowed his head before the authority of the Church, he had already raised it against the temporal power.|3664 172 | /path_to_libritts/78/369/78_369_000017_000001.wav|I know not whether the fiend possessed the same advantages, but I found that, as before I had daily lost ground in the pursuit, I now gained on him, so much so that when I first saw the ocean he was but one day's journey in advance, and I hoped to intercept him before he should reach the beach.|78 173 | /path_to_libritts/4195/186238/4195_186238_000011_000000.wav|Another hour passed.|4195 174 | /path_to_libritts/8088/284756/8088_284756_000077_000003.wav|The apparatus is strewn all over the place."|8088 175 | /path_to_libritts/8088/284756/8088_284756_000055_000005.wav|Then she saw the pool.|8088 176 | /path_to_libritts/8088/284756/8088_284756_000085_000000.wav|"Yes, that's very true: Carson is a most decent sort of chap." The words were not spoken.|8088 177 | /path_to_libritts/4195/186236/4195_186236_000009_000000.wav|"Good-by to my five thousand," said Uncle John, with his chuckling laugh.|4195 178 | /path_to_libritts/4160/14187/4160_14187_000035_000001.wav|"A sound thinker gives equal consideration to the probable and the improbable."|4160 179 | /path_to_libritts/4297/13009/4297_13009_000003_000003.wav|They were habitually indifferent to self-exaltation, and allowed themselves to be thrust into this or that unfitting role, professing that the Queen's Government and the good of the country were their only considerations.|4297 180 | /path_to_libritts/1116/132847/1116_132847_000014_000003.wav|The stick I shall keep for myself, so that I can fly to you if ever you have need of me.'|1116 -------------------------------------------------------------------------------- /filelists/ljs_audiopaths_text_sid_val_filelist.txt: -------------------------------------------------------------------------------- 1 | /path_to_ljs/LJ045-0227.wav|and suggests the possibility, as did his note to his wife just prior to the attempt on General Walker, that he did not expect to escape at all.|0 2 | /path_to_ljs/LJ043-0073.wav|On October nine, nineteen sixty-two he went to the Dallas office of the Texas Employment Commission|0 3 | /path_to_ljs/LJ021-0054.wav|that the way to wealth is through work.|0 4 | /path_to_ljs/LJ036-0134.wav|I asked him where he wanted to go. And he said, "five hundred North Beckley. Well, I started up,|0 5 | /path_to_ljs/LJ045-0069.wav|Katherine Ford, with whom Marina Oswald stayed during her separation from her husband in November of nineteen sixty-two,|0 6 | /path_to_ljs/LJ006-0134.wav|The governor himself admitted that a prisoner of weak intellect who had been severely beaten and much injured by a wardsman did not dare complain|0 7 | /path_to_ljs/LJ047-0168.wav|and that he had given as his address Mrs. Paine's residence in Irving.|0 8 | /path_to_ljs/LJ035-0136.wav|Lovelady and Shelley moved out into the street.|0 9 | /path_to_ljs/LJ028-0219.wav|All the people of Babylon prostrated themselves before him, and, kissing his feet, rejoiced in his sovereignty, while happiness shone on their faces.|0 10 | /path_to_ljs/LJ032-0108.wav|Oswald listed a "Sgt. Robert Hidell" as a reference on one job application and "George Hidell" as a reference on another.|0 11 | /path_to_ljs/LJ009-0015.wav|I will quote an extract from the reverend gentleman's own journal.|0 12 | /path_to_ljs/LJ036-0154.wav|The walk from Beckley and Neely to ten twenty-six North Beckley was timed by Commission counsel at five minutes and forty-five seconds.|0 13 | /path_to_ljs/LJ048-0055.wav|a more alert and carefully considered treatment of the Oswald case by the Bureau might have brought about such a referral.|0 14 | /path_to_ljs/LJ013-0063.wav|The other, remaining unclaimed for ten years, was transferred at the end of that time to the commissioners for the reduction of the National Debt.|0 15 | /path_to_ljs/LJ033-0207.wav|In light of the other evidence linking Lee Harvey Oswald, the blanket, and the rifle to the paper bag found on the sixth floor,|0 16 | /path_to_ljs/LJ037-0025.wav|Benavides saw a man standing at the right side of the parked police car. He then heard three shots and saw the policeman fall to the ground.|0 17 | /path_to_ljs/LJ019-0320.wav|was the boon to which willing industry extending over a long period established a certain claim.|0 18 | /path_to_ljs/LJ041-0095.wav|Oswald read a good deal, said Powers, but, quote, he would never be reading any of the shoot-em-up westerns or anything like that.|0 19 | /path_to_ljs/LJ016-0048.wav|a distance of eight or nine feet.|0 20 | /path_to_ljs/LJ027-0026.wav|lead to a preview of certain principles of adaptation, necessary for their interpretation.|0 21 | /path_to_ljs/LJ024-0127.wav|You who know me can have no fear that I would tolerate the destruction by any branch of government of any part of our heritage of freedom.|0 22 | /path_to_ljs/LJ002-0241.wav|The court of the Marshalsea was instituted by Charles the first in the sixth year of his reign,|0 23 | /path_to_ljs/LJ002-0314.wav|was after eighteen oh seven, through the exertions of the keeper of the jail, spent in the purchase of necessaries.|0 24 | /path_to_ljs/LJ004-0221.wav|The food was properly prepared in the prison kitchen.|0 25 | /path_to_ljs/LJ024-0065.wav|Such a succession of appointments should have provided a Court well-balanced as to age.|0 26 | /path_to_ljs/LJ003-0030.wav|These subordinate chiefs were also rewarded out of the scanty prison rations.|0 27 | /path_to_ljs/LJ042-0075.wav|testified that Oswald was extremely sure of himself and seemed, quote, to know what his mission was.|0 28 | /path_to_ljs/LJ018-0059.wav|He saw Mr. Briggs' watch-chain, and followed him instantly into the carriage, determined to have it at all costs.|0 29 | /path_to_ljs/LJ015-0002.wav|The course of the swindlers was by no means smooth, but it was not till eighteen fifty-four that suspicion arose that anything was wrong.|0 30 | /path_to_ljs/LJ049-0177.wav|and Robert I. Bouck, who was in charge of the Protective Research Section of the Secret Service, believed that the accumulation of the facts known to the FBI|0 31 | /path_to_ljs/LJ044-0028.wav|She testified that he threatened to beat her if she did not do so. The chapter had never been chartered by the national FPCC organization.|0 32 | /path_to_ljs/LJ022-0185.wav|The answer to this demand was the Federal Reserve System.|0 33 | /path_to_ljs/LJ010-0303.wav|According to Fauntleroy's own case, he found at once that the firm was heavily involved,|0 34 | /path_to_ljs/LJ030-0236.wav|I quickly observed unnatural movement of crowds, like ducking or scattering, and quick movements in the Presidential follow-up car.|0 35 | /path_to_ljs/LJ018-0204.wav|Where it had lain was a yawning gulf or trap sufficient to do for the whole body of police engaged in the capture.|0 36 | /path_to_ljs/LJ024-0058.wav|or make independent on upon the desire or prejudice of any individual justice?|0 37 | /path_to_ljs/LJ023-0130.wav|was to infuse new blood into all our courts.|0 38 | /path_to_ljs/LJ001-0101.wav|It is discouraging to note that the improvement of the last fifty years is almost wholly confined to Great Britain.|0 39 | /path_to_ljs/LJ020-0061.wav|Break the rolls apart from one another and eat warm. They are also good cold, and if the directions be followed implicitly, very good always.|0 40 | /path_to_ljs/LJ005-0118.wav|In many others there were no infirmaries, no places set apart for the confinement of prisoners afflicted with dangerous and infectious disorders.|0 41 | /path_to_ljs/LJ047-0072.wav|stating that he had distributed its pamphlets on the streets of Dallas. This information did not reach Agent Hosty in Dallas until June.|0 42 | /path_to_ljs/LJ050-0156.wav|this money would be used to compensate consultants, to lease standard equipment or to purchase specially designed pilot equipment.|0 43 | /path_to_ljs/LJ018-0040.wav|His trial followed at the next sessions of the Central Criminal Court, and ended in his conviction.|0 44 | /path_to_ljs/LJ030-0123.wav|Special Agent Glen A. Bennett once left his place inside the follow-up car to help keep the crowd away from the President's car.|0 45 | /path_to_ljs/LJ018-0336.wav|Webster's devices for disposing of the body of her victim will call to mind those of Theodore Gardelle,|0 46 | /path_to_ljs/LJ028-0299.wav|"Had I told thee," rejoined the other, "what I was bent on doing, thou wouldst not have suffered it;|0 47 | /path_to_ljs/LJ040-0013.wav|Oswald's complete state of mind and character are now outside of the power of man to know.|0 48 | /path_to_ljs/LJ013-0120.wav|whom he had shown over the plate closet.|0 49 | /path_to_ljs/LJ018-0247.wav|William Roupell himself was brought as a principal witness to clench the case by a confession altogether against himself.|0 50 | /path_to_ljs/LJ015-0106.wav|A reward was forthwith offered for Robson's apprehension.|0 51 | /path_to_ljs/LJ030-0050.wav|The Presidential limousine.|0 52 | /path_to_ljs/LJ031-0020.wav|two rooms were prepared.|0 53 | /path_to_ljs/LJ008-0281.wav|They found at Newgate, under disgraceful conditions as already described,|0 54 | /path_to_ljs/LJ002-0048.wav|two. The female debtors' side consisted of a court-yard forty-nine by sixteen feet,|0 55 | /path_to_ljs/LJ017-0085.wav|Palmer's plan was to administer poison in quantities insufficient to cause death, but enough to produce illness which would account for death.|0 56 | /path_to_ljs/LJ045-0111.wav|They asked for Lee Oswald who was not called to the telephone because he was known by the other name.|0 57 | /path_to_ljs/LJ027-0144.wav|Now, it must be obvious|0 58 | /path_to_ljs/LJ040-0066.wav|It had its effect on Lee's mother, Marguerite, his brother Robert, who had been born in nineteen thirty-four,|0 -------------------------------------------------------------------------------- /fp16_optimizer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from torch.autograd import Variable 4 | from torch.nn.parameter import Parameter 5 | from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors 6 | 7 | from loss_scaler import DynamicLossScaler, LossScaler 8 | 9 | FLOAT_TYPES = (torch.FloatTensor, torch.cuda.FloatTensor) 10 | HALF_TYPES = (torch.HalfTensor, torch.cuda.HalfTensor) 11 | 12 | def conversion_helper(val, conversion): 13 | """Apply conversion to val. Recursively apply conversion if `val` is a nested tuple/list structure.""" 14 | if not isinstance(val, (tuple, list)): 15 | return conversion(val) 16 | rtn = [conversion_helper(v, conversion) for v in val] 17 | if isinstance(val, tuple): 18 | rtn = tuple(rtn) 19 | return rtn 20 | 21 | def fp32_to_fp16(val): 22 | """Convert fp32 `val` to fp16""" 23 | def half_conversion(val): 24 | val_typecheck = val 25 | if isinstance(val_typecheck, (Parameter, Variable)): 26 | val_typecheck = val.data 27 | if isinstance(val_typecheck, FLOAT_TYPES): 28 | val = val.half() 29 | return val 30 | return conversion_helper(val, half_conversion) 31 | 32 | def fp16_to_fp32(val): 33 | """Convert fp16 `val` to fp32""" 34 | def float_conversion(val): 35 | val_typecheck = val 36 | if isinstance(val_typecheck, (Parameter, Variable)): 37 | val_typecheck = val.data 38 | if isinstance(val_typecheck, HALF_TYPES): 39 | val = val.float() 40 | return val 41 | return conversion_helper(val, float_conversion) 42 | 43 | class FP16_Module(nn.Module): 44 | def __init__(self, module): 45 | super(FP16_Module, self).__init__() 46 | self.add_module('module', module.half()) 47 | 48 | def forward(self, *inputs, **kwargs): 49 | return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs)) 50 | 51 | class FP16_Optimizer(object): 52 | """ 53 | FP16_Optimizer is designed to wrap an existing PyTorch optimizer, 54 | and enable an fp16 model to be trained using a master copy of fp32 weights. 55 | 56 | Args: 57 | optimizer (torch.optim.optimizer): Existing optimizer containing initialized fp16 parameters. Internally, FP16_Optimizer replaces the passed optimizer's fp16 parameters with new fp32 parameters copied from the original ones. FP16_Optimizer also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy after each step. 58 | static_loss_scale (float, optional, default=1.0): Loss scale used internally to scale fp16 gradients computed by the model. Scaled gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so static_loss_scale should not affect learning rate. 59 | dynamic_loss_scale (bool, optional, default=False): Use dynamic loss scaling. If True, this will override any static_loss_scale option. 60 | 61 | """ 62 | 63 | def __init__(self, optimizer, static_loss_scale=1.0, dynamic_loss_scale=False): 64 | if not torch.cuda.is_available: 65 | raise SystemError('Cannot use fp16 without CUDA') 66 | 67 | self.fp16_param_groups = [] 68 | self.fp32_param_groups = [] 69 | self.fp32_flattened_groups = [] 70 | for i, param_group in enumerate(optimizer.param_groups): 71 | print("FP16_Optimizer processing param group {}:".format(i)) 72 | fp16_params_this_group = [] 73 | fp32_params_this_group = [] 74 | for param in param_group['params']: 75 | if param.requires_grad: 76 | if param.type() == 'torch.cuda.HalfTensor': 77 | print("FP16_Optimizer received torch.cuda.HalfTensor with {}" 78 | .format(param.size())) 79 | fp16_params_this_group.append(param) 80 | elif param.type() == 'torch.cuda.FloatTensor': 81 | print("FP16_Optimizer received torch.cuda.FloatTensor with {}" 82 | .format(param.size())) 83 | fp32_params_this_group.append(param) 84 | else: 85 | raise TypeError("Wrapped parameters must be either " 86 | "torch.cuda.FloatTensor or torch.cuda.HalfTensor. " 87 | "Received {}".format(param.type())) 88 | 89 | fp32_flattened_this_group = None 90 | if len(fp16_params_this_group) > 0: 91 | fp32_flattened_this_group = _flatten_dense_tensors( 92 | [param.detach().data.clone().float() for param in fp16_params_this_group]) 93 | 94 | fp32_flattened_this_group = Variable(fp32_flattened_this_group, requires_grad = True) 95 | 96 | fp32_flattened_this_group.grad = fp32_flattened_this_group.new( 97 | *fp32_flattened_this_group.size()) 98 | 99 | # python's lovely list concatenation via + 100 | if fp32_flattened_this_group is not None: 101 | param_group['params'] = [fp32_flattened_this_group] + fp32_params_this_group 102 | else: 103 | param_group['params'] = fp32_params_this_group 104 | 105 | self.fp16_param_groups.append(fp16_params_this_group) 106 | self.fp32_param_groups.append(fp32_params_this_group) 107 | self.fp32_flattened_groups.append(fp32_flattened_this_group) 108 | 109 | # print("self.fp32_flattened_groups = ", self.fp32_flattened_groups) 110 | # print("self.fp16_param_groups = ", self.fp16_param_groups) 111 | 112 | self.optimizer = optimizer.__class__(optimizer.param_groups) 113 | 114 | # self.optimizer.load_state_dict(optimizer.state_dict()) 115 | 116 | self.param_groups = self.optimizer.param_groups 117 | 118 | if dynamic_loss_scale: 119 | self.dynamic_loss_scale = True 120 | self.loss_scaler = DynamicLossScaler() 121 | else: 122 | self.dynamic_loss_scale = False 123 | self.loss_scaler = LossScaler(static_loss_scale) 124 | 125 | self.overflow = False 126 | self.first_closure_call_this_step = True 127 | 128 | def zero_grad(self): 129 | """ 130 | Zero fp32 and fp16 parameter grads. 131 | """ 132 | self.optimizer.zero_grad() 133 | for fp16_group in self.fp16_param_groups: 134 | for param in fp16_group: 135 | if param.grad is not None: 136 | param.grad.detach_() # This does appear in torch.optim.optimizer.zero_grad(), 137 | # but I'm not sure why it's needed. 138 | param.grad.zero_() 139 | 140 | def _check_overflow(self): 141 | params = [] 142 | for group in self.fp16_param_groups: 143 | for param in group: 144 | params.append(param) 145 | for group in self.fp32_param_groups: 146 | for param in group: 147 | params.append(param) 148 | self.overflow = self.loss_scaler.has_overflow(params) 149 | 150 | def _update_scale(self, has_overflow=False): 151 | self.loss_scaler.update_scale(has_overflow) 152 | 153 | def _copy_grads_fp16_to_fp32(self): 154 | for fp32_group, fp16_group in zip(self.fp32_flattened_groups, self.fp16_param_groups): 155 | if len(fp16_group) > 0: 156 | # This might incur one more deep copy than is necessary. 157 | fp32_group.grad.data.copy_( 158 | _flatten_dense_tensors([fp16_param.grad.data for fp16_param in fp16_group])) 159 | 160 | def _downscale_fp32(self): 161 | if self.loss_scale != 1.0: 162 | for param_group in self.optimizer.param_groups: 163 | for param in param_group['params']: 164 | param.grad.data.mul_(1./self.loss_scale) 165 | 166 | def clip_fp32_grads(self, clip=-1): 167 | if not self.overflow: 168 | fp32_params = [] 169 | for param_group in self.optimizer.param_groups: 170 | for param in param_group['params']: 171 | fp32_params.append(param) 172 | if clip > 0: 173 | return torch.nn.utils.clip_grad_norm(fp32_params, clip) 174 | 175 | def _copy_params_fp32_to_fp16(self): 176 | for fp16_group, fp32_group in zip(self.fp16_param_groups, self.fp32_flattened_groups): 177 | if len(fp16_group) > 0: 178 | for fp16_param, fp32_data in zip(fp16_group, 179 | _unflatten_dense_tensors(fp32_group.data, fp16_group)): 180 | fp16_param.data.copy_(fp32_data) 181 | 182 | def state_dict(self): 183 | """ 184 | Returns a dict containing the current state of this FP16_Optimizer instance. 185 | This dict contains attributes of FP16_Optimizer, as well as the state_dict 186 | of the contained Pytorch optimizer. 187 | 188 | Untested. 189 | """ 190 | state_dict = {} 191 | state_dict['loss_scaler'] = self.loss_scaler 192 | state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale 193 | state_dict['overflow'] = self.overflow 194 | state_dict['first_closure_call_this_step'] = self.first_closure_call_this_step 195 | state_dict['optimizer_state_dict'] = self.optimizer.state_dict() 196 | return state_dict 197 | 198 | def load_state_dict(self, state_dict): 199 | """ 200 | Loads a state_dict created by an earlier call to state_dict. 201 | 202 | Untested. 203 | """ 204 | self.loss_scaler = state_dict['loss_scaler'] 205 | self.dynamic_loss_scale = state_dict['dynamic_loss_scale'] 206 | self.overflow = state_dict['overflow'] 207 | self.first_closure_call_this_step = state_dict['first_closure_call_this_step'] 208 | self.optimizer.load_state_dict(state_dict['optimizer_state_dict']) 209 | 210 | def step(self, closure=None): # could add clip option. 211 | """ 212 | If no closure is supplied, step should be called after fp16_optimizer_obj.backward(loss). 213 | step updates the fp32 master copy of parameters using the optimizer supplied to 214 | FP16_Optimizer's constructor, then copies the updated fp32 params into the fp16 params 215 | originally referenced by Fp16_Optimizer's constructor, so the user may immediately run 216 | another forward pass using their model. 217 | 218 | If a closure is supplied, step may be called without a prior call to self.backward(loss). 219 | However, the user should take care that any loss.backward() call within the closure 220 | has been replaced by fp16_optimizer_obj.backward(loss). 221 | 222 | Args: 223 | closure (optional): Closure that will be supplied to the underlying optimizer originally passed to FP16_Optimizer's constructor. closure should call zero_grad on the FP16_Optimizer object, compute the loss, call .backward(loss), and return the loss. 224 | 225 | Closure example:: 226 | 227 | # optimizer is assumed to be an FP16_Optimizer object, previously constructed from an 228 | # existing pytorch optimizer. 229 | for input, target in dataset: 230 | def closure(): 231 | optimizer.zero_grad() 232 | output = model(input) 233 | loss = loss_fn(output, target) 234 | optimizer.backward(loss) 235 | return loss 236 | optimizer.step(closure) 237 | 238 | .. note:: 239 | The only changes that need to be made compared to 240 | `ordinary optimizer closures`_ are that "optimizer" itself should be an instance of 241 | FP16_Optimizer, and that the call to loss.backward should be replaced by 242 | optimizer.backward(loss). 243 | 244 | .. warning:: 245 | Currently, calling step with a closure is not compatible with dynamic loss scaling. 246 | 247 | .. _`ordinary optimizer closures`: 248 | http://pytorch.org/docs/master/optim.html#optimizer-step-closure 249 | """ 250 | if closure is not None and isinstance(self.loss_scaler, DynamicLossScaler): 251 | raise TypeError("Using step with a closure is currently not " 252 | "compatible with dynamic loss scaling.") 253 | 254 | scale = self.loss_scaler.loss_scale 255 | self._update_scale(self.overflow) 256 | 257 | if self.overflow: 258 | print("OVERFLOW! Skipping step. Attempted loss scale: {}".format(scale)) 259 | return 260 | 261 | if closure is not None: 262 | self._step_with_closure(closure) 263 | else: 264 | self.optimizer.step() 265 | 266 | self._copy_params_fp32_to_fp16() 267 | 268 | return 269 | 270 | def _step_with_closure(self, closure): 271 | def wrapped_closure(): 272 | if self.first_closure_call_this_step: 273 | """ 274 | We expect that the fp16 params are initially fresh on entering self.step(), 275 | so _copy_params_fp32_to_fp16() is unnecessary the first time wrapped_closure() 276 | is called within self.optimizer.step(). 277 | """ 278 | self.first_closure_call_this_step = False 279 | else: 280 | """ 281 | If self.optimizer.step() internally calls wrapped_closure more than once, 282 | it may update the fp32 params after each call. However, self.optimizer 283 | doesn't know about the fp16 params at all. If the fp32 params get updated, 284 | we can't rely on self.optimizer to refresh the fp16 params. We need 285 | to handle that manually: 286 | """ 287 | self._copy_params_fp32_to_fp16() 288 | 289 | """ 290 | Our API expects the user to give us ownership of the backward() call by 291 | replacing all calls to loss.backward() with optimizer.backward(loss). 292 | This requirement holds whether or not the call to backward() is made within 293 | a closure. 294 | If the user is properly calling optimizer.backward(loss) within "closure," 295 | calling closure() here will give the fp32 master params fresh gradients 296 | for the optimizer to play with, 297 | so all wrapped_closure needs to do is call closure() and return the loss. 298 | """ 299 | temp_loss = closure() 300 | return temp_loss 301 | 302 | self.optimizer.step(wrapped_closure) 303 | 304 | self.first_closure_call_this_step = True 305 | 306 | def backward(self, loss, update_fp32_grads=True): 307 | """ 308 | fp16_optimizer_obj.backward performs the following conceptual operations: 309 | 310 | fp32_loss = loss.float() (see first Note below) 311 | 312 | scaled_loss = fp32_loss*loss_scale 313 | 314 | scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the 315 | fp16 model's leaves. 316 | 317 | fp16 grads are then copied to the stored fp32 params' .grad attributes (see second Note). 318 | 319 | Finally, fp32 grads are divided by loss_scale. 320 | 321 | In this way, after fp16_optimizer_obj.backward, the fp32 parameters have fresh gradients, 322 | and fp16_optimizer_obj.step may be called. 323 | 324 | .. note:: 325 | Converting the loss to fp32 before applying the loss scale provides some 326 | additional safety against overflow if the user has supplied an fp16 value. 327 | However, for maximum overflow safety, the user should 328 | compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to 329 | fp16_optimizer_obj.backward. 330 | 331 | .. note:: 332 | The gradients found in an fp16 model's leaves after a call to 333 | fp16_optimizer_obj.backward should not be regarded as valid in general, 334 | because it's possible 335 | they have been scaled (and in the case of dynamic loss scaling, 336 | the scale factor may silently change over time). 337 | If the user wants to inspect gradients after a call to fp16_optimizer_obj.backward, 338 | he/she should query the .grad attribute of FP16_Optimizer's stored fp32 parameters. 339 | 340 | Args: 341 | loss: The loss output by the user's model. loss may be either float or half (but see first Note above). 342 | update_fp32_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay this copy, which is useful to eliminate redundant fp16->fp32 grad copies if fp16_optimizer_obj.backward is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling fp16_optimizer_obj.update_fp32_grads before calling fp16_optimizer_obj.step. 343 | 344 | Example:: 345 | 346 | # Ordinary operation: 347 | optimizer.backward(loss) 348 | 349 | # Naive operation with multiple losses (technically valid, but less efficient): 350 | # fp32 grads will be correct after the second call, but 351 | # the first call incurs an unnecessary fp16->fp32 grad copy. 352 | optimizer.backward(loss1) 353 | optimizer.backward(loss2) 354 | 355 | # More efficient way to handle multiple losses: 356 | # The fp16->fp32 grad copy is delayed until fp16 grads from all 357 | # losses have been accumulated. 358 | optimizer.backward(loss1, update_fp32_grads=False) 359 | optimizer.backward(loss2, update_fp32_grads=False) 360 | optimizer.update_fp32_grads() 361 | """ 362 | self.loss_scaler.backward(loss.float()) 363 | if update_fp32_grads: 364 | self.update_fp32_grads() 365 | 366 | def update_fp32_grads(self): 367 | """ 368 | Copy the .grad attribute from stored references to fp16 parameters to 369 | the .grad attribute of the master fp32 parameters that are directly 370 | updated by the optimizer. :attr:`update_fp32_grads` only needs to be called if 371 | fp16_optimizer_obj.backward was called with update_fp32_grads=False. 372 | """ 373 | if self.dynamic_loss_scale: 374 | self._check_overflow() 375 | if self.overflow: return 376 | self._copy_grads_fp16_to_fp32() 377 | self._downscale_fp32() 378 | 379 | @property 380 | def loss_scale(self): 381 | return self.loss_scaler.loss_scale 382 | -------------------------------------------------------------------------------- /hparams.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from text.symbols import symbols 3 | 4 | 5 | def create_hparams(hparams_string=None, verbose=False): 6 | """Create model hyperparameters. Parse nondefault from given string.""" 7 | 8 | hparams = tf.contrib.training.HParams( 9 | ################################ 10 | # Experiment Parameters # 11 | ################################ 12 | epochs=50000, 13 | iters_per_checkpoint=500, 14 | seed=1234, 15 | dynamic_loss_scaling=True, 16 | fp16_run=False, 17 | distributed_run=False, 18 | dist_backend="nccl", 19 | dist_url="tcp://localhost:54321", 20 | cudnn_enabled=True, 21 | cudnn_benchmark=False, 22 | ignore_layers=['speaker_embedding.weight'], 23 | 24 | ################################ 25 | # Data Parameters # 26 | ################################ 27 | training_files='filelists/ljs_audiopaths_text_sid_train_filelist.txt', 28 | validation_files='filelists/ljs_audiopaths_text_sid_val_filelist.txt', 29 | text_cleaners=['english_cleaners'], 30 | p_arpabet=1.0, 31 | cmudict_path="data/cmu_dictionary", 32 | 33 | ################################ 34 | # Audio Parameters # 35 | ################################ 36 | max_wav_value=32768.0, 37 | sampling_rate=22050, 38 | filter_length=1024, 39 | hop_length=256, 40 | win_length=1024, 41 | n_mel_channels=80, 42 | mel_fmin=0.0, 43 | mel_fmax=8000.0, 44 | f0_min=80, 45 | f0_max=880, 46 | harm_thresh=0.25, 47 | 48 | ################################ 49 | # Model Parameters # 50 | ################################ 51 | n_symbols=len(symbols), 52 | symbols_embedding_dim=512, 53 | 54 | # Encoder parameters 55 | encoder_kernel_size=5, 56 | encoder_n_convolutions=3, 57 | encoder_embedding_dim=512, 58 | 59 | # Decoder parameters 60 | n_frames_per_step=1, # currently only 1 is supported 61 | decoder_rnn_dim=1024, 62 | prenet_dim=256, 63 | prenet_f0_n_layers=1, 64 | prenet_f0_dim=1, 65 | prenet_f0_kernel_size=1, 66 | prenet_rms_dim=0, 67 | prenet_rms_kernel_size=1, 68 | max_decoder_steps=1000, 69 | gate_threshold=0.5, 70 | p_attention_dropout=0.1, 71 | p_decoder_dropout=0.1, 72 | p_teacher_forcing=1.0, 73 | 74 | # Attention parameters 75 | attention_rnn_dim=1024, 76 | attention_dim=128, 77 | 78 | # Location Layer parameters 79 | attention_location_n_filters=32, 80 | attention_location_kernel_size=31, 81 | 82 | # Mel-post processing network parameters 83 | postnet_embedding_dim=512, 84 | postnet_kernel_size=5, 85 | postnet_n_convolutions=5, 86 | 87 | # Speaker embedding 88 | n_speakers=123, 89 | speaker_embedding_dim=128, 90 | 91 | # Reference encoder 92 | with_gst=True, 93 | ref_enc_filters=[32, 32, 64, 64, 128, 128], 94 | ref_enc_size=[3, 3], 95 | ref_enc_strides=[2, 2], 96 | ref_enc_pad=[1, 1], 97 | ref_enc_gru_size=128, 98 | 99 | # Style Token Layer 100 | token_embedding_size=256, 101 | token_num=10, 102 | num_heads=8, 103 | 104 | ################################ 105 | # Optimization Hyperparameters # 106 | ################################ 107 | use_saved_learning_rate=False, 108 | learning_rate=1e-3, 109 | learning_rate_min=1e-5, 110 | learning_rate_anneal=50000, 111 | weight_decay=1e-6, 112 | grad_clip_thresh=1.0, 113 | batch_size=32, 114 | mask_padding=True, # set model's padded outputs to padded values 115 | 116 | ) 117 | 118 | if hparams_string: 119 | tf.compat.v1.logging.info('Parsing command line hparams: %s', hparams_string) 120 | hparams.parse(hparams_string) 121 | 122 | if verbose: 123 | tf.compat.v1.logging.info('Final parsed hparams: %s', hparams.values()) 124 | 125 | return hparams 126 | -------------------------------------------------------------------------------- /layers.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from librosa.filters import mel as librosa_mel_fn 3 | from audio_processing import dynamic_range_compression, dynamic_range_decompression 4 | from stft import STFT 5 | 6 | 7 | class LinearNorm(torch.nn.Module): 8 | def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'): 9 | super(LinearNorm, self).__init__() 10 | self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias) 11 | 12 | torch.nn.init.xavier_uniform_( 13 | self.linear_layer.weight, 14 | gain=torch.nn.init.calculate_gain(w_init_gain)) 15 | 16 | def forward(self, x): 17 | return self.linear_layer(x) 18 | 19 | 20 | class ConvNorm(torch.nn.Module): 21 | def __init__(self, in_channels, out_channels, kernel_size=1, stride=1, 22 | padding=None, dilation=1, bias=True, w_init_gain='linear'): 23 | super(ConvNorm, self).__init__() 24 | if padding is None: 25 | assert(kernel_size % 2 == 1) 26 | padding = int(dilation * (kernel_size - 1) / 2) 27 | self.conv = torch.nn.Conv1d(in_channels, out_channels, 28 | kernel_size=kernel_size, stride=stride, 29 | padding=padding, dilation=dilation, 30 | bias=bias) 31 | torch.nn.init.xavier_uniform_( 32 | self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain)) 33 | 34 | def forward(self, signal): 35 | conv_signal = self.conv(signal) 36 | return conv_signal 37 | 38 | 39 | class ConvNorm2D(torch.nn.Module): 40 | def __init__(self, in_channels, out_channels, kernel_size=1, stride=1, 41 | padding=None, dilation=1, bias=True, w_init_gain='linear'): 42 | super(ConvNorm2D, self).__init__() 43 | self.conv = torch.nn.Conv2d(in_channels=in_channels, out_channels=out_channels, 44 | kernel_size=kernel_size, stride=stride, 45 | padding=padding, dilation=dilation, 46 | groups=1, bias=bias) 47 | torch.nn.init.xavier_uniform_( 48 | self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain)) 49 | 50 | def forward(self, signal): 51 | conv_signal = self.conv(signal) 52 | return conv_signal 53 | 54 | 55 | class TacotronSTFT(torch.nn.Module): 56 | def __init__(self, filter_length=1024, hop_length=256, win_length=1024, 57 | n_mel_channels=80, sampling_rate=22050, mel_fmin=0.0, 58 | mel_fmax=8000.0): 59 | super(TacotronSTFT, self).__init__() 60 | self.n_mel_channels = n_mel_channels 61 | self.sampling_rate = sampling_rate 62 | self.stft_fn = STFT(filter_length, hop_length, win_length) 63 | mel_basis = librosa_mel_fn( 64 | sampling_rate, filter_length, n_mel_channels, mel_fmin, mel_fmax) 65 | mel_basis = torch.from_numpy(mel_basis).float() 66 | self.register_buffer('mel_basis', mel_basis) 67 | 68 | def spectral_normalize(self, magnitudes): 69 | output = dynamic_range_compression(magnitudes) 70 | return output 71 | 72 | def spectral_de_normalize(self, magnitudes): 73 | output = dynamic_range_decompression(magnitudes) 74 | return output 75 | 76 | def mel_spectrogram(self, y, ref_level_db = 20, magnitude_power=1.5): 77 | """Computes mel-spectrograms from a batch of waves 78 | PARAMS 79 | ------ 80 | y: Variable(torch.FloatTensor) with shape (B, T) in range [-1, 1] 81 | 82 | RETURNS 83 | ------- 84 | mel_output: torch.FloatTensor of shape (B, n_mel_channels, T) 85 | """ 86 | assert(torch.min(y.data) >= -1) 87 | assert(torch.max(y.data) <= 1) 88 | 89 | magnitudes, phases = self.stft_fn.transform(y) 90 | magnitudes = magnitudes.data 91 | mel_output = torch.matmul(self.mel_basis, magnitudes) 92 | mel_output = self.spectral_normalize(mel_output) 93 | return mel_output 94 | -------------------------------------------------------------------------------- /logger.py: -------------------------------------------------------------------------------- 1 | import random 2 | import torch 3 | from tensorboardX import SummaryWriter 4 | from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy 5 | from plotting_utils import plot_gate_outputs_to_numpy 6 | 7 | 8 | class Tacotron2Logger(SummaryWriter): 9 | def __init__(self, logdir): 10 | super(Tacotron2Logger, self).__init__(logdir) 11 | 12 | def log_training(self, reduced_loss, grad_norm, learning_rate, duration, 13 | iteration): 14 | self.add_scalar("training.loss", reduced_loss, iteration) 15 | self.add_scalar("grad.norm", grad_norm, iteration) 16 | self.add_scalar("learning.rate", learning_rate, iteration) 17 | self.add_scalar("duration", duration, iteration) 18 | 19 | def log_validation(self, reduced_loss, model, y, y_pred, iteration): 20 | self.add_scalar("validation.loss", reduced_loss, iteration) 21 | _, mel_outputs, gate_outputs, alignments = y_pred 22 | mel_targets, gate_targets = y 23 | 24 | # plot distribution of parameters 25 | for tag, value in model.named_parameters(): 26 | tag = tag.replace('.', '/') 27 | self.add_histogram(tag, value.data.cpu().numpy(), iteration) 28 | 29 | # plot alignment, mel target and predicted, gate target and predicted 30 | idx = random.randint(0, alignments.size(0) - 1) 31 | self.add_image( 32 | "alignment", 33 | plot_alignment_to_numpy(alignments[idx].data.cpu().numpy().T), 34 | iteration, dataformats='HWC') 35 | self.add_image( 36 | "mel_target", 37 | plot_spectrogram_to_numpy(mel_targets[idx].data.cpu().numpy()), 38 | iteration, dataformats='HWC') 39 | self.add_image( 40 | "mel_predicted", 41 | plot_spectrogram_to_numpy(mel_outputs[idx].data.cpu().numpy()), 42 | iteration, dataformats='HWC') 43 | self.add_image( 44 | "gate", 45 | plot_gate_outputs_to_numpy( 46 | gate_targets[idx].data.cpu().numpy(), 47 | torch.sigmoid(gate_outputs[idx]).data.cpu().numpy()), 48 | iteration, dataformats='HWC') 49 | -------------------------------------------------------------------------------- /loss_function.py: -------------------------------------------------------------------------------- 1 | from torch import nn 2 | 3 | 4 | class Tacotron2Loss(nn.Module): 5 | def __init__(self): 6 | super(Tacotron2Loss, self).__init__() 7 | 8 | def forward(self, model_output, targets): 9 | mel_target, gate_target = targets[0], targets[1] 10 | mel_target.requires_grad = False 11 | gate_target.requires_grad = False 12 | gate_target = gate_target.view(-1, 1) 13 | 14 | mel_out, mel_out_postnet, gate_out, _ = model_output 15 | gate_out = gate_out.view(-1, 1) 16 | mel_loss = nn.MSELoss()(mel_out, mel_target) + \ 17 | nn.MSELoss()(mel_out_postnet, mel_target) 18 | gate_loss = nn.BCEWithLogitsLoss()(gate_out, gate_target) 19 | return mel_loss + gate_loss 20 | -------------------------------------------------------------------------------- /loss_scaler.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | class LossScaler: 4 | 5 | def __init__(self, scale=1): 6 | self.cur_scale = scale 7 | 8 | # `params` is a list / generator of torch.Variable 9 | def has_overflow(self, params): 10 | return False 11 | 12 | # `x` is a torch.Tensor 13 | def _has_inf_or_nan(x): 14 | return False 15 | 16 | # `overflow` is boolean indicating whether we overflowed in gradient 17 | def update_scale(self, overflow): 18 | pass 19 | 20 | @property 21 | def loss_scale(self): 22 | return self.cur_scale 23 | 24 | def scale_gradient(self, module, grad_in, grad_out): 25 | return tuple(self.loss_scale * g for g in grad_in) 26 | 27 | def backward(self, loss): 28 | scaled_loss = loss*self.loss_scale 29 | scaled_loss.backward() 30 | 31 | class DynamicLossScaler: 32 | 33 | def __init__(self, 34 | init_scale=2**32, 35 | scale_factor=2., 36 | scale_window=1000): 37 | self.cur_scale = init_scale 38 | self.cur_iter = 0 39 | self.last_overflow_iter = -1 40 | self.scale_factor = scale_factor 41 | self.scale_window = scale_window 42 | 43 | # `params` is a list / generator of torch.Variable 44 | def has_overflow(self, params): 45 | # return False 46 | for p in params: 47 | if p.grad is not None and DynamicLossScaler._has_inf_or_nan(p.grad.data): 48 | return True 49 | 50 | return False 51 | 52 | # `x` is a torch.Tensor 53 | def _has_inf_or_nan(x): 54 | cpu_sum = float(x.float().sum()) 55 | if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum: 56 | return True 57 | return False 58 | 59 | # `overflow` is boolean indicating whether we overflowed in gradient 60 | def update_scale(self, overflow): 61 | if overflow: 62 | #self.cur_scale /= self.scale_factor 63 | self.cur_scale = max(self.cur_scale/self.scale_factor, 1) 64 | self.last_overflow_iter = self.cur_iter 65 | else: 66 | if (self.cur_iter - self.last_overflow_iter) % self.scale_window == 0: 67 | self.cur_scale *= self.scale_factor 68 | # self.cur_scale = 1 69 | self.cur_iter += 1 70 | 71 | @property 72 | def loss_scale(self): 73 | return self.cur_scale 74 | 75 | def scale_gradient(self, module, grad_in, grad_out): 76 | return tuple(self.loss_scale * g for g in grad_in) 77 | 78 | def backward(self, loss): 79 | scaled_loss = loss*self.loss_scale 80 | scaled_loss.backward() 81 | 82 | ############################################################## 83 | # Example usage below here -- assuming it's in a separate file 84 | ############################################################## 85 | if __name__ == "__main__": 86 | import torch 87 | from torch.autograd import Variable 88 | from dynamic_loss_scaler import DynamicLossScaler 89 | 90 | # N is batch size; D_in is input dimension; 91 | # H is hidden dimension; D_out is output dimension. 92 | N, D_in, H, D_out = 64, 1000, 100, 10 93 | 94 | # Create random Tensors to hold inputs and outputs, and wrap them in Variables. 95 | x = Variable(torch.randn(N, D_in), requires_grad=False) 96 | y = Variable(torch.randn(N, D_out), requires_grad=False) 97 | 98 | w1 = Variable(torch.randn(D_in, H), requires_grad=True) 99 | w2 = Variable(torch.randn(H, D_out), requires_grad=True) 100 | parameters = [w1, w2] 101 | 102 | learning_rate = 1e-6 103 | optimizer = torch.optim.SGD(parameters, lr=learning_rate) 104 | loss_scaler = DynamicLossScaler() 105 | 106 | for t in range(500): 107 | y_pred = x.mm(w1).clamp(min=0).mm(w2) 108 | loss = (y_pred - y).pow(2).sum() * loss_scaler.loss_scale 109 | print('Iter {} loss scale: {}'.format(t, loss_scaler.loss_scale)) 110 | print('Iter {} scaled loss: {}'.format(t, loss.data[0])) 111 | print('Iter {} unscaled loss: {}'.format(t, loss.data[0] / loss_scaler.loss_scale)) 112 | 113 | # Run backprop 114 | optimizer.zero_grad() 115 | loss.backward() 116 | 117 | # Check for overflow 118 | has_overflow = DynamicLossScaler.has_overflow(parameters) 119 | 120 | # If no overflow, unscale grad and update as usual 121 | if not has_overflow: 122 | for param in parameters: 123 | param.grad.data.mul_(1. / loss_scaler.loss_scale) 124 | optimizer.step() 125 | # Otherwise, don't do anything -- ie, skip iteration 126 | else: 127 | print('OVERFLOW!') 128 | 129 | # Update loss scale for next iteration 130 | loss_scaler.update_scale(has_overflow) 131 | 132 | -------------------------------------------------------------------------------- /mellotron_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/mellotron_logo.png -------------------------------------------------------------------------------- /mellotron_utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import numpy as np 3 | import music21 as m21 4 | import torch 5 | import torch.nn.functional as F 6 | from text import text_to_sequence, get_arpabet, cmudict 7 | 8 | 9 | CMUDICT_PATH = "data/cmu_dictionary" 10 | CMUDICT = cmudict.CMUDict(CMUDICT_PATH) 11 | PHONEME2GRAPHEME = { 12 | 'AA': ['a', 'o', 'ah'], 13 | 'AE': ['a', 'e'], 14 | 'AH': ['u', 'e', 'a', 'h', 'o'], 15 | 'AO': ['o', 'u', 'au'], 16 | 'AW': ['ou', 'ow'], 17 | 'AX': ['a'], 18 | 'AXR': ['er'], 19 | 'AY': ['i'], 20 | 'EH': ['e', 'ae'], 21 | 'EY': ['a', 'ai', 'ei', 'e', 'y'], 22 | 'IH': ['i', 'e', 'y'], 23 | 'IX': ['e', 'i'], 24 | 'IY': ['ea', 'ey', 'y', 'i'], 25 | 'OW': ['oa', 'o'], 26 | 'OY': ['oy'], 27 | 'UH': ['oo'], 28 | 'UW': ['oo', 'u', 'o'], 29 | 'UX': ['u'], 30 | 'B': ['b'], 31 | 'CH': ['ch', 'tch'], 32 | 'D': ['d', 'e', 'de'], 33 | 'DH': ['th'], 34 | 'DX': ['tt'], 35 | 'EL': ['le'], 36 | 'EM': ['m'], 37 | 'EN': ['on'], 38 | 'ER': ['i', 'er'], 39 | 'F': ['f'], 40 | 'G': ['g'], 41 | 'HH': ['h'], 42 | 'JH': ['j'], 43 | 'K': ['k', 'c', 'ch'], 44 | 'KS': ['x'], 45 | 'L': ['ll', 'l'], 46 | 'M': ['m'], 47 | 'N': ['n', 'gn'], 48 | 'NG': ['ng'], 49 | 'NX': ['nn'], 50 | 'P': ['p'], 51 | 'Q': ['-'], 52 | 'R': ['wr', 'r'], 53 | 'S': ['s', 'ce'], 54 | 'SH': ['sh'], 55 | 'T': ['t'], 56 | 'TH': ['th'], 57 | 'V': ['v', 'f', 'e'], 58 | 'W': ['w'], 59 | 'WH': ['wh'], 60 | 'Y': ['y', 'j'], 61 | 'Z': ['z', 's'], 62 | 'ZH': ['s'] 63 | } 64 | 65 | ######################## 66 | # CONSONANT DURATION # 67 | ######################## 68 | PHONEMEDURATION = { 69 | 'B': 0.05, 70 | 'CH': 0.1, 71 | 'D': 0.075, 72 | 'DH': 0.05, 73 | 'DX': 0.05, 74 | 'EL': 0.05, 75 | 'EM': 0.05, 76 | 'EN': 0.05, 77 | 'F': 0.1, 78 | 'G': 0.05, 79 | 'HH': 0.05, 80 | 'JH': 0.05, 81 | 'K': 0.05, 82 | 'L': 0.05, 83 | 'M': 0.15, 84 | 'N': 0.15, 85 | 'NG': 0.15, 86 | 'NX': 0.05, 87 | 'P': 0.05, 88 | 'Q': 0.075, 89 | 'R': 0.05, 90 | 'S': 0.1, 91 | 'SH': 0.05, 92 | 'T': 0.075, 93 | 'TH': 0.1, 94 | 'V': 0.05, 95 | 'Y': 0.05, 96 | 'W': 0.05, 97 | 'WH': 0.05, 98 | 'Z': 0.05, 99 | 'ZH': 0.05 100 | } 101 | 102 | 103 | def add_space_between_events(events, connect=False): 104 | new_events = [] 105 | for i in range(1, len(events)): 106 | token_a, freq_a, start_time_a, end_time_a = events[i-1][-1] 107 | token_b, freq_b, start_time_b, end_time_b = events[i][0] 108 | 109 | if token_a in (' ', '') and len(events[i-1]) == 1: 110 | new_events.append(events[i-1]) 111 | elif token_a not in (' ', '') and token_b not in (' ', ''): 112 | new_events.append(events[i-1]) 113 | if connect: 114 | new_events.append([[' ', 0, end_time_a, start_time_b]]) 115 | else: 116 | new_events.append([[' ', 0, end_time_a, end_time_a]]) 117 | else: 118 | new_events.append(events[i-1]) 119 | 120 | if new_events[-1][0][0] != ' ': 121 | new_events.append([[' ', 0, end_time_a, end_time_a]]) 122 | new_events.append(events[-1]) 123 | 124 | return new_events 125 | 126 | 127 | def adjust_words(events): 128 | new_events = [] 129 | for event in events: 130 | if len(event) == 1 and event[0][0] == ' ': 131 | new_events.append(event) 132 | else: 133 | for e in event: 134 | if e[0][0].isupper(): 135 | new_events.append([e]) 136 | else: 137 | new_events[-1].extend([e]) 138 | return new_events 139 | 140 | 141 | def adjust_extensions(events, phoneme_durations): 142 | if len(events) == 1: 143 | return events 144 | 145 | idx_last_vowel = None 146 | n_consonants_after_last_vowel = 0 147 | target_ids = np.arange(len(events)) 148 | for i in range(len(events)): 149 | token = re.sub('[0-9{}]', '', events[i][0]) 150 | if idx_last_vowel is None and token not in phoneme_durations: 151 | idx_last_vowel = i 152 | n_consonants_after_last_vowel = 0 153 | else: 154 | if token == '_' and not n_consonants_after_last_vowel: 155 | events[i][0] = events[idx_last_vowel][0] 156 | elif token == '_' and n_consonants_after_last_vowel: 157 | events[i][0] = events[idx_last_vowel][0] 158 | start = idx_last_vowel + 1 159 | target_ids[start:start+n_consonants_after_last_vowel] += 1 160 | target_ids[i] -= n_consonants_after_last_vowel 161 | elif token in phoneme_durations: 162 | n_consonants_after_last_vowel += 1 163 | else: 164 | n_consonants_after_last_vowel = 0 165 | idx_last_vowel = i 166 | 167 | new_events = [0] * len(events) 168 | for i in range(len(events)): 169 | new_events[target_ids[i]] = events[i] 170 | 171 | # adjust time of consonants that were repositioned 172 | for i in range(1, len(new_events)): 173 | if new_events[i][2] < new_events[i-1][2]: 174 | new_events[i][2] = new_events[i-1][2] 175 | new_events[i][3] = new_events[i-1][3] 176 | 177 | return new_events 178 | 179 | 180 | def adjust_consonant_lengths(events, phoneme_durations): 181 | t_init = events[0][2] 182 | 183 | idx_last_vowel = None 184 | for i in range(len(events)): 185 | task = re.sub('[0-9{}]', '', events[i][0]) 186 | if task in phoneme_durations: 187 | duration = phoneme_durations[task] 188 | if idx_last_vowel is None: # consonant comes before any vowel 189 | events[i][2] = t_init 190 | events[i][3] = t_init + duration 191 | else: # consonant comes after a vowel, must offset 192 | events[idx_last_vowel][3] -= duration 193 | for k in range(idx_last_vowel+1, i): 194 | events[k][2] -= duration 195 | events[k][3] -= duration 196 | events[i][2] = events[i-1][3] 197 | events[i][3] = events[i-1][3] + duration 198 | else: 199 | events[i][2] = t_init 200 | events[i][3] = events[i][3] 201 | t_init = events[i][3] 202 | idx_last_vowel = i 203 | t_init = events[i][3] 204 | 205 | return events 206 | 207 | 208 | def adjust_consonants(events, phoneme_durations): 209 | if len(events) == 1: 210 | return events 211 | 212 | start = 0 213 | split_ids = [] 214 | t_init = events[0][2] 215 | 216 | # get each substring group 217 | for i in range(1, len(events)): 218 | if events[i][2] != t_init: 219 | split_ids.append((start, i)) 220 | start = i 221 | t_init = events[i][2] 222 | split_ids.append((start, len(events))) 223 | 224 | for (start, end) in split_ids: 225 | events[start:end] = adjust_consonant_lengths( 226 | events[start:end], phoneme_durations) 227 | 228 | return events 229 | 230 | 231 | def adjust_event(event, hop_length=256, sampling_rate=22050): 232 | tokens, freq, start_time, end_time = event 233 | 234 | if tokens == ' ': 235 | return [event] if freq == 0 else [['_', freq, start_time, end_time]] 236 | 237 | return [[token, freq, start_time, end_time] for token in tokens] 238 | 239 | 240 | def musicxml2score(filepath, bpm=60): 241 | track = {} 242 | beat_length_seconds = 60/bpm 243 | data = m21.converter.parse(filepath) 244 | for i in range(len(data.parts)): 245 | part = data.parts[i].flat 246 | events = [] 247 | for k in range(len(part.notesAndRests)): 248 | event = part.notesAndRests[k] 249 | if isinstance(event, m21.note.Note): 250 | freq = event.pitch.frequency 251 | token = event.lyrics[0].text if len(event.lyrics) > 0 else ' ' 252 | start_time = event.offset * beat_length_seconds 253 | end_time = start_time + event.duration.quarterLength * beat_length_seconds 254 | event = [token, freq, start_time, end_time] 255 | elif isinstance(event, m21.note.Rest): 256 | freq = 0 257 | token = ' ' 258 | start_time = event.offset * beat_length_seconds 259 | end_time = start_time + event.duration.quarterLength * beat_length_seconds 260 | event = [token, freq, start_time, end_time] 261 | 262 | if token == '_': 263 | raise Exception("Unexpected token {}".format(token)) 264 | 265 | if len(events) == 0: 266 | events.append(event) 267 | else: 268 | if token == ' ': 269 | if freq == 0: 270 | if events[-1][1] == 0: 271 | events[-1][3] = end_time 272 | else: 273 | events.append(event) 274 | elif freq == events[-1][1]: # is event duration extension ? 275 | events[-1][-1] = end_time 276 | else: # must be different note on same syllable 277 | events.append(event) 278 | else: 279 | events.append(event) 280 | track[part.partName] = events 281 | return track 282 | 283 | 284 | def track2events(track): 285 | events = [] 286 | for e in track: 287 | events.extend(adjust_event(e)) 288 | group_ids = [i for i in range(len(events)) 289 | if events[i][0] in [' '] or events[i][0].isupper()] 290 | 291 | events_grouped = [] 292 | for i in range(1, len(group_ids)): 293 | start, end = group_ids[i-1], group_ids[i] 294 | events_grouped.append(events[start:end]) 295 | 296 | if events[-1][0] != ' ': 297 | events_grouped.append(events[group_ids[-1]:]) 298 | 299 | return events_grouped 300 | 301 | 302 | def events2eventsarpabet(event): 303 | if event[0][0] == ' ': 304 | return event 305 | 306 | # get word and word arpabet 307 | word = ''.join([e[0] for e in event if e[0] not in('_', ' ')]) 308 | word_arpabet = get_arpabet(word, CMUDICT) 309 | if word_arpabet[0] != '{': 310 | return event 311 | 312 | word_arpabet = word_arpabet.split() 313 | 314 | # align tokens to arpabet 315 | i, k = 0, 0 316 | new_events = [] 317 | while i < len(event) and k < len(word_arpabet): 318 | # single token 319 | token_a, freq_a, start_time_a, end_time_a = event[i] 320 | 321 | if token_a == ' ': 322 | new_events.append([token_a, freq_a, start_time_a, end_time_a]) 323 | i += 1 324 | continue 325 | 326 | if token_a == '_': 327 | new_events.append([token_a, freq_a, start_time_a, end_time_a]) 328 | i += 1 329 | continue 330 | 331 | # two tokens 332 | if i < len(event) - 1: 333 | j = i + 1 334 | token_b, freq_b, start_time_b, end_time_b = event[j] 335 | between_events = [] 336 | while j < len(event) and event[j][0] == '_': 337 | between_events.append([token_b, freq_b, start_time_b, end_time_b]) 338 | j += 1 339 | if j < len(event): 340 | token_b, freq_b, start_time_b, end_time_b = event[j] 341 | 342 | token_compound_2 = (token_a + token_b).lower() 343 | 344 | # single arpabet 345 | arpabet = re.sub('[0-9{}]', '', word_arpabet[k]) 346 | 347 | if k < len(word_arpabet) - 1: 348 | arpabet_compound_2 = ''.join(word_arpabet[k:k+2]) 349 | arpabet_compound_2 = re.sub('[0-9{}]', '', arpabet_compound_2) 350 | 351 | if i < len(event) - 1 and token_compound_2 in PHONEME2GRAPHEME[arpabet]: 352 | new_events.append([word_arpabet[k], freq_a, start_time_a, end_time_a]) 353 | if len(between_events): 354 | new_events.extend(between_events) 355 | if start_time_a != start_time_b: 356 | new_events.append([word_arpabet[k], freq_b, start_time_b, end_time_b]) 357 | i += 2 + len(between_events) 358 | k += 1 359 | elif token_a.lower() in PHONEME2GRAPHEME[arpabet]: 360 | new_events.append([word_arpabet[k], freq_a, start_time_a, end_time_a]) 361 | i += 1 362 | k += 1 363 | elif arpabet_compound_2 in PHONEME2GRAPHEME and token_a.lower() in PHONEME2GRAPHEME[arpabet_compound_2]: 364 | new_events.append([word_arpabet[k], freq_a, start_time_a, end_time_a]) 365 | new_events.append([word_arpabet[k+1], freq_a, start_time_a, end_time_a]) 366 | i += 1 367 | k += 2 368 | else: 369 | k += 1 370 | 371 | # add extensions and pauses at end of words 372 | while i < len(event): 373 | token_a, freq_a, start_time_a, end_time_a = event[i] 374 | 375 | if token_a in (' ', '_'): 376 | new_events.append([token_a, freq_a, start_time_a, end_time_a]) 377 | i += 1 378 | 379 | return new_events 380 | 381 | 382 | def event2alignment(events, hop_length=256, sampling_rate=22050): 383 | frame_length = float(hop_length) / float(sampling_rate) 384 | 385 | n_frames = int(events[-1][-1][-1] / frame_length) 386 | n_tokens = np.sum([len(e) for e in events]) 387 | alignment = np.zeros((n_tokens, n_frames)) 388 | 389 | cur_event = -1 390 | for event in events: 391 | for i in range(len(event)): 392 | if len(event) == 1 or cur_event == -1 or event[i][0] != event[i-1][0]: 393 | cur_event += 1 394 | token, freq, start_time, end_time = event[i] 395 | alignment[cur_event, int(start_time/frame_length):int(end_time/frame_length)] = 1 396 | 397 | return alignment[:cur_event+1] 398 | 399 | 400 | def event2f0(events, hop_length=256, sampling_rate=22050): 401 | frame_length = float(hop_length) / float(sampling_rate) 402 | n_frames = int(events[-1][-1][-1] / frame_length) 403 | f0s = np.zeros((1, n_frames)) 404 | 405 | for event in events: 406 | for i in range(len(event)): 407 | token, freq, start_time, end_time = event[i] 408 | f0s[0, int(start_time/frame_length):int(end_time/frame_length)] = freq 409 | 410 | return f0s 411 | 412 | 413 | def event2text(events, convert_stress, cmudict=None): 414 | text_clean = '' 415 | for event in events: 416 | for i in range(len(event)): 417 | if i > 0 and event[i][0] == event[i-1][0]: 418 | continue 419 | if event[i][0] == ' ' and len(event) > 1: 420 | if text_clean[-1] != "}": 421 | text_clean = text_clean[:-1] + '} {' 422 | else: 423 | text_clean += ' {' 424 | else: 425 | if event[i][0][-1] in ('}', ' '): 426 | text_clean += event[i][0] 427 | else: 428 | text_clean += event[i][0] + ' ' 429 | 430 | if convert_stress: 431 | text_clean = re.sub('[0-9]', '1', text_clean) 432 | 433 | text_encoded = text_to_sequence(text_clean, [], cmudict) 434 | return text_encoded, text_clean 435 | 436 | 437 | def remove_excess_frames(alignment, f0s): 438 | excess_frames = np.sum(alignment.sum(0) == 0) 439 | alignment = alignment[:, :-excess_frames] if excess_frames > 0 else alignment 440 | f0s = f0s[:, :-excess_frames] if excess_frames > 0 else f0s 441 | return alignment, f0s 442 | 443 | 444 | def get_data_from_musicxml(filepath, bpm, phoneme_durations=None, 445 | convert_stress=False): 446 | if phoneme_durations is None: 447 | phoneme_durations = PHONEMEDURATION 448 | score = musicxml2score(filepath, bpm) 449 | data = {} 450 | for k, v in score.items(): 451 | # ignore empty tracks 452 | if len(v) == 1 and v[0][0] == ' ': 453 | continue 454 | 455 | events = track2events(v) 456 | events = adjust_words(events) 457 | events_arpabet = [events2eventsarpabet(e) for e in events] 458 | 459 | # make adjustments 460 | events_arpabet = [adjust_extensions(e, phoneme_durations) 461 | for e in events_arpabet] 462 | events_arpabet = [adjust_consonants(e, phoneme_durations) 463 | for e in events_arpabet] 464 | events_arpabet = add_space_between_events(events_arpabet) 465 | 466 | # convert data to alignment, f0 and text encoded 467 | alignment = event2alignment(events_arpabet) 468 | f0s = event2f0(events_arpabet) 469 | alignment, f0s = remove_excess_frames(alignment, f0s) 470 | text_encoded, text_clean = event2text(events_arpabet, convert_stress) 471 | 472 | # convert data to torch 473 | alignment = torch.from_numpy(alignment).permute(1, 0)[:, None].float() 474 | f0s = torch.from_numpy(f0s)[None].float() 475 | text_encoded = torch.LongTensor(text_encoded)[None] 476 | data[k] = {'rhythm': alignment, 477 | 'pitch_contour': f0s, 478 | 'text_encoded': text_encoded} 479 | 480 | return data 481 | 482 | 483 | if __name__ == "__main__": 484 | import argparse 485 | # Get defaults so it can work with no Sacred 486 | parser = argparse.ArgumentParser() 487 | parser.add_argument('-f', "--filepath", required=True) 488 | args = parser.parse_args() 489 | get_data_from_musicxml(args.filepath, 60) 490 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | from math import sqrt 2 | import numpy as np 3 | from numpy import finfo 4 | import torch 5 | from torch.autograd import Variable 6 | from torch import nn 7 | from torch.nn import functional as F 8 | from layers import ConvNorm, LinearNorm 9 | from utils import to_gpu, get_mask_from_lengths 10 | from modules import GST 11 | 12 | drop_rate = 0.5 13 | 14 | def load_model(hparams): 15 | model = Tacotron2(hparams).cuda() 16 | if hparams.fp16_run: 17 | model.decoder.attention_layer.score_mask_value = finfo('float16').min 18 | 19 | return model 20 | 21 | 22 | class LocationLayer(nn.Module): 23 | def __init__(self, attention_n_filters, attention_kernel_size, 24 | attention_dim): 25 | super(LocationLayer, self).__init__() 26 | padding = int((attention_kernel_size - 1) / 2) 27 | self.location_conv = ConvNorm(2, attention_n_filters, 28 | kernel_size=attention_kernel_size, 29 | padding=padding, bias=False, stride=1, 30 | dilation=1) 31 | self.location_dense = LinearNorm(attention_n_filters, attention_dim, 32 | bias=False, w_init_gain='tanh') 33 | 34 | def forward(self, attention_weights_cat): 35 | processed_attention = self.location_conv(attention_weights_cat) 36 | processed_attention = processed_attention.transpose(1, 2) 37 | processed_attention = self.location_dense(processed_attention) 38 | return processed_attention 39 | 40 | 41 | class Attention(nn.Module): 42 | def __init__(self, attention_rnn_dim, embedding_dim, attention_dim, 43 | attention_location_n_filters, attention_location_kernel_size): 44 | super(Attention, self).__init__() 45 | self.query_layer = LinearNorm(attention_rnn_dim, attention_dim, 46 | bias=False, w_init_gain='tanh') 47 | self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False, 48 | w_init_gain='tanh') 49 | self.v = LinearNorm(attention_dim, 1, bias=False) 50 | self.location_layer = LocationLayer(attention_location_n_filters, 51 | attention_location_kernel_size, 52 | attention_dim) 53 | self.score_mask_value = -float("inf") 54 | 55 | def get_alignment_energies(self, query, processed_memory, 56 | attention_weights_cat): 57 | """ 58 | PARAMS 59 | ------ 60 | query: decoder output (batch, n_mel_channels * n_frames_per_step) 61 | processed_memory: processed encoder outputs (B, T_in, attention_dim) 62 | attention_weights_cat: cumulative and prev. att weights (B, 2, max_time) 63 | 64 | RETURNS 65 | ------- 66 | alignment (batch, max_time) 67 | """ 68 | 69 | processed_query = self.query_layer(query.unsqueeze(1)) 70 | processed_attention_weights = self.location_layer(attention_weights_cat) 71 | energies = self.v(torch.tanh( 72 | processed_query + processed_attention_weights + processed_memory)) 73 | 74 | energies = energies.squeeze(-1) 75 | return energies 76 | 77 | def forward(self, attention_hidden_state, memory, processed_memory, 78 | attention_weights_cat, mask, attention_weights=None): 79 | """ 80 | PARAMS 81 | ------ 82 | attention_hidden_state: attention rnn last output 83 | memory: encoder outputs 84 | processed_memory: processed encoder outputs 85 | attention_weights_cat: previous and cummulative attention weights 86 | mask: binary mask for padded data 87 | """ 88 | if attention_weights is None: 89 | alignment = self.get_alignment_energies( 90 | attention_hidden_state, processed_memory, attention_weights_cat) 91 | 92 | if mask is not None: 93 | alignment.data.masked_fill_(mask, self.score_mask_value) 94 | 95 | attention_weights = F.softmax(alignment, dim=1) 96 | attention_context = torch.bmm(attention_weights.unsqueeze(1), memory) 97 | attention_context = attention_context.squeeze(1) 98 | 99 | return attention_context, attention_weights 100 | 101 | 102 | class Prenet(nn.Module): 103 | def __init__(self, in_dim, sizes): 104 | super(Prenet, self).__init__() 105 | in_sizes = [in_dim] + sizes[:-1] 106 | self.layers = nn.ModuleList( 107 | [LinearNorm(in_size, out_size, bias=False) 108 | for (in_size, out_size) in zip(in_sizes, sizes)]) 109 | 110 | def forward(self, x): 111 | for linear in self.layers: 112 | x = F.dropout(F.relu(linear(x)), p=drop_rate, training=True) 113 | return x 114 | 115 | 116 | class Postnet(nn.Module): 117 | """Postnet 118 | - Five 1-d convolution with 512 channels and kernel size 5 119 | """ 120 | 121 | def __init__(self, hparams): 122 | super(Postnet, self).__init__() 123 | self.convolutions = nn.ModuleList() 124 | 125 | self.convolutions.append( 126 | nn.Sequential( 127 | ConvNorm(hparams.n_mel_channels, hparams.postnet_embedding_dim, 128 | kernel_size=hparams.postnet_kernel_size, stride=1, 129 | padding=int((hparams.postnet_kernel_size - 1) / 2), 130 | dilation=1, w_init_gain='tanh'), 131 | nn.BatchNorm1d(hparams.postnet_embedding_dim)) 132 | ) 133 | 134 | for i in range(1, hparams.postnet_n_convolutions - 1): 135 | self.convolutions.append( 136 | nn.Sequential( 137 | ConvNorm(hparams.postnet_embedding_dim, 138 | hparams.postnet_embedding_dim, 139 | kernel_size=hparams.postnet_kernel_size, stride=1, 140 | padding=int((hparams.postnet_kernel_size - 1) / 2), 141 | dilation=1, w_init_gain='tanh'), 142 | nn.BatchNorm1d(hparams.postnet_embedding_dim)) 143 | ) 144 | 145 | self.convolutions.append( 146 | nn.Sequential( 147 | ConvNorm(hparams.postnet_embedding_dim, hparams.n_mel_channels, 148 | kernel_size=hparams.postnet_kernel_size, stride=1, 149 | padding=int((hparams.postnet_kernel_size - 1) / 2), 150 | dilation=1, w_init_gain='linear'), 151 | nn.BatchNorm1d(hparams.n_mel_channels)) 152 | ) 153 | 154 | def forward(self, x): 155 | for i in range(len(self.convolutions) - 1): 156 | x = F.dropout(torch.tanh(self.convolutions[i](x)), drop_rate, self.training) 157 | x = F.dropout(self.convolutions[-1](x), drop_rate, self.training) 158 | 159 | return x 160 | 161 | 162 | class Encoder(nn.Module): 163 | """Encoder module: 164 | - Three 1-d convolution banks 165 | - Bidirectional LSTM 166 | """ 167 | def __init__(self, hparams): 168 | super(Encoder, self).__init__() 169 | 170 | convolutions = [] 171 | for _ in range(hparams.encoder_n_convolutions): 172 | conv_layer = nn.Sequential( 173 | ConvNorm(hparams.encoder_embedding_dim, 174 | hparams.encoder_embedding_dim, 175 | kernel_size=hparams.encoder_kernel_size, stride=1, 176 | padding=int((hparams.encoder_kernel_size - 1) / 2), 177 | dilation=1, w_init_gain='relu'), 178 | nn.BatchNorm1d(hparams.encoder_embedding_dim)) 179 | convolutions.append(conv_layer) 180 | self.convolutions = nn.ModuleList(convolutions) 181 | 182 | self.lstm = nn.LSTM(hparams.encoder_embedding_dim, 183 | int(hparams.encoder_embedding_dim / 2), 1, 184 | batch_first=True, bidirectional=True) 185 | 186 | def forward(self, x, input_lengths): 187 | if x.size()[0] > 1: 188 | print("here") 189 | x_embedded = [] 190 | for b_ind in range(x.size()[0]): # TODO: Speed up 191 | curr_x = x[b_ind:b_ind+1, :, :input_lengths[b_ind]].clone() 192 | for conv in self.convolutions: 193 | curr_x = F.dropout(F.relu(conv(curr_x)), drop_rate, self.training) 194 | x_embedded.append(curr_x[0].transpose(0, 1)) 195 | x = torch.nn.utils.rnn.pad_sequence(x_embedded, batch_first=True) 196 | else: 197 | for conv in self.convolutions: 198 | x = F.dropout(F.relu(conv(x)), drop_rate, self.training) 199 | x = x.transpose(1, 2) 200 | 201 | # pytorch tensor are not reversible, hence the conversion 202 | input_lengths = input_lengths.cpu().numpy() 203 | x = nn.utils.rnn.pack_padded_sequence( 204 | x, input_lengths, batch_first=True) 205 | 206 | self.lstm.flatten_parameters() 207 | outputs, _ = self.lstm(x) 208 | 209 | outputs, _ = nn.utils.rnn.pad_packed_sequence( 210 | outputs, batch_first=True) 211 | 212 | return outputs 213 | 214 | def inference(self, x): 215 | for conv in self.convolutions: 216 | x = F.dropout(F.relu(conv(x)), drop_rate, self.training) 217 | 218 | x = x.transpose(1, 2) 219 | 220 | self.lstm.flatten_parameters() 221 | outputs, _ = self.lstm(x) 222 | 223 | return outputs 224 | 225 | 226 | class Decoder(nn.Module): 227 | def __init__(self, hparams): 228 | super(Decoder, self).__init__() 229 | self.n_mel_channels = hparams.n_mel_channels 230 | self.n_frames_per_step = hparams.n_frames_per_step 231 | self.encoder_embedding_dim = hparams.encoder_embedding_dim + hparams.token_embedding_size + hparams.speaker_embedding_dim 232 | self.attention_rnn_dim = hparams.attention_rnn_dim 233 | self.decoder_rnn_dim = hparams.decoder_rnn_dim 234 | self.prenet_dim = hparams.prenet_dim 235 | self.max_decoder_steps = hparams.max_decoder_steps 236 | self.gate_threshold = hparams.gate_threshold 237 | self.p_attention_dropout = hparams.p_attention_dropout 238 | self.p_decoder_dropout = hparams.p_decoder_dropout 239 | self.p_teacher_forcing = hparams.p_teacher_forcing 240 | 241 | self.prenet_f0 = ConvNorm( 242 | 1, hparams.prenet_f0_dim, 243 | kernel_size=hparams.prenet_f0_kernel_size, 244 | padding=max(0, int(hparams.prenet_f0_kernel_size/2)), 245 | bias=False, stride=1, dilation=1) 246 | 247 | self.prenet = Prenet( 248 | hparams.n_mel_channels * hparams.n_frames_per_step, 249 | [hparams.prenet_dim, hparams.prenet_dim]) 250 | 251 | self.attention_rnn = nn.LSTMCell( 252 | hparams.prenet_dim + hparams.prenet_f0_dim + self.encoder_embedding_dim, 253 | hparams.attention_rnn_dim) 254 | 255 | self.attention_layer = Attention( 256 | hparams.attention_rnn_dim, self.encoder_embedding_dim, 257 | hparams.attention_dim, hparams.attention_location_n_filters, 258 | hparams.attention_location_kernel_size) 259 | 260 | self.decoder_rnn = nn.LSTMCell( 261 | hparams.attention_rnn_dim + self.encoder_embedding_dim, 262 | hparams.decoder_rnn_dim, 1) 263 | 264 | self.linear_projection = LinearNorm( 265 | hparams.decoder_rnn_dim + self.encoder_embedding_dim, 266 | hparams.n_mel_channels * hparams.n_frames_per_step) 267 | 268 | self.gate_layer = LinearNorm( 269 | hparams.decoder_rnn_dim + self.encoder_embedding_dim, 1, 270 | bias=True, w_init_gain='sigmoid') 271 | 272 | def get_go_frame(self, memory): 273 | """ Gets all zeros frames to use as first decoder input 274 | PARAMS 275 | ------ 276 | memory: decoder outputs 277 | 278 | RETURNS 279 | ------- 280 | decoder_input: all zeros frames 281 | """ 282 | B = memory.size(0) 283 | decoder_input = Variable(memory.data.new( 284 | B, self.n_mel_channels * self.n_frames_per_step).zero_()) 285 | return decoder_input 286 | 287 | def get_end_f0(self, f0s): 288 | B = f0s.size(0) 289 | dummy = Variable(f0s.data.new(B, 1, f0s.size(1)).zero_()) 290 | return dummy 291 | 292 | def initialize_decoder_states(self, memory, mask): 293 | """ Initializes attention rnn states, decoder rnn states, attention 294 | weights, attention cumulative weights, attention context, stores memory 295 | and stores processed memory 296 | PARAMS 297 | ------ 298 | memory: Encoder outputs 299 | mask: Mask for padded data if training, expects None for inference 300 | """ 301 | B = memory.size(0) 302 | MAX_TIME = memory.size(1) 303 | 304 | self.attention_hidden = Variable(memory.data.new( 305 | B, self.attention_rnn_dim).zero_()) 306 | self.attention_cell = Variable(memory.data.new( 307 | B, self.attention_rnn_dim).zero_()) 308 | 309 | self.decoder_hidden = Variable(memory.data.new( 310 | B, self.decoder_rnn_dim).zero_()) 311 | self.decoder_cell = Variable(memory.data.new( 312 | B, self.decoder_rnn_dim).zero_()) 313 | 314 | self.attention_weights = Variable(memory.data.new( 315 | B, MAX_TIME).zero_()) 316 | self.attention_weights_cum = Variable(memory.data.new( 317 | B, MAX_TIME).zero_()) 318 | self.attention_context = Variable(memory.data.new( 319 | B, self.encoder_embedding_dim).zero_()) 320 | 321 | self.memory = memory 322 | self.processed_memory = self.attention_layer.memory_layer(memory) 323 | self.mask = mask 324 | 325 | def parse_decoder_inputs(self, decoder_inputs): 326 | """ Prepares decoder inputs, i.e. mel outputs 327 | PARAMS 328 | ------ 329 | decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs 330 | 331 | RETURNS 332 | ------- 333 | inputs: processed decoder inputs 334 | 335 | """ 336 | # (B, n_mel_channels, T_out) -> (B, T_out, n_mel_channels) 337 | decoder_inputs = decoder_inputs.transpose(1, 2) 338 | decoder_inputs = decoder_inputs.view( 339 | decoder_inputs.size(0), 340 | int(decoder_inputs.size(1)/self.n_frames_per_step), -1) 341 | # (B, T_out, n_mel_channels) -> (T_out, B, n_mel_channels) 342 | decoder_inputs = decoder_inputs.transpose(0, 1) 343 | return decoder_inputs 344 | 345 | def parse_decoder_outputs(self, mel_outputs, gate_outputs, alignments): 346 | """ Prepares decoder outputs for output 347 | PARAMS 348 | ------ 349 | mel_outputs: 350 | gate_outputs: gate output energies 351 | alignments: 352 | 353 | RETURNS 354 | ------- 355 | mel_outputs: 356 | gate_outpust: gate output energies 357 | alignments: 358 | """ 359 | # (T_out, B) -> (B, T_out) 360 | alignments = torch.stack(alignments).transpose(0, 1) 361 | # (T_out, B) -> (B, T_out) 362 | gate_outputs = torch.stack(gate_outputs) 363 | if len(gate_outputs.size()) > 1: 364 | gate_outputs = gate_outputs.transpose(0, 1) 365 | else: 366 | gate_outputs = gate_outputs[None] 367 | gate_outputs = gate_outputs.contiguous() 368 | # (T_out, B, n_mel_channels) -> (B, T_out, n_mel_channels) 369 | mel_outputs = torch.stack(mel_outputs).transpose(0, 1).contiguous() 370 | # decouple frames per step 371 | mel_outputs = mel_outputs.view( 372 | mel_outputs.size(0), -1, self.n_mel_channels) 373 | # (B, T_out, n_mel_channels) -> (B, n_mel_channels, T_out) 374 | mel_outputs = mel_outputs.transpose(1, 2) 375 | 376 | return mel_outputs, gate_outputs, alignments 377 | 378 | def decode(self, decoder_input, attention_weights=None): 379 | """ Decoder step using stored states, attention and memory 380 | PARAMS 381 | ------ 382 | decoder_input: previous mel output 383 | 384 | RETURNS 385 | ------- 386 | mel_output: 387 | gate_output: gate output energies 388 | attention_weights: 389 | """ 390 | cell_input = torch.cat((decoder_input, self.attention_context), -1) 391 | self.attention_hidden, self.attention_cell = self.attention_rnn( 392 | cell_input, (self.attention_hidden, self.attention_cell)) 393 | self.attention_hidden = F.dropout( 394 | self.attention_hidden, self.p_attention_dropout, self.training) 395 | self.attention_cell = F.dropout( 396 | self.attention_cell, self.p_attention_dropout, self.training) 397 | 398 | attention_weights_cat = torch.cat( 399 | (self.attention_weights.unsqueeze(1), 400 | self.attention_weights_cum.unsqueeze(1)), dim=1) 401 | self.attention_context, self.attention_weights = self.attention_layer( 402 | self.attention_hidden, self.memory, self.processed_memory, 403 | attention_weights_cat, self.mask, attention_weights) 404 | 405 | self.attention_weights_cum += self.attention_weights 406 | decoder_input = torch.cat( 407 | (self.attention_hidden, self.attention_context), -1) 408 | self.decoder_hidden, self.decoder_cell = self.decoder_rnn( 409 | decoder_input, (self.decoder_hidden, self.decoder_cell)) 410 | self.decoder_hidden = F.dropout( 411 | self.decoder_hidden, self.p_decoder_dropout, self.training) 412 | self.decoder_cell = F.dropout( 413 | self.decoder_cell, self.p_decoder_dropout, self.training) 414 | 415 | decoder_hidden_attention_context = torch.cat( 416 | (self.decoder_hidden, self.attention_context), dim=1) 417 | 418 | decoder_output = self.linear_projection( 419 | decoder_hidden_attention_context) 420 | 421 | gate_prediction = self.gate_layer(decoder_hidden_attention_context) 422 | return decoder_output, gate_prediction, self.attention_weights 423 | 424 | def forward(self, memory, decoder_inputs, memory_lengths, f0s): 425 | """ Decoder forward pass for training 426 | PARAMS 427 | ------ 428 | memory: Encoder outputs 429 | decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs 430 | memory_lengths: Encoder output lengths for attention masking. 431 | 432 | RETURNS 433 | ------- 434 | mel_outputs: mel outputs from the decoder 435 | gate_outputs: gate outputs from the decoder 436 | alignments: sequence of attention weights from the decoder 437 | """ 438 | 439 | decoder_input = self.get_go_frame(memory).unsqueeze(0) 440 | decoder_inputs = self.parse_decoder_inputs(decoder_inputs) 441 | decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0) 442 | decoder_inputs = self.prenet(decoder_inputs) 443 | 444 | # audio features 445 | f0_dummy = self.get_end_f0(f0s) 446 | f0s = torch.cat((f0s, f0_dummy), dim=2) 447 | f0s = F.relu(self.prenet_f0(f0s)) 448 | f0s = f0s.permute(2, 0, 1) 449 | 450 | self.initialize_decoder_states( 451 | memory, mask=~get_mask_from_lengths(memory_lengths)) 452 | 453 | mel_outputs, gate_outputs, alignments = [], [], [] 454 | while len(mel_outputs) < decoder_inputs.size(0) - 1: 455 | if len(mel_outputs) == 0 or np.random.uniform(0.0, 1.0) <= self.p_teacher_forcing: 456 | decoder_input = torch.cat((decoder_inputs[len(mel_outputs)], 457 | f0s[len(mel_outputs)]), dim=1) 458 | else: 459 | decoder_input = torch.cat((self.prenet(mel_outputs[-1]), 460 | f0s[len(mel_outputs)]), dim=1) 461 | mel_output, gate_output, attention_weights = self.decode( 462 | decoder_input) 463 | mel_outputs += [mel_output.squeeze(1)] 464 | gate_outputs += [gate_output.squeeze()] 465 | alignments += [attention_weights] 466 | 467 | mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs( 468 | mel_outputs, gate_outputs, alignments) 469 | 470 | return mel_outputs, gate_outputs, alignments 471 | 472 | def inference(self, memory, f0s): 473 | """ Decoder inference 474 | PARAMS 475 | ------ 476 | memory: Encoder outputs 477 | 478 | RETURNS 479 | ------- 480 | mel_outputs: mel outputs from the decoder 481 | gate_outputs: gate outputs from the decoder 482 | alignments: sequence of attention weights from the decoder 483 | """ 484 | decoder_input = self.get_go_frame(memory) 485 | 486 | self.initialize_decoder_states(memory, mask=None) 487 | f0_dummy = self.get_end_f0(f0s) 488 | f0s = torch.cat((f0s, f0_dummy), dim=2) 489 | f0s = F.relu(self.prenet_f0(f0s)) 490 | f0s = f0s.permute(2, 0, 1) 491 | 492 | mel_outputs, gate_outputs, alignments = [], [], [] 493 | while True: 494 | if len(mel_outputs) < len(f0s): 495 | f0 = f0s[len(mel_outputs)] 496 | else: 497 | f0 = f0s[-1] * 0 498 | 499 | decoder_input = torch.cat((self.prenet(decoder_input), f0), dim=1) 500 | mel_output, gate_output, alignment = self.decode(decoder_input) 501 | 502 | mel_outputs += [mel_output.squeeze(1)] 503 | gate_outputs += [gate_output] 504 | alignments += [alignment] 505 | 506 | if torch.sigmoid(gate_output.data) > self.gate_threshold: 507 | break 508 | elif len(mel_outputs) == self.max_decoder_steps: 509 | print("Warning! Reached max decoder steps") 510 | break 511 | 512 | decoder_input = mel_output 513 | 514 | mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs( 515 | mel_outputs, gate_outputs, alignments) 516 | 517 | return mel_outputs, gate_outputs, alignments 518 | 519 | def inference_noattention(self, memory, f0s, attention_map): 520 | """ Decoder inference 521 | PARAMS 522 | ------ 523 | memory: Encoder outputs 524 | 525 | RETURNS 526 | ------- 527 | mel_outputs: mel outputs from the decoder 528 | gate_outputs: gate outputs from the decoder 529 | alignments: sequence of attention weights from the decoder 530 | """ 531 | decoder_input = self.get_go_frame(memory) 532 | 533 | self.initialize_decoder_states(memory, mask=None) 534 | f0_dummy = self.get_end_f0(f0s) 535 | f0s = torch.cat((f0s, f0_dummy), dim=2) 536 | f0s = F.relu(self.prenet_f0(f0s)) 537 | f0s = f0s.permute(2, 0, 1) 538 | 539 | mel_outputs, gate_outputs, alignments = [], [], [] 540 | for i in range(len(attention_map)): 541 | f0 = f0s[i] 542 | attention = attention_map[i] 543 | decoder_input = torch.cat((self.prenet(decoder_input), f0), dim=1) 544 | mel_output, gate_output, alignment = self.decode(decoder_input, attention) 545 | 546 | mel_outputs += [mel_output.squeeze(1)] 547 | gate_outputs += [gate_output] 548 | alignments += [alignment] 549 | 550 | decoder_input = mel_output 551 | 552 | mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs( 553 | mel_outputs, gate_outputs, alignments) 554 | 555 | return mel_outputs, gate_outputs, alignments 556 | 557 | 558 | class Tacotron2(nn.Module): 559 | def __init__(self, hparams): 560 | super(Tacotron2, self).__init__() 561 | self.mask_padding = hparams.mask_padding 562 | self.fp16_run = hparams.fp16_run 563 | self.n_mel_channels = hparams.n_mel_channels 564 | self.n_frames_per_step = hparams.n_frames_per_step 565 | self.embedding = nn.Embedding( 566 | hparams.n_symbols, hparams.symbols_embedding_dim) 567 | std = sqrt(2.0 / (hparams.n_symbols + hparams.symbols_embedding_dim)) 568 | val = sqrt(3.0) * std # uniform bounds for std 569 | self.embedding.weight.data.uniform_(-val, val) 570 | self.encoder = Encoder(hparams) 571 | self.decoder = Decoder(hparams) 572 | self.postnet = Postnet(hparams) 573 | if hparams.with_gst: 574 | self.gst = GST(hparams) 575 | self.speaker_embedding = nn.Embedding( 576 | hparams.n_speakers, hparams.speaker_embedding_dim) 577 | 578 | def parse_batch(self, batch): 579 | text_padded, input_lengths, mel_padded, gate_padded, \ 580 | output_lengths, speaker_ids, f0_padded = batch 581 | text_padded = to_gpu(text_padded).long() 582 | input_lengths = to_gpu(input_lengths).long() 583 | max_len = torch.max(input_lengths.data).item() 584 | mel_padded = to_gpu(mel_padded).float() 585 | gate_padded = to_gpu(gate_padded).float() 586 | output_lengths = to_gpu(output_lengths).long() 587 | speaker_ids = to_gpu(speaker_ids.data).long() 588 | f0_padded = to_gpu(f0_padded).float() 589 | return ((text_padded, input_lengths, mel_padded, max_len, 590 | output_lengths, speaker_ids, f0_padded), 591 | (mel_padded, gate_padded)) 592 | 593 | def parse_output(self, outputs, output_lengths=None): 594 | if self.mask_padding and output_lengths is not None: 595 | mask = ~get_mask_from_lengths(output_lengths) 596 | mask = mask.expand(self.n_mel_channels, mask.size(0), mask.size(1)) 597 | mask = mask.permute(1, 0, 2) 598 | 599 | outputs[0].data.masked_fill_(mask, 0.0) 600 | outputs[1].data.masked_fill_(mask, 0.0) 601 | outputs[2].data.masked_fill_(mask[:, 0, :], 1e3) # gate energies 602 | 603 | return outputs 604 | 605 | def forward(self, inputs): 606 | inputs, input_lengths, targets, max_len, \ 607 | output_lengths, speaker_ids, f0s = inputs 608 | input_lengths, output_lengths = input_lengths.data, output_lengths.data 609 | 610 | embedded_inputs = self.embedding(inputs).transpose(1, 2) 611 | embedded_text = self.encoder(embedded_inputs, input_lengths) 612 | embedded_speakers = self.speaker_embedding(speaker_ids)[:, None] 613 | embedded_gst = self.gst(targets, output_lengths) 614 | embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1) 615 | embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1) 616 | 617 | encoder_outputs = torch.cat( 618 | (embedded_text, embedded_gst, embedded_speakers), dim=2) 619 | 620 | mel_outputs, gate_outputs, alignments = self.decoder( 621 | encoder_outputs, targets, memory_lengths=input_lengths, f0s=f0s) 622 | 623 | mel_outputs_postnet = self.postnet(mel_outputs) 624 | mel_outputs_postnet = mel_outputs + mel_outputs_postnet 625 | 626 | return self.parse_output( 627 | [mel_outputs, mel_outputs_postnet, gate_outputs, alignments], 628 | output_lengths) 629 | 630 | def inference(self, inputs): 631 | text, style_input, speaker_ids, f0s = inputs 632 | embedded_inputs = self.embedding(text).transpose(1, 2) 633 | embedded_text = self.encoder.inference(embedded_inputs) 634 | embedded_speakers = self.speaker_embedding(speaker_ids)[:, None] 635 | if hasattr(self, 'gst'): 636 | if isinstance(style_input, int): 637 | query = torch.zeros(1, 1, self.gst.encoder.ref_enc_gru_size).cuda() 638 | GST = torch.tanh(self.gst.stl.embed) 639 | key = GST[style_input].unsqueeze(0).expand(1, -1, -1) 640 | embedded_gst = self.gst.stl.attention(query, key) 641 | else: 642 | embedded_gst = self.gst(style_input) 643 | 644 | embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1) 645 | if hasattr(self, 'gst'): 646 | embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1) 647 | encoder_outputs = torch.cat( 648 | (embedded_text, embedded_gst, embedded_speakers), dim=2) 649 | else: 650 | encoder_outputs = torch.cat( 651 | (embedded_text, embedded_speakers), dim=2) 652 | 653 | mel_outputs, gate_outputs, alignments = self.decoder.inference( 654 | encoder_outputs, f0s) 655 | 656 | mel_outputs_postnet = self.postnet(mel_outputs) 657 | mel_outputs_postnet = mel_outputs + mel_outputs_postnet 658 | 659 | return self.parse_output( 660 | [mel_outputs, mel_outputs_postnet, gate_outputs, alignments]) 661 | 662 | def inference_noattention(self, inputs): 663 | text, style_input, speaker_ids, f0s, attention_map = inputs 664 | embedded_inputs = self.embedding(text).transpose(1, 2) 665 | embedded_text = self.encoder.inference(embedded_inputs) 666 | embedded_speakers = self.speaker_embedding(speaker_ids)[:, None] 667 | if hasattr(self, 'gst'): 668 | if isinstance(style_input, int): 669 | query = torch.zeros(1, 1, self.gst.encoder.ref_enc_gru_size).cuda() 670 | GST = torch.tanh(self.gst.stl.embed) 671 | key = GST[style_input].unsqueeze(0).expand(1, -1, -1) 672 | embedded_gst = self.gst.stl.attention(query, key) 673 | else: 674 | embedded_gst = self.gst(style_input) 675 | 676 | embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1) 677 | if hasattr(self, 'gst'): 678 | embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1) 679 | encoder_outputs = torch.cat( 680 | (embedded_text, embedded_gst, embedded_speakers), dim=2) 681 | else: 682 | encoder_outputs = torch.cat( 683 | (embedded_text, embedded_speakers), dim=2) 684 | 685 | mel_outputs, gate_outputs, alignments = self.decoder.inference_noattention( 686 | encoder_outputs, f0s, attention_map) 687 | 688 | mel_outputs_postnet = self.postnet(mel_outputs) 689 | mel_outputs_postnet = mel_outputs + mel_outputs_postnet 690 | 691 | return self.parse_output( 692 | [mel_outputs, mel_outputs_postnet, gate_outputs, alignments]) 693 | -------------------------------------------------------------------------------- /modules.py: -------------------------------------------------------------------------------- 1 | # adapted from https://github.com/KinglittleQ/GST-Tacotron/blob/master/GST.py 2 | # MIT License 3 | # 4 | # Copyright (c) 2018 MagicGirl Sakura 5 | # 6 | # Permission is hereby granted, free of charge, to any person obtaining a copy 7 | # of this software and associated documentation files (the "Software"), to deal 8 | # in the Software without restriction, including without limitation the rights 9 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | # copies of the Software, and to permit persons to whom the Software is 11 | # furnished to do so, subject to the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be included in 14 | # all copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 19 | # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 21 | # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 22 | # DEALINGS IN THE SOFTWARE. 23 | 24 | 25 | import torch 26 | import torch.nn as nn 27 | import torch.nn.init as init 28 | import torch.nn.functional as F 29 | 30 | 31 | class ReferenceEncoder(nn.Module): 32 | ''' 33 | inputs --- [N, Ty/r, n_mels*r] mels 34 | outputs --- [N, ref_enc_gru_size] 35 | ''' 36 | 37 | def __init__(self, hp): 38 | 39 | super().__init__() 40 | K = len(hp.ref_enc_filters) 41 | filters = [1] + hp.ref_enc_filters 42 | 43 | convs = [nn.Conv2d(in_channels=filters[i], 44 | out_channels=filters[i + 1], 45 | kernel_size=(3, 3), 46 | stride=(2, 2), 47 | padding=(1, 1)) for i in range(K)] 48 | self.convs = nn.ModuleList(convs) 49 | self.bns = nn.ModuleList( 50 | [nn.BatchNorm2d(num_features=hp.ref_enc_filters[i]) 51 | for i in range(K)]) 52 | 53 | out_channels = self.calculate_channels(hp.n_mel_channels, 3, 2, 1, K) 54 | self.gru = nn.GRU(input_size=hp.ref_enc_filters[-1] * out_channels, 55 | hidden_size=hp.ref_enc_gru_size, 56 | batch_first=True) 57 | self.n_mel_channels = hp.n_mel_channels 58 | self.ref_enc_gru_size = hp.ref_enc_gru_size 59 | 60 | def forward(self, inputs, input_lengths=None): 61 | out = inputs.view(inputs.size(0), 1, -1, self.n_mel_channels) 62 | for conv, bn in zip(self.convs, self.bns): 63 | out = conv(out) 64 | out = bn(out) 65 | out = F.relu(out) 66 | 67 | out = out.transpose(1, 2) # [N, Ty//2^K, 128, n_mels//2^K] 68 | N, T = out.size(0), out.size(1) 69 | out = out.contiguous().view(N, T, -1) # [N, Ty//2^K, 128*n_mels//2^K] 70 | 71 | if input_lengths is not None: 72 | input_lengths = torch.ceil(input_lengths.float() / 2 ** len(self.convs)) 73 | input_lengths = input_lengths.cpu().numpy().astype(int) 74 | out = nn.utils.rnn.pack_padded_sequence( 75 | out, input_lengths, batch_first=True, enforce_sorted=False) 76 | 77 | self.gru.flatten_parameters() 78 | _, out = self.gru(out) 79 | return out.squeeze(0) 80 | 81 | def calculate_channels(self, L, kernel_size, stride, pad, n_convs): 82 | for _ in range(n_convs): 83 | L = (L - kernel_size + 2 * pad) // stride + 1 84 | return L 85 | 86 | 87 | class STL(nn.Module): 88 | ''' 89 | inputs --- [N, token_embedding_size//2] 90 | ''' 91 | def __init__(self, hp): 92 | super().__init__() 93 | self.embed = nn.Parameter(torch.FloatTensor(hp.token_num, hp.token_embedding_size // hp.num_heads)) 94 | d_q = hp.ref_enc_gru_size 95 | d_k = hp.token_embedding_size // hp.num_heads 96 | self.attention = MultiHeadAttention( 97 | query_dim=d_q, key_dim=d_k, num_units=hp.token_embedding_size, 98 | num_heads=hp.num_heads) 99 | 100 | init.normal_(self.embed, mean=0, std=0.5) 101 | 102 | def forward(self, inputs): 103 | N = inputs.size(0) 104 | query = inputs.unsqueeze(1) 105 | keys = torch.tanh(self.embed).unsqueeze(0).expand(N, -1, -1) # [N, token_num, token_embedding_size // num_heads] 106 | style_embed = self.attention(query, keys) 107 | 108 | return style_embed 109 | 110 | 111 | class MultiHeadAttention(nn.Module): 112 | ''' 113 | input: 114 | query --- [N, T_q, query_dim] 115 | key --- [N, T_k, key_dim] 116 | output: 117 | out --- [N, T_q, num_units] 118 | ''' 119 | def __init__(self, query_dim, key_dim, num_units, num_heads): 120 | super().__init__() 121 | self.num_units = num_units 122 | self.num_heads = num_heads 123 | self.key_dim = key_dim 124 | 125 | self.W_query = nn.Linear(in_features=query_dim, out_features=num_units, bias=False) 126 | self.W_key = nn.Linear(in_features=key_dim, out_features=num_units, bias=False) 127 | self.W_value = nn.Linear(in_features=key_dim, out_features=num_units, bias=False) 128 | 129 | def forward(self, query, key): 130 | querys = self.W_query(query) # [N, T_q, num_units] 131 | keys = self.W_key(key) # [N, T_k, num_units] 132 | values = self.W_value(key) 133 | 134 | split_size = self.num_units // self.num_heads 135 | querys = torch.stack(torch.split(querys, split_size, dim=2), dim=0) # [h, N, T_q, num_units/h] 136 | keys = torch.stack(torch.split(keys, split_size, dim=2), dim=0) # [h, N, T_k, num_units/h] 137 | values = torch.stack(torch.split(values, split_size, dim=2), dim=0) # [h, N, T_k, num_units/h] 138 | 139 | # score = softmax(QK^T / (d_k ** 0.5)) 140 | scores = torch.matmul(querys, keys.transpose(2, 3)) # [h, N, T_q, T_k] 141 | scores = scores / (self.key_dim ** 0.5) 142 | scores = F.softmax(scores, dim=3) 143 | 144 | # out = score * V 145 | out = torch.matmul(scores, values) # [h, N, T_q, num_units/h] 146 | out = torch.cat(torch.split(out, 1, dim=0), dim=3).squeeze(0) # [N, T_q, num_units] 147 | 148 | return out 149 | 150 | 151 | class GST(nn.Module): 152 | def __init__(self, hp): 153 | super().__init__() 154 | self.encoder = ReferenceEncoder(hp) 155 | self.stl = STL(hp) 156 | 157 | def forward(self, inputs, input_lengths=None): 158 | enc_out = self.encoder(inputs, input_lengths=input_lengths) 159 | style_embed = self.stl(enc_out) 160 | 161 | return style_embed 162 | -------------------------------------------------------------------------------- /multiproc.py: -------------------------------------------------------------------------------- 1 | import time 2 | import torch 3 | import sys 4 | import subprocess 5 | 6 | argslist = list(sys.argv)[1:] 7 | num_gpus = torch.cuda.device_count() 8 | argslist.append('--n_gpus={}'.format(num_gpus)) 9 | workers = [] 10 | job_id = time.strftime("%Y_%m_%d-%H%M%S") 11 | argslist.append("--group_name=group_{}".format(job_id)) 12 | 13 | for i in range(num_gpus): 14 | argslist.append('--rank={}'.format(i)) 15 | stdout = None if i == 0 else open("logs/{}_GPU_{}.log".format(job_id, i), 16 | "w") 17 | print(argslist) 18 | p = subprocess.Popen([str(sys.executable)]+argslist, stdout=stdout) 19 | workers.append(p) 20 | argslist = argslist[:-1] 21 | 22 | for p in workers: 23 | p.wait() 24 | -------------------------------------------------------------------------------- /plotting_utils.py: -------------------------------------------------------------------------------- 1 | import matplotlib 2 | matplotlib.use("Agg") 3 | import matplotlib.pylab as plt 4 | import numpy as np 5 | 6 | 7 | def save_figure_to_numpy(fig): 8 | # save it to a numpy array. 9 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 10 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 11 | return data 12 | 13 | 14 | def plot_alignment_to_numpy(alignment, info=None): 15 | fig, ax = plt.subplots(figsize=(6, 4)) 16 | im = ax.imshow(alignment, aspect='auto', origin='lower', 17 | interpolation='none') 18 | fig.colorbar(im, ax=ax) 19 | xlabel = 'Decoder timestep' 20 | if info is not None: 21 | xlabel += '\n\n' + info 22 | plt.xlabel(xlabel) 23 | plt.ylabel('Encoder timestep') 24 | plt.tight_layout() 25 | 26 | fig.canvas.draw() 27 | data = save_figure_to_numpy(fig) 28 | plt.close() 29 | return data 30 | 31 | 32 | def plot_spectrogram_to_numpy(spectrogram): 33 | fig, ax = plt.subplots(figsize=(12, 3)) 34 | im = ax.imshow(spectrogram, aspect="auto", origin="lower", 35 | interpolation='none') 36 | plt.colorbar(im, ax=ax) 37 | plt.xlabel("Frames") 38 | plt.ylabel("Channels") 39 | plt.tight_layout() 40 | 41 | fig.canvas.draw() 42 | data = save_figure_to_numpy(fig) 43 | plt.close() 44 | return data 45 | 46 | 47 | def plot_gate_outputs_to_numpy(gate_targets, gate_outputs): 48 | fig, ax = plt.subplots(figsize=(12, 3)) 49 | ax.scatter(range(len(gate_targets)), gate_targets, alpha=0.5, 50 | color='green', marker='+', s=1, label='target') 51 | ax.scatter(range(len(gate_outputs)), gate_outputs, alpha=0.5, 52 | color='red', marker='.', s=1, label='predicted') 53 | 54 | plt.xlabel("Frames (Green target, Red predicted)") 55 | plt.ylabel("Gate State") 56 | plt.tight_layout() 57 | 58 | fig.canvas.draw() 59 | data = save_figure_to_numpy(fig) 60 | plt.close() 61 | return data 62 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib==2.1.0 2 | tensorflow==1.15.2 3 | inflect==0.2.5 4 | librosa==0.6.0 5 | scipy==1.0.0 6 | tensorboardX==1.1 7 | Unidecode==1.0.22 8 | pillow 9 | nltk==3.4.5 10 | jamo==0.4.1 11 | music21 12 | -------------------------------------------------------------------------------- /stft.py: -------------------------------------------------------------------------------- 1 | """ 2 | BSD 3-Clause License 3 | 4 | Copyright (c) 2017, Prem Seetharaman 5 | All rights reserved. 6 | 7 | * Redistribution and use in source and binary forms, with or without 8 | modification, are permitted provided that the following conditions are met: 9 | 10 | * Redistributions of source code must retain the above copyright notice, 11 | this list of conditions and the following disclaimer. 12 | 13 | * Redistributions in binary form must reproduce the above copyright notice, this 14 | list of conditions and the following disclaimer in the 15 | documentation and/or other materials provided with the distribution. 16 | 17 | * Neither the name of the copyright holder nor the names of its 18 | contributors may be used to endorse or promote products derived from this 19 | software without specific prior written permission. 20 | 21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 22 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 23 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 24 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR 25 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 26 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 27 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 28 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 29 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 30 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 31 | """ 32 | 33 | import torch 34 | import numpy as np 35 | import torch.nn.functional as F 36 | from torch.autograd import Variable 37 | from scipy.signal import get_window 38 | from librosa.util import pad_center, tiny 39 | from audio_processing import window_sumsquare 40 | 41 | 42 | class STFT(torch.nn.Module): 43 | """adapted from Prem Seetharaman's https://github.com/pseeth/pytorch-stft""" 44 | def __init__(self, filter_length=800, hop_length=200, win_length=800, 45 | window='hann'): 46 | super(STFT, self).__init__() 47 | self.filter_length = filter_length 48 | self.hop_length = hop_length 49 | self.win_length = win_length 50 | self.window = window 51 | self.forward_transform = None 52 | scale = self.filter_length / self.hop_length 53 | fourier_basis = np.fft.fft(np.eye(self.filter_length)) 54 | 55 | cutoff = int((self.filter_length / 2 + 1)) 56 | fourier_basis = np.vstack([np.real(fourier_basis[:cutoff, :]), 57 | np.imag(fourier_basis[:cutoff, :])]) 58 | 59 | forward_basis = torch.FloatTensor(fourier_basis[:, None, :]) 60 | inverse_basis = torch.FloatTensor( 61 | np.linalg.pinv(scale * fourier_basis).T[:, None, :]) 62 | 63 | if window is not None: 64 | assert(filter_length >= win_length) 65 | # get window and zero center pad it to filter_length 66 | fft_window = get_window(window, win_length, fftbins=True) 67 | fft_window = pad_center(fft_window, filter_length) 68 | fft_window = torch.from_numpy(fft_window).float() 69 | 70 | # window the bases 71 | forward_basis *= fft_window 72 | inverse_basis *= fft_window 73 | 74 | self.register_buffer('forward_basis', forward_basis.float()) 75 | self.register_buffer('inverse_basis', inverse_basis.float()) 76 | 77 | def transform(self, input_data): 78 | num_batches = input_data.size(0) 79 | num_samples = input_data.size(1) 80 | 81 | self.num_samples = num_samples 82 | 83 | # similar to librosa, reflect-pad the input 84 | input_data = input_data.view(num_batches, 1, num_samples) 85 | input_data = F.pad( 86 | input_data.unsqueeze(1), 87 | (int(self.filter_length / 2), int(self.filter_length / 2), 0, 0), 88 | mode='reflect') 89 | input_data = input_data.squeeze(1) 90 | 91 | forward_transform = F.conv1d( 92 | input_data, 93 | Variable(self.forward_basis, requires_grad=False), 94 | stride=self.hop_length, 95 | padding=0) 96 | 97 | cutoff = int((self.filter_length / 2) + 1) 98 | real_part = forward_transform[:, :cutoff, :] 99 | imag_part = forward_transform[:, cutoff:, :] 100 | 101 | magnitude = torch.sqrt(real_part**2 + imag_part**2) 102 | phase = torch.autograd.Variable( 103 | torch.atan2(imag_part.data, real_part.data)) 104 | 105 | return magnitude, phase 106 | 107 | def inverse(self, magnitude, phase): 108 | recombine_magnitude_phase = torch.cat( 109 | [magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1) 110 | 111 | inverse_transform = F.conv_transpose1d( 112 | recombine_magnitude_phase, 113 | Variable(self.inverse_basis, requires_grad=False), 114 | stride=self.hop_length, 115 | padding=0) 116 | 117 | if self.window is not None: 118 | window_sum = window_sumsquare( 119 | self.window, magnitude.size(-1), hop_length=self.hop_length, 120 | win_length=self.win_length, n_fft=self.filter_length, 121 | dtype=np.float32) 122 | # remove modulation effects 123 | approx_nonzero_indices = torch.from_numpy( 124 | np.where(window_sum > tiny(window_sum))[0]) 125 | window_sum = torch.autograd.Variable( 126 | torch.from_numpy(window_sum), requires_grad=False) 127 | window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum 128 | inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices] 129 | 130 | # scale by hop ratio 131 | inverse_transform *= float(self.filter_length) / self.hop_length 132 | 133 | inverse_transform = inverse_transform[:, :, int(self.filter_length/2):] 134 | inverse_transform = inverse_transform[:, :, :-int(self.filter_length/2):] 135 | 136 | return inverse_transform 137 | 138 | def forward(self, input_data): 139 | self.magnitude, self.phase = self.transform(input_data) 140 | reconstruction = self.inverse(self.magnitude, self.phase) 141 | return reconstruction 142 | -------------------------------------------------------------------------------- /text/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Keith Ito 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /text/__init__.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | import re 3 | import random 4 | from text import cleaners 5 | from text.symbols import symbols 6 | 7 | 8 | # Mappings from symbol to numeric ID and vice versa: 9 | _symbol_to_id = {s: i for i, s in enumerate(symbols)} 10 | _id_to_symbol = {i: s for i, s in enumerate(symbols)} 11 | 12 | # Regular expression matching text enclosed in curly braces: 13 | _curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)') 14 | _words_re = re.compile(r"([a-zA-ZÀ-ž]+['][a-zA-ZÀ-ž]{1,2}|[a-zA-ZÀ-ž]+)|([{][^}]+[}]|[^a-zA-ZÀ-ž{}]+)") 15 | 16 | 17 | def get_arpabet(word, dictionary): 18 | word_arpabet = dictionary.lookup(word) 19 | if word_arpabet is not None: 20 | return "{" + word_arpabet[0] + "}" 21 | else: 22 | return word 23 | 24 | 25 | def text_to_sequence(text, cleaner_names, dictionary=None, p_arpabet=1.0): 26 | '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 27 | 28 | The text can optionally have ARPAbet sequences enclosed in curly braces embedded 29 | in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street." 30 | 31 | Args: 32 | text: string to convert to a sequence 33 | cleaner_names: names of the cleaner functions to run the text through 34 | dictionary: arpabet class with arpabet dictionary 35 | 36 | Returns: 37 | List of integers corresponding to the symbols in the text 38 | ''' 39 | sequence = [] 40 | 41 | # Check for curly braces and treat their contents as ARPAbet: 42 | while len(text): 43 | m = _curly_re.match(text) 44 | if not m: 45 | clean_text = _clean_text(text, cleaner_names) 46 | if dictionary is not None: 47 | words = _words_re.findall(text) 48 | clean_text = [ 49 | get_arpabet(word[0], dictionary) 50 | if ((word[0] != '') and random.random() < p_arpabet) else word[1] 51 | for word in words] 52 | 53 | for i in range(len(clean_text)): 54 | t = clean_text[i] 55 | if t.startswith("{"): 56 | sequence += _arpabet_to_sequence(t[1:-1]) 57 | else: 58 | sequence += _symbols_to_sequence(t) 59 | #sequence += space 60 | else: 61 | sequence += _symbols_to_sequence(clean_text) 62 | break 63 | 64 | sequence += text_to_sequence(m.group(1), cleaner_names, dictionary, p_arpabet) 65 | sequence += _arpabet_to_sequence(m.group(2)) 66 | text = m.group(3) 67 | 68 | return sequence 69 | 70 | 71 | def sequence_to_text(sequence): 72 | '''Converts a sequence of IDs back to a string''' 73 | result = '' 74 | for symbol_id in sequence: 75 | if symbol_id in _id_to_symbol: 76 | s = _id_to_symbol[symbol_id] 77 | # Enclose ARPAbet back in curly braces: 78 | if len(s) > 1 and s[0] == '@': 79 | s = '{%s}' % s[1:] 80 | result += s 81 | return result.replace('}{', ' ') 82 | 83 | 84 | def _clean_text(text, cleaner_names): 85 | for name in cleaner_names: 86 | cleaner = getattr(cleaners, name) 87 | if not cleaner: 88 | raise Exception('Unknown cleaner: %s' % name) 89 | text = cleaner(text) 90 | return text 91 | 92 | 93 | def _symbols_to_sequence(symbols): 94 | return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)] 95 | 96 | 97 | def _arpabet_to_sequence(text): 98 | return _symbols_to_sequence(['@' + s for s in text.split()]) 99 | 100 | 101 | def _should_keep_symbol(s): 102 | return s in _symbol_to_id and s is not '_' and s is not '~' 103 | 104 | -------------------------------------------------------------------------------- /text/cleaners.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | ''' 4 | Cleaners are transformations that run over the input text at both training and eval time. 5 | 6 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" 7 | hyperparameter. Some cleaners are English-specific. You'll typically want to use: 8 | 1. "english_cleaners" for English text 9 | 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using 10 | the Unidecode library (https://pypi.python.org/pypi/Unidecode) 11 | 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update 12 | the symbols in symbols.py to match your data). 13 | ''' 14 | 15 | import re 16 | from unidecode import unidecode 17 | from .numbers import normalize_numbers 18 | 19 | 20 | # Regular expression matching whitespace: 21 | _whitespace_re = re.compile(r'\s+') 22 | 23 | # List of (regular expression, replacement) pairs for abbreviations: 24 | _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ 25 | ('mrs', 'misess'), 26 | ('mr', 'mister'), 27 | ('dr', 'doctor'), 28 | ('st', 'saint'), 29 | ('co', 'company'), 30 | ('jr', 'junior'), 31 | ('maj', 'major'), 32 | ('gen', 'general'), 33 | ('drs', 'doctors'), 34 | ('rev', 'reverend'), 35 | ('lt', 'lieutenant'), 36 | ('hon', 'honorable'), 37 | ('sgt', 'sergeant'), 38 | ('capt', 'captain'), 39 | ('esq', 'esquire'), 40 | ('ltd', 'limited'), 41 | ('col', 'colonel'), 42 | ('ft', 'fort'), 43 | ]] 44 | 45 | 46 | def expand_abbreviations(text): 47 | for regex, replacement in _abbreviations: 48 | text = re.sub(regex, replacement, text) 49 | return text 50 | 51 | 52 | def expand_numbers(text): 53 | return normalize_numbers(text) 54 | 55 | 56 | def lowercase(text): 57 | return text.lower() 58 | 59 | 60 | def collapse_whitespace(text): 61 | return re.sub(_whitespace_re, ' ', text) 62 | 63 | 64 | def convert_to_ascii(text): 65 | return unidecode(text) 66 | 67 | 68 | def basic_cleaners(text): 69 | '''Basic pipeline that lowercases and collapses whitespace without transliteration.''' 70 | text = lowercase(text) 71 | text = collapse_whitespace(text) 72 | return text 73 | 74 | 75 | def transliteration_cleaners(text): 76 | '''Pipeline for non-English text that transliterates to ASCII.''' 77 | text = convert_to_ascii(text) 78 | text = lowercase(text) 79 | text = collapse_whitespace(text) 80 | return text 81 | 82 | 83 | def english_cleaners(text): 84 | '''Pipeline for English text, including number and abbreviation expansion.''' 85 | text = convert_to_ascii(text) 86 | text = lowercase(text) 87 | text = expand_numbers(text) 88 | text = expand_abbreviations(text) 89 | text = collapse_whitespace(text) 90 | return text 91 | -------------------------------------------------------------------------------- /text/cmudict.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | import re 4 | 5 | 6 | valid_symbols = [ 7 | 'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2', 8 | 'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2', 9 | 'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY', 10 | 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1', 11 | 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', 12 | 'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', 13 | 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH' 14 | ] 15 | 16 | _valid_symbol_set = set(valid_symbols) 17 | 18 | 19 | class CMUDict: 20 | '''Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict''' 21 | def __init__(self, file_or_path, keep_ambiguous=True): 22 | if isinstance(file_or_path, str): 23 | with open(file_or_path, encoding='latin-1') as f: 24 | entries = _parse_cmudict(f) 25 | else: 26 | entries = _parse_cmudict(file_or_path) 27 | if not keep_ambiguous: 28 | entries = {word: pron for word, pron in entries.items() if len(pron) == 1} 29 | self._entries = entries 30 | 31 | 32 | def __len__(self): 33 | return len(self._entries) 34 | 35 | 36 | def lookup(self, word): 37 | '''Returns list of ARPAbet pronunciations of the given word.''' 38 | return self._entries.get(word.upper()) 39 | 40 | 41 | 42 | _alt_re = re.compile(r'\([0-9]+\)') 43 | 44 | 45 | def _parse_cmudict(file): 46 | cmudict = {} 47 | for line in file: 48 | if len(line) and (line[0] >= 'A' and line[0] <= 'Z' or line[0] == "'"): 49 | parts = line.split(' ') 50 | word = re.sub(_alt_re, '', parts[0]) 51 | pronunciation = _get_pronunciation(parts[1]) 52 | if pronunciation: 53 | if word in cmudict: 54 | cmudict[word].append(pronunciation) 55 | else: 56 | cmudict[word] = [pronunciation] 57 | return cmudict 58 | 59 | 60 | def _get_pronunciation(s): 61 | parts = s.strip().split(' ') 62 | for part in parts: 63 | if part not in _valid_symbol_set: 64 | return None 65 | return ' '.join(parts) 66 | -------------------------------------------------------------------------------- /text/numbers.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | import inflect 4 | import re 5 | 6 | 7 | _inflect = inflect.engine() 8 | _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') 9 | _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') 10 | _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') 11 | _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') 12 | _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') 13 | _number_re = re.compile(r'[0-9]+') 14 | 15 | 16 | def _remove_commas(m): 17 | return m.group(1).replace(',', '') 18 | 19 | 20 | def _expand_decimal_point(m): 21 | return m.group(1).replace('.', ' point ') 22 | 23 | 24 | def _expand_dollars(m): 25 | match = m.group(1) 26 | parts = match.split('.') 27 | if len(parts) > 2: 28 | return match + ' dollars' # Unexpected format 29 | dollars = int(parts[0]) if parts[0] else 0 30 | cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 31 | if dollars and cents: 32 | dollar_unit = 'dollar' if dollars == 1 else 'dollars' 33 | cent_unit = 'cent' if cents == 1 else 'cents' 34 | return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) 35 | elif dollars: 36 | dollar_unit = 'dollar' if dollars == 1 else 'dollars' 37 | return '%s %s' % (dollars, dollar_unit) 38 | elif cents: 39 | cent_unit = 'cent' if cents == 1 else 'cents' 40 | return '%s %s' % (cents, cent_unit) 41 | else: 42 | return 'zero dollars' 43 | 44 | 45 | def _expand_ordinal(m): 46 | return _inflect.number_to_words(m.group(0)) 47 | 48 | 49 | def _expand_number(m): 50 | num = int(m.group(0)) 51 | if num > 1000 and num < 3000: 52 | if num == 2000: 53 | return 'two thousand' 54 | elif num > 2000 and num < 2010: 55 | return 'two thousand ' + _inflect.number_to_words(num % 100) 56 | elif num % 100 == 0: 57 | return _inflect.number_to_words(num // 100) + ' hundred' 58 | else: 59 | return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') 60 | else: 61 | return _inflect.number_to_words(num, andword='') 62 | 63 | 64 | def normalize_numbers(text): 65 | text = re.sub(_comma_number_re, _remove_commas, text) 66 | text = re.sub(_pounds_re, r'\1 pounds', text) 67 | text = re.sub(_dollars_re, _expand_dollars, text) 68 | text = re.sub(_decimal_number_re, _expand_decimal_point, text) 69 | text = re.sub(_ordinal_re, _expand_ordinal, text) 70 | text = re.sub(_number_re, _expand_number, text) 71 | return text 72 | -------------------------------------------------------------------------------- /text/symbols.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | ''' 4 | Defines the set of symbols used in text input to the model. 5 | 6 | The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. ''' 7 | from text import cmudict 8 | 9 | _punctuation = '!\'",.:;? ' 10 | _math = '#%&*+-/[]()' 11 | _special = '_@©°½—₩€$' 12 | _accented = 'áçéêëñöøćž' 13 | _numbers = '0123456789' 14 | _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' 15 | 16 | # Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as 17 | # uppercase letters): 18 | _arpabet = ['@' + s for s in cmudict.valid_symbols] 19 | 20 | # Export all symbols: 21 | symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet 22 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import argparse 4 | import math 5 | from numpy import finfo 6 | 7 | import torch 8 | from distributed import apply_gradient_allreduce 9 | import torch.distributed as dist 10 | from torch.utils.data.distributed import DistributedSampler 11 | from torch.utils.data import DataLoader 12 | 13 | from model import load_model 14 | from data_utils import TextMelLoader, TextMelCollate 15 | from loss_function import Tacotron2Loss 16 | from logger import Tacotron2Logger 17 | from hparams import create_hparams 18 | 19 | 20 | def reduce_tensor(tensor, n_gpus): 21 | rt = tensor.clone() 22 | dist.all_reduce(rt, op=dist.ReduceOp.SUM) 23 | rt /= n_gpus 24 | return rt 25 | 26 | 27 | def init_distributed(hparams, n_gpus, rank, group_name): 28 | assert torch.cuda.is_available(), "Distributed mode requires CUDA." 29 | print("Initializing Distributed") 30 | 31 | # Set cuda device so everything is done on the right GPU. 32 | torch.cuda.set_device(rank % torch.cuda.device_count()) 33 | 34 | # Initialize distributed communication 35 | dist.init_process_group( 36 | backend=hparams.dist_backend, init_method=hparams.dist_url, 37 | world_size=n_gpus, rank=rank, group_name=group_name) 38 | 39 | print("Done initializing distributed") 40 | 41 | 42 | def prepare_dataloaders(hparams): 43 | # Get data, data loaders and collate function ready 44 | trainset = TextMelLoader(hparams.training_files, hparams) 45 | valset = TextMelLoader(hparams.validation_files, hparams, 46 | speaker_ids=trainset.speaker_ids) 47 | collate_fn = TextMelCollate(hparams.n_frames_per_step) 48 | 49 | if hparams.distributed_run: 50 | train_sampler = DistributedSampler(trainset) 51 | shuffle = False 52 | else: 53 | train_sampler = None 54 | shuffle = True 55 | 56 | train_loader = DataLoader(trainset, num_workers=1, shuffle=shuffle, 57 | sampler=train_sampler, 58 | batch_size=hparams.batch_size, pin_memory=False, 59 | drop_last=True, collate_fn=collate_fn) 60 | return train_loader, valset, collate_fn, train_sampler 61 | 62 | 63 | def prepare_directories_and_logger(output_directory, log_directory, rank): 64 | if rank == 0: 65 | if not os.path.isdir(output_directory): 66 | os.makedirs(output_directory) 67 | os.chmod(output_directory, 0o775) 68 | logger = Tacotron2Logger(os.path.join(output_directory, log_directory)) 69 | else: 70 | logger = None 71 | return logger 72 | 73 | 74 | def warm_start_model(checkpoint_path, model, ignore_layers): 75 | assert os.path.isfile(checkpoint_path) 76 | print("Warm starting model from checkpoint '{}'".format(checkpoint_path)) 77 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 78 | model_dict = checkpoint_dict['state_dict'] 79 | if len(ignore_layers) > 0: 80 | model_dict = {k: v for k, v in model_dict.items() 81 | if k not in ignore_layers} 82 | dummy_dict = model.state_dict() 83 | dummy_dict.update(model_dict) 84 | model_dict = dummy_dict 85 | model.load_state_dict(model_dict) 86 | return model 87 | 88 | 89 | def load_checkpoint(checkpoint_path, model, optimizer): 90 | assert os.path.isfile(checkpoint_path) 91 | print("Loading checkpoint '{}'".format(checkpoint_path)) 92 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 93 | model.load_state_dict(checkpoint_dict['state_dict']) 94 | optimizer.load_state_dict(checkpoint_dict['optimizer']) 95 | learning_rate = checkpoint_dict['learning_rate'] 96 | iteration = checkpoint_dict['iteration'] 97 | print("Loaded checkpoint '{}' from iteration {}" .format( 98 | checkpoint_path, iteration)) 99 | return model, optimizer, learning_rate, iteration 100 | 101 | 102 | def save_checkpoint(model, optimizer, learning_rate, iteration, filepath): 103 | print("Saving model and optimizer state at iteration {} to {}".format( 104 | iteration, filepath)) 105 | torch.save({'iteration': iteration, 106 | 'state_dict': model.state_dict(), 107 | 'optimizer': optimizer.state_dict(), 108 | 'learning_rate': learning_rate}, filepath) 109 | 110 | 111 | def validate(model, criterion, valset, iteration, batch_size, n_gpus, 112 | collate_fn, logger, distributed_run, rank): 113 | """Handles all the validation scoring and printing""" 114 | model.eval() 115 | with torch.no_grad(): 116 | val_sampler = DistributedSampler(valset) if distributed_run else None 117 | val_loader = DataLoader(valset, sampler=val_sampler, num_workers=1, 118 | shuffle=False, batch_size=batch_size, 119 | pin_memory=False, collate_fn=collate_fn) 120 | 121 | val_loss = 0.0 122 | for i, batch in enumerate(val_loader): 123 | x, y = model.parse_batch(batch) 124 | y_pred = model(x) 125 | loss = criterion(y_pred, y) 126 | if distributed_run: 127 | reduced_val_loss = reduce_tensor(loss.data, n_gpus).item() 128 | else: 129 | reduced_val_loss = loss.item() 130 | val_loss += reduced_val_loss 131 | val_loss = val_loss / (i + 1) 132 | 133 | model.train() 134 | if rank == 0: 135 | print("Validation loss {}: {:9f} ".format(iteration, reduced_val_loss)) 136 | logger.log_validation(val_loss, model, y, y_pred, iteration) 137 | 138 | 139 | def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus, 140 | rank, group_name, hparams): 141 | """Training and validation logging results to tensorboard and stdout 142 | 143 | Params 144 | ------ 145 | output_directory (string): directory to save checkpoints 146 | log_directory (string) directory to save tensorboard logs 147 | checkpoint_path(string): checkpoint path 148 | n_gpus (int): number of gpus 149 | rank (int): rank of current gpu 150 | hparams (object): comma separated list of "name=value" pairs. 151 | """ 152 | if hparams.distributed_run: 153 | init_distributed(hparams, n_gpus, rank, group_name) 154 | 155 | torch.manual_seed(hparams.seed) 156 | torch.cuda.manual_seed(hparams.seed) 157 | 158 | model = load_model(hparams) 159 | learning_rate = hparams.learning_rate 160 | optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, 161 | weight_decay=hparams.weight_decay) 162 | 163 | if hparams.fp16_run: 164 | from apex import amp 165 | model, optimizer = amp.initialize( 166 | model, optimizer, opt_level='O2') 167 | 168 | if hparams.distributed_run: 169 | model = apply_gradient_allreduce(model) 170 | 171 | criterion = Tacotron2Loss() 172 | 173 | logger = prepare_directories_and_logger( 174 | output_directory, log_directory, rank) 175 | 176 | train_loader, valset, collate_fn, train_sampler = prepare_dataloaders(hparams) 177 | 178 | # Load checkpoint if one exists 179 | iteration = 0 180 | epoch_offset = 0 181 | if checkpoint_path is not None: 182 | if warm_start: 183 | model = warm_start_model( 184 | checkpoint_path, model, hparams.ignore_layers) 185 | else: 186 | model, optimizer, _learning_rate, iteration = load_checkpoint( 187 | checkpoint_path, model, optimizer) 188 | if hparams.use_saved_learning_rate: 189 | learning_rate = _learning_rate 190 | iteration += 1 # next iteration is iteration + 1 191 | epoch_offset = max(0, int(iteration / len(train_loader))) 192 | 193 | model.train() 194 | is_overflow = False 195 | # ================ MAIN TRAINNIG LOOP! =================== 196 | for epoch in range(epoch_offset, hparams.epochs): 197 | print("Epoch: {}".format(epoch)) 198 | if train_sampler is not None: 199 | train_sampler.set_epoch(epoch) 200 | for i, batch in enumerate(train_loader): 201 | start = time.perf_counter() 202 | if iteration > 0 and iteration % hparams.learning_rate_anneal == 0: 203 | learning_rate = max( 204 | hparams.learning_rate_min, learning_rate * 0.5) 205 | for param_group in optimizer.param_groups: 206 | param_group['lr'] = learning_rate 207 | 208 | model.zero_grad() 209 | x, y = model.parse_batch(batch) 210 | y_pred = model(x) 211 | 212 | loss = criterion(y_pred, y) 213 | if hparams.distributed_run: 214 | reduced_loss = reduce_tensor(loss.data, n_gpus).item() 215 | else: 216 | reduced_loss = loss.item() 217 | 218 | if hparams.fp16_run: 219 | with amp.scale_loss(loss, optimizer) as scaled_loss: 220 | scaled_loss.backward() 221 | else: 222 | loss.backward() 223 | 224 | if hparams.fp16_run: 225 | grad_norm = torch.nn.utils.clip_grad_norm_( 226 | amp.master_params(optimizer), hparams.grad_clip_thresh) 227 | is_overflow = math.isnan(grad_norm) 228 | else: 229 | grad_norm = torch.nn.utils.clip_grad_norm_( 230 | model.parameters(), hparams.grad_clip_thresh) 231 | 232 | optimizer.step() 233 | 234 | if not is_overflow and rank == 0: 235 | duration = time.perf_counter() - start 236 | print("Train loss {} {:.6f} Grad Norm {:.6f} {:.2f}s/it".format( 237 | iteration, reduced_loss, grad_norm, duration)) 238 | logger.log_training( 239 | reduced_loss, grad_norm, learning_rate, duration, iteration) 240 | 241 | if not is_overflow and (iteration % hparams.iters_per_checkpoint == 0): 242 | validate(model, criterion, valset, iteration, 243 | hparams.batch_size, n_gpus, collate_fn, logger, 244 | hparams.distributed_run, rank) 245 | if rank == 0: 246 | checkpoint_path = os.path.join( 247 | output_directory, "checkpoint_{}".format(iteration)) 248 | save_checkpoint(model, optimizer, learning_rate, iteration, 249 | checkpoint_path) 250 | 251 | iteration += 1 252 | 253 | 254 | if __name__ == '__main__': 255 | parser = argparse.ArgumentParser() 256 | parser.add_argument('-o', '--output_directory', type=str, 257 | help='directory to save checkpoints') 258 | parser.add_argument('-l', '--log_directory', type=str, 259 | help='directory to save tensorboard logs') 260 | parser.add_argument('-c', '--checkpoint_path', type=str, default=None, 261 | required=False, help='checkpoint path') 262 | parser.add_argument('--warm_start', action='store_true', 263 | help='load model weights only, ignore specified layers') 264 | parser.add_argument('--n_gpus', type=int, default=1, 265 | required=False, help='number of gpus') 266 | parser.add_argument('--rank', type=int, default=0, 267 | required=False, help='rank of current gpu') 268 | parser.add_argument('--group_name', type=str, default='group_name', 269 | required=False, help='Distributed group name') 270 | parser.add_argument('--hparams', type=str, 271 | required=False, help='comma separated name=value pairs') 272 | 273 | args = parser.parse_args() 274 | hparams = create_hparams(args.hparams) 275 | 276 | torch.backends.cudnn.enabled = hparams.cudnn_enabled 277 | torch.backends.cudnn.benchmark = hparams.cudnn_benchmark 278 | 279 | print("FP16 Run:", hparams.fp16_run) 280 | print("Dynamic Loss Scaling:", hparams.dynamic_loss_scaling) 281 | print("Distributed Run:", hparams.distributed_run) 282 | print("cuDNN Enabled:", hparams.cudnn_enabled) 283 | print("cuDNN Benchmark:", hparams.cudnn_benchmark) 284 | 285 | train(args.output_directory, args.log_directory, args.checkpoint_path, 286 | args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) 287 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.io.wavfile import read 3 | import torch 4 | 5 | 6 | def get_mask_from_lengths(lengths): 7 | max_len = torch.max(lengths).item() 8 | ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len)) 9 | mask = (ids < lengths.unsqueeze(1)).bool() 10 | return mask 11 | 12 | 13 | def load_wav_to_torch(full_path): 14 | sampling_rate, data = read(full_path) 15 | return torch.FloatTensor(data.astype(np.float32)), sampling_rate 16 | 17 | 18 | def load_filepaths_and_text(filename, split="|"): 19 | with open(filename, encoding='utf-8') as f: 20 | filepaths_and_text = [line.strip().split(split) for line in f] 21 | return filepaths_and_text 22 | 23 | 24 | def files_to_list(filename): 25 | """ 26 | Takes a text file of filenames and makes a list of filenames 27 | """ 28 | with open(filename, encoding='utf-8') as f: 29 | files = f.readlines() 30 | 31 | files = [f.rstrip() for f in files] 32 | return files 33 | 34 | 35 | def to_gpu(x): 36 | x = x.contiguous() 37 | 38 | if torch.cuda.is_available(): 39 | x = x.cuda(non_blocking=True) 40 | return torch.autograd.Variable(x) 41 | -------------------------------------------------------------------------------- /yin.py: -------------------------------------------------------------------------------- 1 | # adapted from https://github.com/patriceguyot/Yin 2 | 3 | import numpy as np 4 | 5 | 6 | def differenceFunction(x, N, tau_max): 7 | """ 8 | Compute difference function of data x. This corresponds to equation (6) in [1] 9 | This solution is implemented directly with Numpy fft. 10 | 11 | 12 | :param x: audio data 13 | :param N: length of data 14 | :param tau_max: integration window size 15 | :return: difference function 16 | :rtype: list 17 | """ 18 | 19 | x = np.array(x, np.float64) 20 | w = x.size 21 | tau_max = min(tau_max, w) 22 | x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum())) 23 | size = w + tau_max 24 | p2 = (size // 32).bit_length() 25 | nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32) 26 | size_pad = min(x * 2 ** p2 for x in nice_numbers if x * 2 ** p2 >= size) 27 | fc = np.fft.rfft(x, size_pad) 28 | conv = np.fft.irfft(fc * fc.conjugate())[:tau_max] 29 | return x_cumsum[w:w - tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv 30 | 31 | 32 | def cumulativeMeanNormalizedDifferenceFunction(df, N): 33 | """ 34 | Compute cumulative mean normalized difference function (CMND). 35 | 36 | This corresponds to equation (8) in [1] 37 | 38 | :param df: Difference function 39 | :param N: length of data 40 | :return: cumulative mean normalized difference function 41 | :rtype: list 42 | """ 43 | 44 | cmndf = df[1:] * range(1, N) / np.cumsum(df[1:]).astype(float) #scipy method 45 | return np.insert(cmndf, 0, 1) 46 | 47 | 48 | def getPitch(cmdf, tau_min, tau_max, harmo_th=0.1): 49 | """ 50 | Return fundamental period of a frame based on CMND function. 51 | 52 | :param cmdf: Cumulative Mean Normalized Difference function 53 | :param tau_min: minimum period for speech 54 | :param tau_max: maximum period for speech 55 | :param harmo_th: harmonicity threshold to determine if it is necessary to compute pitch frequency 56 | :return: fundamental period if there is values under threshold, 0 otherwise 57 | :rtype: float 58 | """ 59 | tau = tau_min 60 | while tau < tau_max: 61 | if cmdf[tau] < harmo_th: 62 | while tau + 1 < tau_max and cmdf[tau + 1] < cmdf[tau]: 63 | tau += 1 64 | return tau 65 | tau += 1 66 | 67 | return 0 # if unvoiced 68 | 69 | 70 | def compute_yin(sig, sr, w_len=512, w_step=256, f0_min=100, f0_max=500, 71 | harmo_thresh=0.1): 72 | """ 73 | 74 | Compute the Yin Algorithm. Return fundamental frequency and harmonic rate. 75 | 76 | :param sig: Audio signal (list of float) 77 | :param sr: sampling rate (int) 78 | :param w_len: size of the analysis window (samples) 79 | :param w_step: size of the lag between two consecutives windows (samples) 80 | :param f0_min: Minimum fundamental frequency that can be detected (hertz) 81 | :param f0_max: Maximum fundamental frequency that can be detected (hertz) 82 | :param harmo_tresh: Threshold of detection. The yalgorithmù return the first minimum of the CMND function below this treshold. 83 | 84 | :returns: 85 | 86 | * pitches: list of fundamental frequencies, 87 | * harmonic_rates: list of harmonic rate values for each fundamental frequency value (= confidence value) 88 | * argmins: minimums of the Cumulative Mean Normalized DifferenceFunction 89 | * times: list of time of each estimation 90 | :rtype: tuple 91 | """ 92 | 93 | tau_min = int(sr / f0_max) 94 | tau_max = int(sr / f0_min) 95 | 96 | timeScale = range(0, len(sig) - w_len, w_step) # time values for each analysis window 97 | times = [t/float(sr) for t in timeScale] 98 | frames = [sig[t:t + w_len] for t in timeScale] 99 | 100 | pitches = [0.0] * len(timeScale) 101 | harmonic_rates = [0.0] * len(timeScale) 102 | argmins = [0.0] * len(timeScale) 103 | 104 | for i, frame in enumerate(frames): 105 | # Compute YIN 106 | df = differenceFunction(frame, w_len, tau_max) 107 | cmdf = cumulativeMeanNormalizedDifferenceFunction(df, tau_max) 108 | p = getPitch(cmdf, tau_min, tau_max, harmo_thresh) 109 | 110 | # Get results 111 | if np.argmin(cmdf) > tau_min: 112 | argmins[i] = float(sr / np.argmin(cmdf)) 113 | if p != 0: # A pitch was found 114 | pitches[i] = float(sr / p) 115 | harmonic_rates[i] = cmdf[p] 116 | else: # No pitch, but we compute a value of the harmonic rate 117 | harmonic_rates[i] = min(cmdf) 118 | 119 | return pitches, harmonic_rates, argmins, times 120 | --------------------------------------------------------------------------------