├── .gitignore ├── README.md ├── attach_memory_bank.py ├── configs ├── ljs.json └── ljs_reproduce.json ├── durations └── durations.tar.bz2 ├── filelists ├── ljs_audio_text_test_filelist.txt ├── ljs_audio_text_test_filelist.txt.cleaned ├── ljs_audio_text_train_filelist.txt ├── ljs_audio_text_train_filelist.txt.cleaned ├── ljs_audio_text_val_filelist.txt └── ljs_audio_text_val_filelist.txt.cleaned ├── models ├── __init__.py ├── attentions.py ├── losses.py ├── models.py ├── modules.py └── soft_dtw.py ├── naturalspeech_training.ipynb ├── preprocess_durations.py ├── preprocess_texts.py ├── requirements.txt ├── resources └── figure1.png ├── text ├── LICENSE ├── __init__.py ├── cleaners.py └── symbols.py ├── train.py └── utils ├── README.md ├── __init__.py ├── attach_memory_bank.py ├── commons.py ├── configs ├── ljs.json └── ljs_reproduce.json ├── data_utils.py ├── filelists ├── ljs_audio_text_test_filelist.txt ├── ljs_audio_text_test_filelist.txt.cleaned ├── ljs_audio_text_train_filelist.txt ├── ljs_audio_text_train_filelist.txt.cleaned ├── ljs_audio_text_val_filelist.txt └── ljs_audio_text_val_filelist.txt.cleaned ├── mel_processing.py ├── models ├── __init__.py ├── attentions.py ├── losses.py ├── models.py ├── modules.py └── soft_dtw.py ├── preprocess_durations.py ├── preprocess_texts.py ├── requirements.txt ├── resources └── figure1.png ├── text ├── LICENSE ├── __init__.py ├── cleaners.py └── symbols.py ├── train.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .ipynb_checkpoints 3 | logs 4 | DUMMY1 5 | nohup.out 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality 2 | 3 | This is an implementation of Microsoft's [NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality](https://arxiv.org/abs/2205.04421) in Pytorch. 4 | 5 | Contribution and pull requests are highly appreciated! 6 | 7 | 23.02.09: Demo samples (using the first 1800 epochs) are out. [(link)](https://github.com/heatz123/naturalspeech/wiki) 8 | 9 | 10 | ### Overview 11 | 12 | ![figure1](resources/figure1.png) 13 | 14 | Naturalspeech is a VAE-based model that employs several techniques to improve the prior and simplify the posterior. It differs from VITS in several ways, including: 15 | - **Phoneme pre-training**: Naturalspeech uses a pre-trained phoneme encoder on a large text corpus, obtained through masked language modeling on phoneme sequences. 16 | - **Differentiable durator**: The posterior operates at the frame level, while the prior operates at the phoneme level. Naturalspeech uses a differentiable durator to bridge the length difference, resulting in soft and flexible features that are expanded. 17 | - **Bidirectional Prior/Posterior**: Naturalspeech reduces the posterior and enhances the prior through normalizing flow, which maps in both directions with forward and backward loss. 18 | - **Memory-based VAE**: The prior is further enhanced through a memory bank using Q-K-V attention." 19 | 20 | 21 | ### Notes 22 | - This implementation does not include pre-training of phonemes using a large-scale text corpus from the news-crawl dataset. 23 | - The multiplier for each loss can be adjusted in the configuration file. Using losses without a multiplier may not lead to convergence. 24 | - The tuning stage for the last 2k epochs has been omitted. 25 | - Due to the high VRAM usage of the soft-dtw loss, there is an option to use a non-softdtw loss for memory efficiency. 26 | - For the soft-dtw loss, the warp factor has been set to 134.4 (0.07 * 192) to match the non-softdtw loss, instead of 0.07. 27 | - To train the duration predictor in the warm-up stage, duration labels are required. The paper suggests using any tool to provide the duration label. In this implementation, a pre-trained VITS model was used. 28 | - To further improve memory efficiency during training, randomly silced sequences are fed to the decoder as in the VITS model. 29 | 30 | 31 | 32 | 33 | ### How to train 34 | 35 | 0. 36 | ``` 37 | # python >= 3.6 38 | pip install -r requirements.txt 39 | ``` 40 | 41 | 1. clone this repository 42 | 1. download **The LJ Speech Dataset**: [link](https://keithito.com/LJ-Speech-Dataset/) 43 | 1. create symbolic link to ljspeech dataset: 44 | ``` 45 | ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1 46 | ``` 47 | 1. text preprocessing (optional, if you are using custom dataset): 48 | 1. `apt-get install espeak` 49 | 2. 50 | ``` 51 | python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 52 | ``` 53 | 54 | 1. duration preprocessing (obtain duration labels using pretrained VITS): 55 | > If you want to skip this section, use `durations/durations.tar.bz2` and overwrite the `durations` folder. 56 | 1. `git clone https://github.com/jaywalnut310/vits.git; cd vits` 57 | 2. create symbolic link to ljspeech dataset 58 | ``` 59 | ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1 60 | ``` 61 | 3. download pretrained VITS model described as from VITS official github: [github link](https://github.com/jaywalnut310/vits) / [pretrained models](https://drive.google.com/drive/folders/1ksarh-cJf3F5eKJjLVWY0X1j1qsQqiS2) 62 | 4. setup monotonic alignment search (for VITS inference): 63 | ``` 64 | cd monotonic_align; mkdir monotonic_align; python setup.py build_ext --inplace; cd .. 65 | ``` 66 | 5. copy duration preprocessing script to VITS repo: `cp /path/to/naturalspeech/preprocess_durations.py .` 67 | 6. 68 | ``` 69 | python3 preprocess_durations.py --weights_path ./pretrained_ljs.pth --filelists filelists/ljs_audio_text_train_filelist.txt.cleaned filelists/ljs_audio_text_val_filelist.txt.cleaned filelists/ljs_audio_text_test_filelist.txt.cleaned 70 | ``` 71 | 7. once the duration labels are created, copy the labels to the naturalspeech repo: `cp -r durations/ path/to/naturalspeech` 72 | 73 | 1. train (warmup) 74 | ``` 75 | python3 train.py -c configs/ljs.json -m [run_name] --warmup 76 | ``` 77 | Note here that `ljs.json` is for low-resource training, which runs for 1500 epochs and does not use soft-dtw loss. If you want to reproduce the steps stated in the paper, use `ljs_reproduce.json`, which runs for 15000 epochs and uses soft-dtw loss. 78 | 79 | 1. initialize and attach memory bank after warmup: 80 | ``` 81 | python3 attach_memory_bank.py -c configs/ljs.json --weights_path logs/[run_name]/G_xxx.pth 82 | ``` 83 | if you lack memory, you can specify the `--num_samples` argument to use only a subset of samples. 84 | 85 | 1. train (resume) 86 | ``` 87 | python3 train.py -c configs/ljs.json -m [run_name] 88 | ``` 89 | 90 | You can use tensorboard to monitor the training. 91 | ``` 92 | tensorboard --logdir /path/to/naturalspeech/logs 93 | ``` 94 | 95 | During each evaluation phase, a selection of samples from the test set is evaluated and saved in the `logs/[run_name]/eval` directory. 96 | 97 | 98 | 99 | ## References 100 | - [VITS implemetation](https://github.com/jaywalnut310/vits) by @jaywalnut310 for normalizing flows, phoneme encoder, and hifi-gan decoder implementation 101 | - [Parallel Tacotron 2 Implementation](https://github.com/keonlee9420/Parallel-Tacotron2) by @keonlee9420 for learnable upsampling Layer 102 | - [soft-dtw implementation](https://github.com/Maghoumi/pytorch-softdtw-cuda) by @Maghoumi for sdtw loss 103 | -------------------------------------------------------------------------------- /attach_memory_bank.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | from pathlib import Path 4 | 5 | import numpy as np 6 | import torch 7 | from torch.cuda.amp import autocast 8 | from torch.utils.data import DataLoader 9 | 10 | from text.symbols import symbols 11 | from models.models import SynthesizerTrn 12 | from models.models import VAEMemoryBank 13 | from utils import utils 14 | 15 | from utils.data_utils import ( 16 | TextAudioLoaderWithDuration, 17 | TextAudioCollateWithDuration, 18 | ) 19 | 20 | from sklearn.cluster import KMeans 21 | 22 | 23 | def load_net_g(hps, weights_path): 24 | net_g = SynthesizerTrn( 25 | len(symbols), 26 | hps.data.filter_length // 2 + 1, 27 | hps.train.segment_size // hps.data.hop_length, 28 | hps.models, 29 | ).cuda() 30 | 31 | optim_g = torch.optim.AdamW( 32 | net_g.parameters(), 33 | hps.train.learning_rate, 34 | betas=hps.train.betas, 35 | eps=hps.train.eps, 36 | ) 37 | 38 | def load_checkpoint(checkpoint_path, model, optimizer=None): 39 | assert os.path.isfile(checkpoint_path) 40 | checkpoint_dict = torch.load(checkpoint_path, map_location="cpu") 41 | iteration = checkpoint_dict["iteration"] 42 | learning_rate = checkpoint_dict["learning_rate"] 43 | 44 | if optimizer is not None: 45 | optimizer.load_state_dict(checkpoint_dict["optimizer"]) 46 | saved_state_dict = checkpoint_dict["model"] 47 | 48 | state_dict = model.state_dict() 49 | new_state_dict = {} 50 | for k, v in state_dict.items(): 51 | try: 52 | new_state_dict[k] = saved_state_dict[k] 53 | except: 54 | print("%s is not in the checkpoint" % k) 55 | new_state_dict[k] = v 56 | model.load_state_dict(new_state_dict) 57 | 58 | print( 59 | "Loaded checkpoint '{}' (iteration {})".format(checkpoint_path, iteration) 60 | ) 61 | return model, optimizer, learning_rate, iteration 62 | 63 | model, optimizer, learning_rate, iteration = load_checkpoint( 64 | weights_path, net_g, optim_g 65 | ) 66 | 67 | return model, optimizer, learning_rate, iteration 68 | 69 | 70 | def get_dataloader(hps): 71 | train_dataset = TextAudioLoaderWithDuration(hps.data.training_files, hps.data) 72 | collate_fn = TextAudioCollateWithDuration() 73 | train_loader = DataLoader( 74 | train_dataset, 75 | num_workers=1, 76 | shuffle=False, 77 | pin_memory=False, 78 | collate_fn=collate_fn, 79 | batch_size=1, 80 | ) 81 | return train_loader 82 | 83 | 84 | def get_zs(net_g, dataloader, num_samples=0): 85 | net_g.eval() 86 | print(len(dataloader)) 87 | zs = [] 88 | with torch.no_grad(): 89 | for batch_idx, ( 90 | x, 91 | x_lengths, 92 | spec, 93 | spec_lengths, 94 | y, 95 | y_lengths, 96 | duration, 97 | ) in enumerate(dataloader): 98 | rank = 0 99 | x, x_lengths = x.cuda(rank, non_blocking=True), x_lengths.cuda( 100 | rank, non_blocking=True 101 | ) 102 | spec, spec_lengths = spec.cuda(rank, non_blocking=True), spec_lengths.cuda( 103 | rank, non_blocking=True 104 | ) 105 | y, y_lengths = y.cuda(rank, non_blocking=True), y_lengths.cuda( 106 | rank, non_blocking=True 107 | ) 108 | duration = duration.cuda() 109 | with autocast(enabled=hps.train.fp16_run): 110 | ( 111 | y_hat, 112 | l_length, 113 | ids_slice, 114 | x_mask, 115 | z_mask, 116 | (z, z_p, m_p, logs_p, m_q, logs_q, p_mask), 117 | *_, 118 | ) = net_g(x, x_lengths, spec, spec_lengths, duration) 119 | 120 | zs.append(z.squeeze(0).cpu()) 121 | if batch_idx % 100 == 99: 122 | print(batch_idx, zs[batch_idx].shape) 123 | 124 | if num_samples and batch_idx >= num_samples: 125 | break 126 | return zs 127 | 128 | 129 | def k_means(zs): 130 | X = torch.cat(zs, dim=1).transpose(0, 1).numpy() 131 | print(X.shape) 132 | kmeans = KMeans(n_clusters=1000, random_state=0, n_init="auto").fit(X) 133 | print(kmeans.cluster_centers_.shape) 134 | 135 | return kmeans.cluster_centers_ 136 | 137 | 138 | def save_memory_bank(bank): 139 | state_dict = bank.state_dict() 140 | torch.save(state_dict, "./bank_init.pth") 141 | 142 | 143 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path): 144 | state_dict = model.state_dict() 145 | torch.save( 146 | { 147 | "model": state_dict, 148 | "iteration": iteration, 149 | "optimizer": optimizer.state_dict(), 150 | "learning_rate": learning_rate, 151 | }, 152 | checkpoint_path, 153 | ) 154 | print("Saving model to " + checkpoint_path) 155 | 156 | 157 | if __name__ == "__main__": 158 | parser = argparse.ArgumentParser() 159 | parser.add_argument("-c", "--config", type=str, default="configs/ljs.json") 160 | parser.add_argument("--weights_path", type=str) 161 | parser.add_argument( 162 | "--num_samples", 163 | type=int, 164 | default=0, 165 | help="samples to use for k-means clustering, 0 for use all samples in dataset", 166 | ) 167 | args = parser.parse_args() 168 | 169 | hps = utils.get_hparams_from_file(args.config) 170 | net_g, optimizer, lr, iterations = load_net_g(hps, weights_path=args.weights_path) 171 | 172 | dataloader = get_dataloader(hps) 173 | zs = get_zs(net_g, dataloader, num_samples=args.num_samples) 174 | centers = k_means(zs) 175 | 176 | memory_bank = VAEMemoryBank( 177 | **hps.models.memory_bank, 178 | init_values=torch.from_numpy(centers).cuda().transpose(0, 1) 179 | ) 180 | save_memory_bank(memory_bank) 181 | 182 | net_g.memory_bank = memory_bank 183 | optimizer.add_param_group( 184 | { 185 | "params": list(memory_bank.parameters()), 186 | "initial_lr": optimizer.param_groups[0]["initial_lr"], 187 | } 188 | ) 189 | 190 | p = Path(args.weights_path) 191 | save_path = p.with_stem(p.stem + "_with_memory").__str__() 192 | save_checkpoint(net_g, optimizer, lr, iterations, save_path) 193 | 194 | # test 195 | print(memory_bank(torch.randn((2, 192, 12))).shape) 196 | -------------------------------------------------------------------------------- /configs/ljs.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 500, 4 | "eval_interval": 5, 5 | "seed": 1234, 6 | "epochs": 1500, 7 | "learning_rate": 2e-4, 8 | "betas": [0.8, 0.99], 9 | "eps": 1e-9, 10 | "batch_size": 16, 11 | "fp16_run": true, 12 | "lr_decay": 0.999, 13 | "segment_size": 8192, 14 | "init_lr_ratio": 1, 15 | "warmup_epochs": 200, 16 | "c_mel": 45, 17 | "c_kl": 1.0, 18 | "c_kl_fwd": 0.001, 19 | "c_e2e": 0.1, 20 | "c_dur": 5.0, 21 | "use_sdtw": false, 22 | "use_gt_duration": true 23 | }, 24 | "data": { 25 | "training_files":"filelists/ljs_audio_text_train_filelist.txt.cleaned", 26 | "validation_files":"filelists/ljs_audio_text_val_filelist.txt.cleaned", 27 | "text_cleaners":["english_cleaners2"], 28 | "max_wav_value": 32768.0, 29 | "sampling_rate": 22050, 30 | "filter_length": 1024, 31 | "hop_length": 256, 32 | "win_length": 1024, 33 | "n_mel_channels": 80, 34 | "mel_fmin": 0.0, 35 | "mel_fmax": null, 36 | "add_blank": true, 37 | "n_speakers": 0, 38 | "cleaned_text": true 39 | }, 40 | "models": { 41 | "phoneme_encoder": { 42 | "out_channels": 192, 43 | "hidden_channels": 192, 44 | "filter_channels": 768, 45 | "n_heads": 2, 46 | "n_layers": 6, 47 | "kernel_size": 3, 48 | "p_dropout": 0.1 49 | }, 50 | "decoder": { 51 | "initial_channel": 192, 52 | "resblock": "1", 53 | "resblock_kernel_sizes": [3,7,11], 54 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 55 | "upsample_rates": [8,8,2,2], 56 | "upsample_initial_channel": 256, 57 | "upsample_kernel_sizes": [16,16,4,4], 58 | "gin_channels": 0 59 | }, 60 | "posterior_encoder": { 61 | "out_channels": 192, 62 | "hidden_channels": 192, 63 | "kernel_size": 5, 64 | "dilation_rate": 1, 65 | "n_layers": 16 66 | }, 67 | "flow": { 68 | "channels": 192, 69 | "hidden_channels": 192, 70 | "kernel_size": 5, 71 | "dilation_rate": 1, 72 | "n_layers": 4 73 | }, 74 | "duration_predictor": { 75 | "in_channels": 192, 76 | "filter_channels": 256, 77 | "kernel_size": 3, 78 | "p_dropout": 0.5 79 | }, 80 | "learnable_upsampling": { 81 | "d_predictor": 192, 82 | "kernel_size": 3, 83 | "dropout": 0.0, 84 | "conv_output_size": 8, 85 | "dim_w": 4, 86 | "dim_c": 2, 87 | "max_seq_len": 1000 88 | }, 89 | "memory_bank": { 90 | "bank_size": 1000, 91 | "n_hidden_dims": 192, 92 | "n_attn_heads": 2 93 | } 94 | } 95 | } 96 | -------------------------------------------------------------------------------- /configs/ljs_reproduce.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 1, 4 | "eval_interval": 1, 5 | "seed": 1234, 6 | "epochs": 15000, 7 | "learning_rate": 2e-4, 8 | "betas": [0.8, 0.99], 9 | "eps": 1e-9, 10 | "batch_size": 16, 11 | "fp16_run": true, 12 | "lr_decay": 0.999875, 13 | "segment_size": 8192, 14 | "init_lr_ratio": 1, 15 | "warmup_epochs": 200, 16 | "c_mel": 45, 17 | "c_kl": 1.0, 18 | "c_kl_fwd": 0.02, 19 | "c_e2e": 0.1, 20 | "c_dur": 5.0, 21 | "use_sdtw": true, 22 | "use_gt_duration": false 23 | }, 24 | "data": { 25 | "training_files":"filelists/ljs_audio_text_train_filelist.txt.cleaned", 26 | "validation_files":"filelists/ljs_audio_text_val_filelist.txt.cleaned", 27 | "text_cleaners":["english_cleaners2"], 28 | "max_wav_value": 32768.0, 29 | "sampling_rate": 22050, 30 | "filter_length": 1024, 31 | "hop_length": 256, 32 | "win_length": 1024, 33 | "n_mel_channels": 80, 34 | "mel_fmin": 0.0, 35 | "mel_fmax": null, 36 | "add_blank": true, 37 | "n_speakers": 0, 38 | "cleaned_text": true 39 | }, 40 | "models": { 41 | "phoneme_encoder": { 42 | "out_channels": 192, 43 | "hidden_channels": 192, 44 | "filter_channels": 768, 45 | "n_heads": 2, 46 | "n_layers": 6, 47 | "kernel_size": 3, 48 | "p_dropout": 0.1 49 | }, 50 | "decoder": { 51 | "initial_channel": 192, 52 | "resblock": "1", 53 | "resblock_kernel_sizes": [3,7,11], 54 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 55 | "upsample_rates": [8,8,2,2], 56 | "upsample_initial_channel": 256, 57 | "upsample_kernel_sizes": [16,16,4,4], 58 | "gin_channels": 0 59 | }, 60 | "posterior_encoder": { 61 | "out_channels": 192, 62 | "hidden_channels": 192, 63 | "kernel_size": 5, 64 | "dilation_rate": 1, 65 | "n_layers": 16 66 | }, 67 | "flow": { 68 | "channels": 192, 69 | "hidden_channels": 192, 70 | "kernel_size": 5, 71 | "dilation_rate": 1, 72 | "n_layers": 4 73 | }, 74 | "duration_predictor": { 75 | "in_channels": 192, 76 | "filter_channels": 256, 77 | "kernel_size": 3, 78 | "p_dropout": 0.5 79 | }, 80 | "learnable_upsampling": { 81 | "d_predictor": 192, 82 | "kernel_size": 3, 83 | "dropout": 0.0, 84 | "conv_output_size": 8, 85 | "dim_w": 4, 86 | "dim_c": 2, 87 | "max_seq_len": 1000 88 | }, 89 | "memory_bank": { 90 | "bank_size": 1000, 91 | "n_hidden_dims": 192, 92 | "n_attn_heads": 2 93 | } 94 | } 95 | } 96 | -------------------------------------------------------------------------------- /durations/durations.tar.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/heatz123/naturalspeech/ca2f814d960149a8fdc3f3e4d773abb67e7c18ec/durations/durations.tar.bz2 -------------------------------------------------------------------------------- /filelists/ljs_audio_text_val_filelist.txt: -------------------------------------------------------------------------------- 1 | DUMMY1/LJ022-0023.wav|The overwhelming majority of people in this country know how to sift the wheat from the chaff in what they hear and what they read. 2 | DUMMY1/LJ043-0030.wav|If somebody did that to me, a lousy trick like that, to take my wife away, and all the furniture, I would be mad as hell, too. 3 | DUMMY1/LJ005-0201.wav|as is shown by the report of the Commissioners to inquire into the state of the municipal corporations in eighteen thirty-five. 4 | DUMMY1/LJ001-0110.wav|Even the Caslon type when enlarged shows great shortcomings in this respect: 5 | DUMMY1/LJ003-0345.wav|All the committee could do in this respect was to throw the responsibility on others. 6 | DUMMY1/LJ007-0154.wav|These pungent and well-grounded strictures applied with still greater force to the unconvicted prisoner, the man who came to the prison innocent, and still uncontaminated, 7 | DUMMY1/LJ018-0098.wav|and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others. 8 | DUMMY1/LJ047-0044.wav|Oswald was, however, willing to discuss his contacts with Soviet authorities. He denied having any involvement with Soviet intelligence agencies 9 | DUMMY1/LJ031-0038.wav|The first physician to see the President at Parkland Hospital was Dr. Charles J. Carrico, a resident in general surgery. 10 | DUMMY1/LJ048-0194.wav|during the morning of November twenty-two prior to the motorcade. 11 | DUMMY1/LJ049-0026.wav|On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President. 12 | DUMMY1/LJ004-0152.wav|although at Mr. Buxton's visit a new jail was in process of erection, the first step towards reform since Howard's visitation in seventeen seventy-four. 13 | DUMMY1/LJ008-0278.wav|or theirs might be one of many, and it might be considered necessary to "make an example." 14 | DUMMY1/LJ043-0002.wav|The Warren Commission Report. By The President's Commission on the Assassination of President Kennedy. Chapter seven. Lee Harvey Oswald: 15 | DUMMY1/LJ009-0114.wav|Mr. Wakefield winds up his graphic but somewhat sensational account by describing another religious service, which may appropriately be inserted here. 16 | DUMMY1/LJ028-0506.wav|A modern artist would have difficulty in doing such accurate work. 17 | DUMMY1/LJ050-0168.wav|with the particular purposes of the agency involved. The Commission recognizes that this is a controversial area 18 | DUMMY1/LJ039-0223.wav|Oswald's Marine training in marksmanship, his other rifle experience and his established familiarity with this particular weapon 19 | DUMMY1/LJ029-0032.wav|According to O'Donnell, quote, we had a motorcade wherever we went, end quote. 20 | DUMMY1/LJ031-0070.wav|Dr. Clark, who most closely observed the head wound, 21 | DUMMY1/LJ034-0198.wav|Euins, who was on the southwest corner of Elm and Houston Streets testified that he could not describe the man he saw in the window. 22 | DUMMY1/LJ026-0068.wav|Energy enters the plant, to a small extent, 23 | DUMMY1/LJ039-0075.wav|once you know that you must put the crosshairs on the target and that is all that is necessary. 24 | DUMMY1/LJ004-0096.wav|the fatal consequences whereof might be prevented if the justices of the peace were duly authorized 25 | DUMMY1/LJ005-0014.wav|Speaking on a debate on prison matters, he declared that 26 | DUMMY1/LJ012-0161.wav|he was reported to have fallen away to a shadow. 27 | DUMMY1/LJ018-0239.wav|His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to 28 | DUMMY1/LJ019-0257.wav|Here the tread-wheel was in use, there cellular cranks, or hard-labor machines. 29 | DUMMY1/LJ028-0008.wav|you tap gently with your heel upon the shoulder of the dromedary to urge her on. 30 | DUMMY1/LJ024-0083.wav|This plan of mine is no attack on the Court; 31 | DUMMY1/LJ042-0129.wav|No night clubs or bowling alleys, no places of recreation except the trade union dances. I have had enough. 32 | DUMMY1/LJ036-0103.wav|The police asked him whether he could pick out his passenger from the lineup. 33 | DUMMY1/LJ046-0058.wav|During his Presidency, Franklin D. Roosevelt made almost four hundred journeys and traveled more than three hundred fifty thousand miles. 34 | DUMMY1/LJ014-0076.wav|He was seen afterwards smoking and talking with his hosts in their back parlor, and never seen again alive. 35 | DUMMY1/LJ002-0043.wav|long narrow rooms -- one thirty-six feet, six twenty-three feet, and the eighth eighteen, 36 | DUMMY1/LJ009-0076.wav|We come to the sermon. 37 | DUMMY1/LJ017-0131.wav|even when the high sheriff had told him there was no possibility of a reprieve, and within a few hours of execution. 38 | DUMMY1/LJ046-0184.wav|but there is a system for the immediate notification of the Secret Service by the confining institution when a subject is released or escapes. 39 | DUMMY1/LJ014-0263.wav|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art. 40 | DUMMY1/LJ042-0096.wav|(old exchange rate) in addition to his factory salary of approximately equal amount 41 | DUMMY1/LJ049-0050.wav|Hill had both feet on the car and was climbing aboard to assist President and Mrs. Kennedy. 42 | DUMMY1/LJ019-0186.wav|seeing that since the establishment of the Central Criminal Court, Newgate received prisoners for trial from several counties, 43 | DUMMY1/LJ028-0307.wav|then let twenty days pass, and at the end of that time station near the Chaldasan gates a body of four thousand. 44 | DUMMY1/LJ012-0235.wav|While they were in a state of insensibility the murder was committed. 45 | DUMMY1/LJ034-0053.wav|reached the same conclusion as Latona that the prints found on the cartons were those of Lee Harvey Oswald. 46 | DUMMY1/LJ014-0030.wav|These were damnatory facts which well supported the prosecution. 47 | DUMMY1/LJ015-0203.wav|but were the precautions too minute, the vigilance too close to be eluded or overcome? 48 | DUMMY1/LJ028-0093.wav|but his scribe wrote it in the manner customary for the scribes of those days to write of their royal masters. 49 | DUMMY1/LJ002-0018.wav|The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London, 50 | DUMMY1/LJ028-0275.wav|At last, in the twentieth month, 51 | DUMMY1/LJ012-0042.wav|which he kept concealed in a hiding-place with a trap-door just under his bed. 52 | DUMMY1/LJ011-0096.wav|He married a lady also belonging to the Society of Friends, who brought him a large fortune, which, and his own money, he put into a city firm, 53 | DUMMY1/LJ036-0077.wav|Roger D. Craig, a deputy sheriff of Dallas County, 54 | DUMMY1/LJ016-0318.wav|Other officials, great lawyers, governors of prisons, and chaplains supported this view. 55 | DUMMY1/LJ013-0164.wav|who came from his room ready dressed, a suspicious circumstance, as he was always late in the morning. 56 | DUMMY1/LJ027-0141.wav|is closely reproduced in the life-history of existing deer. Or, in other words, 57 | DUMMY1/LJ028-0335.wav|accordingly they committed to him the command of their whole army, and put the keys of their city into his hands. 58 | DUMMY1/LJ031-0202.wav|Mrs. Kennedy chose the hospital in Bethesda for the autopsy because the President had served in the Navy. 59 | DUMMY1/LJ021-0145.wav|From those willing to join in establishing this hoped-for period of peace, 60 | DUMMY1/LJ016-0288.wav|"Müller, Müller, He's the man," till a diversion was created by the appearance of the gallows, which was received with continuous yells. 61 | DUMMY1/LJ028-0081.wav|Years later, when the archaeologists could readily distinguish the false from the true, 62 | DUMMY1/LJ018-0081.wav|his defense being that he had intended to commit suicide, but that, on the appearance of this officer who had wronged him, 63 | DUMMY1/LJ021-0066.wav|together with a great increase in the payrolls, there has come a substantial rise in the total of industrial profits 64 | DUMMY1/LJ009-0238.wav|After this the sheriffs sent for another rope, but the spectators interfered, and the man was carried back to jail. 65 | DUMMY1/LJ005-0079.wav|and improve the morals of the prisoners, and shall insure the proper measure of punishment to convicted offenders. 66 | DUMMY1/LJ035-0019.wav|drove to the northwest corner of Elm and Houston, and parked approximately ten feet from the traffic signal. 67 | DUMMY1/LJ036-0174.wav|This is the approximate time he entered the roominghouse, according to Earlene Roberts, the housekeeper there. 68 | DUMMY1/LJ046-0146.wav|The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files 69 | DUMMY1/LJ017-0044.wav|and the deepest anxiety was felt that the crime, if crime there had been, should be brought home to its perpetrator. 70 | DUMMY1/LJ017-0070.wav|but his sporting operations did not prosper, and he became a needy man, always driven to desperate straits for cash. 71 | DUMMY1/LJ014-0020.wav|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood; 72 | DUMMY1/LJ016-0020.wav|He never reached the cistern, but fell back into the yard, injuring his legs severely. 73 | DUMMY1/LJ045-0230.wav|when he was finally apprehended in the Texas Theatre. Although it is not fully corroborated by others who were present, 74 | DUMMY1/LJ035-0129.wav|and she must have run down the stairs ahead of Oswald and would probably have seen or heard him. 75 | DUMMY1/LJ008-0307.wav|afterwards express a wish to murder the Recorder for having kept them so long in suspense. 76 | DUMMY1/LJ008-0294.wav|nearly indefinitely deferred. 77 | DUMMY1/LJ047-0148.wav|On October twenty-five, 78 | DUMMY1/LJ008-0111.wav|They entered a "stone cold room," and were presently joined by the prisoner. 79 | DUMMY1/LJ034-0042.wav|that he could only testify with certainty that the print was less than three days old. 80 | DUMMY1/LJ037-0234.wav|Mrs. Mary Brock, the wife of a mechanic who worked at the station, was there at the time and she saw a white male, 81 | DUMMY1/LJ040-0002.wav|Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one. 82 | DUMMY1/LJ045-0140.wav|The arguments he used to justify his use of the alias suggest that Oswald may have come to think that the whole world was becoming involved 83 | DUMMY1/LJ012-0035.wav|the number and names on watches, were carefully removed or obliterated after the goods passed out of his hands. 84 | DUMMY1/LJ012-0250.wav|On the seventh July, eighteen thirty-seven, 85 | DUMMY1/LJ016-0179.wav|contracted with sheriffs and conveners to work by the job. 86 | DUMMY1/LJ016-0138.wav|at a distance from the prison. 87 | DUMMY1/LJ027-0052.wav|These principles of homology are essential to a correct interpretation of the facts of morphology. 88 | DUMMY1/LJ031-0134.wav|On one occasion Mrs. Johnson, accompanied by two Secret Service agents, left the room to see Mrs. Kennedy and Mrs. Connally. 89 | DUMMY1/LJ019-0273.wav|which Sir Joshua Jebb told the committee he considered the proper elements of penal discipline. 90 | DUMMY1/LJ014-0110.wav|At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects. 91 | DUMMY1/LJ034-0160.wav|on Brennan's subsequent certain identification of Lee Harvey Oswald as the man he saw fire the rifle. 92 | DUMMY1/LJ038-0199.wav|eleven. If I am alive and taken prisoner, 93 | DUMMY1/LJ014-0010.wav|yet he could not overcome the strange fascination it had for him, and remained by the side of the corpse till the stretcher came. 94 | DUMMY1/LJ033-0047.wav|I noticed when I went out that the light was on, end quote, 95 | DUMMY1/LJ040-0027.wav|He was never satisfied with anything. 96 | DUMMY1/LJ048-0228.wav|and others who were present say that no agent was inebriated or acted improperly. 97 | DUMMY1/LJ003-0111.wav|He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity. 98 | DUMMY1/LJ008-0258.wav|Let me retrace my steps, and speak more in detail of the treatment of the condemned in those bloodthirsty and brutally indifferent days, 99 | DUMMY1/LJ029-0022.wav|The original plan called for the President to spend only one day in the State, making whirlwind visits to Dallas, Fort Worth, San Antonio, and Houston. 100 | DUMMY1/LJ004-0045.wav|Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William Wilberforce. 101 | -------------------------------------------------------------------------------- /filelists/ljs_audio_text_val_filelist.txt.cleaned: -------------------------------------------------------------------------------- 1 | DUMMY1/LJ022-0023.wav|ðɪ ˌoʊvɚwˈɛlmɪŋ mədʒˈɔːɹɪɾi ʌv pˈiːpəl ɪn ðɪs kˈʌntɹi nˈoʊ hˌaʊ tə sˈɪft ðə wˈiːt fɹʌmðə tʃˈæf ɪn wˌʌt ðeɪ hˈɪɹ ænd wˌʌt ðeɪ ɹˈiːd. 2 | DUMMY1/LJ043-0030.wav|ɪf sˈʌmbɑːdi dˈɪd ðˈæt tə mˌiː, ɐ lˈaʊsi tɹˈɪk lˈaɪk ðˈæt, tə tˈeɪk maɪ wˈaɪf ɐwˈeɪ, ænd ˈɔːl ðə fˈɜːnɪtʃɚ, ˈaɪ wʊd biː mˈæd æz hˈɛl, tˈuː. 3 | DUMMY1/LJ005-0201.wav|ˌæzˌɪz ʃˈoʊn baɪ ðə ɹɪpˈoːɹt ʌvðə kəmˈɪʃənɚz tʊ ɪnkwˈaɪɚɹ ˌɪntʊ ðə stˈeɪt ʌvðə mjuːnˈɪsɪpəl kˌɔːɹpɚɹˈeɪʃənz ɪn eɪtˈiːn θˈɜːɾifˈaɪv. 4 | DUMMY1/LJ001-0110.wav|ˈiːvən ðə kˈæslɑːn tˈaɪp wɛn ɛnlˈɑːɹdʒd ʃˈoʊz ɡɹˈeɪt ʃˈɔːɹtkʌmɪŋz ɪn ðɪs ɹɪspˈɛkt: 5 | DUMMY1/LJ003-0345.wav|ˈɔːl ðə kəmˈɪɾi kʊd dˈuː ɪn ðɪs ɹɪspˈɛkt wʌz tə θɹˈoʊ ðə ɹɪspˌɑːnsəbˈɪlɪɾi ˌɑːn ˈʌðɚz. 6 | DUMMY1/LJ007-0154.wav|ðiːz pˈʌndʒənt ænd wˈɛlɡɹˈaʊndᵻd stɹˈɪktʃɚz ɐplˈaɪd wɪð stˈɪl ɡɹˈeɪɾɚ fˈoːɹs tə ðɪ ʌnkənvˈɪktᵻd pɹˈɪzənɚ, ðə mˈæn hˌuː kˈeɪm tə ðə pɹˈɪzən ˈɪnəsənt, ænd stˈɪl ʌnkəntˈæmᵻnˌeɪɾᵻd, 7 | DUMMY1/LJ018-0098.wav|ænd ɹˈɛkəɡnˌaɪzd æz wˈʌn ʌvðə fɹˈiːkwɛntɚz ʌvðə bˈoʊɡəs lˈɔːstˈeɪʃənɚz. hɪz ɐɹˈɛst lˈɛd tə ðæt ʌv ˈʌðɚz. 8 | DUMMY1/LJ047-0044.wav|ˈɑːswəld wʌz, haʊˈɛvɚ, wˈɪlɪŋ tə dɪskˈʌs hɪz kˈɑːntækts wɪð sˈoʊviət ɐθˈɔːɹɪɾiz. hiː dɪnˈaɪd hˌævɪŋ ˌɛni ɪnvˈɑːlvmənt wɪð sˈoʊviət ɪntˈɛlɪdʒəns ˈeɪdʒənsiz 9 | DUMMY1/LJ031-0038.wav|ðə fˈɜːst fɪzˈɪʃən tə sˈiː ðə pɹˈɛzɪdənt æt pˈɑːɹklənd hˈɑːspɪɾəl wʌz dˈɑːktɚ tʃˈɑːɹlz dʒˈeɪ. kˈæɹɪkˌoʊ, ɐ ɹˈɛzɪdənt ɪn dʒˈɛnɚɹəl sˈɜːdʒɚɹi. 10 | DUMMY1/LJ048-0194.wav|dˈʊɹɪŋ ðə mˈɔːɹnɪŋ ʌv noʊvˈɛmbɚ twˈɛntitˈuː pɹˈaɪɚ tə ðə mˈoʊɾɚkˌeɪd. 11 | DUMMY1/LJ049-0026.wav|ˌɑːn əkˈeɪʒən ðə sˈiːkɹət sˈɜːvɪs hɐzbɪn pɚmˈɪɾᵻd tə hæv ɐn ˈeɪdʒənt ɹˈaɪdɪŋ ɪnðə pˈæsɪndʒɚ kəmpˈɑːɹtmənt wɪððə pɹˈɛzɪdənt. 12 | DUMMY1/LJ004-0152.wav|ɑːlðˈoʊ æt mˈɪstɚ bˈʌkstənz vˈɪzɪt ɐ nˈuː dʒˈeɪl wʌz ɪn pɹˈɑːsɛs ʌv ɪɹˈɛkʃən, ðə fˈɜːst stˈɛp tʊwˈɔːɹdz ɹɪfˈɔːɹm sˈɪns hˈaʊɚdz vˌɪzɪtˈeɪʃən ɪn sˌɛvəntˈiːn sˈɛvəntifˈoːɹ. 13 | DUMMY1/LJ008-0278.wav|ɔːɹ ðˈɛɹz mˌaɪt biː wˈʌn ʌv mˈɛni, ænd ɪt mˌaɪt biː kənsˈɪdɚd nˈɛsəsɚɹi tuː "mˌeɪk ɐn ɛɡzˈæmpəl." 14 | DUMMY1/LJ043-0002.wav|ðə wˈɔːɹən kəmˈɪʃən ɹɪpˈoːɹt. baɪ ðə pɹˈɛzɪdənts kəmˈɪʃən ɑːnðɪ ɐsˌæsᵻnˈeɪʃən ʌv pɹˈɛzɪdənt kˈɛnədi. tʃˈæptɚ sˈɛvən. lˈiː hˈɑːɹvi ˈɑːswəld: 15 | DUMMY1/LJ009-0114.wav|mˈɪstɚ wˈeɪkfiːld wˈaɪndz ˈʌp hɪz ɡɹˈæfɪk bˌʌt sˈʌmwʌt sɛnsˈeɪʃənəl ɐkˈaʊnt baɪ dɪskɹˈaɪbɪŋ ɐnˈʌðɚ ɹɪlˈɪdʒəs sˈɜːvɪs, wˌɪtʃ mˈeɪ ɐpɹˈoʊpɹɪətli biː ɪnsˈɜːɾᵻd hˈɪɹ. 16 | DUMMY1/LJ028-0506.wav|ɐ mˈɑːdɚn ˈɑːɹɾɪst wʊdhɐv dˈɪfɪkˌʌlti ɪn dˌuːɪŋ sˈʌtʃ ˈækjʊɹət wˈɜːk. 17 | DUMMY1/LJ050-0168.wav|wɪððə pɚtˈɪkjʊlɚ pˈɜːpəsᵻz ʌvðɪ ˈeɪdʒənsi ɪnvˈɑːlvd. ðə kəmˈɪʃən ɹˈɛkəɡnˌaɪzɪz ðæt ðɪs ɪz ɐ kˌɑːntɹəvˈɜːʃəl ˈɛɹiə 18 | DUMMY1/LJ039-0223.wav|ˈɑːswəldz mɚɹˈiːn tɹˈeɪnɪŋ ɪn mˈɑːɹksmənʃˌɪp, hɪz ˈʌðɚ ɹˈaɪfəl ɛkspˈiəɹɪəns ænd hɪz ɪstˈæblɪʃt fəmˌɪlɪˈæɹɪɾi wɪð ðɪs pɚtˈɪkjʊlɚ wˈɛpən 19 | DUMMY1/LJ029-0032.wav|ɐkˈoːɹdɪŋ tʊ oʊdˈɑːnəl, kwˈoʊt, wiː hɐd ɐ mˈoʊɾɚkˌeɪd wɛɹɹˈɛvɚ wiː wˈɛnt, ˈɛnd kwˈoʊt. 20 | DUMMY1/LJ031-0070.wav|dˈɑːktɚ klˈɑːɹk, hˌuː mˈoʊst klˈoʊsli ɑːbzˈɜːvd ðə hˈɛd wˈuːnd, 21 | DUMMY1/LJ034-0198.wav|jˈuːɪnz, hˌuː wʌz ɑːnðə saʊθwˈɛst kˈɔːɹnɚɹ ʌv ˈɛlm ænd hjˈuːstən stɹˈiːts tˈɛstɪfˌaɪd ðæt hiː kʊd nˌɑːt dɪskɹˈaɪb ðə mˈæn hiː sˈɔː ɪnðə wˈɪndoʊ. 22 | DUMMY1/LJ026-0068.wav|ˈɛnɚdʒi ˈɛntɚz ðə plˈænt, tʊ ɐ smˈɔːl ɛkstˈɛnt, 23 | DUMMY1/LJ039-0075.wav|wˈʌns juː nˈoʊ ðæt juː mˈʌst pˌʊt ðə kɹˈɔshɛɹz ɑːnðə tˈɑːɹɡɪt ænd ðæt ɪz ˈɔːl ðæt ɪz nˈɛsəsɚɹi. 24 | DUMMY1/LJ004-0096.wav|ðə fˈeɪɾəl kˈɑːnsɪkwənsᵻz wˈɛɹɑːf mˌaɪt biː pɹɪvˈɛntᵻd ɪf ðə dʒˈʌstɪsᵻz ʌvðə pˈiːs wɜː djˈuːli ˈɔːθɚɹˌaɪzd 25 | DUMMY1/LJ005-0014.wav|spˈiːkɪŋ ˌɑːn ɐ dɪbˈeɪt ˌɑːn pɹˈɪzən mˈæɾɚz, hiː dᵻklˈɛɹd ðˈæt 26 | DUMMY1/LJ012-0161.wav|hiː wʌz ɹɪpˈoːɹɾᵻd tə hæv fˈɔːlən ɐwˈeɪ tʊ ɐ ʃˈædoʊ. 27 | DUMMY1/LJ018-0239.wav|hɪz dˌɪsɐpˈɪɹəns ɡˈeɪv kˈʌlɚ ænd sˈʌbstəns tʊ ˈiːvəl ɹɪpˈoːɹts ɔːlɹˌɛdi ɪn sˌɜːkjʊlˈeɪʃən ðætðə wɪl ænd kənvˈeɪəns əbˌʌv ɹɪfˈɜːd tuː 28 | DUMMY1/LJ019-0257.wav|hˈɪɹ ðə tɹˈɛdwˈiːl wʌz ɪn jˈuːs, ðɛɹ sˈɛljʊlɚ kɹˈæŋks, ɔːɹ hˈɑːɹdlˈeɪbɚ məʃˈiːnz. 29 | DUMMY1/LJ028-0008.wav|juː tˈæp dʒˈɛntli wɪð jʊɹ hˈiːl əpˌɑːn ðə ʃˈoʊldɚɹ ʌvðə dɹˈoʊmdɚɹi tʊ ˈɜːdʒ hɜːɹ ˈɑːn. 30 | DUMMY1/LJ024-0083.wav|ðɪs plˈæn ʌv mˈaɪn ɪz nˈoʊ ɐtˈæk ɑːnðə kˈoːɹt; 31 | DUMMY1/LJ042-0129.wav|nˈoʊ nˈaɪt klˈʌbz ɔːɹ bˈoʊlɪŋ ˈælɪz, nˈoʊ plˈeɪsᵻz ʌv ɹˌɛkɹiːˈeɪʃən ɛksˈɛpt ðə tɹˈeɪd jˈuːniən dˈænsᵻz. ˈaɪ hæv hɐd ɪnˈʌf. 32 | DUMMY1/LJ036-0103.wav|ðə pəlˈiːs ˈæskt hˌɪm wˈɛðɚ hiː kʊd pˈɪk ˈaʊt hɪz pˈæsɪndʒɚ fɹʌmðə lˈaɪnʌp. 33 | DUMMY1/LJ046-0058.wav|dˈʊɹɪŋ hɪz pɹˈɛzɪdənsi, fɹˈæŋklɪn dˈiː. ɹˈoʊzəvˌɛlt mˌeɪd ˈɔːlmoʊst fˈoːɹ hˈʌndɹəd dʒˈɜːnɪz ænd tɹˈævəld mˈoːɹ ðɐn θɹˈiː hˈʌndɹəd fˈɪfti θˈaʊzənd mˈaɪlz. 34 | DUMMY1/LJ014-0076.wav|hiː wʌz sˈiːn ˈæftɚwɚdz smˈoʊkɪŋ ænd tˈɔːkɪŋ wɪð hɪz hˈoʊsts ɪn ðɛɹ bˈæk pˈɑːɹlɚ, ænd nˈɛvɚ sˈiːn ɐɡˈɛn ɐlˈaɪv. 35 | DUMMY1/LJ002-0043.wav|lˈɑːŋ nˈæɹoʊ ɹˈuːmz wˈʌn θˈɜːɾisˈɪks fˈiːt, sˈɪks twˈɛntiθɹˈiː fˈiːt, ænd ðɪ ˈeɪtθ eɪtˈiːn, 36 | DUMMY1/LJ009-0076.wav|wiː kˈʌm tə ðə sˈɜːmən. 37 | DUMMY1/LJ017-0131.wav|ˈiːvən wɛn ðə hˈaɪ ʃˈɛɹɪf hɐd tˈoʊld hˌɪm ðɛɹwˌʌz nˈoʊ pˌɑːsəbˈɪlɪɾi əvɚ ɹɪpɹˈiːv, ænd wɪðˌɪn ɐ fjˈuː ˈaɪʊɹz ʌv ˌɛksɪkjˈuːʃən. 38 | DUMMY1/LJ046-0184.wav|bˌʌt ðɛɹ ɪz ɐ sˈɪstəm fɚðɪ ɪmˈiːdɪət nˌoʊɾɪfɪkˈeɪʃən ʌvðə sˈiːkɹət sˈɜːvɪs baɪ ðə kənfˈaɪnɪŋ ˌɪnstɪtˈuːʃən wɛn ɐ sˈʌbdʒɛkt ɪz ɹɪlˈiːsd ɔːɹ ɛskˈeɪps. 39 | DUMMY1/LJ014-0263.wav|wˌɛn ˈʌðɚ plˈɛʒɚz pˈɔːld hiː tˈʊk ɐ θˈiəɾɚ, ænd pˈoʊzd æz ɐ mjuːnˈɪfɪsənt pˈeɪtɹən ʌvðə dɹəmˈæɾɪk ˈɑːɹt. 40 | DUMMY1/LJ042-0096.wav| ˈoʊld ɛkstʃˈeɪndʒ ɹˈeɪt ɪn ɐdˈɪʃən tə hɪz fˈæktɚɹi sˈælɚɹi ʌv ɐpɹˈɑːksɪmətli ˈiːkwəl ɐmˈaʊnt 41 | DUMMY1/LJ049-0050.wav|hˈɪl hɐd bˈoʊθ fˈiːt ɑːnðə kˈɑːɹ ænd wʌz klˈaɪmɪŋ ɐbˈoːɹd tʊ ɐsˈɪst pɹˈɛzɪdənt ænd mɪsˈɛs kˈɛnədi. 42 | DUMMY1/LJ019-0186.wav|sˈiːɪŋ ðæt sˈɪns ðɪ ɪstˈæblɪʃmənt ʌvðə sˈɛntɹəl kɹˈɪmɪnəl kˈoːɹt, nˈuːɡeɪt ɹɪsˈiːvd pɹˈɪzənɚz fɔːɹ tɹˈaɪəl fɹʌm sˈɛvɹəl kˈaʊntɪz, 43 | DUMMY1/LJ028-0307.wav|ðˈɛn lˈɛt twˈɛnti dˈeɪz pˈæs, ænd æt ðɪ ˈɛnd ʌv ðæt tˈaɪm stˈeɪʃən nˌɪɹ ðə tʃˈældæsən ɡˈeɪts ɐ bˈɑːdi ʌv fˈoːɹ θˈaʊzənd. 44 | DUMMY1/LJ012-0235.wav|wˌaɪl ðeɪ wɜːɹ ɪn ɐ stˈeɪt ʌv ɪnsˌɛnsəbˈɪlɪɾi ðə mˈɜːdɚ wʌz kəmˈɪɾᵻd. 45 | DUMMY1/LJ034-0053.wav|ɹˈiːtʃt ðə sˈeɪm kənklˈuːʒən æz lætˈoʊnə ðætðə pɹˈɪnts fˈaʊnd ɑːnðə kˈɑːɹtənz wɜː ðoʊz ʌv lˈiː hˈɑːɹvi ˈɑːswəld. 46 | DUMMY1/LJ014-0030.wav|ðiːz wɜː dˈæmnətˌoːɹi fˈækts wˌɪtʃ wˈɛl səpˈoːɹɾᵻd ðə pɹˌɑːsɪkjˈuːʃən. 47 | DUMMY1/LJ015-0203.wav|bˌʌt wɜː ðə pɹɪkˈɔːʃənz tˈuː mˈɪnɪt, ðə vˈɪdʒɪləns tˈuː klˈoʊs təbi ɪlˈuːdᵻd ɔːɹ ˌoʊvɚkˈʌm? 48 | DUMMY1/LJ028-0093.wav|bˌʌt hɪz skɹˈaɪb ɹˈoʊt ɪt ɪnðə mˈænɚ kˈʌstəmˌɛɹi fɚðə skɹˈaɪbz ʌv ðoʊz dˈeɪz tə ɹˈaɪt ʌv ðɛɹ ɹˈɔɪəl mˈæstɚz. 49 | DUMMY1/LJ002-0018.wav|ðɪ ɪnˈædɪkwəsi ʌvðə dʒˈeɪl wʌz nˈoʊɾɪsd ænd ɹɪpˈoːɹɾᵻd əpˌɑːn ɐɡˈɛn ænd ɐɡˈɛn baɪ ðə ɡɹˈænd dʒˈʊɹɪz ʌvðə sˈɪɾi ʌv lˈʌndən, 50 | DUMMY1/LJ028-0275.wav|æt lˈæst, ɪnðə twˈɛntiəθ mˈʌnθ, 51 | DUMMY1/LJ012-0042.wav|wˌɪtʃ hiː kˈɛpt kənsˈiːld ɪn ɐ hˈaɪdɪŋplˈeɪs wɪð ɐ tɹˈæpdˈoːɹ dʒˈʌst ˌʌndɚ hɪz bˈɛd. 52 | DUMMY1/LJ011-0096.wav|hiː mˈæɹɪd ɐ lˈeɪdi ˈɑːlsoʊ bɪlˈɑːŋɪŋ tə ðə səsˈaɪəɾi ʌv fɹˈɛndz, hˌuː bɹˈɔːt hˌɪm ɐ lˈɑːɹdʒ fˈɔːɹtʃən, wˈɪtʃ, ænd hɪz ˈoʊn mˈʌni, hiː pˌʊt ˌɪntʊ ɐ sˈɪɾi fˈɜːm, 53 | DUMMY1/LJ036-0077.wav|ɹˈɑːdʒɚ dˈiː. kɹˈeɪɡ, ɐ dˈɛpjuːɾi ʃˈɛɹɪf ʌv dˈæləs kˈaʊnti, 54 | DUMMY1/LJ016-0318.wav|ˈʌðɚɹ əfˈɪʃəlz, ɡɹˈeɪt lˈɔɪɚz, ɡˈʌvɚnɚz ʌv pɹˈɪzənz, ænd tʃˈæplɪnz səpˈoːɹɾᵻd ðɪs vjˈuː. 55 | DUMMY1/LJ013-0164.wav|hˌuː kˈeɪm fɹʌm hɪz ɹˈuːm ɹˈɛdi dɹˈɛst, ɐ səspˈɪʃəs sˈɜːkəmstˌæns, æz hiː wʌz ˈɔːlweɪz lˈeɪt ɪnðə mˈɔːɹnɪŋ. 56 | DUMMY1/LJ027-0141.wav|ɪz klˈoʊsli ɹɪpɹədˈuːst ɪnðə lˈaɪfhˈɪstɚɹi ʌv ɛɡzˈɪstɪŋ dˈɪɹ. ˈɔːɹ, ɪn ˈʌðɚ wˈɜːdz, 57 | DUMMY1/LJ028-0335.wav|ɐkˈoːɹdɪŋli ðeɪ kəmˈɪɾᵻd tə hˌɪm ðə kəmˈænd ʌv ðɛɹ hˈoʊl ˈɑːɹmi, ænd pˌʊt ðə kˈiːz ʌv ðɛɹ sˈɪɾi ˌɪntʊ hɪz hˈændz. 58 | DUMMY1/LJ031-0202.wav|mɪsˈɛs kˈɛnədi tʃˈoʊz ðə hˈɑːspɪɾəl ɪn bəθˈɛzdə fɚðɪ ˈɔːtɑːpsi bɪkˈʌz ðə pɹˈɛzɪdənt hɐd sˈɜːvd ɪnðə nˈeɪvi. 59 | DUMMY1/LJ021-0145.wav|fɹʌm ðoʊz wˈɪlɪŋ tə dʒˈɔɪn ɪn ɪstˈæblɪʃɪŋ ðɪs hˈoʊptfɔːɹ pˈiəɹɪəd ʌv pˈiːs, 60 | DUMMY1/LJ016-0288.wav|"mˈʌlɚ, mˈʌlɚ, hiːz ðə mˈæn," tˈɪl ɐ daɪvˈɜːʒən wʌz kɹiːˈeɪɾᵻd baɪ ðɪ ɐpˈɪɹəns ʌvðə ɡˈæloʊz, wˌɪtʃ wʌz ɹɪsˈiːvd wɪð kəntˈɪnjuːəs jˈɛlz. 61 | DUMMY1/LJ028-0081.wav|jˈɪɹz lˈeɪɾɚ, wˌɛn ðɪ ˌɑːɹkiːˈɑːlədʒˌɪsts kʊd ɹˈɛdɪli dɪstˈɪŋɡwɪʃ ðə fˈɑːls fɹʌmðə tɹˈuː, 62 | DUMMY1/LJ018-0081.wav|hɪz dɪfˈɛns bˌiːɪŋ ðæt hiː hɐd ɪntˈɛndᵻd tə kəmˈɪt sˈuːɪsˌaɪd, bˌʌt ðˈæt, ɑːnðɪ ɐpˈɪɹəns ʌv ðɪs ˈɑːfɪsɚ hˌuː hɐd ɹˈɔŋd hˌɪm, 63 | DUMMY1/LJ021-0066.wav|təɡˌɛðɚ wɪð ɐ ɡɹˈeɪt ˈɪnkɹiːs ɪnðə pˈeɪɹoʊlz, ðɛɹ hɐz kˈʌm ɐ səbstˈænʃəl ɹˈaɪz ɪnðə tˈoʊɾəl ʌv ɪndˈʌstɹɪəl pɹˈɑːfɪts 64 | DUMMY1/LJ009-0238.wav|ˈæftɚ ðɪs ðə ʃˈɛɹɪfs sˈɛnt fɔːɹ ɐnˈʌðɚ ɹˈoʊp, bˌʌt ðə spɛktˈeɪɾɚz ˌɪntəfˈɪɹd, ænd ðə mˈæn wʌz kˈæɹɪd bˈæk tə dʒˈeɪl. 65 | DUMMY1/LJ005-0079.wav|ænd ɪmpɹˈuːv ðə mˈɔːɹəlz ʌvðə pɹˈɪzənɚz, ænd ʃˌæl ɪnʃˈʊɹ ðə pɹˈɑːpɚ mˈɛʒɚɹ ʌv pˈʌnɪʃmənt tə kənvˈɪktᵻd əfˈɛndɚz. 66 | DUMMY1/LJ035-0019.wav|dɹˈoʊv tə ðə nɔːɹθwˈɛst kˈɔːɹnɚɹ ʌv ˈɛlm ænd hjˈuːstən, ænd pˈɑːɹkt ɐpɹˈɑːksɪmətli tˈɛn fˈiːt fɹʌmðə tɹˈæfɪk sˈɪɡnəl. 67 | DUMMY1/LJ036-0174.wav|ðɪs ɪz ðɪ ɐpɹˈɑːksɪmət tˈaɪm hiː ˈɛntɚd ðə ɹˈuːmɪŋhˌaʊs, ɐkˈoːɹdɪŋ tʊ ˈɜːliːn ɹˈɑːbɚts, ðə hˈaʊskiːpɚ ðˈɛɹ. 68 | DUMMY1/LJ046-0146.wav|ðə kɹaɪtˈiəɹɪə ɪn ɪfˈɛkt pɹˈaɪɚ tə noʊvˈɛmbɚ twˈɛntitˈuː, naɪntˈiːn sˈɪkstiθɹˈiː, fɔːɹ dɪtˈɜːmɪnɪŋ wˈɛðɚ tʊ ɐksˈɛpt mətˈiəɹɪəl fɚðə pˌiːˌɑːɹˈɛs dʒˈɛnɚɹəl fˈaɪlz 69 | DUMMY1/LJ017-0044.wav|ænd ðə dˈiːpəst æŋzˈaɪəɾi wʌz fˈɛlt ðætðə kɹˈaɪm, ɪf kɹˈaɪm ðˈɛɹ hɐdbɪn, ʃˌʊd biː bɹˈɔːt hˈoʊm tʊ ɪts pˈɜːpɪtɹˌeɪɾɚ. 70 | DUMMY1/LJ017-0070.wav|bˌʌt hɪz spˈoːɹɾɪŋ ˌɑːpɚɹˈeɪʃənz dɪdnˌɑːt pɹˈɑːspɚ, ænd hiː bɪkˌeɪm ɐ nˈiːdi mˈæn, ˈɔːlweɪz dɹˈɪvən tə dˈɛspɚɹət stɹˈeɪts fɔːɹ kˈæʃ. 71 | DUMMY1/LJ014-0020.wav|hiː wʌz sˈuːn ˈæftɚwɚdz ɐɹˈɛstᵻd ˌɑːn səspˈɪʃən, ænd ɐ sˈɜːtʃ ʌv hɪz lˈɑːdʒɪŋz bɹˈɔːt tə lˈaɪt sˈɛvɹəl ɡˈɑːɹmənts sˈætʃɚɹˌeɪɾᵻd wɪð blˈʌd; 72 | DUMMY1/LJ016-0020.wav|hiː nˈɛvɚ ɹˈiːtʃt ðə sˈɪstɚn, bˌʌt fˈɛl bˈæk ˌɪntʊ ðə jˈɑːɹd, ˈɪndʒɚɹɪŋ hɪz lˈɛɡz sɪvˈɪɹli. 73 | DUMMY1/LJ045-0230.wav|wˌɛn hiː wʌz fˈaɪnəli ˌæpɹɪhˈɛndᵻd ɪnðə tˈɛksəs θˈiəɾɚ. ɑːlðˈoʊ ɪt ɪz nˌɑːt fˈʊli kɚɹˈɑːbɚɹˌeɪɾᵻd baɪ ˈʌðɚz hˌuː wɜː pɹˈɛzənt, 74 | DUMMY1/LJ035-0129.wav|ænd ʃiː mˈʌstɐv ɹˈʌn dˌaʊn ðə stˈɛɹz ɐhˈɛd ʌv ˈɑːswəld ænd wʊd pɹˈɑːbəbli hæv sˈiːn ɔːɹ hˈɜːd hˌɪm. 75 | DUMMY1/LJ008-0307.wav|ˈæftɚwɚdz ɛkspɹˈɛs ɐ wˈɪʃ tə mˈɜːdɚ ðə ɹɪkˈoːɹdɚ fɔːɹ hˌævɪŋ kˈɛpt ðˌɛm sˌoʊ lˈɑːŋ ɪn səspˈɛns. 76 | DUMMY1/LJ008-0294.wav|nˌɪɹli ɪndˈɛfɪnətli dɪfˈɜːd. 77 | DUMMY1/LJ047-0148.wav|ˌɑːn ɑːktˈoʊbɚ twˈɛntifˈaɪv, 78 | DUMMY1/LJ008-0111.wav|ðeɪ ˈɛntɚd ˈeɪ "stˈoʊn kˈoʊld ɹˈuːm," ænd wɜː pɹˈɛzəntli dʒˈɔɪnd baɪ ðə pɹˈɪzənɚ. 79 | DUMMY1/LJ034-0042.wav|ðæt hiː kʊd ˈoʊnli tˈɛstɪfˌaɪ wɪð sˈɜːtənti ðætðə pɹˈɪnt wʌz lˈɛs ðɐn θɹˈiː dˈeɪz ˈoʊld. 80 | DUMMY1/LJ037-0234.wav|mɪsˈɛs mˈɛɹi bɹˈɑːk, ðə wˈaɪf əvə mɪkˈænɪk hˌuː wˈɜːkt æt ðə stˈeɪʃən, wʌz ðɛɹ æt ðə tˈaɪm ænd ʃiː sˈɔː ɐ wˈaɪt mˈeɪl, 81 | DUMMY1/LJ040-0002.wav|tʃˈæptɚ sˈɛvən. lˈiː hˈɑːɹvi ˈɑːswəld: bˈækɡɹaʊnd ænd pˈɑːsəbəl mˈoʊɾɪvz, pˈɑːɹt wˌʌn. 82 | DUMMY1/LJ045-0140.wav|ðɪ ˈɑːɹɡjuːmənts hiː jˈuːzd tə dʒˈʌstɪfˌaɪ hɪz jˈuːs ʌvðɪ ˈeɪliəs sədʒˈɛst ðæt ˈɑːswəld mˌeɪhɐv kˈʌm tə θˈɪŋk ðætðə hˈoʊl wˈɜːld wʌz bɪkˈʌmɪŋ ɪnvˈɑːlvd 83 | DUMMY1/LJ012-0035.wav|ðə nˈʌmbɚ ænd nˈeɪmz ˌɑːn wˈɑːtʃᵻz, wɜː kˈɛɹfəli ɹɪmˈuːvd ɔːɹ əblˈɪɾɚɹˌeɪɾᵻd ˈæftɚ ðə ɡˈʊdz pˈæst ˌaʊɾəv hɪz hˈændz. 84 | DUMMY1/LJ012-0250.wav|ɑːnðə sˈɛvənθ dʒuːlˈaɪ, eɪtˈiːn θˈɜːɾisˈɛvən, 85 | DUMMY1/LJ016-0179.wav|kəntɹˈæktᵻd wɪð ʃˈɛɹɪfs ænd kənvˈɛnɚz tə wˈɜːk baɪ ðə dʒˈɑːb. 86 | DUMMY1/LJ016-0138.wav|æɾə dˈɪstəns fɹʌmðə pɹˈɪzən. 87 | DUMMY1/LJ027-0052.wav|ðiːz pɹˈɪnsɪpəlz ʌv həmˈɑːlədʒi ɑːɹ ɪsˈɛnʃəl tʊ ɐ kɚɹˈɛkt ɪntˌɜːpɹɪtˈeɪʃən ʌvðə fˈækts ʌv mɔːɹfˈɑːlədʒi. 88 | DUMMY1/LJ031-0134.wav|ˌɑːn wˈʌn əkˈeɪʒən mɪsˈɛs dʒˈɑːnsən, ɐkˈʌmpənɪd baɪ tˈuː sˈiːkɹət sˈɜːvɪs ˈeɪdʒənts, lˈɛft ðə ɹˈuːm tə sˈiː mɪsˈɛs kˈɛnədi ænd mɪsˈɛs kənˈæli. 89 | DUMMY1/LJ019-0273.wav|wˌɪtʃ sˌɜː dʒˈɑːʃjuːə dʒˈɛb tˈoʊld ðə kəmˈɪɾi hiː kənsˈɪdɚd ðə pɹˈɑːpɚɹ ˈɛlɪmənts ʌv pˈiːnəl dˈɪsɪplˌɪn. 90 | DUMMY1/LJ014-0110.wav|æt ðə fˈɜːst ðə bˈɑːksᵻz wɜːɹ ɪmpˈaʊndᵻd, ˈoʊpənd, ænd fˈaʊnd tə kəntˈeɪn mˈɛnɪəv oʊkˈɑːnɚz ɪfˈɛkts. 91 | DUMMY1/LJ034-0160.wav|ˌɑːn bɹˈɛnənz sˈʌbsɪkwənt sˈɜːtən aɪdˈɛntɪfɪkˈeɪʃən ʌv lˈiː hˈɑːɹvi ˈɑːswəld æz ðə mˈæn hiː sˈɔː fˈaɪɚ ðə ɹˈaɪfəl. 92 | DUMMY1/LJ038-0199.wav|ɪlˈɛvən. ɪf ˈaɪ æm ɐlˈaɪv ænd tˈeɪkən pɹˈɪzənɚ, 93 | DUMMY1/LJ014-0010.wav|jˈɛt hiː kʊd nˌɑːt ˌoʊvɚkˈʌm ðə stɹˈeɪndʒ fˌæsᵻnˈeɪʃən ɪt hˈɐd fɔːɹ hˌɪm, ænd ɹɪmˈeɪnd baɪ ðə sˈaɪd ʌvðə kˈɔːɹps tˈɪl ðə stɹˈɛtʃɚ kˈeɪm. 94 | DUMMY1/LJ033-0047.wav|ˈaɪ nˈoʊɾɪsd wɛn ˈaɪ wɛnt ˈaʊt ðætðə lˈaɪt wʌz ˈɑːn, ˈɛnd kwˈoʊt, 95 | DUMMY1/LJ040-0027.wav|hiː wʌz nˈɛvɚ sˈæɾɪsfˌaɪd wɪð ˈɛnɪθˌɪŋ. 96 | DUMMY1/LJ048-0228.wav|ænd ˈʌðɚz hˌuː wɜː pɹˈɛzənt sˈeɪ ðæt nˈoʊ ˈeɪdʒənt wʌz ɪnˈiːbɹɪˌeɪɾᵻd ɔːɹ ˈæktᵻd ɪmpɹˈɑːpɚli. 97 | DUMMY1/LJ003-0111.wav|hiː wʌz ɪn kˈɑːnsɪkwəns pˌʊt ˌaʊɾəv ðə pɹətˈɛkʃən ʌv ðɛɹ ɪntˈɜːnəl lˈɔː, ˈɛnd kwˈoʊt. ðɛɹ kˈoʊd wʌzɐ sˈʌbdʒɛkt ʌv sˌʌm kjˌʊɹɪˈɑːsɪɾi. 98 | DUMMY1/LJ008-0258.wav|lˈɛt mˌiː ɹɪtɹˈeɪs maɪ stˈɛps, ænd spˈiːk mˈoːɹ ɪn diːtˈeɪl ʌvðə tɹˈiːtmənt ʌvðə kəndˈɛmd ɪn ðoʊz blˈʌdθɜːsti ænd bɹˈuːɾəli ɪndˈɪfɹənt dˈeɪz, 99 | DUMMY1/LJ029-0022.wav|ðɪ ɚɹˈɪdʒɪnəl plˈæn kˈɔːld fɚðə pɹˈɛzɪdənt tə spˈɛnd ˈoʊnli wˈʌn dˈeɪ ɪnðə stˈeɪt, mˌeɪkɪŋ wˈɜːlwɪnd vˈɪzɪts tə dˈæləs, fˈɔːɹt wˈɜːθ, sˌæn æntˈoʊnɪˌoʊ, ænd hjˈuːstən. 100 | DUMMY1/LJ004-0045.wav|mˈɪstɚ stˈɜːdʒᵻz bˈoːɹn, sˌɜː dʒˈeɪmz mˈækɪntˌɑːʃ, sˌɜː dʒˈeɪmz skˈɑːɹlɪt, ænd wˈɪljəm wˈɪlbɚfˌoːɹs. 101 | -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/heatz123/naturalspeech/ca2f814d960149a8fdc3f3e4d773abb67e7c18ec/models/__init__.py -------------------------------------------------------------------------------- /models/attentions.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import torch 4 | from torch import nn 5 | from torch.nn import functional as F 6 | 7 | from utils import commons 8 | from models.modules import LayerNorm 9 | 10 | 11 | class Encoder(nn.Module): 12 | def __init__( 13 | self, 14 | hidden_channels, 15 | filter_channels, 16 | n_heads, 17 | n_layers, 18 | kernel_size=1, 19 | p_dropout=0.0, 20 | window_size=4, 21 | **kwargs 22 | ): 23 | super().__init__() 24 | self.hidden_channels = hidden_channels 25 | self.filter_channels = filter_channels 26 | self.n_heads = n_heads 27 | self.n_layers = n_layers 28 | self.kernel_size = kernel_size 29 | self.p_dropout = p_dropout 30 | self.window_size = window_size 31 | 32 | self.drop = nn.Dropout(p_dropout) 33 | self.attn_layers = nn.ModuleList() 34 | self.norm_layers_1 = nn.ModuleList() 35 | self.ffn_layers = nn.ModuleList() 36 | self.norm_layers_2 = nn.ModuleList() 37 | for i in range(self.n_layers): 38 | self.attn_layers.append( 39 | MultiHeadAttention( 40 | hidden_channels, 41 | hidden_channels, 42 | n_heads, 43 | p_dropout=p_dropout, 44 | window_size=window_size, 45 | ) 46 | ) 47 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 48 | self.ffn_layers.append( 49 | FFN( 50 | hidden_channels, 51 | hidden_channels, 52 | filter_channels, 53 | kernel_size, 54 | p_dropout=p_dropout, 55 | ) 56 | ) 57 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 58 | 59 | def forward(self, x, x_mask): 60 | attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 61 | x = x * x_mask 62 | for i in range(self.n_layers): 63 | y = self.attn_layers[i](x, x, attn_mask) 64 | y = self.drop(y) 65 | x = self.norm_layers_1[i](x + y) 66 | 67 | y = self.ffn_layers[i](x, x_mask) 68 | y = self.drop(y) 69 | x = self.norm_layers_2[i](x + y) 70 | x = x * x_mask 71 | return x 72 | 73 | 74 | class Decoder(nn.Module): 75 | def __init__( 76 | self, 77 | hidden_channels, 78 | filter_channels, 79 | n_heads, 80 | n_layers, 81 | kernel_size=1, 82 | p_dropout=0.0, 83 | proximal_bias=False, 84 | proximal_init=True, 85 | **kwargs 86 | ): 87 | super().__init__() 88 | self.hidden_channels = hidden_channels 89 | self.filter_channels = filter_channels 90 | self.n_heads = n_heads 91 | self.n_layers = n_layers 92 | self.kernel_size = kernel_size 93 | self.p_dropout = p_dropout 94 | self.proximal_bias = proximal_bias 95 | self.proximal_init = proximal_init 96 | 97 | self.drop = nn.Dropout(p_dropout) 98 | self.self_attn_layers = nn.ModuleList() 99 | self.norm_layers_0 = nn.ModuleList() 100 | self.encdec_attn_layers = nn.ModuleList() 101 | self.norm_layers_1 = nn.ModuleList() 102 | self.ffn_layers = nn.ModuleList() 103 | self.norm_layers_2 = nn.ModuleList() 104 | for i in range(self.n_layers): 105 | self.self_attn_layers.append( 106 | MultiHeadAttention( 107 | hidden_channels, 108 | hidden_channels, 109 | n_heads, 110 | p_dropout=p_dropout, 111 | proximal_bias=proximal_bias, 112 | proximal_init=proximal_init, 113 | ) 114 | ) 115 | self.norm_layers_0.append(LayerNorm(hidden_channels)) 116 | self.encdec_attn_layers.append( 117 | MultiHeadAttention( 118 | hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout 119 | ) 120 | ) 121 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 122 | self.ffn_layers.append( 123 | FFN( 124 | hidden_channels, 125 | hidden_channels, 126 | filter_channels, 127 | kernel_size, 128 | p_dropout=p_dropout, 129 | causal=True, 130 | ) 131 | ) 132 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 133 | 134 | def forward(self, x, x_mask, h, h_mask): 135 | """ 136 | x: decoder input 137 | h: encoder output 138 | """ 139 | self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to( 140 | device=x.device, dtype=x.dtype 141 | ) 142 | encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 143 | x = x * x_mask 144 | for i in range(self.n_layers): 145 | y = self.self_attn_layers[i](x, x, self_attn_mask) 146 | y = self.drop(y) 147 | x = self.norm_layers_0[i](x + y) 148 | 149 | y = self.encdec_attn_layers[i](x, h, encdec_attn_mask) 150 | y = self.drop(y) 151 | x = self.norm_layers_1[i](x + y) 152 | 153 | y = self.ffn_layers[i](x, x_mask) 154 | y = self.drop(y) 155 | x = self.norm_layers_2[i](x + y) 156 | x = x * x_mask 157 | return x 158 | 159 | 160 | class MultiHeadAttention(nn.Module): 161 | def __init__( 162 | self, 163 | channels, 164 | out_channels, 165 | n_heads, 166 | p_dropout=0.0, 167 | window_size=None, 168 | heads_share=True, 169 | block_length=None, 170 | proximal_bias=False, 171 | proximal_init=False, 172 | ): 173 | super().__init__() 174 | assert channels % n_heads == 0 175 | 176 | self.channels = channels 177 | self.out_channels = out_channels 178 | self.n_heads = n_heads 179 | self.p_dropout = p_dropout 180 | self.window_size = window_size 181 | self.heads_share = heads_share 182 | self.block_length = block_length 183 | self.proximal_bias = proximal_bias 184 | self.proximal_init = proximal_init 185 | self.attn = None 186 | 187 | self.k_channels = channels // n_heads 188 | self.conv_q = nn.Conv1d(channels, channels, 1) 189 | self.conv_k = nn.Conv1d(channels, channels, 1) 190 | self.conv_v = nn.Conv1d(channels, channels, 1) 191 | self.conv_o = nn.Conv1d(channels, out_channels, 1) 192 | self.drop = nn.Dropout(p_dropout) 193 | 194 | if window_size is not None: 195 | n_heads_rel = 1 if heads_share else n_heads 196 | rel_stddev = self.k_channels**-0.5 197 | self.emb_rel_k = nn.Parameter( 198 | torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) 199 | * rel_stddev 200 | ) 201 | self.emb_rel_v = nn.Parameter( 202 | torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) 203 | * rel_stddev 204 | ) 205 | 206 | nn.init.xavier_uniform_(self.conv_q.weight) 207 | nn.init.xavier_uniform_(self.conv_k.weight) 208 | nn.init.xavier_uniform_(self.conv_v.weight) 209 | if proximal_init: 210 | with torch.no_grad(): 211 | self.conv_k.weight.copy_(self.conv_q.weight) 212 | self.conv_k.bias.copy_(self.conv_q.bias) 213 | 214 | def forward(self, x, c, attn_mask=None): 215 | q = self.conv_q(x) 216 | k = self.conv_k(c) 217 | v = self.conv_v(c) 218 | 219 | x, self.attn = self.attention(q, k, v, mask=attn_mask) 220 | 221 | x = self.conv_o(x) 222 | return x 223 | 224 | def attention(self, query, key, value, mask=None): 225 | # reshape [b, d, t] -> [b, n_h, t, d_k] 226 | b, d, t_s, t_t = (*key.size(), query.size(2)) 227 | query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3) 228 | key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3) 229 | value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3) 230 | 231 | scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1)) 232 | if self.window_size is not None: 233 | assert ( 234 | t_s == t_t 235 | ), "Relative attention is only available for self-attention." 236 | key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s) 237 | rel_logits = self._matmul_with_relative_keys( 238 | query / math.sqrt(self.k_channels), key_relative_embeddings 239 | ) 240 | scores_local = self._relative_position_to_absolute_position(rel_logits) 241 | scores = scores + scores_local 242 | if self.proximal_bias: 243 | assert t_s == t_t, "Proximal bias is only available for self-attention." 244 | scores = scores + self._attention_bias_proximal(t_s).to( 245 | device=scores.device, dtype=scores.dtype 246 | ) 247 | if mask is not None: 248 | scores = scores.masked_fill(mask == 0, -1e4) 249 | if self.block_length is not None: 250 | assert ( 251 | t_s == t_t 252 | ), "Local attention is only available for self-attention." 253 | block_mask = ( 254 | torch.ones_like(scores) 255 | .triu(-self.block_length) 256 | .tril(self.block_length) 257 | ) 258 | scores = scores.masked_fill(block_mask == 0, -1e4) 259 | p_attn = F.softmax(scores, dim=-1) # [b, n_h, t_t, t_s] 260 | p_attn = self.drop(p_attn) 261 | output = torch.matmul(p_attn, value) 262 | if self.window_size is not None: 263 | relative_weights = self._absolute_position_to_relative_position(p_attn) 264 | value_relative_embeddings = self._get_relative_embeddings( 265 | self.emb_rel_v, t_s 266 | ) 267 | output = output + self._matmul_with_relative_values( 268 | relative_weights, value_relative_embeddings 269 | ) 270 | output = ( 271 | output.transpose(2, 3).contiguous().view(b, d, t_t) 272 | ) # [b, n_h, t_t, d_k] -> [b, d, t_t] 273 | return output, p_attn 274 | 275 | def _matmul_with_relative_values(self, x, y): 276 | """ 277 | x: [b, h, l, m] 278 | y: [h or 1, m, d] 279 | ret: [b, h, l, d] 280 | """ 281 | ret = torch.matmul(x, y.unsqueeze(0)) 282 | return ret 283 | 284 | def _matmul_with_relative_keys(self, x, y): 285 | """ 286 | x: [b, h, l, d] 287 | y: [h or 1, m, d] 288 | ret: [b, h, l, m] 289 | """ 290 | ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1)) 291 | return ret 292 | 293 | def _get_relative_embeddings(self, relative_embeddings, length): 294 | max_relative_position = 2 * self.window_size + 1 295 | # Pad first before slice to avoid using cond ops. 296 | pad_length = max(length - (self.window_size + 1), 0) 297 | slice_start_position = max((self.window_size + 1) - length, 0) 298 | slice_end_position = slice_start_position + 2 * length - 1 299 | if pad_length > 0: 300 | padded_relative_embeddings = F.pad( 301 | relative_embeddings, 302 | commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]), 303 | ) 304 | else: 305 | padded_relative_embeddings = relative_embeddings 306 | used_relative_embeddings = padded_relative_embeddings[ 307 | :, slice_start_position:slice_end_position 308 | ] 309 | return used_relative_embeddings 310 | 311 | def _relative_position_to_absolute_position(self, x): 312 | """ 313 | x: [b, h, l, 2*l-1] 314 | ret: [b, h, l, l] 315 | """ 316 | batch, heads, length, _ = x.size() 317 | # Concat columns of pad to shift from relative to absolute indexing. 318 | x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]])) 319 | 320 | # Concat extra elements so to add up to shape (len+1, 2*len-1). 321 | x_flat = x.view([batch, heads, length * 2 * length]) 322 | x_flat = F.pad( 323 | x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [0, length - 1]]) 324 | ) 325 | 326 | # Reshape and slice out the padded elements. 327 | x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[ 328 | :, :, :length, length - 1 : 329 | ] 330 | return x_final 331 | 332 | def _absolute_position_to_relative_position(self, x): 333 | """ 334 | x: [b, h, l, l] 335 | ret: [b, h, l, 2*l-1] 336 | """ 337 | batch, heads, length, _ = x.size() 338 | # padd along column 339 | x = F.pad( 340 | x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]]) 341 | ) 342 | x_flat = x.view([batch, heads, length**2 + length * (length - 1)]) 343 | # add 0's in the beginning that will skew the elements after reshape 344 | x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]])) 345 | x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:] 346 | return x_final 347 | 348 | def _attention_bias_proximal(self, length): 349 | """Bias for self-attention to encourage attention to close positions. 350 | Args: 351 | length: an integer scalar. 352 | Returns: 353 | a Tensor with shape [1, 1, length, length] 354 | """ 355 | r = torch.arange(length, dtype=torch.float32) 356 | diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1) 357 | return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0) 358 | 359 | 360 | class FFN(nn.Module): 361 | def __init__( 362 | self, 363 | in_channels, 364 | out_channels, 365 | filter_channels, 366 | kernel_size, 367 | p_dropout=0.0, 368 | activation=None, 369 | causal=False, 370 | ): 371 | super().__init__() 372 | self.in_channels = in_channels 373 | self.out_channels = out_channels 374 | self.filter_channels = filter_channels 375 | self.kernel_size = kernel_size 376 | self.p_dropout = p_dropout 377 | self.activation = activation 378 | self.causal = causal 379 | 380 | if causal: 381 | self.padding = self._causal_padding 382 | else: 383 | self.padding = self._same_padding 384 | 385 | self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size) 386 | self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size) 387 | self.drop = nn.Dropout(p_dropout) 388 | 389 | def forward(self, x, x_mask): 390 | x = self.conv_1(self.padding(x * x_mask)) 391 | if self.activation == "gelu": 392 | x = x * torch.sigmoid(1.702 * x) 393 | else: 394 | x = torch.relu(x) 395 | x = self.drop(x) 396 | x = self.conv_2(self.padding(x * x_mask)) 397 | return x * x_mask 398 | 399 | def _causal_padding(self, x): 400 | if self.kernel_size == 1: 401 | return x 402 | pad_l = self.kernel_size - 1 403 | pad_r = 0 404 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 405 | x = F.pad(x, commons.convert_pad_shape(padding)) 406 | return x 407 | 408 | def _same_padding(self, x): 409 | if self.kernel_size == 1: 410 | return x 411 | pad_l = (self.kernel_size - 1) // 2 412 | pad_r = self.kernel_size // 2 413 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 414 | x = F.pad(x, commons.convert_pad_shape(padding)) 415 | return x 416 | -------------------------------------------------------------------------------- /models/losses.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.nn import functional as F 3 | from models.soft_dtw import SoftDTW 4 | 5 | sdtw = SoftDTW(use_cuda=False, gamma=0.01, warp=134.4) 6 | 7 | 8 | def feature_loss(fmap_r, fmap_g): 9 | loss = 0 10 | for dr, dg in zip(fmap_r, fmap_g): 11 | for rl, gl in zip(dr, dg): 12 | rl = rl.float().detach() 13 | gl = gl.float() 14 | loss += torch.mean(torch.abs(rl - gl)) 15 | 16 | return loss * 2 17 | 18 | 19 | def discriminator_loss(disc_real_outputs, disc_generated_outputs): 20 | loss = 0 21 | r_losses = [] 22 | g_losses = [] 23 | for dr, dg in zip(disc_real_outputs, disc_generated_outputs): 24 | dr = dr.float() 25 | dg = dg.float() 26 | r_loss = torch.mean((1 - dr) ** 2) 27 | g_loss = torch.mean(dg**2) 28 | loss += r_loss + g_loss 29 | r_losses.append(r_loss.item()) 30 | g_losses.append(g_loss.item()) 31 | 32 | return loss, r_losses, g_losses 33 | 34 | 35 | def generator_loss(disc_outputs): 36 | loss = 0 37 | gen_losses = [] 38 | for dg in disc_outputs: 39 | dg = dg.float() 40 | l = torch.mean((1 - dg) ** 2) 41 | gen_losses.append(l) 42 | loss += l 43 | 44 | return loss, gen_losses 45 | 46 | 47 | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask): 48 | """ 49 | z_p, logs_q: [b, h, t_t] 50 | m_p, logs_p: [b, h, t_t] 51 | """ 52 | z_p = z_p.float() 53 | logs_q = logs_q.float() 54 | m_p = m_p.float() 55 | logs_p = logs_p.float() 56 | z_mask = z_mask.float() 57 | 58 | kl = logs_p - logs_q - 0.5 59 | kl += 0.5 * ((z_p - m_p) ** 2) * torch.exp(-2.0 * logs_p) 60 | kl = torch.sum(kl * z_mask) 61 | l = kl / torch.sum(z_mask) 62 | return l 63 | 64 | 65 | def get_sdtw_kl_matrix(z_p, logs_q, m_p, logs_p): 66 | """ 67 | returns kl matrix with shape [b, t_tp, t_tq] 68 | z_p, logs_q: [b, h, t_tq] 69 | m_p, logs_p: [b, h, t_tp] 70 | """ 71 | z_p = z_p.float() 72 | logs_q = logs_q.float() 73 | m_p = m_p.float() 74 | logs_p = logs_p.float() 75 | 76 | t_tp, t_tq = m_p.size(-1), z_p.size(-1) 77 | b, h, t_tp = m_p.shape 78 | 79 | # for memory testing 80 | # return torch.abs(z_p.mean(dim=1)[:, None, :] - m_p.mean(dim=1)[:, :, None]) 81 | 82 | # z_p = z_p.transpose(0, 1) 83 | # logs_q = logs_q.transpose(0, 1) 84 | # m_p = m_p.transpose(0, 1) 85 | # logs_p = logs_p.transpose(0, 1) 86 | 87 | kls = torch.zeros((b, t_tp, t_tq), dtype=z_p.dtype, device=z_p.device) 88 | for i in range(h): 89 | logs_p_, m_p_, logs_q_, z_p_ = ( 90 | logs_p[:, i, :, None], 91 | m_p[:, i, :, None], 92 | logs_q[:, i, None, :], 93 | z_p[:, i, None, :], 94 | ) 95 | kl = logs_p_ - logs_q_ - 0.5 # [b, t_tp, t_tq] 96 | kl += 0.5 * ((z_p_ - m_p_) ** 2) * torch.exp(-2.0 * logs_p_) 97 | kls += kl 98 | return kls 99 | 100 | kl = logs_p[:, :, :, None] - logs_q[:, :, None, :] - 0.5 # p, q 101 | kl += ( 102 | 0.5 103 | * ((z_p[:, :, None, :] - m_p[:, :, :, None]) ** 2) 104 | * torch.exp(-2.0 * logs_p[:, :, :, None]) 105 | ) 106 | 107 | kl = kl.sum(dim=1) 108 | return kl 109 | 110 | 111 | import torch.utils.checkpoint 112 | 113 | 114 | def kl_loss_dtw(z_p, logs_q, m_p, logs_p, p_mask, q_mask): 115 | INF = 1e5 116 | 117 | kl = get_sdtw_kl_matrix(z_p, logs_q, m_p, logs_p) # [b t_tp t_tq] 118 | kl = torch.nn.functional.pad(kl, (0, 1, 0, 1), "constant", 0) 119 | p_mask = torch.nn.functional.pad(p_mask, (0, 1), "constant", 0) 120 | q_mask = torch.nn.functional.pad(q_mask, (0, 1), "constant", 0) 121 | 122 | kl.masked_fill_(p_mask[:, :, None].bool() ^ q_mask[:, None, :].bool(), INF) 123 | kl.masked_fill_((~p_mask[:, :, None].bool()) & (~q_mask[:, None, :].bool()), 0) 124 | res = sdtw(kl).sum() / p_mask.sum() 125 | return res 126 | 127 | 128 | if __name__ == "__main__": 129 | kl = torch.rand(4, 100, 100) 130 | kl[:, 50:, :] = 1e4 131 | kl[:, :, 50:] = 1e4 132 | kl[:, 50:, 50:] = 0 133 | 134 | print(kl) 135 | print(sdtw(kl).mean() / 50) 136 | -------------------------------------------------------------------------------- /models/soft_dtw.py: -------------------------------------------------------------------------------- 1 | # MIT License 2 | # 3 | # Copyright (c) 2020 Mehran Maghoumi 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining a copy 6 | # of this software and associated documentation files (the "Software"), to deal 7 | # in the Software without restriction, including without limitation the rights 8 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | # copies of the Software, and to permit persons to whom the Software is 10 | # furnished to do so, subject to the following conditions: 11 | # 12 | # The above copyright notice and this permission notice shall be included in all 13 | # copies or substantial portions of the Software. 14 | # 15 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | # SOFTWARE. 22 | # ---------------------------------------------------------------------------------------------------------------------- 23 | 24 | import numpy as np 25 | import torch 26 | import torch.cuda 27 | from numba import jit, prange 28 | from torch.autograd import Function 29 | from numba import cuda 30 | import math 31 | 32 | # ---------------------------------------------------------------------------------------------------------------------- 33 | # 34 | # The following is the CPU implementation based on https://github.com/Sleepwalking/pytorch-softdtw 35 | # Credit goes to Kanru Hua. 36 | # I've added support for batching and pruning. 37 | # 38 | # ---------------------------------------------------------------------------------------------------------------------- 39 | # ---------------------------------------------------------------------------------------------------------------------- 40 | @cuda.jit 41 | def compute_softdtw_cuda(D, gamma, bandwidth, max_i, max_j, n_passes, R): 42 | """ 43 | :param seq_len: The length of the sequence (both inputs are assumed to be of the same size) 44 | :param n_passes: 2 * seq_len - 1 (The number of anti-diagonals) 45 | """ 46 | # Each block processes one pair of examples 47 | b = cuda.blockIdx.x 48 | # We have as many threads as seq_len, because the most number of threads we need 49 | # is equal to the number of elements on the largest anti-diagonal 50 | tid = cuda.threadIdx.x 51 | 52 | # Compute I, J, the indices from [0, seq_len) 53 | 54 | # The row index is always the same as tid 55 | I = tid 56 | 57 | inv_gamma = 1.0 / gamma 58 | 59 | # Go over each anti-diagonal. Only process threads that fall on the current on the anti-diagonal 60 | for p in range(n_passes): 61 | 62 | # The index is actually 'p - tid' but need to force it in-bounds 63 | J = max(0, min(p - tid, max_j - 1)) 64 | 65 | # For simplicity, we define i, j which start from 1 (offset from I, J) 66 | i = I + 1 67 | j = J + 1 68 | 69 | # Only compute if element[i, j] is on the current anti-diagonal, and also is within bounds 70 | if I + J == p and (I < max_i and J < max_j): 71 | # Don't compute if outside bandwidth 72 | if not (abs(i - j) > bandwidth > 0): 73 | r0 = -R[b, i - 1, j - 1] * inv_gamma 74 | r1 = -R[b, i - 1, j] * inv_gamma - 0.07 75 | r2 = -R[b, i, j - 1] * inv_gamma - 0.07 76 | rmax = max(max(r0, r1), r2) 77 | rsum = math.exp(r0 - rmax) + math.exp(r1 - rmax) + math.exp(r2 - rmax) 78 | softmin = -gamma * (math.log(rsum) + rmax) 79 | R[b, i, j] = D[b, i - 1, j - 1] + softmin 80 | 81 | # Wait for other threads in this block 82 | cuda.syncthreads() 83 | 84 | 85 | # ---------------------------------------------------------------------------------------------------------------------- 86 | @cuda.jit 87 | def compute_softdtw_backward_cuda( 88 | D, R, inv_gamma, bandwidth, max_i, max_j, n_passes, E 89 | ): 90 | k = cuda.blockIdx.x 91 | tid = cuda.threadIdx.x 92 | 93 | # Indexing logic is the same as above, however, the anti-diagonal needs to 94 | # progress backwards 95 | I = tid 96 | 97 | for p in range(n_passes): 98 | # Reverse the order to make the loop go backward 99 | rev_p = n_passes - p - 1 100 | 101 | # convert tid to I, J, then i, j 102 | J = max(0, min(rev_p - tid, max_j - 1)) 103 | 104 | i = I + 1 105 | j = J + 1 106 | 107 | # Only compute if element[i, j] is on the current anti-diagonal, and also is within bounds 108 | if I + J == rev_p and (I < max_i and J < max_j): 109 | 110 | if math.isinf(R[k, i, j]): 111 | R[k, i, j] = -math.inf 112 | 113 | # Don't compute if outside bandwidth 114 | if not (abs(i - j) > bandwidth > 0): 115 | a = math.exp((R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j]) * inv_gamma) 116 | b = math.exp((R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1]) * inv_gamma) 117 | c = math.exp( 118 | (R[k, i + 1, j + 1] - R[k, i, j] - D[k, i + 1, j + 1]) * inv_gamma 119 | ) 120 | E[k, i, j] = ( 121 | E[k, i + 1, j] * a + E[k, i, j + 1] * b + E[k, i + 1, j + 1] * c 122 | ) 123 | 124 | # Wait for other threads in this block 125 | cuda.syncthreads() 126 | 127 | 128 | # ---------------------------------------------------------------------------------------------------------------------- 129 | class _SoftDTWCUDA(Function): 130 | """ 131 | CUDA implementation is inspired by the diagonal one proposed in https://ieeexplore.ieee.org/document/8400444: 132 | "Developing a pattern discovery method in time series data and its GPU acceleration" 133 | """ 134 | 135 | @staticmethod 136 | def forward(ctx, D, gamma, bandwidth, warp): 137 | dev = D.device 138 | dtype = D.dtype 139 | gamma = torch.cuda.FloatTensor([gamma]) 140 | bandwidth = torch.cuda.FloatTensor([bandwidth]) 141 | 142 | B = D.shape[0] 143 | N = D.shape[1] 144 | M = D.shape[2] 145 | threads_per_block = max(N, M) 146 | n_passes = 2 * threads_per_block - 1 147 | 148 | D_ = torch.full((B, N + 2, M + 2), np.inf, dtype=dtype, device=dev) 149 | D_[:, 1 : N + 1, 1 : M + 1] = D 150 | 151 | # Prepare the output array 152 | R = torch.ones((B, N + 2, M + 2), device=dev, dtype=dtype) * math.inf 153 | R[:, 0, 0] = 0 154 | 155 | # Run the CUDA kernel. 156 | # Set CUDA's grid size to be equal to the batch size (every CUDA block processes one sample pair) 157 | # Set the CUDA block size to be equal to the length of the longer sequence (equal to the size of the largest diagonal) 158 | compute_softdtw_cuda[B, threads_per_block]( 159 | cuda.as_cuda_array(D_.detach()), 160 | gamma.item(), 161 | bandwidth.item(), 162 | N, 163 | M, 164 | n_passes, 165 | cuda.as_cuda_array(R), 166 | ) 167 | ctx.save_for_backward(D, R.clone(), gamma, bandwidth) 168 | return R[:, -2, -2] 169 | 170 | @staticmethod 171 | def backward(ctx, grad_output): 172 | dev = grad_output.device 173 | dtype = grad_output.dtype 174 | D, R, gamma, bandwidth = ctx.saved_tensors 175 | 176 | B = D.shape[0] 177 | N = D.shape[1] 178 | M = D.shape[2] 179 | threads_per_block = max(N, M) 180 | n_passes = 2 * threads_per_block - 1 181 | 182 | D_ = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev) 183 | D_[:, 1 : N + 1, 1 : M + 1] = D 184 | 185 | R[:, :, -1] = -math.inf 186 | R[:, -1, :] = -math.inf 187 | R[:, -1, -1] = R[:, -2, -2] 188 | 189 | E = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev) 190 | E[:, -1, -1] = 1 191 | 192 | # Grid and block sizes are set same as done above for the forward() call 193 | compute_softdtw_backward_cuda[B, threads_per_block]( 194 | cuda.as_cuda_array(D_), 195 | cuda.as_cuda_array(R), 196 | 1.0 / gamma.item(), 197 | bandwidth.item(), 198 | N, 199 | M, 200 | n_passes, 201 | cuda.as_cuda_array(E), 202 | ) 203 | E = E[:, 1 : N + 1, 1 : M + 1] 204 | return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None, None 205 | 206 | 207 | # ------------------------------------------------- 208 | 209 | 210 | @jit(nopython=True, parallel=True) 211 | def compute_softdtw(D, gamma, bandwidth, warp): 212 | B = D.shape[0] 213 | N = D.shape[1] 214 | M = D.shape[2] 215 | R = np.ones((B, N + 2, M + 2)) * np.inf 216 | R[:, 0, 0] = 0 217 | for b in prange(B): 218 | for j in range(1, M + 1): 219 | for i in range(1, N + 1): 220 | 221 | # Check the pruning condition 222 | if 0 < bandwidth < np.abs(i - j): 223 | continue 224 | 225 | r0 = -R[b, i - 1, j - 1] / gamma 226 | # r1 = -(R[b, i - 1, j]) / gamma 227 | # r2 = -(R[b, i, j - 1]) / gamma 228 | r1 = -(R[b, i - 1, j] + warp) / gamma 229 | r2 = -(R[b, i, j - 1] + warp) / gamma 230 | rmax = max(max(r0, r1), r2) 231 | rsum = np.exp(r0 - rmax) + np.exp(r1 - rmax) + np.exp(r2 - rmax) 232 | softmin = -gamma * (np.log(rsum) + rmax) 233 | R[b, i, j] = D[b, i - 1, j - 1] + softmin 234 | return R 235 | 236 | 237 | # ---------------------------------------------------------------------------------------------------------------------- 238 | @jit(nopython=True, parallel=True) 239 | def compute_softdtw_backward(D_, R, gamma, bandwidth, warp): 240 | B = D_.shape[0] 241 | N = D_.shape[1] 242 | M = D_.shape[2] 243 | D = np.zeros((B, N + 2, M + 2)) 244 | E = np.zeros((B, N + 2, M + 2)) 245 | D[:, 1 : N + 1, 1 : M + 1] = D_ 246 | E[:, -1, -1] = 1 247 | R[:, :, -1] = -np.inf 248 | R[:, -1, :] = -np.inf 249 | R[:, -1, -1] = R[:, -2, -2] 250 | for k in prange(B): 251 | for j in range(M, 0, -1): 252 | for i in range(N, 0, -1): 253 | 254 | if np.isinf(R[k, i, j]): 255 | R[k, i, j] = -np.inf 256 | 257 | # Check the pruning condition 258 | if 0 < bandwidth < np.abs(i - j): 259 | continue 260 | 261 | # a0 = (R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j]) / gamma 262 | # b0 = (R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1]) / gamma 263 | a0 = (R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j] - warp) / gamma 264 | b0 = (R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1] - warp) / gamma 265 | c0 = (R[k, i + 1, j + 1] - R[k, i, j] - D[k, i + 1, j + 1]) / gamma 266 | a = np.exp(a0) 267 | b = np.exp(b0) 268 | c = np.exp(c0) 269 | E[k, i, j] = ( 270 | E[k, i + 1, j] * a + E[k, i, j + 1] * b + E[k, i + 1, j + 1] * c 271 | ) 272 | return E[:, 1 : N + 1, 1 : M + 1] 273 | 274 | 275 | # ---------------------------------------------------------------------------------------------------------------------- 276 | class _SoftDTW(Function): 277 | """ 278 | CPU implementation based on https://github.com/Sleepwalking/pytorch-softdtw 279 | """ 280 | 281 | @staticmethod 282 | def forward(ctx, D, gamma, bandwidth, warp): 283 | dev = D.device 284 | dtype = D.dtype 285 | gamma = torch.Tensor([gamma]).to(dev).type(dtype) # dtype fixed 286 | bandwidth = torch.Tensor([bandwidth]).to(dev).type(dtype) 287 | warp = torch.Tensor([warp]).to(dev).type(dtype) 288 | 289 | D_ = D.detach().cpu().numpy() 290 | g_ = gamma.item() 291 | b_ = bandwidth.item() 292 | w_ = warp.item() 293 | 294 | R = torch.Tensor(compute_softdtw(D_, g_, b_, w_)).to(dev).type(dtype) 295 | ctx.save_for_backward(D, R, gamma, bandwidth, warp) 296 | return R[:, -2, -2] 297 | 298 | @staticmethod 299 | def backward(ctx, grad_output): 300 | dev = grad_output.device 301 | dtype = grad_output.dtype 302 | D, R, gamma, bandwidth, warp = ctx.saved_tensors 303 | D_ = D.detach().cpu().numpy() 304 | R_ = R.detach().cpu().numpy() 305 | g_ = gamma.item() 306 | b_ = bandwidth.item() 307 | w_ = warp.item() 308 | E = ( 309 | torch.Tensor(compute_softdtw_backward(D_, R_, g_, b_, w_)) 310 | .to(dev) 311 | .type(dtype) 312 | ) 313 | return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None, None 314 | 315 | 316 | # ---------------------------------------------------------------------------------------------------------------------- 317 | class SoftDTW(torch.nn.Module): 318 | """ 319 | The soft DTW implementation that optionally supports CUDA 320 | """ 321 | 322 | def __init__( 323 | self, gamma=1.0, use_cuda=True, normalize=False, bandwidth=None, warp=0.07 324 | ): 325 | """ 326 | Initializes a new instance using the supplied parameters 327 | :param use_cuda: Flag indicating whether the CUDA implementation should be used 328 | :param gamma: sDTW's gamma parameter 329 | :param normalize: Flag indicating whether to perform normalization 330 | (as discussed in https://github.com/mblondel/soft-dtw/issues/10#issuecomment-383564790) 331 | :param bandwidth: Sakoe-Chiba bandwidth for pruning. Passing 'None' will disable pruning. 332 | :param dist_func: Optional point-wise distance function to use. If 'None', then a default Euclidean distance function will be used. 333 | """ 334 | super(SoftDTW, self).__init__() 335 | self.gamma = gamma 336 | self.use_cuda = use_cuda 337 | self.bandwidth = 0 if bandwidth is None else float(bandwidth) 338 | 339 | self.warp = warp 340 | 341 | def _get_func_dtw(self, x): 342 | """ 343 | Checks the inputs and selects the proper implementation to use. 344 | """ 345 | # Finally, return the correct function 346 | return _SoftDTW.apply if not self.use_cuda else _SoftDTWCUDA.apply 347 | 348 | def forward(self, X): 349 | """ 350 | Compute the soft-DTW value between X and Y 351 | :param X: One batch of examples, batch_size x seq_len x dims 352 | :param Y: The other batch of examples, batch_size x seq_len x dims 353 | :return: The computed results 354 | """ 355 | 356 | # Check the inputs and get the correct implementation 357 | func_dtw = self._get_func_dtw(X) 358 | 359 | # D_xy = self.dist_func(X, Y) 360 | D_xy = X 361 | return func_dtw(D_xy, self.gamma, self.bandwidth, self.warp) 362 | 363 | 364 | # ---------------------------------------------------------------------------------------------------------------------- 365 | def timed_run(a, b, sdtw): 366 | """ 367 | Runs a and b through sdtw, and times the forward and backward passes. 368 | Assumes that a requires gradients. 369 | :return: timing, forward result, backward result 370 | """ 371 | from timeit import default_timer as timer 372 | 373 | # Forward pass 374 | start = timer() 375 | forward = sdtw(a, b) 376 | end = timer() 377 | t = end - start 378 | 379 | grad_outputs = torch.ones_like(forward) 380 | 381 | # Backward 382 | start = timer() 383 | grads = torch.autograd.grad(forward, a, grad_outputs=grad_outputs)[0] 384 | end = timer() 385 | 386 | # Total time 387 | t += end - start 388 | 389 | return t, forward, grads 390 | 391 | 392 | # ---------------------------------------------------------------------------------------------------------------------- 393 | def profile(batch_size, seq_len_a, seq_len_b, dims, tol_backward): 394 | sdtw = SoftDTW(False, gamma=1.0, normalize=False) 395 | sdtw_cuda = SoftDTW(True, gamma=1.0, normalize=False) 396 | n_iters = 6 397 | 398 | print( 399 | "Profiling forward() + backward() times for batch_size={}, seq_len_a={}, seq_len_b={}, dims={}...".format( 400 | batch_size, seq_len_a, seq_len_b, dims 401 | ) 402 | ) 403 | 404 | times_cpu = [] 405 | times_gpu = [] 406 | 407 | for i in range(n_iters): 408 | a_cpu = torch.rand((batch_size, seq_len_a, dims), requires_grad=True) 409 | b_cpu = torch.rand((batch_size, seq_len_b, dims)) 410 | a_gpu = a_cpu.cuda() 411 | b_gpu = b_cpu.cuda() 412 | 413 | # GPU 414 | t_gpu, forward_gpu, backward_gpu = timed_run(a_gpu, b_gpu, sdtw_cuda) 415 | 416 | # CPU 417 | t_cpu, forward_cpu, backward_cpu = timed_run(a_cpu, b_cpu, sdtw) 418 | 419 | # Verify the results 420 | assert torch.allclose(forward_cpu, forward_gpu.cpu()) 421 | assert torch.allclose(backward_cpu, backward_gpu.cpu(), atol=tol_backward) 422 | 423 | if ( 424 | i > 0 425 | ): # Ignore the first time we run, in case this is a cold start (because timings are off at a cold start of the script) 426 | times_cpu += [t_cpu] 427 | times_gpu += [t_gpu] 428 | 429 | # Average and log 430 | avg_cpu = np.mean(times_cpu) 431 | avg_gpu = np.mean(times_gpu) 432 | print(" CPU: ", avg_cpu) 433 | print(" GPU: ", avg_gpu) 434 | print(" Speedup: ", avg_cpu / avg_gpu) 435 | print() 436 | 437 | 438 | # ---------------------------------------------------------------------------------------------------------------------- 439 | if __name__ == "__main__": 440 | from timeit import default_timer as timer 441 | 442 | torch.manual_seed(1234) 443 | 444 | profile(128, 17, 15, 2, tol_backward=1e-6) 445 | profile(512, 64, 64, 2, tol_backward=1e-4) 446 | profile(512, 256, 256, 2, tol_backward=1e-3) 447 | -------------------------------------------------------------------------------- /naturalspeech_training.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "OKBM9jO5emtp" 7 | }, 8 | "source": [ 9 | "## Set up naturalspeech" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "id": "MnnIzNDiO1MH" 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "%cd /content/\n", 21 | "!git clone https://github.com/heatz123/naturalspeech\n", 22 | "%cd /content/naturalspeech\n", 23 | "!pip install -r requirements.txt\n", 24 | "%cd /content/\n", 25 | "!wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 " 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "id": "OWKnNm3xjf7g" 32 | }, 33 | "source": [ 34 | "## Create symbolic link to dataset" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": { 41 | "id": "oXDKCx1FPnzI" 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "%cd /content/\n", 46 | "!tar -xf LJSpeech-1.1.tar.bz2\n", 47 | "%cd /content/naturalspeech\n", 48 | "!ln -s /content/LJSpeech-1.1/wavs/ DUMMY1" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": { 54 | "id": "Uf63QzpajptC" 55 | }, 56 | "source": [ 57 | "## Uncompress durations labels" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "id": "jt1t3bhObSji" 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "!mv /content/naturalspeech/durations/durations.tar.bz2 /content/naturalspeech/\n", 69 | "!rm -r /content/naturalspeech/durations\n", 70 | "%cd /content/naturalspeech/\n", 71 | "!tar -xf durations.tar.bz2" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": { 77 | "id": "X3dYAXA8j81B" 78 | }, 79 | "source": [ 80 | "## Warmup Training" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "id": "x49LdKJ_ciuX" 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "%cd /content/naturalspeech\n", 92 | "!python3 train.py -c configs/ljs.json -m exp1 --warmup" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "source": [ 98 | "## Training" 99 | ], 100 | "metadata": { 101 | "id": "qE4YV8C19y3d" 102 | } 103 | }, 104 | { 105 | "cell_type": "code", 106 | "source": [ 107 | "!python3 attach_memory_bank.py -c configs/ljs.json --weights_path logs/ext/G_500.pth" 108 | ], 109 | "metadata": { 110 | "id": "t9JH9TZD9zzx" 111 | }, 112 | "execution_count": null, 113 | "outputs": [] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "source": [ 118 | "!python3 train.py -c configs/ljs.json -m exp1" 119 | ], 120 | "metadata": { 121 | "id": "W8bqOU-q98cJ" 122 | }, 123 | "execution_count": null, 124 | "outputs": [] 125 | } 126 | ], 127 | "metadata": { 128 | "accelerator": "GPU", 129 | "colab": { 130 | "provenance": [] 131 | }, 132 | "gpuClass": "standard", 133 | "kernelspec": { 134 | "display_name": "Python 3", 135 | "name": "python3" 136 | }, 137 | "language_info": { 138 | "name": "python" 139 | } 140 | }, 141 | "nbformat": 4, 142 | "nbformat_minor": 0 143 | } -------------------------------------------------------------------------------- /preprocess_durations.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import text 3 | from utils import load_filepaths_and_text 4 | 5 | # from data_utils import ( 6 | # TextAudioLoader, 7 | # ) 8 | from text.symbols import symbols 9 | from models import ( 10 | SynthesizerTrn, 11 | ) 12 | import utils 13 | from torch.utils.data import DataLoader 14 | import torch 15 | 16 | import os 17 | import random 18 | import numpy as np 19 | import torch 20 | import torch.utils.data 21 | 22 | import commons 23 | from mel_processing import spectrogram_torch 24 | from utils import load_wav_to_torch, load_filepaths_and_text 25 | from text import text_to_sequence, cleaned_text_to_sequence 26 | 27 | import json 28 | 29 | 30 | class TextAudioLoaderWithPath(torch.utils.data.Dataset): 31 | """ 32 | 1) loads audio, text pairs 33 | 2) normalizes text and converts them to sequences of integers 34 | 3) computes spectrograms from audio files. 35 | """ 36 | 37 | def __init__(self, audiopaths_and_text, hparams): 38 | self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text) 39 | self.text_cleaners = hparams.text_cleaners 40 | self.max_wav_value = hparams.max_wav_value 41 | self.sampling_rate = hparams.sampling_rate 42 | self.filter_length = hparams.filter_length 43 | self.hop_length = hparams.hop_length 44 | self.win_length = hparams.win_length 45 | self.sampling_rate = hparams.sampling_rate 46 | 47 | self.cleaned_text = getattr(hparams, "cleaned_text", False) 48 | 49 | self.add_blank = hparams.add_blank 50 | self.min_text_len = getattr(hparams, "min_text_len", 1) 51 | self.max_text_len = getattr(hparams, "max_text_len", 190) 52 | 53 | random.seed(1234) 54 | random.shuffle(self.audiopaths_and_text) 55 | self._filter() 56 | 57 | def _filter(self): 58 | """ 59 | Filter text & store spec lengths 60 | """ 61 | # Store spectrogram lengths for Bucketing 62 | # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2) 63 | # spec_length = wav_length // hop_length 64 | 65 | audiopaths_and_text_new = [] 66 | lengths = [] 67 | for audiopath, text in self.audiopaths_and_text: 68 | if self.min_text_len <= len(text) and len(text) <= self.max_text_len: 69 | audiopaths_and_text_new.append([audiopath, text]) 70 | lengths.append(os.path.getsize(audiopath) // (2 * self.hop_length)) 71 | self.audiopaths_and_text = audiopaths_and_text_new 72 | self.lengths = lengths 73 | 74 | def get_audio_text_pair(self, audiopath_and_text): 75 | # separate filename and text 76 | audiopath, text = audiopath_and_text[0], audiopath_and_text[1] 77 | text = self.get_text(text) 78 | spec, wav = self.get_audio(audiopath) 79 | return (text, spec, wav, audiopath) 80 | 81 | def get_audio(self, filename): 82 | audio, sampling_rate = load_wav_to_torch(filename) 83 | if sampling_rate != self.sampling_rate: 84 | raise ValueError( 85 | "{} {} SR doesn't match target {} SR".format( 86 | sampling_rate, self.sampling_rate 87 | ) 88 | ) 89 | audio_norm = audio / self.max_wav_value 90 | audio_norm = audio_norm.unsqueeze(0) 91 | spec_filename = filename.replace(".wav", ".spec.pt") 92 | if os.path.exists(spec_filename): 93 | spec = torch.load(spec_filename) 94 | else: 95 | spec = spectrogram_torch( 96 | audio_norm, 97 | self.filter_length, 98 | self.sampling_rate, 99 | self.hop_length, 100 | self.win_length, 101 | center=False, 102 | ) 103 | spec = torch.squeeze(spec, 0) 104 | torch.save(spec, spec_filename) 105 | return spec, audio_norm 106 | 107 | def get_text(self, text): 108 | if self.cleaned_text: 109 | text_norm = cleaned_text_to_sequence(text) 110 | else: 111 | text_norm = text_to_sequence(text, self.text_cleaners) 112 | if self.add_blank: 113 | text_norm = commons.intersperse(text_norm, 0) 114 | text_norm = torch.LongTensor(text_norm) 115 | return text_norm 116 | 117 | def __getitem__(self, index): 118 | return self.get_audio_text_pair(self.audiopaths_and_text[index]) 119 | 120 | def __len__(self): 121 | return len(self.audiopaths_and_text) 122 | 123 | 124 | class TextAudioCollateWithPath: 125 | """Zero-pads model inputs and targets""" 126 | 127 | def __call__(self, batch): 128 | """Collate's training batch from normalized text and aduio 129 | PARAMS 130 | ------ 131 | batch: [text_normalized, spec_normalized, wav_normalized] 132 | """ 133 | # Right zero-pad all one-hot text sequences to max input length 134 | _, ids_sorted_decreasing = torch.sort( 135 | torch.LongTensor([x[1].size(1) for x in batch]), dim=0, descending=True 136 | ) 137 | 138 | max_text_len = max([len(x[0]) for x in batch]) 139 | max_spec_len = max([x[1].size(1) for x in batch]) 140 | max_wav_len = max([x[2].size(1) for x in batch]) 141 | 142 | text_lengths = torch.LongTensor(len(batch)) 143 | spec_lengths = torch.LongTensor(len(batch)) 144 | wav_lengths = torch.LongTensor(len(batch)) 145 | 146 | text_padded = torch.LongTensor(len(batch), max_text_len) 147 | spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0), max_spec_len) 148 | wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len) 149 | text_padded.zero_() 150 | spec_padded.zero_() 151 | wav_padded.zero_() 152 | for i in range(len(ids_sorted_decreasing)): 153 | row = batch[ids_sorted_decreasing[i]] 154 | 155 | text = row[0] 156 | text_padded[i, : text.size(0)] = text 157 | text_lengths[i] = text.size(0) 158 | 159 | spec = row[1] 160 | spec_padded[i, :, : spec.size(1)] = spec 161 | spec_lengths[i] = spec.size(1) 162 | 163 | wav = row[2] 164 | wav_padded[i, :, : wav.size(1)] = wav 165 | wav_lengths[i] = wav.size(1) 166 | 167 | paths = [x[3] for x in batch] 168 | 169 | return ( 170 | text_padded, 171 | text_lengths, 172 | spec_padded, 173 | spec_lengths, 174 | wav_padded, 175 | wav_lengths, 176 | paths, 177 | ) 178 | 179 | 180 | if __name__ == "__main__": 181 | parser = argparse.ArgumentParser() 182 | parser.add_argument("--weights_path", default="./pretrained_ljs.pth") 183 | parser.add_argument( 184 | "--filelists", 185 | nargs="+", 186 | default=[ 187 | "filelists/ljs_audio_text_val_filelist.txt", 188 | "filelists/ljs_audio_text_test_filelist.txt", 189 | ], 190 | ) 191 | 192 | args = parser.parse_args() 193 | 194 | hps = utils.get_hparams_from_file("./configs/ljs_base.json") 195 | net_g = SynthesizerTrn( 196 | len(symbols), 197 | hps.data.filter_length // 2 + 1, 198 | hps.train.segment_size // hps.data.hop_length, 199 | **hps.model, 200 | ).cuda() 201 | _ = utils.load_checkpoint(args.weights_path, net_g, None) 202 | net_g.eval() 203 | 204 | print(args.filelists) 205 | 206 | os.makedirs("durations", exist_ok=True) 207 | 208 | for filelist in args.filelists: 209 | dataset = TextAudioLoaderWithPath(filelist, hps.data) 210 | dataloader = DataLoader( 211 | dataset, 212 | batch_size=1, 213 | shuffle=False, 214 | pin_memory=False, 215 | collate_fn=TextAudioCollateWithPath(), 216 | ) 217 | 218 | with torch.no_grad(): 219 | for batch_idx, ( 220 | x, 221 | x_lengths, 222 | spec, 223 | spec_lengths, 224 | y, 225 | y_lengths, 226 | paths, 227 | ) in enumerate(dataloader): 228 | x, x_lengths = x.cuda(0), x_lengths.cuda(0) 229 | spec, spec_lengths = spec.cuda(0), spec_lengths.cuda(0) 230 | y, y_lengths = y.cuda(0), y_lengths.cuda(0) 231 | (path,) = paths 232 | 233 | ( 234 | y_hat, 235 | l_length, 236 | attn, 237 | ids_slice, 238 | x_mask, 239 | z_mask, 240 | (z, z_p, m_p, logs_p, m_q, logs_q), 241 | ) = net_g(x, x_lengths, spec, spec_lengths) 242 | 243 | w = attn.sum(2).flatten() 244 | x = x.flatten() 245 | 246 | save_path = os.path.join("durations", os.path.basename(path) + ".npy") 247 | np.save(save_path, w.cpu().numpy().astype(int)) 248 | 249 | if batch_idx % 100 == 0: 250 | print(f"{batch_idx} / {len(dataloader)}") 251 | -------------------------------------------------------------------------------- /preprocess_texts.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import text 3 | from utils.utils import load_filepaths_and_text 4 | 5 | if __name__ == "__main__": 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument("--out_extension", default="cleaned") 8 | parser.add_argument("--text_index", default=1, type=int) 9 | parser.add_argument( 10 | "--filelists", 11 | nargs="+", 12 | default=[ 13 | "filelists/ljs_audio_text_val_filelist.txt", 14 | "filelists/ljs_audio_text_test_filelist.txt", 15 | ], 16 | ) 17 | parser.add_argument("--text_cleaners", nargs="+", default=["english_cleaners2"]) 18 | 19 | args = parser.parse_args() 20 | 21 | for filelist in args.filelists: 22 | print("START:", filelist) 23 | filepaths_and_text = load_filepaths_and_text(filelist) 24 | for i in range(len(filepaths_and_text)): 25 | original_text = filepaths_and_text[i][args.text_index] 26 | cleaned_text = text._clean_text(original_text, args.text_cleaners) 27 | filepaths_and_text[i][args.text_index] = cleaned_text 28 | 29 | new_filelist = filelist + "." + args.out_extension 30 | with open(new_filelist, "w", encoding="utf-8") as f: 31 | f.writelines(["|".join(x) + "\n" for x in filepaths_and_text]) 32 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | Cython>=0.29.21 2 | librosa>=0.8.0 3 | matplotlib>=3.3.1 4 | numpy>=1.18.5 5 | phonemizer>=2.2.1 6 | scipy>=1.5.2 7 | tensorboard>=2.3.0 8 | torch>=1.6.0 9 | torchvision>=0.7.0 10 | Unidecode>=1.1.1 -------------------------------------------------------------------------------- /resources/figure1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/heatz123/naturalspeech/ca2f814d960149a8fdc3f3e4d773abb67e7c18ec/resources/figure1.png -------------------------------------------------------------------------------- /text/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Keith Ito 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /text/__init__.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | from text import cleaners 3 | from text.symbols import symbols 4 | 5 | 6 | # Mappings from symbol to numeric ID and vice versa: 7 | _symbol_to_id = {s: i for i, s in enumerate(symbols)} 8 | _id_to_symbol = {i: s for i, s in enumerate(symbols)} 9 | 10 | 11 | def text_to_sequence(text, cleaner_names): 12 | """Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 13 | Args: 14 | text: string to convert to a sequence 15 | cleaner_names: names of the cleaner functions to run the text through 16 | Returns: 17 | List of integers corresponding to the symbols in the text 18 | """ 19 | sequence = [] 20 | 21 | clean_text = _clean_text(text, cleaner_names) 22 | for symbol in clean_text: 23 | symbol_id = _symbol_to_id[symbol] 24 | sequence += [symbol_id] 25 | return sequence 26 | 27 | 28 | def cleaned_text_to_sequence(cleaned_text): 29 | """Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 30 | Args: 31 | text: string to convert to a sequence 32 | Returns: 33 | List of integers corresponding to the symbols in the text 34 | """ 35 | sequence = [_symbol_to_id[symbol] for symbol in cleaned_text] 36 | return sequence 37 | 38 | 39 | def sequence_to_text(sequence): 40 | """Converts a sequence of IDs back to a string""" 41 | result = "" 42 | for symbol_id in sequence: 43 | s = _id_to_symbol[symbol_id] 44 | result += s 45 | return result 46 | 47 | 48 | def _clean_text(text, cleaner_names): 49 | for name in cleaner_names: 50 | cleaner = getattr(cleaners, name) 51 | if not cleaner: 52 | raise Exception("Unknown cleaner: %s" % name) 53 | text = cleaner(text) 54 | return text 55 | -------------------------------------------------------------------------------- /text/cleaners.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | """ 4 | Cleaners are transformations that run over the input text at both training and eval time. 5 | 6 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" 7 | hyperparameter. Some cleaners are English-specific. You'll typically want to use: 8 | 1. "english_cleaners" for English text 9 | 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using 10 | the Unidecode library (https://pypi.python.org/pypi/Unidecode) 11 | 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update 12 | the symbols in symbols.py to match your data). 13 | """ 14 | 15 | import re 16 | from unidecode import unidecode 17 | from phonemizer import phonemize 18 | 19 | 20 | # Regular expression matching whitespace: 21 | _whitespace_re = re.compile(r"\s+") 22 | 23 | # List of (regular expression, replacement) pairs for abbreviations: 24 | _abbreviations = [ 25 | (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1]) 26 | for x in [ 27 | ("mrs", "misess"), 28 | ("mr", "mister"), 29 | ("dr", "doctor"), 30 | ("st", "saint"), 31 | ("co", "company"), 32 | ("jr", "junior"), 33 | ("maj", "major"), 34 | ("gen", "general"), 35 | ("drs", "doctors"), 36 | ("rev", "reverend"), 37 | ("lt", "lieutenant"), 38 | ("hon", "honorable"), 39 | ("sgt", "sergeant"), 40 | ("capt", "captain"), 41 | ("esq", "esquire"), 42 | ("ltd", "limited"), 43 | ("col", "colonel"), 44 | ("ft", "fort"), 45 | ] 46 | ] 47 | 48 | 49 | def expand_abbreviations(text): 50 | for regex, replacement in _abbreviations: 51 | text = re.sub(regex, replacement, text) 52 | return text 53 | 54 | 55 | def expand_numbers(text): 56 | return normalize_numbers(text) 57 | 58 | 59 | def lowercase(text): 60 | return text.lower() 61 | 62 | 63 | def collapse_whitespace(text): 64 | return re.sub(_whitespace_re, " ", text) 65 | 66 | 67 | def convert_to_ascii(text): 68 | return unidecode(text) 69 | 70 | 71 | def basic_cleaners(text): 72 | """Basic pipeline that lowercases and collapses whitespace without transliteration.""" 73 | text = lowercase(text) 74 | text = collapse_whitespace(text) 75 | return text 76 | 77 | 78 | def transliteration_cleaners(text): 79 | """Pipeline for non-English text that transliterates to ASCII.""" 80 | text = convert_to_ascii(text) 81 | text = lowercase(text) 82 | text = collapse_whitespace(text) 83 | return text 84 | 85 | 86 | def english_cleaners(text): 87 | """Pipeline for English text, including abbreviation expansion.""" 88 | text = convert_to_ascii(text) 89 | text = lowercase(text) 90 | text = expand_abbreviations(text) 91 | phonemes = phonemize(text, language="en-us", backend="espeak", strip=True) 92 | phonemes = collapse_whitespace(phonemes) 93 | return phonemes 94 | 95 | 96 | def english_cleaners2(text): 97 | """Pipeline for English text, including abbreviation expansion. + punctuation + stress""" 98 | text = convert_to_ascii(text) 99 | text = lowercase(text) 100 | text = expand_abbreviations(text) 101 | phonemes = phonemize( 102 | text, 103 | language="en-us", 104 | backend="espeak", 105 | strip=True, 106 | preserve_punctuation=True, 107 | with_stress=True, 108 | ) 109 | phonemes = collapse_whitespace(phonemes) 110 | return phonemes 111 | -------------------------------------------------------------------------------- /text/symbols.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | """ 4 | Defines the set of symbols used in text input to the model. 5 | """ 6 | _pad = "_" 7 | _punctuation = ';:,.!?¡¿—…"«»“” ' 8 | _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 9 | _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ" 10 | 11 | 12 | # Export all symbols: 13 | symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) 14 | 15 | # Special symbol ids 16 | SPACE_ID = symbols.index(" ") 17 | -------------------------------------------------------------------------------- /utils/README.md: -------------------------------------------------------------------------------- 1 | # NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality 2 | 3 | This is an implementation of Microsoft's [NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality](https://arxiv.org/abs/2205.04421) in Pytorch. 4 | 5 | Contribution and pull requests are highly appreciated! 6 | 7 | 23.02.01: Pretrained models or demo samples will soon be released. 8 | 9 | 10 | ### Overview 11 | 12 | ![figure1](resources/figure1.png) 13 | 14 | Naturalspeech is a VAE-based model that employs several techniques to improve the prior and simplify the posterior. It differs from VITS in several ways, including: 15 | - **Phoneme pre-training**: Naturalspeech uses a pre-trained phoneme encoder on a large text corpus, obtained through masked language modeling on phoneme sequences. 16 | - **Differentiable durator**: The posterior operates at the frame level, while the prior operates at the phoneme level. Naturalspeech uses a differentiable durator to bridge the length difference, resulting in soft and flexible features that are expanded. 17 | - **Bidirectional Prior/Posterior**: Naturalspeech reduces the posterior and enhances the prior through normalizing flow, which maps in both directions with forward and backward loss. 18 | - **Memory-based VAE**: The prior is further enhanced through a memory bank using Q-K-V attention." 19 | 20 | 21 | ### Notes 22 | 1. Phoneme pre-training with large-scale text corpus from the news-crawl dataset is omitted in this implementation. 23 | 2. Multiplier for each loss is denoted and can be adjusted in config, as using losses with no multiplier doesn't seem to converge. 24 | 3. Tuning stage for last 2k epochs is omitted. 25 | 4. As soft-dtw loss uses quite a lot of VRAM, there is an option for using non-softdtw loss. 26 | 5. For soft-dtw loss, warp is set to 134.4(=0.07 * 192), not 0.07, to match with non-softdtw loss. 27 | 6. To train the duration predictor in the warmup stage, duration labels are needed. As stated in the paper, you can choose any tools to provide duration label. Here I used pretrained VITS model. 28 | 7. For memory efficient training, partial sequences are fed to decoder as in VITS. 29 | 30 | 31 | ### How to train 32 | 33 | 0. 34 | ``` 35 | # python >= 3.6 36 | pip install -r requirements.txt 37 | ``` 38 | 39 | 1. clone this repository 40 | 1. download `The LJ Speech Dataset`: [link](https://keithito.com/LJ-Speech-Dataset/) 41 | 1. create symbolic link to ljspeech dataset: 42 | ``` 43 | ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1 44 | ``` 45 | 1. text preprocessing (optional, if you are using custom dataset): 46 | 1. `apt-get install espeak` 47 | 2. 48 | ``` 49 | python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 50 | ``` 51 | 52 | 1. duration preprocessing (obtain duration labels using pretrained VITS): 53 | 1. `git clone https://github.com/jaywalnut310/vits.git; cd vits` 54 | 2. create symbolic link to ljspeech dataset 55 | ``` 56 | ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1 57 | ``` 58 | 3. download pretrained VITS model described as from VITS official github: [github link](https://github.com/jaywalnut310/vits) / [pretrained models](https://drive.google.com/drive/folders/1ksarh-cJf3F5eKJjLVWY0X1j1qsQqiS2) 59 | 4. setup monotonic alignment search (for VITS inference): 60 | ``` 61 | cd monotonic_align; mkdir monotonic_align; python setup.py build_ext --inplace; cd .. 62 | ``` 63 | 5. copy duration preprocessing script to VITS repo: `cp /path/to/naturalspeech/preprocess_durations.py .` 64 | 6. 65 | ``` 66 | python3 preprocess_durations.py --weights_path ./pretrained_ljs.pth --filelists filelists/ljs_audio_text_train_filelist.txt.cleaned filelists/ljs_audio_text_val_filelist.txt.cleaned filelists/ljs_audio_text_test_filelist.txt.cleaned 67 | ``` 68 | 7. once the duration labels are created, copy the labels to the naturalspeech repo: `cp -r durations/ path/to/naturalspeech` 69 | 70 | 1. train (warmup) 71 | ``` 72 | python3 train.py -c configs/ljs.json -m [run_name] --warmup 73 | ``` 74 | Note here that ljs.json is for low-resource training, which runs for 1500 epochs and does not use soft-dtw loss. If you want to reproduce the steps stated in the paper, use ljs_reproduce.json, which runs for 15000 epochs and uses soft-dtw loss. 75 | 76 | 1. initialize and attach memory bank after warmup: 77 | ``` 78 | python3 attach_memory_bank.py -c configs/ljs.json --weights_path logs/[run_name]/G_xxx.pth 79 | ``` 80 | if you lack memory, you can specify the "--num_samples" argument to use only a subset of samples. 81 | 82 | 1. train (resume) 83 | ``` 84 | python3 train.py -c configs/ljs.json -m [run_name] 85 | ``` 86 | 87 | You can use tensorboard to monitor the training. 88 | ``` 89 | tensorboard --logdir /path/to/naturalspeech/logs 90 | ``` 91 | 92 | During each evaluation phase, a selection of samples from the test set is evaluated and saved in the `logs/[run_name]/eval` directory. 93 | 94 | 95 | 96 | ## References 97 | - [VITS implemetation](https://github.com/jaywalnut310/vits) by @jaywalnut310 for normalizing flows, phoneme encoder, and hifi-gan decoder implementation 98 | - [Parallel Tacotron 2 Implementation](https://github.com/keonlee9420/Parallel-Tacotron2) by @keonlee9420 for learnable upsampling Layer 99 | - [soft-dtw implementation](https://github.com/Maghoumi/pytorch-softdtw-cuda) by @Maghoumi for sdtw loss 100 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/heatz123/naturalspeech/ca2f814d960149a8fdc3f3e4d773abb67e7c18ec/utils/__init__.py -------------------------------------------------------------------------------- /utils/attach_memory_bank.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | from pathlib import Path 4 | 5 | import numpy as np 6 | import torch 7 | from torch.cuda.amp import autocast 8 | from torch.utils.data import DataLoader 9 | 10 | from text.symbols import symbols 11 | from models.models import SynthesizerTrn 12 | from models.models import VAEMemoryBank 13 | from utils import utils 14 | 15 | from utils.data_utils import ( 16 | TextAudioLoaderWithDuration, 17 | TextAudioCollateWithDuration, 18 | ) 19 | 20 | from sklearn.cluster import KMeans 21 | 22 | 23 | def load_net_g(hps, weights_path): 24 | net_g = SynthesizerTrn( 25 | len(symbols), 26 | hps.data.filter_length // 2 + 1, 27 | hps.train.segment_size // hps.data.hop_length, 28 | hps.models, 29 | ).cuda() 30 | 31 | optim_g = torch.optim.AdamW( 32 | net_g.parameters(), 33 | hps.train.learning_rate, 34 | betas=hps.train.betas, 35 | eps=hps.train.eps, 36 | ) 37 | 38 | def load_checkpoint(checkpoint_path, model, optimizer=None): 39 | assert os.path.isfile(checkpoint_path) 40 | checkpoint_dict = torch.load(checkpoint_path, map_location="cpu") 41 | iteration = checkpoint_dict["iteration"] 42 | learning_rate = checkpoint_dict["learning_rate"] 43 | 44 | if optimizer is not None: 45 | optimizer.load_state_dict(checkpoint_dict["optimizer"]) 46 | saved_state_dict = checkpoint_dict["model"] 47 | 48 | state_dict = model.state_dict() 49 | new_state_dict = {} 50 | for k, v in state_dict.items(): 51 | try: 52 | new_state_dict[k] = saved_state_dict[k] 53 | except: 54 | print("%s is not in the checkpoint" % k) 55 | new_state_dict[k] = v 56 | model.load_state_dict(new_state_dict) 57 | 58 | print( 59 | "Loaded checkpoint '{}' (iteration {})".format(checkpoint_path, iteration) 60 | ) 61 | return model, optimizer, learning_rate, iteration 62 | 63 | model, optimizer, learning_rate, iteration = load_checkpoint( 64 | weights_path, net_g, optim_g 65 | ) 66 | 67 | return model, optimizer, learning_rate, iteration 68 | 69 | 70 | def get_dataloader(hps): 71 | train_dataset = TextAudioLoaderWithDuration(hps.data.training_files, hps.data) 72 | collate_fn = TextAudioCollateWithDuration() 73 | train_loader = DataLoader( 74 | train_dataset, 75 | num_workers=1, 76 | shuffle=False, 77 | pin_memory=False, 78 | collate_fn=collate_fn, 79 | batch_size=1, 80 | ) 81 | return train_loader 82 | 83 | 84 | def get_zs(net_g, dataloader, num_samples=0): 85 | net_g.eval() 86 | print(len(dataloader)) 87 | zs = [] 88 | with torch.no_grad(): 89 | for batch_idx, ( 90 | x, 91 | x_lengths, 92 | spec, 93 | spec_lengths, 94 | y, 95 | y_lengths, 96 | duration, 97 | ) in enumerate(dataloader): 98 | rank = 0 99 | x, x_lengths = x.cuda(rank, non_blocking=True), x_lengths.cuda( 100 | rank, non_blocking=True 101 | ) 102 | spec, spec_lengths = spec.cuda(rank, non_blocking=True), spec_lengths.cuda( 103 | rank, non_blocking=True 104 | ) 105 | y, y_lengths = y.cuda(rank, non_blocking=True), y_lengths.cuda( 106 | rank, non_blocking=True 107 | ) 108 | duration = duration.cuda() 109 | with autocast(enabled=hps.train.fp16_run): 110 | ( 111 | y_hat, 112 | l_length, 113 | ids_slice, 114 | x_mask, 115 | z_mask, 116 | (z, z_p, m_p, logs_p, m_q, logs_q, p_mask), 117 | *_, 118 | ) = net_g(x, x_lengths, spec, spec_lengths, duration) 119 | 120 | zs.append(z.squeeze(0).cpu()) 121 | if batch_idx % 100 == 99: 122 | print(batch_idx, zs[batch_idx].shape) 123 | 124 | if num_samples and batch_idx >= num_samples: 125 | break 126 | return zs 127 | 128 | 129 | def k_means(zs): 130 | X = torch.cat(zs, dim=1).transpose(0, 1).numpy() 131 | print(X.shape) 132 | kmeans = KMeans(n_clusters=1000, random_state=0, n_init="auto").fit(X) 133 | print(kmeans.cluster_centers_.shape) 134 | 135 | return kmeans.cluster_centers_ 136 | 137 | 138 | def save_memory_bank(bank): 139 | state_dict = bank.state_dict() 140 | torch.save(state_dict, "./bank_init.pth") 141 | 142 | 143 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path): 144 | state_dict = model.state_dict() 145 | torch.save( 146 | { 147 | "model": state_dict, 148 | "iteration": iteration, 149 | "optimizer": optimizer.state_dict(), 150 | "learning_rate": learning_rate, 151 | }, 152 | checkpoint_path, 153 | ) 154 | print("Saving model to " + checkpoint_path) 155 | 156 | 157 | if __name__ == "__main__": 158 | parser = argparse.ArgumentParser() 159 | parser.add_argument("-c", "--config", type=str, default="configs/ljs.json") 160 | parser.add_argument("--weights_path", type=str) 161 | parser.add_argument( 162 | "--num_samples", 163 | type=int, 164 | default=0, 165 | help="samples to use for k-means clustering, 0 for use all samples in dataset", 166 | ) 167 | args = parser.parse_args() 168 | 169 | hps = utils.get_hparams_from_file(args.config) 170 | net_g, optimizer, lr, iterations = load_net_g(hps, weights_path=args.weights_path) 171 | 172 | dataloader = get_dataloader(hps) 173 | zs = get_zs(net_g, dataloader, num_samples=args.num_samples) 174 | centers = k_means(zs) 175 | 176 | memory_bank = VAEMemoryBank( 177 | **hps.models.memory_bank, 178 | init_values=torch.from_numpy(centers).cuda().transpose(0, 1) 179 | ) 180 | save_memory_bank(memory_bank) 181 | 182 | net_g.memory_bank = memory_bank 183 | optimizer.add_param_group( 184 | { 185 | "params": list(memory_bank.parameters()), 186 | "initial_lr": optimizer.param_groups[0]["initial_lr"], 187 | } 188 | ) 189 | 190 | p = Path(args.weights_path) 191 | save_path = p.with_stem(p.stem + "_with_memory").__str__() 192 | save_checkpoint(net_g, optimizer, lr, iterations, save_path) 193 | 194 | # test 195 | print(memory_bank(torch.randn((2, 192, 12))).shape) 196 | -------------------------------------------------------------------------------- /utils/commons.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import torch 4 | from torch import nn 5 | from torch.nn import functional as F 6 | 7 | 8 | def init_weights(m, mean=0.0, std=0.01): 9 | classname = m.__class__.__name__ 10 | if classname.find("Conv") != -1: 11 | m.weight.data.normal_(mean, std) 12 | 13 | 14 | def get_padding(kernel_size, dilation=1): 15 | return int((kernel_size * dilation - dilation) / 2) 16 | 17 | 18 | def convert_pad_shape(pad_shape): 19 | l = pad_shape[::-1] 20 | pad_shape = [item for sublist in l for item in sublist] 21 | return pad_shape 22 | 23 | 24 | def intersperse(lst, item): 25 | result = [item] * (len(lst) * 2 + 1) 26 | result[1::2] = lst 27 | return result 28 | 29 | 30 | def kl_divergence(m_p, logs_p, m_q, logs_q): 31 | """KL(P||Q)""" 32 | kl = (logs_q - logs_p) - 0.5 33 | kl += ( 34 | 0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q) 35 | ) 36 | return kl 37 | 38 | 39 | def rand_gumbel(shape): 40 | """Sample from the Gumbel distribution, protect from overflows.""" 41 | uniform_samples = torch.rand(shape) * 0.99998 + 0.00001 42 | return -torch.log(-torch.log(uniform_samples)) 43 | 44 | 45 | def rand_gumbel_like(x): 46 | g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device) 47 | return g 48 | 49 | 50 | def slice_segments(x, ids_str, segment_size=4): 51 | ret = torch.zeros_like(x[:, :, :segment_size]) 52 | for i in range(x.size(0)): 53 | idx_str = ids_str[i] 54 | idx_end = idx_str + segment_size 55 | ret[i] = x[i, :, idx_str:idx_end] 56 | return ret 57 | 58 | 59 | def rand_slice_segments(x, x_lengths=None, segment_size=4): 60 | b, d, t = x.size() 61 | if x_lengths is None: 62 | x_lengths = t 63 | ids_str_max = x_lengths - segment_size + 1 64 | ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long) 65 | ret = slice_segments(x, ids_str, segment_size) 66 | return ret, ids_str 67 | 68 | 69 | def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4): 70 | position = torch.arange(length, dtype=torch.float) 71 | num_timescales = channels // 2 72 | log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / ( 73 | num_timescales - 1 74 | ) 75 | inv_timescales = min_timescale * torch.exp( 76 | torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment 77 | ) 78 | scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1) 79 | signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0) 80 | signal = F.pad(signal, [0, 0, 0, channels % 2]) 81 | signal = signal.view(1, channels, length) 82 | return signal 83 | 84 | 85 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4): 86 | b, channels, length = x.size() 87 | signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale) 88 | return x + signal.to(dtype=x.dtype, device=x.device) 89 | 90 | 91 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1): 92 | b, channels, length = x.size() 93 | signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale) 94 | return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis) 95 | 96 | 97 | def subsequent_mask(length): 98 | mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0) 99 | return mask 100 | 101 | 102 | @torch.jit.script 103 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels): 104 | n_channels_int = n_channels[0] 105 | in_act = input_a + input_b 106 | t_act = torch.tanh(in_act[:, :n_channels_int, :]) 107 | s_act = torch.sigmoid(in_act[:, n_channels_int:, :]) 108 | acts = t_act * s_act 109 | return acts 110 | 111 | 112 | def convert_pad_shape(pad_shape): 113 | l = pad_shape[::-1] 114 | pad_shape = [item for sublist in l for item in sublist] 115 | return pad_shape 116 | 117 | 118 | def shift_1d(x): 119 | x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1] 120 | return x 121 | 122 | 123 | def sequence_mask(length, max_length=None): 124 | if max_length is None: 125 | max_length = length.max() 126 | x = torch.arange(max_length, dtype=length.dtype, device=length.device) 127 | return x.unsqueeze(0) < length.unsqueeze(1) 128 | 129 | 130 | def generate_path(duration, mask): 131 | """ 132 | duration: [b, 1, t_x] 133 | mask: [b, 1, t_y, t_x] 134 | """ 135 | device = duration.device 136 | 137 | b, _, t_y, t_x = mask.shape 138 | cum_duration = torch.cumsum(duration, -1) 139 | 140 | cum_duration_flat = cum_duration.view(b * t_x) 141 | path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype) 142 | path = path.view(b, t_x, t_y) 143 | path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1] 144 | path = path.unsqueeze(1).transpose(2, 3) * mask 145 | return path 146 | 147 | 148 | def clip_grad_value_(parameters, clip_value, norm_type=2): 149 | if isinstance(parameters, torch.Tensor): 150 | parameters = [parameters] 151 | parameters = list(filter(lambda p: p.grad is not None, parameters)) 152 | norm_type = float(norm_type) 153 | if clip_value is not None: 154 | clip_value = float(clip_value) 155 | 156 | total_norm = 0 157 | for p in parameters: 158 | param_norm = p.grad.data.norm(norm_type) 159 | total_norm += param_norm.item() ** norm_type 160 | if clip_value is not None: 161 | p.grad.data.clamp_(min=-clip_value, max=clip_value) 162 | total_norm = total_norm ** (1.0 / norm_type) 163 | return total_norm 164 | -------------------------------------------------------------------------------- /utils/configs/ljs.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 500, 4 | "eval_interval": 5, 5 | "seed": 1234, 6 | "epochs": 1500, 7 | "learning_rate": 2e-4, 8 | "betas": [0.8, 0.99], 9 | "eps": 1e-9, 10 | "batch_size": 16, 11 | "fp16_run": true, 12 | "lr_decay": 0.999, 13 | "segment_size": 8192, 14 | "init_lr_ratio": 1, 15 | "warmup_epochs": 200, 16 | "c_mel": 45, 17 | "c_kl": 1.0, 18 | "c_kl_fwd": 0.001, 19 | "c_e2e": 0.1, 20 | "c_dur": 5.0, 21 | "use_sdtw": false, 22 | "use_gt_duration": true 23 | }, 24 | "data": { 25 | "training_files":"filelists/ljs_audio_text_train_filelist.txt.cleaned", 26 | "validation_files":"filelists/ljs_audio_text_val_filelist.txt.cleaned", 27 | "text_cleaners":["english_cleaners2"], 28 | "max_wav_value": 32768.0, 29 | "sampling_rate": 22050, 30 | "filter_length": 1024, 31 | "hop_length": 256, 32 | "win_length": 1024, 33 | "n_mel_channels": 80, 34 | "mel_fmin": 0.0, 35 | "mel_fmax": null, 36 | "add_blank": true, 37 | "n_speakers": 0, 38 | "cleaned_text": true 39 | }, 40 | "models": { 41 | "phoneme_encoder": { 42 | "out_channels": 192, 43 | "hidden_channels": 192, 44 | "filter_channels": 768, 45 | "n_heads": 2, 46 | "n_layers": 6, 47 | "kernel_size": 3, 48 | "p_dropout": 0.1 49 | }, 50 | "decoder": { 51 | "initial_channel": 192, 52 | "resblock": "1", 53 | "resblock_kernel_sizes": [3,7,11], 54 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 55 | "upsample_rates": [8,8,2,2], 56 | "upsample_initial_channel": 256, 57 | "upsample_kernel_sizes": [16,16,4,4], 58 | "gin_channels": 0 59 | }, 60 | "posterior_encoder": { 61 | "out_channels": 192, 62 | "hidden_channels": 192, 63 | "kernel_size": 5, 64 | "dilation_rate": 1, 65 | "n_layers": 16 66 | }, 67 | "flow": { 68 | "channels": 192, 69 | "hidden_channels": 192, 70 | "kernel_size": 5, 71 | "dilation_rate": 1, 72 | "n_layers": 4 73 | }, 74 | "duration_predictor": { 75 | "in_channels": 192, 76 | "filter_channels": 256, 77 | "kernel_size": 3, 78 | "p_dropout": 0.5 79 | }, 80 | "learnable_upsampling": { 81 | "d_predictor": 192, 82 | "kernel_size": 3, 83 | "dropout": 0.0, 84 | "conv_output_size": 8, 85 | "dim_w": 4, 86 | "dim_c": 2, 87 | "max_seq_len": 1000 88 | }, 89 | "memory_bank": { 90 | "bank_size": 1000, 91 | "n_hidden_dims": 192, 92 | "n_attn_heads": 2 93 | } 94 | } 95 | } 96 | -------------------------------------------------------------------------------- /utils/configs/ljs_reproduce.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 1, 4 | "eval_interval": 1, 5 | "seed": 1234, 6 | "epochs": 15000, 7 | "learning_rate": 2e-4, 8 | "betas": [0.8, 0.99], 9 | "eps": 1e-9, 10 | "batch_size": 16, 11 | "fp16_run": true, 12 | "lr_decay": 0.999875, 13 | "segment_size": 8192, 14 | "init_lr_ratio": 1, 15 | "warmup_epochs": 200, 16 | "c_mel": 45, 17 | "c_kl": 1.0, 18 | "c_kl_fwd": 0.02, 19 | "c_e2e": 0.1, 20 | "c_dur": 5.0, 21 | "use_sdtw": true, 22 | "use_gt_duration": false 23 | }, 24 | "data": { 25 | "training_files":"filelists/ljs_audio_text_train_filelist.txt.cleaned", 26 | "validation_files":"filelists/ljs_audio_text_val_filelist.txt.cleaned", 27 | "text_cleaners":["english_cleaners2"], 28 | "max_wav_value": 32768.0, 29 | "sampling_rate": 22050, 30 | "filter_length": 1024, 31 | "hop_length": 256, 32 | "win_length": 1024, 33 | "n_mel_channels": 80, 34 | "mel_fmin": 0.0, 35 | "mel_fmax": null, 36 | "add_blank": true, 37 | "n_speakers": 0, 38 | "cleaned_text": true 39 | }, 40 | "models": { 41 | "phoneme_encoder": { 42 | "out_channels": 192, 43 | "hidden_channels": 192, 44 | "filter_channels": 768, 45 | "n_heads": 2, 46 | "n_layers": 6, 47 | "kernel_size": 3, 48 | "p_dropout": 0.1 49 | }, 50 | "decoder": { 51 | "initial_channel": 192, 52 | "resblock": "1", 53 | "resblock_kernel_sizes": [3,7,11], 54 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 55 | "upsample_rates": [8,8,2,2], 56 | "upsample_initial_channel": 256, 57 | "upsample_kernel_sizes": [16,16,4,4], 58 | "gin_channels": 0 59 | }, 60 | "posterior_encoder": { 61 | "out_channels": 192, 62 | "hidden_channels": 192, 63 | "kernel_size": 5, 64 | "dilation_rate": 1, 65 | "n_layers": 16 66 | }, 67 | "flow": { 68 | "channels": 192, 69 | "hidden_channels": 192, 70 | "kernel_size": 5, 71 | "dilation_rate": 1, 72 | "n_layers": 4 73 | }, 74 | "duration_predictor": { 75 | "in_channels": 192, 76 | "filter_channels": 256, 77 | "kernel_size": 3, 78 | "p_dropout": 0.5 79 | }, 80 | "learnable_upsampling": { 81 | "d_predictor": 192, 82 | "kernel_size": 3, 83 | "dropout": 0.0, 84 | "conv_output_size": 8, 85 | "dim_w": 4, 86 | "dim_c": 2, 87 | "max_seq_len": 1000 88 | }, 89 | "memory_bank": { 90 | "bank_size": 1000, 91 | "n_hidden_dims": 192, 92 | "n_attn_heads": 2 93 | } 94 | } 95 | } 96 | -------------------------------------------------------------------------------- /utils/filelists/ljs_audio_text_val_filelist.txt: -------------------------------------------------------------------------------- 1 | DUMMY1/LJ022-0023.wav|The overwhelming majority of people in this country know how to sift the wheat from the chaff in what they hear and what they read. 2 | DUMMY1/LJ043-0030.wav|If somebody did that to me, a lousy trick like that, to take my wife away, and all the furniture, I would be mad as hell, too. 3 | DUMMY1/LJ005-0201.wav|as is shown by the report of the Commissioners to inquire into the state of the municipal corporations in eighteen thirty-five. 4 | DUMMY1/LJ001-0110.wav|Even the Caslon type when enlarged shows great shortcomings in this respect: 5 | DUMMY1/LJ003-0345.wav|All the committee could do in this respect was to throw the responsibility on others. 6 | DUMMY1/LJ007-0154.wav|These pungent and well-grounded strictures applied with still greater force to the unconvicted prisoner, the man who came to the prison innocent, and still uncontaminated, 7 | DUMMY1/LJ018-0098.wav|and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others. 8 | DUMMY1/LJ047-0044.wav|Oswald was, however, willing to discuss his contacts with Soviet authorities. He denied having any involvement with Soviet intelligence agencies 9 | DUMMY1/LJ031-0038.wav|The first physician to see the President at Parkland Hospital was Dr. Charles J. Carrico, a resident in general surgery. 10 | DUMMY1/LJ048-0194.wav|during the morning of November twenty-two prior to the motorcade. 11 | DUMMY1/LJ049-0026.wav|On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President. 12 | DUMMY1/LJ004-0152.wav|although at Mr. Buxton's visit a new jail was in process of erection, the first step towards reform since Howard's visitation in seventeen seventy-four. 13 | DUMMY1/LJ008-0278.wav|or theirs might be one of many, and it might be considered necessary to "make an example." 14 | DUMMY1/LJ043-0002.wav|The Warren Commission Report. By The President's Commission on the Assassination of President Kennedy. Chapter seven. Lee Harvey Oswald: 15 | DUMMY1/LJ009-0114.wav|Mr. Wakefield winds up his graphic but somewhat sensational account by describing another religious service, which may appropriately be inserted here. 16 | DUMMY1/LJ028-0506.wav|A modern artist would have difficulty in doing such accurate work. 17 | DUMMY1/LJ050-0168.wav|with the particular purposes of the agency involved. The Commission recognizes that this is a controversial area 18 | DUMMY1/LJ039-0223.wav|Oswald's Marine training in marksmanship, his other rifle experience and his established familiarity with this particular weapon 19 | DUMMY1/LJ029-0032.wav|According to O'Donnell, quote, we had a motorcade wherever we went, end quote. 20 | DUMMY1/LJ031-0070.wav|Dr. Clark, who most closely observed the head wound, 21 | DUMMY1/LJ034-0198.wav|Euins, who was on the southwest corner of Elm and Houston Streets testified that he could not describe the man he saw in the window. 22 | DUMMY1/LJ026-0068.wav|Energy enters the plant, to a small extent, 23 | DUMMY1/LJ039-0075.wav|once you know that you must put the crosshairs on the target and that is all that is necessary. 24 | DUMMY1/LJ004-0096.wav|the fatal consequences whereof might be prevented if the justices of the peace were duly authorized 25 | DUMMY1/LJ005-0014.wav|Speaking on a debate on prison matters, he declared that 26 | DUMMY1/LJ012-0161.wav|he was reported to have fallen away to a shadow. 27 | DUMMY1/LJ018-0239.wav|His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to 28 | DUMMY1/LJ019-0257.wav|Here the tread-wheel was in use, there cellular cranks, or hard-labor machines. 29 | DUMMY1/LJ028-0008.wav|you tap gently with your heel upon the shoulder of the dromedary to urge her on. 30 | DUMMY1/LJ024-0083.wav|This plan of mine is no attack on the Court; 31 | DUMMY1/LJ042-0129.wav|No night clubs or bowling alleys, no places of recreation except the trade union dances. I have had enough. 32 | DUMMY1/LJ036-0103.wav|The police asked him whether he could pick out his passenger from the lineup. 33 | DUMMY1/LJ046-0058.wav|During his Presidency, Franklin D. Roosevelt made almost four hundred journeys and traveled more than three hundred fifty thousand miles. 34 | DUMMY1/LJ014-0076.wav|He was seen afterwards smoking and talking with his hosts in their back parlor, and never seen again alive. 35 | DUMMY1/LJ002-0043.wav|long narrow rooms -- one thirty-six feet, six twenty-three feet, and the eighth eighteen, 36 | DUMMY1/LJ009-0076.wav|We come to the sermon. 37 | DUMMY1/LJ017-0131.wav|even when the high sheriff had told him there was no possibility of a reprieve, and within a few hours of execution. 38 | DUMMY1/LJ046-0184.wav|but there is a system for the immediate notification of the Secret Service by the confining institution when a subject is released or escapes. 39 | DUMMY1/LJ014-0263.wav|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art. 40 | DUMMY1/LJ042-0096.wav|(old exchange rate) in addition to his factory salary of approximately equal amount 41 | DUMMY1/LJ049-0050.wav|Hill had both feet on the car and was climbing aboard to assist President and Mrs. Kennedy. 42 | DUMMY1/LJ019-0186.wav|seeing that since the establishment of the Central Criminal Court, Newgate received prisoners for trial from several counties, 43 | DUMMY1/LJ028-0307.wav|then let twenty days pass, and at the end of that time station near the Chaldasan gates a body of four thousand. 44 | DUMMY1/LJ012-0235.wav|While they were in a state of insensibility the murder was committed. 45 | DUMMY1/LJ034-0053.wav|reached the same conclusion as Latona that the prints found on the cartons were those of Lee Harvey Oswald. 46 | DUMMY1/LJ014-0030.wav|These were damnatory facts which well supported the prosecution. 47 | DUMMY1/LJ015-0203.wav|but were the precautions too minute, the vigilance too close to be eluded or overcome? 48 | DUMMY1/LJ028-0093.wav|but his scribe wrote it in the manner customary for the scribes of those days to write of their royal masters. 49 | DUMMY1/LJ002-0018.wav|The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London, 50 | DUMMY1/LJ028-0275.wav|At last, in the twentieth month, 51 | DUMMY1/LJ012-0042.wav|which he kept concealed in a hiding-place with a trap-door just under his bed. 52 | DUMMY1/LJ011-0096.wav|He married a lady also belonging to the Society of Friends, who brought him a large fortune, which, and his own money, he put into a city firm, 53 | DUMMY1/LJ036-0077.wav|Roger D. Craig, a deputy sheriff of Dallas County, 54 | DUMMY1/LJ016-0318.wav|Other officials, great lawyers, governors of prisons, and chaplains supported this view. 55 | DUMMY1/LJ013-0164.wav|who came from his room ready dressed, a suspicious circumstance, as he was always late in the morning. 56 | DUMMY1/LJ027-0141.wav|is closely reproduced in the life-history of existing deer. Or, in other words, 57 | DUMMY1/LJ028-0335.wav|accordingly they committed to him the command of their whole army, and put the keys of their city into his hands. 58 | DUMMY1/LJ031-0202.wav|Mrs. Kennedy chose the hospital in Bethesda for the autopsy because the President had served in the Navy. 59 | DUMMY1/LJ021-0145.wav|From those willing to join in establishing this hoped-for period of peace, 60 | DUMMY1/LJ016-0288.wav|"Müller, Müller, He's the man," till a diversion was created by the appearance of the gallows, which was received with continuous yells. 61 | DUMMY1/LJ028-0081.wav|Years later, when the archaeologists could readily distinguish the false from the true, 62 | DUMMY1/LJ018-0081.wav|his defense being that he had intended to commit suicide, but that, on the appearance of this officer who had wronged him, 63 | DUMMY1/LJ021-0066.wav|together with a great increase in the payrolls, there has come a substantial rise in the total of industrial profits 64 | DUMMY1/LJ009-0238.wav|After this the sheriffs sent for another rope, but the spectators interfered, and the man was carried back to jail. 65 | DUMMY1/LJ005-0079.wav|and improve the morals of the prisoners, and shall insure the proper measure of punishment to convicted offenders. 66 | DUMMY1/LJ035-0019.wav|drove to the northwest corner of Elm and Houston, and parked approximately ten feet from the traffic signal. 67 | DUMMY1/LJ036-0174.wav|This is the approximate time he entered the roominghouse, according to Earlene Roberts, the housekeeper there. 68 | DUMMY1/LJ046-0146.wav|The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files 69 | DUMMY1/LJ017-0044.wav|and the deepest anxiety was felt that the crime, if crime there had been, should be brought home to its perpetrator. 70 | DUMMY1/LJ017-0070.wav|but his sporting operations did not prosper, and he became a needy man, always driven to desperate straits for cash. 71 | DUMMY1/LJ014-0020.wav|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood; 72 | DUMMY1/LJ016-0020.wav|He never reached the cistern, but fell back into the yard, injuring his legs severely. 73 | DUMMY1/LJ045-0230.wav|when he was finally apprehended in the Texas Theatre. Although it is not fully corroborated by others who were present, 74 | DUMMY1/LJ035-0129.wav|and she must have run down the stairs ahead of Oswald and would probably have seen or heard him. 75 | DUMMY1/LJ008-0307.wav|afterwards express a wish to murder the Recorder for having kept them so long in suspense. 76 | DUMMY1/LJ008-0294.wav|nearly indefinitely deferred. 77 | DUMMY1/LJ047-0148.wav|On October twenty-five, 78 | DUMMY1/LJ008-0111.wav|They entered a "stone cold room," and were presently joined by the prisoner. 79 | DUMMY1/LJ034-0042.wav|that he could only testify with certainty that the print was less than three days old. 80 | DUMMY1/LJ037-0234.wav|Mrs. Mary Brock, the wife of a mechanic who worked at the station, was there at the time and she saw a white male, 81 | DUMMY1/LJ040-0002.wav|Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one. 82 | DUMMY1/LJ045-0140.wav|The arguments he used to justify his use of the alias suggest that Oswald may have come to think that the whole world was becoming involved 83 | DUMMY1/LJ012-0035.wav|the number and names on watches, were carefully removed or obliterated after the goods passed out of his hands. 84 | DUMMY1/LJ012-0250.wav|On the seventh July, eighteen thirty-seven, 85 | DUMMY1/LJ016-0179.wav|contracted with sheriffs and conveners to work by the job. 86 | DUMMY1/LJ016-0138.wav|at a distance from the prison. 87 | DUMMY1/LJ027-0052.wav|These principles of homology are essential to a correct interpretation of the facts of morphology. 88 | DUMMY1/LJ031-0134.wav|On one occasion Mrs. Johnson, accompanied by two Secret Service agents, left the room to see Mrs. Kennedy and Mrs. Connally. 89 | DUMMY1/LJ019-0273.wav|which Sir Joshua Jebb told the committee he considered the proper elements of penal discipline. 90 | DUMMY1/LJ014-0110.wav|At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects. 91 | DUMMY1/LJ034-0160.wav|on Brennan's subsequent certain identification of Lee Harvey Oswald as the man he saw fire the rifle. 92 | DUMMY1/LJ038-0199.wav|eleven. If I am alive and taken prisoner, 93 | DUMMY1/LJ014-0010.wav|yet he could not overcome the strange fascination it had for him, and remained by the side of the corpse till the stretcher came. 94 | DUMMY1/LJ033-0047.wav|I noticed when I went out that the light was on, end quote, 95 | DUMMY1/LJ040-0027.wav|He was never satisfied with anything. 96 | DUMMY1/LJ048-0228.wav|and others who were present say that no agent was inebriated or acted improperly. 97 | DUMMY1/LJ003-0111.wav|He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity. 98 | DUMMY1/LJ008-0258.wav|Let me retrace my steps, and speak more in detail of the treatment of the condemned in those bloodthirsty and brutally indifferent days, 99 | DUMMY1/LJ029-0022.wav|The original plan called for the President to spend only one day in the State, making whirlwind visits to Dallas, Fort Worth, San Antonio, and Houston. 100 | DUMMY1/LJ004-0045.wav|Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William Wilberforce. 101 | -------------------------------------------------------------------------------- /utils/filelists/ljs_audio_text_val_filelist.txt.cleaned: -------------------------------------------------------------------------------- 1 | DUMMY1/LJ022-0023.wav|ðɪ ˌoʊvɚwˈɛlmɪŋ mədʒˈɔːɹɪɾi ʌv pˈiːpəl ɪn ðɪs kˈʌntɹi nˈoʊ hˌaʊ tə sˈɪft ðə wˈiːt fɹʌmðə tʃˈæf ɪn wˌʌt ðeɪ hˈɪɹ ænd wˌʌt ðeɪ ɹˈiːd. 2 | DUMMY1/LJ043-0030.wav|ɪf sˈʌmbɑːdi dˈɪd ðˈæt tə mˌiː, ɐ lˈaʊsi tɹˈɪk lˈaɪk ðˈæt, tə tˈeɪk maɪ wˈaɪf ɐwˈeɪ, ænd ˈɔːl ðə fˈɜːnɪtʃɚ, ˈaɪ wʊd biː mˈæd æz hˈɛl, tˈuː. 3 | DUMMY1/LJ005-0201.wav|ˌæzˌɪz ʃˈoʊn baɪ ðə ɹɪpˈoːɹt ʌvðə kəmˈɪʃənɚz tʊ ɪnkwˈaɪɚɹ ˌɪntʊ ðə stˈeɪt ʌvðə mjuːnˈɪsɪpəl kˌɔːɹpɚɹˈeɪʃənz ɪn eɪtˈiːn θˈɜːɾifˈaɪv. 4 | DUMMY1/LJ001-0110.wav|ˈiːvən ðə kˈæslɑːn tˈaɪp wɛn ɛnlˈɑːɹdʒd ʃˈoʊz ɡɹˈeɪt ʃˈɔːɹtkʌmɪŋz ɪn ðɪs ɹɪspˈɛkt: 5 | DUMMY1/LJ003-0345.wav|ˈɔːl ðə kəmˈɪɾi kʊd dˈuː ɪn ðɪs ɹɪspˈɛkt wʌz tə θɹˈoʊ ðə ɹɪspˌɑːnsəbˈɪlɪɾi ˌɑːn ˈʌðɚz. 6 | DUMMY1/LJ007-0154.wav|ðiːz pˈʌndʒənt ænd wˈɛlɡɹˈaʊndᵻd stɹˈɪktʃɚz ɐplˈaɪd wɪð stˈɪl ɡɹˈeɪɾɚ fˈoːɹs tə ðɪ ʌnkənvˈɪktᵻd pɹˈɪzənɚ, ðə mˈæn hˌuː kˈeɪm tə ðə pɹˈɪzən ˈɪnəsənt, ænd stˈɪl ʌnkəntˈæmᵻnˌeɪɾᵻd, 7 | DUMMY1/LJ018-0098.wav|ænd ɹˈɛkəɡnˌaɪzd æz wˈʌn ʌvðə fɹˈiːkwɛntɚz ʌvðə bˈoʊɡəs lˈɔːstˈeɪʃənɚz. hɪz ɐɹˈɛst lˈɛd tə ðæt ʌv ˈʌðɚz. 8 | DUMMY1/LJ047-0044.wav|ˈɑːswəld wʌz, haʊˈɛvɚ, wˈɪlɪŋ tə dɪskˈʌs hɪz kˈɑːntækts wɪð sˈoʊviət ɐθˈɔːɹɪɾiz. hiː dɪnˈaɪd hˌævɪŋ ˌɛni ɪnvˈɑːlvmənt wɪð sˈoʊviət ɪntˈɛlɪdʒəns ˈeɪdʒənsiz 9 | DUMMY1/LJ031-0038.wav|ðə fˈɜːst fɪzˈɪʃən tə sˈiː ðə pɹˈɛzɪdənt æt pˈɑːɹklənd hˈɑːspɪɾəl wʌz dˈɑːktɚ tʃˈɑːɹlz dʒˈeɪ. kˈæɹɪkˌoʊ, ɐ ɹˈɛzɪdənt ɪn dʒˈɛnɚɹəl sˈɜːdʒɚɹi. 10 | DUMMY1/LJ048-0194.wav|dˈʊɹɪŋ ðə mˈɔːɹnɪŋ ʌv noʊvˈɛmbɚ twˈɛntitˈuː pɹˈaɪɚ tə ðə mˈoʊɾɚkˌeɪd. 11 | DUMMY1/LJ049-0026.wav|ˌɑːn əkˈeɪʒən ðə sˈiːkɹət sˈɜːvɪs hɐzbɪn pɚmˈɪɾᵻd tə hæv ɐn ˈeɪdʒənt ɹˈaɪdɪŋ ɪnðə pˈæsɪndʒɚ kəmpˈɑːɹtmənt wɪððə pɹˈɛzɪdənt. 12 | DUMMY1/LJ004-0152.wav|ɑːlðˈoʊ æt mˈɪstɚ bˈʌkstənz vˈɪzɪt ɐ nˈuː dʒˈeɪl wʌz ɪn pɹˈɑːsɛs ʌv ɪɹˈɛkʃən, ðə fˈɜːst stˈɛp tʊwˈɔːɹdz ɹɪfˈɔːɹm sˈɪns hˈaʊɚdz vˌɪzɪtˈeɪʃən ɪn sˌɛvəntˈiːn sˈɛvəntifˈoːɹ. 13 | DUMMY1/LJ008-0278.wav|ɔːɹ ðˈɛɹz mˌaɪt biː wˈʌn ʌv mˈɛni, ænd ɪt mˌaɪt biː kənsˈɪdɚd nˈɛsəsɚɹi tuː "mˌeɪk ɐn ɛɡzˈæmpəl." 14 | DUMMY1/LJ043-0002.wav|ðə wˈɔːɹən kəmˈɪʃən ɹɪpˈoːɹt. baɪ ðə pɹˈɛzɪdənts kəmˈɪʃən ɑːnðɪ ɐsˌæsᵻnˈeɪʃən ʌv pɹˈɛzɪdənt kˈɛnədi. tʃˈæptɚ sˈɛvən. lˈiː hˈɑːɹvi ˈɑːswəld: 15 | DUMMY1/LJ009-0114.wav|mˈɪstɚ wˈeɪkfiːld wˈaɪndz ˈʌp hɪz ɡɹˈæfɪk bˌʌt sˈʌmwʌt sɛnsˈeɪʃənəl ɐkˈaʊnt baɪ dɪskɹˈaɪbɪŋ ɐnˈʌðɚ ɹɪlˈɪdʒəs sˈɜːvɪs, wˌɪtʃ mˈeɪ ɐpɹˈoʊpɹɪətli biː ɪnsˈɜːɾᵻd hˈɪɹ. 16 | DUMMY1/LJ028-0506.wav|ɐ mˈɑːdɚn ˈɑːɹɾɪst wʊdhɐv dˈɪfɪkˌʌlti ɪn dˌuːɪŋ sˈʌtʃ ˈækjʊɹət wˈɜːk. 17 | DUMMY1/LJ050-0168.wav|wɪððə pɚtˈɪkjʊlɚ pˈɜːpəsᵻz ʌvðɪ ˈeɪdʒənsi ɪnvˈɑːlvd. ðə kəmˈɪʃən ɹˈɛkəɡnˌaɪzɪz ðæt ðɪs ɪz ɐ kˌɑːntɹəvˈɜːʃəl ˈɛɹiə 18 | DUMMY1/LJ039-0223.wav|ˈɑːswəldz mɚɹˈiːn tɹˈeɪnɪŋ ɪn mˈɑːɹksmənʃˌɪp, hɪz ˈʌðɚ ɹˈaɪfəl ɛkspˈiəɹɪəns ænd hɪz ɪstˈæblɪʃt fəmˌɪlɪˈæɹɪɾi wɪð ðɪs pɚtˈɪkjʊlɚ wˈɛpən 19 | DUMMY1/LJ029-0032.wav|ɐkˈoːɹdɪŋ tʊ oʊdˈɑːnəl, kwˈoʊt, wiː hɐd ɐ mˈoʊɾɚkˌeɪd wɛɹɹˈɛvɚ wiː wˈɛnt, ˈɛnd kwˈoʊt. 20 | DUMMY1/LJ031-0070.wav|dˈɑːktɚ klˈɑːɹk, hˌuː mˈoʊst klˈoʊsli ɑːbzˈɜːvd ðə hˈɛd wˈuːnd, 21 | DUMMY1/LJ034-0198.wav|jˈuːɪnz, hˌuː wʌz ɑːnðə saʊθwˈɛst kˈɔːɹnɚɹ ʌv ˈɛlm ænd hjˈuːstən stɹˈiːts tˈɛstɪfˌaɪd ðæt hiː kʊd nˌɑːt dɪskɹˈaɪb ðə mˈæn hiː sˈɔː ɪnðə wˈɪndoʊ. 22 | DUMMY1/LJ026-0068.wav|ˈɛnɚdʒi ˈɛntɚz ðə plˈænt, tʊ ɐ smˈɔːl ɛkstˈɛnt, 23 | DUMMY1/LJ039-0075.wav|wˈʌns juː nˈoʊ ðæt juː mˈʌst pˌʊt ðə kɹˈɔshɛɹz ɑːnðə tˈɑːɹɡɪt ænd ðæt ɪz ˈɔːl ðæt ɪz nˈɛsəsɚɹi. 24 | DUMMY1/LJ004-0096.wav|ðə fˈeɪɾəl kˈɑːnsɪkwənsᵻz wˈɛɹɑːf mˌaɪt biː pɹɪvˈɛntᵻd ɪf ðə dʒˈʌstɪsᵻz ʌvðə pˈiːs wɜː djˈuːli ˈɔːθɚɹˌaɪzd 25 | DUMMY1/LJ005-0014.wav|spˈiːkɪŋ ˌɑːn ɐ dɪbˈeɪt ˌɑːn pɹˈɪzən mˈæɾɚz, hiː dᵻklˈɛɹd ðˈæt 26 | DUMMY1/LJ012-0161.wav|hiː wʌz ɹɪpˈoːɹɾᵻd tə hæv fˈɔːlən ɐwˈeɪ tʊ ɐ ʃˈædoʊ. 27 | DUMMY1/LJ018-0239.wav|hɪz dˌɪsɐpˈɪɹəns ɡˈeɪv kˈʌlɚ ænd sˈʌbstəns tʊ ˈiːvəl ɹɪpˈoːɹts ɔːlɹˌɛdi ɪn sˌɜːkjʊlˈeɪʃən ðætðə wɪl ænd kənvˈeɪəns əbˌʌv ɹɪfˈɜːd tuː 28 | DUMMY1/LJ019-0257.wav|hˈɪɹ ðə tɹˈɛdwˈiːl wʌz ɪn jˈuːs, ðɛɹ sˈɛljʊlɚ kɹˈæŋks, ɔːɹ hˈɑːɹdlˈeɪbɚ məʃˈiːnz. 29 | DUMMY1/LJ028-0008.wav|juː tˈæp dʒˈɛntli wɪð jʊɹ hˈiːl əpˌɑːn ðə ʃˈoʊldɚɹ ʌvðə dɹˈoʊmdɚɹi tʊ ˈɜːdʒ hɜːɹ ˈɑːn. 30 | DUMMY1/LJ024-0083.wav|ðɪs plˈæn ʌv mˈaɪn ɪz nˈoʊ ɐtˈæk ɑːnðə kˈoːɹt; 31 | DUMMY1/LJ042-0129.wav|nˈoʊ nˈaɪt klˈʌbz ɔːɹ bˈoʊlɪŋ ˈælɪz, nˈoʊ plˈeɪsᵻz ʌv ɹˌɛkɹiːˈeɪʃən ɛksˈɛpt ðə tɹˈeɪd jˈuːniən dˈænsᵻz. ˈaɪ hæv hɐd ɪnˈʌf. 32 | DUMMY1/LJ036-0103.wav|ðə pəlˈiːs ˈæskt hˌɪm wˈɛðɚ hiː kʊd pˈɪk ˈaʊt hɪz pˈæsɪndʒɚ fɹʌmðə lˈaɪnʌp. 33 | DUMMY1/LJ046-0058.wav|dˈʊɹɪŋ hɪz pɹˈɛzɪdənsi, fɹˈæŋklɪn dˈiː. ɹˈoʊzəvˌɛlt mˌeɪd ˈɔːlmoʊst fˈoːɹ hˈʌndɹəd dʒˈɜːnɪz ænd tɹˈævəld mˈoːɹ ðɐn θɹˈiː hˈʌndɹəd fˈɪfti θˈaʊzənd mˈaɪlz. 34 | DUMMY1/LJ014-0076.wav|hiː wʌz sˈiːn ˈæftɚwɚdz smˈoʊkɪŋ ænd tˈɔːkɪŋ wɪð hɪz hˈoʊsts ɪn ðɛɹ bˈæk pˈɑːɹlɚ, ænd nˈɛvɚ sˈiːn ɐɡˈɛn ɐlˈaɪv. 35 | DUMMY1/LJ002-0043.wav|lˈɑːŋ nˈæɹoʊ ɹˈuːmz wˈʌn θˈɜːɾisˈɪks fˈiːt, sˈɪks twˈɛntiθɹˈiː fˈiːt, ænd ðɪ ˈeɪtθ eɪtˈiːn, 36 | DUMMY1/LJ009-0076.wav|wiː kˈʌm tə ðə sˈɜːmən. 37 | DUMMY1/LJ017-0131.wav|ˈiːvən wɛn ðə hˈaɪ ʃˈɛɹɪf hɐd tˈoʊld hˌɪm ðɛɹwˌʌz nˈoʊ pˌɑːsəbˈɪlɪɾi əvɚ ɹɪpɹˈiːv, ænd wɪðˌɪn ɐ fjˈuː ˈaɪʊɹz ʌv ˌɛksɪkjˈuːʃən. 38 | DUMMY1/LJ046-0184.wav|bˌʌt ðɛɹ ɪz ɐ sˈɪstəm fɚðɪ ɪmˈiːdɪət nˌoʊɾɪfɪkˈeɪʃən ʌvðə sˈiːkɹət sˈɜːvɪs baɪ ðə kənfˈaɪnɪŋ ˌɪnstɪtˈuːʃən wɛn ɐ sˈʌbdʒɛkt ɪz ɹɪlˈiːsd ɔːɹ ɛskˈeɪps. 39 | DUMMY1/LJ014-0263.wav|wˌɛn ˈʌðɚ plˈɛʒɚz pˈɔːld hiː tˈʊk ɐ θˈiəɾɚ, ænd pˈoʊzd æz ɐ mjuːnˈɪfɪsənt pˈeɪtɹən ʌvðə dɹəmˈæɾɪk ˈɑːɹt. 40 | DUMMY1/LJ042-0096.wav| ˈoʊld ɛkstʃˈeɪndʒ ɹˈeɪt ɪn ɐdˈɪʃən tə hɪz fˈæktɚɹi sˈælɚɹi ʌv ɐpɹˈɑːksɪmətli ˈiːkwəl ɐmˈaʊnt 41 | DUMMY1/LJ049-0050.wav|hˈɪl hɐd bˈoʊθ fˈiːt ɑːnðə kˈɑːɹ ænd wʌz klˈaɪmɪŋ ɐbˈoːɹd tʊ ɐsˈɪst pɹˈɛzɪdənt ænd mɪsˈɛs kˈɛnədi. 42 | DUMMY1/LJ019-0186.wav|sˈiːɪŋ ðæt sˈɪns ðɪ ɪstˈæblɪʃmənt ʌvðə sˈɛntɹəl kɹˈɪmɪnəl kˈoːɹt, nˈuːɡeɪt ɹɪsˈiːvd pɹˈɪzənɚz fɔːɹ tɹˈaɪəl fɹʌm sˈɛvɹəl kˈaʊntɪz, 43 | DUMMY1/LJ028-0307.wav|ðˈɛn lˈɛt twˈɛnti dˈeɪz pˈæs, ænd æt ðɪ ˈɛnd ʌv ðæt tˈaɪm stˈeɪʃən nˌɪɹ ðə tʃˈældæsən ɡˈeɪts ɐ bˈɑːdi ʌv fˈoːɹ θˈaʊzənd. 44 | DUMMY1/LJ012-0235.wav|wˌaɪl ðeɪ wɜːɹ ɪn ɐ stˈeɪt ʌv ɪnsˌɛnsəbˈɪlɪɾi ðə mˈɜːdɚ wʌz kəmˈɪɾᵻd. 45 | DUMMY1/LJ034-0053.wav|ɹˈiːtʃt ðə sˈeɪm kənklˈuːʒən æz lætˈoʊnə ðætðə pɹˈɪnts fˈaʊnd ɑːnðə kˈɑːɹtənz wɜː ðoʊz ʌv lˈiː hˈɑːɹvi ˈɑːswəld. 46 | DUMMY1/LJ014-0030.wav|ðiːz wɜː dˈæmnətˌoːɹi fˈækts wˌɪtʃ wˈɛl səpˈoːɹɾᵻd ðə pɹˌɑːsɪkjˈuːʃən. 47 | DUMMY1/LJ015-0203.wav|bˌʌt wɜː ðə pɹɪkˈɔːʃənz tˈuː mˈɪnɪt, ðə vˈɪdʒɪləns tˈuː klˈoʊs təbi ɪlˈuːdᵻd ɔːɹ ˌoʊvɚkˈʌm? 48 | DUMMY1/LJ028-0093.wav|bˌʌt hɪz skɹˈaɪb ɹˈoʊt ɪt ɪnðə mˈænɚ kˈʌstəmˌɛɹi fɚðə skɹˈaɪbz ʌv ðoʊz dˈeɪz tə ɹˈaɪt ʌv ðɛɹ ɹˈɔɪəl mˈæstɚz. 49 | DUMMY1/LJ002-0018.wav|ðɪ ɪnˈædɪkwəsi ʌvðə dʒˈeɪl wʌz nˈoʊɾɪsd ænd ɹɪpˈoːɹɾᵻd əpˌɑːn ɐɡˈɛn ænd ɐɡˈɛn baɪ ðə ɡɹˈænd dʒˈʊɹɪz ʌvðə sˈɪɾi ʌv lˈʌndən, 50 | DUMMY1/LJ028-0275.wav|æt lˈæst, ɪnðə twˈɛntiəθ mˈʌnθ, 51 | DUMMY1/LJ012-0042.wav|wˌɪtʃ hiː kˈɛpt kənsˈiːld ɪn ɐ hˈaɪdɪŋplˈeɪs wɪð ɐ tɹˈæpdˈoːɹ dʒˈʌst ˌʌndɚ hɪz bˈɛd. 52 | DUMMY1/LJ011-0096.wav|hiː mˈæɹɪd ɐ lˈeɪdi ˈɑːlsoʊ bɪlˈɑːŋɪŋ tə ðə səsˈaɪəɾi ʌv fɹˈɛndz, hˌuː bɹˈɔːt hˌɪm ɐ lˈɑːɹdʒ fˈɔːɹtʃən, wˈɪtʃ, ænd hɪz ˈoʊn mˈʌni, hiː pˌʊt ˌɪntʊ ɐ sˈɪɾi fˈɜːm, 53 | DUMMY1/LJ036-0077.wav|ɹˈɑːdʒɚ dˈiː. kɹˈeɪɡ, ɐ dˈɛpjuːɾi ʃˈɛɹɪf ʌv dˈæləs kˈaʊnti, 54 | DUMMY1/LJ016-0318.wav|ˈʌðɚɹ əfˈɪʃəlz, ɡɹˈeɪt lˈɔɪɚz, ɡˈʌvɚnɚz ʌv pɹˈɪzənz, ænd tʃˈæplɪnz səpˈoːɹɾᵻd ðɪs vjˈuː. 55 | DUMMY1/LJ013-0164.wav|hˌuː kˈeɪm fɹʌm hɪz ɹˈuːm ɹˈɛdi dɹˈɛst, ɐ səspˈɪʃəs sˈɜːkəmstˌæns, æz hiː wʌz ˈɔːlweɪz lˈeɪt ɪnðə mˈɔːɹnɪŋ. 56 | DUMMY1/LJ027-0141.wav|ɪz klˈoʊsli ɹɪpɹədˈuːst ɪnðə lˈaɪfhˈɪstɚɹi ʌv ɛɡzˈɪstɪŋ dˈɪɹ. ˈɔːɹ, ɪn ˈʌðɚ wˈɜːdz, 57 | DUMMY1/LJ028-0335.wav|ɐkˈoːɹdɪŋli ðeɪ kəmˈɪɾᵻd tə hˌɪm ðə kəmˈænd ʌv ðɛɹ hˈoʊl ˈɑːɹmi, ænd pˌʊt ðə kˈiːz ʌv ðɛɹ sˈɪɾi ˌɪntʊ hɪz hˈændz. 58 | DUMMY1/LJ031-0202.wav|mɪsˈɛs kˈɛnədi tʃˈoʊz ðə hˈɑːspɪɾəl ɪn bəθˈɛzdə fɚðɪ ˈɔːtɑːpsi bɪkˈʌz ðə pɹˈɛzɪdənt hɐd sˈɜːvd ɪnðə nˈeɪvi. 59 | DUMMY1/LJ021-0145.wav|fɹʌm ðoʊz wˈɪlɪŋ tə dʒˈɔɪn ɪn ɪstˈæblɪʃɪŋ ðɪs hˈoʊptfɔːɹ pˈiəɹɪəd ʌv pˈiːs, 60 | DUMMY1/LJ016-0288.wav|"mˈʌlɚ, mˈʌlɚ, hiːz ðə mˈæn," tˈɪl ɐ daɪvˈɜːʒən wʌz kɹiːˈeɪɾᵻd baɪ ðɪ ɐpˈɪɹəns ʌvðə ɡˈæloʊz, wˌɪtʃ wʌz ɹɪsˈiːvd wɪð kəntˈɪnjuːəs jˈɛlz. 61 | DUMMY1/LJ028-0081.wav|jˈɪɹz lˈeɪɾɚ, wˌɛn ðɪ ˌɑːɹkiːˈɑːlədʒˌɪsts kʊd ɹˈɛdɪli dɪstˈɪŋɡwɪʃ ðə fˈɑːls fɹʌmðə tɹˈuː, 62 | DUMMY1/LJ018-0081.wav|hɪz dɪfˈɛns bˌiːɪŋ ðæt hiː hɐd ɪntˈɛndᵻd tə kəmˈɪt sˈuːɪsˌaɪd, bˌʌt ðˈæt, ɑːnðɪ ɐpˈɪɹəns ʌv ðɪs ˈɑːfɪsɚ hˌuː hɐd ɹˈɔŋd hˌɪm, 63 | DUMMY1/LJ021-0066.wav|təɡˌɛðɚ wɪð ɐ ɡɹˈeɪt ˈɪnkɹiːs ɪnðə pˈeɪɹoʊlz, ðɛɹ hɐz kˈʌm ɐ səbstˈænʃəl ɹˈaɪz ɪnðə tˈoʊɾəl ʌv ɪndˈʌstɹɪəl pɹˈɑːfɪts 64 | DUMMY1/LJ009-0238.wav|ˈæftɚ ðɪs ðə ʃˈɛɹɪfs sˈɛnt fɔːɹ ɐnˈʌðɚ ɹˈoʊp, bˌʌt ðə spɛktˈeɪɾɚz ˌɪntəfˈɪɹd, ænd ðə mˈæn wʌz kˈæɹɪd bˈæk tə dʒˈeɪl. 65 | DUMMY1/LJ005-0079.wav|ænd ɪmpɹˈuːv ðə mˈɔːɹəlz ʌvðə pɹˈɪzənɚz, ænd ʃˌæl ɪnʃˈʊɹ ðə pɹˈɑːpɚ mˈɛʒɚɹ ʌv pˈʌnɪʃmənt tə kənvˈɪktᵻd əfˈɛndɚz. 66 | DUMMY1/LJ035-0019.wav|dɹˈoʊv tə ðə nɔːɹθwˈɛst kˈɔːɹnɚɹ ʌv ˈɛlm ænd hjˈuːstən, ænd pˈɑːɹkt ɐpɹˈɑːksɪmətli tˈɛn fˈiːt fɹʌmðə tɹˈæfɪk sˈɪɡnəl. 67 | DUMMY1/LJ036-0174.wav|ðɪs ɪz ðɪ ɐpɹˈɑːksɪmət tˈaɪm hiː ˈɛntɚd ðə ɹˈuːmɪŋhˌaʊs, ɐkˈoːɹdɪŋ tʊ ˈɜːliːn ɹˈɑːbɚts, ðə hˈaʊskiːpɚ ðˈɛɹ. 68 | DUMMY1/LJ046-0146.wav|ðə kɹaɪtˈiəɹɪə ɪn ɪfˈɛkt pɹˈaɪɚ tə noʊvˈɛmbɚ twˈɛntitˈuː, naɪntˈiːn sˈɪkstiθɹˈiː, fɔːɹ dɪtˈɜːmɪnɪŋ wˈɛðɚ tʊ ɐksˈɛpt mətˈiəɹɪəl fɚðə pˌiːˌɑːɹˈɛs dʒˈɛnɚɹəl fˈaɪlz 69 | DUMMY1/LJ017-0044.wav|ænd ðə dˈiːpəst æŋzˈaɪəɾi wʌz fˈɛlt ðætðə kɹˈaɪm, ɪf kɹˈaɪm ðˈɛɹ hɐdbɪn, ʃˌʊd biː bɹˈɔːt hˈoʊm tʊ ɪts pˈɜːpɪtɹˌeɪɾɚ. 70 | DUMMY1/LJ017-0070.wav|bˌʌt hɪz spˈoːɹɾɪŋ ˌɑːpɚɹˈeɪʃənz dɪdnˌɑːt pɹˈɑːspɚ, ænd hiː bɪkˌeɪm ɐ nˈiːdi mˈæn, ˈɔːlweɪz dɹˈɪvən tə dˈɛspɚɹət stɹˈeɪts fɔːɹ kˈæʃ. 71 | DUMMY1/LJ014-0020.wav|hiː wʌz sˈuːn ˈæftɚwɚdz ɐɹˈɛstᵻd ˌɑːn səspˈɪʃən, ænd ɐ sˈɜːtʃ ʌv hɪz lˈɑːdʒɪŋz bɹˈɔːt tə lˈaɪt sˈɛvɹəl ɡˈɑːɹmənts sˈætʃɚɹˌeɪɾᵻd wɪð blˈʌd; 72 | DUMMY1/LJ016-0020.wav|hiː nˈɛvɚ ɹˈiːtʃt ðə sˈɪstɚn, bˌʌt fˈɛl bˈæk ˌɪntʊ ðə jˈɑːɹd, ˈɪndʒɚɹɪŋ hɪz lˈɛɡz sɪvˈɪɹli. 73 | DUMMY1/LJ045-0230.wav|wˌɛn hiː wʌz fˈaɪnəli ˌæpɹɪhˈɛndᵻd ɪnðə tˈɛksəs θˈiəɾɚ. ɑːlðˈoʊ ɪt ɪz nˌɑːt fˈʊli kɚɹˈɑːbɚɹˌeɪɾᵻd baɪ ˈʌðɚz hˌuː wɜː pɹˈɛzənt, 74 | DUMMY1/LJ035-0129.wav|ænd ʃiː mˈʌstɐv ɹˈʌn dˌaʊn ðə stˈɛɹz ɐhˈɛd ʌv ˈɑːswəld ænd wʊd pɹˈɑːbəbli hæv sˈiːn ɔːɹ hˈɜːd hˌɪm. 75 | DUMMY1/LJ008-0307.wav|ˈæftɚwɚdz ɛkspɹˈɛs ɐ wˈɪʃ tə mˈɜːdɚ ðə ɹɪkˈoːɹdɚ fɔːɹ hˌævɪŋ kˈɛpt ðˌɛm sˌoʊ lˈɑːŋ ɪn səspˈɛns. 76 | DUMMY1/LJ008-0294.wav|nˌɪɹli ɪndˈɛfɪnətli dɪfˈɜːd. 77 | DUMMY1/LJ047-0148.wav|ˌɑːn ɑːktˈoʊbɚ twˈɛntifˈaɪv, 78 | DUMMY1/LJ008-0111.wav|ðeɪ ˈɛntɚd ˈeɪ "stˈoʊn kˈoʊld ɹˈuːm," ænd wɜː pɹˈɛzəntli dʒˈɔɪnd baɪ ðə pɹˈɪzənɚ. 79 | DUMMY1/LJ034-0042.wav|ðæt hiː kʊd ˈoʊnli tˈɛstɪfˌaɪ wɪð sˈɜːtənti ðætðə pɹˈɪnt wʌz lˈɛs ðɐn θɹˈiː dˈeɪz ˈoʊld. 80 | DUMMY1/LJ037-0234.wav|mɪsˈɛs mˈɛɹi bɹˈɑːk, ðə wˈaɪf əvə mɪkˈænɪk hˌuː wˈɜːkt æt ðə stˈeɪʃən, wʌz ðɛɹ æt ðə tˈaɪm ænd ʃiː sˈɔː ɐ wˈaɪt mˈeɪl, 81 | DUMMY1/LJ040-0002.wav|tʃˈæptɚ sˈɛvən. lˈiː hˈɑːɹvi ˈɑːswəld: bˈækɡɹaʊnd ænd pˈɑːsəbəl mˈoʊɾɪvz, pˈɑːɹt wˌʌn. 82 | DUMMY1/LJ045-0140.wav|ðɪ ˈɑːɹɡjuːmənts hiː jˈuːzd tə dʒˈʌstɪfˌaɪ hɪz jˈuːs ʌvðɪ ˈeɪliəs sədʒˈɛst ðæt ˈɑːswəld mˌeɪhɐv kˈʌm tə θˈɪŋk ðætðə hˈoʊl wˈɜːld wʌz bɪkˈʌmɪŋ ɪnvˈɑːlvd 83 | DUMMY1/LJ012-0035.wav|ðə nˈʌmbɚ ænd nˈeɪmz ˌɑːn wˈɑːtʃᵻz, wɜː kˈɛɹfəli ɹɪmˈuːvd ɔːɹ əblˈɪɾɚɹˌeɪɾᵻd ˈæftɚ ðə ɡˈʊdz pˈæst ˌaʊɾəv hɪz hˈændz. 84 | DUMMY1/LJ012-0250.wav|ɑːnðə sˈɛvənθ dʒuːlˈaɪ, eɪtˈiːn θˈɜːɾisˈɛvən, 85 | DUMMY1/LJ016-0179.wav|kəntɹˈæktᵻd wɪð ʃˈɛɹɪfs ænd kənvˈɛnɚz tə wˈɜːk baɪ ðə dʒˈɑːb. 86 | DUMMY1/LJ016-0138.wav|æɾə dˈɪstəns fɹʌmðə pɹˈɪzən. 87 | DUMMY1/LJ027-0052.wav|ðiːz pɹˈɪnsɪpəlz ʌv həmˈɑːlədʒi ɑːɹ ɪsˈɛnʃəl tʊ ɐ kɚɹˈɛkt ɪntˌɜːpɹɪtˈeɪʃən ʌvðə fˈækts ʌv mɔːɹfˈɑːlədʒi. 88 | DUMMY1/LJ031-0134.wav|ˌɑːn wˈʌn əkˈeɪʒən mɪsˈɛs dʒˈɑːnsən, ɐkˈʌmpənɪd baɪ tˈuː sˈiːkɹət sˈɜːvɪs ˈeɪdʒənts, lˈɛft ðə ɹˈuːm tə sˈiː mɪsˈɛs kˈɛnədi ænd mɪsˈɛs kənˈæli. 89 | DUMMY1/LJ019-0273.wav|wˌɪtʃ sˌɜː dʒˈɑːʃjuːə dʒˈɛb tˈoʊld ðə kəmˈɪɾi hiː kənsˈɪdɚd ðə pɹˈɑːpɚɹ ˈɛlɪmənts ʌv pˈiːnəl dˈɪsɪplˌɪn. 90 | DUMMY1/LJ014-0110.wav|æt ðə fˈɜːst ðə bˈɑːksᵻz wɜːɹ ɪmpˈaʊndᵻd, ˈoʊpənd, ænd fˈaʊnd tə kəntˈeɪn mˈɛnɪəv oʊkˈɑːnɚz ɪfˈɛkts. 91 | DUMMY1/LJ034-0160.wav|ˌɑːn bɹˈɛnənz sˈʌbsɪkwənt sˈɜːtən aɪdˈɛntɪfɪkˈeɪʃən ʌv lˈiː hˈɑːɹvi ˈɑːswəld æz ðə mˈæn hiː sˈɔː fˈaɪɚ ðə ɹˈaɪfəl. 92 | DUMMY1/LJ038-0199.wav|ɪlˈɛvən. ɪf ˈaɪ æm ɐlˈaɪv ænd tˈeɪkən pɹˈɪzənɚ, 93 | DUMMY1/LJ014-0010.wav|jˈɛt hiː kʊd nˌɑːt ˌoʊvɚkˈʌm ðə stɹˈeɪndʒ fˌæsᵻnˈeɪʃən ɪt hˈɐd fɔːɹ hˌɪm, ænd ɹɪmˈeɪnd baɪ ðə sˈaɪd ʌvðə kˈɔːɹps tˈɪl ðə stɹˈɛtʃɚ kˈeɪm. 94 | DUMMY1/LJ033-0047.wav|ˈaɪ nˈoʊɾɪsd wɛn ˈaɪ wɛnt ˈaʊt ðætðə lˈaɪt wʌz ˈɑːn, ˈɛnd kwˈoʊt, 95 | DUMMY1/LJ040-0027.wav|hiː wʌz nˈɛvɚ sˈæɾɪsfˌaɪd wɪð ˈɛnɪθˌɪŋ. 96 | DUMMY1/LJ048-0228.wav|ænd ˈʌðɚz hˌuː wɜː pɹˈɛzənt sˈeɪ ðæt nˈoʊ ˈeɪdʒənt wʌz ɪnˈiːbɹɪˌeɪɾᵻd ɔːɹ ˈæktᵻd ɪmpɹˈɑːpɚli. 97 | DUMMY1/LJ003-0111.wav|hiː wʌz ɪn kˈɑːnsɪkwəns pˌʊt ˌaʊɾəv ðə pɹətˈɛkʃən ʌv ðɛɹ ɪntˈɜːnəl lˈɔː, ˈɛnd kwˈoʊt. ðɛɹ kˈoʊd wʌzɐ sˈʌbdʒɛkt ʌv sˌʌm kjˌʊɹɪˈɑːsɪɾi. 98 | DUMMY1/LJ008-0258.wav|lˈɛt mˌiː ɹɪtɹˈeɪs maɪ stˈɛps, ænd spˈiːk mˈoːɹ ɪn diːtˈeɪl ʌvðə tɹˈiːtmənt ʌvðə kəndˈɛmd ɪn ðoʊz blˈʌdθɜːsti ænd bɹˈuːɾəli ɪndˈɪfɹənt dˈeɪz, 99 | DUMMY1/LJ029-0022.wav|ðɪ ɚɹˈɪdʒɪnəl plˈæn kˈɔːld fɚðə pɹˈɛzɪdənt tə spˈɛnd ˈoʊnli wˈʌn dˈeɪ ɪnðə stˈeɪt, mˌeɪkɪŋ wˈɜːlwɪnd vˈɪzɪts tə dˈæləs, fˈɔːɹt wˈɜːθ, sˌæn æntˈoʊnɪˌoʊ, ænd hjˈuːstən. 100 | DUMMY1/LJ004-0045.wav|mˈɪstɚ stˈɜːdʒᵻz bˈoːɹn, sˌɜː dʒˈeɪmz mˈækɪntˌɑːʃ, sˌɜː dʒˈeɪmz skˈɑːɹlɪt, ænd wˈɪljəm wˈɪlbɚfˌoːɹs. 101 | -------------------------------------------------------------------------------- /utils/mel_processing.py: -------------------------------------------------------------------------------- 1 | import math 2 | import os 3 | import random 4 | import torch 5 | from torch import nn 6 | import torch.nn.functional as F 7 | import torch.utils.data 8 | import numpy as np 9 | import librosa 10 | import librosa.util as librosa_util 11 | from librosa.util import normalize, pad_center, tiny 12 | 13 | from scipy.signal import get_window 14 | from scipy.io.wavfile import read 15 | from librosa.filters import mel as librosa_mel_fn 16 | 17 | MAX_WAV_VALUE = 32768.0 18 | 19 | 20 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5): 21 | """ 22 | PARAMS 23 | ------ 24 | C: compression factor 25 | """ 26 | return torch.log(torch.clamp(x, min=clip_val) * C) 27 | 28 | 29 | def dynamic_range_decompression_torch(x, C=1): 30 | """ 31 | PARAMS 32 | ------ 33 | C: compression factor used to compress 34 | """ 35 | return torch.exp(x) / C 36 | 37 | 38 | def spectral_normalize_torch(magnitudes): 39 | output = dynamic_range_compression_torch(magnitudes) 40 | return output 41 | 42 | 43 | def spectral_de_normalize_torch(magnitudes): 44 | output = dynamic_range_decompression_torch(magnitudes) 45 | return output 46 | 47 | 48 | mel_basis = {} 49 | hann_window = {} 50 | 51 | 52 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False): 53 | if torch.min(y) < -1.0: 54 | print("min value is ", torch.min(y)) 55 | if torch.max(y) > 1.0: 56 | print("max value is ", torch.max(y)) 57 | 58 | global hann_window 59 | dtype_device = str(y.dtype) + "_" + str(y.device) 60 | wnsize_dtype_device = str(win_size) + "_" + dtype_device 61 | if wnsize_dtype_device not in hann_window: 62 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to( 63 | dtype=y.dtype, device=y.device 64 | ) 65 | 66 | y = torch.nn.functional.pad( 67 | y.unsqueeze(1), 68 | (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), 69 | mode="reflect", 70 | ) 71 | y = y.squeeze(1) 72 | 73 | spec = torch.stft( 74 | y, 75 | n_fft, 76 | hop_length=hop_size, 77 | win_length=win_size, 78 | window=hann_window[wnsize_dtype_device], 79 | center=center, 80 | pad_mode="reflect", 81 | normalized=False, 82 | onesided=True, 83 | ) 84 | 85 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 86 | return spec 87 | 88 | 89 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax): 90 | global mel_basis 91 | dtype_device = str(spec.dtype) + "_" + str(spec.device) 92 | fmax_dtype_device = str(fmax) + "_" + dtype_device 93 | if fmax_dtype_device not in mel_basis: 94 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 95 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to( 96 | dtype=spec.dtype, device=spec.device 97 | ) 98 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 99 | spec = spectral_normalize_torch(spec) 100 | return spec 101 | 102 | 103 | def mel_spectrogram_torch( 104 | y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False 105 | ): 106 | if torch.min(y) < -1.0: 107 | print("min value is ", torch.min(y)) 108 | if torch.max(y) > 1.0: 109 | print("max value is ", torch.max(y)) 110 | 111 | global mel_basis, hann_window 112 | dtype_device = str(y.dtype) + "_" + str(y.device) 113 | fmax_dtype_device = str(fmax) + "_" + dtype_device 114 | wnsize_dtype_device = str(win_size) + "_" + dtype_device 115 | if fmax_dtype_device not in mel_basis: 116 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 117 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to( 118 | dtype=y.dtype, device=y.device 119 | ) 120 | if wnsize_dtype_device not in hann_window: 121 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to( 122 | dtype=y.dtype, device=y.device 123 | ) 124 | 125 | y = torch.nn.functional.pad( 126 | y.unsqueeze(1), 127 | (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), 128 | mode="reflect", 129 | ) 130 | y = y.squeeze(1) 131 | 132 | spec = torch.stft( 133 | y, 134 | n_fft, 135 | hop_length=hop_size, 136 | win_length=win_size, 137 | window=hann_window[wnsize_dtype_device], 138 | center=center, 139 | pad_mode="reflect", 140 | normalized=False, 141 | onesided=True, 142 | ) 143 | 144 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 145 | 146 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 147 | spec = spectral_normalize_torch(spec) 148 | 149 | return spec 150 | -------------------------------------------------------------------------------- /utils/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/heatz123/naturalspeech/ca2f814d960149a8fdc3f3e4d773abb67e7c18ec/utils/models/__init__.py -------------------------------------------------------------------------------- /utils/models/attentions.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import torch 4 | from torch import nn 5 | from torch.nn import functional as F 6 | 7 | from utils import commons 8 | from models.modules import LayerNorm 9 | 10 | 11 | class Encoder(nn.Module): 12 | def __init__( 13 | self, 14 | hidden_channels, 15 | filter_channels, 16 | n_heads, 17 | n_layers, 18 | kernel_size=1, 19 | p_dropout=0.0, 20 | window_size=4, 21 | **kwargs 22 | ): 23 | super().__init__() 24 | self.hidden_channels = hidden_channels 25 | self.filter_channels = filter_channels 26 | self.n_heads = n_heads 27 | self.n_layers = n_layers 28 | self.kernel_size = kernel_size 29 | self.p_dropout = p_dropout 30 | self.window_size = window_size 31 | 32 | self.drop = nn.Dropout(p_dropout) 33 | self.attn_layers = nn.ModuleList() 34 | self.norm_layers_1 = nn.ModuleList() 35 | self.ffn_layers = nn.ModuleList() 36 | self.norm_layers_2 = nn.ModuleList() 37 | for i in range(self.n_layers): 38 | self.attn_layers.append( 39 | MultiHeadAttention( 40 | hidden_channels, 41 | hidden_channels, 42 | n_heads, 43 | p_dropout=p_dropout, 44 | window_size=window_size, 45 | ) 46 | ) 47 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 48 | self.ffn_layers.append( 49 | FFN( 50 | hidden_channels, 51 | hidden_channels, 52 | filter_channels, 53 | kernel_size, 54 | p_dropout=p_dropout, 55 | ) 56 | ) 57 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 58 | 59 | def forward(self, x, x_mask): 60 | attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 61 | x = x * x_mask 62 | for i in range(self.n_layers): 63 | y = self.attn_layers[i](x, x, attn_mask) 64 | y = self.drop(y) 65 | x = self.norm_layers_1[i](x + y) 66 | 67 | y = self.ffn_layers[i](x, x_mask) 68 | y = self.drop(y) 69 | x = self.norm_layers_2[i](x + y) 70 | x = x * x_mask 71 | return x 72 | 73 | 74 | class Decoder(nn.Module): 75 | def __init__( 76 | self, 77 | hidden_channels, 78 | filter_channels, 79 | n_heads, 80 | n_layers, 81 | kernel_size=1, 82 | p_dropout=0.0, 83 | proximal_bias=False, 84 | proximal_init=True, 85 | **kwargs 86 | ): 87 | super().__init__() 88 | self.hidden_channels = hidden_channels 89 | self.filter_channels = filter_channels 90 | self.n_heads = n_heads 91 | self.n_layers = n_layers 92 | self.kernel_size = kernel_size 93 | self.p_dropout = p_dropout 94 | self.proximal_bias = proximal_bias 95 | self.proximal_init = proximal_init 96 | 97 | self.drop = nn.Dropout(p_dropout) 98 | self.self_attn_layers = nn.ModuleList() 99 | self.norm_layers_0 = nn.ModuleList() 100 | self.encdec_attn_layers = nn.ModuleList() 101 | self.norm_layers_1 = nn.ModuleList() 102 | self.ffn_layers = nn.ModuleList() 103 | self.norm_layers_2 = nn.ModuleList() 104 | for i in range(self.n_layers): 105 | self.self_attn_layers.append( 106 | MultiHeadAttention( 107 | hidden_channels, 108 | hidden_channels, 109 | n_heads, 110 | p_dropout=p_dropout, 111 | proximal_bias=proximal_bias, 112 | proximal_init=proximal_init, 113 | ) 114 | ) 115 | self.norm_layers_0.append(LayerNorm(hidden_channels)) 116 | self.encdec_attn_layers.append( 117 | MultiHeadAttention( 118 | hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout 119 | ) 120 | ) 121 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 122 | self.ffn_layers.append( 123 | FFN( 124 | hidden_channels, 125 | hidden_channels, 126 | filter_channels, 127 | kernel_size, 128 | p_dropout=p_dropout, 129 | causal=True, 130 | ) 131 | ) 132 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 133 | 134 | def forward(self, x, x_mask, h, h_mask): 135 | """ 136 | x: decoder input 137 | h: encoder output 138 | """ 139 | self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to( 140 | device=x.device, dtype=x.dtype 141 | ) 142 | encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 143 | x = x * x_mask 144 | for i in range(self.n_layers): 145 | y = self.self_attn_layers[i](x, x, self_attn_mask) 146 | y = self.drop(y) 147 | x = self.norm_layers_0[i](x + y) 148 | 149 | y = self.encdec_attn_layers[i](x, h, encdec_attn_mask) 150 | y = self.drop(y) 151 | x = self.norm_layers_1[i](x + y) 152 | 153 | y = self.ffn_layers[i](x, x_mask) 154 | y = self.drop(y) 155 | x = self.norm_layers_2[i](x + y) 156 | x = x * x_mask 157 | return x 158 | 159 | 160 | class MultiHeadAttention(nn.Module): 161 | def __init__( 162 | self, 163 | channels, 164 | out_channels, 165 | n_heads, 166 | p_dropout=0.0, 167 | window_size=None, 168 | heads_share=True, 169 | block_length=None, 170 | proximal_bias=False, 171 | proximal_init=False, 172 | ): 173 | super().__init__() 174 | assert channels % n_heads == 0 175 | 176 | self.channels = channels 177 | self.out_channels = out_channels 178 | self.n_heads = n_heads 179 | self.p_dropout = p_dropout 180 | self.window_size = window_size 181 | self.heads_share = heads_share 182 | self.block_length = block_length 183 | self.proximal_bias = proximal_bias 184 | self.proximal_init = proximal_init 185 | self.attn = None 186 | 187 | self.k_channels = channels // n_heads 188 | self.conv_q = nn.Conv1d(channels, channels, 1) 189 | self.conv_k = nn.Conv1d(channels, channels, 1) 190 | self.conv_v = nn.Conv1d(channels, channels, 1) 191 | self.conv_o = nn.Conv1d(channels, out_channels, 1) 192 | self.drop = nn.Dropout(p_dropout) 193 | 194 | if window_size is not None: 195 | n_heads_rel = 1 if heads_share else n_heads 196 | rel_stddev = self.k_channels**-0.5 197 | self.emb_rel_k = nn.Parameter( 198 | torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) 199 | * rel_stddev 200 | ) 201 | self.emb_rel_v = nn.Parameter( 202 | torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) 203 | * rel_stddev 204 | ) 205 | 206 | nn.init.xavier_uniform_(self.conv_q.weight) 207 | nn.init.xavier_uniform_(self.conv_k.weight) 208 | nn.init.xavier_uniform_(self.conv_v.weight) 209 | if proximal_init: 210 | with torch.no_grad(): 211 | self.conv_k.weight.copy_(self.conv_q.weight) 212 | self.conv_k.bias.copy_(self.conv_q.bias) 213 | 214 | def forward(self, x, c, attn_mask=None): 215 | q = self.conv_q(x) 216 | k = self.conv_k(c) 217 | v = self.conv_v(c) 218 | 219 | x, self.attn = self.attention(q, k, v, mask=attn_mask) 220 | 221 | x = self.conv_o(x) 222 | return x 223 | 224 | def attention(self, query, key, value, mask=None): 225 | # reshape [b, d, t] -> [b, n_h, t, d_k] 226 | b, d, t_s, t_t = (*key.size(), query.size(2)) 227 | query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3) 228 | key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3) 229 | value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3) 230 | 231 | scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1)) 232 | if self.window_size is not None: 233 | assert ( 234 | t_s == t_t 235 | ), "Relative attention is only available for self-attention." 236 | key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s) 237 | rel_logits = self._matmul_with_relative_keys( 238 | query / math.sqrt(self.k_channels), key_relative_embeddings 239 | ) 240 | scores_local = self._relative_position_to_absolute_position(rel_logits) 241 | scores = scores + scores_local 242 | if self.proximal_bias: 243 | assert t_s == t_t, "Proximal bias is only available for self-attention." 244 | scores = scores + self._attention_bias_proximal(t_s).to( 245 | device=scores.device, dtype=scores.dtype 246 | ) 247 | if mask is not None: 248 | scores = scores.masked_fill(mask == 0, -1e4) 249 | if self.block_length is not None: 250 | assert ( 251 | t_s == t_t 252 | ), "Local attention is only available for self-attention." 253 | block_mask = ( 254 | torch.ones_like(scores) 255 | .triu(-self.block_length) 256 | .tril(self.block_length) 257 | ) 258 | scores = scores.masked_fill(block_mask == 0, -1e4) 259 | p_attn = F.softmax(scores, dim=-1) # [b, n_h, t_t, t_s] 260 | p_attn = self.drop(p_attn) 261 | output = torch.matmul(p_attn, value) 262 | if self.window_size is not None: 263 | relative_weights = self._absolute_position_to_relative_position(p_attn) 264 | value_relative_embeddings = self._get_relative_embeddings( 265 | self.emb_rel_v, t_s 266 | ) 267 | output = output + self._matmul_with_relative_values( 268 | relative_weights, value_relative_embeddings 269 | ) 270 | output = ( 271 | output.transpose(2, 3).contiguous().view(b, d, t_t) 272 | ) # [b, n_h, t_t, d_k] -> [b, d, t_t] 273 | return output, p_attn 274 | 275 | def _matmul_with_relative_values(self, x, y): 276 | """ 277 | x: [b, h, l, m] 278 | y: [h or 1, m, d] 279 | ret: [b, h, l, d] 280 | """ 281 | ret = torch.matmul(x, y.unsqueeze(0)) 282 | return ret 283 | 284 | def _matmul_with_relative_keys(self, x, y): 285 | """ 286 | x: [b, h, l, d] 287 | y: [h or 1, m, d] 288 | ret: [b, h, l, m] 289 | """ 290 | ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1)) 291 | return ret 292 | 293 | def _get_relative_embeddings(self, relative_embeddings, length): 294 | max_relative_position = 2 * self.window_size + 1 295 | # Pad first before slice to avoid using cond ops. 296 | pad_length = max(length - (self.window_size + 1), 0) 297 | slice_start_position = max((self.window_size + 1) - length, 0) 298 | slice_end_position = slice_start_position + 2 * length - 1 299 | if pad_length > 0: 300 | padded_relative_embeddings = F.pad( 301 | relative_embeddings, 302 | commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]), 303 | ) 304 | else: 305 | padded_relative_embeddings = relative_embeddings 306 | used_relative_embeddings = padded_relative_embeddings[ 307 | :, slice_start_position:slice_end_position 308 | ] 309 | return used_relative_embeddings 310 | 311 | def _relative_position_to_absolute_position(self, x): 312 | """ 313 | x: [b, h, l, 2*l-1] 314 | ret: [b, h, l, l] 315 | """ 316 | batch, heads, length, _ = x.size() 317 | # Concat columns of pad to shift from relative to absolute indexing. 318 | x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]])) 319 | 320 | # Concat extra elements so to add up to shape (len+1, 2*len-1). 321 | x_flat = x.view([batch, heads, length * 2 * length]) 322 | x_flat = F.pad( 323 | x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [0, length - 1]]) 324 | ) 325 | 326 | # Reshape and slice out the padded elements. 327 | x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[ 328 | :, :, :length, length - 1 : 329 | ] 330 | return x_final 331 | 332 | def _absolute_position_to_relative_position(self, x): 333 | """ 334 | x: [b, h, l, l] 335 | ret: [b, h, l, 2*l-1] 336 | """ 337 | batch, heads, length, _ = x.size() 338 | # padd along column 339 | x = F.pad( 340 | x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]]) 341 | ) 342 | x_flat = x.view([batch, heads, length**2 + length * (length - 1)]) 343 | # add 0's in the beginning that will skew the elements after reshape 344 | x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]])) 345 | x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:] 346 | return x_final 347 | 348 | def _attention_bias_proximal(self, length): 349 | """Bias for self-attention to encourage attention to close positions. 350 | Args: 351 | length: an integer scalar. 352 | Returns: 353 | a Tensor with shape [1, 1, length, length] 354 | """ 355 | r = torch.arange(length, dtype=torch.float32) 356 | diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1) 357 | return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0) 358 | 359 | 360 | class FFN(nn.Module): 361 | def __init__( 362 | self, 363 | in_channels, 364 | out_channels, 365 | filter_channels, 366 | kernel_size, 367 | p_dropout=0.0, 368 | activation=None, 369 | causal=False, 370 | ): 371 | super().__init__() 372 | self.in_channels = in_channels 373 | self.out_channels = out_channels 374 | self.filter_channels = filter_channels 375 | self.kernel_size = kernel_size 376 | self.p_dropout = p_dropout 377 | self.activation = activation 378 | self.causal = causal 379 | 380 | if causal: 381 | self.padding = self._causal_padding 382 | else: 383 | self.padding = self._same_padding 384 | 385 | self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size) 386 | self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size) 387 | self.drop = nn.Dropout(p_dropout) 388 | 389 | def forward(self, x, x_mask): 390 | x = self.conv_1(self.padding(x * x_mask)) 391 | if self.activation == "gelu": 392 | x = x * torch.sigmoid(1.702 * x) 393 | else: 394 | x = torch.relu(x) 395 | x = self.drop(x) 396 | x = self.conv_2(self.padding(x * x_mask)) 397 | return x * x_mask 398 | 399 | def _causal_padding(self, x): 400 | if self.kernel_size == 1: 401 | return x 402 | pad_l = self.kernel_size - 1 403 | pad_r = 0 404 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 405 | x = F.pad(x, commons.convert_pad_shape(padding)) 406 | return x 407 | 408 | def _same_padding(self, x): 409 | if self.kernel_size == 1: 410 | return x 411 | pad_l = (self.kernel_size - 1) // 2 412 | pad_r = self.kernel_size // 2 413 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 414 | x = F.pad(x, commons.convert_pad_shape(padding)) 415 | return x 416 | -------------------------------------------------------------------------------- /utils/models/losses.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.nn import functional as F 3 | from models.soft_dtw import SoftDTW 4 | 5 | sdtw = SoftDTW(use_cuda=False, gamma=0.01, warp=134.4) 6 | 7 | 8 | def feature_loss(fmap_r, fmap_g): 9 | loss = 0 10 | for dr, dg in zip(fmap_r, fmap_g): 11 | for rl, gl in zip(dr, dg): 12 | rl = rl.float().detach() 13 | gl = gl.float() 14 | loss += torch.mean(torch.abs(rl - gl)) 15 | 16 | return loss * 2 17 | 18 | 19 | def discriminator_loss(disc_real_outputs, disc_generated_outputs): 20 | loss = 0 21 | r_losses = [] 22 | g_losses = [] 23 | for dr, dg in zip(disc_real_outputs, disc_generated_outputs): 24 | dr = dr.float() 25 | dg = dg.float() 26 | r_loss = torch.mean((1 - dr) ** 2) 27 | g_loss = torch.mean(dg**2) 28 | loss += r_loss + g_loss 29 | r_losses.append(r_loss.item()) 30 | g_losses.append(g_loss.item()) 31 | 32 | return loss, r_losses, g_losses 33 | 34 | 35 | def generator_loss(disc_outputs): 36 | loss = 0 37 | gen_losses = [] 38 | for dg in disc_outputs: 39 | dg = dg.float() 40 | l = torch.mean((1 - dg) ** 2) 41 | gen_losses.append(l) 42 | loss += l 43 | 44 | return loss, gen_losses 45 | 46 | 47 | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask): 48 | """ 49 | z_p, logs_q: [b, h, t_t] 50 | m_p, logs_p: [b, h, t_t] 51 | """ 52 | z_p = z_p.float() 53 | logs_q = logs_q.float() 54 | m_p = m_p.float() 55 | logs_p = logs_p.float() 56 | z_mask = z_mask.float() 57 | 58 | kl = logs_p - logs_q - 0.5 59 | kl += 0.5 * ((z_p - m_p) ** 2) * torch.exp(-2.0 * logs_p) 60 | kl = torch.sum(kl * z_mask) 61 | l = kl / torch.sum(z_mask) 62 | return l 63 | 64 | 65 | def get_sdtw_kl_matrix(z_p, logs_q, m_p, logs_p): 66 | """ 67 | returns kl matrix with shape [b, t_tp, t_tq] 68 | z_p, logs_q: [b, h, t_tq] 69 | m_p, logs_p: [b, h, t_tp] 70 | """ 71 | z_p = z_p.float() 72 | logs_q = logs_q.float() 73 | m_p = m_p.float() 74 | logs_p = logs_p.float() 75 | 76 | t_tp, t_tq = m_p.size(-1), z_p.size(-1) 77 | b, h, t_tp = m_p.shape 78 | 79 | # for memory testing 80 | # return torch.abs(z_p.mean(dim=1)[:, None, :] - m_p.mean(dim=1)[:, :, None]) 81 | 82 | # z_p = z_p.transpose(0, 1) 83 | # logs_q = logs_q.transpose(0, 1) 84 | # m_p = m_p.transpose(0, 1) 85 | # logs_p = logs_p.transpose(0, 1) 86 | 87 | kls = torch.zeros((b, t_tp, t_tq), dtype=z_p.dtype, device=z_p.device) 88 | for i in range(h): 89 | logs_p_, m_p_, logs_q_, z_p_ = ( 90 | logs_p[:, i, :, None], 91 | m_p[:, i, :, None], 92 | logs_q[:, i, None, :], 93 | z_p[:, i, None, :], 94 | ) 95 | kl = logs_p_ - logs_q_ - 0.5 # [b, t_tp, t_tq] 96 | kl += 0.5 * ((z_p_ - m_p_) ** 2) * torch.exp(-2.0 * logs_p_) 97 | kls += kl 98 | return kls 99 | 100 | kl = logs_p[:, :, :, None] - logs_q[:, :, None, :] - 0.5 # p, q 101 | kl += ( 102 | 0.5 103 | * ((z_p[:, :, None, :] - m_p[:, :, :, None]) ** 2) 104 | * torch.exp(-2.0 * logs_p[:, :, :, None]) 105 | ) 106 | 107 | kl = kl.sum(dim=1) 108 | return kl 109 | 110 | 111 | import torch.utils.checkpoint 112 | 113 | 114 | def kl_loss_dtw(z_p, logs_q, m_p, logs_p, p_mask, q_mask): 115 | INF = 1e5 116 | 117 | kl = get_sdtw_kl_matrix(z_p, logs_q, m_p, logs_p) # [b t_tp t_tq] 118 | kl = torch.nn.functional.pad(kl, (0, 1, 0, 1), "constant", 0) 119 | p_mask = torch.nn.functional.pad(p_mask, (0, 1), "constant", 0) 120 | q_mask = torch.nn.functional.pad(q_mask, (0, 1), "constant", 0) 121 | 122 | kl.masked_fill_(p_mask[:, :, None].bool() ^ q_mask[:, None, :].bool(), INF) 123 | kl.masked_fill_((~p_mask[:, :, None].bool()) & (~q_mask[:, None, :].bool()), 0) 124 | res = sdtw(kl).sum() / p_mask.sum() 125 | return res 126 | 127 | 128 | if __name__ == "__main__": 129 | kl = torch.rand(4, 100, 100) 130 | kl[:, 50:, :] = 1e4 131 | kl[:, :, 50:] = 1e4 132 | kl[:, 50:, 50:] = 0 133 | 134 | print(kl) 135 | print(sdtw(kl).mean() / 50) 136 | -------------------------------------------------------------------------------- /utils/models/soft_dtw.py: -------------------------------------------------------------------------------- 1 | # MIT License 2 | # 3 | # Copyright (c) 2020 Mehran Maghoumi 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining a copy 6 | # of this software and associated documentation files (the "Software"), to deal 7 | # in the Software without restriction, including without limitation the rights 8 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | # copies of the Software, and to permit persons to whom the Software is 10 | # furnished to do so, subject to the following conditions: 11 | # 12 | # The above copyright notice and this permission notice shall be included in all 13 | # copies or substantial portions of the Software. 14 | # 15 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | # SOFTWARE. 22 | # ---------------------------------------------------------------------------------------------------------------------- 23 | 24 | import numpy as np 25 | import torch 26 | import torch.cuda 27 | from numba import jit, prange 28 | from torch.autograd import Function 29 | from numba import cuda 30 | import math 31 | 32 | # ---------------------------------------------------------------------------------------------------------------------- 33 | # 34 | # The following is the CPU implementation based on https://github.com/Sleepwalking/pytorch-softdtw 35 | # Credit goes to Kanru Hua. 36 | # I've added support for batching and pruning. 37 | # 38 | # ---------------------------------------------------------------------------------------------------------------------- 39 | # ---------------------------------------------------------------------------------------------------------------------- 40 | @cuda.jit 41 | def compute_softdtw_cuda(D, gamma, bandwidth, max_i, max_j, n_passes, R): 42 | """ 43 | :param seq_len: The length of the sequence (both inputs are assumed to be of the same size) 44 | :param n_passes: 2 * seq_len - 1 (The number of anti-diagonals) 45 | """ 46 | # Each block processes one pair of examples 47 | b = cuda.blockIdx.x 48 | # We have as many threads as seq_len, because the most number of threads we need 49 | # is equal to the number of elements on the largest anti-diagonal 50 | tid = cuda.threadIdx.x 51 | 52 | # Compute I, J, the indices from [0, seq_len) 53 | 54 | # The row index is always the same as tid 55 | I = tid 56 | 57 | inv_gamma = 1.0 / gamma 58 | 59 | # Go over each anti-diagonal. Only process threads that fall on the current on the anti-diagonal 60 | for p in range(n_passes): 61 | 62 | # The index is actually 'p - tid' but need to force it in-bounds 63 | J = max(0, min(p - tid, max_j - 1)) 64 | 65 | # For simplicity, we define i, j which start from 1 (offset from I, J) 66 | i = I + 1 67 | j = J + 1 68 | 69 | # Only compute if element[i, j] is on the current anti-diagonal, and also is within bounds 70 | if I + J == p and (I < max_i and J < max_j): 71 | # Don't compute if outside bandwidth 72 | if not (abs(i - j) > bandwidth > 0): 73 | r0 = -R[b, i - 1, j - 1] * inv_gamma 74 | r1 = -R[b, i - 1, j] * inv_gamma - 0.07 75 | r2 = -R[b, i, j - 1] * inv_gamma - 0.07 76 | rmax = max(max(r0, r1), r2) 77 | rsum = math.exp(r0 - rmax) + math.exp(r1 - rmax) + math.exp(r2 - rmax) 78 | softmin = -gamma * (math.log(rsum) + rmax) 79 | R[b, i, j] = D[b, i - 1, j - 1] + softmin 80 | 81 | # Wait for other threads in this block 82 | cuda.syncthreads() 83 | 84 | 85 | # ---------------------------------------------------------------------------------------------------------------------- 86 | @cuda.jit 87 | def compute_softdtw_backward_cuda( 88 | D, R, inv_gamma, bandwidth, max_i, max_j, n_passes, E 89 | ): 90 | k = cuda.blockIdx.x 91 | tid = cuda.threadIdx.x 92 | 93 | # Indexing logic is the same as above, however, the anti-diagonal needs to 94 | # progress backwards 95 | I = tid 96 | 97 | for p in range(n_passes): 98 | # Reverse the order to make the loop go backward 99 | rev_p = n_passes - p - 1 100 | 101 | # convert tid to I, J, then i, j 102 | J = max(0, min(rev_p - tid, max_j - 1)) 103 | 104 | i = I + 1 105 | j = J + 1 106 | 107 | # Only compute if element[i, j] is on the current anti-diagonal, and also is within bounds 108 | if I + J == rev_p and (I < max_i and J < max_j): 109 | 110 | if math.isinf(R[k, i, j]): 111 | R[k, i, j] = -math.inf 112 | 113 | # Don't compute if outside bandwidth 114 | if not (abs(i - j) > bandwidth > 0): 115 | a = math.exp((R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j]) * inv_gamma) 116 | b = math.exp((R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1]) * inv_gamma) 117 | c = math.exp( 118 | (R[k, i + 1, j + 1] - R[k, i, j] - D[k, i + 1, j + 1]) * inv_gamma 119 | ) 120 | E[k, i, j] = ( 121 | E[k, i + 1, j] * a + E[k, i, j + 1] * b + E[k, i + 1, j + 1] * c 122 | ) 123 | 124 | # Wait for other threads in this block 125 | cuda.syncthreads() 126 | 127 | 128 | # ---------------------------------------------------------------------------------------------------------------------- 129 | class _SoftDTWCUDA(Function): 130 | """ 131 | CUDA implementation is inspired by the diagonal one proposed in https://ieeexplore.ieee.org/document/8400444: 132 | "Developing a pattern discovery method in time series data and its GPU acceleration" 133 | """ 134 | 135 | @staticmethod 136 | def forward(ctx, D, gamma, bandwidth, warp): 137 | dev = D.device 138 | dtype = D.dtype 139 | gamma = torch.cuda.FloatTensor([gamma]) 140 | bandwidth = torch.cuda.FloatTensor([bandwidth]) 141 | 142 | B = D.shape[0] 143 | N = D.shape[1] 144 | M = D.shape[2] 145 | threads_per_block = max(N, M) 146 | n_passes = 2 * threads_per_block - 1 147 | 148 | D_ = torch.full((B, N + 2, M + 2), np.inf, dtype=dtype, device=dev) 149 | D_[:, 1 : N + 1, 1 : M + 1] = D 150 | 151 | # Prepare the output array 152 | R = torch.ones((B, N + 2, M + 2), device=dev, dtype=dtype) * math.inf 153 | R[:, 0, 0] = 0 154 | 155 | # Run the CUDA kernel. 156 | # Set CUDA's grid size to be equal to the batch size (every CUDA block processes one sample pair) 157 | # Set the CUDA block size to be equal to the length of the longer sequence (equal to the size of the largest diagonal) 158 | compute_softdtw_cuda[B, threads_per_block]( 159 | cuda.as_cuda_array(D_.detach()), 160 | gamma.item(), 161 | bandwidth.item(), 162 | N, 163 | M, 164 | n_passes, 165 | cuda.as_cuda_array(R), 166 | ) 167 | ctx.save_for_backward(D, R.clone(), gamma, bandwidth) 168 | return R[:, -2, -2] 169 | 170 | @staticmethod 171 | def backward(ctx, grad_output): 172 | dev = grad_output.device 173 | dtype = grad_output.dtype 174 | D, R, gamma, bandwidth = ctx.saved_tensors 175 | 176 | B = D.shape[0] 177 | N = D.shape[1] 178 | M = D.shape[2] 179 | threads_per_block = max(N, M) 180 | n_passes = 2 * threads_per_block - 1 181 | 182 | D_ = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev) 183 | D_[:, 1 : N + 1, 1 : M + 1] = D 184 | 185 | R[:, :, -1] = -math.inf 186 | R[:, -1, :] = -math.inf 187 | R[:, -1, -1] = R[:, -2, -2] 188 | 189 | E = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev) 190 | E[:, -1, -1] = 1 191 | 192 | # Grid and block sizes are set same as done above for the forward() call 193 | compute_softdtw_backward_cuda[B, threads_per_block]( 194 | cuda.as_cuda_array(D_), 195 | cuda.as_cuda_array(R), 196 | 1.0 / gamma.item(), 197 | bandwidth.item(), 198 | N, 199 | M, 200 | n_passes, 201 | cuda.as_cuda_array(E), 202 | ) 203 | E = E[:, 1 : N + 1, 1 : M + 1] 204 | return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None, None 205 | 206 | 207 | # ------------------------------------------------- 208 | 209 | 210 | @jit(nopython=True, parallel=True) 211 | def compute_softdtw(D, gamma, bandwidth, warp): 212 | B = D.shape[0] 213 | N = D.shape[1] 214 | M = D.shape[2] 215 | R = np.ones((B, N + 2, M + 2)) * np.inf 216 | R[:, 0, 0] = 0 217 | for b in prange(B): 218 | for j in range(1, M + 1): 219 | for i in range(1, N + 1): 220 | 221 | # Check the pruning condition 222 | if 0 < bandwidth < np.abs(i - j): 223 | continue 224 | 225 | r0 = -R[b, i - 1, j - 1] / gamma 226 | # r1 = -(R[b, i - 1, j]) / gamma 227 | # r2 = -(R[b, i, j - 1]) / gamma 228 | r1 = -(R[b, i - 1, j] + warp) / gamma 229 | r2 = -(R[b, i, j - 1] + warp) / gamma 230 | rmax = max(max(r0, r1), r2) 231 | rsum = np.exp(r0 - rmax) + np.exp(r1 - rmax) + np.exp(r2 - rmax) 232 | softmin = -gamma * (np.log(rsum) + rmax) 233 | R[b, i, j] = D[b, i - 1, j - 1] + softmin 234 | return R 235 | 236 | 237 | # ---------------------------------------------------------------------------------------------------------------------- 238 | @jit(nopython=True, parallel=True) 239 | def compute_softdtw_backward(D_, R, gamma, bandwidth, warp): 240 | B = D_.shape[0] 241 | N = D_.shape[1] 242 | M = D_.shape[2] 243 | D = np.zeros((B, N + 2, M + 2)) 244 | E = np.zeros((B, N + 2, M + 2)) 245 | D[:, 1 : N + 1, 1 : M + 1] = D_ 246 | E[:, -1, -1] = 1 247 | R[:, :, -1] = -np.inf 248 | R[:, -1, :] = -np.inf 249 | R[:, -1, -1] = R[:, -2, -2] 250 | for k in prange(B): 251 | for j in range(M, 0, -1): 252 | for i in range(N, 0, -1): 253 | 254 | if np.isinf(R[k, i, j]): 255 | R[k, i, j] = -np.inf 256 | 257 | # Check the pruning condition 258 | if 0 < bandwidth < np.abs(i - j): 259 | continue 260 | 261 | # a0 = (R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j]) / gamma 262 | # b0 = (R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1]) / gamma 263 | a0 = (R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j] - warp) / gamma 264 | b0 = (R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1] - warp) / gamma 265 | c0 = (R[k, i + 1, j + 1] - R[k, i, j] - D[k, i + 1, j + 1]) / gamma 266 | a = np.exp(a0) 267 | b = np.exp(b0) 268 | c = np.exp(c0) 269 | E[k, i, j] = ( 270 | E[k, i + 1, j] * a + E[k, i, j + 1] * b + E[k, i + 1, j + 1] * c 271 | ) 272 | return E[:, 1 : N + 1, 1 : M + 1] 273 | 274 | 275 | # ---------------------------------------------------------------------------------------------------------------------- 276 | class _SoftDTW(Function): 277 | """ 278 | CPU implementation based on https://github.com/Sleepwalking/pytorch-softdtw 279 | """ 280 | 281 | @staticmethod 282 | def forward(ctx, D, gamma, bandwidth, warp): 283 | dev = D.device 284 | dtype = D.dtype 285 | gamma = torch.Tensor([gamma]).to(dev).type(dtype) # dtype fixed 286 | bandwidth = torch.Tensor([bandwidth]).to(dev).type(dtype) 287 | warp = torch.Tensor([warp]).to(dev).type(dtype) 288 | 289 | D_ = D.detach().cpu().numpy() 290 | g_ = gamma.item() 291 | b_ = bandwidth.item() 292 | w_ = warp.item() 293 | 294 | R = torch.Tensor(compute_softdtw(D_, g_, b_, w_)).to(dev).type(dtype) 295 | ctx.save_for_backward(D, R, gamma, bandwidth, warp) 296 | return R[:, -2, -2] 297 | 298 | @staticmethod 299 | def backward(ctx, grad_output): 300 | dev = grad_output.device 301 | dtype = grad_output.dtype 302 | D, R, gamma, bandwidth, warp = ctx.saved_tensors 303 | D_ = D.detach().cpu().numpy() 304 | R_ = R.detach().cpu().numpy() 305 | g_ = gamma.item() 306 | b_ = bandwidth.item() 307 | w_ = warp.item() 308 | E = ( 309 | torch.Tensor(compute_softdtw_backward(D_, R_, g_, b_, w_)) 310 | .to(dev) 311 | .type(dtype) 312 | ) 313 | return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None, None 314 | 315 | 316 | # ---------------------------------------------------------------------------------------------------------------------- 317 | class SoftDTW(torch.nn.Module): 318 | """ 319 | The soft DTW implementation that optionally supports CUDA 320 | """ 321 | 322 | def __init__( 323 | self, gamma=1.0, use_cuda=True, normalize=False, bandwidth=None, warp=0.07 324 | ): 325 | """ 326 | Initializes a new instance using the supplied parameters 327 | :param use_cuda: Flag indicating whether the CUDA implementation should be used 328 | :param gamma: sDTW's gamma parameter 329 | :param normalize: Flag indicating whether to perform normalization 330 | (as discussed in https://github.com/mblondel/soft-dtw/issues/10#issuecomment-383564790) 331 | :param bandwidth: Sakoe-Chiba bandwidth for pruning. Passing 'None' will disable pruning. 332 | :param dist_func: Optional point-wise distance function to use. If 'None', then a default Euclidean distance function will be used. 333 | """ 334 | super(SoftDTW, self).__init__() 335 | self.gamma = gamma 336 | self.use_cuda = use_cuda 337 | self.bandwidth = 0 if bandwidth is None else float(bandwidth) 338 | 339 | self.warp = warp 340 | 341 | def _get_func_dtw(self, x): 342 | """ 343 | Checks the inputs and selects the proper implementation to use. 344 | """ 345 | # Finally, return the correct function 346 | return _SoftDTW.apply if not self.use_cuda else _SoftDTWCUDA.apply 347 | 348 | def forward(self, X): 349 | """ 350 | Compute the soft-DTW value between X and Y 351 | :param X: One batch of examples, batch_size x seq_len x dims 352 | :param Y: The other batch of examples, batch_size x seq_len x dims 353 | :return: The computed results 354 | """ 355 | 356 | # Check the inputs and get the correct implementation 357 | func_dtw = self._get_func_dtw(X) 358 | 359 | # D_xy = self.dist_func(X, Y) 360 | D_xy = X 361 | return func_dtw(D_xy, self.gamma, self.bandwidth, self.warp) 362 | 363 | 364 | # ---------------------------------------------------------------------------------------------------------------------- 365 | def timed_run(a, b, sdtw): 366 | """ 367 | Runs a and b through sdtw, and times the forward and backward passes. 368 | Assumes that a requires gradients. 369 | :return: timing, forward result, backward result 370 | """ 371 | from timeit import default_timer as timer 372 | 373 | # Forward pass 374 | start = timer() 375 | forward = sdtw(a, b) 376 | end = timer() 377 | t = end - start 378 | 379 | grad_outputs = torch.ones_like(forward) 380 | 381 | # Backward 382 | start = timer() 383 | grads = torch.autograd.grad(forward, a, grad_outputs=grad_outputs)[0] 384 | end = timer() 385 | 386 | # Total time 387 | t += end - start 388 | 389 | return t, forward, grads 390 | 391 | 392 | # ---------------------------------------------------------------------------------------------------------------------- 393 | def profile(batch_size, seq_len_a, seq_len_b, dims, tol_backward): 394 | sdtw = SoftDTW(False, gamma=1.0, normalize=False) 395 | sdtw_cuda = SoftDTW(True, gamma=1.0, normalize=False) 396 | n_iters = 6 397 | 398 | print( 399 | "Profiling forward() + backward() times for batch_size={}, seq_len_a={}, seq_len_b={}, dims={}...".format( 400 | batch_size, seq_len_a, seq_len_b, dims 401 | ) 402 | ) 403 | 404 | times_cpu = [] 405 | times_gpu = [] 406 | 407 | for i in range(n_iters): 408 | a_cpu = torch.rand((batch_size, seq_len_a, dims), requires_grad=True) 409 | b_cpu = torch.rand((batch_size, seq_len_b, dims)) 410 | a_gpu = a_cpu.cuda() 411 | b_gpu = b_cpu.cuda() 412 | 413 | # GPU 414 | t_gpu, forward_gpu, backward_gpu = timed_run(a_gpu, b_gpu, sdtw_cuda) 415 | 416 | # CPU 417 | t_cpu, forward_cpu, backward_cpu = timed_run(a_cpu, b_cpu, sdtw) 418 | 419 | # Verify the results 420 | assert torch.allclose(forward_cpu, forward_gpu.cpu()) 421 | assert torch.allclose(backward_cpu, backward_gpu.cpu(), atol=tol_backward) 422 | 423 | if ( 424 | i > 0 425 | ): # Ignore the first time we run, in case this is a cold start (because timings are off at a cold start of the script) 426 | times_cpu += [t_cpu] 427 | times_gpu += [t_gpu] 428 | 429 | # Average and log 430 | avg_cpu = np.mean(times_cpu) 431 | avg_gpu = np.mean(times_gpu) 432 | print(" CPU: ", avg_cpu) 433 | print(" GPU: ", avg_gpu) 434 | print(" Speedup: ", avg_cpu / avg_gpu) 435 | print() 436 | 437 | 438 | # ---------------------------------------------------------------------------------------------------------------------- 439 | if __name__ == "__main__": 440 | from timeit import default_timer as timer 441 | 442 | torch.manual_seed(1234) 443 | 444 | profile(128, 17, 15, 2, tol_backward=1e-6) 445 | profile(512, 64, 64, 2, tol_backward=1e-4) 446 | profile(512, 256, 256, 2, tol_backward=1e-3) 447 | -------------------------------------------------------------------------------- /utils/preprocess_durations.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import text 3 | from utils import load_filepaths_and_text 4 | 5 | # from data_utils import ( 6 | # TextAudioLoader, 7 | # ) 8 | from text.symbols import symbols 9 | from models import ( 10 | SynthesizerTrn, 11 | ) 12 | import utils 13 | from torch.utils.data import DataLoader 14 | import torch 15 | 16 | import os 17 | import random 18 | import numpy as np 19 | import torch 20 | import torch.utils.data 21 | 22 | import commons 23 | from mel_processing import spectrogram_torch 24 | from utils import load_wav_to_torch, load_filepaths_and_text 25 | from text import text_to_sequence, cleaned_text_to_sequence 26 | 27 | import json 28 | 29 | 30 | class TextAudioLoaderWithPath(torch.utils.data.Dataset): 31 | """ 32 | 1) loads audio, text pairs 33 | 2) normalizes text and converts them to sequences of integers 34 | 3) computes spectrograms from audio files. 35 | """ 36 | 37 | def __init__(self, audiopaths_and_text, hparams): 38 | self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text) 39 | self.text_cleaners = hparams.text_cleaners 40 | self.max_wav_value = hparams.max_wav_value 41 | self.sampling_rate = hparams.sampling_rate 42 | self.filter_length = hparams.filter_length 43 | self.hop_length = hparams.hop_length 44 | self.win_length = hparams.win_length 45 | self.sampling_rate = hparams.sampling_rate 46 | 47 | self.cleaned_text = getattr(hparams, "cleaned_text", False) 48 | 49 | self.add_blank = hparams.add_blank 50 | self.min_text_len = getattr(hparams, "min_text_len", 1) 51 | self.max_text_len = getattr(hparams, "max_text_len", 190) 52 | 53 | random.seed(1234) 54 | random.shuffle(self.audiopaths_and_text) 55 | self._filter() 56 | 57 | def _filter(self): 58 | """ 59 | Filter text & store spec lengths 60 | """ 61 | # Store spectrogram lengths for Bucketing 62 | # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2) 63 | # spec_length = wav_length // hop_length 64 | 65 | audiopaths_and_text_new = [] 66 | lengths = [] 67 | for audiopath, text in self.audiopaths_and_text: 68 | if self.min_text_len <= len(text) and len(text) <= self.max_text_len: 69 | audiopaths_and_text_new.append([audiopath, text]) 70 | lengths.append(os.path.getsize(audiopath) // (2 * self.hop_length)) 71 | self.audiopaths_and_text = audiopaths_and_text_new 72 | self.lengths = lengths 73 | 74 | def get_audio_text_pair(self, audiopath_and_text): 75 | # separate filename and text 76 | audiopath, text = audiopath_and_text[0], audiopath_and_text[1] 77 | text = self.get_text(text) 78 | spec, wav = self.get_audio(audiopath) 79 | return (text, spec, wav, audiopath) 80 | 81 | def get_audio(self, filename): 82 | audio, sampling_rate = load_wav_to_torch(filename) 83 | if sampling_rate != self.sampling_rate: 84 | raise ValueError( 85 | "{} {} SR doesn't match target {} SR".format( 86 | sampling_rate, self.sampling_rate 87 | ) 88 | ) 89 | audio_norm = audio / self.max_wav_value 90 | audio_norm = audio_norm.unsqueeze(0) 91 | spec_filename = filename.replace(".wav", ".spec.pt") 92 | if os.path.exists(spec_filename): 93 | spec = torch.load(spec_filename) 94 | else: 95 | spec = spectrogram_torch( 96 | audio_norm, 97 | self.filter_length, 98 | self.sampling_rate, 99 | self.hop_length, 100 | self.win_length, 101 | center=False, 102 | ) 103 | spec = torch.squeeze(spec, 0) 104 | torch.save(spec, spec_filename) 105 | return spec, audio_norm 106 | 107 | def get_text(self, text): 108 | if self.cleaned_text: 109 | text_norm = cleaned_text_to_sequence(text) 110 | else: 111 | text_norm = text_to_sequence(text, self.text_cleaners) 112 | if self.add_blank: 113 | text_norm = commons.intersperse(text_norm, 0) 114 | text_norm = torch.LongTensor(text_norm) 115 | return text_norm 116 | 117 | def __getitem__(self, index): 118 | return self.get_audio_text_pair(self.audiopaths_and_text[index]) 119 | 120 | def __len__(self): 121 | return len(self.audiopaths_and_text) 122 | 123 | 124 | class TextAudioCollateWithPath: 125 | """Zero-pads model inputs and targets""" 126 | 127 | def __call__(self, batch): 128 | """Collate's training batch from normalized text and aduio 129 | PARAMS 130 | ------ 131 | batch: [text_normalized, spec_normalized, wav_normalized] 132 | """ 133 | # Right zero-pad all one-hot text sequences to max input length 134 | _, ids_sorted_decreasing = torch.sort( 135 | torch.LongTensor([x[1].size(1) for x in batch]), dim=0, descending=True 136 | ) 137 | 138 | max_text_len = max([len(x[0]) for x in batch]) 139 | max_spec_len = max([x[1].size(1) for x in batch]) 140 | max_wav_len = max([x[2].size(1) for x in batch]) 141 | 142 | text_lengths = torch.LongTensor(len(batch)) 143 | spec_lengths = torch.LongTensor(len(batch)) 144 | wav_lengths = torch.LongTensor(len(batch)) 145 | 146 | text_padded = torch.LongTensor(len(batch), max_text_len) 147 | spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0), max_spec_len) 148 | wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len) 149 | text_padded.zero_() 150 | spec_padded.zero_() 151 | wav_padded.zero_() 152 | for i in range(len(ids_sorted_decreasing)): 153 | row = batch[ids_sorted_decreasing[i]] 154 | 155 | text = row[0] 156 | text_padded[i, : text.size(0)] = text 157 | text_lengths[i] = text.size(0) 158 | 159 | spec = row[1] 160 | spec_padded[i, :, : spec.size(1)] = spec 161 | spec_lengths[i] = spec.size(1) 162 | 163 | wav = row[2] 164 | wav_padded[i, :, : wav.size(1)] = wav 165 | wav_lengths[i] = wav.size(1) 166 | 167 | paths = [x[3] for x in batch] 168 | 169 | return ( 170 | text_padded, 171 | text_lengths, 172 | spec_padded, 173 | spec_lengths, 174 | wav_padded, 175 | wav_lengths, 176 | paths, 177 | ) 178 | 179 | 180 | if __name__ == "__main__": 181 | parser = argparse.ArgumentParser() 182 | parser.add_argument("--weights_path", default="./pretrained_ljs.pth") 183 | parser.add_argument( 184 | "--filelists", 185 | nargs="+", 186 | default=[ 187 | "filelists/ljs_audio_text_val_filelist.txt", 188 | "filelists/ljs_audio_text_test_filelist.txt", 189 | ], 190 | ) 191 | 192 | args = parser.parse_args() 193 | 194 | hps = utils.get_hparams_from_file("./configs/ljs_base.json") 195 | net_g = SynthesizerTrn( 196 | len(symbols), 197 | hps.data.filter_length // 2 + 1, 198 | hps.train.segment_size // hps.data.hop_length, 199 | **hps.model, 200 | ).cuda() 201 | _ = utils.load_checkpoint(args.weights_path, net_g, None) 202 | net_g.eval() 203 | 204 | print(args.filelists) 205 | 206 | os.makedirs("durations", exist_ok=True) 207 | 208 | for filelist in args.filelists: 209 | dataset = TextAudioLoaderWithPath(filelist, hps.data) 210 | dataloader = DataLoader( 211 | dataset, 212 | batch_size=1, 213 | shuffle=False, 214 | pin_memory=False, 215 | collate_fn=TextAudioCollateWithPath(), 216 | ) 217 | 218 | with torch.no_grad(): 219 | for batch_idx, ( 220 | x, 221 | x_lengths, 222 | spec, 223 | spec_lengths, 224 | y, 225 | y_lengths, 226 | paths, 227 | ) in enumerate(dataloader): 228 | x, x_lengths = x.cuda(0), x_lengths.cuda(0) 229 | spec, spec_lengths = spec.cuda(0), spec_lengths.cuda(0) 230 | y, y_lengths = y.cuda(0), y_lengths.cuda(0) 231 | (path,) = paths 232 | 233 | ( 234 | y_hat, 235 | l_length, 236 | attn, 237 | ids_slice, 238 | x_mask, 239 | z_mask, 240 | (z, z_p, m_p, logs_p, m_q, logs_q), 241 | ) = net_g(x, x_lengths, spec, spec_lengths) 242 | 243 | w = attn.sum(2).flatten() 244 | x = x.flatten() 245 | 246 | save_path = os.path.join("durations", os.path.basename(path) + ".npy") 247 | np.save(save_path, w.cpu().numpy().astype(int)) 248 | 249 | if batch_idx % 100 == 0: 250 | print(f"{batch_idx} / {len(dataloader)}") 251 | -------------------------------------------------------------------------------- /utils/preprocess_texts.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import text 3 | from utils.utils import load_filepaths_and_text 4 | 5 | if __name__ == "__main__": 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument("--out_extension", default="cleaned") 8 | parser.add_argument("--text_index", default=1, type=int) 9 | parser.add_argument( 10 | "--filelists", 11 | nargs="+", 12 | default=[ 13 | "filelists/ljs_audio_text_val_filelist.txt", 14 | "filelists/ljs_audio_text_test_filelist.txt", 15 | ], 16 | ) 17 | parser.add_argument("--text_cleaners", nargs="+", default=["english_cleaners2"]) 18 | 19 | args = parser.parse_args() 20 | 21 | for filelist in args.filelists: 22 | print("START:", filelist) 23 | filepaths_and_text = load_filepaths_and_text(filelist) 24 | for i in range(len(filepaths_and_text)): 25 | original_text = filepaths_and_text[i][args.text_index] 26 | cleaned_text = text._clean_text(original_text, args.text_cleaners) 27 | filepaths_and_text[i][args.text_index] = cleaned_text 28 | 29 | new_filelist = filelist + "." + args.out_extension 30 | with open(new_filelist, "w", encoding="utf-8") as f: 31 | f.writelines(["|".join(x) + "\n" for x in filepaths_and_text]) 32 | -------------------------------------------------------------------------------- /utils/requirements.txt: -------------------------------------------------------------------------------- 1 | Cython>=0.29.21 2 | librosa>=0.8.0 3 | matplotlib>=3.3.1 4 | numpy>=1.18.5 5 | phonemizer>=2.2.1 6 | scipy>=1.5.2 7 | tensorboard>=2.3.0 8 | torch>=1.6.0 9 | torchvision>=0.7.0 10 | Unidecode>=1.1.1 -------------------------------------------------------------------------------- /utils/resources/figure1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/heatz123/naturalspeech/ca2f814d960149a8fdc3f3e4d773abb67e7c18ec/utils/resources/figure1.png -------------------------------------------------------------------------------- /utils/text/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Keith Ito 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /utils/text/__init__.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | from text import cleaners 3 | from text.symbols import symbols 4 | 5 | 6 | # Mappings from symbol to numeric ID and vice versa: 7 | _symbol_to_id = {s: i for i, s in enumerate(symbols)} 8 | _id_to_symbol = {i: s for i, s in enumerate(symbols)} 9 | 10 | 11 | def text_to_sequence(text, cleaner_names): 12 | """Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 13 | Args: 14 | text: string to convert to a sequence 15 | cleaner_names: names of the cleaner functions to run the text through 16 | Returns: 17 | List of integers corresponding to the symbols in the text 18 | """ 19 | sequence = [] 20 | 21 | clean_text = _clean_text(text, cleaner_names) 22 | for symbol in clean_text: 23 | symbol_id = _symbol_to_id[symbol] 24 | sequence += [symbol_id] 25 | return sequence 26 | 27 | 28 | def cleaned_text_to_sequence(cleaned_text): 29 | """Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 30 | Args: 31 | text: string to convert to a sequence 32 | Returns: 33 | List of integers corresponding to the symbols in the text 34 | """ 35 | sequence = [_symbol_to_id[symbol] for symbol in cleaned_text] 36 | return sequence 37 | 38 | 39 | def sequence_to_text(sequence): 40 | """Converts a sequence of IDs back to a string""" 41 | result = "" 42 | for symbol_id in sequence: 43 | s = _id_to_symbol[symbol_id] 44 | result += s 45 | return result 46 | 47 | 48 | def _clean_text(text, cleaner_names): 49 | for name in cleaner_names: 50 | cleaner = getattr(cleaners, name) 51 | if not cleaner: 52 | raise Exception("Unknown cleaner: %s" % name) 53 | text = cleaner(text) 54 | return text 55 | -------------------------------------------------------------------------------- /utils/text/cleaners.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | """ 4 | Cleaners are transformations that run over the input text at both training and eval time. 5 | 6 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" 7 | hyperparameter. Some cleaners are English-specific. You'll typically want to use: 8 | 1. "english_cleaners" for English text 9 | 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using 10 | the Unidecode library (https://pypi.python.org/pypi/Unidecode) 11 | 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update 12 | the symbols in symbols.py to match your data). 13 | """ 14 | 15 | import re 16 | from unidecode import unidecode 17 | from phonemizer import phonemize 18 | 19 | 20 | # Regular expression matching whitespace: 21 | _whitespace_re = re.compile(r"\s+") 22 | 23 | # List of (regular expression, replacement) pairs for abbreviations: 24 | _abbreviations = [ 25 | (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1]) 26 | for x in [ 27 | ("mrs", "misess"), 28 | ("mr", "mister"), 29 | ("dr", "doctor"), 30 | ("st", "saint"), 31 | ("co", "company"), 32 | ("jr", "junior"), 33 | ("maj", "major"), 34 | ("gen", "general"), 35 | ("drs", "doctors"), 36 | ("rev", "reverend"), 37 | ("lt", "lieutenant"), 38 | ("hon", "honorable"), 39 | ("sgt", "sergeant"), 40 | ("capt", "captain"), 41 | ("esq", "esquire"), 42 | ("ltd", "limited"), 43 | ("col", "colonel"), 44 | ("ft", "fort"), 45 | ] 46 | ] 47 | 48 | 49 | def expand_abbreviations(text): 50 | for regex, replacement in _abbreviations: 51 | text = re.sub(regex, replacement, text) 52 | return text 53 | 54 | 55 | def expand_numbers(text): 56 | return normalize_numbers(text) 57 | 58 | 59 | def lowercase(text): 60 | return text.lower() 61 | 62 | 63 | def collapse_whitespace(text): 64 | return re.sub(_whitespace_re, " ", text) 65 | 66 | 67 | def convert_to_ascii(text): 68 | return unidecode(text) 69 | 70 | 71 | def basic_cleaners(text): 72 | """Basic pipeline that lowercases and collapses whitespace without transliteration.""" 73 | text = lowercase(text) 74 | text = collapse_whitespace(text) 75 | return text 76 | 77 | 78 | def transliteration_cleaners(text): 79 | """Pipeline for non-English text that transliterates to ASCII.""" 80 | text = convert_to_ascii(text) 81 | text = lowercase(text) 82 | text = collapse_whitespace(text) 83 | return text 84 | 85 | 86 | def english_cleaners(text): 87 | """Pipeline for English text, including abbreviation expansion.""" 88 | text = convert_to_ascii(text) 89 | text = lowercase(text) 90 | text = expand_abbreviations(text) 91 | phonemes = phonemize(text, language="en-us", backend="espeak", strip=True) 92 | phonemes = collapse_whitespace(phonemes) 93 | return phonemes 94 | 95 | 96 | def english_cleaners2(text): 97 | """Pipeline for English text, including abbreviation expansion. + punctuation + stress""" 98 | text = convert_to_ascii(text) 99 | text = lowercase(text) 100 | text = expand_abbreviations(text) 101 | phonemes = phonemize( 102 | text, 103 | language="en-us", 104 | backend="espeak", 105 | strip=True, 106 | preserve_punctuation=True, 107 | with_stress=True, 108 | ) 109 | phonemes = collapse_whitespace(phonemes) 110 | return phonemes 111 | -------------------------------------------------------------------------------- /utils/text/symbols.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | """ 4 | Defines the set of symbols used in text input to the model. 5 | """ 6 | _pad = "_" 7 | _punctuation = ';:,.!?¡¿—…"«»“” ' 8 | _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 9 | _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ" 10 | 11 | 12 | # Export all symbols: 13 | symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) 14 | 15 | # Special symbol ids 16 | SPACE_ID = symbols.index(" ") 17 | -------------------------------------------------------------------------------- /utils/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import sys 4 | import argparse 5 | import logging 6 | import json 7 | import subprocess 8 | import numpy as np 9 | from scipy.io.wavfile import read 10 | import torch 11 | 12 | MATPLOTLIB_FLAG = False 13 | 14 | logging.basicConfig(stream=sys.stdout, level=logging.INFO) # can be changed to DEBUG 15 | logger = logging 16 | 17 | 18 | def load_checkpoint(checkpoint_path, model, optimizer=None): 19 | assert os.path.isfile(checkpoint_path) 20 | 21 | checkpoint_dict = torch.load(checkpoint_path, map_location="cpu") 22 | 23 | iteration = checkpoint_dict["iteration"] 24 | 25 | learning_rate = checkpoint_dict["learning_rate"] 26 | 27 | saved_state_dict = checkpoint_dict["model"] 28 | if hasattr(model, "module"): 29 | state_dict = model.module.state_dict() 30 | else: 31 | state_dict = model.state_dict() 32 | new_state_dict = {} 33 | for k, v in state_dict.items(): 34 | try: 35 | new_state_dict[k] = saved_state_dict[k] 36 | except: 37 | logger.info("%s is not in the checkpoint" % k) 38 | new_state_dict[k] = v 39 | if hasattr(model, "module"): 40 | model.module.load_state_dict(new_state_dict) 41 | else: 42 | model.load_state_dict(new_state_dict) 43 | 44 | if optimizer is not None: 45 | optimizer.load_state_dict(checkpoint_dict["optimizer"]) 46 | 47 | logger.info( 48 | "Loaded checkpoint '{}' (iteration {})".format(checkpoint_path, iteration) 49 | ) 50 | return model, optimizer, learning_rate, iteration 51 | 52 | 53 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path): 54 | logger.info( 55 | "Saving model and optimizer state at iteration {} to {}".format( 56 | iteration, checkpoint_path 57 | ) 58 | ) 59 | if hasattr(model, "module"): 60 | state_dict = model.module.state_dict() 61 | else: 62 | state_dict = model.state_dict() 63 | torch.save( 64 | { 65 | "model": state_dict, 66 | "iteration": iteration, 67 | "optimizer": optimizer.state_dict(), 68 | "learning_rate": learning_rate, 69 | }, 70 | checkpoint_path, 71 | ) 72 | 73 | 74 | def summarize( 75 | writer, 76 | global_step, 77 | scalars={}, 78 | histograms={}, 79 | images={}, 80 | audios={}, 81 | audio_sampling_rate=22050, 82 | ): 83 | for k, v in scalars.items(): 84 | writer.add_scalar(k, v, global_step) 85 | for k, v in histograms.items(): 86 | writer.add_histogram(k, v, global_step) 87 | for k, v in images.items(): 88 | writer.add_image(k, v, global_step, dataformats="HWC") 89 | for k, v in audios.items(): 90 | writer.add_audio(k, v, global_step, audio_sampling_rate) 91 | 92 | 93 | def latest_checkpoint_path(dir_path, regex="G_*.pth"): 94 | f_list = glob.glob(os.path.join(dir_path, regex)) 95 | f_list.sort(key=lambda f: int("".join(filter(str.isdigit, f)))) 96 | if not f_list: 97 | print("no checkpoint detected") 98 | x = f_list[-1] 99 | print(x) 100 | return x 101 | 102 | 103 | def plot_spectrogram_to_numpy(spectrogram): 104 | global MATPLOTLIB_FLAG 105 | if not MATPLOTLIB_FLAG: 106 | import matplotlib 107 | 108 | matplotlib.use("Agg") 109 | MATPLOTLIB_FLAG = True 110 | mpl_logger = logging.getLogger("matplotlib") 111 | mpl_logger.setLevel(logging.WARNING) 112 | import matplotlib.pylab as plt 113 | import numpy as np 114 | 115 | fig, ax = plt.subplots(figsize=(10, 2)) 116 | im = ax.imshow(spectrogram, aspect="auto", origin="lower", interpolation="none") 117 | plt.colorbar(im, ax=ax) 118 | plt.xlabel("Frames") 119 | plt.ylabel("Channels") 120 | plt.tight_layout() 121 | 122 | fig.canvas.draw() 123 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep="") 124 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 125 | plt.close() 126 | return data 127 | 128 | 129 | def plot_alignment_to_numpy(alignment, info=None): 130 | global MATPLOTLIB_FLAG 131 | if not MATPLOTLIB_FLAG: 132 | import matplotlib 133 | 134 | matplotlib.use("Agg") 135 | MATPLOTLIB_FLAG = True 136 | mpl_logger = logging.getLogger("matplotlib") 137 | mpl_logger.setLevel(logging.WARNING) 138 | import matplotlib.pylab as plt 139 | import numpy as np 140 | 141 | fig, ax = plt.subplots(figsize=(6, 4)) 142 | im = ax.imshow( 143 | alignment.transpose(), aspect="auto", origin="lower", interpolation="none" 144 | ) 145 | fig.colorbar(im, ax=ax) 146 | xlabel = "Decoder timestep" 147 | if info is not None: 148 | xlabel += "\n\n" + info 149 | plt.xlabel(xlabel) 150 | plt.ylabel("Encoder timestep") 151 | plt.tight_layout() 152 | 153 | fig.canvas.draw() 154 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep="") 155 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 156 | plt.close() 157 | return data 158 | 159 | 160 | def load_wav_to_torch(full_path): 161 | sampling_rate, data = read(full_path) 162 | return torch.FloatTensor(data.astype(np.float32)), sampling_rate 163 | 164 | 165 | def load_filepaths_and_text(filename, split="|"): 166 | with open(filename, encoding="utf-8") as f: 167 | filepaths_and_text = [line.strip().split(split) for line in f] 168 | return filepaths_and_text 169 | 170 | 171 | def get_hparams(init=True): 172 | parser = argparse.ArgumentParser() 173 | parser.add_argument( 174 | "-c", 175 | "--config", 176 | type=str, 177 | default="./configs/base.json", 178 | help="JSON file for configuration", 179 | ) 180 | parser.add_argument("-m", "--model", type=str, required=True, help="Model name") 181 | parser.add_argument( 182 | "--warmup", action="store_true", help="Whether to train as warmup phase" 183 | ) 184 | 185 | args = parser.parse_args() 186 | model_dir = os.path.join("./logs", args.model) 187 | 188 | if not os.path.exists(model_dir): 189 | os.makedirs(model_dir) 190 | 191 | config_path = args.config 192 | config_save_path = os.path.join(model_dir, "config.json") 193 | if init: 194 | with open(config_path, "r") as f: 195 | data = f.read() 196 | with open(config_save_path, "w") as f: 197 | f.write(data) 198 | else: 199 | with open(config_save_path, "r") as f: 200 | data = f.read() 201 | config = json.loads(data) 202 | 203 | hparams = HParams(**config) 204 | hparams.model_dir = model_dir 205 | hparams.warmup = args.warmup 206 | 207 | return hparams 208 | 209 | 210 | def get_hparams_from_dir(model_dir): 211 | config_save_path = os.path.join(model_dir, "config.json") 212 | with open(config_save_path, "r") as f: 213 | data = f.read() 214 | config = json.loads(data) 215 | 216 | hparams = HParams(**config) 217 | hparams.model_dir = model_dir 218 | return hparams 219 | 220 | 221 | def get_hparams_from_file(config_path): 222 | with open(config_path, "r") as f: 223 | data = f.read() 224 | config = json.loads(data) 225 | 226 | hparams = HParams(**config) 227 | return hparams 228 | 229 | 230 | def check_git_hash(model_dir): 231 | source_dir = os.path.dirname(os.path.realpath(__file__)) 232 | if not os.path.exists(os.path.join(source_dir, ".git")): 233 | logger.warn( 234 | "{} is not a git repository, therefore hash value comparison will be ignored.".format( 235 | source_dir 236 | ) 237 | ) 238 | return 239 | 240 | cur_hash = subprocess.getoutput("git rev-parse HEAD") 241 | 242 | path = os.path.join(model_dir, "githash") 243 | if os.path.exists(path): 244 | saved_hash = open(path).read() 245 | if saved_hash != cur_hash: 246 | logger.warn( 247 | "git hash values are different. {}(saved) != {}(current)".format( 248 | saved_hash[:8], cur_hash[:8] 249 | ) 250 | ) 251 | else: 252 | open(path, "w").write(cur_hash) 253 | 254 | 255 | def get_logger(model_dir, filename="train.log"): 256 | global logger 257 | logger = logging.getLogger(os.path.basename(model_dir)) 258 | logger.setLevel(logging.DEBUG) 259 | 260 | formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s") 261 | if not os.path.exists(model_dir): 262 | os.makedirs(model_dir) 263 | h = logging.FileHandler(os.path.join(model_dir, filename)) 264 | h.setLevel(logging.DEBUG) 265 | h.setFormatter(formatter) 266 | logger.addHandler(h) 267 | return logger 268 | 269 | 270 | class HParams: 271 | def __init__(self, **kwargs): 272 | for k, v in kwargs.items(): 273 | if type(v) == dict: 274 | v = HParams(**v) 275 | self[k] = v 276 | 277 | def keys(self): 278 | return self.__dict__.keys() 279 | 280 | def items(self): 281 | return self.__dict__.items() 282 | 283 | def values(self): 284 | return self.__dict__.values() 285 | 286 | def __len__(self): 287 | return len(self.__dict__) 288 | 289 | def __getitem__(self, key): 290 | return getattr(self, key) 291 | 292 | def __setitem__(self, key, value): 293 | return setattr(self, key, value) 294 | 295 | def __contains__(self, key): 296 | return key in self.__dict__ 297 | 298 | def __repr__(self): 299 | return self.__dict__.__repr__() 300 | --------------------------------------------------------------------------------