Scientists at the CERN laboratory say they have discovered a new particle.
9 |
10 |
11 |
12 |
There’s a way to measure the acute emotional intelligence that has never gone out of style.
13 |
14 |
15 |
16 |
President Trump met with other leaders at the Group of 20 conference.
17 |
18 |
19 |
20 |
In a statement announcing his resignation, Mr Ross, said: "While the intentions may have been well meaning, the reaction to this news shows that Mr Cummings interpretation of the government advice was not shared by the vast majority of people who have done as the government asked."
21 |
22 |
23 |
24 |
25 | ## Forward Tacotron + MelGAN Vocoder
26 |
27 | The samples are generated with a model trained 400K steps on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) together with the pretrained MelGAN vocoder provided by the [MelGAN repo](https://github.com/seungwonpark/melgan).
28 |
29 |
Scientists at the CERN laboratory say they have discovered a new particle.
There’s a way to measure the acute emotional intelligence that has never gone out of style.
36 |
37 | |:---:|:---:|:---:|
38 | ||||
39 |
40 |
President Trump met with other leaders at the Group of 20 conference.
41 |
42 | |:---:|:---:|:---:|
43 | ||||
44 |
45 | ## Forward Tacotron + WaveRNN Vocoder
46 |
47 | The samples are generated with a model trained 100K steps on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) together with the pretrained WaveRNN vocoder provided by the [WaveRNN repo](https://github.com/fatchord/WaveRNN).
48 |
49 |
Scientists at the CERN laboratory say they have discovered a new particle.
Synthetic speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.
86 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ⏩ ForwardTacotron
2 |
3 | Inspired by Microsoft's [FastSpeech](https://www.microsoft.com/en-us/research/blog/fastspeech-new-text-to-speech-model-improves-on-speed-accuracy-and-controllability/)
4 | we modified Tacotron (Fork from fatchord's [WaveRNN](https://github.com/fatchord/WaveRNN)) to generate speech in a single forward pass using a duration predictor to align text and generated mel spectrograms. Hence, we call the model ForwardTacotron (see Figure 1).
5 |
6 |
7 |
8 |
9 |
10 | Figure 1: Model Architecture.
11 |
12 |
13 | The model has following advantages:
14 | - **Robustness:** No repeats and failed attention modes for challenging sentences.
15 | - **Speed:** The generation of a mel spectogram takes about 0.04s on a GeForce RTX 2080.
16 | - **Controllability:** It is possible to control the speed of the generated utterance.
17 | - **Efficiency:** In contrast to FastSpeech and Tacotron, the model of ForwardTacotron
18 | does not use any attention. Hence, the required memory grows linearly with text size, which makes it possible to synthesize large articles at once.
19 |
20 |
21 | ## UPDATE Improved attention mechanism (30.08.2023)
22 | - Faster tacotron attention buildup by adding alignment conditioning based on [one alignment to rule them all](https://arxiv.org/abs/2108.10447)
23 | - Improved attention translates to improved synth quality.
24 |
25 | ## 🔈 Samples
26 |
27 | [Can be found here.](https://as-ideas.github.io/ForwardTacotron/)
28 |
29 | The samples are generated with a model trained on LJSpeech and vocoded with WaveRNN, [MelGAN](https://github.com/seungwonpark/melgan), or [HiFiGAN](https://github.com/jik876/hifi-gan).
30 | You can try out the latest pretrained model with the following notebook:
31 |
32 | [](https://colab.research.google.com/github/as-ideas/ForwardTacotron/blob/master/notebooks/synthesize.ipynb)
33 |
34 | ## ⚙️ Installation
35 |
36 | Make sure you have:
37 |
38 | * Python >= 3.6
39 |
40 | Install espeak as phonemizer backend (for macOS use brew):
41 | ```
42 | sudo apt-get install espeak
43 | ```
44 |
45 | Then install the rest with pip:
46 | ```
47 | pip install -r requirements.txt
48 | ```
49 |
50 | ## 🚀 Training your own Model (Singlespeaker)
51 |
52 | Change the params in the config.yaml according to your needs and follow the steps below:
53 |
54 | (1) Download and preprocess the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset:
55 | ```
56 | python preprocess.py --path /path/to/ljspeech
57 | ```
58 | (2) Train Tacotron with:
59 | ```
60 | python train_tacotron.py
61 | ```
62 | Once the training is finished, the model will automatically extract the alignment features from the dataset. In case you stopped the training early, you
63 | can use the latest checkpoint to manually run the process with:
64 | ```
65 | python train_tacotron.py --force_align
66 | ```
67 | (3) Train ForwardTacotron with:
68 | ```
69 | python train_forward.py
70 | ```
71 | (4) Generate Sentences with Griffin-Lim vocoder:
72 | ```
73 | python gen_forward.py --alpha 1 --input_text 'this is whatever you want it to be' griffinlim
74 | ```
75 | If you want to use the [MelGAN](https://github.com/seungwonpark/melgan) vocoder, you can produce .mel files with:
76 | ```
77 | python gen_forward.py --input_text 'this is whatever you want it to be' melgan
78 | ```
79 | If you want to use the [HiFiGAN](https://github.com/jik876/hifi-gan) vocoder, you can produce .npy files with:
80 | ```
81 | python gen_forward.py --input_text 'this is whatever you want it to be' hifigan
82 | ```
83 | To vocode the resulting .mel or .npy files use the inference.py script from the MelGAN or HiFiGAN repo and point to the model output folder.
84 |
85 | For training the model on your own dataset just bring it to the LJSpeech-like format:
86 | ```
87 | |- dataset_folder/
88 | | |- metadata.csv
89 | | |- wav/
90 | | |- file1.wav
91 | | |- ...
92 | ```
93 |
94 | For languages other than English, change the language and cleaners params in the hparams.py, e.g. for French:
95 | ```
96 | language = 'fr'
97 | tts_cleaner_name = 'no_cleaners'
98 | ```
99 |
100 | ____
101 | You can monitor the training processes for Tacotron and ForwardTacotron with
102 | ```
103 | tensorboard --logdir checkpoints
104 | ```
105 | Here is what the ForwardTacotron tensorboard looks like:
106 |
107 |
108 |
109 |
110 | Figure 2: Tensorboard example for training a ForwardTacotron model.
111 |
112 |
113 |
114 | ## Multispeaker Training
115 | Prepare the data in ljspeech format:
116 | ```
117 | |- dataset_folder/
118 | | |- metadata.csv
119 | | |- wav/
120 | | |- file1.wav
121 | | |- ...
122 | ```
123 | The metadata.csv is expected to have the speaker id in the second column:
124 | ```
125 | id_001|speaker_1|this is the first text.
126 | id_002|speaker_1|this is the second text.
127 | id_003|speaker_2|this is the third text.
128 | ...
129 | ```
130 | We also support the VCTK and a pandas format
131 | (can be set in the config multispeaker.yaml under preprocesing.metafile_format)
132 |
133 | Follow the same steps as for singlespaker, but provide the multispeaker config:
134 | ```
135 | python preprocess.py --config configs/multispeaker.yaml --path /path/to/ljspeech
136 | python train_tacotron.py --config configs/multispeaker.yaml
137 | python train_forward.py --config configs/multispeaker.yaml
138 | ```
139 |
140 | ## Pretrained Models
141 |
142 | | Model | Dataset | Commit Tag |
143 | |---|---|------------|
144 | |[forward_tacotron](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/ForwardTacotron/forward_step90k.pt)| ljspeech | v3.1 |
145 | |[fastpitch](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/ForwardTacotron/thorsten_fastpitch_50k.pt)| [thorstenmueller (german)](https://github.com/thorstenMueller/deep-learning-german-tts) | v3.1 |
146 |
147 | Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:
148 | - [MelGAN](https://github.com/seungwonpark/melgan)
149 | - [HiFiGAN](https://github.com/jik876/hifi-gan)
150 |
151 |
152 | After downloading the models you can synthesize text using the pretrained models with
153 | ```
154 | python gen_forward.py --input_text 'Hi there!' --checkpoint forward_step90k.pt wavernn --voc_checkpoint wave_step_575k.pt
155 |
156 | ```
157 |
158 | ## Export Model with TorchScript
159 |
160 | Here is a dummy example of exporting the model in TorchScript:
161 | ```
162 | import torch
163 | from models.forward_tacotron import ForwardTacotron
164 |
165 | tts_model = ForwardTacotron.from_checkpoint('checkpoints/ljspeech_tts.forward/latest_model.pt')
166 | tts_model.eval()
167 | model_script = torch.jit.script(tts_model)
168 | x = torch.ones((1, 5)).long()
169 | y = model_script.generate_jit(x)
170 | ```
171 | For the necessary preprocessing steps (text to tokens) please refer to:
172 | ```
173 | gen_forward.py
174 | ```
175 |
176 | ## References
177 |
178 | * [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
179 | * [FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
180 | * [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
181 | * [MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711)
182 | * [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/abs/1806.04558)
183 |
184 | ## Acknowlegements
185 |
186 | * [https://github.com/keithito/tacotron](https://github.com/keithito/tacotron)
187 | * [https://github.com/fatchord/WaveRNN](https://github.com/fatchord/WaveRNN)
188 | * [https://github.com/seungwonpark/melgan](https://github.com/seungwonpark/melgan)
189 | * [https://github.com/jik876/hifi-gan](https://github.com/jik876/hifi-gan)
190 | * [https://github.com/xcmyz/LightSpeech](https://github.com/xcmyz/LightSpeech)
191 | * [https://github.com/resemble-ai/Resemblyzer](https://github.com/resemble-ai/Resemblyzer)
192 | * [https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch)
193 | * [https://github.com/resemble-ai/Resemblyzer](https://github.com/resemble-ai/Resemblyzer)
194 |
195 | ## Maintainers
196 |
197 | * Christian Schäfer, github: [cschaefer26](https://github.com/cschaefer26)
198 |
199 | ## Copyright
200 |
201 | See [LICENSE](LICENSE) for details.
202 |
--------------------------------------------------------------------------------
/duration_extraction/duration_extraction_pipe.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from dataclasses import dataclass
3 | from logging import INFO
4 | from typing import List, Dict, Any, Tuple
5 |
6 | import numpy as np
7 | import torch
8 | from torch.utils.data import DataLoader, Dataset
9 | from tqdm import tqdm
10 |
11 | from duration_extraction.duration_extractor import DurationExtractor
12 | from models.tacotron import Tacotron
13 | from trainer.common import to_device
14 | from utils.dataset import BinnedLengthSampler, get_binned_taco_dataloader, DurationStats
15 | from utils.files import unpickle_binary
16 | from utils.metrics import attention_score
17 | from utils.paths import Paths
18 | from utils.text.tokenizer import Tokenizer
19 |
20 |
21 | @dataclass
22 | class DurationResult:
23 | item_id: str
24 | att_score: float
25 | align_score: float
26 | durations: np.array
27 |
28 |
29 | class DurationCollator:
30 |
31 | def __call__(self, x: List[DurationResult]) -> DurationResult:
32 | if len(x) > 1:
33 | raise ValueError(f'Batch size must be 1! Found batch size: {len(x)}')
34 | return x[0]
35 |
36 |
37 | class DurationExtractionDataset(Dataset):
38 |
39 | def __init__(self,
40 | duration_extractor: DurationExtractor,
41 | paths: Paths,
42 | dataset_ids: List[str],
43 | text_dict: Dict[str, str],
44 | tokenizer: Tokenizer):
45 | self.metadata = dataset_ids
46 | self.text_dict = text_dict
47 | self.tokenizer = tokenizer
48 | self.text_dict = text_dict
49 | self.duration_extractor = duration_extractor
50 | self.paths = paths
51 |
52 | def __getitem__(self, index: int) -> DurationResult:
53 | item_id = self.metadata[index]
54 | x = self.text_dict[item_id]
55 | x = self.tokenizer(x)
56 | mel = np.load(self.paths.mel / f'{item_id}.npy')
57 | mel = torch.from_numpy(mel)
58 | x = torch.tensor(x)
59 | attention_npy = np.load(str(self.paths.att_pred / f'{item_id}.npy'))
60 | attention = torch.from_numpy(attention_npy)
61 | mel_len = mel.shape[-1]
62 | mel_len = torch.tensor(mel_len).unsqueeze(0)
63 | align_score, _ = attention_score(attention.unsqueeze(0), mel_len, r=1)
64 | align_score = float(align_score)
65 | durations, att_score = self.duration_extractor(x=x, mel=mel, attention=attention)
66 | att_score = float(att_score)
67 | durations_npy = durations.cpu().numpy()
68 | if np.sum(durations_npy) != mel_len:
69 | print(f'WARNINNG: Sum of durations did not match mel length for item {item_id}!')
70 | return DurationResult(item_id=item_id, att_score=att_score,
71 | align_score=align_score, durations=durations_npy)
72 |
73 | def __len__(self):
74 | return len(self.metadata)
75 |
76 |
77 | class DurationExtractionPipeline:
78 |
79 | def __init__(self,
80 | paths: Paths,
81 | config: Dict[str, Any],
82 | duration_extractor: DurationExtractor) -> None:
83 | self.paths = paths
84 | self.config = config
85 | self.duration_extractor = duration_extractor
86 | self.logger = logging.Logger(__name__, level=INFO)
87 |
88 | def extract_attentions(self,
89 | model: Tacotron,
90 | max_batch_size: int = 1) -> float:
91 | """
92 | Performs tacotron inference and stores the attention matrices as npy arrays in paths.data.att_pred.
93 | Returns average attention score.
94 |
95 | Args:
96 | model: Tacotron model to use for attention extraction.
97 | batch_size: Batch size to use for tacotron inference.
98 |
99 | Returns: Mean attention score. The attention matrices are saved as numpy arrays in paths.att_pred.
100 |
101 | """
102 |
103 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
104 | model.to(device)
105 |
106 | dataloader = get_binned_taco_dataloader(paths=self.paths, max_batch_size=max_batch_size)
107 |
108 | sum_items = 0
109 | sum_att_score = 0
110 | pbar = tqdm(dataloader, total=len(dataloader), smoothing=0.01)
111 | for i, batch in enumerate(pbar, 1):
112 | batch = to_device(batch, device=device)
113 | with torch.no_grad():
114 | out = model(batch)
115 | attention_batch = out['att']
116 | _, att_score = attention_score(attention_batch, batch['mel_len'], r=1)
117 | sum_att_score += att_score.sum()
118 | B = batch['x_len'].size(0)
119 | sum_items += B
120 | for b in range(B):
121 | x_len = batch['x_len'][b].cpu()
122 | mel_len = batch['mel_len'][b].cpu()
123 | item_id = batch['item_id'][b]
124 | attention = attention_batch[b, :mel_len, :x_len].cpu()
125 | np.save(self.paths.att_pred / f'{item_id}.npy', attention.numpy(), allow_pickle=False)
126 | pbar.set_description(f'Avg attention score: {sum_att_score / sum_items}', refresh=True)
127 |
128 | return sum_att_score / len(dataloader)
129 |
130 | def extract_durations(self,
131 | num_workers: int = 0,
132 | sampler_bin_size: int = 1) -> Dict[str, DurationStats]:
133 | """
134 | Extracts durations from saved attention matrices.
135 |
136 | Args:
137 | num_workers: Number of workers for multiprocessing.
138 | sampler_bin_size: Bin size of BinnedLengthSampler.
139 | Should be greater than one (but much less than length of dataset) for optimal performance.
140 |
141 | Returns: Dictionary containing the attention scores for each item id.
142 | The durations are saved as numpy arrays in paths.alg.
143 | """
144 |
145 | train_set = unpickle_binary(self.paths.train_dataset)
146 | val_set = unpickle_binary(self.paths.val_dataset)
147 | text_dict = unpickle_binary(self.paths.text_dict)
148 | dataset = train_set + val_set
149 | dataset = [(file_id, mel_len) for file_id, mel_len in dataset
150 | if (self.paths.att_pred / f'{file_id}.npy').is_file()]
151 | len_orig = len(dataset)
152 | data_ids, mel_lens = list(zip(*dataset))
153 | self.logger.info(f'Found {len(data_ids)} / {len_orig} '
154 | f'alignment files in {self.paths.att_pred}')
155 |
156 | duration_stats = {}
157 | sum_att_score = 0
158 |
159 | dataset = DurationExtractionDataset(
160 | duration_extractor=self.duration_extractor,
161 | paths=self.paths, dataset_ids=data_ids,
162 | text_dict=text_dict, tokenizer=Tokenizer())
163 |
164 | dataset = DataLoader(dataset=dataset,
165 | batch_size=1,
166 | shuffle=False,
167 | pin_memory=False,
168 | collate_fn=DurationCollator(),
169 | sampler=BinnedLengthSampler(lengths=mel_lens, batch_size=1, bin_size=sampler_bin_size),
170 | num_workers=num_workers)
171 |
172 | pbar = tqdm(dataset, total=len(dataset), smoothing=0.01)
173 |
174 | for i, res in enumerate(pbar, 1):
175 | sum_att_score += res.att_score
176 | pbar.set_description(f'Avg duration attention score: {sum_att_score / i}', refresh=True)
177 | max_consecutive_ones = self._get_max_consecutive_ones(res.durations)
178 | max_duration = np.max(res.durations)
179 | duration_stats[res.item_id] = DurationStats(att_align_score=res.align_score,
180 | att_sharpness_score=res.att_score,
181 | max_consecutive_ones=max_consecutive_ones,
182 | max_duration=max_duration)
183 | np.save(self.paths.alg / f'{res.item_id}.npy', res.durations.astype(int), allow_pickle=False)
184 |
185 | return duration_stats
186 |
187 | @staticmethod
188 | def _get_max_consecutive_ones(durations: np.array) -> int:
189 | max_count = 0
190 | count = 0
191 | for d in durations:
192 | if d == 1:
193 | count += 1
194 | else:
195 | max_count = max(max_count, count)
196 | count = 0
197 | return max(max_count, count)
198 |
--------------------------------------------------------------------------------
/utils/dsp.py:
--------------------------------------------------------------------------------
1 | import struct
2 | from pathlib import Path
3 | from typing import Dict, Any, Union, List
4 | import numpy as np
5 | import librosa
6 | import torch
7 | import webrtcvad
8 | from scipy.ndimage import binary_dilation
9 | import torchaudio
10 | import torchaudio.transforms as transforms
11 |
12 | from utils.dataset import tensor_to_ndarray, ndarray_to_tensor
13 |
14 |
15 | class DSP:
16 |
17 | def __init__(self,
18 | num_mels: int,
19 | sample_rate: int,
20 | hop_length: int,
21 | win_length: int,
22 | n_fft: int,
23 | fmin: float,
24 | fmax: float,
25 | peak_norm: bool,
26 | trim_start_end_silence: bool,
27 | trim_silence_top_db: int,
28 | trim_long_silences: bool,
29 | vad_sample_rate: int,
30 | vad_window_length: float,
31 | vad_moving_average_width: float,
32 | vad_max_silence_length: int,
33 | **kwargs, # for backward compatibility
34 | ) -> None:
35 |
36 | self.n_mels = num_mels
37 | self.sample_rate = sample_rate
38 | self.hop_length = hop_length
39 | self.win_length = win_length
40 | self.n_fft = n_fft
41 | self.fmin = fmin
42 | self.fmax = fmax
43 |
44 | self.should_peak_norm = peak_norm
45 | self.should_trim_start_end_silence = trim_start_end_silence
46 | self.should_trim_long_silences = trim_long_silences
47 | self.trim_silence_top_db = trim_silence_top_db
48 |
49 | self.vad_sample_rate = vad_sample_rate
50 | self.vad_window_length = vad_window_length
51 | self.vad_moving_average_width = vad_moving_average_width
52 | self.vad_max_silence_length = vad_max_silence_length
53 |
54 | self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
55 |
56 | # init transformation
57 | self.mel_transform = self._init_mel_transform()
58 |
59 | @classmethod
60 | def from_config(cls, config: Dict[str, Any]) -> 'DSP':
61 | """Initialize from configuration object"""
62 | return DSP(**config['dsp'])
63 |
64 | def _init_mel_transform(self):
65 | """Initialize mel transformation"""
66 | mel_transform = transforms.MelSpectrogram(
67 | sample_rate=self.sample_rate,
68 | n_fft=self.n_fft,
69 | win_length=self.win_length,
70 | hop_length=self.hop_length,
71 | power=1,
72 | norm="slaney",
73 | n_mels=self.n_mels,
74 | mel_scale="slaney",
75 | f_min=self.fmin,
76 | f_max=self.fmax,
77 | ).to(self.device)
78 |
79 | return mel_transform
80 |
81 | def load_wav(self, path: Union[str, Path], mono: bool = True) -> torch.Tensor:
82 | """Load audio file into a tensor"""
83 | effects = []
84 | metadata = torchaudio.info(path)
85 |
86 | # merge channels if source is multichannel
87 | if mono and metadata.num_channels > 1:
88 | effects.extend([
89 | ["remix", "-"] # convert to mono
90 | ])
91 |
92 | # resample if source sample rate is different from desired sample rate
93 | if metadata.sample_rate != self.sample_rate:
94 | effects.extend([
95 | ["rate", f'{self.sample_rate}'],
96 | ])
97 |
98 | waveform, _ = torchaudio.sox_effects.apply_effects_file(path, effects=effects)
99 | return waveform
100 |
101 | def save_wav(self, waveform: torch.Tensor, path: Union[str, Path]) -> None:
102 | """Save waveform to file"""
103 | torchaudio.save(filepath=path, src=waveform, sample_rate=self.sample_rate)
104 |
105 | def adjust_volume(self, waveform: torch.Tensor, target_dbfs: int = -30) -> torch.Tensor:
106 | """Adjust volume of the waveform"""
107 | volume_transform = transforms.Vol(gain=target_dbfs, gain_type='db').to(self.device)
108 | return volume_transform(waveform)
109 |
110 | def adjust_volume_batched(self, data: List[torch.Tensor], target_dbfs: int = -30) -> List[torch.Tensor]:
111 | """Adjust volume of the waveforms in the batch"""
112 | lengths = [tensor.size(1) for tensor in data]
113 | padded_batch = [torch.nn.functional.pad(x, (0, max(lengths) - x.size(1))) for x in data]
114 | stacked_tensor = torch.stack(padded_batch, dim=0)
115 | processed_batch = self.adjust_volume(stacked_tensor, target_dbfs=target_dbfs)
116 | result = [processed_waveform[:, :lengths[index]] for index, processed_waveform in enumerate(processed_batch)]
117 | return result
118 |
119 | def waveform_to_mel_batched(self, batch: List[torch.Tensor]) -> List[torch.Tensor]:
120 | """Convert waveform to mel spectrogram for the batch of waveforms"""
121 | lengths = [tensor.size(1) for tensor in batch]
122 | expected_mel_lengths = [x // self.hop_length + 1 for x in lengths]
123 | padded_batch = [torch.nn.functional.pad(x, (0, max(lengths) - x.size(1))) for x in batch]
124 | batch_tensor = torch.stack(padded_batch, dim=0).to(self.device)
125 | mels = self.waveform_to_mel(batch_tensor)
126 | list_of_mels = [mel[:, :, :expected_mel_lengths[index]] for index, mel in enumerate(mels)]
127 | return list_of_mels
128 |
129 | def waveform_to_mel(self, waveform: torch.Tensor, normalized: bool = True) -> torch.Tensor:
130 | """Convert waveform to mel spectrogram"""
131 | mel_spec = self.mel_transform(waveform)
132 | if normalized:
133 | mel_spec = self.normalize(mel_spec)
134 | return mel_spec
135 |
136 | def griffinlim(self, mel: np.array, n_iter: int = 32) -> np.array:
137 | mel = self.denormalize(mel)
138 | S = librosa.feature.inverse.mel_to_stft(
139 | mel,
140 | power=1,
141 | sr=self.sample_rate,
142 | n_fft=self.n_fft,
143 | fmin=self.fmin,
144 | fmax=self.fmax)
145 | wav = librosa.core.griffinlim(
146 | S,
147 | n_iter=n_iter,
148 | hop_length=self.hop_length,
149 | win_length=self.win_length)
150 | return wav
151 |
152 | @staticmethod
153 | def normalize(mel: torch.Tensor) -> torch.Tensor:
154 | """Normalize mel spectrogram"""
155 | mel = torch.clip(mel, min=1.e-5, max=None)
156 | return torch.log(mel)
157 |
158 | @staticmethod
159 | def denormalize(mel: np.ndarray) -> np.ndarray:
160 | """Denormalize mel spectrogram"""
161 | return np.exp(mel)
162 |
163 | def trim_silence(self, waveform: torch.Tensor) -> torch.Tensor:
164 | """Trim silence from the waveform"""
165 | waveform = tensor_to_ndarray(waveform)
166 | trimmed_waveform = librosa.effects.trim(waveform,
167 | top_db=self.trim_silence_top_db,
168 | frame_length=self.win_length,
169 | hop_length=self.hop_length)
170 | return ndarray_to_tensor(trimmed_waveform[0])
171 |
172 | # borrowed from https://github.com/resemble-ai/Resemblyzer/blob/master/resemblyzer/audio.py
173 | def trim_long_silences(self, wav: torch.Tensor) -> torch.Tensor:
174 | wav = tensor_to_ndarray(wav)
175 | int16_max = (2 ** 15) - 1
176 | samples_per_window = (self.vad_window_length * self.vad_sample_rate) // 1000
177 | wav = wav[:len(wav) - (len(wav) % samples_per_window)]
178 | pcm_wave = struct.pack("%dh" % len(wav), *(np.round(wav * int16_max)).astype(np.int16))
179 | voice_flags = []
180 | vad = webrtcvad.Vad(mode=3)
181 | for window_start in range(0, len(wav), samples_per_window):
182 | window_end = window_start + samples_per_window
183 | voice_flags.append(vad.is_speech(pcm_wave[window_start * 2:window_end * 2],
184 | sample_rate=self.vad_sample_rate))
185 | voice_flags = np.array(voice_flags)
186 | def moving_average(array, width):
187 | array_padded = np.concatenate((np.zeros((width - 1) // 2), array, np.zeros(width // 2)))
188 | ret = np.cumsum(array_padded, dtype=float)
189 | ret[width:] = ret[width:] - ret[:-width]
190 | return ret[width - 1:] / width
191 | audio_mask = moving_average(voice_flags, self.vad_moving_average_width)
192 | audio_mask = np.round(audio_mask).astype(np.bool)
193 | audio_mask[:] = binary_dilation(audio_mask[:], np.ones(self.vad_max_silence_length + 1))
194 | audio_mask = np.repeat(audio_mask, samples_per_window)
195 | return ndarray_to_tensor(wav[audio_mask])
196 |
--------------------------------------------------------------------------------
/models/common_layers.py:
--------------------------------------------------------------------------------
1 | import copy
2 | import math
3 | from typing import Optional
4 |
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.nn import LayerNorm, MultiheadAttention
9 | from torch.nn.utils.rnn import pad_sequence
10 |
11 |
12 | class LengthRegulator(nn.Module):
13 |
14 | def __init__(self):
15 | super().__init__()
16 |
17 | def forward(self, x: torch.Tensor, dur: torch.Tensor) -> torch.Tensor:
18 | dur[dur < 0] = 0.
19 | x_expanded = []
20 | for i in range(x.size(0)):
21 | x_exp = torch.repeat_interleave(x[i], (dur[i] + 0.5).long(), dim=0)
22 | x_expanded.append(x_exp)
23 | x_expanded = pad_sequence(x_expanded, padding_value=0., batch_first=True)
24 | return x_expanded
25 |
26 |
27 | class HighwayNetwork(nn.Module):
28 |
29 | def __init__(self, size: int) -> None:
30 | super().__init__()
31 | self.W1 = nn.Linear(size, size)
32 | self.W2 = nn.Linear(size, size)
33 | self.W1.bias.data.fill_(0.)
34 |
35 | def forward(self, x: torch.Tensor) -> torch.Tensor:
36 | x1 = self.W1(x)
37 | x2 = self.W2(x)
38 | g = torch.sigmoid(x2)
39 | y = g * F.relu(x1) + (1. - g) * x
40 | return y
41 |
42 |
43 | class BatchNormConv(nn.Module):
44 |
45 | def __init__(self,
46 | in_channels: int,
47 | out_channels: int,
48 | kernel: int, relu=True) -> None:
49 | super().__init__()
50 | self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
51 | self.bnorm = nn.BatchNorm1d(out_channels)
52 | self.relu = relu
53 |
54 | def forward(self, x: torch.Tensor) -> torch.Tensor:
55 | x = self.conv(x)
56 | x = F.relu(x) if self.relu is True else x
57 | return self.bnorm(x)
58 |
59 |
60 | class CBHG(nn.Module):
61 |
62 | def __init__(self,
63 | K: int,
64 | in_channels: int,
65 | channels: int,
66 | proj_channels: list,
67 | num_highways: int,
68 | dropout: float = 0.5) -> None:
69 | super().__init__()
70 |
71 | self.dropout = dropout
72 | self.bank_kernels = [i for i in range(1, K + 1)]
73 | self.conv1d_bank = nn.ModuleList()
74 | for k in self.bank_kernels:
75 | conv = BatchNormConv(in_channels, channels, k)
76 | self.conv1d_bank.append(conv)
77 |
78 | self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
79 |
80 | self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
81 | self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
82 |
83 | self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
84 | self.highways = nn.ModuleList()
85 | for i in range(num_highways):
86 | hn = HighwayNetwork(channels)
87 | self.highways.append(hn)
88 |
89 | self.rnn = nn.GRU(channels, channels, batch_first=True, bidirectional=True)
90 |
91 | def forward(self, x: torch.Tensor) -> torch.Tensor:
92 | residual = x
93 | seq_len = x.size(-1)
94 | conv_bank = []
95 |
96 | # Convolution Bank
97 | for conv in self.conv1d_bank:
98 | c = conv(x) # Convolution
99 | conv_bank.append(c[:, :, :seq_len])
100 |
101 | # Stack along the channel axis
102 | conv_bank = torch.cat(conv_bank, dim=1)
103 |
104 | # dump the last padding to fit residual
105 | x = self.maxpool(conv_bank)[:, :, :seq_len]
106 | x = F.dropout(x, p=self.dropout, training=self.training)
107 |
108 | # Conv1d projections
109 | x = self.conv_project1(x)
110 | x = F.dropout(x, p=self.dropout, training=self.training)
111 | x = self.conv_project2(x)
112 |
113 | # Residual Connect
114 | x = x + residual
115 |
116 | # Through the highways
117 | x = x.transpose(1, 2)
118 | x = self.pre_highway(x)
119 | for h in self.highways:
120 | x = h(x)
121 |
122 | # And then the RNN
123 | x, _ = self.rnn(x)
124 | return x
125 |
126 |
127 | class PositionalEncoding(torch.nn.Module):
128 |
129 | def __init__(self, d_model: int, dropout=0.1, max_len=5000) -> None:
130 | super(PositionalEncoding, self).__init__()
131 | self.dropout = torch.nn.Dropout(p=dropout)
132 | self.scale = torch.nn.Parameter(torch.ones(1))
133 |
134 | pe = torch.zeros(max_len, d_model)
135 | position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
136 | div_term = torch.exp(torch.arange(
137 | 0, d_model, 2).float() * (-math.log(10000.0) / d_model))
138 | pe[:, 0::2] = torch.sin(position * div_term)
139 | pe[:, 1::2] = torch.cos(position * div_term)
140 | pe = pe.unsqueeze(0).transpose(0, 1)
141 | self.register_buffer('pe', pe)
142 |
143 | def forward(self, x: torch.Tensor) -> torch.Tensor: # shape: [T, N]
144 | x = x + self.scale * self.pe[:x.size(0), :]
145 | return self.dropout(x)
146 |
147 |
148 | class FFTBlock(nn.Module):
149 |
150 | def __init__(self,
151 | d_model: int,
152 | nhead: int,
153 | conv1_kernel: int,
154 | conv2_kernel: int,
155 | d_fft: int,
156 | dropout: float = 0.1):
157 | super(FFTBlock, self).__init__()
158 | self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
159 | self.dropout = nn.Dropout(dropout)
160 | self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_fft,
161 | kernel_size=conv1_kernel, stride=1, padding=conv1_kernel // 2)
162 | self.conv2 = nn.Conv1d(in_channels=d_fft, out_channels=d_model,
163 | kernel_size=conv2_kernel, stride=1, padding=conv2_kernel // 2)
164 | self.norm1 = LayerNorm(d_model)
165 | self.norm2 = LayerNorm(d_model)
166 | self.dropout1 = nn.Dropout(dropout)
167 | self.dropout2 = nn.Dropout(dropout)
168 | self.activation = torch.nn.ReLU()
169 |
170 | def forward(self,
171 | src: torch.Tensor,
172 | src_pad_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
173 | src2 = self.self_attn(src, src, src,
174 | attn_mask=None,
175 | key_padding_mask=src_pad_mask)[0]
176 | src = src + self.dropout1(src2)
177 | src = self.norm1(src)
178 | src = src.transpose(0, 1).transpose(1, 2)
179 | src2 = self.conv1(src)
180 | src2 = self.activation(src2)
181 | src2 = self.conv2(src2)
182 | src = src + self.dropout2(src2)
183 | src = src.transpose(1, 2).transpose(0, 1)
184 | src = self.norm2(src)
185 | return src
186 |
187 |
188 | class ForwardTransformer(torch.nn.Module):
189 |
190 | def __init__(self,
191 | d_model: int,
192 | d_fft: int,
193 | layers: int,
194 | heads: int,
195 | conv1_kernel: int,
196 | conv2_kernel: int,
197 | dropout: float = 0.1,
198 | ) -> None:
199 | super().__init__()
200 |
201 | self.d_model = d_model
202 | self.pos_encoder = PositionalEncoding(d_model, dropout)
203 | encoder_layer = FFTBlock(d_model=d_model,
204 | nhead=heads,
205 | d_fft=d_fft,
206 | conv1_kernel=conv1_kernel,
207 | conv2_kernel=conv2_kernel,
208 | dropout=dropout)
209 | encoder_norm = LayerNorm(d_model)
210 | self.layers = nn.ModuleList([copy.deepcopy(encoder_layer)
211 | for _ in range(layers)])
212 | self.norm = encoder_norm
213 |
214 | def forward(self,
215 | x: torch.Tensor,
216 | src_pad_mask: Optional[torch.Tensor] = None) -> torch.Tensor: # shape: [N, T]
217 | x = x.transpose(0, 1) # shape: [T, N]
218 | x = self.pos_encoder(x)
219 | for layer in self.layers:
220 | x = layer(x, src_pad_mask=src_pad_mask)
221 | x = self.norm(x)
222 | x = x.transpose(0, 1)
223 | return x
224 |
225 |
226 | def generate_square_subsequent_mask(sz: int) -> torch.Tensor:
227 | mask = torch.triu(torch.ones(sz, sz), 1)
228 | mask = mask.masked_fill(mask == 1, float('-inf'))
229 | return mask
230 |
231 |
232 | def make_token_len_mask(x: torch.Tensor) -> torch.Tensor:
233 | return (x == 0).transpose(0, 1)
234 |
235 |
236 | def make_mel_len_mask(x: torch.Tensor, mel_lens: torch.Tensor) -> torch.Tensor:
237 | len_mask = torch.zeros((x.size(0), x.size(1))).bool().to(x.device)
238 | for i, mel_len in enumerate(mel_lens):
239 | len_mask[i, mel_len:] = True
240 | return len_mask
241 |
--------------------------------------------------------------------------------
/trainer/taco_trainer.py:
--------------------------------------------------------------------------------
1 | import math
2 | import time
3 |
4 | import torch
5 | import torch.nn.functional as F
6 | from torch.optim.optimizer import Optimizer
7 | from torch.utils.data import DataLoader
8 | from torch.utils.tensorboard import SummaryWriter
9 | from typing import Tuple, Dict, Any
10 |
11 | from models.tacotron import Tacotron
12 | from trainer.common import Averager, TTSSession, to_device, np_now, ForwardSumLoss, new_guided_attention_matrix
13 | from utils.checkpoints import save_checkpoint
14 | from utils.dataset import get_taco_dataloaders
15 | from utils.decorators import ignore_exception
16 | from utils.display import stream, simple_table, plot_mel, plot_attention
17 | from utils.dsp import DSP
18 | from utils.files import parse_schedule
19 | from utils.metrics import attention_score
20 | from utils.paths import Paths
21 |
22 |
23 | class TacoTrainer:
24 |
25 | def __init__(self,
26 | paths: Paths,
27 | dsp: DSP,
28 | config: Dict[str, Any]) -> None:
29 | self.paths = paths
30 | self.dsp = dsp
31 | self.config = config
32 | self.train_cfg = config['tacotron']['training']
33 | self.writer = SummaryWriter(log_dir=paths.taco_log, comment='v1')
34 | self.forward_loss = ForwardSumLoss()
35 |
36 | def train(self,
37 | model: Tacotron,
38 | optimizer: Optimizer) -> None:
39 | tts_schedule = self.train_cfg['schedule']
40 | tts_schedule = parse_schedule(tts_schedule)
41 | for i, session_params in enumerate(tts_schedule, 1):
42 | r, lr, max_step, bs = session_params
43 | if model.get_step() < max_step:
44 | train_set, val_set = get_taco_dataloaders(
45 | paths=self.paths, batch_size=bs, r=r,
46 | **self.train_cfg['filter']
47 | )
48 | session = TTSSession(
49 | index=i, r=r, lr=lr, max_step=max_step,
50 | bs=bs, train_set=train_set, val_set=val_set)
51 | self.train_session(model, optimizer, session=session)
52 |
53 | def train_session(self, model: Tacotron,
54 | optimizer: Optimizer,
55 | session: TTSSession) -> None:
56 | current_step = model.get_step()
57 | training_steps = session.max_step - current_step
58 | total_iters = len(session.train_set)
59 | epochs = training_steps // total_iters + 1
60 | model.r = session.r
61 | simple_table([(f'Steps with r={session.r}', str(training_steps // 1000) + 'k Steps'),
62 | ('Batch Size', session.bs),
63 | ('Learning Rate', session.lr),
64 | ('Outputs/Step (r)', model.r)])
65 | for g in optimizer.param_groups:
66 | g['lr'] = session.lr
67 |
68 | loss_avg = Averager()
69 | duration_avg = Averager()
70 | device = next(model.parameters()).device # use same device as model parameters
71 | for e in range(1, epochs + 1):
72 | for i, batch in enumerate(session.train_set, 1):
73 | batch = to_device(batch, device=device)
74 | start = time.time()
75 | model.train()
76 |
77 | out = model(batch)
78 | m1_hat, m2_hat, attention, att_aligner = out['mel'], out['mel_post'], out['att'], out['att_aligner']
79 | ctc_loss = self.forward_loss(att_aligner, text_lens=batch['x_len'], mel_lens=batch['mel_len'])
80 |
81 | m1_loss = F.l1_loss(m1_hat, batch['mel'])
82 | m2_loss = F.l1_loss(m2_hat, batch['mel'])
83 |
84 | dia_mat = new_guided_attention_matrix(attention=attention,
85 | g=self.train_cfg['dia_loss_matrix_g'])
86 | dia_loss = ((1 - dia_mat) * attention).mean()
87 |
88 | mel_loss = m1_loss + m2_loss
89 | loss = mel_loss + self.train_cfg['ctc_loss_factor'] * ctc_loss \
90 | + self.train_cfg['dia_loss_factor'] * dia_loss
91 |
92 | optimizer.zero_grad()
93 | loss.backward()
94 | torch.nn.utils.clip_grad_norm_(model.parameters(),
95 | self.train_cfg['clip_grad_norm'])
96 | optimizer.step()
97 | loss_avg.add(loss.item())
98 | step = model.get_step()
99 | k = step // 1000
100 |
101 | duration_avg.add(time.time() - start)
102 | speed = 1. / duration_avg.get()
103 | msg = f'| Epoch: {e}/{epochs} ({i}/{total_iters}) | Loss: {loss_avg.get():#.4} ' \
104 | f'| {speed:#.2} steps/s | Step: {k}k | '
105 |
106 | if step % self.train_cfg['checkpoint_every'] == 0:
107 | save_checkpoint(model=model, optim=optimizer, config=self.config,
108 | path=self.paths.taco_checkpoints / f'taco_step{k}k.pt')
109 |
110 | if step % self.train_cfg['plot_every'] == 0:
111 | self.generate_plots(model, session)
112 |
113 | _, att_score = attention_score(attention, batch['mel_len'])
114 | att_score = torch.mean(att_score)
115 | self.writer.add_scalar('Attention_Score/train', att_score, model.get_step())
116 | self.writer.add_scalar('Mel_Loss/train', mel_loss, model.get_step())
117 | self.writer.add_scalar('CTC_Loss/train', ctc_loss, model.get_step())
118 | self.writer.add_scalar('Dia_Loss/train', dia_loss, model.get_step())
119 | self.writer.add_scalar('Params/reduction_factor', session.r, model.get_step())
120 | self.writer.add_scalar('Params/batch_size', session.bs, model.get_step())
121 | self.writer.add_scalar('Params/learning_rate', session.lr, model.get_step())
122 |
123 | stream(msg)
124 |
125 | val_loss, val_att_score = self.evaluate(model, session.val_set)
126 | self.writer.add_scalar('Mel_Loss/val', val_loss, model.get_step())
127 | self.writer.add_scalar('Attention_Score/val', val_att_score, model.get_step())
128 | save_checkpoint(model=model, optim=optimizer, config=self.config,
129 | path=self.paths.taco_checkpoints / 'latest_model.pt')
130 |
131 | loss_avg.reset()
132 | duration_avg.reset()
133 | print(' ')
134 |
135 | def evaluate(self, model: Tacotron, val_set: DataLoader) -> Tuple[float, float]:
136 | model.eval()
137 | model.decoder.prenet.train()
138 | val_loss = 0
139 | val_att_score = 0
140 | device = next(model.parameters()).device
141 | for i, batch in enumerate(val_set, 1):
142 | batch = to_device(batch, device=device)
143 | with torch.no_grad():
144 | out = model(batch)
145 | m1_hat, m2_hat, attention = out['mel'], out['mel_post'], out['att']
146 | m1_loss = F.l1_loss(m1_hat, batch['mel'])
147 | m2_loss = F.l1_loss(m2_hat, batch['mel'])
148 | val_loss += m1_loss.item() + m2_loss.item()
149 | _, att_score = attention_score(attention, batch['mel_len'])
150 | val_att_score += torch.mean(att_score).item()
151 |
152 | return val_loss / len(val_set), val_att_score / len(val_set)
153 |
154 | @ignore_exception
155 | def generate_plots(self, model: Tacotron, session: TTSSession) -> None:
156 | model.eval()
157 | device = next(model.parameters()).device
158 | batch = session.val_sample
159 | batch = to_device(batch, device=device)
160 | with torch.no_grad():
161 | out = model(batch)
162 | m1_hat, m2_hat, att, att_aligner = out['mel'], out['mel_post'], out['att'], out['att_aligner']
163 | att = np_now(att)[0]
164 | att_aligner = np_now(att_aligner.softmax(-1))[0]
165 | m1_hat = np_now(m1_hat)[0, :, :]
166 | m2_hat = np_now(m2_hat)[0, :, :]
167 | m_target = np_now(batch['mel'])[0, :, :]
168 | speaker = batch['speaker_name'][0]
169 |
170 | att_fig = plot_attention(att)
171 | att_aligner_fig = plot_attention(att_aligner)
172 |
173 | m1_hat_fig = plot_mel(m1_hat)
174 | m2_hat_fig = plot_mel(m2_hat)
175 | m_target_fig = plot_mel(m_target)
176 |
177 | self.writer.add_figure(f'Ground_Truth_Aligned/attention/{speaker}', att_fig, model.step)
178 | self.writer.add_figure(f'Ground_Truth_Aligned/attention_aligner/{speaker}', att_aligner_fig, model.step)
179 | self.writer.add_figure(f'Ground_Truth_Aligned/target/{speaker}', m_target_fig, model.step)
180 | self.writer.add_figure(f'Ground_Truth_Aligned/linear/{speaker}', m1_hat_fig, model.step)
181 | self.writer.add_figure(f'Ground_Truth_Aligned/postnet/{speaker}', m2_hat_fig, model.step)
182 |
183 | m2_hat_wav = self.dsp.griffinlim(m2_hat)
184 | target_wav = self.dsp.griffinlim(m_target)
185 |
186 | self.writer.add_audio(
187 | tag=f'Ground_Truth_Aligned/target_wav/{speaker}', snd_tensor=target_wav,
188 | global_step=model.step, sample_rate=self.dsp.sample_rate)
189 | self.writer.add_audio(
190 | tag=f'Ground_Truth_Aligned/postnet_wav/{speaker}', snd_tensor=m2_hat_wav,
191 | global_step=model.step, sample_rate=self.dsp.sample_rate)
--------------------------------------------------------------------------------
/configs/singlespeaker.yaml:
--------------------------------------------------------------------------------
1 |
2 | tts_model_id: 'ljspeech_tts'
3 | data_path: 'data' # output data path
4 |
5 | tts_model: 'forward_tacotron' # choices: [forward_tacotron, fast_pitch]
6 |
7 |
8 | dsp:
9 |
10 | sample_rate: 22050
11 | n_fft: 1024
12 | num_mels: 80
13 | hop_length: 256
14 | win_length: 1024
15 | fmin: 0
16 | fmax: 8000
17 | target_dBFS: -30 # Target loudness in decibels, used for normalization
18 | peak_norm: False # Normalise to the peak of each wav file
19 | trim_start_end_silence: True # Whether to trim leading and trailing silence
20 | trim_silence_top_db: 60 # Threshold in decibels below reference to consider silence for for trimming
21 | # start and end silences with librosa (no trimming if really high)
22 |
23 | trim_long_silences: False # Whether to reduce long silence using WebRTC Voice Activity Detector
24 | vad_window_length: 30 # In milliseconds
25 | vad_moving_average_width: 8
26 | vad_max_silence_length: 12
27 | vad_sample_rate: 16000
28 |
29 |
30 | preprocessing:
31 |
32 | metafile_format: 'ljspeech' # not to be changed, we use the simplest format for singlespeaker models
33 | audio_format: '.wav' # extension for audio files (e.g. .wav or .flac)
34 | seed: 42
35 | n_val: 200
36 | language: 'en-us'
37 | cleaner_name: 'english_cleaners' # choices: ['english_cleaners', 'no_cleaners'], expands numbers and abbreviations.
38 | use_phonemes: True # whether to phonemize the text
39 | # if set to False, you have to provide the phonemized text yourself
40 | min_text_len: 2
41 | pitch_min_freq: 30 # Minimum value for pitch frequency to remove outliers (Common pitch range is
42 | # about 60-300)
43 | pitch_max_freq: 600 # Maximum value for pitch frequency to remove outliers (Common pitch range is
44 | # about 60-300)¡
45 | pitch_extractor: pyworld # choice of pitch extraction library, choices: [librosa, pyworld]
46 | pitch_frame_length: 2048 # Frame length for extracting pitch with librosa
47 |
48 |
49 | duration_extraction:
50 |
51 | silence_threshold: -11 # normalized mel value below which the voice is considered silent
52 | # minimum mel value = -11.512925465 for zeros in the wav array (=log(1e-5),
53 | # where 1e-5 is a cutoff value)
54 | silence_prob_shift: 0.25 # increase probability for silent characters in periods of silence
55 | # for better durations during non voiced periods
56 | max_batch_size: 32 # max allowed for binned dataloader used for tacotron inference
57 | num_workers: 12 # number of processes for costly dijkstra duration extraction
58 |
59 |
60 | tacotron:
61 |
62 | model:
63 | embed_dims: 256
64 | encoder_dims: 128
65 | decoder_dims: 256
66 | postnet_dims: 128
67 | speaker_emb_dim: 0 # dimension of speaker embedding,
68 | # set to 0 for no speaker conditioning, to 256 for speaker conditioning
69 | encoder_k: 16
70 | lstm_dims: 512
71 | postnet_k: 8
72 | num_highways: 4
73 | dropout: 0.5
74 | stop_threshold: -11 # Value below which audio generation ends.
75 |
76 | aligner_hidden_dims: 256 # text-mel aligner hidden dimensions
77 | aligner_out_dims: 32 # text-mel aligner encoding dimensions for text and mel
78 |
79 | training:
80 | schedule:
81 | - 5, 1e-3, 10_000, 32 # progressive training schedule
82 | - 3, 1e-4, 20_000, 16 # (r, lr, step, batch_size)
83 | - 2, 1e-4, 30_000, 8
84 | - 1, 1e-4, 40_000, 8
85 |
86 | dia_loss_matrix_g: 0.2 # value of g for diatonal matrix (larger g = broader diagonal)
87 | dia_loss_factor: 1.0 # factor for scaling diagonal loss
88 | ctc_loss_factor: 0.1 # factor for scaling aligner CTC loss
89 | clip_grad_norm: 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
90 | checkpoint_every: 10000 # checkpoints the model every x steps
91 | plot_every: 1000 # generates samples and plots every x steps
92 | num_workers: 2 # number of workers for dataloader
93 |
94 | filter:
95 | max_mel_len: 1250 # filter files with mel len larger than given
96 | filter_duration_stats: False # whether to filter according to the duration stats below
97 | min_attention_sharpness: 0.5 # filter files with bad attention sharpness score, if 0 then no filter
98 | min_attention_alignment: 0.95 # filter files with bad attention alignment score, if 0 then no filter
99 | max_duration: 40 # filter files with durations larger than given
100 | max_consecutive_ones: 6 # filter files where durations contain more consecutive ones than given
101 |
102 |
103 | forward_tacotron:
104 |
105 | model:
106 | embed_dims: 256 # embedding dimension for main model
107 | series_embed_dims: 64 # embedding dimension for series predictor
108 |
109 | durpred_conv_dims: 256
110 | durpred_rnn_dims: 64
111 | durpred_dropout: 0.5
112 |
113 | pitch_conv_dims: 256
114 | pitch_rnn_dims: 128
115 | pitch_dropout: 0.5
116 | pitch_strength: 1. # set to 0 if you want no pitch conditioning
117 |
118 | energy_conv_dims: 256
119 | energy_rnn_dims: 64
120 | energy_dropout: 0.5
121 | energy_strength: 1. # set to 0 if you want no energy conditioning
122 |
123 | prenet_dims: 256
124 | prenet_k: 16
125 | prenet_dropout: 0.5
126 | prenet_num_highways: 4
127 |
128 | rnn_dims: 512
129 |
130 | postnet_dims: 256
131 | postnet_k: 8
132 | postnet_num_highways: 4
133 | postnet_dropout: 0.
134 |
135 | training:
136 | schedule:
137 | - 5e-5, 150_000, 32 # progressive training schedule
138 | - 1e-5, 300_000, 32 # lr, step, batch_size
139 | dur_loss_factor: 0.1
140 | pitch_loss_factor: 0.1
141 | energy_loss_factor: 0.1
142 | pitch_zoneout: 0. # zoneout may regularize conditioning on pitch
143 | energy_zoneout: 0. # zoneout may regularize conditioning on energy
144 |
145 | clip_grad_norm: 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
146 | checkpoint_every: 10_000 # checkpoints the model every x steps
147 | plot_every: 1000 # generates samples and plots every x steps
148 |
149 | filter:
150 | max_mel_len: 1250 # filter files with mel len larger than given
151 | filter_duration_stats: True # whether to filter according to the duration stats below
152 | min_attention_sharpness: 0.5 # filter files with bad attention sharpness score, if 0 then no filter
153 | min_attention_alignment: 0.95 # filter files with bad attention alignment score, if 0 then no filter
154 | max_duration: 40 # filter files with durations larger than given
155 | max_consecutive_ones: 6 # filter files where durations contain more consecutive ones than given
156 |
157 | fast_pitch:
158 |
159 | model:
160 | durpred_d_model: 128
161 | durpred_n_heads: 2
162 | durpred_layers: 4
163 | durpred_d_fft: 128
164 | durpred_dropout: 0.5
165 |
166 | pitch_d_model: 128
167 | pitch_n_heads: 2
168 | pitch_layers: 4
169 | pitch_d_fft: 128
170 | pitch_dropout: 0.5
171 | pitch_strength: 1.0
172 |
173 | energy_d_model: 128
174 | energy_n_heads: 2
175 | energy_layers: 4
176 | energy_d_fft: 128
177 | energy_dropout: 0.5
178 | energy_strength: 1.0
179 |
180 | d_model: 256
181 | conv1_kernel: 9
182 | conv2_kernel: 1
183 |
184 | prenet_layers: 4
185 | prenet_heads: 2
186 | prenet_fft: 1024
187 | prenet_dropout: 0.1
188 |
189 | postnet_layers: 4
190 | postnet_heads: 2
191 | postnet_fft: 1024
192 | postnet_dropout: 0.1
193 |
194 |
195 | training:
196 | schedule:
197 | - 1e-5, 5_000, 32 # progressive training schedule
198 | - 5e-5, 100_000, 32 # lr, step, batch_size
199 | - 2e-5, 300_000, 32
200 | dur_loss_factor: 0.1
201 | pitch_loss_factor: 0.1
202 | energy_loss_factor: 0.1
203 | pitch_zoneout: 0. # zoneout may regularize conditioning on pitch
204 | energy_zoneout: 0. # zoneout may regularize conditioning on energy
205 |
206 | max_mel_len: 1250
207 | clip_grad_norm: 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
208 | checkpoint_every: 10_000 # checkpoints the model every x steps
209 | plot_every: 1000
210 |
211 | filter:
212 | max_mel_len: 1250 # filter files with mel len larger than given
213 | filter_duration_stats: True # whether to filter according to the duration stats below
214 | min_attention_sharpness: 0.5 # filter files with bad attention sharpness score, if 0 then no filter
215 | min_attention_alignment: 0.95 # filter files with bad attention alignment score, if 0 then no filter
216 | max_duration: 40 # filter files with durations larger than given
217 | max_consecutive_ones: 6 # filter files where durations contain more consecutive ones than given
218 |
--------------------------------------------------------------------------------
/train_tacotron.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import itertools
3 | from pathlib import Path
4 | from typing import Tuple, Dict, Any
5 |
6 | import torch
7 | from torch import optim
8 | from torch.utils.data.dataloader import DataLoader
9 | from tqdm import tqdm
10 |
11 | from duration_extraction.duration_extraction_pipe import DurationExtractionPipeline
12 | from duration_extraction.duration_extractor import DurationExtractor
13 | from models.tacotron import Tacotron
14 | from trainer.common import to_device
15 | from trainer.taco_trainer import TacoTrainer
16 | from utils.checkpoints import restore_checkpoint
17 | from utils.dataset import get_taco_dataloaders
18 | from utils.display import *
19 | from utils.dsp import DSP
20 | from utils.files import pickle_binary, unpickle_binary, read_config
21 | from utils.paths import Paths
22 |
23 |
24 | def normalize_values(phoneme_val):
25 | nonzeros = np.concatenate([v[np.where(v != 0.0)[0]]
26 | for item_id, v in phoneme_val])
27 | mean, std = np.mean(nonzeros), np.std(nonzeros)
28 | if not std > 0:
29 | std = 1e10
30 | for item_id, v in phoneme_val:
31 | zero_idxs = np.where(v == 0.0)[0]
32 | v -= mean
33 | v /= std
34 | v[zero_idxs] = 0.0
35 | return mean, std
36 |
37 |
38 | # adapted from https://github.com/NVIDIA/DeepLearningExamples/blob/
39 | # 0b27e359a5869cd23294c1707c92f989c0bf201e/PyTorch/SpeechSynthesis/FastPitch/extract_mels.py
40 | def extract_pitch_energy(save_path_pitch: Path,
41 | save_path_energy: Path,
42 | pitch_min_freq: float,
43 | pitch_max_freq: float) -> Tuple[float, float]:
44 | speaker_dict = unpickle_binary(paths.speaker_dict)
45 |
46 |
47 | speaker_names = set([v for v in speaker_dict.values() if len(v) > 1])
48 | mean, var = 0, 0
49 |
50 | train_data = unpickle_binary(paths.train_dataset)
51 | val_data = unpickle_binary(paths.val_dataset)
52 | all_data = train_data + val_data
53 |
54 | for speaker_name in tqdm(speaker_names, total=len(speaker_names), smoothing=0.1):
55 | all_data_speaker = [(item_id, mel_len) for item_id, mel_len in all_data if speaker_dict[item_id] == speaker_name]
56 | phoneme_pitches = []
57 | phoneme_energies = []
58 | for prog_idx, (item_id, mel_len) in enumerate(all_data_speaker, 1):
59 | try:
60 | dur = np.load(paths.alg / f'{item_id}.npy')
61 | mel = np.load(paths.mel / f'{item_id}.npy')
62 | energy = np.linalg.norm(np.exp(mel), axis=0, ord=2)
63 | assert np.sum(dur) == mel_len
64 | pitch = np.load(paths.raw_pitch / f'{item_id}.npy')
65 | durs_cum = np.cumsum(np.pad(dur, (1, 0)))
66 | pitch_char = np.zeros((dur.shape[0],), dtype=np.float32)
67 | energy_char = np.zeros((dur.shape[0],), dtype=np.float32)
68 | for idx, a, b in zip(range(mel_len), durs_cum[:-1], durs_cum[1:]):
69 | values = pitch[a:b][np.where(pitch[a:b] != 0.0)[0]]
70 | values = values[np.where((values >= pitch_min_freq) & (values <= pitch_max_freq))[0]]
71 | pitch_char[idx] = np.mean(values) if len(values) > 0 else 0.0
72 | energy_values = energy[a:b]
73 | energy_char[idx] = np.mean(energy_values)if len(energy_values) > 0 else 0.0
74 | phoneme_pitches.append((item_id, pitch_char))
75 | phoneme_energies.append((item_id, energy_char))
76 | bar = progbar(prog_idx, len(all_data))
77 | msg = f'{bar} {prog_idx}/{len(all_data_speaker )} Files '
78 | stream(msg)
79 | except Exception as e:
80 | print(e)
81 |
82 | for item_id, phoneme_energy in phoneme_energies:
83 | np.save(str(save_path_energy / f'{item_id}.npy'), phoneme_energy, allow_pickle=False)
84 |
85 | mean, var = normalize_values(phoneme_pitches)
86 | for item_id, phoneme_pitch in phoneme_pitches:
87 | np.save(str(save_path_pitch / f'{item_id}.npy'), phoneme_pitch, allow_pickle=False)
88 |
89 | return mean, var
90 |
91 |
92 | def create_gta_features(model: Tacotron,
93 | train_set: DataLoader,
94 | val_set: DataLoader,
95 | save_path: Path):
96 | model.eval()
97 | device = next(model.parameters()).device # use same device as model parameters
98 | iters = len(train_set) + len(val_set)
99 | dataset = itertools.chain(train_set, val_set)
100 | for i, batch in enumerate(dataset, 1):
101 | batch = to_device(batch, device=device)
102 | with torch.no_grad():
103 | _, gta, _ = model(batch)
104 | gta = gta.cpu().numpy()
105 | for j, item_id in enumerate(batch['item_id']):
106 | mel = gta[j][:, :batch['mel_len'][j]]
107 | np.save(str(save_path/f'{item_id}.npy'), mel, allow_pickle=False)
108 | bar = progbar(i, iters)
109 | msg = f'{bar} {i}/{iters} Batches '
110 | stream(msg)
111 |
112 |
113 | def create_align_features(model: Tacotron,
114 | paths: Paths,
115 | config: Dict[str, Any]) -> None:
116 |
117 | assert model.r == 1, f'Reduction factor of tacotron must be 1 for creating alignment features! ' \
118 | f'Reduction factor was: {model.r}'
119 | model.eval()
120 | model.decoder.prenet.train()
121 |
122 | dur_extr_conf = config['duration_extraction']
123 |
124 | duration_extractor = DurationExtractor(silence_threshold=dur_extr_conf['silence_threshold'],
125 | silence_prob_shift=dur_extr_conf['silence_prob_shift'])
126 |
127 | duration_extraction_pipe = DurationExtractionPipeline(paths=paths, config=config,
128 | duration_extractor=duration_extractor)
129 |
130 | print('Extracting attention matrices from tacotron...')
131 | duration_extraction_pipe.extract_attentions(model, max_batch_size=dur_extr_conf['max_batch_size'])
132 |
133 | num_workers = dur_extr_conf['num_workers']
134 | print(f'Extracting durations from attention matrices (num workers={num_workers})...')
135 | duration_stats = duration_extraction_pipe.extract_durations(num_workers=num_workers,
136 | sampler_bin_size=num_workers*4)
137 | pickle_binary(duration_stats, paths.duration_stats)
138 |
139 | print('Extracting Pitch Values...')
140 | extract_pitch_energy(save_path_pitch=paths.phon_pitch,
141 | save_path_energy=paths.phon_energy,
142 | pitch_min_freq=config['preprocessing']['pitch_min_freq'],
143 | pitch_max_freq=config['preprocessing']['pitch_max_freq'])
144 |
145 |
146 | if __name__ == '__main__':
147 | parser = argparse.ArgumentParser(description='Train Tacotron TTS')
148 | parser.add_argument('--force_gta', '-g', action='store_true', help='Force the model to create GTA features')
149 | parser.add_argument('--force_align', '-a', action='store_true', help='Force the model to create attention alignment features')
150 | parser.add_argument('--extract_pitch', '-p', action='store_true', help='Extracts phoneme-pitch values only')
151 | parser.add_argument('--config', metavar='FILE', default='configs/singlespeaker.yaml', help='The config containing all hyperparams.')
152 |
153 | args = parser.parse_args()
154 | config = read_config(args.config)
155 | dsp = DSP.from_config(config)
156 | paths = Paths(config['data_path'], config['tts_model_id'])
157 |
158 | if args.extract_pitch:
159 | print('Extracting Pitch and Energy Values...')
160 | mean, var = extract_pitch_energy(save_path_pitch=paths.phon_pitch,
161 | save_path_energy=paths.phon_energy,
162 | pitch_min_freq=config['preprocessing']['pitch_min_freq'],
163 | pitch_max_freq=config['preprocessing']['pitch_max_freq'])
164 | print('\n\nYou can now train ForwardTacotron - use python train_forward.py\n')
165 | exit()
166 |
167 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
168 | print('Using device:', device)
169 |
170 | # Instantiate Tacotron Model
171 | print('\nInitialising Tacotron Model...\n')
172 | model = Tacotron.from_config(config).to(device)
173 |
174 | optimizer = optim.Adam(model.parameters())
175 | restore_checkpoint(model=model, optim=optimizer,
176 | path=paths.taco_checkpoints / 'latest_model.pt',
177 | device=device)
178 |
179 | train_cfg = config['tacotron']['training']
180 | if args.force_gta:
181 | print('Creating Ground Truth Aligned Dataset...\n')
182 | train_set, val_set = get_taco_dataloaders(paths.data, 1, model.r, **train_cfg['filter'])
183 | create_gta_features(model, train_set, val_set, paths.gta)
184 | print('\n\nYou can now train WaveRNN on GTA features - use python train_wavernn.py --gta\n')
185 | elif args.force_align:
186 | print('Creating Attention Alignments and Pitch Values...')
187 | train_set, val_set = get_taco_dataloaders(paths, 1, model.r, **train_cfg['filter'])
188 | create_align_features(model=model, config=config, paths=paths)
189 | print('\n\nYou can now train ForwardTacotron - use python train_forward.py\n')
190 | else:
191 | trainer = TacoTrainer(paths, config=config, dsp=dsp)
192 | trainer.train(model, optimizer)
193 | print('Training finished, now creating Attention Alignments and Pitch Values...')
194 | train_set, val_set = get_taco_dataloaders(paths, 1, model.r, **train_cfg['filter'])
195 | create_align_features(model=model, config=config, paths=paths)
196 | print('\n\nYou can now train ForwardTacotron - use python train_forward.py\n')
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
--------------------------------------------------------------------------------
/models/fast_pitch.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from typing import Union, Callable, Dict, Any, Optional
3 |
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.nn import Embedding
9 |
10 | from models.common_layers import LengthRegulator, ForwardTransformer, make_token_len_mask
11 | from utils.text.symbols import phonemes
12 |
13 |
14 | class SeriesPredictor(nn.Module):
15 |
16 | def __init__(self,
17 | num_chars: int,
18 | d_model: int,
19 | n_heads: int,
20 | d_fft: int,
21 | layers: int,
22 | conv1_kernel: int,
23 | conv2_kernel: int,
24 | dropout=0.1):
25 | super().__init__()
26 | self.embedding = Embedding(num_chars, d_model)
27 | self.transformer = ForwardTransformer(heads=n_heads, dropout=dropout,
28 | d_model=d_model, d_fft=d_fft,
29 | conv1_kernel=conv1_kernel,
30 | conv2_kernel=conv2_kernel,
31 | layers=layers)
32 | self.lin = nn.Linear(d_model, 1)
33 |
34 | def forward(self,
35 | x: torch.Tensor,
36 | src_pad_mask: Optional[torch.Tensor] = None,
37 | alpha: float = 1.0) -> torch.Tensor:
38 | x = self.embedding(x)
39 | x = self.transformer(x, src_pad_mask=src_pad_mask)
40 | x = self.lin(x)
41 | return x / alpha
42 |
43 |
44 | class FastPitch(nn.Module):
45 |
46 | def __init__(self,
47 | num_chars: int,
48 | durpred_dropout: float,
49 | durpred_d_model: int,
50 | durpred_n_heads: int,
51 | durpred_layers: int,
52 | durpred_d_fft: int,
53 | pitch_dropout: float,
54 | pitch_d_model: int,
55 | pitch_n_heads: int,
56 | pitch_layers: int,
57 | pitch_d_fft: int,
58 | energy_dropout: float,
59 | energy_d_model: int,
60 | energy_n_heads: int,
61 | energy_layers: int,
62 | energy_d_fft: int,
63 | pitch_strength: float,
64 | energy_strength: float,
65 | d_model: int,
66 | conv1_kernel: int,
67 | conv2_kernel: int,
68 | prenet_layers: int,
69 | prenet_heads: int,
70 | prenet_fft: int,
71 | prenet_dropout: float,
72 | postnet_layers: int,
73 | postnet_heads: int,
74 | postnet_fft: int,
75 | postnet_dropout: float,
76 | n_mels: int,
77 | padding_value=-11.5129):
78 | super().__init__()
79 | self.padding_value = padding_value
80 | self.lr = LengthRegulator()
81 | self.dur_pred = SeriesPredictor(num_chars=num_chars,
82 | d_model=durpred_d_model,
83 | n_heads=durpred_n_heads,
84 | layers=durpred_layers,
85 | d_fft=durpred_d_fft,
86 | conv1_kernel=conv1_kernel,
87 | conv2_kernel=conv2_kernel,
88 | dropout=durpred_dropout)
89 | self.pitch_pred = SeriesPredictor(num_chars=num_chars,
90 | d_model=pitch_d_model,
91 | n_heads=pitch_n_heads,
92 | layers=pitch_layers,
93 | d_fft=pitch_d_fft,
94 | conv1_kernel=conv1_kernel,
95 | conv2_kernel=conv2_kernel,
96 | dropout=pitch_dropout)
97 | self.energy_pred = SeriesPredictor(num_chars=num_chars,
98 | d_model=energy_d_model,
99 | n_heads=energy_n_heads,
100 | layers=energy_layers,
101 | d_fft=energy_d_fft,
102 | conv1_kernel=conv1_kernel,
103 | conv2_kernel=conv2_kernel,
104 | dropout=energy_dropout)
105 | self.embedding = Embedding(num_embeddings=num_chars, embedding_dim=d_model)
106 | self.prenet = ForwardTransformer(heads=prenet_heads, dropout=prenet_dropout,
107 | conv1_kernel=conv1_kernel, conv2_kernel=conv2_kernel,
108 | d_model=d_model, d_fft=prenet_fft, layers=prenet_layers)
109 | self.postnet = ForwardTransformer(heads=postnet_heads, dropout=postnet_dropout,
110 | conv1_kernel=conv1_kernel, conv2_kernel=conv2_kernel,
111 | d_model=d_model, d_fft=postnet_fft, layers=postnet_layers)
112 | self.lin = torch.nn.Linear(d_model, n_mels)
113 | self.register_buffer('step', torch.zeros(1, dtype=torch.long))
114 | self.pitch_strength = pitch_strength
115 | self.energy_strength = energy_strength
116 | self.pitch_proj = nn.Conv1d(1, d_model, kernel_size=3, padding=1)
117 | self.energy_proj = nn.Conv1d(1, d_model, kernel_size=3, padding=1)
118 |
119 | def __repr__(self):
120 | num_params = sum([np.prod(p.size()) for p in self.parameters()])
121 | return f'FastPitch, num params: {num_params}'
122 |
123 | def forward(self, batch: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
124 | x = batch['x']
125 | mel = batch['mel']
126 | dur = batch['dur']
127 | mel_lens = batch['mel_len']
128 | pitch = batch['pitch'].unsqueeze(1)
129 | energy = batch['energy'].unsqueeze(1)
130 |
131 | if self.training:
132 | self.step += 1
133 |
134 | len_mask = make_token_len_mask(x.transpose(0, 1))
135 | dur_hat = self.dur_pred(x, src_pad_mask=len_mask).squeeze(-1)
136 | pitch_hat = self.pitch_pred(x, src_pad_mask=len_mask).transpose(1, 2)
137 | energy_hat = self.energy_pred(x, src_pad_mask=len_mask).transpose(1, 2)
138 |
139 | x = self.embedding(x)
140 | x = self.prenet(x, src_pad_mask=len_mask)
141 |
142 | pitch_proj = self.pitch_proj(pitch)
143 | pitch_proj = pitch_proj.transpose(1, 2)
144 | x = x + pitch_proj * self.pitch_strength
145 |
146 | energy_proj = self.energy_proj(energy)
147 | energy_proj = energy_proj.transpose(1, 2)
148 | x = x + energy_proj * self.energy_strength
149 |
150 | x = self.lr(x, dur)
151 |
152 | len_mask = torch.zeros((x.size(0), x.size(1))).bool().to(x.device)
153 | for i, mel_len in enumerate(mel_lens):
154 | len_mask[i, mel_len:] = True
155 |
156 | x = self.postnet(x, src_pad_mask=len_mask)
157 |
158 | x = self.lin(x)
159 | x = x.transpose(1, 2)
160 |
161 | x_post = self.pad(x, mel.size(2))
162 | x = self.pad(x, mel.size(2))
163 |
164 | return {'mel': x, 'mel_post': x_post,
165 | 'dur': dur_hat, 'pitch': pitch_hat, 'energy': energy_hat}
166 |
167 | def generate(self,
168 | x: torch.Tensor,
169 | alpha=1.0,
170 | pitch_function: Callable[[torch.Tensor], torch.Tensor] = lambda x: x,
171 | energy_function: Callable[[torch.Tensor], torch.Tensor] = lambda x: x) -> Dict[str, torch.Tensor]:
172 | self.eval()
173 | with torch.no_grad():
174 | dur_hat = self.dur_pred(x, alpha=alpha)
175 | dur_hat = dur_hat.squeeze(2)
176 | if torch.sum(dur_hat.long()) <= 0:
177 | torch.fill_(dur_hat, value=2.)
178 | pitch_hat = self.pitch_pred(x).transpose(1, 2)
179 | pitch_hat = pitch_function(pitch_hat)
180 | energy_hat = self.energy_pred(x).transpose(1, 2)
181 | energy_hat = energy_function(energy_hat)
182 | return self._generate_mel(x=x, dur_hat=dur_hat,
183 | pitch_hat=pitch_hat,
184 | energy_hat=energy_hat)
185 |
186 | def pad(self, x: torch.Tensor, max_len: int) -> torch.Tensor:
187 | x = x[:, :, :max_len]
188 | x = F.pad(x, [0, max_len - x.size(2), 0, 0], 'constant', self.padding_value)
189 | return x
190 |
191 | def get_step(self) -> int:
192 | return self.step.data.item()
193 |
194 | def _generate_mel(self,
195 | x: torch.Tensor,
196 | dur_hat: torch.Tensor,
197 | pitch_hat: torch.Tensor,
198 | energy_hat: torch.Tensor) -> Dict[str, torch.Tensor]:
199 |
200 | len_mask = make_token_len_mask(x.transpose(0, 1))
201 |
202 | x = self.embedding(x)
203 | x = self.prenet(x, src_pad_mask=len_mask)
204 |
205 | pitch_proj = self.pitch_proj(pitch_hat)
206 | pitch_proj = pitch_proj.transpose(1, 2)
207 | x = x + pitch_proj * self.pitch_strength
208 |
209 | energy_proj = self.energy_proj(energy_hat)
210 | energy_proj = energy_proj.transpose(1, 2)
211 | x = x + energy_proj * self.energy_strength
212 |
213 | x = self.lr(x, dur_hat)
214 |
215 | x = self.postnet(x, src_pad_mask=None)
216 |
217 | x = self.lin(x)
218 | x = x.transpose(1, 2)
219 |
220 | return {'mel': x, 'mel_post': x, 'dur': dur_hat,
221 | 'pitch': pitch_hat, 'energy': energy_hat}
222 |
223 | @classmethod
224 | def from_config(cls, config: Dict[str, Any]) -> 'FastPitch':
225 | model_config = config['fast_pitch']['model']
226 | model_config['num_chars'] = len(phonemes)
227 | model_config['n_mels'] = config['dsp']['num_mels']
228 | return FastPitch(**model_config)
229 |
230 | @classmethod
231 | def from_checkpoint(cls, path: Union[Path, str]) -> 'FastPitch':
232 | checkpoint = torch.load(path, map_location=torch.device('cpu'))
233 | model = FastPitch.from_config(checkpoint['config'])
234 | model.load_state_dict(checkpoint['model'])
235 | return model
236 |
--------------------------------------------------------------------------------
/models/forward_tacotron.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from typing import Union, Callable, Dict, Any
3 | import numpy as np
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 | from torch.nn import Embedding
8 | from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
9 |
10 | from models.common_layers import CBHG, LengthRegulator, BatchNormConv
11 | from utils.text.symbols import phonemes
12 |
13 |
14 | class SeriesPredictor(nn.Module):
15 |
16 | def __init__(self, num_chars, emb_dim=64, conv_dims=256, rnn_dims=64, dropout=0.5):
17 | super().__init__()
18 | self.embedding = Embedding(num_chars, emb_dim)
19 | self.convs = torch.nn.ModuleList([
20 | BatchNormConv(emb_dim, conv_dims, 5, relu=True),
21 | BatchNormConv(conv_dims, conv_dims, 5, relu=True),
22 | BatchNormConv(conv_dims, conv_dims, 5, relu=True),
23 | ])
24 | self.rnn = nn.GRU(conv_dims, rnn_dims, batch_first=True, bidirectional=True)
25 | self.lin = nn.Linear(2 * rnn_dims, 1)
26 | self.dropout = dropout
27 |
28 | def forward(self,
29 | x: torch.Tensor,
30 | alpha: float = 1.0) -> torch.Tensor:
31 | x = self.embedding(x)
32 | x = x.transpose(1, 2)
33 | for conv in self.convs:
34 | x = conv(x)
35 | x = F.dropout(x, p=self.dropout, training=self.training)
36 | x = x.transpose(1, 2)
37 | x, _ = self.rnn(x)
38 | x = self.lin(x)
39 | return x / alpha
40 |
41 |
42 | class ForwardTacotron(nn.Module):
43 |
44 | def __init__(self,
45 | embed_dims: int,
46 | series_embed_dims: int,
47 | num_chars: int,
48 | durpred_conv_dims: int,
49 | durpred_rnn_dims: int,
50 | durpred_dropout: float,
51 | pitch_conv_dims: int,
52 | pitch_rnn_dims: int,
53 | pitch_dropout: float,
54 | pitch_strength: float,
55 | energy_conv_dims: int,
56 | energy_rnn_dims: int,
57 | energy_dropout: float,
58 | energy_strength: float,
59 | rnn_dims: int,
60 | prenet_dims: int,
61 | prenet_k: int,
62 | postnet_num_highways: int,
63 | prenet_dropout: float,
64 | postnet_dims: int,
65 | postnet_k: int,
66 | prenet_num_highways: int,
67 | postnet_dropout: float,
68 | n_mels: int,
69 | padding_value=-11.5129):
70 | super().__init__()
71 | self.rnn_dims = rnn_dims
72 | self.padding_value = padding_value
73 | self.embedding = nn.Embedding(num_chars, embed_dims)
74 | self.lr = LengthRegulator()
75 | self.dur_pred = SeriesPredictor(num_chars=num_chars,
76 | emb_dim=series_embed_dims,
77 | conv_dims=durpred_conv_dims,
78 | rnn_dims=durpred_rnn_dims,
79 | dropout=durpred_dropout)
80 | self.pitch_pred = SeriesPredictor(num_chars=num_chars,
81 | emb_dim=series_embed_dims,
82 | conv_dims=pitch_conv_dims,
83 | rnn_dims=pitch_rnn_dims,
84 | dropout=pitch_dropout)
85 | self.energy_pred = SeriesPredictor(num_chars=num_chars,
86 | emb_dim=series_embed_dims,
87 | conv_dims=energy_conv_dims,
88 | rnn_dims=energy_rnn_dims,
89 | dropout=energy_dropout)
90 | self.prenet = CBHG(K=prenet_k,
91 | in_channels=embed_dims,
92 | channels=prenet_dims,
93 | proj_channels=[prenet_dims, embed_dims],
94 | num_highways=prenet_num_highways,
95 | dropout=prenet_dropout)
96 | self.lstm = nn.LSTM(2 * prenet_dims,
97 | rnn_dims,
98 | batch_first=True,
99 | bidirectional=True)
100 | self.lin = torch.nn.Linear(2 * rnn_dims, n_mels)
101 | self.register_buffer('step', torch.zeros(1, dtype=torch.long))
102 | self.postnet = CBHG(K=postnet_k,
103 | in_channels=n_mels,
104 | channels=postnet_dims,
105 | proj_channels=[postnet_dims, n_mels],
106 | num_highways=postnet_num_highways,
107 | dropout=postnet_dropout)
108 | self.post_proj = nn.Linear(2 * postnet_dims, n_mels, bias=False)
109 | self.pitch_strength = pitch_strength
110 | self.energy_strength = energy_strength
111 | self.pitch_proj = nn.Conv1d(1, 2 * prenet_dims, kernel_size=3, padding=1)
112 | self.energy_proj = nn.Conv1d(1, 2 * prenet_dims, kernel_size=3, padding=1)
113 |
114 | def __repr__(self):
115 | num_params = sum([np.prod(p.size()) for p in self.parameters()])
116 | return f'ForwardTacotron, num params: {num_params}'
117 |
118 | def forward(self, batch: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
119 | x = batch['x']
120 | mel = batch['mel']
121 | dur = batch['dur']
122 | mel_lens = batch['mel_len']
123 | pitch = batch['pitch'].unsqueeze(1)
124 | energy = batch['energy'].unsqueeze(1)
125 |
126 | if self.training:
127 | self.step += 1
128 |
129 | dur_hat = self.dur_pred(x).squeeze(-1)
130 | pitch_hat = self.pitch_pred(x).transpose(1, 2)
131 | energy_hat = self.energy_pred(x).transpose(1, 2)
132 |
133 | x = self.embedding(x)
134 | x = x.transpose(1, 2)
135 | x = self.prenet(x)
136 |
137 | pitch_proj = self.pitch_proj(pitch)
138 | pitch_proj = pitch_proj.transpose(1, 2)
139 | x = x + pitch_proj * self.pitch_strength
140 |
141 | energy_proj = self.energy_proj(energy)
142 | energy_proj = energy_proj.transpose(1, 2)
143 | x = x + energy_proj * self.energy_strength
144 |
145 | x = self.lr(x, dur)
146 |
147 | x = pack_padded_sequence(x, lengths=mel_lens.cpu(), enforce_sorted=False,
148 | batch_first=True)
149 |
150 | x, _ = self.lstm(x)
151 |
152 | x, _ = pad_packed_sequence(x, padding_value=self.padding_value, batch_first=True)
153 |
154 | x = self.lin(x)
155 | x = x.transpose(1, 2)
156 |
157 | x_post = self.postnet(x)
158 | x_post = self.post_proj(x_post)
159 | x_post = x_post.transpose(1, 2)
160 |
161 | x_post = self._pad(x_post, mel.size(2))
162 | x = self._pad(x, mel.size(2))
163 |
164 | return {'mel': x, 'mel_post': x_post,
165 | 'dur': dur_hat, 'pitch': pitch_hat, 'energy': energy_hat}
166 |
167 | def generate(self,
168 | x: torch.Tensor,
169 | alpha=1.0,
170 | pitch_function: Callable[[torch.Tensor], torch.Tensor] = lambda x: x,
171 | energy_function: Callable[[torch.Tensor], torch.Tensor] = lambda x: x) -> Dict[str, torch.Tensor]:
172 | self.eval()
173 | with torch.no_grad():
174 | dur_hat = self.dur_pred(x, alpha=alpha)
175 | dur_hat = dur_hat.squeeze(2)
176 | if torch.sum(dur_hat.long()) <= 0:
177 | torch.fill_(dur_hat, value=2.)
178 | pitch_hat = self.pitch_pred(x).transpose(1, 2)
179 | pitch_hat = pitch_function(pitch_hat)
180 | energy_hat = self.energy_pred(x).transpose(1, 2)
181 | energy_hat = energy_function(energy_hat)
182 | return self._generate_mel(x=x, dur_hat=dur_hat,
183 | pitch_hat=pitch_hat,
184 | energy_hat=energy_hat)
185 |
186 | @torch.jit.export
187 | def generate_jit(self,
188 | x: torch.Tensor,
189 | alpha: float = 1.0,
190 | beta: float = 1.0) -> Dict[str, torch.Tensor]:
191 | with torch.no_grad():
192 | dur_hat = self.dur_pred(x, alpha=alpha)
193 | dur_hat = dur_hat.squeeze(2)
194 | if torch.sum(dur_hat.long()) <= 0:
195 | torch.fill_(dur_hat, value=2.)
196 | pitch_hat = self.pitch_pred(x).transpose(1, 2) * beta
197 | energy_hat = self.energy_pred(x).transpose(1, 2)
198 | return self._generate_mel(x=x, dur_hat=dur_hat,
199 | pitch_hat=pitch_hat,
200 | energy_hat=energy_hat)
201 |
202 | def get_step(self) -> int:
203 | return self.step.data.item()
204 |
205 | def _generate_mel(self,
206 | x: torch.Tensor,
207 | dur_hat: torch.Tensor,
208 | pitch_hat: torch.Tensor,
209 | energy_hat: torch.Tensor) -> Dict[str, torch.Tensor]:
210 | x = self.embedding(x)
211 | x = x.transpose(1, 2)
212 | x = self.prenet(x)
213 |
214 | pitch_proj = self.pitch_proj(pitch_hat)
215 | pitch_proj = pitch_proj.transpose(1, 2)
216 | x = x + pitch_proj * self.pitch_strength
217 |
218 | energy_proj = self.energy_proj(energy_hat)
219 | energy_proj = energy_proj.transpose(1, 2)
220 | x = x + energy_proj * self.energy_strength
221 |
222 | x = self.lr(x, dur_hat)
223 |
224 | x, _ = self.lstm(x)
225 |
226 | x = self.lin(x)
227 | x = x.transpose(1, 2)
228 |
229 | x_post = self.postnet(x)
230 | x_post = self.post_proj(x_post)
231 | x_post = x_post.transpose(1, 2)
232 |
233 | return {'mel': x, 'mel_post': x_post, 'dur': dur_hat,
234 | 'pitch': pitch_hat, 'energy': energy_hat}
235 |
236 | def _pad(self, x: torch.Tensor, max_len: int) -> torch.Tensor:
237 | x = x[:, :, :max_len]
238 | x = F.pad(x, [0, max_len - x.size(2), 0, 0], 'constant', self.padding_value)
239 | return x
240 |
241 |
242 | @classmethod
243 | def from_config(cls, config: Dict[str, Any]) -> 'ForwardTacotron':
244 | model_config = config['forward_tacotron']['model']
245 | model_config['num_chars'] = len(phonemes)
246 | model_config['n_mels'] = config['dsp']['num_mels']
247 | return ForwardTacotron(**model_config)
248 |
249 | @classmethod
250 | def from_checkpoint(cls, path: Union[Path, str]) -> 'ForwardTacotron':
251 | checkpoint = torch.load(path, map_location=torch.device('cpu'))
252 | model = ForwardTacotron.from_config(checkpoint['config'])
253 | model.load_state_dict(checkpoint['model'])
254 | return model
--------------------------------------------------------------------------------
/configs/multispeaker.yaml:
--------------------------------------------------------------------------------
1 |
2 | tts_model_id: 'multispeaker_tts'
3 | data_path: 'data_multispeaker' # output data path
4 |
5 | tts_model: 'multi_forward_tacotron' # choices: [multi_forward_tacotron, multi_fast_pitch]
6 |
7 |
8 | dsp:
9 |
10 | sample_rate: 22050
11 | n_fft: 1024
12 | num_mels: 80
13 | hop_length: 256
14 | win_length: 1024
15 | fmin: 0
16 | fmax: 8000
17 | peak_norm: False # Normalise to the peak of each wav file
18 | trim_start_end_silence: True # Whether to trim leading and trailing silence
19 | trim_silence_top_db: 60 # Threshold in decibels below reference to consider silence for for trimming
20 | # start and end silences with librosa (no trimming if really high)
21 |
22 | trim_long_silences: True # Whether to reduce long silence using WebRTC Voice Activity Detector
23 | vad_window_length: 30 # In milliseconds
24 | vad_moving_average_width: 8
25 | vad_max_silence_length: 12
26 | vad_sample_rate: 16000
27 |
28 |
29 | preprocessing:
30 |
31 | metafile_format: 'ljspeech_multi' # Choices [ljspeech_multi, vctk, pandas]
32 | # ljspeech_multi expects a .csv file with rows: 'file_id|speaker_id|text"
33 | # pandas expects a .tsv with columns: ['file_id', 'speaker_id', 'text']
34 | # expects VCTK version 0.92 (set audio_format to '_mic1.flac')
35 | audio_format: '.wav' # Audio extension, usually .wav (different for VCTK)
36 | seed: 42
37 | n_val: 2000
38 | language: 'en'
39 | cleaner_name: 'english_cleaners' # choices: ['english_cleaners', 'no_cleaners'], expands numbers and abbreviations.
40 | use_phonemes: True # whether to phonemize the text
41 | # if set to False, you have to provide the phonemized text yourself
42 | min_text_len: 2
43 | pitch_min_freq: 30 # Minimum value for pitch frequency to remove outliers (Common pitch range is about 60-300)
44 | pitch_max_freq: 600 # Maximum value for pitch frequency to remove outliers (Common pitch range is about 60-300)
45 | pitch_extractor: pyworld # choice of pitch extraction library, choices: [librosa, pyworld]
46 | pitch_frame_length: 2048 # Frame length for extracting pitch with librosa
47 |
48 |
49 | duration_extraction:
50 |
51 | silence_threshold: -11 # normalized mel value below which the voice is considered silent
52 | # minimum mel value = -11.512925465 for zeros in the wav array (=log(1e-5),
53 | # where 1e-5 is a cutoff value)
54 | silence_prob_shift: 0.25 # increase probability for silent characters in periods of silence
55 | # for better durations during non voiced periods
56 | max_batch_size: 32 # max allowed for binned dataloader used for tacotron inference
57 | num_workers: 12 # number of processes for costly dijkstra duration extraction
58 |
59 |
60 | tacotron:
61 |
62 | model:
63 | embed_dims: 256
64 | encoder_dims: 128
65 | decoder_dims: 256
66 | postnet_dims: 128
67 | speaker_emb_dim: 256 # dimension of speaker embedding,
68 | # set to 0 for no speaker conditioning, to 256 for speaker conditioning
69 | encoder_k: 16
70 | lstm_dims: 512
71 | postnet_k: 8
72 | num_highways: 4
73 | dropout: 0.5
74 | stop_threshold: -11 # Value below which audio generation ends.
75 |
76 | aligner_hidden_dims: 256 # text-mel aligner hidden dimensions
77 | aligner_out_dims: 32 # text-mel aligner encoding dimensions for text and mel
78 |
79 | training:
80 | schedule:
81 | - 8, 1e-3, 30_000, 32 # progressive training schedule
82 | - 4, 1e-4, 40_000, 16 # (r, lr, step, batch_size)
83 | - 2, 1e-4, 50_000, 8
84 | - 1, 1e-4, 65_000, 8
85 |
86 | dia_loss_matrix_g: 0.2 # value of g for diatonal matrix (larger g = broader diagonal)
87 | dia_loss_factor: 1.0 # factor for scaling diagonal loss
88 | ctc_loss_factor: 0.1 # factor for scaling aligner CTC loss
89 | clip_grad_norm: 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
90 | checkpoint_every: 10000 # checkpoints the model every x steps
91 | plot_every: 1000 # generates samples and plots every x steps
92 | num_workers: 2 # number of workers for dataloader
93 |
94 | filter:
95 | max_mel_len: 1250 # filter files with mel len larger than given
96 | filter_duration_stats: False # whether to filter according to the duration stats below
97 | min_attention_sharpness: 0.5 # filter files with bad attention sharpness score, if 0 then no filter
98 | min_attention_alignment: 0.95 # filter files with bad attention alignment score, if 0 then no filter
99 | max_duration: 40 # filter files with durations larger than given
100 | max_consecutive_ones: 6 # filter files where durations contain more consecutive ones than given
101 |
102 |
103 | multi_forward_tacotron:
104 |
105 | model:
106 | speaker_emb_dims: 256
107 | embed_dims: 256 # embedding dimension for main model
108 | series_embed_dims: 128 # embedding dimension for series predictor
109 |
110 | durpred_conv_dims: 256
111 | durpred_rnn_dims: 128
112 | durpred_dropout: 0.5
113 |
114 | pitch_conv_dims: 256
115 | pitch_rnn_dims: 256
116 | pitch_dropout: 0.5
117 | pitch_strength: 1. # set to 0 if you want no pitch conditioning
118 |
119 | energy_conv_dims: 256
120 | energy_rnn_dims: 64
121 | energy_dropout: 0.5
122 | energy_strength: 1. # set to 0 if you want no energy conditioning
123 |
124 | pitch_cond_conv_dims: 256 # predictor for pitch prior (predicts unvoiced phonemes with zero pitch)
125 | pitch_cond_rnn_dims: 128
126 | pitch_cond_dropout: 0.5
127 | pitch_cond_emb_dims: 4 # conditional embedding on pitch softmax prediction (zero vs non-zero pitch)
128 | pitch_cond_categorical_dims: 3 # dimension of categorical output of pitch conditioning, should be set to 3
129 | # (zero=padding, one=zero pitch, two=nonzero pitch)
130 |
131 | prenet_dims: 256
132 | prenet_k: 16
133 | prenet_dropout: 0.5
134 | prenet_num_highways: 4
135 |
136 | rnn_dims: 512
137 |
138 | postnet_dims: 256
139 | postnet_k: 8
140 | postnet_num_highways: 4
141 | postnet_dropout: 0.
142 |
143 | training:
144 | schedule:
145 | - 5e-5, 500_000, 32 # progressive training schedule
146 | - 1e-5, 600_000, 32 # lr, step, batch_size
147 | dur_loss_factor: 0.1
148 | pitch_loss_factor: 0.1
149 | energy_loss_factor: 0.1
150 | pitch_cond_loss_factor: 0.1
151 |
152 | clip_grad_norm: 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
153 | checkpoint_every: 50_000 # checkpoints the model every x steps
154 | plot_every: 5000 # generates samples and plots every x steps
155 | plot_n_speakers: 3 # max number of speakers to generate plots for
156 | plot_speakers: # speakers to generate plots for (additionally to plot_n_speakers)
157 | - default_speaker
158 |
159 | filter:
160 | max_mel_len: 1250 # filter files with mel len larger than given
161 | filter_duration_stats: True # whether to filter according to the duration stats below
162 | min_attention_sharpness: 0.5 # filter files with bad attention sharpness score, if 0 then no filter
163 | min_attention_alignment: 0.95 # filter files with bad attention alignment score, if 0 then no filter
164 | max_duration: 40 # filter files with durations larger than given
165 | max_consecutive_ones: 6 # filter files where durations contain more consecutive ones than given
166 |
167 |
168 | multi_fast_pitch:
169 |
170 | model:
171 | speaker_emb_dims: 256
172 |
173 | durpred_d_model: 128
174 | durpred_n_heads: 2
175 | durpred_layers: 4
176 | durpred_d_fft: 128
177 | durpred_dropout: 0.5
178 |
179 | pitch_d_model: 128
180 | pitch_n_heads: 2
181 | pitch_layers: 4
182 | pitch_d_fft: 128
183 | pitch_dropout: 0.5
184 | pitch_strength: 1.0
185 |
186 | energy_d_model: 128
187 | energy_n_heads: 2
188 | energy_layers: 4
189 | energy_d_fft: 128
190 | energy_dropout: 0.5
191 | energy_strength: 1.0
192 |
193 | pitch_cond_d_model: 128
194 | pitch_cond_n_heads: 2
195 | pitch_cond_layers: 4
196 | pitch_cond_d_fft: 128
197 | pitch_cond_dropout: 0.5
198 | pitch_cond_output_dims: 3 # dimension of categorical output of pitch conditioning, should be set to 3
199 | # (zero=padding, one=zero pitch, two=nonzero pitch)
200 |
201 | d_model: 256
202 | conv1_kernel: 9
203 | conv2_kernel: 1
204 |
205 | prenet_layers: 4
206 | prenet_heads: 2
207 | prenet_fft: 1024
208 | prenet_dropout: 0.1
209 |
210 | postnet_layers: 4
211 | postnet_heads: 2
212 | postnet_fft: 1024
213 | postnet_dropout: 0.1
214 |
215 |
216 | training:
217 | schedule:
218 | - 1e-5, 5_000, 32 # progressive training schedule
219 | - 5e-5, 300_000, 32 # lr, step, batch_size
220 | - 2e-5, 300_000, 32
221 | dur_loss_factor: 0.1
222 | pitch_loss_factor: 0.1
223 | energy_loss_factor: 0.1
224 | pitch_cond_loss_factor: 0.1
225 | pitch_zoneout: 0. # zoneout may regularize conditioning on pitch
226 | energy_zoneout: 0. # zoneout may regularize conditioning on energy
227 |
228 | max_mel_len: 1250
229 | clip_grad_norm: 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
230 | checkpoint_every: 10_000 # checkpoints the model every x steps
231 | plot_every: 1000
232 | plot_n_speakers: 3 # max number of speakers to generate plots for
233 | plot_speakers: # speakers to generate plots for (additionally to plot_n_speakers)
234 | - default_speaker
235 |
236 | filter:
237 | max_mel_len: 1250 # filter files with mel len larger than given
238 | filter_duration_stats: True # whether to filter according to the duration stats below
239 | min_attention_sharpness: 0.5 # filter files with bad attention sharpness score, if 0 then no filter
240 | min_attention_alignment: 0.95 # filter files with bad attention alignment score, if 0 then no filter
241 | max_duration: 40 # filter files with durations larger than given
242 | max_consecutive_ones: 6 # filter files where durations contain more consecutive ones than given
243 |
--------------------------------------------------------------------------------